# Buzz_Data
Bees are some of the most prolific pollinators, and the world needs them. However, international honey bee populations have been decreasing at alarming rates. Climate change, parasites, and diseases have all contributed to the loss of bees.

In this project I will follow the **CRISP-DM model**

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data science process. It has six sequential phases:

- Business understanding – What does the business need?
- Data understanding – What data do we have / need? Is it clean?
- Data preparation – How do we organize the data for modeling?
- Modeling – What modeling techniques should we apply?
- Evaluation – Which model best meets the business objectives?
- Deployment – How do stakeholders access the results?

## Business understanding – What does the business need?

The USDA need data scientist for a machine Learning project to create a bee population prediction system.

### Content
The data consists of data from honey bees colonies in the United States from 2015 to 2022 is available. These data were collected by the USDA and curated into a manageable dataset for EDA and predictions.

### Feature Description:

state: state within the USA. Note, other is a collection of states for privacy reasons. And the United States state is the average across all states.

num_colonies: number of honey bee colonies

max_colonies: max number of honey bee colonies for that quarter

lost_colonies: number of colonies that were lost during that quarter

percent_lost: percentage of honey bee colonies lost during that quarter

renovated_colonies: colonies that were 'requeened' or received new bees

percent_renovated: percentage of honey bee colonies that were renovated

quarter: Q1 is Jan to March, Q2 is April to June, Q3 is July to September, and Q4 is October to December

year: year between 2015 and 2022

varroa_mites: Percentage of colonies affected by a species of mite that affects honey bee populations

other_pests_and_parasites: Percentage of colonies affected by a collection of other harmful critters

diseases: Percentage of colonies affected by certain diseases

pesticides: Percentage of colonies affected by the use of certain pesticides

other: Percentage of colonies affected by an unlisted cause

unknown: Percentage of colonies affected by an unknown cause

In [1]:
import polars as pl 
import altair as alt

In [28]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [2]:
bd = pl.read_csv('D:/Programacion/buzz_data/input/save_the_bees.csv', )
bd.sample(10)

state,state_code,num_colonies,max_colonies,lost_colonies,percent_lost,added_colonies,renovated_colonies,percent_renovated,quarter,year,varroa_mites,other_pests_and_parasites,diseases,pesticides,other,unknown
str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,f64,f64,f64,f64,f64,f64
"""Pennsylvania""","""PA""",28000,30000,1600,5,1600,1200,4,3,2022,25.8,4.6,0.6,1.7,3.7,0.8
"""Kentucky""","""KY""",10000,10000,1300,13,170,550,6,3,2015,40.4,11.3,1.2,6.5,9.0,2.0
"""Michigan""","""MI""",98000,98000,15500,16,8000,5500,6,3,2022,25.4,2.8,2.1,4.6,3.5,1.7
"""Arizona""","""AZ""",26000,30000,2000,7,5500,3800,13,3,2019,29.0,9.9,2.7,11.4,13.2,0.6
"""Kentucky""","""KY""",6500,7000,840,12,300,660,9,3,2018,47.1,26.5,1.6,1.5,4.5,7.6
"""North Carolina…","""NC""",16000,23000,2500,11,3700,1200,5,1,2017,34.7,17.4,0.2,1.5,7.2,3.9
"""South Dakota""","""SD""",28000,34000,450,1,0,0,0,1,2021,8.6,0.0,0.0,0.0,0.0,0.8
"""Montana""","""MT""",103000,114000,6000,5,870,7500,7,3,2021,19.5,1.5,1.2,0.0,14.8,0.5
"""Maryland""","""MD""",7000,7000,1900,27,1900,30,0,1,2020,19.9,1.5,0.0,0.0,8.5,2.4
"""New York""","""NY""",44000,44000,5500,13,370,730,2,1,2022,16.6,4.1,0.0,3.1,7.7,6.6


In [3]:
bd.glimpse()

Rows: 1453
Columns: 17
$ state                     <str> 'Alabama', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Florida', 'Georgia', 'Hawaii', 'Idaho'
$ state_code                <str> 'AL', 'AZ', 'AR', 'CA', 'CO', 'CT', 'FL', 'GA', 'HI', 'ID'
$ num_colonies              <i64> 7000, 35000, 13000, 1440000, 3500, 3900, 305000, 104000, 10500, 81000
$ max_colonies              <i64> 7000, 35000, 14000, 1690000, 12500, 3900, 315000, 105000, 10500, 88000
$ lost_colonies             <i64> 1800, 4600, 1500, 255000, 1500, 870, 42000, 14500, 380, 3700
$ percent_lost              <i64> 26, 13, 11, 15, 12, 22, 13, 14, 4, 4
$ added_colonies            <i64> 2800, 3400, 1200, 250000, 200, 290, 54000, 47000, 3400, 2600
$ renovated_colonies        <i64> 250, 2100, 90, 124000, 140, 0, 25000, 9500, 760, 8000
$ percent_renovated         <i64> 4, 6, 1, 7, 1, 0, 8, 9, 7, 9
$ quarter                   <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ year                      <i64> 2015, 2015, 2015,

In [4]:
# Basic statistical information, it looks like it dosen't have null values, let's take a deeper look
bd.describe()

describe,state,state_code,num_colonies,max_colonies,lost_colonies,percent_lost,added_colonies,renovated_colonies,percent_renovated,quarter,year,varroa_mites,other_pests_and_parasites,diseases,pesticides,other,unknown
str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""","""1453""","""1453""",1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0
"""null_count""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""",,,124469.979353,155948.671714,16759.710943,11.266345,15744.900206,13520.302822,7.11287,2.516173,2018.474191,30.186098,10.937509,3.406676,6.185272,6.083345,3.994907
"""std""",,,438499.655483,550593.1281,60681.042329,7.359984,63548.43909,57201.973644,9.025198,1.132682,2.322824,18.861293,13.035092,6.472063,8.959392,6.488208,4.939563
"""min""","""Alabama""","""AL""",1300.0,1300.0,0.0,0.0,0.0,0.0,0.0,1.0,2015.0,0.0,0.0,0.0,0.0,0.0,0.0
"""25%""",,,8000.0,9500.0,950.0,6.0,380.0,150.0,1.0,1.0,2016.0,15.6,1.9,0.1,0.4,1.8,0.8
"""50%""",,,18500.0,23000.0,2200.0,10.0,1500.0,800.0,4.0,3.0,2018.0,27.2,7.0,1.1,2.6,4.1,2.4
"""75%""",,,59000.0,79000.0,7000.0,14.0,6000.0,3700.0,10.0,4.0,2021.0,42.2,15.1,4.2,8.5,8.2,5.4
"""max""","""Wyoming""","""WY""",3181180.0,4174440.0,502350.0,65.0,736920.0,762550.0,77.0,4.0,2022.0,98.8,91.9,87.4,73.5,61.4,46.2


In [5]:
bd.select(pl.col("state")).unique().count()

state
u32
47


In [6]:
state = bd.select(pl.col("state"))
state

state
str
"""Alabama"""
"""Arizona"""
"""Arkansas"""
"""California"""
"""Colorado"""
"""Connecticut"""
"""Florida"""
"""Georgia"""
"""Hawaii"""
"""Idaho"""


In [7]:
col = bd.select(pl.col("num_colonies"))
col

num_colonies
i64
7000
35000
13000
1440000
3500
3900
305000
104000
10500
81000


In [8]:
lost = bd.select(pl.col("lost_colonies"))
lost

lost_colonies
i64
1800
4600
1500
255000
1500
870
42000
14500
380
3700


In [9]:
((col +
  lost/2)
)

num_colonies
f64
7900.0
37300.0
13750.0
1.5675e6
4250.0
4335.0
326000.0
111250.0
10690.0
82850.0


In [10]:
bd.with_columns(pl.col("num_colonies").add(pl.col("lost_colonies")).alias("e"))

state,state_code,num_colonies,max_colonies,lost_colonies,percent_lost,added_colonies,renovated_colonies,percent_renovated,quarter,year,varroa_mites,other_pests_and_parasites,diseases,pesticides,other,unknown,e
str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,f64,f64,f64,f64,f64,f64,i64
"""Alabama""","""AL""",7000,7000,1800,26,2800,250,4,1,2015,10.0,5.4,0.0,2.2,9.1,9.4,8800
"""Arizona""","""AZ""",35000,35000,4600,13,3400,2100,6,1,2015,26.9,20.5,0.1,0.0,1.8,3.1,39600
"""Arkansas""","""AR""",13000,14000,1500,11,1200,90,1,1,2015,17.6,11.4,1.5,3.4,1.0,1.0,14500
"""California""","""CA""",1440000,1690000,255000,15,250000,124000,7,1,2015,24.7,7.2,3.0,7.5,6.5,2.8,1695000
"""Colorado""","""CO""",3500,12500,1500,12,200,140,1,1,2015,14.6,0.9,1.8,0.6,2.6,5.9,5000
"""Connecticut""","""CT""",3900,3900,870,22,290,0,0,1,2015,2.5,1.4,0.0,0.0,21.2,2.4,4770
"""Florida""","""FL""",305000,315000,42000,13,54000,25000,8,1,2015,22.3,13.5,0.8,8.9,5.1,4.4,347000
"""Georgia""","""GA""",104000,105000,14500,14,47000,9500,9,1,2015,6.2,4.9,3.3,2.6,4.8,10.5,118500
"""Hawaii""","""HI""",10500,10500,380,4,3400,760,7,1,2015,38.8,37.7,1.6,0.0,2.0,0.0,10880
"""Idaho""","""ID""",81000,88000,3700,4,2600,8000,9,1,2015,39.8,6.7,12.5,4.8,8.9,4.9,84700


In [11]:
col.mean()

num_colonies
f64
124469.979353


In [13]:
col.is_unique

<bound method DataFrame.is_unique of shape: (1_453, 1)
┌──────────────┐
│ num_colonies │
│ ---          │
│ i64          │
╞══════════════╡
│ 7000         │
│ 35000        │
│ 13000        │
│ 1440000      │
│ …            │
│ 26000        │
│ 19500        │
│ 30030        │
│ 2888130      │
└──────────────┘>

In [24]:
col.quantile(.5)

num_colonies
f64
18500.0


In [29]:
alt.Chart(col).mark_bar().encode(
    x='col',
    y='average(col)'
)


ImportError: Usage of the DataFrame Interchange Protocol requires
version 11.0.0 or greater of the pyarrow package. 
This can be installed with pip using:
   pip install "pyarrow>=11.0.0"
or conda:
   conda install -c conda-forge "pyarrow>=11.0.0"

ImportError: pyarrow

alt.Chart(...)