Machine Learning project to create a population prediction system 


# Buzz_Data
Bees are some of the most prolific pollinators, and the world needs them. However, international honey bee populations have been decreasing at alarming rates. Climate change, parasites, and diseases have all contributed to the loss of bees.

In this project I will follow the **CRISP-DM model**

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data science process. It has six sequential phases:

- Business understanding – What does the business need?
- Data understanding – What data do we have / need? Is it clean?
- Data preparation – How do we organize the data for modeling?
- Modeling – What modeling techniques should we apply?
- Evaluation – Which model best meets the business objectives?
- Deployment – How do stakeholders access the results?

## Business understanding – What does the business need?

The USDA need data scientist for a machine Learning project to create a bee population prediction system.

### Content
The data consists of data from honey bees colonies in the United States from 2015 to 2022 is available. These data were collected by the USDA and curated into a manageable dataset for EDA and predictions.

### Feature Description:

state: state within the USA. Note, other is a collection of states for privacy reasons. And the United States state is the average across all states.

num_colonies: number of honey bee colonies

max_colonies: max number of honey bee colonies for that quarter

lost_colonies: number of colonies that were lost during that quarter

percent_lost: percentage of honey bee colonies lost during that quarter

renovated_colonies: colonies that were 'requeened' or received new bees

percent_renovated: percentage of honey bee colonies that were renovated

quarter: Q1 is Jan to March, Q2 is April to June, Q3 is July to September, and Q4 is October to December

year: year between 2015 and 2022

varroa_mites: Percentage of colonies affected by a species of mite that affects honey bee populations

other_pests_and_parasites: Percentage of colonies affected by a collection of other harmful critters

diseases: Percentage of colonies affected by certain diseases

pesticides: Percentage of colonies affected by the use of certain pesticides

other: Percentage of colonies affected by an unlisted cause

unknown: Percentage of colonies affected by an unknown cause

In [13]:
import polars as pl 
import altair as alt

In [12]:
bd = pl.read_csv('D:/Programacion/buzz_data/input/save_the_bees.csv', )
bd.sample(10)

state,state_code,num_colonies,max_colonies,lost_colonies,percent_lost,added_colonies,renovated_colonies,percent_renovated,quarter,year,varroa_mites,other_pests_and_parasites,diseases,pesticides,other,unknown
str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,f64,f64,f64,f64,f64,f64
"""Washington""","""WA""",84000,89000,7500,8,540,220,0,4,2022,34.0,6.4,0.0,0.0,4.2,8.2
"""Wyoming""","""WY""",19500,21000,3200,15,640,0,0,4,2022,22.9,5.9,4.2,0.0,0.0,7.4
"""Wisconsin""","""WI""",66000,66000,7000,11,1500,4300,7,3,2021,56.0,36.8,20.3,31.2,15.3,8.0
"""South Carolina…","""SC""",18000,19000,1700,9,4200,4900,26,2,2021,48.4,32.7,2.1,0.0,9.1,5.0
"""Massachusetts""","""MA""",4200,8500,300,4,1100,440,5,2,2016,40.5,10.6,0.3,10.6,2.3,11.9
"""Massachusetts""","""MA""",3800,3800,470,12,1100,100,3,1,2020,13.9,0.5,1.8,0.0,4.5,2.2
"""Idaho""","""ID""",81000,88000,3700,4,2600,8000,9,1,2015,39.8,6.7,12.5,4.8,8.9,4.9
"""Arizona""","""AZ""",33000,33000,5500,17,19500,7000,21,2,2015,8.4,32.1,0.5,20.1,28.2,0.3
"""Kentucky""","""KY""",11500,11500,1100,10,380,520,5,4,2020,23.8,8.8,0.0,0.0,8.2,9.0
"""South Carolina…","""SC""",12500,15000,2000,13,1300,750,5,4,2017,31.6,19.4,14.6,0.2,1.9,1.9


In [16]:
bd.glimpse()

Rows: 1453
Columns: 17
$ state                     <str> 'Alabama', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Florida', 'Georgia', 'Hawaii', 'Idaho'
$ state_code                <str> 'AL', 'AZ', 'AR', 'CA', 'CO', 'CT', 'FL', 'GA', 'HI', 'ID'
$ num_colonies              <i64> 7000, 35000, 13000, 1440000, 3500, 3900, 305000, 104000, 10500, 81000
$ max_colonies              <i64> 7000, 35000, 14000, 1690000, 12500, 3900, 315000, 105000, 10500, 88000
$ lost_colonies             <i64> 1800, 4600, 1500, 255000, 1500, 870, 42000, 14500, 380, 3700
$ percent_lost              <i64> 26, 13, 11, 15, 12, 22, 13, 14, 4, 4
$ added_colonies            <i64> 2800, 3400, 1200, 250000, 200, 290, 54000, 47000, 3400, 2600
$ renovated_colonies        <i64> 250, 2100, 90, 124000, 140, 0, 25000, 9500, 760, 8000
$ percent_renovated         <i64> 4, 6, 1, 7, 1, 0, 8, 9, 7, 9
$ quarter                   <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ year                      <i64> 2015, 2015, 2015,

In [10]:
bd.describe()

describe,state,state_code,num_colonies,max_colonies,lost_colonies,percent_lost,added_colonies,renovated_colonies,percent_renovated,quarter,year,varroa_mites,other_pests_and_parasites,diseases,pesticides,other,unknown
str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""","""1453""","""1453""",1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0,1453.0
"""null_count""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""",,,124469.979353,155948.671714,16759.710943,11.266345,15744.900206,13520.302822,7.11287,2.516173,2018.474191,30.186098,10.937509,3.406676,6.185272,6.083345,3.994907
"""std""",,,438499.655483,550593.1281,60681.042329,7.359984,63548.43909,57201.973644,9.025198,1.132682,2.322824,18.861293,13.035092,6.472063,8.959392,6.488208,4.939563
"""min""","""Alabama""","""AL""",1300.0,1300.0,0.0,0.0,0.0,0.0,0.0,1.0,2015.0,0.0,0.0,0.0,0.0,0.0,0.0
"""25%""",,,8000.0,9500.0,950.0,6.0,380.0,150.0,1.0,1.0,2016.0,15.6,1.9,0.1,0.4,1.8,0.8
"""50%""",,,18500.0,23000.0,2200.0,10.0,1500.0,800.0,4.0,3.0,2018.0,27.2,7.0,1.1,2.6,4.1,2.4
"""75%""",,,59000.0,79000.0,7000.0,14.0,6000.0,3700.0,10.0,4.0,2021.0,42.2,15.1,4.2,8.5,8.2,5.4
"""max""","""Wyoming""","""WY""",3181180.0,4174440.0,502350.0,65.0,736920.0,762550.0,77.0,4.0,2022.0,98.8,91.9,87.4,73.5,61.4,46.2
