In [1]:
# All of your imports here (you may need to add some)
import numpy
import scipy
import pandas as pd
from pandas import read_csv
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_curve, roc_curve, roc_auc_score

from sklearn import set_config
set_config(transform_output = "pandas")

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Project 2

It is October 2018. The squirrels in Central Park are running into a problem and we need your help.

For this project you must go through most steps in the checklist. You must write responses for all items however sometimes the item will simply be "does not apply". Some of the parts are a bit more nebulous and you simply show that you have done things in general (and the order doesn't really matter). Keep your progress and thoughts organized in this document and use formatting as appropriate (using markdown to add headers and sub-headers for each major part). Do not do the final part (launching the product) and your presentation will be done as information written in this document in a dedicated section, no slides or anything like that. It should however include the best summary plots/graphics/data points.

You are intentionally given very little information thus far. You must communicate with your client (me) for additional information as necessary. But also make sure that your communications are efficient, thought out, and not redundant as your client might get frustrated and "fire" you (this only applies to getting information from your client, this does not necessary apply to asking for help with the actual project itself - you should continuously ask questions for getting help).

You must submit all data files and a pickled preprocessor and final model along with this notebook.

Frame the Problem and Look at the Big Picture
=====================================
1. **Define the objective in business terms:** Help find the causes of the deaths of squirrels.
2. **How will your solution be used?** Used to manage or help prevent the spread or cause of deaths of squirrels.
3. **How should you frame this problem?** A classification problem that uses set parameters to determine if a squirrel will die or not.
4. **How should performance be measured? Is the performance measure aligned with the business objective?** Accuracy and f1 score so we can accurately predict positives and try and mitigate the false positives.
5. **What would be the minimum performance needed to reach the business objective?** Probably a f1 score in the 70 or 80 percent range.
6. **What are comparable problems? Can you reuse (personal or readily available) experience or tools?** None personal but probably some wildlife studies readily available online.
7. **Is human expertise available?** Some studies online by scientists focusing on urban animals.
8. **How would you solve the problem manually?** Find dead squirrels and try and find similar traits between them all and then compare to living squirrels and see if the traits we believe to cause the deaths is correct or not and reevaluate.
9. **List the assumptions you (or others) have made so far. Verify assumptions if possible.** The data accurately can help track what causes deaths, fur color has an effect on deaths, and there are no seasonal or disease related effects that we know of.

# Get the Data

1. **List the data you need and how much you need:**
- Dead squirrels
- All squirrels seen and different traits and observations

2. **Find and document where you can get that data:** Some given and some found at https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw/about_data

3. **Get access authorizations**: None needed, publicy available.

4. **Create a workspace**: This notebook.
5. **Get the data**: Given to us and found online
6. **Convert the data to a format you can easily manipulate**: Done, they are all CSV's
7. **Ensure sensitive information is deleted or protected**: Done
8. **Check the size and type of data (time series, geographical, …)**: 3023 by 31 = 93,713. Float, objects, ints,  and booleans

Do not look at the data too closely at this point since you have not yet split off the testing set. Basically, enough looking at it to understand *how* to split the test set off. It is likely you will have to review the website where the data came from to be able to understand some of the features.

In [2]:
all_squirrels = read_csv('2018_Central_Park_Squirrel_Census_-_Squirrel_Data_20241101.csv')
dead_squirrels = read_csv('diseased_squirrels.csv')

In [21]:
all_squirrels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3023 entries, 0 to 3022
Data columns (total 31 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   X                                           3023 non-null   float64
 1   Y                                           3023 non-null   float64
 2   Unique Squirrel ID                          3023 non-null   object 
 3   Hectare                                     3023 non-null   object 
 4   Shift                                       3023 non-null   object 
 5   Date                                        3023 non-null   int64  
 6   Hectare Squirrel Number                     3023 non-null   int64  
 7   Age                                         2902 non-null   object 
 8   Primary Fur Color                           2968 non-null   object 
 9   Highlight Fur Color                         1937 non-null   object 
 10  Combination 

In [18]:
dead_copy = dead_squirrels.copy()
all_copy = all_squirrels.copy()
dead_squirrels_data = all_copy.merge(dead_copy, how='inner', on='Unique Squirrel ID')
dead_squirrels_data

Unnamed: 0,X,Y,Unique Squirrel ID,Hectare,Shift,Date,Hectare Squirrel Number,Age,Primary Fur Color,Highlight Fur Color,...,Kuks,Quaas,Moans,Tail flags,Tail twitches,Approaches,Indifferent,Runs from,Other Interactions,Lat/Long
0,-73.967063,40.773499,12I-AM-1013-01,12I,AM,10132018,1,Adult,Cinnamon,White,...,False,False,False,False,True,False,False,True,,POINT (-73.9670628558161 40.77349914209411)
1,-73.957956,40.795934,38C-PM-1014-09,38C,PM,10142018,9,Adult,Black,,...,False,False,False,False,False,True,False,False,,POINT (-73.9579564338627 40.7959337795027)
2,-73.970408,40.769028,6I-PM-1013-06,06I,PM,10132018,6,Adult,Gray,Cinnamon,...,False,False,False,False,True,False,False,False,,POINT (-73.9704082821356 40.7690280985956)
3,-73.968381,40.778014,16E-PM-1018-06,16E,PM,10182018,6,Adult,Cinnamon,Gray,...,False,False,False,False,True,False,True,False,,POINT (-73.968381325559 40.7780143443779)
4,-73.969424,40.775590,13F-AM-1007-02,13F,AM,10072018,2,Adult,Cinnamon,White,...,False,False,False,False,False,False,True,False,,POINT (-73.96942403275091 40.7755898126674)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
317,-73.972408,40.774416,11D-AM-1010-08,11D,AM,10102018,8,Adult,Cinnamon,"Gray, White",...,False,False,False,False,False,True,False,False,,POINT (-73.9724083320538 40.7744163768061)
318,-73.975653,40.773354,8B-PM-1012-07,08B,PM,10122018,7,Adult,Cinnamon,"Gray, White",...,False,False,False,False,False,False,True,False,,POINT (-73.9756533763813 40.7733537881363)
319,-73.967883,40.784761,23B-PM-1012-06,23B,PM,10122018,6,Adult,Gray,,...,True,False,False,False,False,False,False,False,scolding,POINT (-73.9678831312936 40.7847605974975)
320,-73.964544,40.781160,21F-PM-1018-02,21F,PM,10182018,2,Juvenile,Cinnamon,Gray,...,False,False,False,False,False,False,False,True,,POINT (-73.9645437409662 40.7811599933331)


In [17]:
print(dead_squirrels_data['Primary Fur Color'].value_counts(normalize=True))
print(all_squirrels['Primary Fur Color'].value_counts(normalize=True))

Primary Fur Color
Cinnamon    0.608150
Black       0.216301
Gray        0.175549
Name: proportion, dtype: float64
Primary Fur Color
Gray        0.833221
Cinnamon    0.132075
Black       0.034704
Name: proportion, dtype: float64


In [19]:
dead_squirrels_data.select_dtypes(include='bool').sum()


Running           51
Chasing           26
Climbing          47
Eating            90
Foraging         172
Kuks              13
Quaas             10
Moans              1
Tail flags        22
Tail twitches     48
Approaches        35
Indifferent      146
Runs from         70
dtype: int64