In [1]:
# All of your imports here (you may need to add some)
import numpy
import scipy
import pandas as pd
from pandas import read_csv
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_curve, roc_curve, roc_auc_score

from sklearn import set_config
set_config(transform_output = "pandas")

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Project 2

It is October 2018. The squirrels in Central Park are running into a problem and we need your help.

For this project you must go through most steps in the checklist. You must write responses for all items however sometimes the item will simply be "does not apply". Some of the parts are a bit more nebulous and you simply show that you have done things in general (and the order doesn't really matter). Keep your progress and thoughts organized in this document and use formatting as appropriate (using markdown to add headers and sub-headers for each major part). Do not do the final part (launching the product) and your presentation will be done as information written in this document in a dedicated section, no slides or anything like that. It should however include the best summary plots/graphics/data points.

You are intentionally given very little information thus far. You must communicate with your client (me) for additional information as necessary. But also make sure that your communications are efficient, thought out, and not redundant as your client might get frustrated and "fire" you (this only applies to getting information from your client, this does not necessary apply to asking for help with the actual project itself - you should continuously ask questions for getting help).

You must submit all data files and a pickled preprocessor and final model along with this notebook.

Frame the Problem and Look at the Big Picture
=====================================
1. **Define the objective in business terms:** 
2. **How will your solution be used?** 
3. **How should you frame this problem?** 
4. **How should performance be measured? Is the performance measure aligned with the business objective?** 
5. **What would be the minimum performance needed to reach the business objective?** 
6. **What are comparable problems? Can you reuse (personal or readily available) experience or tools?** 
7. **Is human expertise available?** 
8. **How would you solve the problem manually?** 
9. **List the assumptions you (or others) have made so far. Verify assumptions if possible.** 

# Get the Data

1. **List the data you need and how much you need:**

2. **Find and document where you can get that data:**

3. **Get access authorizations**: None needed, publicy available.

4. **Create a workspace**: This notebook.
5. **Get the data**: 
6. **Convert the data to a format you can easily manipulate**: Done, it's a CSV
7. **Ensure sensitive information is deleted or protected**: Done
8. **Check the size and type of data (time series, geographical, …)**:

Do not look at the data too closely at this point since you have not yet split off the testing set. Basically, enough looking at it to understand *how* to split the test set off. It is likely you will have to review the website where the data came from to be able to understand some of the features.

In [2]:
all_squirrels = read_csv('2018_Central_Park_Squirrel_Census_-_Squirrel_Data_20241101.csv')
dead_squirrels = read_csv('diseased_squirrels.csv')

In [3]:
dead_copy = dead_squirrels.copy()
all_copy = all_squirrels.copy()
dead_squirrels_data = all_copy.merge(dead_copy, how='inner', on='Unique Squirrel ID')
print(dead_squirrels_data)

             X          Y Unique Squirrel ID Hectare Shift      Date  \
0   -73.967063  40.773499     12I-AM-1013-01     12I    AM  10132018   
1   -73.957956  40.795934     38C-PM-1014-09     38C    PM  10142018   
2   -73.970408  40.769028      6I-PM-1013-06     06I    PM  10132018   
3   -73.968381  40.778014     16E-PM-1018-06     16E    PM  10182018   
4   -73.969424  40.775590     13F-AM-1007-02     13F    AM  10072018   
..         ...        ...                ...     ...   ...       ...   
317 -73.972408  40.774416     11D-AM-1010-08     11D    AM  10102018   
318 -73.975653  40.773354      8B-PM-1012-07     08B    PM  10122018   
319 -73.967883  40.784761     23B-PM-1012-06     23B    PM  10122018   
320 -73.964544  40.781160     21F-PM-1018-02     21F    PM  10182018   
321 -73.975479  40.769640      5E-PM-1012-01     05E    PM  10122018   

     Hectare Squirrel Number       Age Primary Fur Color Highlight Fur Color  \
0                          1     Adult          Cinnamo