# Project 2

It is October 2018. The squirrels in Central Park are running into a problem and we need your help.

For this project you must go through most steps in the checklist. You must write responses for all items however sometimes the item will simply be "does not apply". Some of the parts are a bit more nebulous and you simply show that you have done things in general (and the order doesn't really matter). Keep your progress and thoughts organized in this document and use formatting as appropriate (using markdown to add headers and sub-headers for each major part). Do not do the final part (launching the product) and your presentation will be done as information written in this document in a dedicated section, no slides or anything like that. It should however include the best summary plots/graphics/data points.

You are intentionally given very little information thus far. You must communicate with your client (me) for additional information as necessary. But also make sure that your communications are efficient, thought out, and not redundant as your client might get frustrated and "fire" you (this only applies to getting information from your client, this does not necessary apply to asking for help with the actual project itself - you should continuously ask questions for getting help).

You must submit all data files and a pickled preprocessor and final model along with this notebook.

### **Objective**

- There is an outbreak affecting squirrels in Central Park, and we need to try and identify and manage the diseased ones using data from the Squirrel Census, which includes approximately 3,000 squirrel sightings. We aim to determine which squirrels tested positive for the disease and analyze these sightings to uncover patterns or indicators that could help predict the presence of the disease. It's important to note that squirrel behavior varies by type, and certain diseases may be influenced by fluctuating weather conditions. By understanding these factors, we hope to develop a more effective strategy for managing the outbreak.


### *Imports Goes Here*

In [178]:
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns


# **Get the Data**
1. **List the data you need and how much you need:**
    - Need weather data because squirrel sightings might vary per type of weather.
    - daily NYC Weather data
2. **Find and document where you can get that data:**
    - Publicly available from: [Kaggle](https://www.kaggle.com/datasets/aadimator/nyc-weather-2016-to-2022)
3. **Get access authorizations**: None needed, publicly available.


### *Loading Datasets...*

- To prepare for merging data later, we start by loading the necessary datasets



In [179]:
squirrel_data = pd.read_csv("2018_Central_Park_Squirrel_Census_-_Squirrel_Data_20241105.csv")
weather_data = pd.read_csv("NYC_Weather_2016_2022.csv")

### *Functions for Cleaning and Merging Data*
- In new dataset. (i.e. samples with dates that don't have a squirrel sighting on the same day.)
- The `retrieve_weather_days` function filters the weather data to include only dates when squirrel sightings occurred. The `merge_datasets` function then combines the squirrel and weather datasets based on these matching dates.


#### TODO: Currently drop samples and only keep the weather measurements at noon, potentially could make it so we take a average of the measurements in teh second or first half of the day relative to the `Shift` feature designating if it should be `AM` or `PM`.

In [180]:
import pandas as pd

def retrieve_weather_days(weather_dataset, squirrel_dataset):
    squirrel_dataset["Date"] = pd.to_datetime(squirrel_dataset["Date"], format='%m%d%Y')
    sighting_dates = squirrel_dataset["Date"].dt.date.unique().tolist()
    weather_dataset["time"] = pd.to_datetime(weather_dataset["time"])
    weather_dataset["Date"] = weather_dataset["time"].dt.date
    weather_dataset = weather_dataset[weather_dataset["Date"].isin(sighting_dates)]
    
    return weather_dataset

# TODO: Fix merging
def merge_datasets(weather_dataset, squirrel_dataset):
    squirrel_dataset["Date"] = pd.to_datetime(squirrel_dataset["Date"], format='%m%d%Y')
    weather_dataset["time"] = pd.to_datetime(weather_dataset["time"])
    weather_dataset = weather_dataset[weather_dataset["time"].dt.hour == 12]
    weather_dataset["Date"] = pd.to_datetime(weather_dataset["time"].dt.date)
    weather_dataset = weather_dataset.drop(columns=["time"])
    squirrel_dataset["Date"] = pd.to_datetime(squirrel_dataset["Date"])
    merged_dataset = squirrel_dataset.merge(weather_dataset, on="Date", how="inner")
    
    return merged_dataset

weather_data = retrieve_weather_days(weather_data, squirrel_data)
data = merge_datasets(weather_data, squirrel_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weather_dataset["Date"] = pd.to_datetime(weather_dataset["time"].dt.date)


In [181]:
# Load the diseased squirrels data
diseased_squirrels = pd.read_csv("diseased_squirrels.csv")

# Add a column to indicate if the squirrel is diseased
data['Diseased'] = data['Unique Squirrel ID'].isin(diseased_squirrels['Unique Squirrel ID'])

# Display the first few rows to verify
data.head()



Unnamed: 0,X,Y,Unique Squirrel ID,Hectare,Shift,Date,Hectare Squirrel Number,Age,Primary Fur Color,Highlight Fur Color,...,temperature_2m (°C),precipitation (mm),rain (mm),cloudcover (%),cloudcover_low (%),cloudcover_mid (%),cloudcover_high (%),windspeed_10m (km/h),winddirection_10m (°),Diseased
0,-73.956134,40.794082,37F-PM-1014-03,37F,PM,2018-10-14,3,,,,...,9.5,0.0,0.0,56.0,0.0,58.0,71.0,5.1,309.0,False
1,-73.968857,40.783783,21B-AM-1019-04,21B,AM,2018-10-19,4,,,,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,11.9,241.0,False
2,-73.974281,40.775534,11B-PM-1014-08,11B,PM,2018-10-14,8,,Gray,,...,9.5,0.0,0.0,56.0,0.0,58.0,71.0,5.1,309.0,False
3,-73.959641,40.790313,32E-PM-1017-14,32E,PM,2018-10-17,14,Adult,Gray,,...,8.0,0.0,0.0,3.0,0.0,0.0,11.0,13.6,238.0,False
4,-73.970268,40.776213,13E-AM-1017-05,13E,AM,2018-10-17,5,Adult,Gray,Cinnamon,...,8.0,0.0,0.0,3.0,0.0,0.0,11.0,13.6,238.0,False


In [None]:
data.head()

Unnamed: 0,X,Y,Unique Squirrel ID,Hectare,Shift,Date,Hectare Squirrel Number,Age,Primary Fur Color,Highlight Fur Color,...,temperature_2m (°C),precipitation (mm),rain (mm),cloudcover (%),cloudcover_low (%),cloudcover_mid (%),cloudcover_high (%),windspeed_10m (km/h),winddirection_10m (°),Diseased
0,-73.956134,40.794082,37F-PM-1014-03,37F,PM,2018-10-14,3,,,,...,9.5,0.0,0.0,56.0,0.0,58.0,71.0,5.1,309.0,False
1,-73.968857,40.783783,21B-AM-1019-04,21B,AM,2018-10-19,4,,,,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,11.9,241.0,False
2,-73.974281,40.775534,11B-PM-1014-08,11B,PM,2018-10-14,8,,Gray,,...,9.5,0.0,0.0,56.0,0.0,58.0,71.0,5.1,309.0,False
3,-73.959641,40.790313,32E-PM-1017-14,32E,PM,2018-10-17,14,Adult,Gray,,...,8.0,0.0,0.0,3.0,0.0,0.0,11.0,13.6,238.0,False
4,-73.970268,40.776213,13E-AM-1017-05,13E,AM,2018-10-17,5,Adult,Gray,Cinnamon,...,8.0,0.0,0.0,3.0,0.0,0.0,11.0,13.6,238.0,False


### *Data Information*

- The dataset contains 72,552 entries and 41 columns, with a mix of numerical, categorical, and boolean data types. Some columns, like `Age`, `Primary Fur Color`, and `Location`, have missing values, especially `Highlight Fur Color` and `Specific Location`. While most columns are complete, some require cleaning due to missing data, especially for location-related attributes.

In [None]:
data.info()

In [None]:
data.describe()

### *Histograms*
- The data on squirrel sightings in Central Park during October 2018 reveals several patterns. Squirrel sightings are concentrated in specific areas, with X and Y coordinates showing bell-shaped distributions. Sightings occurred on various dates, with certain days experiencing higher counts. The number of squirrels per hectare is right-skewed, suggesting that most areas had few squirrels. Temperatures ranged from 10-20°C, typical for October in New York, while precipitation and rain were both right-skewed, indicating mostly dry conditions. Cloud cover data shows peaks at either clear or fully clouded skies, with few instances of partial cover. Wind speeds clustered between 5-15 km/h, typical for urban environments, and wind direction appeared to be uniformly distributed across various directions.

In [None]:
data.hist(bins=50, figsize=(20,15));

### *Splitting The Data Into Two Subsets*

-  The `train_set` containing 80% of the data for training and `test_set` containing 20% for testing, with the split being reproducible due to the specified `random_state`.

In [None]:
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

### *Saving Data to a CSV File*

In [None]:
data.to_csv("merged_data.csv", index=False)

### *Creating a Copy of the Training Set*

In [None]:
data = train_set.copy()

# **Data Exploration**

### *Heatmap of Squirrel Sightings in Central Park*

In [None]:
plt.figure(figsize=(12,8))
plt.hexbin(data['X'], data['Y'], gridsize=50, cmap='coolwarm', mincnt=1)
plt.colorbar(label='Number of Sightings')
plt.title('Heatmap of Squirrel Sightings in Central Park', fontsize=18)
plt.xlabel('X Coordinate', fontsize=14)
plt.ylabel('Y Coordinate', fontsize=14)
plt.show()

### *Correlation Matrix For Numerical Features*

In [None]:
# Select only numerical columns
numerical_data = data.select_dtypes(include=['float64', 'int64'])

correlation_matrix = numerical_data.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Features')
plt.show()


### *Average Squirrel Sighting Based Off Temperature*

In [None]:
# Create temperature bins for grouping
bins = [0, 5, 10, 15, 20, 25, 30]
labels = ['0-5°C', '5-10°C', '10-15°C', '15-20°C', '20-25°C', '25-30°C']
data['Temperature Range'] = pd.cut(data['temperature_2m (°C)'], bins=bins, labels=labels)

# Group by temperature range and calculate the mean number of squirrel sightings
temperature_grouped = data.groupby('Temperature Range')['Hectare Squirrel Number'].mean().reset_index()

# Create the bar plot
plt.figure(figsize=(10,6))
sns.barplot(x='Temperature Range', y='Hectare Squirrel Number', data=temperature_grouped, palette='viridis')
plt.title('Average Squirrel Sightings by Temperature Range', fontsize=18)
plt.xlabel('Temperature Range (°C)', fontsize=14)
plt.ylabel('Average Number of Squirrel Sightings', fontsize=14)
plt.xticks(rotation=45)
plt.show()


#### Some of the big blank areas below are bodies of water in central park.

In [None]:
diseased_data = data[data['Diseased'] == True]
non_diseased_data = data[data['Diseased'] == False]
plt.figure(figsize=(12,8))
plt.scatter(non_diseased_data['X'], non_diseased_data['Y'], c='green', label='Non-Diseased', alpha=0.5, s=10)
plt.scatter(diseased_data['X'], diseased_data['Y'], c='red', label='Diseased', alpha=0.5, s=10)
plt.title('Location of Diseased vs Non-Diseased Squirrels in Central Park', fontsize=18)
plt.xlabel('X Coordinate', fontsize=14)
plt.ylabel('Y Coordinate', fontsize=14)
plt.legend()
plt.show()

In [None]:
#plots the highlight fur color of the squirrels correlation with the diseased squirrels
plt.figure(figsize=(12,8))
sns.countplot(x='Primary Fur Color', hue='Diseased', data=data, palette='viridis')
plt.title('Primary Fur Color of Squirrels by Diseased Status', fontsize=18)
plt.xlabel('Primary Fur Color', fontsize=14)
plt.ylabel('Number of Squirrels', fontsize=14)
plt.show()

In [None]:
# Actual Percentage values
value_counts = data['Primary Fur Color'].value_counts()
diseased_data = data[data['Diseased'] == True]
diseased_counts = diseased_data['Primary Fur Color'].value_counts()
vals = (diseased_counts / value_counts).to_dict()
print(vals)

#### Found some anomalies in the data's diseased squirrels with a decent amount being `NaN` of `Null` Could be a potential issue in the future.

In [None]:
data["Primary Fur Color"].isnull().sum()

In [None]:
filtered_rows = data[(data["Diseased"] == True) & (data["Primary Fur Color"].isna())]
bad_ids = []
for _, i in filtered_rows.iterrows():
    bad_ids.append(i["Unique Squirrel ID"])
print(bad_ids)

In [None]:
# Anomalies
full_set = []
for _, row in data[data["Diseased"] == True].iterrows():
    sub_map = {}
    for col in data.columns:
        value = row[col]
        sub_map[col] = pd.isna(value)
    full_set.append(sub_map)

bad_rows = pd.DataFrame(full_set)
bad_rows