<a href="https://colab.research.google.com/github/alliarnold/data71200su24/blob/allibranch/projectfiles/Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1

### Step 1: Select Data Source

**Source:** Open Data NYC--Hyperlocal Temperature Monitoring\
**Filtering:** Just Brooklyn 2019 dataset\
**About:** The filtered data includes features for temperature, hour of the day [0-23], date, lat, long, source [street light monitor or tree monitor]. I am also adding a calculated column for heat safety warning based on the temperature column.\
**Source URL:** [Access filtered link here.](https://data.cityofnewyork.us/dataset/Hyperlocal-Temperature-Monitoring/qdq3-9eqn/explore/query/SELECT%20%60airtemp%60%2C%20%60day%60%2C%20%60hour%60%2C%20%60latitude%60%2C%20%60longitude%60%2C%20%60install_type%60%0AWHERE%0A%20%20caseless_one_of%28%60year%60%2C%20%222019%22%29%0A%20%20AND%20caseless_one_of%28%60borough%60%2C%20%22Brooklyn%22%29/page/filter)\
**Github Source:** Due to the size of the data set, it is saved in two csv files the project folder of my Github branch.


In [1]:
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix

In [2]:
!pip install -U scikit-learn==1.4

Collecting scikit-learn==1.4
  Downloading scikit_learn-1.4.0-1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
Successfully installed scikit-learn-1.4.0


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


I made sure to use "parse-dates" when creating the dataframes on the "Day" column to make sure the correct datatype was set.

In [4]:
df1 = pd.read_csv('/content/drive/My Drive/GC-CUNY/2024_Summer/BK2-019-Temp-Monitor-pt1.csv', parse_dates=[1])

In [5]:
df2 = pd.read_csv('/content/drive/My Drive/GC-CUNY/2024_Summer/BK2-019-Temp-Monitor-pt2.csv', parse_dates=[1])

In [6]:
tempdf = pd.concat([df1, df2])

In [7]:
tempdf.head()

Unnamed: 0,AirTemp,Day,Hour,Latitude,Longitude,Install.Type
0,73.942167,2019-08-09,7,40.666205,-73.91691,Street Tree
1,76.666333,2019-08-09,8,40.666205,-73.91691,Street Tree
2,78.691333,2019-08-09,9,40.666205,-73.91691,Street Tree
3,81.4725,2019-08-09,10,40.666205,-73.91691,Street Tree
4,83.571667,2019-08-09,11,40.666205,-73.91691,Street Tree


In [8]:
tempdf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 477576 entries, 0 to 135576
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   AirTemp       471649 non-null  float64       
 1   Day           477576 non-null  datetime64[ns]
 2   Hour          477576 non-null  int64         
 3   Latitude      477576 non-null  float64       
 4   Longitude     477576 non-null  float64       
 5   Install.Type  477576 non-null  object        
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 25.5+ MB


I went back and forth on whether or not to change the 'Hour' column to an object, because the hours are more of an ordinal category than a numerical feature. Ultimately, I decided not to set the data type as object so that I could more easily call the orders of the hours, but I will need to remember not to run certain calculations on it.

In [None]:
# tempdf['Hour'] = tempdf['Hour'].astype(object)

I added a column with heat advisory levels based on the outside temperature.

In [9]:
conditions = [
    (tempdf['AirTemp'] < 80),
    (tempdf['AirTemp'] >= 80) & (tempdf['AirTemp'] < 90),
    (tempdf['AirTemp'] >= 90) & (tempdf['AirTemp'] < 103),
    (tempdf['AirTemp'] >= 103) & (tempdf['AirTemp'] < 124),
    (tempdf['AirTemp'] >= 124)
    ]

# create a list of the values we want to assign for each condition
values = ['Normal', 'Caution', 'Extreme_Caution', 'Danger', 'Extreme_Danger']

# create a new column and use np.select to assign values to it using our lists as arguments

tempdf['Advisory'] = np.select(conditions, values)

# display updated DataFrame
tempdf.head()

Unnamed: 0,AirTemp,Day,Hour,Latitude,Longitude,Install.Type,Advisory
0,73.942167,2019-08-09,7,40.666205,-73.91691,Street Tree,Normal
1,76.666333,2019-08-09,8,40.666205,-73.91691,Street Tree,Normal
2,78.691333,2019-08-09,9,40.666205,-73.91691,Street Tree,Normal
3,81.4725,2019-08-09,10,40.666205,-73.91691,Street Tree,Caution
4,83.571667,2019-08-09,11,40.666205,-73.91691,Street Tree,Caution


### Step 2:

Separate data into training and test sets.

In [32]:
from sklearn.model_selection import train_test_split

In [49]:
y_target = tempdf['AirTemp']
print(y_target)

0         73.942167
1         76.666333
2         78.691333
3         81.472500
4         83.571667
            ...    
135572    74.705333
135573    76.107833
135574    77.990333
135575    81.616167
135576    84.669833
Name: AirTemp, Length: 477576, dtype: float64


In [42]:
x_data = tempdf.drop(['AirTemp'], axis=1)
print(x_data)

              Day  Hour   Latitude  Longitude Install.Type Advisory
0      2019-08-09     7  40.666205 -73.916910  Street Tree   Normal
1      2019-08-09     8  40.666205 -73.916910  Street Tree   Normal
2      2019-08-09     9  40.666205 -73.916910  Street Tree   Normal
3      2019-08-09    10  40.666205 -73.916910  Street Tree  Caution
4      2019-08-09    11  40.666205 -73.916910  Street Tree  Caution
...           ...   ...        ...        ...          ...      ...
135572 2019-07-15     7  40.682379 -73.931414   Light Pole   Normal
135573 2019-07-15     8  40.682379 -73.931414   Light Pole   Normal
135574 2019-07-15     9  40.682379 -73.931414   Light Pole   Normal
135575 2019-07-15    10  40.682379 -73.931414   Light Pole  Caution
135576 2019-07-15    11  40.682379 -73.931414   Light Pole  Caution

[477576 rows x 6 columns]


In [43]:
X_train, X_test, y_train, y_test = train_test_split(x_data, y_target, random_state=0, stratify=None)

### Step 3:

Explore your training set.

In [44]:
X_train.describe()

Unnamed: 0,Day,Hour,Latitude,Longitude
count,358182,358182.0,358182.0,358182.0
mean,2019-08-14 23:08:22.276496384,11.49199,40.667502,-73.927265
min,2019-06-15 00:00:00,0.0,40.646738,-73.998978
25%,2019-07-15 00:00:00,5.0,40.66006,-73.945583
50%,2019-08-15 00:00:00,11.0,40.666487,-73.925768
75%,2019-09-15 00:00:00,17.0,40.679143,-73.907185
max,2019-10-15 00:00:00,23.0,40.686464,-73.887402
std,,6.917393,0.01204,0.028421


In [45]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 358182 entries, 315900 to 119485
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   Day           358182 non-null  datetime64[ns]
 1   Hour          358182 non-null  int64         
 2   Latitude      358182 non-null  float64       
 3   Longitude     358182 non-null  float64       
 4   Install.Type  358182 non-null  object        
 5   Advisory      358182 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 19.1+ MB


In [46]:
y_train.describe()

count    353842.000000
mean         75.672800
std           8.967774
min          46.031000
25%          70.150833
50%          75.659333
75%          81.472333
max         111.546500
Name: AirTemp, dtype: float64

In [47]:
y_train.info()

<class 'pandas.core.series.Series'>
Index: 358182 entries, 315900 to 119485
Series name: AirTemp
Non-Null Count   Dtype  
--------------   -----  
353842 non-null  float64
dtypes: float64(1)
memory usage: 5.5 MB


### Step 4: Data cleaning.

For this step I found that a number of my rows were marked as NaN for the target data, in this case the air temperature. I decided considered my options between dropping those cases, since I still had plenty of data, or setting them as a median value.

In [52]:
nan_Xcount = X_train.isna().sum()

print(nan_Xcount)

Day             0
Hour            0
Latitude        0
Longitude       0
Install.Type    0
Advisory        0
dtype: int64


In [54]:
nan_Ycount = y_train.isna().sum()

print(nan_Ycount)

4340


In [55]:
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='median')
imp_mean.fit(y_train)
SimpleImputer()

# apply to both testing and training data
y_train_new = imp_mean.transform(y_train)
y_test_new = imp_mean.transform(y_test)

ValueError: Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.