### **Project Title**: Predicting NYC Rideshare Prices Using Subway Delays, Ridership, and Weather Conditions 
#### ***NYC Weather Data***

#### **Source**: Visual Crossing 

<img src="PoweredByVC-WeatherLogo-RoundedRect.png" alt="Alt Text" width="500" height="150">

#### **Website**: https://www.visualcrossing.com/
#### **About**: 
##### Visual Crossing is a leading provider of weather data and enterprise analysis tools to data scientists, business analysts, professionals, and academics. Visual Crossing aims to provide accurate weather data and forecasts by combining data from various sources, including ground-based weather stations, satellites, and radar, and using statistical climate modeling.

#### **Weather Data Use Case**: 
##### Weather data in this project is used to identify how conditions affect rideshare pricing in NYC. It helps capture demand spikes and travel delays caused by adverse weather. This allows for more accurate fare predictions and better planning for both riders and service providers.




### **Exploratory Data Analysis and Preprocessing**

In [29]:
import pandas as pd
import numpy as np

df = pd.read_csv('New York, ny 2020-01-01 to 2024-12-31.csv')
print(df.head())


           name             datetime  temp  feelslike   dew  humidity  precip  \
0  New York, ny  2020-01-01T00:00:00  41.2       34.7  27.9     58.99     0.0   
1  New York, ny  2020-01-01T01:00:00  39.8       33.9  25.9     57.17     0.0   
2  New York, ny  2020-01-01T02:00:00  39.2       32.8  26.8     60.71     0.0   
3  New York, ny  2020-01-01T03:00:00  39.1       30.9  25.9     58.85     0.0   
4  New York, ny  2020-01-01T04:00:00  38.9       31.4  23.9     54.59     0.0   

   precipprob preciptype  snow  ...  sealevelpressure  cloudcover  visibility  \
0           0        NaN   0.0  ...            1003.5        97.9         9.9   
1           0        NaN   0.0  ...            1003.7        92.1         9.9   
2           0        NaN   0.0  ...            1004.0        85.4         9.9   
3           0        NaN   0.0  ...            1004.3        54.9         9.9   
4           0        NaN   0.0  ...            1004.8        97.9         9.9   

   solarradiation  solaren

In [30]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43848 entries, 0 to 43847
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              43848 non-null  object 
 1   datetime          43848 non-null  object 
 2   temp              43848 non-null  float64
 3   feelslike         43848 non-null  float64
 4   dew               43848 non-null  float64
 5   humidity          43848 non-null  float64
 6   precip            43848 non-null  float64
 7   precipprob        43848 non-null  int64  
 8   preciptype        4296 non-null   object 
 9   snow              43848 non-null  float64
 10  snowdepth         43848 non-null  float64
 11  windgust          43691 non-null  float64
 12  windspeed         43848 non-null  float64
 13  winddir           43848 non-null  float64
 14  sealevelpressure  43848 non-null  float64
 15  cloudcover        43848 non-null  float64
 16  visibility        43848 non-null  float6

In [31]:
print(df.columns.tolist())

# location: new york, ny
# columns to focus on: 
# name - remove
# datetime
# temp
# feelslike
# dew - remove
# humidity - remove
# precip
# precipprob - remove
# preciptype 
# snow
# snowdepth - remove
# windgust - remove
# windspeed 
# winddir - remove
# sealevelpressure - remove
# cloudcover
# visibility
# solarradiation - remove
# solarenergy - remove
# uvindex 
# severerisk - remove
# conditions
# icon - remove
# stations - remove

['name', 'datetime', 'temp', 'feelslike', 'dew', 'humidity', 'precip', 'precipprob', 'preciptype', 'snow', 'snowdepth', 'windgust', 'windspeed', 'winddir', 'sealevelpressure', 'cloudcover', 'visibility', 'solarradiation', 'solarenergy', 'uvindex', 'severerisk', 'conditions', 'icon', 'stations']


In [32]:
# Drop columns not needed 

columns_to_drop = [
    'name',
    'dew',
    'humidity',
    'precipprob',
    'snowdepth',
    'windgust',
    'winddir',
    'sealevelpressure',
    'solarradiation',
    'solarenergy',
    'severerisk',
    'icon',
    'stations',
    'preciptype'
]

df_cleaned = df.drop(columns=columns_to_drop)
df_cleaned

Unnamed: 0,datetime,temp,feelslike,precip,snow,windspeed,cloudcover,visibility,uvindex,conditions
0,2020-01-01T00:00:00,41.2,34.7,0.000,0.0,10.9,97.9,9.9,0,Overcast
1,2020-01-01T01:00:00,39.8,33.9,0.000,0.0,9.1,92.1,9.9,0,Overcast
2,2020-01-01T02:00:00,39.2,32.8,0.000,0.0,9.7,85.4,9.9,0,Partially cloudy
3,2020-01-01T03:00:00,39.1,30.9,0.000,0.0,14.3,54.9,9.9,0,Partially cloudy
4,2020-01-01T04:00:00,38.9,31.4,0.000,0.0,12.2,97.9,9.9,0,Overcast
...,...,...,...,...,...,...,...,...,...,...
43843,2024-12-31T19:00:00,48.7,44.9,0.000,0.0,8.6,15.4,9.9,0,Clear
43844,2024-12-31T20:00:00,47.4,43.8,0.000,0.0,7.7,97.9,9.9,0,Overcast
43845,2024-12-31T21:00:00,45.0,41.7,0.033,0.0,6.0,100.0,4.8,0,"Rain, Overcast"
43846,2024-12-31T22:00:00,44.1,40.4,0.068,0.0,6.5,100.0,1.6,0,"Rain, Overcast"


In [None]:
# reformat data and time column to be formatted like (2020-01-01 00:30:00) to match TLC data

df_cleaned['datetime'] = pd.to_datetime(df_cleaned['datetime']).dt.strftime('%Y-%m-%d %H:%M:%S')

# remove possible duplicates ( ~ 5)

df_cleaned = df_cleaned.drop_duplicates(subset='datetime')
df_cleaned


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['datetime'] = pd.to_datetime(df_cleaned['datetime']).dt.strftime('%Y-%m-%d %H:%M:%S')


Unnamed: 0,datetime,temp,feelslike,precip,snow,windspeed,cloudcover,visibility,uvindex,conditions
0,2020-01-01 00:00:00,41.2,34.7,0.000,0.0,10.9,97.9,9.9,0,Overcast
1,2020-01-01 01:00:00,39.8,33.9,0.000,0.0,9.1,92.1,9.9,0,Overcast
2,2020-01-01 02:00:00,39.2,32.8,0.000,0.0,9.7,85.4,9.9,0,Partially cloudy
3,2020-01-01 03:00:00,39.1,30.9,0.000,0.0,14.3,54.9,9.9,0,Partially cloudy
4,2020-01-01 04:00:00,38.9,31.4,0.000,0.0,12.2,97.9,9.9,0,Overcast
...,...,...,...,...,...,...,...,...,...,...
43843,2024-12-31 19:00:00,48.7,44.9,0.000,0.0,8.6,15.4,9.9,0,Clear
43844,2024-12-31 20:00:00,47.4,43.8,0.000,0.0,7.7,97.9,9.9,0,Overcast
43845,2024-12-31 21:00:00,45.0,41.7,0.033,0.0,6.0,100.0,4.8,0,"Rain, Overcast"
43846,2024-12-31 22:00:00,44.1,40.4,0.068,0.0,6.5,100.0,1.6,0,"Rain, Overcast"


TypeError: unsupported operand type(s) for -: 'str' and 'str'