# A Data Scientist's Tourist Guide To NYC Taxis

# Summary

# Introduction

Is this your first time traveling to New York City? Scared to being ripped off by unscrupulous drivers? If you answered yes to both questions, then you are in the right place. In this notebook, we will provide you with a data science approach to New York City taxi fares. 


This analysis will start with loading the data, followed by simple EDAs, and end with a simple linear regression (SLR) to model taxi fares. 



# Methods & Results

(describe in written english the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.)

Load data from the NYC's Taxi and Limousine Commission's website at https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In [None]:
import pandas as pd
# !pip install pyarrow

url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet"
df = pd.read_parquet(url, engine='pyarrow')

print(df.head())

Wrangle and clean the data from it’s original format to the format necessary for classification/clustering analysis.

In [None]:
# NOTE: wrangled data is already in the data folder TODO: maybe move the write_data_folder code to here

df = pd.read_csv("data/yellow_tripdata_2024-01.csv") 

# drop all rows with NaN TODO: do we want to drop all rows where ANY col has NaNs?
df = df.dropna()

Perform a summary of the data set that is relevant for exploratory data analysis related to the planned classification analysis

In [None]:
df.describe()

Create a visualization of the dataset that is relevant for exploratory data analysis related to the planned classification analysis

In [None]:
import altair as alt
# !pip install "vegafusion[embed]>=1.5.0"

alt.data_transformers.enable("vegafusion")

In [None]:
# Visualize missing values NOTE: after doing dropna(), obviously there's no missing values anymore

alt.Chart(
    df.isna().reset_index().melt(
        id_vars='index'
    )
).mark_rect().encode(
    alt.X('index:O').axis(None),
    alt.Y('variable').title(None),
    alt.Color('value').title('NaN'),
    alt.Stroke('value') # We set the stroke which is the outline of each rectangle in the heatmap
).properties(
    width=df.shape[0]
)

In [None]:
# Correlation Plot (TODO: see if we're even supposed to do this)

corr_df = df.select_dtypes('number').corr('spearman', numeric_only=True).stack().reset_index(name='corr')
corr_df.loc[corr_df['corr'] == 1, 'corr'] = 0  # Remove diagonal
corr_df['abs'] = corr_df['corr'].abs()

alt.Chart(corr_df).mark_circle().encode(
    x='level_0',
    y='level_1',
    size=alt.Size('abs').scale(domain=(0, 1)),
    color=alt.Color('corr').scale(scheme='redblue', domain=(-1, 1))
)

Perform classification or regression analysis

In [None]:
#!pip install scikit-learn

from sklearn.linear_model import LinearRegression

In [None]:
# Example: relationship between passenger_count and tip_amount. NOTE: based on the correlation plots, there seems to be NO correlation btw these two!

X = df['passenger_count'].values.reshape(-1,1)
y = df['tip_amount'].values

model = LinearRegression()
model.fit(X, y)

intercept = model.intercept_
slope = model.coef_[0]

df['y_pred'] = model.predict(X)

Create a visualization of the result of the analysis


In [None]:
scatter_plot = alt.Chart(df).mark_circle().encode(
    x='passenger_count',
    y='tip_amount',
    color=alt.value('green'),
    tooltip=['passenger_count', 'tip_amount']
)

line_plot = alt.Chart(df).mark_line(color='orange').encode(
    x='passenger_count',
    y='y_pred'
)

chart = scatter_plot + line_plot

chart

# Discussion

# References