# A Data Scientist's Tourist Guide To NYC Taxis

# Summary

# Introduction

Taking a taxi in New York City can be intimidating for first-time visitors, especially for tourist in this City. With over 200,000 taxi trips happening daily across the city, yellow cabs remain a vital part of NYC's transportation system. However, without proper knowledge, tourists often worry about whether they're being overcharged or taken on unnecessarily long routes.

Using data from 30,000 Yellow Taxi trips in January 2024, provided by the NYC Taxi and Limousine Commission (TLC), we analyze the relationship between trip distances and fare amounts. Our goal is simple: help tourists understand how much they should expect to pay for their taxi rides in NYC based on data-driven analysis. 



# Methods & Results

Load the trip data for yellow taxis in January 2024 from the NYC Taxi and Limousine Commission's [website](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

In [None]:
import pandas as pd

data_set_link = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet"

# Use only a smaller, random subset (30,000 rows) of the data
df = pd.read_parquet(data_set_link).sample(30000, random_state=123)
df.to_csv('data/yellow_tripdata_2024-01.csv', index=False)
df.head()

We'll now wrangle and clean the taxi data from it’s original format to the format necessary for regression analysis. Here, since we want to perform regression analysis on just the "trip_distance" and "fare_amount" columns, we drop any rows that contain NaNs in those columns.

In [None]:
# Drop all rows with NaN in the "trip_distance" and "fare_amount" columns
df = df.dropna(subset=["trip_distance", "fare_amount"])

Now, we'll perform a summary of the data set that is relevant for exploratory data analysis related to our regression analysis. We'll check out the summary statistics for each column in the dataset.

In [None]:
df.describe()

Based on the summary statistics, it shows that the mean trip distance is 3.20 miles, while the median is 1.68 miles, which may point to a right-skewed distribution. As for the fare amount, the mean fare amount is 12.8 USD, while the median is 18.12 USD, which again points to a right-skewed distribution for this column.


Now, we'll create a visualizations for exploratory data analysis. First, we want to confirm that there are no missing values in the columns that we are performing regression on ("trip_distance" and "fare_amount").

In [None]:
import altair as alt
# !pip install "vegafusion[embed]>=1.5.0"

alt.data_transformers.enable("vegafusion")

In [None]:
# Visualize missing values

alt.Chart(
    df.isna().reset_index().melt(
        id_vars='index'
    )
).mark_rect().encode(
    alt.X('index:O').axis(None),
    alt.Y('variable').title(None),
    alt.Color('value').title('NaN'),
    alt.Stroke('value')
).properties(
    width=df.shape[0]
)

As evidenced by the chart, "trip_distance" and "fare_amount" do not have NaN values.

Next, we'll create a correlation plot of all of the columns against one another, to check out the strength and direction of associations between columns. 

In [None]:
# Correlation Plot

corr_df = df.select_dtypes('number').corr('spearman', numeric_only=True).stack().reset_index(name='corr')
corr_df.loc[corr_df['corr'] == 1, 'corr'] = 0  # Remove diagonal
corr_df['abs'] = corr_df['corr'].abs()

alt.Chart(corr_df).mark_circle().encode(
    x='level_0',
    y='level_1',
    size=alt.Size('abs').scale(domain=(0, 1)),
    color=alt.Color('corr').scale(scheme='redblue', domain=(-1, 1))
)

Based on the correlation plot, we see that "trip_distance" and "fare_amount" have a fairly high positive correlation, which may indicate that they are fairly positively associated with each other.

Now, we'll perform regression analysis to examine the relationship between "trip_distance" (the indepdendent variable) and "fare_amount" (the dependent variable).

In [None]:
#!pip install scikit-learn

from sklearn.linear_model import LinearRegression

In [None]:
# Example: relationship between trip_distance and fare_amount. 
X = df['trip_distance'].values.reshape(-1,1)
y = df['fare_amount'].values

model = LinearRegression()
model.fit(X, y)

intercept = model.intercept_
slope = model.coef_[0]

df['y_pred'] = model.predict(X)

print(f"The regression line formula is: y_hat = {slope:.4f} * trip_distance + {intercept:.4f}")


Finally, lets perform a visualization of the result of the regression analysis in the form of a scatter plot with the regression line.

In [None]:
scatter_plot = alt.Chart(df).mark_circle().encode(
    x=alt.X('trip_distance', title='Trip Distance (miles)'),
    y=alt.Y('fare_amount', title="Fare Amount (USD)"),
    color=alt.value('purple'),
    tooltip=['trip_distance', 'fare_amount']
).properties(
    title="Regression of Trip Distance vs Fare Amount for NYC Yellow Taxis in January 2024"
)

line_plot = alt.Chart(df).mark_line(color='orange').encode(
    x='trip_distance',
    y='y_pred'
)

chart = scatter_plot + line_plot

chart

# Discussion

# References