# Introduction

This project sets out to analyze the affects that Covid-19 had on the Taxi industry in New York City, New York.

Covid-19 had tremendous impact on every aspect of society and much of the restrictions that were put into place directly impacted the way Taxis could operate.

### Background Information
The Covid-19 pandemic began on March 11th of 2020 and continued for a substantial portion of the next 3 years. For the first year lockdowns, mask restrictions, limiting the number of people in places, and other limitations were placed on the United States. These restrictions would seem to have a logical impact on the NYC Taxi market but how sever was that impact? [(Here is a link to the CDC's timeline)](https://www.cdc.gov/museum/timeline/covid19.html)

*This project will attempt to answer the following questions:*

* Did the average passenger count go up or down during the pandemic compared to years before 2020.
* Did the payment method for rides change during and after the pandemic?
* How was the average fare cost per ride affected?
* How did the pandemic affect the average trip distance?
* Was tipping affected by the pandemic?


## Methodology

Taxi data was taken from the [(NYC Taxi & Limousine Commission)](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) and the Covid data was taken from the [(CDC)](https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36). For definitions of terms in the Taxi data, please see the `data_dictionary_trip_records_yellow.pdf` in the assets folder.

The Taxi data exists originally in .parquet files that contain a month's worth of trip data. This was compiled into two .csv files, `overall_data.csv` and `sample.csv`. The `overall_data.csv` file is what was used in the analysis below. It was not uploaded here due to its size and GitHub's size restrictions. `sample.csv` has been provided within the assets folder. This is a much smaller data set compiled from the same .parquet files but in a more manageable size for this to be ran for testing purposes and by others. If you would like to run this on a full data set like `overall_data.csv`, you will need to gather the .parquet files from the NYC link above and run `taxi_csv_create.ipynb`. 


## Results

### 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import geopandas as gpd
import folium as fol

### 2. Import and clean the Taxi and Covid Data

In [None]:
overall_data = pd.read_csv('assets/overall_data.csv')

In [None]:
overall_data["pickup_datetime"] = pd.to_datetime(overall_data['pickup_datetime'])
overall_data['dropoff_datetime'] = pd.to_datetime(overall_data['dropoff_datetime'])

We need to remove inaccurate trip data from the dataframe

In [None]:
drop_list = (list(np.where(overall_data['trip_distance'] > 100)))

In [None]:
for i in drop_list:
    overall_data.drop(i, inplace=True)

In [None]:
covid_df  = pd.read_csv('assets/covidreportsbystate.csv')

In [None]:
covid_df['submission_date'] = pd.to_datetime(covid_df['submission_date'])
covid_df['created_at'] = pd.to_datetime(covid_df['created_at'])

In [None]:
covid_df.drop(columns=['consent_cases', 'consent_deaths', 'prob_cases', 'new_case', 'pnew_case', 'prob_death', 'new_death', 'pnew_death', 'conf_cases', 'conf_death'], inplace=True)

In [None]:
covid_df['year'] = covid_df['submission_date'].dt.year

### 3. Explore the Data

#### Question 1: Did the average passenger count go up or down during the pandemic compared to years before 2020.

##### Plot 1: Average Passenger Count by Year

In [None]:
#create a graph of the average number of passengers per ride per year rounded to two decimal places
passenger_count = overall_data.groupby('year')['passenger_count'].mean()
passenger_count.plot(kind='bar', ylabel="Passengers", title='Average Passenger Count by Year', color='blue')

#calculate the average number of passengers per ride per year in 2019 and 2020
passenger_count_2019 = overall_data[overall_data['year'] == 2019]['passenger_count'].mean()
passenger_count_2020 = overall_data[overall_data['year'] == 2020]['passenger_count'].mean()

passenger_count_2019, passenger_count_2020

##### Plot 2: Total Covid Cases in NY

In [None]:
#plot the tot_cases by day where state = NY
covid_df[covid_df['state'] == 'NY'].plot(kind='area',x='submission_date', y='tot_cases', title='Total Cases in NY', xlabel='Date', ylabel='Total Cases', figsize=(20, 10)).ticklabel_format(style='plain', axis='y')

### Findings

* The average number of passengers before 2020 had already began to decline starting in 2013 but took a steeper drop from 1.56 people in 2019 to 1.41 people in 2020.
* Rides did recover a bit in 2021 but went back down in 2022.
* It seems the pandemic had an impact on the average rider count but not as significant as I hypothesized at the outset.

#### Question 2: Did the payment method for rides change during and after the pandemic?

### Cralculate the number of Cash and Credit Card usage totals by year

In [None]:
#plot the number of rides that paid with credit card in 2019
credit = overall_data[overall_data['payment_type'] == 'credit_card']
credit = credit.groupby('year')['payment_type'].count()



In [None]:
#plot the number of rides that paid with cash after 2019
cash = overall_data[overall_data['payment_type'] == 'cash']
cash = cash.groupby('year')['payment_type'].count()



### Calculate the number of Cash and Credit Card usage totals for 2019 and 2020 specifically

In [None]:
#count of cash in 2019
cash_2019 = overall_data[(overall_data['payment_type'] == 'cash') & (overall_data['year'] == 2019)]['payment_type'].count()
cash_2019

In [None]:
#count of credit in 2019
credit_2019 = overall_data[(overall_data['payment_type'] == 'credit_card') & (overall_data['year'] == 2019)]['payment_type'].count()
credit_2019

In [None]:
#count of cash in 2020
cash_2020 = overall_data[(overall_data['payment_type'] == 'cash') & (overall_data['year'] == 2020)]['payment_type'].count()
cash_2020

In [None]:
#count of credit in 2020
credit_2020 = overall_data[(overall_data['payment_type'] == 'credit_card') & (overall_data['year'] == 2020)]['payment_type'].count()
credit_2020

### Calculate the percentage each payment type was used in 2019 and 2020

In [None]:
#calculate the perentage of cash payments in 2019
cash_2019 / (cash_2019 + credit_2019)


In [None]:
#calculate the perentage of credit payments in 2019
credit_2019 / (cash_2019 + credit_2019)


In [None]:
#calculate the perentage of cash payments in 2020
cash_2020 / (cash_2020 + credit_2020)

In [None]:
#calculate the perentage of credit payments in 2020
credit_2020 / (cash_2020 + credit_2020)

##### Plot 3: Cash vs. Credit Payments

In [None]:
#overlay cash_2019 and credit_2019 side by side to compare on the same graph with a legend
fig, ax = plt.subplots()
credit.plot(title='Credit Card Payments', color='blue', ax=ax)
cash.plot(title='Cash  vs. Credit Payments', color='red', ax=ax)
ax.legend(['Credit Card', 'Cash'])


##### Plot 4: Percentage of Cash vs. Credit Payments for 2019 and 2020

In [None]:
labels = ['Cash', 'Credit']
sizes = [cash_2019 / (cash_2019 + credit_2019), credit_2019 / (cash_2019 + credit_2019)]
sizes2 = [cash_2020 / (cash_2020 + credit_2020), credit_2020 / (cash_2020 + credit_2020)]

fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis('equal')
plt.title('2019 Cash vs. Credit Payments')
plt.show()

fig2, ax2 = plt.subplots()
ax2.pie(sizes2, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
ax2.axis('equal')
plt.title('2020 Cash vs. Credit Payments')
plt.show()



### Findings

* The number of cash payments in 2019 was 321,343 while the number of credit payments the same year was 851,437
* The number of cash payments in 2020 was 321,204 while the number of credit payments the same year was 777,336
* In 2019, roughly 73% of all payments were made with credit card. In 2020 that number decreased to roughly %71.

#### Question 3: How was the average fare cost per ride affected?

##### Plot 5: Average Fare Amount by Year

In [None]:
#plot the average fare amount per year
fare_amount = overall_data.groupby('year')['fare_amount'].mean()
fare_amount.plot(title='Average Fare Amount by Year', color='blue', ylabel='Fare Amount')


In [None]:
#calculate the average fare amount in 2011
fare_amount_2011 = overall_data[overall_data['year'] == 2011]['fare_amount'].mean()
fare_amount_2011

In [None]:
#calculate the average fare amount in 2019
fare_amount_2019 = overall_data[overall_data['year'] == 2019]['fare_amount'].mean()
fare_amount_2019

In [None]:
#calculate the average fare amount in 2020
fare_amount_2020 = overall_data[overall_data['year'] == 2020]['fare_amount'].mean()
fare_amount_2020


In [None]:
#calculate the average fare amount in 2021
fare_amount_2021 = overall_data[overall_data['year'] == 2021]['fare_amount'].mean()
fare_amount_2021


##### Plot 6: Average Fare Amount for 2019 through 2022

In [None]:
#plot the average fare amount per year starting in 2019 and ending in 2021
fare_amount = overall_data[overall_data['year'] >= 2019].groupby('year')['fare_amount'].mean()
fare_amount.plot(title='Average Fare Amount by Year', color='blue', ylabel='Fare Amount')
ax = plt.gca()
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:.0f}'))



### Findings

* The average fare price per ride has steadily been increasing since 2011.
* The average fare price in 2011 was $10.25 and grew to $13.38 in 2019.
* In 2020 the average fare price was $12.62.

#### Question 4: How did the pandemic affect the average trip distance?

##### Create and clean one last dataset for the NYC Boroughs

In [None]:
borough_data = pd.read_csv('assets/taxi_zone_lookup.csv')


In [None]:
#convert PU location ID in overall_data to borough based on borough_data
borough_data = borough_data[['LocationID', 'Borough']]
borough_data = borough_data.rename(columns={'LocationID': 'PULocationID'})
overall_data = overall_data.merge(borough_data, on='PULocationID', how='left')
overall_data = overall_data.rename(columns={'Borough': 'PUBorough'})


In [None]:
overall_data = overall_data.loc[overall_data['PUBorough'] != 'EWR']
overall_data = overall_data.loc[overall_data['PUBorough'] != 'Unknown']

In [None]:
#reset index of overall_data by puborough
map_data = overall_data.set_index('PUBorough')
map_data

##### Plot 7: Average Trip distance by Borough since 2011 Chorolpleth

In [None]:

map = fol.Map(location=[40.7128, -74.0060], zoom_start=10)
map.choropleth(geo_data='assets/new-york-city-boroughs.geojson', data=overall_data, columns=['PUBorough', 'trip_distance'], key_on='feature.properties.name', fill_color='YlGn', fill_opacity=0.7, line_opacity=0.2, legend_name='Average Distance in Miles', highlight=True, labels='feature.properties.name')
map


##### Plot 8: Average Trip distance by Borough since 2011 Bar graph

In [None]:
#calculate the average distance per borough per year and plot it
borough_distance = overall_data.groupby(['PUBorough', 'year'])['trip_distance'].mean()
borough_distance = borough_distance.reset_index()
borough_distance = borough_distance.pivot(index='PUBorough', columns='year', values='trip_distance')
borough_distance.plot(kind='bar', title='Average Distance by Borough by Year', ylabel='Distance in Miles')
#make the graph wider
plt.gcf().set_size_inches(25, 5)



### Findings

* The average trip distance was surprisingly on the incline in most Borough's from 2019 on.
* The Bronx had it's highest yearly average trip distance in 2021 and Staten Island's was in 2020.
* There wasn't a single Borough whose trip distance fill during the pandemic related years of 2020-2021.

#### Question 5: Was tipping affected by the pandemic?

In [None]:
#average tip amount per borough
borough_tip = overall_data[overall_data['PUBorough'] != 'EWR']
borough_tip = overall_data.groupby(['PUBorough'])['tip_amount'].mean()
borough_tip

In [None]:
#average tip amount in 2020
borough_tip_2020 = overall_data[overall_data['year'] == 2020]
borough_tip_2020 = borough_tip_2020.groupby(['PUBorough'])['tip_amount'].mean()
borough_tip_2020


In [None]:
#average tip amount in 2019
borough_tip_2019 = overall_data[overall_data['year'] == 2019]
borough_tip_2019 = borough_tip_2019.groupby(['PUBorough'])['tip_amount'].mean()
borough_tip_2019

In [None]:
#average tip amount in 2021
borough_tip_2021 = overall_data[overall_data['year'] == 2021]
borough_tip_2021 = borough_tip_2021.groupby(['PUBorough'])['tip_amount'].mean()
borough_tip_2021

In [None]:
#average tip amount in 2022
borough_tip_2022 = overall_data[overall_data['year'] == 2022]
borough_tip_2022 = borough_tip_2022.groupby(['PUBorough'])['tip_amount'].mean()
borough_tip_2022


In [None]:
#average tip amount for queens
borough_tip_queens = overall_data[overall_data['PUBorough'] == 'Queens']
borough_tip_queens = borough_tip_queens.groupby(['PUBorough'])['tip_amount'].mean()
borough_tip_queens


In [None]:
#what was percentage change in tips from 2019 to 2020
tip_change = (borough_tip_2020 - borough_tip_2019) / borough_tip_2019
tip_change = tip_change * 100
tip_change

tot_change = 0

for i in tip_change:
    tot_change += i

tot_change = tot_change / 5
tot_change


##### Plot 9: Average Tip Amount by Borough

In [None]:
#plot the change in average tip amount per borough from 2019 to 2020
fig, ax = plt.subplots()
borough_tip_2019.plot(title='Average Tip Amount by Borough', color='blue', ax=ax)
borough_tip_2020.plot(title='Average Tip Amount by Borough', color='red', ax=ax)
borough_tip_2021.plot(title='Average Tip Amount by Borough', color='green', ax=ax)
borough_tip_2022.plot(title='Average Tip Amount by Borough', color='orange', ax=ax)
ax.legend(['2019', '2020', '2021', '2022'])


### Findings

* Queens consistently has the highest average tip of all the Boroughs at $4.52.
* Tipping had a 51.59% change from 2019 to 2020. The only Boroughs that saw a rise in tipping in 2020 was the Bronx and Staten Island.
* Tipping did return to roughly pre-Covid levels in 2021 for all Boroughs.
* There was a steep rise in tipping in all Boroughs in 2022 from 2021 except in the Bronx and Staten Island.

#### Notes for further study of this question:

It would be very interesting to dive into the reasons behind the inverse affect Covid-19 had on the Bronx and Staten Island when it comes to tipping. While these reasons are beyond the scope of this project, one could use this data to seek greater insight in this area.


##### Plot 10: Summary plot of all data presented

In [None]:
#display the map in a panel with the other plots

fig, ax = plt.subplots(2, 2, figsize=(30, 20))
passenger_count.plot(kind='bar', title='Average Passenger Count by Year', color='blue', ax=ax[0,0], xlabel='Year', ylabel='Passenger Count')
borough_distance.plot(kind='bar', title='Distance Traveled', colormap='Paired', ax=ax[0,1], xlabel='Borough', ylabel='Distance in Miles')
borough_tip_2019.plot(title='Average Tip Amount by Borough', color='blue', ax=ax[1,0], xlabel='Borough', ylabel='Tip Amount in USD')
borough_tip_2020.plot(title='Average Tip Amount by Borough', color='red', ax=ax[1,0], xlabel='Borough', ylabel='Tip Amount in USD')
borough_tip_2021.plot(title='Average Tip Amount by Borough', color='green', ax=ax[1,0], xlabel='Borough', ylabel='Tip Amount in USD')
borough_tip_2022.plot(title='Average Tip Amount by Borough', color='orange', ax=ax[1,0], xlabel='Borough', ylabel='Tip Amount in USD')
cash.plot(title='Cash  vs. Credit Payments', color='red', ax=ax[1,1], ylabel='Number of Payments', xlabel='Year')
credit.plot(title='Cash  vs. Credit Payments', color='blue', ax=ax[1,1], ylabel='Number of Payments', xlabel='Year')
ax[0,0].legend(['Average Passenger Count'])
ax[0,1].legend(loc='upper left', bbox_to_anchor=(1, 1), ncol=1, fancybox=True, shadow=True, title='Borough', title_fontsize='large',    borderpad=1, labelspacing=1, handlelength=1, handletextpad=1, borderaxespad=1, columnspacing=1)
ax[1,0].legend(['2019', '2020', '2021', '2022'])
ax[1,1].legend(['Cash', 'Credit Card'])



for container in ax[0,0].containers:
    ax[0,0].bar_label(container, label_type='edge', fmt='%.2f')


#map

# Conclusions

Through this study we have found:


* The average number of passengers before 2020 had already began to decline starting in 2013 but took a steeper drop from 1.56 people in 2019 to 1.41 people in 2020.
* Rides did recover a bit in 2021 but went back down in 2022.
* It seems the pandemic had an impact on the average rider count but not as significant as I hypothesized at the outset.
* The number of cash payments in 2019 was 321,343 while the number of credit payments the same year was 851,437
* The number of cash payments in 2020 was 321,204 while the number of credit payments the same year was 777,336
* In 2019, roughly 73% of all payments were made with credit card. In 2020 that number decreased to roughly %71.
* The average fare price per ride has steadily been increasing since 2011.
* The average fare price in 2011 was $10.25 and grew to $13.38 in 2019.
* In 2020 the average fare price was $12.62.
* Queens consistently has the highest average tip of all the Boroughs at $4.52.
* Tipping had a 51.59% change from 2019 to 2020. The only Boroughs that saw a rise in tipping in 2020 was the Bronx and Staten Island.
* Tipping did return to roughly pre-Covid levels in 2021 for all Boroughs.
* There was a steep rise in tipping in all Boroughs in 2022 from 2021 except in the Bronx and Staten Island.