# Exploration

This notebook contains the data exploration done on the dataset. It's aim is to answer the following research question: **TODO**. This is done by generating insightful visualizations, using multiple techniques from the class (e.g. **TODO**), and also extending the dataset with additional relevant data, namely **TODO**.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))

First, let's read the data into a pandas dataframe and inspect the columns. We also set the index to the starttime of the trips.

In [None]:
trips = pd.read_csv('../data/Trips_2018.csv', index_col=2, parse_dates=True)
trips = trips.sort_index() # Sort the trips in ascending order of the start time
print("Shape:", trips.shape)
print(trips.head(3))

We can get all trips for one day, using the loc method:

In [None]:
trips.loc['2018-06-01']

Let's plot how many trips were done each day.

In [None]:
daily_counts = trips.resample('D').size() # Group the trips by day

daily_counts.plot(kind='line')
plt.title('Daily Trip Counts')
plt.xlabel('Date')
plt.ylabel('Number of Trips')
plt.grid(axis='y', linestyle='--')
plt.show()

We can clearly see a difference in the amount of trips done in summer versus in winter. Let's show this relation explicitly by including a dataset with the temperatures in New York. The dataset was downloaded from [Kaggle](https://www.kaggle.com/datasets/aadimator/nyc-weather-2016-to-2022).

In [None]:
weather = pd.read_csv("../data/NYC_Weather_2016_2022.csv", index_col=0, parse_dates=True)

# Only keep the weather for 2018
weather = weather.loc['2018']

# Compute the average temperature per day
daily_avg_temp = weather['temperature_2m (°C)'].resample('D').mean()

In [None]:
# Plot daily trip counts (left y-axis)
ax1 = daily_counts.plot(kind='line', color='tab:blue', label='Daily Trip Counts')

# Create a second y-axis (right)
ax2 = ax1.twinx()
daily_avg_temp.plot(kind='line', color='tab:red', label='Avg Temperature (°C)', ax=ax2)

ax1.set_title('Daily Trip Counts vs. Average Temperature (2018)')
ax1.set_xlabel('Date')
ax1.set_ylabel('Number of Trips', color='tab:blue')
ax2.set_ylabel('Avg Temperature (°C)', color='tab:red')
ax1.grid(axis='y', linestyle='--')

# Combine legends
lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()
ax1.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper left')

plt.show()


The relation between the trip count and the average temperature is very significant.

Let's zoom in on January and February and look for weekly seasonality (highlighting weekend days).

In [None]:
start = '2018-01'
end = '2018-02'
trips_jan_feb = trips.loc[start:end]

number_trips_jan_feb = trips_jan_feb.resample('D').size()

weekends=number_trips_jan_feb.index.weekday >= 5
colors=['blue' if x else 'red' for x in weekends]

import matplotlib.dates as mdates

fig, ax = plt.subplots()
ax.plot(number_trips_jan_feb)
ax.scatter(number_trips_jan_feb.index, number_trips_jan_feb, marker='o', linestyle='-', c=colors)
ax.set_xlabel('Date')
ax.set_ylabel('Number of Trips')
ax.set_title('Jan-Feb 2018 Number of Trips')

# Format x-tick labels as 3-letter month name and day number
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));

In general, we can see that during the weekends (blue dots) the amount of trips lower than during the rest of the week.

## Stations