# Data Visualization of NYC Taxi Trip Data

This notebook outlines the process of loading, cleaning, processing, and visualizing the NYC Taxi Trip data, which is a large dataset typically consisting of millions of records.

## Steps:
1. **Data Loading**: Load the large CSV file.
2. **Data Cleaning**: Handle missing values and outliers.
3. **Data Processing**: Calculate trip duration and other metrics.
4. **Data Aggregation**: Aggregate data by different dimensions.
5. **Data Visualization**: Create various plots to visualize the data.
6. **Performance Checks**: Use timing magic commands to measure performance of operations.

First, ensure you have downloaded the large dataset from the NYC Taxi and Limousine Commission's website or another similar source. The data should be placed in a known directory.

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

### Load the Data

Use the following cell to load your data. Adjust the path to where your dataset is stored.

In [None]:
# Load data directly from a publicly accessible URL
import pandas as pd
url = 'https://data.cityofnewyork.us/api/views/m6nq-qud6/rows.csv?accessType=DOWNLOAD'  # Replace with the actual URL
df = pd.read_csv(url)

### Data Cleaning

Clean the data by removing or imputing missing values and filtering out undesirable records.

In [None]:
%%time
df.dropna(inplace=True)
df = df[df['fare_amount'] > 0]

### Data Processing

Calculate additional metrics like trip duration.

In [None]:
%%time
df['trip_duration'] = (pd.to_datetime(df['tpep_dropoff_datetime']) - pd.to_datetime(df['tpep_pickup_datetime'])).dt.total_seconds() / 60

### Data Aggregation

Group the data by hour or day for further analysis.

In [None]:
%%time
hourly_data = df.groupby(df['tpep_pickup_datetime'].dt.hour).count()

### Data Visualization

Visualize the aggregated data.

In [None]:
%%time
plt.figure(figsize=(12, 6))
plt.plot(hourly_data.index, hourly_data['VendorID'], marker='o')
plt.title('Trip Counts by Hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Trips')
plt.grid(True)
plt.show()