## Task 2.5: Advanced Geospatial Plotting with Kepler.gl
### Project: NYC Citi Bike Dashboard Analysis
### Abstract

This notebook is dedicated to geospatial analysis, a critical component of the NYC Citi Bike project. The primary objective is to visualize and analyze the most popular trip routes to identify high-demand corridors, directly addressing the business need to understand bike distribution and identify expansion opportunities. This is achieved using Kepler.gl, a powerful, high-performance library designed for large-scale geospatial datasets.
#### Methodology and Key Steps:
1. **Data Aggregation**: The first crucial step is to transform the raw, trip-level data (over 30 million rows) into an aggregated format. The data is grouped by unique start and end station pairs to count the total number of trips for each specific route.
2. **Targeted Sampling**: To ensure both analytical relevance and optimal performance, a targeted sample of the top 1,000 most popular routes is created. This data-driven approach focuses the visualization on the most impactful segments of the bike network, aligning perfectly with the project's strategic goals.
3. **Interactive Map Creation**: An interactive map is generated using the KeplerGl widget. This map serves as the primary tool for exploration and analysis within the notebook.
4. **Visualization Customization**: The map is professionally styled for clarity and impact:
- The Arc layer is used to visualize the flow of traffic between stations.
- Both the color and stroke thickness of the arcs are mapped to trip_count, creating a powerful heatmap effect where the busiest routes are visually dominant.
- A high-contrast color palette is selected to make hotspots instantly identifiable.
5. **Filtered Analysis**: The notebook concludes with an interactive filtering exercise, narrowing the view to the absolute busiest corridors. The accompanying analysis interprets these patterns, identifying key commuting arteries like major crosstown streets and bridges, and providing actionable insights for the business strategy team.

In [1]:
# Import Libraries
import pandas as pd
import os
from keplergl import KeplerGl
from pyproj import CRS 
import numpy as np
from matplotlib import pyplot as plt

print("Libraries imported successfully.")

Libraries imported successfully.


In [2]:
# Load the Dataset
try:
    # Define the data types for all station-related columns to prevent warnings.
    column_dtypes = {
        'start_station_name': str,
        'start_station_id': str,
        'end_station_name': str,
        'end_station_id': str
    }

    # Load the CSV, passing the dtype information.
    df_full = pd.read_csv(
        'citi_bike_2022_with_weather.csv',
        dtype=column_dtypes
    )
    
    print("Successfully loaded 'citi_bike_2022_with_weather.csv'.")
    print(f"Dataset contains {len(df_full):,} rows.")
    
except FileNotFoundError:
    print("Error: 'citi_bike_2022_with_weather.csv' not found. Please ensure the file is on your local drive and not just in iCloud.")
    df_full = pd.DataFrame()

display(df_full.head())

Successfully loaded 'citi_bike_2022_with_weather.csv'.
Dataset contains 29,838,806 rows.


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,date,avgTemp,_merge
0,9D0DC440CB40CF8E,electric_bike,2022-08-27 13:56:47.728,2022-08-27 14:02:56.651,Flatbush Ave & Ocean Ave,3704.04,3 St & Prospect Park West,3865.05,40.663657,-73.963014,40.668132,-73.973638,casual,2022-08-27,27.8,both
1,2214991DFBE5C4D7,electric_bike,2022-08-20 10:37:02.756,2022-08-20 10:45:56.631,Forsyth St\t& Grand St,5382.07,E 11 St & 1 Ave,5746.14,40.717798,-73.993161,40.729538,-73.984267,casual,2022-08-20,27.9,both
2,20C5D469563B6337,classic_bike,2022-08-31 18:55:03.051,2022-08-31 19:03:37.344,Perry St & Bleecker St,5922.07,Grand St & Greene St,5500.02,40.735354,-74.004831,40.7217,-74.002381,member,2022-08-31,25.6,both
3,3E8791885BC189D1,classic_bike,2022-08-02 08:05:00.250,2022-08-02 08:16:52.063,FDR Drive & E 35 St,6230.04,Grand Army Plaza & Central Park S,6839.1,40.744219,-73.971212,40.764397,-73.973715,member,2022-08-02,26.4,both
4,8DBCBF98885106CB,electric_bike,2022-08-25 15:44:48.386,2022-08-25 15:55:39.691,E 40 St & 5 Ave,6474.11,Ave A & E 14 St,5779.11,40.752052,-73.982115,40.730311,-73.980472,member,2022-08-25,28.1,both


In [3]:
# Aggregate Trips Between Stations

print("Aggregating data to count trips per route...")

# Our NYC data already contains the latitude and longitude for each station, which is perfect for mapping.
# We will group by all the details that define a unique route: the start station's name and coordinates,
# and the end station's name and coordinates. Then, we'll count the number of rides for each group.

df_aggregated = df_full.groupby([
    'start_station_name', 
    'start_lat', 
    'start_lng', 
    'end_station_name', 
    'end_lat', 
    'end_lng'
]).size().reset_index(name='trip_count')

# As a sanity check, we'll verify that the total number of trips is conserved after aggregation.
total_original_trips = len(df_full)
total_aggregated_trips = df_aggregated['trip_count'].sum()

print(f"\nTotal trips in original data: {total_original_trips:,}")
print(f"Total trips after aggregation: {total_aggregated_trips:,}")
print(f"Number of unique routes found: {len(df_aggregated):,}")

# Sort the results by the most popular routes to see the busiest corridors.
df_aggregated = df_aggregated.sort_values(by='trip_count', ascending=False)

print("\nAggregation complete. Here are the top 5 most popular routes:")
display(df_aggregated.head())

Aggregating data to count trips per route...

Total trips in original data: 29,838,806
Total trips after aggregation: 29,768,714
Number of unique routes found: 5,004,655

Aggregation complete. Here are the top 5 most popular routes:


Unnamed: 0,start_station_name,start_lat,start_lng,end_station_name,end_lat,end_lng,trip_count
1383851,Central Park S & 6 Ave,40.765909,-73.976342,Central Park S & 6 Ave,40.765909,-73.976342,10658
3713672,Roosevelt Island Tramway,40.757284,-73.9536,Roosevelt Island Tramway,40.757284,-73.9536,7862
633518,7 Ave & Central Park South,40.766741,-73.979069,7 Ave & Central Park South,40.766741,-73.979069,7378
3797878,Soissons Landing,40.692317,-74.014866,Soissons Landing,40.692317,-74.014866,6731
4268516,W 21 St & 6 Ave,40.74174,-73.994156,9 Ave & W 22 St,40.745497,-74.001971,5987


### Geospatial Visualization Strategy: Focusing on High-Impact Routes

The aggregated dataset contains over 5 million unique trip routes, which is too large to render efficiently in an interactive browser-based tool like Kepler.gl. Attempting to visualize the entire dataset would lead to significant performance issues and a cluttered, uninterpretable map. Therefore, a sampling strategy is necessary.

While a random sample could provide an overview of the network's general structure, it would include many low-frequency routes that are not relevant to the core business problem of managing high demand.

Given the project's objective to **"diagnose where distribution issues stem from"** and **"identify expansion opportunities,"** the most effective analytical approach is to focus on the routes under the most strain. For this reason, I have created a sample consisting of the **top 1,000 most popular routes**. This targeted sample allows us to filter out the noise and create a clear, actionable visualization of the city's primary "bike highways," directly addressing the strategic goals of the analysis.

In [4]:
# Initialize Kepler.gl Map (with Top 1000 Routes)

# As per the project brief's goal to find the busiest areas, we will focus
# on the most popular routes. Since the DataFrame is already sorted by trip_count,
# we can simply take the top 1000 routes using .head().
df_top_1000 = df_aggregated.head(1000)

print(f"Using the top {len(df_top_1000)} most popular routes for a focused analysis.")
# Create a KeplerGl instance with this highly relevant data sample.
m = KeplerGl(
    height=600,
    data={'Top 1000 Routes': df_top_1000}
)

# Display the map instance. It will render much faster and be more focused.
m

Using the top 1000 most popular routes for a focused analysis.
User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'Top 1000 Routes':                  start_station_name  start_lat  start_lng  \
1383851      Ce…

### Map Customization and Rationale

The map was customized to transform the raw data into an insightful visualization of New York City's primary bike routes. The following settings were applied to achieve a clear and professional result:

1.  **Layer Selection:** To focus the analysis on the flow of traffic, the default "Point" layers were hidden, and the **Arc** layer was made visible. Arcs are the ideal choice for this project as they clearly illustrate the connections and routes between different start and end stations.

2.  **Color Encoding:** The color of each arc was mapped to the **`trip_count`** column. I selected a **Purple-to-Yellow sequential color palette**, which effectively functions as a heatmap. The bright yellow highlights the most popular, high-traffic corridors, immediately drawing the viewer's eye to the areas with the highest demand.

3.  **Stroke (Thickness) Encoding:** To further emphasize the hierarchy of route popularity, the thickness of each arc was also mapped to the **`trip_count`** column. This creates a powerful visual effect where the most critical transportation arteries on the map are not only the brightest color but are also the thickest lines, making them impossible to miss during analysis.

These combined settings create a rich, multi-layered visualization that clearly communicates both the density and the relative importance of different bike routes across the city.

### Analysis of Filtered Popular Trips

After applying a filter to show only routes with a high `trip_count` (e.g., >1500 trips), the map transforms from a dense web into a clear network of the city's primary bike corridors. Several key patterns emerge that directly address the project brief:

1.  **Manhattan is the Epicenter:** The vast majority of the most popular routes are concentrated within Manhattan, particularly in the dense commercial and residential areas of **Midtown** and **Lower Manhattan**. This confirms that the most significant distribution challenges will be found in these zones.

2.  **Key Commuting Arteries:** The filter reveals extremely heavy traffic on specific routes, such as those connecting major transit hubs (e.g., Penn Station, Grand Central) to office districts, and on major crosstown streets that connect the east and west sides of the borough.

3.  **Bridge and Waterfront Traffic:** The **Williamsburg and Brooklyn Bridges** are clearly identified as critical commuting arteries with immense traffic volume. Routes along the Hudson River and East River waterfronts are also confirmed as extremely popular recreational and commuting paths.

Important Note on Round Trips: It is critical to note that the most popular activities in the entire dataset are round trips, such as those starting and ending at "Central Park S & 6 Ave" or "Roosevelt Island Tramway". Because the Arc layer in Kepler.gl cannot visualize a zero-length route, these trips are not displayed on this map. While this map is invaluable for understanding the flow of bikes between different stations, the business strategy team must also be aware that a huge portion of demand comes from leisure-based, single-station rentals in recreational areas.

This filtered view, combined with the knowledge of popular round-trip locations, provides precise and actionable intelligence for the business strategy team. It pinpoints the exact pathways under the most strain and identifies key recreational hubs, which is invaluable for planning both fleet rebalancing and future station placement.

In [5]:
# Save the Configured Map to an HTML File

# First, we get the current configuration of the map, which includes all your customizations (layers, colors, thickness) and the current filter settings.
config = m.config

# Now, we save the map to a descriptive HTML file, passing this configuration to it.
m.save_to_html(
    file_name='nyc_top_1000_bike_routes.html',
    read_only=False, # Set to False to keep the control panel in the HTML file
    config=config
)

print("Customized map and its settings have been successfully saved to 'nyc_top_1000_bike_routes.html'.")

Map saved to nyc_top_1000_bike_routes.html!
Customized map and its settings have been successfully saved to 'nyc_top_1000_bike_routes.html'.


In [8]:
# --- Final Step: Save Aggregated Data for Future Use ---
# we will save our aggregated DataFrame to a new, smaller CSV file.

output_filename = 'citi_bike_top_routes_aggregated.csv'
df_aggregated.to_csv(output_filename, index=False)

print(f"Aggregated route data has been successfully saved to '{output_filename}'.")

# Also saving the top 1000 routes data for future use.
top_1000_filename = 'citi_bike_top_1000_routes.csv'
df_top_1000.to_csv(top_1000_filename, index=False)

print(f"Top 1000 routes sample has been successfully saved to '{top_1000_filename}'.")

Aggregated route data has been successfully saved to 'citi_bike_top_routes_aggregated.csv'.
Top 1000 routes sample has been successfully saved to 'citi_bike_top_1000_routes.csv'.
