# CitiBike Data Cleaning: December 2023

The purpose of this notebook is to check a sample of the data we will be working with. In particular, we check for what needs to be cleaned, so that we can use the methods recursively for data from January 2023 to October 2025.

## 1. Imports and data loading

In [129]:
import numpy as np
import pandas as pd
# Load the CSV files
df1 = pd.read_csv('../data/raw/citibike/2023/12/202312-citibike-tripdata_1.csv',
                  low_memory=False)
df2 = pd.read_csv('../data/raw/citibike/2023/12/202312-citibike-tripdata_2.csv',
                  low_memory=False)
df3 = pd.read_csv('../data/raw/citibike/2023/12/202312-citibike-tripdata_3.csv',
                  low_memory=False)
# Concatenate the DataFrames
df = pd.concat([df1, df2, df3], ignore_index=True)
# Check for duplicate rows
df.duplicated().sum()

np.int64(0)

## 2. Overview

In [130]:
# Check column data types
print(df.info())
# Check for missing values in each column
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2204874 entries, 0 to 2204873
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 218.7+ MB
None
ride_id                  0
rideable_type            0
started_at               0
ended_at                 0
start_station_name    1550
start_station_id      1550
end_station_name      6471
end_station_id        6471
start_lat             1550
start_lng             1550
end_lat               6452
end_lng               6452
member_ca

In [131]:
# Check first few rows
df.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,5301271E68E18BA4,electric_bike,2023-12-09 14:36:16.912,2023-12-09 14:48:13.919,Watts St & Greenwich St,5578.02,W 4 St & 7 Ave S,5880.02,40.724055,-74.00966,40.734011,-74.002939,casual
1,DE4CC2FEB483AE3A,electric_bike,2023-12-13 09:50:39.182,2023-12-13 10:13:01.348,Underhill Ave & Lincoln Pl,4042.08,Plaza St East & Flatbush Ave,4010.01,40.674012,-73.967146,40.673134,-73.969106,casual
2,0C67D9174CBDE2D0,electric_bike,2023-12-02 13:00:29.094,2023-12-02 13:33:22.654,W Broadway & Spring St,5569.06,Central Park W & W 103 St,7577.27,40.724947,-74.001659,40.79559,-73.961884,casual
3,BDB5F8E57AF7CC70,classic_bike,2023-12-08 11:09:49.012,2023-12-08 11:11:21.778,Ave A & E 11 St,5703.13,E 11 St & 1 Ave,5746.14,40.728547,-73.981759,40.729538,-73.984267,member
4,599A6C20123BED9D,electric_bike,2023-12-06 10:12:58.286,2023-12-06 10:19:12.479,Ave A & E 11 St,5703.13,Lafayette St & Jersey St,5561.06,40.728547,-73.981759,40.724561,-73.995653,casual


## 3. Impose data types 

In [132]:
# ride IDs and names → strings
for col in ["ride_id", "start_station_name", "end_station_name"]:
    if col in df.columns:
        df[col] = df[col].astype("string")
        
# station ids → numeric
for col in ["start_station_id", "end_station_id"]:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

# Categorical fields (lower memory usage)
for col in ["rideable_type", "member_casual"]:
    if col in df.columns:
        df[col] = df[col].astype("category")

# Coordinates → float32 (lower memory usage)
for col in ["start_lat", "start_lng", "end_lat", "end_lng"]:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce").astype("float32")

# Datetimes
for col in ["started_at", "ended_at"]:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors="coerce", utc=True)

## 4. Remove rows with missing station IDs

In [133]:
df = df.dropna(subset=["start_station_id", "end_station_id"])
df.isnull().sum()

ride_id               0
rideable_type         0
started_at            0
ended_at              0
start_station_name    0
start_station_id      0
end_station_name      0
end_station_id        0
start_lat             0
start_lng             0
end_lat               0
end_lng               0
member_casual         0
dtype: int64

## 5. Checking if we have a 1-1 map of *_station_id and *_station_name

In [125]:
start_id_name_counts = (
    df.groupby("start_station_id")["start_station_name"]
      .nunique()
)

end_id_name_counts = (
    df.groupby("end_station_id")["end_station_name"]
      .nunique()
)

conflicting_start_ids = start_id_name_counts[start_id_name_counts > 1].index.tolist()
conflicting_end_ids = end_id_name_counts[end_id_name_counts > 1].index.tolist()

print(f"Conflicting start station IDs: {conflicting_start_ids}")
print(f"Conflicting end station IDs: {conflicting_end_ids}")

Conflicting start station IDs: [3919.07]
Conflicting end station IDs: [3919.07]


Code to generate a clean 1-to-1 mapping. Count how many times each (name, ID) pair appears and use the most commin one:

In [127]:
# Most common name per start_station_id
start_map = (
    df.groupby(["start_station_id", "start_station_name"])
        .size()
        .reset_index(name="n")
        .sort_values(["start_station_id", "n"], ascending=[True, False])
        .drop_duplicates(subset=["start_station_id"])
        .set_index("start_station_id")["start_station_name"]
)

# Most common name per end_station_id
end_map = (
    df.groupby(["end_station_id", "end_station_name"])
        .size()
        .reset_index(name="n")
        .sort_values(["end_station_id", "n"], ascending=[True, False])
        .drop_duplicates(subset=["end_station_id"])
        .set_index("end_station_id")["end_station_name"]
)

# Apply names
df["start_station_name"] = df["start_station_id"].map(start_map)
df["end_station_name"]   = df["end_station_id"].map(end_map)

In [128]:
df.isnull().sum()

ride_id               0
rideable_type         0
started_at            0
ended_at              0
start_station_name    0
start_station_id      0
end_station_name      0
end_station_id        0
start_lat             0
start_lng             0
end_lat               0
end_lng               0
member_casual         0
dtype: int64