Docu for cleaning data

1. tripInfo Data
2. stationInfo

# Data Cleaning Strategy

---

## 1. Tripdata Dataset: Handling NaN Values

Rows with missing values in the Tripdata dataset **can be deleted**, but only if imputation is not possible.

### 1.1 NaN `duration`
- Calculate using: `end_time - start_time`
- If not possible: delete the row

### 1.2 NaN `start_time`
- Calculate using: `end_time - duration`
- If not possible: delete the row

### 1.3 NaN `end_time`
- Calculate using: `start_time + duration`
- If not possible: delete the row

### 1.4 NaN `start_lat`
- Replace via lookup in StationInfo using `start_station`
- If lookup fails: delete the row

### 1.5 NaN `start_lon`
- Replace via lookup in StationInfo using `start_station`
- If lookup fails: delete the row

### 1.6 NaN `start_station`
- Try to infer from coordinates or other fields
- If not possible: delete the row

### 1.7 NaN `end_lat`, `end_lon`, `end_station`
- Same approach as for start coordinates and station
- Use StationInfo to fill; delete if not possible

### 1.8 NaN `bike_type`
- Label as `"unknown_device"`
- Optionally try to infer from statistical patterns or other features

---

## 2. StationInfo Dataset: Handling NaN Values

**Do not delete any rows!** Use imputation or labeling for missing data.

### 2.1 `Station ID`
- Check for duplicates
- If duplicates exist: merge or correct as appropriate
- respect that there is a virtuall station for employees

### 2.2 NaN `Region`
- Label as `"Unknown"`
- Optionally use reverse geocoding (via coordinates) to infer region

### 2.3 NaN `Station Name`
- Label as `"Unnamed Station"`

### 2.4 NaN `Status`
- Label as `"Inactive"` or `"Unknown"`

---

## 3. Additional Measures


- **Sample Checks**: Visually inspect random rows to ensure data quality


In [None]:
import pandas as pd

# 1) Dateien einlesen
stationInfoFile = '../data/metro-bike-share-stations-2025-04-01.csv'
tripInfoFile    = '../data/metro-trips-2025-q1.csv'

stationData = pd.read_csv(stationInfoFile, encoding='latin-1')
tripData    = pd.read_csv(tripInfoFile,    encoding='latin-1')


# 2) Spaltennamen bereinigen
stationData.columns = stationData.columns.str.strip()
tripData.columns    = tripData.columns.str.strip()

# 3) TripData: Datumsfelder parsen (MM/DD/YYYY HH:MM)
tripData['start_time'] = pd.to_datetime(
    tripData['start_time'],
    format='%m/%d/%Y %H:%M',
    errors='coerce'
)
tripData['end_time'] = pd.to_datetime(
    tripData['end_time'],
    format='%m/%d/%Y %H:%M',
    errors='coerce'
)

# 4) Imputation-Flags initialisieren
for col in [
    'duration_imputed',
    'start_time_imputed',
    'end_time_imputed',
    'start_coord_imputed',
    'end_coord_imputed',
    'bike_type_imputed'
]:
    tripData[col] = False

# 5) NaN duration cleaning
mask = tripData['duration'].isna()
tripData.loc[mask, 'duration'] = (
    (tripData.loc[mask, 'end_time'] - tripData.loc[mask, 'start_time'])
    .dt.total_seconds() / 60
)
tripData.loc[mask, 'duration_imputed'] = True

# 6) Fehlende start_time berechnen
mask = tripData['start_time'].isna()
tripData.loc[mask, 'start_time'] = (
    tripData.loc[mask, 'end_time']
    - pd.to_timedelta(tripData.loc[mask, 'duration'], unit='m')
)
tripData.loc[mask, 'start_time_imputed'] = True

# 7) Fehlende end_time berechnen
mask = tripData['end_time'].isna()
tripData.loc[mask, 'end_time'] = (
    tripData.loc[mask, 'start_time']
    + pd.to_timedelta(tripData.loc[mask, 'duration'], unit='m')
)
tripData.loc[mask, 'end_time_imputed'] = True

# 8) Helper-Funktion für Koordinaten-Imputation mit 'Kiosk ID'
def impute_coords(df, station_df, id_col, lat_col, lon_col, flag_col):
    mask = df[[lat_col, lon_col]].isna().any(axis=1)
    df.loc[mask, flag_col] = True
    df = df.merge(
        station_df[['Kiosk ID', 'Latitude', 'Longitude']],
        how='left',
        left_on=id_col,
        right_on='Kiosk ID'
    )
    df.loc[mask, lat_col] = df.loc[mask, 'Latitude']
    df.loc[mask, lon_col] = df.loc[mask, 'Longitude']
    return df.drop(columns=['Kiosk ID','Latitude','Longitude'])

# 9) start_lat/lon → StationInfo Lookup
tripData = impute_coords(
    tripData, stationData,
    id_col='start_station',
    lat_col='start_lat',
    lon_col='start_lon',
    flag_col='start_coord_imputed'
)

# 10) end_lat/lon → StationInfo Lookup
tripData = impute_coords(
    tripData, stationData,
    id_col='end_station',
    lat_col='end_lat',
    lon_col='end_lon',
    flag_col='end_coord_imputed'
)

# 11) Fehlende bike_type → "unknown_device"
mask = tripData['bike_type'].isna()
tripData.loc[mask, 'bike_type'] = 'unknown_device'
tripData.loc[mask, 'bike_type_imputed'] = True

# 12) essential NaNs: only delete those rows 
essential = [
    'duration','start_time','end_time',
    'start_station','end_station',
    'start_lat','start_lon','end_lat','end_lon'
]
tripData = tripData.dropna(subset=essential)

# 13) cleaning stationInfo , removing duplicates and let the first entry win
stationData = stationData.drop_duplicates(subset=['Kiosk ID'])
stationData['Region']       = stationData['Region'].fillna('Unknown')
stationData['Kiosk Name']   = stationData['Kiosk Name'].fillna('Unnamed Station')
stationData['Status']       = stationData['Status'].fillna('Unknown')
stationData['status2']      = stationData['status2'].fillna('Unknown')

# 14) Überblick ausgeben
print("Cleaned tripData shape:", tripData.shape)
print("Imputation flags count:\n",
      tripData[['duration_imputed','start_time_imputed','end_time_imputed',
                'start_coord_imputed','end_coord_imputed','bike_type_imputed']].sum())
print("Cleaned stationData shape:", stationData.shape)


Cleaned tripData shape: (95916, 21)
Imputation flags count:
 duration_imputed          0
start_time_imputed        0
end_time_imputed          0
start_coord_imputed      19
end_coord_imputed      2506
bike_type_imputed         0
dtype: int64
Cleaned stationData shape: (429, 8)


In [15]:
stationData


Unnamed: 0,Kiosk ID,Kiosk Name,Go Live Date,Region,Status,Latitude,Longitude,status2
0,3000,Virtual Station,7/7/2016,Unknown,Active,0.000000,0.000000,#REF!
1,3005,7th & Flower,7/7/2016,DTLA,Active,34.048500,-118.258537,#REF!
2,3006,Olive & 8th,7/7/2016,DTLA,Active,34.045540,-118.256668,#REF!
3,3007,5th & Grand,7/7/2016,DTLA,Active,34.050480,-118.254593,#REF!
4,3008,Figueroa & 9th,7/7/2016,DTLA,Active,34.046612,-118.262733,#REF!
...,...,...,...,...,...,...,...,...
424,4687,Evergreen Hub - cicLAvia Heart of LA 2024,10/13/2024,DTLA,Inactive,34.043671,-118.200844,#REF!
425,4689,University Village (Sepulveda Blvd),10/9/2024,Westside,Active,34.023849,-118.426071,#REF!
426,4690,Vineland & Burbank,12/19/2024,North Hollywood,Active,34.172451,-118.370369,#REF!
427,4691,CicLAvia West Adams Meets University Park - We...,2/23/2025,Westside,Inactive,34.025520,-118.353104,#REF!


In [17]:
tripData

Unnamed: 0,trip_id,duration,start_time,end_time,start_station,start_lat,start_lon,end_station,end_lat,end_lon,...,plan_duration,trip_route_category,passholder_type,bike_type,duration_imputed,start_time_imputed,end_time_imputed,start_coord_imputed,end_coord_imputed,bike_type_imputed
0,475609834,5,2025-01-01 00:12:00,2025-01-01 00:17:00,3030,34.051941,-118.243530,4491,34.047440,-118.247940,...,30,One Way,Monthly Pass,standard,False,False,False,False,False,False
1,475609846,7,2025-01-01 00:12:00,2025-01-01 00:19:00,4558,34.025688,-118.395302,4569,34.026550,-118.408463,...,30,One Way,Monthly Pass,electric,False,False,False,False,False,False
2,475609903,11,2025-01-01 00:13:00,2025-01-01 00:24:00,4212,33.988129,-118.471741,4206,33.998341,-118.461014,...,30,One Way,Monthly Pass,standard,False,False,False,False,False,False
3,475609904,11,2025-01-01 00:13:00,2025-01-01 00:24:00,4212,33.988129,-118.471741,4206,33.998341,-118.461014,...,30,One Way,Monthly Pass,electric,False,False,False,False,False,False
4,475610048,13,2025-01-01 00:27:00,2025-01-01 00:40:00,4472,34.092602,-118.280930,4509,34.101639,-118.309174,...,30,One Way,Monthly Pass,standard,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95911,497792778,13,2025-03-31 23:43:00,2025-03-31 23:56:00,3022,34.046070,-118.233093,4680,34.043522,-118.255089,...,30,One Way,Monthly Pass,standard,False,False,False,False,False,False
95912,497792596,7,2025-03-31 23:44:00,2025-03-31 23:51:00,4643,34.072620,-118.449440,4528,34.060970,-118.444366,...,30,One Way,Monthly Pass,electric,False,False,False,False,False,False
95913,497792744,6,2025-03-31 23:49:00,2025-03-31 23:55:00,3026,34.063179,-118.245880,4445,34.073639,-118.251572,...,30,One Way,Monthly Pass,electric,False,False,False,False,False,False
95914,497826163,623,2025-03-31 23:56:00,2025-04-01 10:19:00,3035,34.048401,-118.260948,3077,34.039871,-118.250038,...,365,One Way,Annual Pass,standard,False,False,False,False,False,False


In [18]:
# 1) Spalten zum Entfernen definieren
trip_drop = [
    'trip_id',
    'plan_duration',
    'passholder_type',
    'trip_route_category',
    'duration_imputed',
    'start_time_imputed',
    'end_time_imputed',
    'start_coord_imputed',
    'end_coord_imputed',
    'bike_type_imputed'
]

station_drop = [
    'Go Live Date',
    'status2'
]

# 2) Neue DataFrames ohne diese Spalten
cleaned_trip    = tripData.drop(columns=trip_drop)
cleaned_station = stationData.drop(columns=station_drop)

# 3) Als CSV speichern
cleaned_trip.to_csv('cleaned_trip_data.csv', index=False)
cleaned_station.to_csv('cleaned_station_data.csv', index=False)

print("→ cleaned_trip_data.csv und cleaned_station_data.csv wurden erstellt.")

→ cleaned_trip_data.csv und cleaned_station_data.csv wurden erstellt.
