- This is a Jupyter Notebook to process CitiBike station data that was obtained from www.theopenbus.com/raw-data
- The dataset ranges from March 2015 to April 2019
- The original datasets are tab-deliminated monthly datasets with inconsistent column numbers and names.
- The data was preprocessed in the terminal command line by merging the datasets in a yearly level by running the following command: cat *.csv > merged.csv (The datasets were grouped in a yearly level)
- The following function <span style="color:blue">*process_stationdata*</span> processes each station data for a given year by slitting the tab-delimitors and making consistent column names.
- The script below then combines the data from all years into a single station.csv data file. 

In [1]:
import pandas as pd

# Function to process station data for a given year
def process_stationdata(year):
    """This function will process citibke station data that was obtained from www.theopenbus.com/raw-data.
    The data set ranges from March 2015 to April 2019. The original datasets are tab-deliminated datasets with
    inconsistent column numbers and names. The data was preprocessed in the terminal command line by merging the
    datasets in a yearly level by running the following command: cat *.csv > merged.csv.
    The function processes each station data for a given year by slitting the tab-delimitors and making
    consistent column names. The script below then combines the data from all years into a single station.csv
    data file.
    """
    df = pd.read_csv(f"../data/stationdata/merged{year}.csv")

    # Generate the list of columns to be used
    colnames = df.columns[0].split('\t')
    df.columns = ['_']

    
    # Split dataframe into multiple columns and save that to a new temporary dataframe
    df_expand = df['_'].str.split("\t", expand = True)

    
    # Drop any unnecessary columns created by the splitting
    if df_expand.shape[1]>13:
        df_expand.drop(list(range(13, df_expand.shape[1])), axis = 1, inplace = True)
    
    
    # Assign the correct column names to the dataframe and write it to disk
    df_expand.columns = colnames[0:13]
    
    
    # Save as new csv file
    df_expand.to_csv(f"../data/stationdata/stations{year}.csv", index = False)    


In [9]:
# Running this cell will take approximately 15 + minutes
process_stationdata(2015)
process_stationdata(2016)
process_stationdata(2017)
process_stationdata(2018)
process_stationdata(2019)

In [14]:
# Read in the processed datafiles
stations2015 = pd.read_csv("data/stationdata/stations2015.csv")
stations2016 = pd.read_csv("data/stationdata/stations2016.csv")
stations2017 = pd.read_csv("data/stationdata/stations2017.csv")
stations2018 = pd.read_csv("data/stationdata/stations2018.csv")
stations2019 = pd.read_csv("data/stationdata/stations2019.csv")

# Concatenate them into a single dataframe
stations = pd.concat([stations2015, stations2016, stations2017, stations2018, stations2019])


# Write the concatenated dataframe to csv
stations.to_csv("data/stationdata/stations.csv", index = False)

  interactivity=interactivity, compiler=compiler, result=result)


In [16]:
stations.tail(5)

Unnamed: 0,dock_id,dock_name,date,hour,minute,pm,avail_bikes,avail_docks,tot_docks,_lat,_long,in_service,status_key
1671950,3534,"""Frederick Douglass Blvd & W 117 St""","""19-01-31""",9,28,1,18,21,39,40.8052,-73.9547,1,1
1671951,3534,"""Frederick Douglass Blvd & W 117 St""","""19-01-31""",10,7,1,18,21,39,40.8052,-73.9547,1,1
1671952,3534,"""Frederick Douglass Blvd & W 117 St""","""19-01-31""",10,44,1,18,21,39,40.8052,-73.9547,1,1
1671953,3534,"""Frederick Douglass Blvd & W 117 St""","""19-01-31""",11,32,1,19,20,39,40.8052,-73.9547,1,1
1671954,3534,"""Frederick Douglass Blvd & W 117 St""","""19-01-31""",12,15,1,13,26,39,40.8052,-73.9547,1,1


In [17]:
# Distinct customers
stations['dock_id'].nunique()

1469