# Description and Goal

This project performs cleaning, EDA, and modelling to the BlueBikes bicycle sharing system data from 20xx to 20xx found [here](https://bluebikes.com/system-data)



| Column (From website)      |
| ----------- |
| Trip Duration (seconds) |
| Start Time and Date |
| Stop Time and Date |
| Start Station Name & ID |
| End Station Name & ID |
| Bike ID |
| User Type (Casual = Single Trip or Day Pass user; Member = Annual or Monthly Member) |









# Load and Clean Data

Step 1: Preprocessing (in Python/Pandas)

Let's load data. Lots of files from the website that we need to standardize column names for and concatenate into one csv file?
- Loop through CSVs, inspect column names, standardize them.
- Concatenate all into one big DataFrame.
- Clean data (e.g., fix datetime parsing, column types, missing values).

In [56]:
import pandas as pd

In [None]:
start_path = "/Users/ellawang/Documents/GitHub/bike_csv_files/"
old_end_path = "-hubway-tripdata.csv"
new_end_path = "-bluebikes-tripdata.csv"
yr_15 = ["2015" + str(i).zfill(2) + old_end_path for i in range(1, 13)]
yr_16 = ["2016" + str(i).zfill(2) + old_end_path for i in range(1, 13)]
yr_17 = ["2017" + str(i).zfill(2) + old_end_path for i in range(1, 13)]
yr_18_1 = ["2018" + str(i).zfill(2) + old_end_path for i in range(1, 5)] 
    # note 1801-1803 i had to manually replace _ with - in the names
    # after 1805 hubway-->bluebikes
yr_18_2 = ["2018" + str(i).zfill(2) + new_end_path for i in range(5, 13)] 
yr_19 = ["2019" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_20 = ["2020" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_21 = ["2021" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_22 = ["2022" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_23 = ["2023" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_24 = ["2024" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_25 = ["2025" + str(i).zfill(2) + new_end_path for i in range(1, 7)]
pathways = yr_15 + yr_16 + yr_17 + yr_18_1 + yr_18_2 + yr_19 + yr_20 + yr_21 + yr_22 + yr_23 + yr_24 + yr_25

# condense this shii

In [62]:
# give us a peak into the columns and formats/datatypes of each file
num_total_rows = 0
col_count = {}
for path in pathways:
    df = pd.read_csv(start_path + path)
    print(f'{path}: {df.columns} : {df.shape[0]} rows')
    num_total_rows += df.shape[0]
    print(df.iloc[0])
    for col in df.columns:
        if col not in col_count:
            col_count[col] = 0
        col_count[col] += 1

print(col_count)
print(f'Num total rows: {num_total_rows}')
# saved to "output.txt" so don't have to re-run

print('done')

201501-hubway-tripdata.csv: Index(['tripduration', 'starttime', 'stoptime', 'start station id',
       'start station name', 'start station latitude',
       'start station longitude', 'end station id', 'end station name',
       'end station latitude', 'end station longitude', 'bikeid', 'usertype',
       'birth year', 'gender'],
      dtype='object') : 7840 rows
tripduration                                                             542
starttime                                                2015-01-01 00:21:44
stoptime                                                 2015-01-01 00:30:47
start station id                                                         115
start station name                                     Porter Square Station
start station latitude                                             42.387995
start station longitude                                           -71.119084
end station id                                                            96
end station name 

We see inconsistent naming conventions. investigated into output of the print statements printing one line from each file to see which columns are the same and of those which are reformatted and also which columns like dropped before or after a certain point

Column names in 99 files (201501 until 202303 (including final yr/mo))
- 'tripduration': 99 (ends in 202303) (e.g. 1105) -- DROPPING tentatively?
- 'bikeid': 99, (ends in 202304) (e.g. 6680) -- DROPPING tentatively?
- 'starttime': 99 (turns into started_at beg. 202304) (e.g. 2023-03-01 00:00:44.1520 --> 2023-04-13 13:49:59)
- 'stoptime': 99 (turns into ended_at beg. 202304)
- 'start station id': 99 (turns into start_station_name beg. 202304) (e.g. 386 --> A32011)
- 'start station name': 99, (turns into start_station_name beg. 202304) (e.g. Central Square at Mass Ave / Essex St --> seems to stay same!)
- start station latitude': 99, (turns into start_lat beg. 202304) (e.g. 42.368605 --> 42.363713 stays the same!)
- 'start station longitude': 99, (turns into start_lng beg. 202304) (same)
- 'end station id': 99, (turns into end_station_id beg. 202304) (e.g. 386 --> A32011 aka same)
- 'end station name': 99, (turns into end_station_name beg. 202304) (same)
- 'end station latitude': 99, (turns into end_lat beg. 202304) (same)
- 'end station longitude': 99, (turns into end_lng beg. 202304) (same)
- 'usertype': 99(turns into member_casual beg. 202304) (e.g. Customer or Subscriber --> member or casual)

Column names in 64 files (201501 until 202004)
- 'birth year': 64 (e.g. 1984) -- DROPPING
- 'gender': 64 (e.g. 0 or 1 or 2) -- DROPPING

Column names in 35 files (202005 until 202303)
- 'postal code': 35 (e.g. 02118 or NaN) -- DROPPING

Column names in 27 files (202304 to 202506)
- 'ride_id': 27 (begins 202304) (e.g. 0093AA5E7E3E0158) -- DROPPING
- 'rideable_type': 27, (begins 202304) (e.g. docked_bike or classic_bike or electric_bike) -- DROPPING tentatively?

dropping columns: I will delete birth year, gender, and postal code since those are present in only half or fewer of the rows and not the most imformative. I will drop ride_id since not informative and just distinguishes rides from each other, bikeid because I don't care too much about particular bike (not sure about htis assumption hm), dropping tripduration bc that can be deduced from starttime and endtime (i'll engineer a new col after this). 

for now, i will drop rideable type bc it's only in 27 rows... however this is a meaningful var to predict other things so will do more research (maybe bluebikes only started offering e bikes a certain year and prior to that there was only classic bike... also idk the diff between classic and docked bike lol so will look into that later but for now drop?)

In [None]:
# drop those columns

then i'll rename columns, standardize formatting, and visualize with EDA as well as missing values before i decide how to go about filling in missing values

In [None]:
# rename columns
# standardize formatting e.g. starttime

In [None]:
# eda techniques

# EDA