## Introduction

The data for this project is pulled from the Boston Blue Bikes [website](https://www.bluebikes.com/system-data). The data is being pulled from the historical data page, where there are shown to be several file naming conventions over the past decade. This corresponds to changes in the physical infrastructure (expansion of bike stations, etc) and changes in the organizational structures (Hubway to Boston Blue Bikes). 

We need to verify if these naming conventions also apply to changes in the file structures, so that we can adjust the dataframes before uploading to GCP.

## File types

There are two main subsets of data files available in this directory: trip data and station data. 

### Trip data

The trip data files follow three naming conventions:
- hubway_Trips_YYYY.zip
- YYYYMM-hubway-tripdata.zip
- YYYYMM-bluebikes-tripdata.zip

### Station data

The station data is separated into four files, each with a different naming convention. To verify the integrity, each will be explored. 

## Trip Data

### hubway_Trips_YYYY.zip

These compressed files contain a single csv file, and so that can be opened with the pandas.read_csv() method. 

In [43]:
import pandas as pd
import numpy as np

df = pd.read_csv("https://s3.amazonaws.com/hubway-data/hubway_Trips_2011.csv")
df.head(10)

  df = pd.read_csv("https://s3.amazonaws.com/hubway-data/hubway_Trips_2011.csv")


Unnamed: 0,Duration,Start date,End date,Start station number,Start station name,End station number,End station name,Bike number,Member type,Zip code,Gender
0,1712320,11/30/2011 23:58,12/1/2011 0:26,D32005,Boston Public Library - 700 Boylston St.,D32011,Stuart St. at Charles St.,B00056,Member,2116.0,Male
1,313200,11/30/2011 23:56,12/1/2011 0:01,C32008,Boylston at Fairfield,D32011,Stuart St. at Charles St.,B00133,Casual,,
2,1111430,11/30/2011 23:18,11/30/2011 23:36,A32009,Tremont St / W Newton St,D32006,Lewis Wharf - Atlantic Ave.,B00471,Member,2109.0,Male
3,1313487,11/30/2011 23:15,11/30/2011 23:37,A32001,Union Square - Brighton Ave. at Cambridge St.,D32005,Boston Public Library - 700 Boylston St.,B00056,Member,2116.0,Male
4,345115,11/30/2011 22:59,11/30/2011 23:05,B32008,Mayor Martin J. Walsh - 28 State St.,D32006,Lewis Wharf - Atlantic Ave.,B00174,Member,2109.0,Male
5,904843,11/30/2011 22:48,11/30/2011 23:03,A32004,Longwood Ave / Binney St,A32009,Tremont St / W Newton St,B00203,Member,2118.0,Male
6,266305,11/30/2011 22:42,11/30/2011 22:47,D32005,Boston Public Library - 700 Boylston St.,C32000,Tremont St. at Berkeley St.,B00431,Member,2118.0,Male
7,2065156,11/30/2011 22:40,11/30/2011 23:14,D32005,Boston Public Library - 700 Boylston St.,D32014,Tremont St / West St,B00028,Member,2114.0,Female
8,2093619,11/30/2011 22:39,11/30/2011 23:14,D32005,Boston Public Library - 700 Boylston St.,D32014,Tremont St / West St,B00394,Member,2111.0,Male
9,210672,11/30/2011 21:57,11/30/2011 22:00,D32011,Stuart St. at Charles St.,C32000,Tremont St. at Berkeley St.,B00381,Member,2116.0,Female


In [44]:
def clean_tripdata_old(df: pd.DataFrame) -> pd.DataFrame:
    """All columns need to be renamed and retyped to match newer trips datasets"""

    # tripduration
    df["tripduration"] = df["Duration"]
    df["tripduration"] = pd.to_numeric(df["tripduration"], downcast="integer")
    df = df.drop(["Duration"], axis=1)
    df = df[df["tripduration"] < 4.32e5]

    # starttime
    df["starttime"] = df["Start date"]
    df["starttime"] = pd.to_datetime(df["starttime"])
    df = df.drop(["Start date"], axis=1)

    # stoptime
    df["stoptime"] = df["End date"]
    df["stoptime"] = pd.to_datetime(df["stoptime"])
    df = df.drop(["End date"], axis=1)

    # start station id
    df["start station id"] = df["Start station number"]
    df = df.drop(["Start station number"], axis=1)

    # start station name
    df["start station name"] = df["Start station name"].str.lower()
    df = df.drop(["Start station name"], axis=1)

    # start station latitude
    df["start station latitude"] = np.nan

    # start station longitude
    df["start station longitude"] = np.nan

    # end station id
    df["end station id"] = df["End station number"]
    df = df.drop(["End station number"], axis=1)

    # end station name
    df["end station name"] = df["End station name"].str.lower()
    df = df.drop(["End station name"], axis=1)

    # end station latitude
    df["end station latitude"] = np.nan

    # end station longitude
    df["end station longitude"] = np.nan

    # bikeid
    df["bikeid"] = df["Bike number"]
    df = df.drop(["Bike number"], axis=1)

    # usertype
    df["usertype"] = df["Member type"].str.lower()
    df = df.drop(["Member type"], axis=1)

    # Zip code

    df["postal code"] = df["Zip code"].astype(str)
    df = df.drop(["Zip code"], axis=1)

    df["postal code"] = pd.to_numeric(
        df["postal code"], downcast="integer", errors="coerce"
    )

    # birth year
    df["birth year"] = np.nan

    # gender
    df["gender"] = df["Gender"]
    df.loc[(df["gender"] != "Male") & (df["gender"] != "Female"), "gender"] = 0
    df.loc[df["gender"] == "Male", "gender"] = 1
    df.loc[df["gender"] == "Female", "gender"] = 2
    df["gender"] = pd.to_numeric(df["gender"], downcast="float")
    df = df.drop(["Gender"], axis=1)

    return df

cleaned_df = clean_tripdata_old(df)
cleaned_df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,postal code,birth year,gender
1,313200,2011-11-30 23:56:00,2011-12-01 00:01:00,C32008,boylston at fairfield,,,D32011,stuart st. at charles st.,,,B00133,casual,,,0.0
4,345115,2011-11-30 22:59:00,2011-11-30 23:05:00,B32008,mayor martin j. walsh - 28 state st.,,,D32006,lewis wharf - atlantic ave.,,,B00174,member,2109.0,,1.0
6,266305,2011-11-30 22:42:00,2011-11-30 22:47:00,D32005,boston public library - 700 boylston st.,,,C32000,tremont st. at berkeley st.,,,B00431,member,2118.0,,1.0
9,210672,2011-11-30 21:57:00,2011-11-30 22:00:00,D32011,stuart st. at charles st.,,,C32000,tremont st. at berkeley st.,,,B00381,member,2116.0,,2.0
11,410857,2011-11-30 21:41:00,2011-11-30 21:47:00,A32010,south station - 700 atlantic ave.,,,D32017,the esplanade - beacon st. at arlington st.,,,B00241,member,2116.0,,1.0


In [45]:
print(cleaned_df.dtypes)
print("\n")
print(cleaned_df.shape)
print("\n")
cleaned_df.describe()

tripduration                        int32
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
start station id                   object
start station name                 object
start station latitude            float64
start station longitude           float64
end station id                     object
end station name                   object
end station latitude              float64
end station longitude             float64
bikeid                             object
usertype                           object
postal code                       float64
birth year                        float64
gender                            float32
dtype: object


(35272, 16)




Unnamed: 0,tripduration,start station latitude,start station longitude,end station latitude,end station longitude,postal code,birth year,gender
count,35272.0,0.0,0.0,0.0,0.0,29552.0,0.0,35272.0
mean,300279.423169,,,,,2260.768848,,0.97919
std,85441.388165,,,,,2473.133709,,0.550547
min,60021.0,,,,,216.0,,0.0
25%,237282.5,,,,,2109.0,,1.0
50%,309973.0,,,,,2118.0,,1.0
75%,372441.0,,,,,2176.0,,1.0
max,431999.0,,,,,84010.0,,2.0


### Files named YYYYMM-hubway-tripdata.zip

These compressed files contain a single csv file, and so that can be opened with the pandas.read_csv() method. 

In [59]:
import pandas as pd

df = pd.read_csv("https://s3.amazonaws.com/hubway-data/201501-hubway-tripdata.zip")
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,542,2015-01-01 00:21:44,2015-01-01 00:30:47,115,Porter Square Station,42.387995,-71.119084,96,Cambridge Main Library at Broadway / Trowbridg...,42.373379,-71.111075,277,Subscriber,1984,1
1,438,2015-01-01 00:27:03,2015-01-01 00:34:21,80,MIT Stata Center at Vassar St / Main St,42.361962,-71.092053,95,Cambridge St - at Columbia St / Webster Ave,42.372969,-71.094445,648,Subscriber,1985,1
2,254,2015-01-01 00:31:31,2015-01-01 00:35:46,91,One Kendall Square at Hampshire St / Portland St,42.366277,-71.09169,68,Central Square at Mass Ave / Essex St,42.36507,-71.1031,555,Subscriber,1974,1
3,432,2015-01-01 00:53:46,2015-01-01 01:00:58,115,Porter Square Station,42.387995,-71.119084,96,Cambridge Main Library at Broadway / Trowbridg...,42.373379,-71.111075,1307,Subscriber,1987,1
4,735,2015-01-01 01:07:06,2015-01-01 01:19:21,105,Lower Cambridgeport at Magazine St/Riverside Rd,42.356954,-71.113687,88,Inman Square at Vellucci Plaza / Hampshire St,42.374035,-71.101427,177,Customer,1986,2


In [66]:
def clean_tripdata_new(df: pd.DataFrame) -> pd.DataFrame:
    """Columns retyped to increase performance"""

    # tripduration
    df["tripduration"] = pd.to_numeric(df["tripduration"], downcast="integer")
    df = df[df["tripduration"] < 4.32e5]

    # starttime
    df["starttime"] = pd.to_datetime(df["starttime"])

    # stoptime
    df["stoptime"] = pd.to_datetime(df["stoptime"])

    # start_station_id
    df["start_station_id"] = df["start station id"].astype(str)
    df = df.drop(["start station id"], axis=1)

    # start_station_name
    df["start_station_name"] = df["start station name"].astype(str)
    df["start_station_name"] = df["start_station_name"].str.lower()
    df = df.drop(["start station name"], axis=1)

    # start_station_latitude
    df["start_station_latitude"] = pd.to_numeric(
        df["start station latitude"], downcast="float"
    )
    df = df.drop(["start station latitude"], axis=1)

    # start_station_longitude
    df["start_station_longitude"] = pd.to_numeric(
        df["start station longitude"], downcast="float"
    )
    df = df.drop(["start station longitude"], axis=1)

    # end_station_id
    df["end_station_id"] = df["end station id"].astype(str)
    df = df.drop(["end station id"], axis=1)

    # end_station_name
    df["end_station_name"] = df["end station name"].astype(str)
    df["end_station_name"] = df["end_station_name"].str.lower()
    df = df.drop(["end station name"], axis=1)

    # end_station_latitude
    df.loc[df["end station latitude"] == r"\N", "end station latitude"] = np.nan
    df["end_station_latitude"] = pd.to_numeric(
        df["end station latitude"], downcast="float"
    )
    df = df.drop(["end station latitude"], axis=1)

    # end_station_longitude
    df.loc[df["end station longitude"] == r"\N", "end station longitude"] = np.nan
    df["end_station_longitude"] = pd.to_numeric(
        df["end station longitude"], downcast="float"
    )
    df = df.drop(["end station longitude"], axis=1)

    # bikeid
    df["bikeid"] = df["bikeid"].astype(str)

    # usertype
    df["usertype"] = df["usertype"].astype(str)
    df["usertype"] = df["usertype"].str.lower()

    # birth_year
    if "birth year" in df:
        df["birth_year"] = df["birth year"]
        df.loc[df["birth_year"] == r"\N", "birth_year"] = np.nan
        df["birth_year"] = pd.to_numeric(df["birth_year"], downcast="float")
        df = df.drop(["birth year"], axis=1)
    else:
        df["birth_year"] = np.nan

    # gender
    if "gender" in df:
        df["gender"] = pd.to_numeric(df["gender"], downcast="integer")
    else:
        df["gender"] = 0

    # postal_code
    if "postal code" in df:
        df["postal_code"] = df["postal code"]
        df["postal_code"] = pd.to_numeric(
            df["postal_code"], downcast="float", errors="coerce"
        )
        df = df.drop(["postal code"], axis=1)
    else:
        df["postal_code"] = np.nan

    return df

cleaned_df = clean_tripdata_new(df)
cleaned_df.head()

Unnamed: 0,tripduration,starttime,stoptime,bikeid,usertype,gender,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,birth_year,postal_code
0,542,2015-01-01 00:21:44,2015-01-01 00:30:47,277,subscriber,1,115,porter square station,42.387997,-71.119087,96,cambridge main library at broadway / trowbridg...,42.373379,-71.111076,1984.0,
1,438,2015-01-01 00:27:03,2015-01-01 00:34:21,648,subscriber,1,80,mit stata center at vassar st / main st,42.361961,-71.092056,95,cambridge st - at columbia st / webster ave,42.372971,-71.094444,1985.0,
2,254,2015-01-01 00:31:31,2015-01-01 00:35:46,555,subscriber,1,91,one kendall square at hampshire st / portland st,42.366276,-71.09169,68,central square at mass ave / essex st,42.36507,-71.103104,1974.0,
3,432,2015-01-01 00:53:46,2015-01-01 01:00:58,1307,subscriber,1,115,porter square station,42.387997,-71.119087,96,cambridge main library at broadway / trowbridg...,42.373379,-71.111076,1987.0,
4,735,2015-01-01 01:07:06,2015-01-01 01:19:21,177,customer,2,105,lower cambridgeport at magazine st/riverside rd,42.356953,-71.113686,88,inman square at vellucci plaza / hampshire st,42.374035,-71.101425,1986.0,


In [67]:
print(cleaned_df.isna().sum())

tripduration                  0
starttime                     0
stoptime                      0
bikeid                        0
usertype                      0
gender                        0
start_station_id              0
start_station_name            0
start_station_latitude        0
start_station_longitude       0
end_station_id                0
end_station_name              0
end_station_latitude          0
end_station_longitude         0
birth_year                  254
postal_code                7840
dtype: int64


In [49]:
print(df.dtypes)
print("\n")
print(df.shape)
print("\n")
df.describe()

tripduration                 int32
starttime                   object
stoptime                    object
start station id             int64
start station name          object
start station latitude     float64
start station longitude    float64
end station id               int64
end station name            object
end station latitude       float64
end station longitude      float64
bikeid                       int64
usertype                    object
birth year                  object
gender                       int64
dtype: object


(7840, 15)




Unnamed: 0,tripduration,start station id,start station latitude,start station longitude,end station id,end station latitude,end station longitude,bikeid,gender
count,7840.0,7840.0,7840.0,7840.0,7840.0,7840.0,7840.0,7840.0,7840.0
mean,647.878444,91.928954,42.369236,-71.10221,90.985459,42.368961,-71.102134,932.269515,1.211735
std,3998.551965,19.303781,0.008205,0.014306,20.584012,0.008531,0.014827,216.461422,0.481382
min,62.0,67.0,42.356954,-71.139459,1.0,42.334876,-71.139459,23.0,0.0
25%,287.0,75.0,42.363465,-71.114214,74.0,42.362613,-71.114214,748.0,1.0
50%,406.0,89.5,42.366621,-71.101427,89.0,42.366621,-71.101427,894.0,1.0
75%,602.0,107.0,42.373268,-71.09169,107.0,42.373268,-71.09169,1090.0,1.0
max,232319.0,145.0,42.397828,-71.069957,149.0,42.397828,-71.048927,1325.0,2.0


### Files named YYYYMM-bluebikes-tripdata.zip

These compressed files contain a csv file and a subdirectory, so they need to be opened with requests, io and zipfile modules.  

In [56]:
import pandas as pd
import requests
import zipfile
import io

url = "https://s3.amazonaws.com/hubway-data/201805-bluebikes-tripdata.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
df = pd.read_csv(z.open("201805-bluebikes-tripdata.csv"))
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,1177,2018-05-01 00:01:32.4590,2018-05-01 00:21:10.0260,184,Sidney Research Campus/ Erie Street at Waverly,42.357753,-71.103934,189,Kendall T,42.362428,-71.084955,790,Subscriber,1994,1
1,733,2018-05-01 00:05:19.4970,2018-05-01 00:17:32.7190,67,MIT at Mass Ave / Amherst St,42.3581,-71.093198,41,Packard's Corner - Commonwealth Ave at Brighto...,42.352261,-71.123831,1238,Subscriber,1993,2
2,437,2018-05-01 00:05:37.7590,2018-05-01 00:12:54.8300,54,Tremont St at West St,42.354979,-71.063348,6,Cambridge St at Joy St,42.361291,-71.065262,218,Subscriber,1993,1
3,730,2018-05-01 00:05:39.6780,2018-05-01 00:17:50.5880,54,Tremont St at West St,42.354979,-71.063348,46,Christian Science Plaza - Massachusetts Ave at...,42.343666,-71.085824,1885,Subscriber,1992,1
4,411,2018-05-01 00:06:10.1590,2018-05-01 00:13:02.0490,54,Tremont St at West St,42.354979,-71.063348,6,Cambridge St at Joy St,42.361291,-71.065262,602,Customer,1969,0


In [57]:
cleaned_df = clean_tripdata_new(df)
cleaned_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["starttime"] = pd.to_datetime(df["starttime"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["stoptime"] = pd.to_datetime(df["stoptime"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["start_station_id"] = df["start station id"].astype(str)


Unnamed: 0,tripduration,starttime,stoptime,bikeid,usertype,gender,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,birth_year,postal_code
0,1177,2018-05-01 00:01:32.459,2018-05-01 00:21:10.026,790,subscriber,1,184,sidney research campus/ erie street at waverly,42.357754,-71.103935,189,kendall t,42.362427,-71.084953,1994.0,
1,733,2018-05-01 00:05:19.497,2018-05-01 00:17:32.719,1238,subscriber,2,67,mit at mass ave / amherst st,42.358101,-71.093201,41,packard's corner - commonwealth ave at brighto...,42.352261,-71.123833,1993.0,
2,437,2018-05-01 00:05:37.759,2018-05-01 00:12:54.830,218,subscriber,1,54,tremont st at west st,42.35498,-71.063347,6,cambridge st at joy st,42.36129,-71.065262,1993.0,
3,730,2018-05-01 00:05:39.678,2018-05-01 00:17:50.588,1885,subscriber,1,54,tremont st at west st,42.35498,-71.063347,46,christian science plaza - massachusetts ave at...,42.343666,-71.085823,1992.0,
4,411,2018-05-01 00:06:10.159,2018-05-01 00:13:02.049,602,customer,0,54,tremont st at west st,42.35498,-71.063347,6,cambridge st at joy st,42.36129,-71.065262,1969.0,


In [58]:
print(cleaned_df.isna().sum())

tripduration                    0
starttime                       0
stoptime                        0
bikeid                          0
usertype                        0
gender                          0
start_station_id                0
start_station_name              0
start_station_latitude          0
start_station_longitude         0
end_station_id                  0
end_station_name                0
end_station_latitude            0
end_station_longitude           0
birth_year                      0
postal_code                178831
dtype: int64


In [52]:
print(cleaned_df.dtypes)
print("\n")
print(cleaned_df.shape)
print("\n")
df.describe()

tripduration                        int32
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
bikeid                             object
usertype                           object
gender                               int8
start_station_id                   object
start_station_name                 object
start_station_latitude            float32
start_station_longitude           float32
end_station_id                     object
end_station_name                   object
end_station_latitude              float32
end_station_longitude             float32
birth_year                        float32
postal_code                       float64
dtype: object


(178831, 16)




Unnamed: 0,tripduration,start station id,start station latitude,start station longitude,end station id,end station latitude,end station longitude,bikeid,birth year,gender
count,178865.0,178865.0,178865.0,178865.0,178865.0,178865.0,178865.0,178865.0,178865.0,178865.0
mean,1562.178,88.317245,42.357409,-71.08509,88.505633,42.357379,-71.084767,1875.32903,1981.054818,1.064602
std,43297.61,58.560581,0.014519,0.025097,58.894745,0.014505,0.025142,952.57628,11.716032,0.603111
min,61.0,1.0,42.303469,-71.166491,1.0,42.303469,-71.166491,1.0,1887.0,0.0
25%,428.0,43.0,42.348717,-71.1031,43.0,42.348717,-71.1031,1030.0,1969.0,1.0
50%,728.0,74.0,42.3581,-71.085954,74.0,42.357753,-71.085824,2108.0,1984.0,1.0
75%,1207.0,124.0,42.365673,-71.064263,125.0,42.365673,-71.064263,2661.0,1991.0,1.0
max,9328558.0,232.0,42.406302,-71.006098,232.0,42.406302,-71.006098,3306.0,2002.0,2.0


In [53]:
import pandas as pd
import requests
import zipfile
import io

url = "https://s3.amazonaws.com/hubway-data/202105-bluebikes-tripdata.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
df = pd.read_csv(z.open("202105-bluebikes-tripdata.csv"))
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,postal code
0,609,2021-05-01 00:00:01.0450,2021-05-01 00:10:10.7300,66,Commonwealth Ave at Griggs St,42.349225,-71.132753,400,Lansdowne T Stop,42.347345,-71.100168,4885,Subscriber,2134
1,632,2021-05-01 00:00:13.0880,2021-05-01 00:10:45.9060,409,Elm St at White St,42.389524,-71.116941,104,Harvard University Radcliffe Quadrangle at She...,42.380287,-71.125107,3844,Subscriber,2144
2,187,2021-05-01 00:00:20.0430,2021-05-01 00:03:27.7480,75,Lafayette Square at Mass Ave / Main St / Colum...,42.363465,-71.100573,178,MIT Pacific St at Purrington St,42.359573,-71.101295,6907,Subscriber,2139
3,976,2021-05-01 00:00:29.9290,2021-05-01 00:16:46.0470,371,700 Huron Ave,42.380788,-71.154129,455,Coolidge Sq.,42.372076,-71.156831,2850,Subscriber,2138
4,136,2021-05-01 00:00:45.0970,2021-05-01 00:03:01.1520,39,Washington St at Rutland St,42.338515,-71.074041,26,Washington St at Waltham St,42.341575,-71.068904,3903,Subscriber,2143


In [54]:
cleaned_df = clean_tripdata_new(df)
cleaned_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["starttime"] = pd.to_datetime(df["starttime"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["stoptime"] = pd.to_datetime(df["stoptime"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["start_station_id"] = df["start station id"].astype(str)


Unnamed: 0,tripduration,starttime,stoptime,bikeid,usertype,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,birth_year,gender,postal_code
0,609,2021-05-01 00:00:01.045,2021-05-01 00:10:10.730,4885,subscriber,66,commonwealth ave at griggs st,42.349224,-71.132751,400,lansdowne t stop,42.347343,-71.100166,,0,2134.0
1,632,2021-05-01 00:00:13.088,2021-05-01 00:10:45.906,3844,subscriber,409,elm st at white st,42.389523,-71.116943,104,harvard university radcliffe quadrangle at she...,42.380287,-71.125107,,0,2144.0
2,187,2021-05-01 00:00:20.043,2021-05-01 00:03:27.748,6907,subscriber,75,lafayette square at mass ave / main st / colum...,42.363464,-71.100571,178,mit pacific st at purrington st,42.359573,-71.101295,,0,2139.0
3,976,2021-05-01 00:00:29.929,2021-05-01 00:16:46.047,2850,subscriber,371,700 huron ave,42.380787,-71.154129,455,coolidge sq.,42.372078,-71.15683,,0,2138.0
4,136,2021-05-01 00:00:45.097,2021-05-01 00:03:01.152,3903,subscriber,39,washington st at rutland st,42.338516,-71.074043,26,washington st at waltham st,42.341576,-71.068901,,0,2143.0


In [55]:
print(cleaned_df.dtypes)
print("\n")
print(cleaned_df.shape)
print("\n")
df.describe()

tripduration                        int32
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
bikeid                             object
usertype                           object
start_station_id                   object
start_station_name                 object
start_station_latitude            float32
start_station_longitude           float32
end_station_id                     object
end_station_name                   object
end_station_latitude              float32
end_station_longitude             float32
birth_year                        float64
gender                              int64
postal_code                       float64
dtype: object


(270764, 16)




Unnamed: 0,tripduration,start station id,start station latitude,start station longitude,end station id,end station latitude,end station longitude,bikeid
count,270893.0,270893.0,270893.0,270893.0,270893.0,270893.0,270893.0,270893.0
mean,1947.888,170.271834,42.356854,-71.088623,168.823484,42.356766,-71.08843,4706.088349
std,26596.77,143.45428,0.017207,0.027101,143.239403,0.017321,0.027126,1492.641299
min,61.0,1.0,42.267902,-71.226275,1.0,42.267902,-71.226275,218.0
25%,481.0,58.0,42.347406,-71.105495,57.0,42.347406,-71.105495,3424.0
50%,825.0,112.0,42.357219,-71.08822,108.0,42.357143,-71.08822,4882.0
75%,1422.0,318.0,42.365673,-71.06975,296.0,42.365673,-71.069616,6025.0
max,3366392.0,510.0,42.416085,-71.006098,510.0,42.416085,-71.006098,7029.0
