# Download Citi Biki System Data

## Process

1. Download zip files from Citi Bike website
2. Unzip files
3. Read csv and aggregate data to daily level
4. Remove zip and csv files
5. Save daily bike data into data folder

In [2]:
# download citi bike data from January 2017 to June 2019
!curl -O "https://s3.amazonaws.com/tripdata/201[7-8][01-12]-citibike-tripdata.csv.zip"
!curl -O "https://s3.amazonaws.com/tripdata/2019[01-06]-citibike-tripdata.csv.zip"
!unzip '*.zip'
!rm *.zip


[1/24]: https://s3.amazonaws.com/tripdata/201701-citibike-tripdata.csv.zip --> 201701-citibike-tripdata.csv.zip
--_curl_--https://s3.amazonaws.com/tripdata/201701-citibike-tripdata.csv.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 23.1M  100 23.1M    0     0  6813k      0  0:00:03  0:00:02  0:00:01 6811k     0  6850k      0  0:00:03  0:00:03 --:--:-- 6850k

[2/24]: https://s3.amazonaws.com/tripdata/201702-citibike-tripdata.csv.zip --> 201702-citibike-tripdata.csv.zip
--_curl_--https://s3.amazonaws.com/tripdata/201702-citibike-tripdata.csv.zip
100 25.1M  100 25.1M    0     0  6981k      0  0:00:03  0:00:03 --:--:-- 6997k   0  6969k      0  0:00:03  0:00:02  0:00:01 6990k

[3/24]: https://s3.amazonaws.com/tripdata/201703-citibike-tripdata.csv.zip --> 201703-citibike-tripdata.csv.zip
--_curl_--https://s3.amazonaws.com/tripdata/201703-citibike-tripdata.csv.zip
100 23.0M 

In [2]:
import pandas as pd

# list of csv files to read
files = !ls *.csv #For Ipython only

In [3]:
# create function to process each file individually
def aggregate_bike_data(file):
    
    """
    This function reads in raw ride files and aggregates
    daily ride counts.
    """
    
    # read file
    df = pd.read_csv(file)
    
    df.columns = [col.lower() for col in df.columns]
    
    df.columns = df.columns.str.replace(' ', '')
    
    # create grouping date variable
    df['date'] = pd.to_datetime(df['starttime']).dt.date
    
    # aggregate daily start time counts
    daily_df = pd.DataFrame(df.groupby('date')['starttime'].count())
    daily_df = daily_df.rename(columns={'starttime': 'rides'}).reset_index()
    
    return daily_df

In [4]:
# concatenate list of daily dataframes
df = pd.concat([aggregate_bike_data(f) for f in files])

In [5]:
df.head()

Unnamed: 0,date,rides
0,2017-01-01,16009
1,2017-01-02,8921
2,2017-01-03,14198
3,2017-01-04,34039
4,2017-01-05,28393


In [6]:
import os

data_dir = "data"

df.to_csv(os.path.join(data_dir, "citi_bike_daily.csv"), index=False)

In [9]:
# remove unzipped csv files
!rm *.csv
!rm -r __MACOSX