# Getting the data

Data to be used is the following:
* tripdata from Bike Share NYC, available [here](https://s3.amazonaws.com/tripdata/index.html). Description of the fields available [here](https://www.citibikenyc.com/system-data).
* weather data, requested from NOOA. Description available [here](https://www.ncdc.noaa.gov/cdo-web/datasets#LCD)

## 1. Tripdata

The data is in several .zip files in a bucket. The hrefs to the files cannot be found with beautiful soup, because the page loads a shadow page first, beautiful soup only gets access to this shadow page. I used Selenium webdriver to get the hrefs, because Selenium behaves like a 'real person'.

In [39]:
from sqlalchemy import create_engine
from selenium import webdriver
import os
import pandas as pd
pd.set_option("display.max_rows",20)

In [2]:
chromedriver = "/Applications/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

url = 'https://s3.amazonaws.com/tripdata/index.html'

driver.get(url)


In [33]:
nodes = driver.find_elements_by_xpath('//a')
list_urls = [node.get_attribute('href') for node in nodes[1:-1]]
list_urls

final_list = []
for url in list_urls:
    if '2016' in url:
        final_list.append(url)
with open('urls.txt', 'w') as thefile:
    for item in final_list:
        thefile.write("%s\n" % item)

['https://s3.amazonaws.com/tripdata/201601-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201602-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201603-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201604-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201605-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201606-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201607-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201608-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201609-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201610-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201611-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201612-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/JC-201601-citibike-tripdata.csv.zip',
 'https://s3.amazonaws.com/tripdata/JC-201602-citibike-tripdata.csv.zip',
 'https://s3.amazonaws.com/tripdata/JC-201603-ci

In [31]:
filenames = []
for url in final_list:
    filenames.append(url[url.rfind('/')+1:])
print(filenames)
with open ('filename.txt', 'w') as thefile:
    for item in filenames:
        thefile.write("%s\n" % item)

['201601-citibike-tripdata.zip', '201602-citibike-tripdata.zip', '201603-citibike-tripdata.zip', '201604-citibike-tripdata.zip', '201605-citibike-tripdata.zip', '201606-citibike-tripdata.zip', '201607-citibike-tripdata.zip', '201608-citibike-tripdata.zip', '201609-citibike-tripdata.zip', '201610-citibike-tripdata.zip', '201611-citibike-tripdata.zip', '201612-citibike-tripdata.zip', 'JC-201601-citibike-tripdata.csv.zip', 'JC-201602-citibike-tripdata.csv.zip', 'JC-201603-citibike-tripdata.csv.zip', 'JC-201604-citibike-tripdata.csv.zip', 'JC-201605-citibike-tripdata.csv.zip', 'JC-201606-citibike-tripdata.csv.zip', 'JC-201607-citibike-tripdata.csv.zip', 'JC-201608-citibike-tripdata.csv.zip', 'JC-201609-citibike-tripdata.csv.zip', 'JC-201610-citibike-tripdata.csv.zip', 'JC-201611-citibike-tripdata.csv.zip', 'JC-201612-citibike-tripdata.csv.zip']


I scp'ed the ursl.txt files to the EC2:

Then on the EC2:
* download all files
* move all the files to a bikesharedata directory
* unzip all files from within the bikesharedata directory

## 2. Weather data

Requested from NOOA, downloaded from the link they provided. I requested the hourly data, but the csv comes with all the columns, also the ones corresponding to daily data, so I will drop those before I ship it of to EC2. There are too many columns that have to be dropped, and writing the huge list of column names to create a SQL table takes way longer than just dropping them with pandas.

In [3]:
df = pd.read_csv('weather_data.csv', delimiter=';')

  interactivity=interactivity, compiler=compiler, result=result)


In [62]:
df.head()

Unnamed: 0,STATION,STATION_NAME,ELEVATION,LATITUDE,LONGITUDE,DATE,REPORTTPYE,HOURLYSKYCONDITIONS,HOURLYVISIBILITY,HOURLYPRSENTWEATHERTYPE,...,MonthlyMaxSeaLevelPressureTime,MonthlyMinSeaLevelPressureValue,MonthlyMinSeaLevelPressureDate,MonthlyMinSeaLevelPressureTime,MonthlyTotalHeatingDegreeDays,MonthlyTotalCoolingDegreeDays,MonthlyDeptFromNormalHeatingDD,MonthlyDeptFromNormalCoolingDD,MonthlyTotalSeasonToDateHeatingDD,MonthlyTotalSeasonToDateCoolingDD
0,WBAN:94728,NY CITY CENTRAL PARK NY US,39.6,407.889,-739.669,01/07/13 00:13,FM-16,OVC:08 13,8.0,,...,-9999,,-9999,-9999,,,,,,
1,WBAN:94728,NY CITY CENTRAL PARK NY US,39.6,407.889,-739.669,01/07/13 00:51,FM-15,OVC:08 15,8.0,,...,-9999,,-9999,-9999,,,,,,
2,WBAN:94728,NY CITY CENTRAL PARK NY US,39.6,407.889,-739.669,01/07/13 01:49,FM-16,OVC:08 14,8.0,,...,-9999,,-9999,-9999,,,,,,
3,WBAN:94728,NY CITY CENTRAL PARK NY US,39.6,407.889,-739.669,01/07/13 01:51,FM-15,OVC:08 14,8.0,,...,-9999,,-9999,-9999,,,,,,
4,WBAN:94728,NY CITY CENTRAL PARK NY US,39.6,407.889,-739.669,01/07/13 02:51,FM-15,OVC:08 11,9.0,,...,-9999,,-9999,-9999,,,,,,


In [4]:
for i, column in enumerate(list(df)):
    print(i, column)

0 STATION
1 STATION_NAME
2 ELEVATION
3 LATITUDE
4 LONGITUDE
5 DATE
6 REPORTTPYE
7 HOURLYSKYCONDITIONS
8 HOURLYVISIBILITY
9 HOURLYPRSENTWEATHERTYPE
10 HOURLYDRYBULBTEMPF
11 HOURLYDRYBULBTEMPC
12 HOURLYWETBULBTEMPF
13 HOURLYWETBULBTEMPC
14 HOURLYDewPointTempF
15 HOURLYDewPointTempC
16 HOURLYRelativeHumidity
17 HOURLYWindSpeed
18 HOURLYWindDirection
19 HOURLYWindGustSpeed
20 HOURLYStationPressure
21 HOURLYPressureTendency
22 HOURLYPressureChange
23 HOURLYSeaLevelPressure
24 HOURLYPrecip
25 HOURLYAltimeterSetting
26 DAILYMaximumDryBulbTemp
27 DAILYMinimumDryBulbTemp
28 DAILYAverageDryBulbTemp
29 DAILYDeptFromNormalAverageTemp
30 DAILYAverageRelativeHumidity
31 DAILYAverageDewPointTemp
32 DAILYAverageWetBulbTemp
33 DAILYHeatingDegreeDays
34 DAILYCoolingDegreeDays
35 DAILYSunrise
36 DAILYSunset
37 DAILYWeather
38 DAILYPrecip
39 DAILYSnowfall
40 DAILYSnowDepth
41 DAILYAverageStationPressure
42 DAILYAverageSeaLevelPressure
43 DAILYAverageWindSpeed
44 DAILYPeakWindSpeed
45 PeakWindDirection
46 

All the data is from the same station, so all information concerning station can be dropped.

In [40]:
df_drop_daily = df[['DATE',
       'REPORTTPYE', 'HOURLYSKYCONDITIONS', 'HOURLYVISIBILITY',
       'HOURLYDRYBULBTEMPC', 'HOURLYWETBULBTEMPC',
       'HOURLYWindSpeed', 'HOURLYWindDirection', 'HOURLYPrecip']]

In [41]:
df_drop_daily.shape

(42484, 9)

In [42]:
list(df_drop_daily)

['DATE',
 'REPORTTPYE',
 'HOURLYSKYCONDITIONS',
 'HOURLYVISIBILITY',
 'HOURLYDRYBULBTEMPC',
 'HOURLYWETBULBTEMPC',
 'HOURLYWindSpeed',
 'HOURLYWindDirection',
 'HOURLYPrecip']

In [43]:
df_drop_daily.dtypes

DATE                    object
REPORTTPYE              object
HOURLYSKYCONDITIONS     object
HOURLYVISIBILITY        object
HOURLYDRYBULBTEMPC      object
HOURLYWETBULBTEMPC     float64
HOURLYWindSpeed        float64
HOURLYWindDirection     object
HOURLYPrecip            object
dtype: object

In [56]:
df.DATE.head()

0    01/07/13 00:13
1    01/07/13 00:51
2    01/07/13 01:49
3    01/07/13 01:51
4    01/07/13 02:51
Name: DATE, dtype: object

In [45]:
relevant_dates = []
for row in df.DATE:
    if '/16' in row:
        relevant_dates.append(row)

In [57]:
df_small = df_drop_daily[df_drop_daily.DATE.isin(relevant_dates)]

In [58]:
df_small.head()

Unnamed: 0,DATE,REPORTTPYE,HOURLYSKYCONDITIONS,HOURLYVISIBILITY,HOURLYDRYBULBTEMPC,HOURLYWETBULBTEMPC,HOURLYWindSpeed,HOURLYWindDirection,HOURLYPrecip
28549,01/01/16 00:51,FM-15,OVC:08 37,10.0,5.6,2.4,5.0,VRB,0.0
28550,01/01/16 01:51,FM-15,OVC:08 36,10.0,5.0,2.0,3.0,VRB,0.0
28551,01/01/16 02:51,FM-15,OVC:08 34,10.0,5.0,2.1,5.0,280,0.0
28552,01/01/16 03:51,FM-15,OVC:08 31,10.0,5.0,2.1,9.0,280,0.0
28553,01/01/16 04:51,FM-15,OVC:08 44,10.0,4.4,1.7,10.0,270,0.0


In [59]:
df_small = df_small.dropna()

In [60]:
df_small.head()

Unnamed: 0,DATE,REPORTTPYE,HOURLYSKYCONDITIONS,HOURLYVISIBILITY,HOURLYDRYBULBTEMPC,HOURLYWETBULBTEMPC,HOURLYWindSpeed,HOURLYWindDirection,HOURLYPrecip
28549,01/01/16 00:51,FM-15,OVC:08 37,10.0,5.6,2.4,5.0,VRB,0.0
28550,01/01/16 01:51,FM-15,OVC:08 36,10.0,5.0,2.0,3.0,VRB,0.0
28551,01/01/16 02:51,FM-15,OVC:08 34,10.0,5.0,2.1,5.0,280,0.0
28552,01/01/16 03:51,FM-15,OVC:08 31,10.0,5.0,2.1,9.0,280,0.0
28553,01/01/16 04:51,FM-15,OVC:08 44,10.0,4.4,1.7,10.0,270,0.0


In [61]:
df_small.to_csv('weatherdata.csv')

## 3. Putting the data in the DB

Copy multiple files into POSGRES table:

The weatherdata will be imported after cleaning them.