# Some EDA for 1st capstone project on city-bike data from NYC

To-Do:
- Make sure we can read in csv files. Any issues with data/format/types that need to be corrected?
- Missing values? 
- Plot station locations on map
- Plot # rides per day (1-30) for one month
- Plot # rides grouped by day of week
- Plot # rides grouped by hour



## Loading and cleaning data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
dat = pd.read_csv('data/2013-09 - Citi Bike trip data.csv',parse_dates=True)
dat.head()

Look at df info: 
- It looks like the data is pretty clean; no null values. 
- Start and end times need to be coverted to date format. 
- Gender can probably be converted to factor (categorical) type. 
- Should replace spaces in column names with underscores.

In [None]:
dat.info()

convert start/end times to date format

In [None]:
dat.starttime=pd.to_datetime(dat.starttime)
dat.stoptime=pd.to_datetime(dat.stoptime)
dat.info()

convert gender to categorical variable

In [None]:
dat['gender']=dat['gender'].astype('category')
dat.info()

replace spaces in variable names w/ undersocre

In [None]:
dat.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)
dat.info()

## Now put all this data input/transformation together

In [None]:
def load_citibike_monthly():
    dat = pd.read_csv('data/2013-09 - Citi Bike trip data.csv',parse_dates=True)
    dat.starttime=pd.to_datetime(dat.starttime)
    dat.stoptime=pd.to_datetime(dat.stoptime)
    dat['gender']=dat['gender'].astype('category')
    dat.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)
    return dat

In [None]:
del dat
dat=load_citibike_monthly()
dat.info()

## EDA

In [None]:
dat.describe()

## Number rides per day of month
- Big decrease in rides first 1-3 days (this is September, so probably that is Labor day weekend).
- Should get data on holiday dates to use in analysis.
- Other than that, rides seem to fluctuate between 30-40k. Need to investigate patterns further (some could be holidays or maybe weekends?). Or could be related to weather.

In [None]:
# plot 
dat['day']=dat.starttime.dt.day
dat.head()
dat.groupby('day').size().plot().grid()

## Rides by day of week
Look like there tend to be more rides on day 0 which is Monday, and the fewest on Saturday. But would need to look at different months to see if this pattern is significant.

In [None]:
# plot 
dat['day_of_week']=dat.starttime.dt.dayofweek
dat.head()
dat.groupby('day_of_week').size().plot(kind='bar')

## Relationship between hour and number of rides
The plot below shows that rides peak around 7-8 and 16-17, which correspond roughly to rush hour/commuting to/from work. So hour of day looks like it would be an important predictor of demand.

In [None]:
# add hour column to data frome
dat['hour']=dat.starttime.dt.hour
rides_by_hour = dat.groupby('hour').size()
dat.groupby('hour').size().plot().grid()


In [None]:
# which days had the most rides this month?
dat.groupby('day').size().sort_values(ascending=False).head(10)

In [None]:
# distribution of trip durations
dat.tripduration[dat.tripduration<4000].plot(kind='hist').grid()

What is the average trip duration?

In [None]:
# mean trip duration in minutes
np.mean(dat.tripduration)/60

Which stations are used the most?

In [None]:
# Which stations are used the most?
# does this depend on day?
dat.groupby('start_station_id').size().sort_values(ascending=False).head(10)

# Map of station locations

In [None]:
# plot map of station locations (use leaflet/folium?)
import folium
map_osm = folium.Map(location=[40.75, -74])

lat_list = dat['start_station_latitude'].values
lon_list = dat['start_station_longitude'].values

# add station locations:
for lat, lon, name in zip(dat[lat_list, dat['start_station_longitude'], dat['start_station_id']):
    print(lat,lon,name)
    #folium.Marker([lat, -lon], popup=name).add_to(map_osm)

map_osm


In [None]:
type(lat_list.values)

In [None]:
stations_df = dat[ ['start_station_latitude','start_station_longitude','start_station_id' ]]
stations_df.drop_duplicates().head()

In [None]:
feature_group = folium.FeatureGroup("Locations")

for lat, lon, name in zip(dat['start_station_latitude'], dat['start_station_latitude'], dat.start_station_id):
    feature_group.add_child(folium.Marker(location=[lat,lon],popup=name))

#map.add_child(feature_group)

In [None]:
zip(dat['start_station_latitude'], dat['start_station_latitude'], dat.start_station_id)