# Data Analysis

1. [Get cleaned data](#BGet-cleaned-data)
2. [Get Zone Name and attach to data](#Get-Zone-Name-and-attach-to-data)
3. [Histogram Comparison: Pickups per Zone](#Histogram-Comparison,-Pickups-per-Zone)
4. [Linear Chart: Pickups evolution over time](#Linear-Chart,-Pickups-evolution-over-time)
5. [Scatter Plot, Relation between Precipitation and Pickups](#Scatter-Plot,-Relation-between-Precipitation-and-Pickups)
6. [Pairwise Relationships](#Pairwise-Relationships)
7. [Pickups in Rainy Day vs Not Rainy Days](#Pickups-in-Rainy-Day-vs-Not-Rainy-Days)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("seaborn")
plt.rcParams["figure.figsize"] = (15, 5)
import seaborn as sns

pd.options.display.max_columns = None
pd.options.display.max_rows = None
import altair as alt

# alt.data_transformers.disable_max_rows()

In [2]:
%matplotlib inline

# Get cleaned data

In [3]:
year=2017

df = pd.read_csv('../data/Data_Cleaned_'+str(year)+'_To_Model.csv', sep=',',
                       parse_dates=['datetime'])
df = df.rename(columns={'NoOfPickups':'pickups'})

df.head()

Unnamed: 0,datetime,month,day,hour,LocationID,pickups,year,week,dayofweek,isweekend,precipitation
0,2017-01-01,1,1,0,4,136.0,2017,52,6,1,0.0
1,2017-01-01,1,1,0,12,3.0,2017,52,6,1,0.0
2,2017-01-01,1,1,0,13,103.0,2017,52,6,1,0.0
3,2017-01-01,1,1,0,24,94.0,2017,52,6,1,0.0
4,2017-01-01,1,1,0,41,136.0,2017,52,6,1,0.0


# Get Zone Name and attach to data

In [4]:
# 1. Import Location and Borough columns form NY TAXI ZONES dataset
dfzones = pd.read_csv('../data/NY_taxi_zones.csv', sep=',',
                      usecols=['LocationID', 'borough', 'zone'])

# 2. Filter Manhattan zones
dfzones = dfzones[dfzones['borough']=='Manhattan']\
                .drop(['borough'], axis=1)\
                .sort_values(by='LocationID')\
                .drop_duplicates('LocationID').reset_index(drop=True)
dfzones.head()

Unnamed: 0,zone,LocationID
0,Alphabet City,4
1,Battery Park,12
2,Battery Park City,13
3,Bloomingdale,24
4,Central Harlem,41


In [5]:
df2 = df.merge(dfzones, left_on='LocationID', right_on='LocationID')
df2.head()

Unnamed: 0,datetime,month,day,hour,LocationID,pickups,year,week,dayofweek,isweekend,precipitation,zone
0,2017-01-01 00:00:00,1,1,0,4,136.0,2017,52,6,1,0.0,Alphabet City
1,2017-01-01 01:00:00,1,1,1,4,144.0,2017,52,6,1,0.0,Alphabet City
2,2017-01-01 02:00:00,1,1,2,4,189.0,2017,52,6,1,0.0,Alphabet City
3,2017-01-01 03:00:00,1,1,3,4,186.0,2017,52,6,1,0.0,Alphabet City
4,2017-01-01 04:00:00,1,1,4,4,125.0,2017,52,6,1,0.0,Alphabet City


# Histogram Comparison, Pickups per Zone
The plot shows Manhattan zones ordered by their average number of pickups over the year.

The top ten here:

|Zone|Number of Pickups (mean)|
| --- | --- |
|Upper East Side South:|505.161416|
|Midtown Center|485.399543|
|Upper East Side North|457.652283|
|Penn Station/Madison Sq West|451.312671|
|Midtown East|438.765411|
|Times Sq/Theatre District|435.514155|
|Union Sq|420.867009|
|Murray Hill|412.360616|
|Clinton East|404.736301|
|East Village|377.901941|



In [6]:
df_grouped = df2[['zone', 'pickups']]
df_grouped = df_grouped.groupby('zone').mean().sort_values('pickups', ascending=False).reset_index()

pickups_by_zone = alt.Chart(df_grouped).mark_bar().encode(
    x=alt.X('zone',
            sort=alt.Sort(field='pickups',
                         order='descending')
    ),
    y='pickups',
    tooltip='zone'
).properties(
    width=800
)
pickups_by_zone

# Map, Pickups per Zone
WIP

In [27]:
import geopandas as gpd
import altair as alt
from shapely.geometry import Polygon, MultiPolygon


shape_data = gpd.read_file('../data/taxi_zones/taxi_zones.shp')

    # filter Manhattan zones
shape_data = shape_data[shape_data['borough'] == 'Manhattan'].reset_index(drop=True)

shape_data = shape_data.drop(['borough'], axis=1)

#EPSG-Code of Web Mercador
shape_data.to_crs(epsg=3785, inplace=True)

# Simplify Shape of Zones (otherwise slow peformance of plot)
shape_data["geometry"] = shape_data["geometry"].simplify(100)

data = []
for zonename, LocationID, shape in shape_data[["zone", "LocationID", "geometry"]].values:
        #If shape is polygon, extract X and Y coordinates of boundary line:
    if isinstance(shape, Polygon):
        X, Y = shape.boundary.xy
        X = [int(x) for x in X]
        Y = [int(y) for y in Y]
        data.append([LocationID, zonename, X, Y])

        #If shape is Multipolygon, extract X and Y coordinates of each sub-Polygon:
    if isinstance(shape, MultiPolygon):
        for poly in shape:
            X, Y = poly.boundary.xy
            X = [int(x) for x in X]
            Y = [int(y) for y in Y]
            data.append([LocationID, zonename, X, Y])

    #Create new DataFrame with X an Y coordinates separated:
shape_data = pd.DataFrame(data, columns=["LocationID", "ZoneName", "X", "Y"])

shape_data.drop_duplicates(subset=['LocationID'])


# GET LOCATIONIDs WITH PICKUPS

df2 = df[['LocationID', 'pickups']]
df2 = df2.groupby('LocationID').mean().reset_index()

# JOIN SHAPE DATA WITH PICKUPS
#shape_data.join(df2, how='left')
shape_data

Unnamed: 0,LocationID,ZoneName,X,Y
0,4,Alphabet City,"[-8234500, -8234690, -8235841, -8235196, -8234...","[4971984, 4970961, 4971345, 4972514, 4972139, ..."
1,12,Battery Park,"[-8239385, -8239229, -8239175, -8239225, -8239...","[4968901, 4968851, 4968359, 4968208, 4968705, ..."
2,13,Battery Park City,"[-8239027, -8239307, -8239625, -8239795, -8239...","[4970990, 4969577, 4968770, 4969086, 4970076, ..."
3,24,Bloomingdale,"[-8233137, -8233194, -8234622, -8234426, -8232...","[4982697, 4982598, 4983379, 4983737, 4982971, ..."
4,41,Central Harlem,"[-8231824, -8231160, -8231269, -8231495, -8231...","[4984298, 4983928, 4983731, 4983733, 4983456, ..."
5,42,Central Harlem North,"[-8230335, -8230425, -8230276, -8230853, -8230...","[4988211, 4987721, 4985757, 4984482, 4984283, ..."
6,43,Central Park,"[-8234586, -8234638, -8235599, -8232951, -8231...","[4977725, 4977634, 4978245, 4982962, 4982431, ..."
7,45,Chinatown,"[-8237364, -8236814, -8236741, -8237552, -8238...","[4970257, 4970304, 4969591, 4969358, 4969955, ..."
8,48,Clinton East,"[-8236660, -8236710, -8237342, -8236313, -8235...","[4976319, 4976228, 4976581, 4978444, 4978092, ..."
9,50,Clinton West,"[-8237272, -8236313, -8237138, -8238060, -8237...","[4978991, 4978444, 4976951, 4977532, 4977508, ..."


In [None]:
gdf = gpd.GeoDataFrame.from_features(inline_data)
gdf.head()

# Linear Chart, Pickups evolution over time

#### 4.1. Pickups evolution over Months
- The number of pickups over the year is quite constant.
- The only variation is a small decrease of pickups over the Summer months: July and August. This can be interpretate as that most of the taxis are taken by new yorkers (not tourists), and in this period they travel away from the city.

#### 4.2. Pickups evolution over Weeks
- This graph shows better how the taxi demand drops drastically on the USA Federal holidays:
    - New Year's Day: January.
    - Independance Day: 4th July.
    - Labor Day: first Monday of September.
    - Thanksgiving: 4th Thursday of November.
    
#### 4.3. Pickups evolution over a Single Week
- This graph shows the average number of pickups per weekday:
    - Monday and Sunday are a bit quieter than the other days, but not much.

In [None]:
# take 'datetime' and 'pickups' columns
df_grouped = df2[['datetime', 'pickups']]

# MONTHS DATA
df_grouped_months = df_grouped.groupby('datetime').mean().resample('M').mean()

# WEEKS DATA
df_grouped_weeks = df_grouped.groupby('datetime').mean().resample('W').mean()

# WEEKDAY DATA
df_grouped_weekday = df_grouped.groupby('datetime').mean().reset_index()
# create a categorical list to order weekday index
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# define key to groupby: weekday
key = df_grouped_weekday['datetime'].dt.weekday_name
# groupby 'key' (weekday) and reorder index using categorical values
df_grouped_weekday = df_grouped_weekday.groupby(key).mean().reindex(cats)

# PLOT MONTHS
df_grouped_months.plot()
plt.title('Pickups evolution over MONTHS')
plt.ylim(bottom=0)

# PLOT WEEKS
df_grouped_weeks.plot()
plt.title('Pickups evolution over WEEKS')
plt.ylim(bottom=0)

# PLOT WEEKDAYS
df_grouped_weekday.plot()
plt.title('Pickups evolution over A SINGLE WEEK')
plt.ylim(bottom=0)

plt.show()

# Scatter Plot, Relation between Precipitation and Pickups

Probar quitando outliers de precipitation.

Para ello dibujar un box plot de precipitatation

The scatter plot does not show a clear relation between 'precipitation' and 'pickups'.

The Histogram and Boxplot show that the mayority of the days it does not rain.

In [None]:
df_grouped = df2[['datetime', 'pickups', 'precipitation']]
df_grouped = df_grouped.groupby('precipitation').mean().reset_index()

fig, ax =plt.subplots(1,3,figsize=(15,5))

sns.distplot(df2['precipitation'], ax=ax[0])
sns.boxplot(data=df2, y='precipitation', ax=ax[1])
sns.regplot(df_grouped['precipitation'], df_grouped['pickups'], ax=ax[2])

fig.show()

I will remove the days with zero precipitation and see again

In [None]:
df2_prec = df2[df2['precipitation'] > 0]
df_grouped = df2_prec[['datetime', 'pickups', 'precipitation']]
df_grouped = df_grouped.groupby('precipitation').mean().reset_index()

fig, ax =plt.subplots(1,3,figsize=(15,5))

sns.distplot(df2_prec['precipitation'], ax=ax[0])
sns.boxplot(data=df2_prec, y='precipitation', ax=ax[1])
sns.regplot(df_grouped['precipitation'], df_grouped['pickups'], ax=ax[2])

fig.show()

I don´t still see any relationship. I will remove outliers above 0.5 precipitation.

In [None]:
filter1 = df2['precipitation'] > 0
filter2 = df2['precipitation'] < 0.5
df2_prec = df2[filter1 & filter2]
df_grouped = df2_prec[['datetime', 'pickups', 'precipitation']]
df_grouped = df_grouped.groupby('precipitation').mean().reset_index()

fig, ax =plt.subplots(1,3,figsize=(15,5))

sns.regplot(df_grouped['precipitation'], df_grouped['pickups'], ax=ax[2])
sns.boxplot(data=df2_prec, y='precipitation', ax=ax[1])
sns.distplot(df2_prec['precipitation'], ax=ax[0])
fig.show()

#sns.regplot(df_grouped['precipitation'], df_grouped['pickups'])

The scatter plot is very clear: **there is no correlation between precipitation and pickups.**

# Pairwise Relationships
As the number of pickups is very stable over time, I will analyse only one month (so it runs faster in my computer)

In [None]:
df2 = df[df['month'] == 2]
sns.pairplot(df2[['pickups','day', 'hour', 'LocationID',
       'dayofweek', 'isweekend', 'precipitation']])

Relations found:

- **Pickups - iswweekend**. There are more pickups during the weenkend.
- **Pickups - dayofweek**. There are more pickups on Saturday, Friday, Thursday. In this order. It is related to 'isweekend' but it contains more granularity about pickups distribution so I will keep this variable and remove 'is weekend'.
- **Pickups - hour**. There are more pickups between 23:00 and 3:00. This could be because the is not public transport.
- **Pickups - day**. There is a clear weekly pattern so this information is already given by 'dayofweek'. So I will remove 'day'.

# Pickups in Rainy Day vs Not Rainy Days

I have compared the average of pickups in rainy days vs not rainy days.

Against my prediction, there seems not to be relation at all between rain and pickups.

In [None]:
df2 = df[['month', 'day', 'precipitation','pickups']]
df2 = df2.groupby(['month', 'day']).mean().reset_index()

# convert prediction variable y categorical. 0 = it didn´t rain. 1 = it rained.
df2.loc[(df2.precipitation != 0),'precipitation']=1
df2 = df2.groupby('precipitation')['pickups'].mean().reset_index()

df2.head(20)

Contar cuántos días al año llueve

Conclusiones finales:
    
No usar 'day' ni 'isweekend'