# Analyzing the NYC Subway Dataset
### By Ashutosh Singh

### Section 0. References

1. ggplot documentation at [yhathq.com](yhathq.com)
2. http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm
3. Udacity Discussion Forum

### Section 1: Statistical test
#### 1.1 Statistical test
Mann-Whitney-U test is used to analyze the subway data. 

One tailed P-value is used. Python implementation of Mann-Whitney-u-test checks for one -tailed pvalues. To obtain the two tailed p- values we can multiply it by 2. 

**Null-Hypothesis**: There is no significant difference in subway ridership between rainy and non-rainy days. 

The p-value obtained is 0.0249 which is less than the alpha -critical value of 0.05. 
#### 1.2 Reasoning
Mann-Whitney-U test is application in this case as we are not assuming any difference in the values of 
two samples. We test that whether a particular population tends to differ from other. 
#### 1.3 Results:
Data  Mean Value  p-Value
With Rain  1105.447  0.0249
Without Rain  1090.279  0.0249
#### 1.4 Interpretation
From the above table we can see that the mean values are different and the corresponding p- critical 
value is less than the alpha critical limit of 0.05. This means that it is very unlikely that the samples are 
from the same population as the probability is very low. 
Since the samples are from the different population. Then we can reject the null hypothesis that the 
ridership during the rainy and non-rainy days does not change. 
We can confidently say that rain has an effect on ridership.

### Section 2: Linear Regression
#### 2.1 Algorithm 
For computing the data coefficients of the linear model which will fit the ridership data we used the 
Ordinary Least Square using the Statsmodels package. 
#### 2.2 Features 
The features used in the model are 
*  Rain
*  Hour
*  Mintempi

I have also used the dummy variables in the model which are UNIT column in the dataframe. 
#### 2.3 Why the feature selection 
I decided to use the above mentioned features because of the following reasons

a.  **Rain** : I thought that when it is rainy outside people tend to find the safest and easiest transport 
to travel. Also there may be traffic jams / other instances where people may get late to reach 
their destination. So I thought that rain may increase the ridership. 

b.  **Hour** : Subway Entries are dependent on hours of the day. More people leave home for work 
and since most of the jobs start between 8 to 10 am. Hence more people tend to travel at that 
period of time. 

c. **MinTempi** : This feature I chose because it chose a positive ( but low ) correlation with the 
hourly entries and when I used it in the regression my R^2 value increased from 4.783 to 4.795. 
This increases the R^2 value but not much. Hence this feature can be dropped to keep the 
model simple. 
#### 2.4 Coefficients of features in Linear model 
Following table represents the coefficients calculated by the OLS model 
Feature  Coefficient
Intercept  1591.94
Hour  65.34
Rain  62.09
MinTempi  -13.10
The remaining coefficients are of the dummy variables hence removed from here . 
#### 2.5 R-squared value of model  
The R-squared value of the model comes to be 0.4795

In [1]:
### Insert the residual graph here

### Section 3 : Visualization 
#### 3.1 Visualization for the hourly entries during rainy and non-rainy days. 
![](plots/rainyvsnon_Rainy.png)

It may look like that the entries for the rainy day are more than the rainy days but this is false 
interpretation available from the graph. The graph looks like because the number of entries of the nonrainy days is much more than number of entries for rainy days. 


In [4]:
%matplotlib inline

In [None]:
import pandas
import numpy as np
from ggplot import *

def floor_decade(date_value):
    #"Takes a date. Returns the decade."
    return (date_value.day // 31) * 10

data = pandas.read_csv('turnstile_weather_v2.csv')

data = data[data['ENTRIESn_hourly'] > 1]
data_rain = data[data['rain'] > 0]
data_no_rain = data[data['rain'] == 0]
data_no_rain = data_no_rain[:9433]

#plot2 =  ggplot(data_no_rain , aes(x= 'ENTRIESn_hourly', color='rain', fill = 'rain')) + \
#geom_histogram(binwidth = 50, alpha=0.6) + scale_x_continuous(limits=(0,10000)) + \
#ggtitle('Subway Ridership on Non-Rainy Days ')+ xlab('Entries Each hour') + ylab('Frequency')

plot3 =  ggplot(data , aes(x= 'hour', y='ENTRIESn')) + \
geom_boxplot()  + \
ggtitle('Subway Ridership on Non-Rainy Days ')+ xlab('Entries Each hour') + ylab('Frequency')

print plot3 

### 3.2 Visualization for Rider ship by time of day. 
![](plots/rideByHourOfDay.png)

From the visualization is section 3.2 it is clear that number of entries for the subway increase  at 12 pm 
and 8 pm.  But this may due to the aggregation of entries in 4 hour slots.

### Section 4 : Conclusion 

**4.1** From the analysis of the data it can be interpreted that more people ride the NYC subway when it is 
raining than when it is not raining. 

**4.2** The above analysis can be concluded as of following reasons 
a. The samples of the people riding when it’s raining and when it’s not raining are statistically different 
and relevant as given by the p-value. Since the mean value of ridership during rain is more hence we can 
say that more people ride in the subway while it’s raining outside. 
b. Nothing can be conclusively said by the parameters of the linear regression model but the coefficient 
value of the rain parameter in the model is positive and significant with respect to other parameters 
taken. Hence it is evident that raining outside has positive effect on the subway ridership.

### Section 5 : Reflection
#### 5.1 Shortcomings of the method and analysis
This dataset is good for basic analysis of the ridership and learning of the concepts but since the data is 
only of the May 2011 then this will only give us information of ridership according to month which may 
be dependent upon other factors. 
Also the data of hourly entries is I think clubbed into the interval of 4 hours. This generalization may also 
give false information about ridership at particular hour. 
Linear regression is a basic tool to get the feel of the data and if the data can be linearly modelled then it 
is a good tool but in the event of non-linear relationships, it may not provide the best analysis