***
# Analyzing the NYC Subway Dataset
***

***
## Section 0 : References

http://felixfan.github.io/rstudy/2013/11/27/ggplot2-remove-grid-background-margin/

http://stackoverflow.com/questions/24564789/python-ggplot-syntax-to-annotate-chart-with-statsmodels-variable

http://stackoverflow.com/questions/17690738/in-pandas-how-do-i-convert-a-string-of-date-strings-to-datetime-objects-and-put

http://stackoverflow.com/questions/25290576/highlight-weekends-using-ggplot

http://stackoverflow.com/questions/28009370/get-weekday-for-datetime-in-pandas-dataframe
***

***
## Section 1 : Statistical Test

__Mann Whitney U test  __  
Reasons for using this test
  * The data for the entries per hour during rainy and non rainy days does not follow a normal distribution.
  * It is a non-parametric test and does not make any assumptions about the underlying probability distribution.  
  

__Null Hypothesis $H_0$:__ Let $X$ and $Y$ denote the population of ridership on rainy and non-rainy days respectively. Let $x$ be the random samples taken from population $X$  abd $y$ be the random samples drawn from population $Y$. The null hypothesis states that $P(x > y) = 0.5$ i.e., the ridership on rainy days is likely to be higher than on non-rainy days.


__Two tailed test will be used as we would like to know if the samples drawn from rainy days are likely to be higher than the samples drawn from non-rainy days and this is what the standard Mann Whitney U test does__


__P-critical value = 0.05__


__Results  __
  * Mean on rainy days: 1105.446
  * Mean on non-rainy days: 1090.279
  * P value: 0.025*2 = 0.05 (Rounded to three decimal places)


__Since the P-value  returned is for a one sided t-test, by multiplyig it with 2 we get the P-value for a two sided test and it is equal to 0.05. Hence we can conclude that there is not enough evidence to discard the null hypothesis. Also this means that the samples drawn from raniy days are likely to be higher than the samples drawn from non-rainy days__
***

***
## Section 2 : Linear Regression

__Approach__ : OLS using stats models  


__Features Used__ : The features used are 'rain', 'precipi', 'Hour', 'meantempi' and 'UNIT' 


__Reason for using that particular feature__ : To start with I experimented by adding the obviously straightforward features like 'rain' and 'fog' but they did not gie a better $R^2$ value and hence by trial and error I ended up with the feature 'UNIT'.   


__Parameters of the non dummy variables__:  
  * rain :          29.464529
  * precipi :       28.726380
  * Hour   :        65.334565
  * meantempi :    -10.531825  
  
  
__$R^2$__ : 0.47924770782. Given the $R^2$ value I don't think it is an ideal model to predict the hourly entries.  
***

***
## Section 3 : Visualization

The histogram below shows that the number of entries per hour is consistently higher across all the bins for non-rainy days.

<img src="plot1.png">

The scatter plot below shows that during weekends there is significantly less usage of the subway apart from one exception on May 30 but as it turns out that is a US memorial day holiday and hence the less usage. Also among the weekdays there is slightly low usage for Mondays.

<img src="plot2.png">

## Code used for Visualization  

This does not work on the Udacity platform possibly because of the issues with latest pandas version and compatibility with ggplot2.

In [7]:
def get_plots (turnstile_data):

    df =  turnstile_data
    df1 = df[['ENTRIESn_hourly','rain']]
    dataText=pd.DataFrame.from_items([('x',[3000,3000]),('y',[5000,4500]),('text',['Black: Non rainy days','Blue:Rainy days'])])

    # #########First plot
    plot =  ggplot(df1,aes('ENTRIESn_hourly')) + \
            geom_histogram(data=df[df['rain'] == 0], fill = "black", alpha = 0.5,binwidth = 50) + \
            geom_histogram(data=df[df['rain'] == 1], fill = "blue", alpha = 0.5,binwidth = 50) +\
            scale_x_discrete(limits = [0, 5000]) +\
            scale_y_discrete(limits = [0, 6000]) +\
            theme_bw() +\
            xlab("Entries per Hour") +\
            ylab("Frequency") +\
            ggtitle("Distribution of entries per hour during raniny and normal days") +\
            geom_text(aes(x='x', y='y', label='text'), data=dataText)

    print plot

    #############Second plot
    unique_dates = df['DATEn'].unique()

    df1 = df[['ENTRIESn_hourly','DATEn']]
    df1_groupMean = df1.groupby(['DATEn'],as_index = False).agg(['mean'])

    df2 = pd.DataFrame({'Dates': unique_dates,
                        'Mean': list(df1_groupMean.ix[:,0])})

    df2['Dates'] = pd.to_datetime(pd.Series(df2['Dates']))

    df2['weekdays'] = df2['Dates'].dt.dayofweek

    df2['weekdays'] = df2['weekdays'].replace(5,"Weekend")
    df2['weekdays'] = df2['weekdays'].replace(6,"Weekend")
    df2['weekdays'] = df2['weekdays'].replace(0,"Weekday")
    df2['weekdays'] = df2['weekdays'].replace(1,"Weekday")
    df2['weekdays'] = df2['weekdays'].replace(2,"Weekday")
    df2['weekdays'] = df2['weekdays'].replace(3,"Weekday")
    df2['weekdays'] = df2['weekdays'].replace(4,"Weekday")


    plot2 = ggplot(df2,aes(y = 'Mean', color = 'weekdays')) +\
            geom_point(aes(x='Dates')) +\
            xlab("Date") +\
            ylab("Mean entries per hour") +\
            ggtitle("Weekday vs Weekend Ridership") +\
            theme_bw()

    print plot2

    

***
## Section 4 : Conclusion

From my analysis, more people travel during the rainy days. This conclusion can be drawn from the statistical analysis of section 1 where the P-value was equal to the P-critical value of 0.05, and from the analysis we could not reject the null hypothesis.


Also from the linear regression, the parameter for the coefficient if  feature 'rain' is 29.464529. This shows that there is a positive relationship between 'rain' and the output variable 'Entries per hour'

***

***
## Section 5 : Reflection

The dataset provided contains the data for only one month and it cannot be representative of the entire population. There are many other factors which we cannot control in this experiments, there are too many participants and too many factors influencing thir choice to take the subway or not. Hence we cannot draw a causal inference from the analysis carried out in this study.

Also the model used to predict the hourly entries has an $R^2$ value less than 0.5 and the results from this model may not be reliable.
***