# Analyzing the NYC Subway Dataset

## Section 1: Statistical Test
Statistical test that is choosen to analyze this dataset is **Mann-Whitney U Test** because as will be seen below, the distribution of the sample is not following a Gaussian or Normal distribution. Therefore, it is appropriate to use Mann-Whitney U test, because unlike Welch's T test, it does not assume that the data come from any probability distribution.

Furthermore, a testing could be conducted to make strong of the decision of using Mann-Whitney U test by performing **Shapiro-Wilk Test**, which will test whether the distribution of the sample is normally distributed.

Moreover, since Mann-Whitney U test only tests whether the two samples come from the same population and not necessarily tells the direction of the test, the test will be a **two-tail test**.

### P-Critical Value
$$P_{\text{Critical}} = .05$$

### Hypotheses
**Null Hypothesis**: The samples of entries on raining days and entries on clear days came from the same general population.
$$H_0: \mu_{\text{Rain}}=\mu_{\text{Clear}}$$

**Alternative Hypothesis**: The samples of entries on on raining days and entries on clear days came from different populations.
$$H_1: \mu{\text:{Rain}}\neq\mu_{\text{Clear}}$$


In [19]:
import pandas as pd
import numpy as np
import scipy.stats

# The path of the dataset CSV file
path = r'improved-dataset/turnstile_weather_v2.csv'

dataFrame = pd.read_csv(path)

# Storing data of entries when rain and when clear
withRain = dataFrame['ENTRIESn_hourly'][dataFrame['rain'] == 1]
withoutRain = dataFrame['ENTRIESn_hourly'][dataFrame['rain'] == 0]

# Calculating means of samples
mean_withRain = np.mean(withRain)
mean_withoutRain = np.mean(withoutRain)

# Calculating Mann-Whitney U Test
U, p = scipy.stats.mannwhitneyu(withRain, withoutRain)

print 'Mean of sample entries when raining: ', mean_withRain
print 'Mean of sample entries when clear: ', mean_withoutRain
print 'Test Statistics (U value): ', U
print 'P-Value: ', p

Mean of sample entries when raining:  2028.19603547
Mean of sample entries when clear:  1845.53943866
Test Statistics (U value):  153635120.5
P-Value:  2.74106957124e-06


### Results
- Null Hypothesis is rejected because **p-value < p-critical**.
- It is concluded that there is a probability of less than five percent for obtaining the U value and the samples come from the same population (Less than 5% probability of having a Type-I-Error).
- It is reasonable to conclude that the samples come from different populations and there is a significance difference in number of NYC subway entries when it is raining and when it is not raining.

## Section 2: Linear Regression



In [41]:
import statsmodels.api as sm

# Storing input and output variables that would be used in the model
features = dataFrame[['hour', 'rain', 'meantempi']]
values = dataFrame['ENTRIESn_hourly']

# Adding variable to calculate intercept
featuresWithConstant = sm.add_constant(features)

# Calculating linear regression using Ordinary Least Square
model = sm.OLS(values, featuresWithConstant)
results = model.fit()

# Storing results
intercept = results.params[0]
params = results.params[1:]

# Function to compute R^2
def compute_r_squared(data, predictions):
    # Sum of squared difference between data and predictions
    dp_diff = np.square(data - predictions).sum()
    
    # Sum of squared difference between data and its mean
    dmean_diff = np.square(data.map(lambda x: x - np.mean(data))).sum()
    
    r_squared = 1 - (dp_diff / dmean_diff)
    
    return r_squared

# Making predictions based on parameters
predictionValues = intercept + np.dot(features, params)

# Compute R^2
r_squared = compute_r_squared(values, predictionValues)

print 'Intercept: ', intercept
print 'Parameters:\n', params, '\n'
print 'R Squared: ', r_squared

Intercept:  1190.29477449
Parameters:
hour         122.045050
rain         136.603716
meantempi     -8.883175
dtype: float64 

R Squared:  0.0833210998821


### Results


## Section 3: Visualization

In [26]:
from ggplot import *



### Discussion

## Section 4: Conclusion