P1: Analyzing the NYC Subway Dataset - Andrei Iusan.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
	<meta http-equiv="content-type" content="text/html; charset=utf-8"/>
	<title>Analyzing the NYC Subway Dataset</title>
	<meta name="generator" content="LibreOffice 4.3.7.2 (Linux)"/>
	<meta name="created" content="00:00:00"/>
	<meta name="changed" content="2015-05-21T13:34:53.402078982"/>
	<meta name="created" content="00:00:00">
	<meta name="changed" content="2015-05-18T17:41:17.158068914">
	<meta name="created" content="00:00:00">
	<meta name="changed" content="2015-05-18T15:35:56.471062978">
	<style type="text/css">
		p { color: #000000 }
		h1 { color: #000000 }
		h2 { color: #000000 }
		h2.western { font-family: "Albany", sans-serif; font-size: 16pt }
		h2.cjk { font-family: "Droid Sans Fallback"; font-size: 16pt }
		h2.ctl { font-family: "FreeSans"; font-size: 16pt }
	</style>
</head>
<body lang="en-US" text="#000000" dir="ltr" style="background: transparent">
<p align="center" style="margin-top: 0.17in; page-break-after: avoid"><a name="h.giiv1ib353ge"></a>
<font face="Albany, sans-serif"><font size="6" style="font-size: 28pt"><b>Analyzing
the NYC Subway Dataset</b></font></font></p>
<p align="center" style="margin-top: 0.04in; margin-bottom: 0.08in; page-break-after: avoid">
<font face="Albany, sans-serif"><font size="5" style="font-size: 18pt">Do
more people ride the NY Subway when it rains?</font></font></p>
<p align="right">Author: Andrei Iusan</p>
<p><br/>
<br/>

</p>
<h1><a name="h.o2v5a17t769h"></a>Section 0. References</h1>
<ol>
	<li/>
<p>D. M. Diez, C. D. Barr, M Çetinkaya-Rundel, <b>OpenIntro
	Statistics, Second Edition</b></p>
	<li/>
<p>H. B. Mann, D. R. Whitney, <b>On a Test of Whether one of
	Two Random Variables is Stochastically Larger than the Other</b> <i>The
	Annals of Mathematical Statistics, Volume 18, Number 1 (1947),
	50-60.</i>
	</p>
	<li/>
<p>Stéfan van der Walt, Jarrod Millman, <b>Statsmodels:
	Econometric and Statistical Modeling with Python</b>, <i>Proceedings
	of the 9th Python in Science Conference, 57-61 (2010)</i>,
	<a href="http://statsmodels.sourceforge.net/">http://statsmodels.sourceforge.net/</a></p>
	<li/>
<p>Jones E, Oliphant E, Peterson P, et al. <b>SciPy: Open
	Source Scientific Tools for Python, 2001-</b>, <a href="http://www.scipy.org/">http://www.scipy.org/</a></p>
	<li/>
<p>Wes McKinney. <b>Data Structures for Statistical Computing
	in Python</b>, <i>Proceedings of the 9th Python in Science
	Conference, 51-56 (2010)</i>, <a href="http://pandas.pydata.org/">http://pandas.pydata.org/</a></p>
	<li/>
<p>John D. Hunter. <b>Matplotlib: A 2D Graphics Environment</b>,
	<i>Computing in Science &amp; Engineering, 9, 90-95 (2007)</i>,
	<a href="http://matplotlib.org/">http://matplotlib.org/</a>
	</p>
</ol>
<h1><a name="h.dknrkp3j7jd4"></a>Section 1. Statistical Test</h1>
<h2 class="western">1.1 Statistical test design</h2>
<p style="margin-bottom: 0in">A statistical test can be used to
determine if weather affects ridership in the NY Subway system. First
the data is divided in two datasets: one called <b>Sunny</b>
containing all entries where rain equals 0 in the dataset, and one called
<b>Rainy</b>, containing all entries where rain equals 1. In order to
choose the appropriate statistical test, one can plot a histogram
of the data (Figure 1). Figure (a) shows a histogram of hourly
entries, where the height of the bars represents the count of entries
in the dataset, and on the x axis, the number of entries. The plot represents
the number of times that any subway unit records entries in
an interval (e.g. the number of times that a subway unit recorded
0-100 entries). But since there is more data for clear days than rainy
days, the histograms have different scales. There are more entries
for clear days on our histogram not necesarily because people ride more during
clear days, but because there are more samples for clear days in our dataset
 than for rainy days. This issue can be resolved by normalizing the histogram. 
 The resulting histogram is presented in Figure 1 (b).</p>

<p style="margin-bottom: 0in">The histogram is strongly right skewed,
so we need a non-parametric statistical test to compare the two
distributions. We can apply the Mann-Whitney U test to answer the
question. The Null Hypothesis can be formulated as “There is no
difference between the two datasets” and the Alternative Hypothesis
as “The values in the dataset containing entries during the days
with bad weather conditions came from different populations.” 
We can rephrase the Null
Hypothesis: The number of people riding the subway is roughly the same 
regardless of the weather conditions, and
the Alternative Hypothesis: Rain affects the average number of people
that ride the subway.
For this we compute a two-sided test.</p>
<h3 style="margin-bottom: 0in">More formally:</h3>
<p><b>Null Hypothesis:</b>
μ<sub>Sunny</sub> = μ<sub>Rainy</sub> </p>
<p><b>Alternative Hypothesis:</b>
μ<sub>Sunny</sub> ≠ μ<sub>Rainy</sub> </p>
<p><b>Significance level</b>
p=0.05</p>
<p><b>Statistical test:</b>
Mann Whitney U Test, two sided.</p>
<h2 class="western">1.2 Choice of statistical test</h2>
<p>We can see from the hystogram that the data is not normally
distributed. We therefore need to apply a non-parametric statistical
test. We need to check the assumptions of the statistical test that
we want to use in order to decide if that test is applicable to our
dataset. We analyze the two datasets with respect to <a href="https://statistics.laerd.com/spss-tutorials/mann-whitney-u-test-using-spss-statistics.php">Mann-Whitney
U test assumptions</a>:</p>
<p>Assumption 1: The dependent variable should be ordinal or
continuous. Our dependent variable is hourly entries. This can be
considered continuous.</p>
<p>Assumption 2: The independent variable should be categorical. In
our case, the independent variable is weather it rains or not. We
used this variable to split our data in two datasets.</p>
<p>Assumption 3: The observations should be independent. This is also
true, the observation of number of entries in one day does not
influence other observations.</p>
<p>Assumption 4: The distribution of values from the dataset should
have the same shape. This can be observed in Figure 1.</p>
<h2 class="western">1.3 Results</h2>
<p style="margin-bottom: 0in">I applied the test and I obtained the
following results: </p>
<p style="margin-bottom: 0in">μ<sub>Sunny</sub> = 1845.53 entries</p>
<p style="margin-bottom: 0in">μ<sub>Rainy</sub> = 2028.19 entries</p>
<p style="margin-bottom: 0in">U = 1.536 * 10<sup>8</sup></p>
<p style="margin-bottom: 0in">p = 5.48 * 10<sup>-6</sup></p>
</p>
<p style="margin-bottom: 0in">where μ<sub>Sunny</sub> represents the
mean of the entries over sunny days, μ<sub>Rainy</sub> represents
the mean of the entries over rainy days, U represents the U statistic
of the Mann Whitney U test, and p represents the p-value for a one-sided
Mann Whitney U test.</p>
<p>The p-value is considerably lower than the significance level, 
so the Null is rejected.</p>

<h2 class="western">1.4 Interpretation</h2>
<p>The p-value represents the probability of obtaining a variation
like the observed one ore more extreme, given that the null
hypothesis is true. If this value is low, it is unlikely (although
possible, with probability p) that the null hypothesis is true. Given
that the p-value is small, we can reject the null hypothesis and
conclude that there is convincing evidence that the alternative
hypothesis holds.</p>
<h1><a name="h.z24p4e3rt9ik"></a>Section 2. Linear Regression</h1>
<h2 class="western">2.1 Approach</h2>
<p>I implemented OLS using statsmodels for my regression model.</p>
<h2 class="western">2.2 Features</h2>
<p>I used <font face='mono' color='#0005AA' size='-1'>'precipi'</font>,
<font face='mono' color='#0005AA' size='-1'>'fog'</font>,
<font face='mono' color='#0005AA' size='-1'> 'wspdi'</font>,
 the square of <font face='mono' color='#0005AA' size='-1'>'pressurei'</font>
  called <font face='mono' color='#0005AA' size='-1'>'pressure2'</font>
   and dummy variables for Subway unit, day of week and hour.</p>
<h2 class="western">2.3 Selecting features</h2>
<p>By analyzing the data, we can see from the plots in Figure 2 that
the moment in time is the strongest predictor. From Figure 2 (a) we
can see that the weather affects the ridership slightly, but the
number of entries follows the same daily pattern. There is also a
weekly pattern that can be observed in Figure 2 (b) and Figure 3.</p>
<p>Another important aspect is that the relationship between
time-of-day and number of entries is nonlinear. Therefore, using
linear regression with 'hour' or 'day_week' as a feature will not
work properly. It is better to aggregate the values in time intervals
and use dummy variables for those intervals. I included dummy
variables for subway unit, weekday and time of day.</p>
<p>In order to select the best variables that could be used as predictors,
I generated scatterplots of all predictor variables (of explanatory variables)
versus the number of entries. It is best to include in a linear model
only variables that follow an approximately linear pattern with respect 
to the response variable. I observed that temperature does not follow a
linear pattern so I excluded it. Also, pressure appears quadratic (Figure 4a).
I dropped the pressure variable and replaced it with the squared value of
the pressure (Figure 4b).</p>
<p>It is interesting to note that since we have a very limited
dataset, the average for the day of Monday is pulled down by the
national <a href="http://en.wikipedia.org/wiki/Memorial_Day">Memorial
Day</a> that is celebrated on the last Monday of May. This suggests
that perhaps a more accurate prediction could be made if we take into
 account not only the weekday, but also the day of the year, or
perhaps another binary variable that describes the day as being a national
holiday.</p>
<p>I obtained an R<sup>2</sup> value of 0.54
<h2 class="western">2.4 Weights</h2>
<p>The output variable (or response variable), the number of entries
for a particular subway unit at a particular moment in time varies from
0 to 35,000 entries. Due to this fact, the weights of different 
predictors have values that vary very much. The weights for weather 
values are the following:</p>
<ul>
	<li/>
<p style="margin-bottom: 0in">precipi: -117.366997884</p>
	<li/>
<p style="margin-bottom: 0in">fog: -80.4189014166</p>
	<li/>
<p style="margin-bottom: 0in">wspdi: 33.5845241393</p>
	<li/>
<p style="margin-bottom: 0in">pressure2: -20.6362939134</p>
</ul>


<p>On the other hand, the theta values for the dummy features are
usually on the order of 10<sup>12</sup> to 10<sup>14</sup>.</p>
<h2 class="western">2.5 R<sup>2</sup></h2>
<p>For my model: R<sup>2</sup> = 0.545.</p>
<h2 class="western">2.6 How well does this model fit the data? How
well does it predict?</h2>
<p>A distinction has to be made between modeling and predicting.</p>
<p><b>Modeling</b> is the process of describing theoreticaly a phenomena 
(usually a natural phenomena). It involves finding a mathematical
description for the real phenomena. If the model is accurate,
it can be used to predict future events.</p>
<p><b>Prediction</b> involves using a model to predict future events.
Prediction is related to modeling, however, measuring the prediction
accuracy of a model is more difficult than measuring fitness. In order 
to measure fitness of a model, one has to compare the prediction
of the model with real data. A model may describe a dataset very well,
but it may fail to generalize to other datasets.</p>
<p>Figure 5 shows the accuracy of the model with respect to the real data.
We can clearly see that the model is unable to capture the variability of
the data. Moreover, the Linear model outputs negative values. Figure 6 shows the residuals. We can see that for larger values, the residuals get larger.
This sugests that the output variable has a non-linear behavor.</p>
<p>This model describes the data to some extent, but in order to better
assess a statistic like R<sup>2</sup> for predictive purposes, we need
to set up an experiment to appreciate how good the model might behave in a
real world setting.</p>
<p>I desighed a cross-validation experiment that is described below
in pseudo-code.</p>
<font face='mono'>
<ol>
	<li/>
<p>Shuffle the data</p>
	<li/>
<p>Split the data in 10 chunks (or any N chuncks, for this
	experiment I used N=10)</p>
	<li/>
<p>for every chunk:</p>
	<ul>
		<li/>
<p>test_set = data[chunk]</p>
		<li/>
<p>train_set = data – test_set</p>
		<li/>
<p>train the model using the train_set</p>
		<li/>
<p>compute predicted values for test set</p>
		<li/>
<p>compute R<sup>2</sup> for test set</p>
	</ul>
	<li/>
<p>Compute R<sup>2</sup> average and std</p>
</ol>
</font>
<p>The results of this experiment are the following:</p>
<p>R<sup>2</sup><sub>mean</sub>= 0.549</p>
<p>R<sup>2</sup><sub>std</sub>= 0.039</p>
<p>I obtained 10 values for R<sup>2</sup>. In order to compute a confidence
	interval, I used the t distribution. (The t distribution is required rather
	than the normal distribution because we have only a small sample.)
	For our experiment, there are 10 samples, therefore I applied the t test with 9 degrees
	of freedom.</p>
<p>The 95% confidence interval for R<sup>2</sup> is: (0.46, 0.63).</p>
<p>The code that led to this result is presented here for replication
purposes:</p>
<p><code>
	<!-- HTML generated using hilite.me --><div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;"><pre style="margin: 0; line-height: 125%">	<span style="color: #008800; font-weight: bold">from</span> <span style="color: #0e84b5; font-weight: bold">scipy.stats</span> <span style="color: #008800; font-weight: bold">import</span> t
	<span style="color: #008800; font-weight: bold">import</span> <span style="color: #0e84b5; font-weight: bold">numpy</span> <span style="color: #008800; font-weight: bold">as</span> <span style="color: #0e84b5; font-weight: bold">np</span>

	<span style="color: #888888"># ...</span>
	indeces <span style="color: #333333">=</span> np<span style="color: #333333">.</span>array(<span style="color: #007020">range</span>(<span style="color: #007020">len</span>(features)))
	np<span style="color: #333333">.</span>random<span style="color: #333333">.</span>seed(<span style="color: #0000DD; font-weight: bold">1</span>)
	np<span style="color: #333333">.</span>random<span style="color: #333333">.</span>shuffle(indeces)
	k_cv <span style="color: #333333">=</span> <span style="color: #0000DD; font-weight: bold">10</span>
	test_set_len <span style="color: #333333">=</span> <span style="color: #007020">len</span>(features)<span style="color: #333333">/</span>k_cv
	R_sq_array <span style="color: #333333">=</span> []
	<span style="color: #008800; font-weight: bold">for</span> k <span style="color: #000000; font-weight: bold">in</span> <span style="color: #007020">range</span>(k_cv):
		train_i <span style="color: #333333">=</span> indeces[<span style="color: #007020">range</span>(<span style="color: #0000DD; font-weight: bold">0</span>,k<span style="color: #333333">*</span>test_set_len)<span style="color: #333333">+</span><span style="color: #007020">range</span>((k<span style="color: #333333">+</span><span style="color: #0000DD; font-weight: bold">1</span>)<span style="color: #333333">*</span>test_set_len,<span style="color: #007020">len</span>(features))]
		test_i <span style="color: #333333">=</span> indeces[<span style="color: #007020">range</span>(k<span style="color: #333333">*</span>test_set_len,(k<span style="color: #333333">+</span><span style="color: #0000DD; font-weight: bold">1</span>)<span style="color: #333333">*</span>test_set_len)]
		model <span style="color: #333333">=</span> sm<span style="color: #333333">.</span>OLS(values[train_i], features[train_i])
		results <span style="color: #333333">=</span> model<span style="color: #333333">.</span>fit()
		predicted_values <span style="color: #333333">=</span> results<span style="color: #333333">.</span>predict(features[test_i])
		r_sq <span style="color: #333333">=</span> compute_r_squared(turnstile_weather[<span style="background-color: #fff0f0">&#39;ENTRIESn_hourly&#39;</span>][test_i], predicted_values)
		R_sq_array<span style="color: #333333">.</span>append(r_sq)
		log<span style="color: #333333">.</span>write(<span style="color: #007020">str</span>(r_sq)<span style="color: #333333">+</span><span style="background-color: #fff0f0">&#39;</span><span style="color: #666666; font-weight: bold; background-color: #fff0f0">\n</span><span style="background-color: #fff0f0">&#39;</span>)
	Rm <span style="color: #333333">=</span> np<span style="color: #333333">.</span>mean(R_sq_array)
	Rsig <span style="color: #333333">=</span> np<span style="color: #333333">.</span>std(R_sq_array)
	conf_interval <span style="color: #333333">=</span> t<span style="color: #333333">.</span>interval(<span style="color: #333333">.</span><span style="color: #0000DD; font-weight: bold">95</span>,<span style="color: #007020">len</span>(R_sq_array)<span style="color: #333333">-</span><span style="color: #0000DD; font-weight: bold">1</span>,loc<span style="color: #333333">=</span>Rm, scale<span style="color: #333333">=</span>Rsig)
</pre></div>

</code></p>

	<h1>Section 3. Visualizations
</h1>
	<p>
	<span id="Frame1" dir="ltr" style="float: right; width: 5.72in; height: 4.29in; border: none; padding: 0in; background: #ffffff">
	<p style="margin-top: 0.08in; margin-bottom: 0.08in">
		<img src="histogram of hourly entries.png" name="Image1" align="left" width="549" height="600" border="0"><br clear="left"><font size="3" style="font-size: 12pt"><i>Figure
	1: Histogram of hourly entries. Figure 1 (a) shows the count of the
	entries vs. the number of entries. Figure 1 (b) shows the same
	histogram, but normalized, such that the hight of all bins sums up to one.</i></font></p>
</span></p>
<p>
<span id="Frame2" dir="ltr" style="float: right; width: 5.69in; height: 4.27in; border: none; padding: 0in; background: #ffffff">
	<p style="margin-top: 0.08in; margin-bottom: 0.08in"><img src="hourly entries.png" name="Image2" align="left" width="546" height="500" border="0"><br clear="left"><font size="3" style="font-size: 12pt"><i>Figure
	2: These plots describe ridership by hour (figure a) and by day of
	week (figure b).</i></font></p>
</span></p>
<p>
<span id="Frame3" dir="ltr" style="float: right; width: 5.72in; height: 4.29in; border: none; padding: 0in; background: #ffffff">
	<p style="margin-top: 0.08in; margin-bottom: 0.08in"><img src="
	daily entries.png" name="Image3" align="left" width="549" height="500" border="0"><br clear="left"><font size="3" style="font-size: 12pt"><i>Figure
	3: Total entries by day.</i></font></p>
</span></p>
	<p>
	<span id="Frame4" dir="ltr" style="float: right; width: 5.72in; height: 4.29in; border: none; padding: 0in; background: #ffffff">
	<p style="margin-top: 0.08in; margin-bottom: 0.08in">
		<img src="scatterplots/pressurei.png" name="Image1" align="left" width="549" height="500" border="0"><br clear="left"><font size="3" style="font-size: 12pt"><i>Figure
	4a: Scatterplot of pressure vs Entries</i></font></p>
</span></p>
	<p>
	<span id="Frame5" dir="ltr" style="float: right; width: 5.72in; height: 4.29in; border: none; padding: 0in; background: #ffffff">
	<p style="margin-top: 0.08in; margin-bottom: 0.08in">
		<img src="scatterplots/pressure2.png" name="Image1" align="left" width="549" height="500" border="0"><br clear="left"><font size="3" style="font-size: 12pt"><i>Figure
	4b: Scatterplot of pressure<sup>2</sup> vs Entries</i></font></p>
</span></p>
	<p>
	<span id="Frame6" dir="ltr" style="float: right; width: 5.72in; height: 4.29in; border: none; padding: 0in; background: #ffffff">
	<p style="margin-top: 0.08in; margin-bottom: 0.08in">
		<img src="Accuracy.png" name="Image1" align="left" width="549" height="500" border="0"><br clear="left"><font size="3" style="font-size: 12pt"><i>Figure
	5: The model's predicted variable versus the real variable (ENTRIESn_hourly). The red line is the identity function. The closer the points are to the red line, the more accurate the prediction. The difference between the line and a point represents the residual error of that point.</i></font></p>
</span></p>
	<p>
	<span id="Frame7" dir="ltr" style="float: right; width: 5.72in; height: 4.29in; border: none; padding: 0in; background: #ffffff">
	<p style="margin-top: 0.08in; margin-bottom: 0.08in">
		<img src="Residuals.png" name="Image1" align="left" width="549" height="500" border="0"><br clear="left"><font size="3" style="font-size: 12pt"><i>Figure
	6: Residuals versus ENTRIESn_hourly.</i></font></p>
</span></p>
<h1>Section 4. Conclusion</h1>
<p>From the analysis conducted so far we can state that more people
ride the NYC subway when it is raining. This can be observed first
only by computing the average hourly entries for the rainy vs.
non-rainy days. Although this gives as a clue as to the answer of
this question, we cannot state this answer only by this computation.
To properly answer the question we need to decide whether the
difference is large enough to be significant. In other words, is the
difference due to rain, or could it be due to chance that for our
given dataset, more people used the subway during rainy days?</p>
<p>Statistics helps us answer the question on a more scientific
ground. By using a statistical test we can answer the question: How
probable is it that the observed difference is due to chance? As we
have seen in Section 1, the p-value is very low,
so we can safely conclude that the observed difference is not due to
chance, and that rain does indeed affect NYC Subway ridership.</p>
<p>By using linear regression we can attempt to model the observed
ridership and predict, based on the moment in time (day and hour),
and weather, the approximate number of riders. The implementation
of this model gave an R<sup>2</sup> value of 0.54. This is significant
enough to suggest there is a correlation between the input variables
and the output, and this correlation can be used for predictive purposes,
but the correlation is rather low. As observed from Figure 5 and Figure 6,
the linear model does not capture the variability of the data well.</p>
<h1>Section 5. Reflection</h1>
<h2 class="western">Shortcomings</h2>
<h3 style="margin-top: 0.17in; page-break-after: avoid"><font face="Albany, sans-serif" size="4" style="font-size: 14pt">Dataset</font></h3>
<p>The dataset used for this project included entries only for the month
of May. Therefore, the computed model has limited use, it is most likely
unsuitable for predicting ridership during winter or hot summer days.
Also, the granularity of the values is quite large. The entries are
summed over 4 hour intervals. It may be possible to obtain a more
accurate prediction if the data contains entries for every hour, or for every
10 minutes, or even in real time, for example having a dataset with
every entry and the timestamp associated.</p>
<h3 style="margin-top: 0.17in; page-break-after: avoid"><font face="Albany, sans-serif" size="4" style="font-size: 14pt">Analysis</font></h3>
<p>By examining the residuals it is clear that the linear model is 
not suited for this particular application. The variability of the
data is very high and some relations are clearly not linear. In attempting
to account for the non-linearity of variables like time-of-day, I splitted
the variable into multiple dummy-variables. This approach increased the
number of features and created many categorical variables. Although
this approach may improve the model slightly, the linear model is not
suited for categorical variables.
For an accurate prediction, other non-linear models should be investigated.
</p>
</body>
</html>