# Introduction to Correlated Data Lab

## Correlated States

Read [this](http://fivethirtyeight.com/features/why-fivethirtyeight-gave-trump-a-better-chance-than-almost-anyone-else/) FiveThirtyEight article about 2016 polling.

Suppose you were advising a Presidential candidate's pollster in the 2020 election. 

The pollster took polls in every state/region with electoral votes, treated each state as "independent coin flips," and estimated the probability of your candidate winning by simulating 1,000,000 elections. (This is an example of [Monte Carlo simulations](http://www.palisade.com/risk/monte_carlo_simulation.asp)!)

**Question:** What suggestions do you have for the pollster? Specifically, discuss how appropriate the assumption of independence is in this case and suggest improvements if you feel any are necessary.

**Answer:** Proxmity of states matter here so the pollster should not assume that states are independent of one another. A Bayesian way of thinking may be more appropriate here.

## Correlated Stock Prices

You're interested in the performance of a particular stock. You use the [autocorrelation function](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.autocorr.html) in Pandas to assess how autocorrelated your stock's values are.

Autocorrelation makes explicit the idea of temporal correlation we discussed previously. Suppose we wanted to see how correlated a stock's prices are with the immediately preceding day's stock prices. 

| Day | Price | Price with Lag = 1 | Price with Lag = 2 |
|-----|-------|--------------------|--------------------|
| 1   | 25    | NA                 | NA                 |
| 2   | 26    | 25                 | NA                 |
| 3   | 28    | 26                 | 25                 |
| 4   | 24    | 28                 | 26                 |
| 5   | 23    | 24                 | 28                 |

Autocorrelation with a lag of 1 will calculate the correlation between column "Price" and column "Price with Lag = 1." Autocorrelation with a lag of $k$ will calculate the correlation between stock price and the stock price of $k$ days before in a similar manner.

I build a loop that iterates through days (we'll assume our stock price is the closing price at every day) 1 to 365 to assess how correlated a stock price is with the stock price from $i$ days ago. (Sample code seen below.)

```
for i in range(1, 366):
    print(df[stock_prices].autocorr(lag=i))
```

**Question:** Suppose my highest values of autocorrelation are found when $i = 1, 7, 30, 365$. What do each of these suggest about the performance of this particular stock?

**Answer:** As no transformations have been made to the stock (i.e. We are still looking at price), we can infer that the stock on day *t* is very highly correlated with the stock price on the day before (*t-1*), the week before (*t-7*), a month before (approx. *t-30*), as well as the year before (approx. *t-365*).

Stock prices vary quite rapidly. Looking at almost any plot of stock price over time, we'll see a very "wiggly" function that moves around erratically. Building a model for this can be difficult.

One way to "de-noise" or "smooth" this is to create a [moving average](http://www.investopedia.com/terms/m/movingaverage.asp) of stock prices. Suppose I wanted to create a moving average of stock prices across $k$ days. In this case, I create a new column that takes the current day and $k-1$ previous days (for $k$ total days) and average the stock prices of these days.

For example, I have a column of stock prices and a column associated with a moving average for three days. Then, my row for Day 5 includes the Day 5 stock price and the average of Day 3, Day 4, and Day 5 stock prices. (We'll go more in detail with this later.)

| Day | Price | Moving Average k = 3 |
|-----|-------|----------------------|
| 1   | 25    | NA                   |
| 2   | 26    | NA                   |
| 3   | 28    | 26.33                |
| 4   | 24    | 26                   |
| 5   | 23    | 25                   |

**Question:** As the number of periods $k$ increases, how do I expect my plotted curve to change?

**Answer:** Yes. As the number of periods (k) increases, the curve will become smoother (i.e. less swings).

**Question:** Suppose we use our moving average to predict values of the stock price. As $k$ increases, how is the bias of our predictions affected? As $k$ increases, how is the variance of our predictions affected?

**Answer:** As the number of periods (k) increases, our model becomes less flexible (i.e. Higher bias {Directional} but lower variance {Smoother}).

In [None]:
## End