In [None]:
from datascience import *
import numpy as np
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
import statsmodels.formula.api as smf
plots.style.use('fivethirtyeight')

# Correlation, regression, and prediction

One of the most important and interesting aspects of data science is making predictions about the future. How can we learn about temperatures a few decades from now by analyzing historical data about climate change and pollution? Based on a person's social media profile, what conclusions can we draw about their interests? How can we use a patient's medical history to judge how well he or she will respond to a treatment?

In this module, you will look at two **correlated** phenomena and predict unseen data points!

In order to use the datascience tools in python, we must first import the relevant modules. The text after the # sign is called a "comment" and is not part of the code. It's simply a way for us to clarify the syntax.

We will be using data from the online data archive of Prof. Larry Winner of the University of Florida. The file *hybrid* contains data on hybrid passenger cars sold in the United States from 1997 to 2013. In order to analyze the data, we must first **import** it to our jupyter notebook and **create a table.**

In [None]:
hybrid = Table.read_table('http://inferentialthinking.com/notebooks/hybrid.csv')  # Imports the data and creates a table
hybrid.show(5)  # Displays the first five rows of the table

*References: vehicle: model of the car, year: year of manufacture, msrp: manufacturer's suggested retail price in 2013 dollars, acceleration: acceleration rate in km per hour per second, mpg: fuel econonmy in miles per gallon, class: the model's class.*

**Note: whenever we write an equal sign (=) in python, we are assigning somthing to a variable.**

Now try to import your own data to this notebook! Remember to assign your data to a variable that is informative but not too wordy.

## Using your own data

Try it out with your own file! Make up random data if you don't have a file already and export it to a CSV format. You can create a file like this using Google Sheets, Microsoft Excel, etc. Eventually, you will have your own real file of data that you can upload and analyze, if you don't have it already.

To upload your data make sure you're in the directory with the green notebook:

![image](img/upload.png)

Click the "Upload" button and upload your CSV. You should see it appear in the directory:

![image](img/uploaded.png)

Now you put the name of your file in the cell below and delete the hash tags in front of the three bottom lines. Make sure to run the cell after your make the changes

In [None]:
## TASK
# my_data = Table.read_table('YOUR-FILE-NAME.csv')  # un-hashtag the front of this line when you have your data uploaded!
# my_data  # un-hashtag the front of this line when you have your data uploaded!

Let's visualize some of the data to see if we can spot a possible assocation! The modules we imported earlier include several powerful data visualization tools. Below are some interesting, possible relationships between our variables.

In [None]:
hybrid.scatter('acceleration', 'msrp') # Creates a scatter plot of two variables in a table

As we can see in the above scatter, there seems to be a positive association between acceleration and price. That is, cars with greater acceleration tend to cost more, on average; conversely, cars that cost more tend to have greater acceleration on average.

What about miles per gallon and price? Do you expect a positive or negative association?

In [None]:
hybrid.scatter('mpg', 'msrp')

Along with the negative association, the scatter diagram of price versus efficiency shows a **non-linear relation** between the two variables. The points appear to be clustered around a curve, not around a straight line.

Let's subset the data so that we're only looking at SUVs:

In [None]:
suv = hybrid.where('class', 'SUV')
suv.scatter('mpg', 'msrp')

As you can see, if we restrict the data just to the **SUV class**, the association appears to be more linear.

To find an association between two variables, the `.scatter` method is perhaps the most useful one. 
Try creating a few scatter plots of variables you might think are related among your data!

In [None]:
# TASK

### The correlation coefficient - *r*

> The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables. ~Wikipedia

*r* = 1: the scatter diagram is a perfect straight line sloping upwards

*r* = -1: the scatter diagram is a perfect straight line sloping downwards.

Let's calculate the correlation coefficient between acceleration and price!

In [None]:
np.corrcoef(hybrid['acceleration'], hybrid['msrp'])

This function returns a matrix for each variable. We have two 1s because of course acceleration is correlated perfectly with acceleration. Our coefficient here is 0.6955779, ***implying strong positive correlation***.

Now it's your turn to calculate the correlation coefficient on your data!

In [None]:
# TASK

### Regression

As mentioned earlier, an important tool in data science is to make predictions based on data. The code that we've created so far has helped us establish a relationship between our two variables. Once a relationship has been established, it's time to create a model that predicts unseen data values. To do this we'll find the equation of the **regression line**!

The regression line is the **best fit** line for our data. It’s like an average of where all the points line up. In linear regression, the regression line is a perfectly straight line! Below is a picture showing the best fit line.

![image](http://onlinestatbook.com/2/regression/graphics/gpa.jpg)

As you can infer from the picture, once we find the **slope** and the **y-intercept** we can start predicting values! The equation for the above regression to predict university GPA based on high school GPA would look like this:

$UNIGPA_i= \alpha + \beta HSGPA + \epsilon_i$

The variable we want to predict (or model) is the left side `y` variable, the variable we're we think has an influence on our left side variable is on the right side. The $\alpha$ term is the y-intercept and the $\epsilon_i$ describes the randomness.

We can fit the model by setting up an equation without the $\alpha$ and $\epsilon_i$ in the `formula` parameter below, we'll give it our data variable in the `data` parameter. Then we just `fit` the model and ask for a `summary`. We'll try a model for:

$MSRP_i= \alpha + \beta ACCELERATION + \epsilon_i$

In [None]:
mod = smf.ols(formula='msrp ~ acceleration', data=hybrid.to_df())
res = mod.fit()
print(res.summary())

That's a lot of information. While we should consider everything, we'll look at the `p` value, the `coef`, and the `R-squared`. A p-value of > .05 is generally considered to be significant. The `coef` is how much increase one sees in the left side variable for a one unit increase of the right side variable. So for a 1 unit increase in acceleration one might see an increase of $5067 MSRP, according to our model. But how great is our model? That's the `R-squared`. The `R-squared` tells us how much of the variation in the data can be explained by our model, .484 isn't that bad, but obviously more goes into the MSRP value of a car.

### Prediction

We can also use this model to predict on new data. Say we found hybrids with acceleration of 8, 9, and 10 respectively:

In [None]:
res.predict({'acceleration': [8,9,10]})

We can plot this line of "best fit" too:

In [None]:
hybrid.scatter('acceleration', 'msrp', fit_line=True)

That's it! By working through this model, you've learned how to **visually analyze your data**, **establish a correlation** by calculating the **correlation coefficient**, **find the regression line** and **predict data points**!

---

***We would also appreciate if you filled out this feedback form regarding the notebook:
https://goo.gl/forms/ADY9TJU3TGKlllyT2***

***Your input allows us to continue improving our educational notebooks!***