# Correlation and Linear Regression 

This notebook shows you steps on analyzing correlation in data, finding and plot a regression line, and using its function for prediction - a common statistical and machine learning technique.

The dataset for this exercise is the one used on the famous 1885 study of Francis Galton exploring the relationship between the heights of adult children and the heights of their parents. Each case is an adult child, and the variables are

    Family: The family that the child belongs to, labeled from 1 to 205.
    Father: The father's height, in inches
    Mother: The mother's height, in inches
    Gender: The gender of the child, male (M) or female (F)
    Height: The height of the child, in inches
    Kids: The number of kids in the family of the child

For more information, see here: http://www.randomservices.org/random/data/Galton.html

As you go through this exercise with your group, think about these questions and be  prepared to talk about them when you come back for the group discussion:

- What is the importance of what is represented in this data regarding the predictions it can make?
- What would be the implications of using such techniques in machine learning?
- Do you think these techniques "transform" the nature of knowledge?

In [None]:
#import the libraries you will need

import pandas as pd #pandas to load the dataset and manipulate the data
import numpy as np #numpy for linear algebra
import matplotlib.pyplot as plt #pyplot for plots
import seaborn as sns #seaborn for plots


In [None]:
#load the dataset from the csv file
csv_file = "Galton_Height_Data.csv"

galton_data = pd.read_csv(csv_file)

In [None]:
#see if loading the data worked
galton_data.head()

In [None]:
#check the type of data you have
galton_data.info()

## Data analysis

Suppose you want to use this data to train a model that makes predictions regarding the height of a child based on the height of their parents.

The column "Height" tells you the children's height. You will need to find the regression line and function that predicts the number in this column.

First, we need to do a little data "cleaning."


We cannot use categorical data for linear regression - we need to convert them to numbers.
To do this quickly, we can import the "LabelEncoder" library from the sklearn library  (a machine learning library)
The Label Encoders transforms categorical values into numbers from 0 to n classes. For example, if you have categories "paris", "tokyo" and "amsterdam" in a column,  these will be transformed into 0, 1 and 2.


In [None]:
from sklearn.preprocessing import LabelEncoder 

LE = LabelEncoder()

cols = ['Gender']

# for each column specified above, the below will transform in numbers
for col in cols:
    galton_data[col] = LE.fit_transform(galton_data[col]) 



In [None]:
#visualize how the categorical data looks now:

galton_data.head()

# Correlation

Pandas is really quick in calculating correlations.
We can plot a correlation matrix easily with the corr() function.

Remember:
 - 1 indicates a perfect positive correlation.
 - -1 indicates a perfect negative correlation.
 - 0 indicates that there is no relationship between the different variables.

In [None]:
galton_data.corr()

or, you can also use a "heatmap" to visualize correlation with the seaborn library

In [None]:
corr = galton_data.corr()
plt.figure(figsize=(15, 12)) #defines the size of the figure
sns.heatmap(corr, annot=True)

Observation: 
- Gender is positively correlated with height in this dataset.


# Linear regression

You can use seaborn to plot linear regression line between two features in the data.

In [None]:
#Regression line for child's height and father's height

sns.regplot(x="Height", y="Father", data=galton_data)

Ps.: the shaded are is the "size of the confidence interval for the regression estimate"

If you want to visualize how these differ with a third variable, such as Gender, use the "hue" parameter

In [None]:
sns.lmplot(x="Height", y="Father", hue="Gender", data=galton_data)

Once you've got a regression line, you can use its function to make predictions through its equation. There are machine learning libraries to do that across all variables in the dataset, but first, you will use the simple statistical method and, therefore, the "statsmodel" library.

Remember:

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).

In [None]:
from scipy import stats

#predict the child's height using father's height.
# get the slope, intercept using the stats library
slope, intercept, r_value, p_value, std_err = stats.linregress(galton_data['Father'],galton_data['Height'])


In [None]:
# assign the values of a and b to intercept and slope, to maintain terminology above

a = intercept
b = slope

In [None]:
print(a) #intercept

In [None]:
print(b) #slope

In [None]:
# predict the value of x = 70 (father's height is 70)
x = 70

In [None]:
#equation of the least square lines that best fits this line:
y = a + b*x


In [None]:
print(y)

The predicted child's height is Y if the father's height is X.

The r-value is the degree of correlation between the 2 variables.

In [None]:
print(r_value)

In [None]:

#predict the child's height using mother's height.
# get the slope, intercept using the stats library
slope, intercept, r_value, p_value, std_err = stats.linregress(galton_data['Mother'],galton_data['Height'])


In [None]:
a = intercept
b = slope

# predict the value of x = 70 (mother's height is 70)
x = 70

#equation of the least square lines that best fits this line:
y = a + b*x

print(y)

In [None]:
print(r_value)

The degree of correlation between father's height and child's height is greater than the correlation between mother and child's height.

In [None]:
#predict the child's height using gender.
# get the slope, intercept using the stats library
slope, intercept, r_value, p_value, std_err = stats.linregress(galton_data['Gender'],galton_data['Height'])

a = intercept
b = slope

# predict the value of x = 1 (male)
x = 1

#equation of the least square lines that best fits this line:
y = a + b*x

print(y)

In [None]:
print(r_value)

The correlation between gender and child height is much higher than mother or father heights and chi.

# Multiple Regression (bonus)


Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables.

Let's try to predict height of a child based on both the mother and father's height.

Then make a list of the independent values and call this variable X.

Put the dependent values in a variable called y.


In [None]:
X = galton_data[['Father', 'Mother']]
y = galton_data['Height'] 

In [None]:
from sklearn import linear_model 

regr = linear_model.LinearRegression()
regr.fit(X, y) 

In [None]:
#predict the height of the children based on father and mother's height
predicted_height = regr.predict([[60, 60]]) 

print(predicted_height)

In [None]:
#predict the height of the children based on father and mother's height and gender
X = galton_data[['Father', 'Mother','Gender']]
y = galton_data['Height'] 

regr = linear_model.LinearRegression()
regr.fit(X, y) 
predicted_height = regr.predict([[60, 60,1]]) # 1 = male 0= female

print(predicted_height)

In [None]:
X = galton_data[['Father', 'Mother','Gender']]
y = galton_data['Height'] 

regr = linear_model.LinearRegression()
regr.fit(X, y) 
predicted_height = regr.predict([[60, 60,0]]) 

print(predicted_height)