# Regression

In this lesson we will learn about regression analysis. Regression is a powerful, widely used approach to modeling the relationship between variables. 

We will cover three different types of regression:

1) Linear Regression

2) Multiple Regression

3) Logistic or Logit Regression

## Linear Regression

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). 

In linear regression, we compute the y-intercept and slope of the best-fit line running through the data points in a scatter plot, where "best-fit" refers to the fact that the resulting line minimizes the squared distance between the data points and the line. The best-fit line can be described by its y-intercept and slope (e.g. y = mx+b, m:slope,b:y-intercept).

The best-fit line provides what is known as a linear model. That is, we can use the equation y=mx+b to make predictions about the value of y given the value of x.

In this section we will build a linear regression model to predict the child's GPA scores. As a predictor, we will use language and literacy skills (feature t5c13a), science and social skills (feature t5c13b), and math skills (feature t5c13c).

In [5]:
# First, we import the libraries we will use in this notebook and load the Fragile Families data. 
# The first line sets maplotlib plots to show inside the notebook.
%matplotlib inline 
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import pandas as pd

# Directory with cleaned data
background = "../../ai4all_data/background.csv"
train = "../../ai4all_data/train.csv"
output_dir = "../output"

### Process data

RENATO: I'm not sure what's going on in the next cell. It seems we are cleaning the data a little bit. Should we do that here? Better comments are needed. 

In [6]:
# Load data
data_frame = pd.read_csv(background, low_memory=False)
# Get number of samples in the data
num_samples = data_frame.shape[0]
# ????
assert data_frame['challengeID'].to_dict().values() == range(1, num_samples+1)
# Set index
data_frame = data_frame.set_index('challengeID')
# Replace missing values with -3
data_frame = data_frame.replace('missing', -3)
# Transform all entries in data_frame to numeric (Is this necessary?)
data_frame = data_frame.apply(lambda x: pd.to_numeric(x, errors='ignore'))
# removing all non-numeric elements
data_frame = data_frame.select_dtypes(include = [np.number])

AssertionError: 

In [None]:
# read outcome data
outcome = pd.read_csv(train, low_memory=False)
# set outcome data index to match data_frame index
outcome = outcome.set_index('challengeID')
# Remove null entries
outcome = outcome.loc[~outcome['gpa'].isnull()]
# How many samples do we have now?
data_frame.shape[1]

### Linear regression for GPA using language and literacy skills ('t5c13a') as predictor 

In [None]:
# First we select data_frame entries that match the entries in outcome
data_frame = data_frame.loc[data_frame.index.isin(outcome.index.values)]
# We extract the language and literacy skills from the data_frame
lang_lit = data_frame.loc[~data_frame['t5c13a'].isnull()]
lang_lit = lang_lit['t5c13a']
# We extract GPA from the outcome data
Y = outcome.loc[outcome.index.isin(lang_lit.index.values)]
GPA = Y['gpa']

Use scatter plots and histogram to see what the data look like.

In [None]:
plt.scatter(lang_lit, GPA)
plt.show()
n, bins, patches = plt.hist(lang_lit,14)
plt.show()

If we calculate the average GPAs for students in each category(1,2,3,4,5) we might see the correlation between GPA and their literacy skills.

In [None]:
# Calculate average GPA for students whose language and literacy skills are far below average
one = lang_lit.loc[lang_lit == 1 ]
one_gpa = GPA.loc[GPA.index.isin(one.index.values)]
one_gpa_mean = np.mean(one_gpa)
# Calculate average GPA for students whose language and literacy skills are below average
two = lang_lit.loc[lang_lit == 2]
two_gpa = GPA.loc[GPA.index.isin(two.index.values)]
two_gpa_mean = np.mean(two_gpa)
# Calculate average GPA for students whose language and literacy skills are average
three = lang_lit.loc[lang_lit == 3 ]
three_gpa = GPA.loc[GPA.index.isin(three.index.values)]
three_gpa_mean = np.mean(three_gpa)
# Calculate average GPA for students whose language and literacy skills are above average
four = lang_lit.loc[lang_lit == 4 ]
four_gpa = GPA.loc[GPA.index.isin(four.index.values)]
four_gpa_mean = np.mean(four_gpa)
# Calculate average GPA for students whose language and literacy skills are far above average
five = lang_lit.loc[lang_lit == 5 ]
five_gpa = GPA.loc[GPA.index.isin(five.index.values)]
five_gpa_mean = np.mean(five_gpa)

We plot the average GPA against language and literacy skills.

In [None]:
X_train = np.array([1,2,3,4,5])
y_train = np.array([one_gpa_mean,two_gpa_mean,three_gpa_mean,four_gpa_mean,five_gpa_mean])
plt.scatter(X_train, y_train)
plt.show()

**Let's do linear regression with `numpy.polyfit`**


In [None]:
coef = np.polyfit(X_train,y_train,1)
print('slope : {}'.format(coef[0]))
print('intercept : {}'.format(coef[1]))


fig = plt.figure()
ax = plt.axes()
plt.scatter(X_train, y_train)
x = np.linspace(0, 6, 100)
ax.plot(x, coef[0]*x + coef[1]);

mse = np.mean(((coef[0]*X_train + coef[1] - y_train) ** 2))
print('mean square error : {}'.format(mse))

**We can also do linear regression with `scipy.stats.linregress`**

In [None]:
from scipy import stats

slope, intercept, r_value, p_value, std_err = stats.linregress(X_train, y_train)

print('slope : {}'.format(slope))
print('intercept : {}'.format(intercept))
print('r-squared : {}'.format(r_value**2))

### We can do the same analysis for science and social skills ('t5c13b')

In [None]:
science_social = data_frame.loc[~data_frame['t5c13b'].isnull()]
science_social = science_social['t5c13b']
Y = outcome.loc[outcome.index.isin(science_social.index.values)]
GPA = Y['gpa']
plt.scatter(science_social, GPA)
plt.show()
n, bins, patches = plt.hist(science_social,14)
plt.show()

In [None]:
# Calculate average GPA for students whose science and social skills are far below average
one = science_social.loc[science_social == 1 ]
one_gpa = GPA.loc[GPA.index.isin(one.index.values)]
one_gpa_mean = np.mean(one_gpa)
# Calculate average GPA for students whose science and social skills are below average
two = science_social.loc[science_social == 2]
two_gpa = GPA.loc[GPA.index.isin(two.index.values)]
two_gpa_mean = np.mean(two_gpa)
# Calculate average GPA for students whose science and social skills are average
three = science_social.loc[science_social == 3 ]
three_gpa = GPA.loc[GPA.index.isin(three.index.values)]
three_gpa_mean = np.mean(three_gpa)
# Calculate average GPA for students whose science and social skills are above average
four = science_social.loc[science_social == 4 ]
four_gpa = GPA.loc[GPA.index.isin(four.index.values)]
four_gpa_mean = np.mean(four_gpa)
# Calculate average GPA for students whose science and social skills are far above average
five = science_social.loc[science_social == 5 ]
five_gpa = GPA.loc[GPA.index.isin(five.index.values)]
five_gpa_mean = np.mean(five_gpa)

In [None]:
plt.scatter([1,2,3,4,5], [one_gpa_mean,two_gpa_mean,three_gpa_mean,four_gpa_mean,five_gpa_mean])
plt.show()

### We can do the same analysis for math skills ('t5c13b')

In [None]:
math = data_frame.loc[~data_frame['t5c13c'].isnull()]
math = math['t5c13c']
Y = outcome.loc[outcome.index.isin(math.index.values)]
GPA = Y['gpa']
plt.scatter(math, GPA)
plt.show()
n, bins, patches = plt.hist(math,14)
plt.show()

In [None]:
# Calculate average GPA for students whose math skills are far below average
one = math.loc[math == 1 ]
one_gpa = GPA.loc[GPA.index.isin(one.index.values)]
one_gpa_mean = np.mean(one_gpa)
# Calculate average GPA for students whose math skills are below average
two = math.loc[math == 2]
two_gpa = GPA.loc[GPA.index.isin(two.index.values)]
two_gpa_mean = np.mean(two_gpa)
# Calculate average GPA for students whose math skills are average
three = math.loc[math == 3 ]
three_gpa = GPA.loc[GPA.index.isin(three.index.values)]
three_gpa_mean = np.mean(three_gpa)
# Calculate average GPA for students whose math skills are above average
four = math.loc[math == 4 ]
four_gpa = GPA.loc[GPA.index.isin(four.index.values)]
four_gpa_mean = np.mean(four_gpa)
# Calculate average GPA for students whose math skills are far above average
five = math.loc[math == 5 ]
five_gpa = GPA.loc[GPA.index.isin(five.index.values)]
five_gpa_mean = np.mean(five_gpa)

In [None]:
plt.scatter([1,2,3,4,5], [one_gpa_mean,two_gpa_mean,three_gpa_mean,four_gpa_mean,five_gpa_mean])
plt.show()

## Multiple Linear Regression

## Logistic Regression