The dataset is taken from
http://onlinestatbook.com/2/case_studies/sat.html

#### Overview
When deciding whether to admit an applicant, colleges take lots of factors, such as grades, sports, activities, leadership positions, awards, teacher recommendations, and test scores, into consideration. Using SAT scores as a basis of whether to admit a student or not has created some controversy. Among other things, people question whether the SATs are fair and whether they predict college performance.
This study examines the SAT and GPA information of 105 students who graduated from a state university with a B.S. in computer science. Using the grades and test scores from high school, can you predict a student's college grades?



##### Questions to Answer
Can the math and verbal SAT scores be used to predict college GPA? Are the high school and college GPAs related? 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
random_state = 42

### Read Data


In [None]:
data = pd.read_excel('sat.xls')
data.head()


### Display correlation matrix 

We can see that high school GPA has a correlation of 0.78 with university GPA

In [None]:
def display_corr_matrix(data):  
    cor_matrix = data.corr().round(2)
    fig = plt.figure(figsize=(6, 6));
    sns.heatmap(cor_matrix, annot=True, center=0, cmap = sns.diverging_palette(250, 10, as_cmap=True), ax=plt.subplot(111));
    plt.show()

In [None]:
display_corr_matrix(data)    

### Prepare Training and Test data
We will predict university GPA on the basis of High Scool GPA, hence X will have values corresponding to university GPA while y will have values for university GPA

Training set will have 80% of samples while Test set will have 20%

Model will be trained only on trainng set and evaluted on test set for root mean square error

In [None]:
from sklearn.model_selection import train_test_split
X =   data['high_GPA'].values
y   = data['univ_GPA'].values

X = X.reshape(-1,1)
y = y.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state)
print('Train shape', X_train.shape, 'Test Shape', X_test.shape)

#### Display Scatterplot for High School GPA VS University 

In [None]:
def display_gpa(X, y, y_pred = None):
    plt.figure(figsize= (10,6))
    plt.scatter(X, y, color = 'blue')
    if y_pred is not None:
       plt.plot(X, y_pred, color = 'red' )    
    plt.xlabel('High School GPA', fontsize = 12)
    plt.ylabel('University GPA', fontsize = 12)
    plt.show()

In [None]:
display_gpa(X_train, y_train)  


### Train the Model 


In [None]:
from sklearn import linear_model
model =linear_model.LinearRegression()
model.fit(X_train, y_train)

#### RMSE on training set

In [None]:
from sklearn.metrics import mean_squared_error
y_train_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_train_pred)
rmse = np.sqrt(mse)
print('Training set RMSE', rmse)

### Display the fitted line on Training data

In [None]:
display_gpa(X_train, y_train, y_train_pred)
print('Equation of fitted line is: y =  {:0.2f}x + {:0.2f}'.format(model.coef_[0][0], model.intercept_[0] ))

#### RMSE on test set

In [None]:
y_test_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_test_pred)
rmse = np.sqrt(mse)
print('Test set RMSE', rmse)

### Display the fitted line on Test Data

In [None]:
display_gpa(X_test, y_test, y_test_pred)

### Linear Regression  using more features: Multiple Regresssion
 In Previous section we predicted university GPA based on single independent varaible: High School GPA
 
 Lets try to add two more varaibles, maths SAT test and verbal SAT test 

In [None]:
X =   data[['high_GPA', 'math_SAT', 'verb_SAT']].values
y   = data['univ_GPA'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state)
print('Train shape', X_train.shape, 'Test Shape', X_test.shape)
X_train[:5]

In [None]:
model =linear_model.LinearRegression()
model.fit(X_train, y_train)

### Predict on test set
As we can see there is only slight improvement in score. This is because all three variables have high correlation between them.
Linear Model performs best when all  independent variables have low correlation between them

In [None]:
y_test_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_test_pred)
rmse = np.sqrt(mse)
print('Test set RMSE', rmse)