# Introduction to Data Science 
# Activity for Lecture 10: Linear Regression 2
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/*

Name:Andreas Martinson

Email:andreas.martinson@utah.edu

UID:u1291396


## Class exercise: analysis of the credit dataset 

Recall the 'Credit' dataset introduced in class and available [here](http://www-bcf.usc.edu/~gareth/ISL/data.html). 
This dataset consists of some credit card information for 400 people. 

First import the data and convert income to thousands.


In [2]:
# imports and setup

import scipy as sc
import numpy as np

import pandas as pd
import statsmodels.formula.api as sm     #Last lecture: used statsmodels.formula.api.ols() for OLS
from sklearn import linear_model         #Last lecture: used sklearn.linear_model.LinearRegression() for OLS

import matplotlib.pyplot as plt
%matplotlib inline  
plt.rcParams['figure.figsize'] = (10, 6)

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

# Import data from Credit.csv file
credit = pd.read_csv('Credit.csv',index_col=0) #load data
credit["Income"] = credit["Income"].map(lambda x: 1000*x)
credit

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
1,14891.0,3606,283,2,34,11,Male,No,Yes,Caucasian,333
2,106025.0,6645,483,3,82,15,Female,Yes,Yes,Asian,903
3,104593.0,7075,514,4,71,11,Male,No,No,Asian,580
4,148924.0,9504,681,3,36,11,Female,No,No,Asian,964
5,55882.0,4897,357,2,68,16,Male,No,Yes,Caucasian,331
...,...,...,...,...,...,...,...,...,...,...,...
396,12096.0,4100,307,3,32,13,Male,No,Yes,Caucasian,560
397,13364.0,3838,296,5,65,17,Male,No,No,African American,480
398,57872.0,4171,321,5,67,12,Female,No,Yes,Caucasian,138
399,37728.0,2525,192,1,44,13,Male,No,Yes,Caucasian,0


## Activity 1: A First Regression Model

**Exercise:** First regress Limit on Rating: 
$$
\text{Limit} = \beta_0 + \beta_1 \text{Rating}. 
$$
Since credit ratings are primarily used by banks to determine credit limits, we expect that Rating is very predictive for Limit, so this regression should be very good. 

Use the 'ols' function from the statsmodels python library. What is the $R^2$ value? What are $H_0$ and $H_A$ for the associated hypothesis test and what is the $p$-value? 


In [3]:
model = sm.ols('Limit ~ Rating', data=credit).fit()
model.summary()


0,1,2,3
Dep. Variable:,Limit,R-squared:,0.994
Model:,OLS,Adj. R-squared:,0.994
Method:,Least Squares,F-statistic:,63480.0
Date:,"Thu, 18 Feb 2021",Prob (F-statistic):,0.0
Time:,18:17:22,Log-Likelihood:,-2649.1
No. Observations:,400,AIC:,5302.0
Df Residuals:,398,BIC:,5310.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-542.9282,22.850,-23.760,0.000,-587.851,-498.006
Rating,14.8716,0.059,251.949,0.000,14.756,14.988

0,1,2,3
Omnibus:,6.887,Durbin-Watson:,2.08
Prob(Omnibus):,0.032,Jarque-Bera (JB):,4.98
Skew:,-0.145,Prob(JB):,0.0829
Kurtosis:,2.537,Cond. No.,970.0


**Answer**

The $R^2$ is .994. The $H_0$ is that there is no correlation between the two variables (The slope - $B_1$ - is equal to 0) and and the $H_A$ is that there is a correlation between the two variables (The slope - $B_1$ - is not equal to 0). The p-value is 0 for both the intercept and the slope, so we can reject the null hypothesis and say that there is correlation between these two variables.

## Activity 2: Predicting Limit without Rating 

Since Rating and Limit are almost the same variable, next we'll forget about Rating and just try to predict Limit from the real-valued variables (non-categorical variables): Income, Cards, Age, Education, Balance. 

**Exercise:** Develop a multilinear regression model to predict Rating. Interpret the results. 

For now, just focus on the real-valued variables (Income, Cards, Age, Education, Balance)
and ignore the categorical variables (Gender, Student, Married, Ethnicity). 



In [4]:
model = sm.ols('Limit ~ Income + Cards + Age + Education + Balance', data=credit).fit()
model.summary()

0,1,2,3
Dep. Variable:,Limit,R-squared:,0.94
Model:,OLS,Adj. R-squared:,0.94
Method:,Least Squares,F-statistic:,1242.0
Date:,"Thu, 18 Feb 2021",Prob (F-statistic):,1.32e-238
Time:,18:33:53,Log-Likelihood:,-3101.0
No. Observations:,400,AIC:,6214.0
Df Residuals:,394,BIC:,6238.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1752.3883,170.621,10.271,0.000,1416.947,2087.830
Income,0.0323,0.001,34.737,0.000,0.031,0.034
Cards,-61.7552,20.904,-2.954,0.003,-102.852,-20.658
Age,1.9710,1.683,1.171,0.242,-1.337,5.279
Education,-4.9297,9.107,-0.541,0.589,-22.834,12.974
Balance,3.1921,0.070,45.291,0.000,3.053,3.331

0,1,2,3
Omnibus:,78.568,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,124.063
Skew:,-1.356,Prob(JB):,1.1500000000000002e-27
Kurtosis:,3.306,Cond. No.,345000.0


Which independent variables are good/bad predictors? What is the best overall model?

**Your observations:**

Income, Cards, and Balance seem to be good predictors for predicting Limit based off of the p-values. Age and Education aren't as predictive. I would think that the best model would be one a multiple linear regression with Income, Cards, and Balance. The $R^2$ is .94 with the current model, but I will check to see if I'm correct by running it again down below.


In [8]:
model = sm.ols('Limit ~ Income + Cards + Balance', data=credit).fit()
model.summary()

0,1,2,3
Dep. Variable:,Limit,R-squared:,0.94
Model:,OLS,Adj. R-squared:,0.94
Method:,Least Squares,F-statistic:,2071.0
Date:,"Thu, 18 Feb 2021",Prob (F-statistic):,1.4599999999999999e-241
Time:,18:39:12,Log-Likelihood:,-3101.9
No. Observations:,400,AIC:,6212.0
Df Residuals:,396,BIC:,6228.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1783.7236,77.962,22.879,0.000,1630.453,1936.994
Income,0.0326,0.001,35.747,0.000,0.031,0.034
Cards,-59.7631,20.831,-2.869,0.004,-100.716,-18.810
Balance,3.1836,0.070,45.409,0.000,3.046,3.321

0,1,2,3
Omnibus:,77.822,Durbin-Watson:,1.986
Prob(Omnibus):,0.0,Jarque-Bera (JB):,122.252
Skew:,-1.345,Prob(JB):,2.8400000000000003e-27
Kurtosis:,3.312,Cond. No.,161000.0


There wasn't a change in the $R^2$ value, but the p-value for the f-statistic decreased, indicating our model might be slightly better than the previous one despite not explaining more of the variance in the data.

## Activity 3: Incorporating Categorical Variables Into Regression Models

Now consider the binary categorical variables which we mapped to integer 0, 1 values in class.

In [6]:
credit["Gender_num"] = credit["Gender"].map({' Male':0, 'Female':1})
credit["Student_num"] = credit["Student"].map({'Yes':1, 'No':0})
credit["Married_num"] = credit["Married"].map({'Yes':1, 'No':0})

Can you improve the model you developed in Activity 2 by incorporating one or more of these variables?


In [14]:
# Initial test
# model = sm.ols('Limit ~ Income + Cards + Balance + Gender_num + Student_num + Married_num', data=credit).fit()
# model.summary()

model = sm.ols('Limit ~ Income + Cards + Balance + Student_num', data=credit).fit()
model.summary()

0,1,2,3
Dep. Variable:,Limit,R-squared:,0.976
Model:,OLS,Adj. R-squared:,0.976
Method:,Least Squares,F-statistic:,3979.0
Date:,"Thu, 18 Feb 2021",Prob (F-statistic):,1.39e-317
Time:,18:42:11,Log-Likelihood:,-2920.6
No. Observations:,400,AIC:,5851.0
Df Residuals:,395,BIC:,5871.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1900.7873,49.855,38.126,0.000,1802.772,1998.802
Income,0.0309,0.001,52.827,0.000,0.030,0.032
Cards,-78.5986,13.281,-5.918,0.000,-104.708,-52.489
Balance,3.5059,0.047,75.271,0.000,3.414,3.597
Student_num,-1516.6454,62.834,-24.137,0.000,-1640.177,-1393.114

0,1,2,3
Omnibus:,97.253,Durbin-Watson:,1.948
Prob(Omnibus):,0.0,Jarque-Bera (JB):,174.534
Skew:,-1.389,Prob(JB):,1.26e-38
Kurtosis:,4.66,Cond. No.,201000.0


**Your answer goes here:**

I initially added all of the categorical variables, but I removed Gender and Married since neither of those had significant values. I kept student since it was significant. The $R^2$ for this model is 97.6%! That's a 3.6% increase from the previous model. The F-statistic p-value is also higher, indicating a slightly better model.