# DS-SF-30 | Assignment 09: Linear Regression, Part 3

In [31]:
import os

import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

from sklearn import feature_selection, linear_model
from sklearn.metrics import r2_score

In [32]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-09-credit.csv'))

In [33]:
df

Unnamed: 0,Income,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,14.891,283,2,34,11,Male,No,Yes,Caucasian,333
1,106.025,483,3,82,15,Female,Yes,Yes,Asian,903
2,104.593,514,4,71,11,Male,No,No,Asian,580
3,148.924,681,3,36,11,Female,No,No,Asian,964
4,55.882,357,2,68,16,Male,No,Yes,Caucasian,331
...,...,...,...,...,...,...,...,...,...,...
395,12.096,307,3,32,13,Male,No,Yes,Caucasian,560
396,13.364,296,5,65,17,Male,No,No,African American,480
397,57.872,321,5,67,12,Female,No,Yes,Caucasian,138
398,37.728,192,1,44,13,Male,No,Yes,Caucasian,0


A description of the dataset is as follows:

- Income (in thousands of dollars)
- Rating: Credit score rating
- Cards: Number of Credit cards owned
- Age
- Education: Years of Education
- Gender: Male/Female
- Student: Yes/No
- Married: Yes/No
- Ethnicity: African American/Asian/Caucasian
- Balance: Average credit card debt

> ## Question 1.  Let's explore the quantitative variables that affect `Balance`.  From your preliminary analysis, which 2 variables seem to affect `Balance` the most?  Our goal is interpretation; can we use these 2 variables simultaneously?  Why or why not?

In [34]:
df.corr()

Unnamed: 0,Income,Rating,Cards,Age,Education,Balance
Income,1.0,0.791378,-0.018273,0.175338,-0.027692,0.463656
Rating,0.791378,1.0,0.053239,0.103165,-0.030136,0.863625
Cards,-0.018273,0.053239,1.0,0.042948,-0.051084,0.086456
Age,0.175338,0.103165,0.042948,1.0,0.003619,0.001835
Education,-0.027692,-0.030136,-0.051084,0.003619,1.0,-0.008062
Balance,0.463656,0.863625,0.086456,0.001835,-0.008062,1.0


Answer: Income and rating. Probably not, they are highly correlated with each other

> ## Question 2.  `Ethnicity`, `Gender`, `Married`, and `Student` are categorical variables.  Go ahead and create dummy variables for all of them.

In [35]:
ethnicity_data = pd.get_dummies(df.Ethnicity, prefix = 'Ethnicity')
ethnicity_data

Unnamed: 0,Ethnicity_African American,Ethnicity_Asian,Ethnicity_Caucasian
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
...,...,...,...
395,0.0,0.0,1.0
396,1.0,0.0,0.0
397,0.0,0.0,1.0
398,0.0,0.0,1.0


In [36]:
gender_data = pd.get_dummies(df.Gender, prefix = 'Gender')
married_data = pd.get_dummies(df.Married, prefix = 'Married')
student_data = pd.get_dummies(df.Student, prefix = 'Student')

df = df.join([gender_data, ethnicity_data, married_data, student_data])

In [37]:
df.columns

Index([u'Income', u'Rating', u'Cards', u'Age', u'Education', u'Gender',
       u'Student', u'Married', u'Ethnicity', u'Balance', u'Gender_Female',
       u'Gender_Male', u'Ethnicity_African American', u'Ethnicity_Asian',
       u'Ethnicity_Caucasian', u'Married_No', u'Married_Yes', u'Student_No',
       u'Student_Yes'],
      dtype='object')

> ## Question 3.  Using _sklearn_'s linear regression, predict `Balance` using `Income`, `Cards`, `Age`, `Education`, `Gender`, and `Ethnicity`

First, find the coefficients of your regression line.

In [38]:
X = df[ ['Income', 'Cards', 'Age', 'Education', 'Gender_Male', 'Ethnicity_African American', 'Ethnicity_Caucasian'] ]
y = df.Balance

model = linear_model.LinearRegression()
model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [39]:
print model.intercept_
print model.coef_

250.621754843
[  6.27995894  33.62953508  -2.32970547   1.64553607 -27.12543123
   6.54603078  10.02100719]


Then, find the p-values of your F-values' models.  You have a few variables, so try to show your p-values alongside the names of the variables.  (https://docs.python.org/2/library/functions.html#zip)

In [40]:
zip(X.columns.values, feature_selection.f_regression(X, y)[1])

[('Income', 1.0308858025893587e-22),
 ('Cards', 0.084176555599370956),
 ('Age', 0.97081387233013317),
 ('Education', 0.87230640156710226),
 ('Gender_Male', 0.66851610550260099),
 ('Ethnicity_African American', 0.78443031908756577),
 ('Ethnicity_Caucasian', 0.94772751139663791)]

> ## Question 4.  Which of your coefficients are significant at the 5% significance level?

Answer: Just income, the rest are all above 0.05

> ## Question 5.  What is your model's $R^2$?

In [41]:
model.score(X, y)

0.23231260833540465

> ## Question 6.  How do we interpret this value?

Answer:23 % of the model's variability is in the model

> ## Question 7.  Now let's focus on the two most significant variables from your previous model and re-run your regression model.

In [43]:
X = df[ ['Income', 'Cards']]

model = linear_model.LinearRegression()
model.fit(X,y)
print model.intercept_
print model.coef_

151.329946349
[  6.07099859  31.83812895]


In [44]:
zip(X.columns.values, feature_selection.f_regression(X, y)[1])

[('Income', 1.0308858025893587e-22), ('Cards', 0.084176555599370956)]

> ## Question 8.  In comparison to the previous model, did the $R^2$ increase or decrease?  Why?

In [45]:
model.score(X, y)

0.22399175162249518

Answer: Went slightly down, but aobut the same. Most of the signal is included in these variables. 

> ## Question 9.  Now let's regress `Balance` on `Gender` alone.  After running your linear regressions, do you have enough evidence to claim that females have more balance than males?  (Hint: Look at the p-value of the Gender coefficient.  If it is significant then you will have evidence to support that claim, otherwise you cannot support the statement.)

In [46]:
X = df[ ['Gender_Male']]

model = linear_model.LinearRegression()
model.fit(X,y)
print model.intercept_
print model.coef_

529.536231884
[-19.73312308]


In [47]:
zip(X.columns.values, feature_selection.f_regression(X, y)[1])

[('Gender_Male', 0.66851610550260099)]

Answer: No, the power is extremely large

> ## Question 10.  Now let's regress `Balance` on `Ethnicity`.  After running your linear regressions, do you have enough evidence to claim that some ethnic groups carry more balance than others?

In [48]:
X = df[ ['Ethnicity_African American','Ethnicity_Asian']]

model = linear_model.LinearRegression()
model.fit(X,y)
print model.intercept_
print model.coef_

518.497487437
[ 12.50251256  -6.18376195]


In [49]:
 zip(X.columns.values, feature_selection.f_regression(X, y)[1])

[('Ethnicity_African American', 0.78443031908756577),
 ('Ethnicity_Asian', 0.84489564436221742)]

Answer: no, again p-values are high

> ## Question 11.  Finally let's regress `Balance` on `Student`.  After running your linear regressions, do you have enough evidence to claim that students carry more balance than non-students?

In [50]:
X = df[ ['Student_Yes']]

model = linear_model.LinearRegression()
model.fit(X,y)
print model.intercept_
print model.coef_

480.369444444
[ 396.45555556]


In [51]:
 zip(X.columns.values, feature_selection.f_regression(X, y)[1])

[('Student_Yes', 1.4877341077327523e-07)]

Answer: Yes, students seem to have more credit card debt than non-students. P-value is high. 

> ## Question 12.  No let's consider the effect of `Student` and `Income` on `Balance` simultaneously.  Are all the coefficients significant?

In [53]:
X = df[ ['Student_Yes', "Balance"]]

model = linear_model.LinearRegression()
model.fit(X,y)
print model.intercept_
print model.coef_

-3.41060513165e-13
[  1.76302223e-14   1.00000000e+00]


In [54]:
 zip(X.columns.values, feature_selection.f_regression(X, y)[1])

[('Student_Yes', 1.4877341077327523e-07), ('Balance', 0.0)]

Answer: Yes, both are significant. P values are very low! 

> ## Question 13.  No let's consider the interaction effect of `Student` and `Income` on `Balance` simultaneously.  Are all the coefficients significant?  It they are, write down your regression model below

(First generate a new variable for the interaction term)

In [55]:
df['Student * Income'] = df['Student_Yes'] * df['Income']

In [56]:
X = df[ ['Student_Yes', 'Income', 'Student * Income'] ]
y = df.Balance

model = linear_model.LinearRegression()
model.fit(X,y)

print model.intercept_
print model.coef_
print feature_selection.f_regression(X, y)[1]

200.62315295
[ 476.67584321    6.21816874   -1.99915087]
[  1.48773411e-07   1.03088580e-22   4.61768368e-08]


Answer: All coefficients are significant. Regression model is: 
Balance = 200.6 + 476.67 * Student_Yes   + 6.21 * Income  - 1.999 * INcome * Studnt_Yes

> ## Question 14.  Is there any income level at which students and non-students on average carry same level of balance?

Answer:

- Non-students: $Balance = 200 + 6.21 * Income$
- Students: $Balance = 200 + 477 + 6.21 * Income - 2 * Income$

balance = balance when income = 238. Very unlikely for students, so really the answer is no. 

In [58]:
df[df.Student_Yes==1].Income.mean()

47.292049999999996