In [16]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
%matplotlib inline
url = "https://raw.githubusercontent.com/ga-students/DS-SF-24/master/Data/Credit.csv"
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
CreditData = pd.read_csv(url)
CreditData.head(10)

Unnamed: 0.1,Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
1,2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
2,3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
3,4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
4,5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331
5,6,80.18,8047,569,4,77,10,Male,No,No,Caucasian,1151
6,7,20.996,3388,259,2,37,12,Female,No,No,African American,203
7,8,71.408,7114,512,2,87,9,Male,No,No,Asian,872
8,9,15.125,3300,266,5,66,13,Female,No,No,Caucasian,279
9,10,71.061,6819,491,3,41,19,Female,Yes,Yes,African American,1350


In [17]:
del CreditData['Unnamed: 0']

#### Let's look at correlation matrix. This time, we only explore the quantitative variables that affect Credit Balance. From your preliminary analysis, which 3 variables seem to affect Balance the most? In our goal is interpretation; can we use these 3 variables simultaneously? Why?

In [18]:
CreditData.corr()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance
Income,1.0,0.792088,0.791378,-0.018273,0.175338,-0.027692,0.463656
Limit,0.792088,1.0,0.99688,0.010231,0.100888,-0.023549,0.861697
Rating,0.791378,0.99688,1.0,0.053239,0.103165,-0.030136,0.863625
Cards,-0.018273,0.010231,0.053239,1.0,0.042948,-0.051084,0.086456
Age,0.175338,0.100888,0.103165,0.042948,1.0,0.003619,0.001835
Education,-0.027692,-0.023549,-0.030136,-0.051084,0.003619,1.0,-0.008062
Balance,0.463656,0.861697,0.863625,0.086456,0.001835,-0.008062,1.0


Answer:  1. rating, limit, income
         2. limit and rating have a 0.996880 correlation so can't use them simultaneously 

#### There are few categorical variables, let's first create dummy variables for them
    

In [19]:

RaceDummy = pd.get_dummies(CreditData.Ethnicity, prefix = 'Race')
del RaceDummy['Race_African American']

GenderDummy = pd.get_dummies(CreditData.Gender, prefix = 'Gender')
del GenderDummy['Gender_ Male']  

MarriedDummy = pd.get_dummies(CreditData.Married, prefix = 'Married')
del MarriedDummy['Married_No']

StudentDummy = pd.get_dummies(CreditData.Student, prefix = 'Student')
del StudentDummy['Student_No']

CreditData = pd.concat([CreditData, RaceDummy,GenderDummy,MarriedDummy,StudentDummy], axis=1)

CreditData.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance,Race_Asian,Race_Caucasian,Gender_Female,Married_Yes,Student_Yes
0,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333,0.0,1.0,0.0,1.0,0.0
1,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903,1.0,0.0,1.0,1.0,1.0
2,104.593,7075,514,4,71,11,Male,No,No,Asian,580,1.0,0.0,0.0,0.0,0.0
3,148.924,9504,681,3,36,11,Female,No,No,Asian,964,1.0,0.0,1.0,0.0,0.0
4,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331,0.0,1.0,0.0,1.0,0.0


# Now it's time for some fun!

#### By a regression line, use Education, Ethnicity, Gender, Age, Cards, and Income to predict Balance. 

First Step, find the coefficients of your regression line

In [32]:
# ListVariables = CreditData.columns.values
# print(ListVariables)

x = CreditData[['Education', 'Race_Asian', 'Race_Caucasian', 'Gender_Female', 'Age', 'Cards', 'Income']]
y = CreditData['Balance']
linreg.fit(x, y)

print(linreg.intercept_)
print(linreg.coef_)


230.042354393
[  1.64553607  -6.54603078   3.47497641  27.12543123  -2.32970547
  33.62953508   6.27995894]


Second Step, find the p-values of your estimates. You have a few variables try to show your p-values along side the names of the variables.

In [33]:
from sklearn import feature_selection 
p_vals = feature_selection.f_regression(x, y)[1]

zip(x, p_vals)

[('Education', 0.87230640156710226),
 ('Race_Asian', 0.84489564436221742),
 ('Race_Caucasian', 0.94772751139663791),
 ('Gender_Female', 0.66851610550260099),
 ('Age', 0.97081387233013317),
 ('Cards', 0.084176555599370956),
 ('Income', 1.0308858025893513e-22)]

**Which of your coefficients are significant at significance level 5%?**

Answer: Income 

#### What is the R-Squared of your model?

In [34]:
RSS = sum((y - linreg.predict(x)) ** 2)
TSS = sum((y - y.mean()) ** 2)
r_squared = 1 - float(RSS/TSS)
print(r_squared)
linreg.score(x, y)

0.232312608335


0.23231260833540443

#### How do we interpret this value?

Answer: our regression line accounts for about 23.2% of the variability of the data 

#### Now focus on two of the most significant variables from your previous model and re-run your regression model. 

In [35]:
x = CreditData[['Cards', 'Income']]
y = CreditData['Balance']

linreg.fit(x, y)
linreg.score(x, y)



0.22399175162249518

**In comparison to the previous model, did our R-Squared increase or decrease? Why?**

Answer: our r-squared decreased 

#### Now let's regress Balance on Gender alone. After running your regression lines, do you have enough evidence to claim that females having more balance than males? (Hint: Look at the p-value of the Gender coefficient. If it is significant then you will have evidence to support that claim, otherwise you cannot support the statement.

In [41]:
x = CreditData[['Gender_Female']]
y = CreditData['Balance']

p_vals = feature_selection.f_regression(x, y)[1]
zip(x, p_vals)

[('Gender_Female', 0.66851610550260099)]

Answer: no, there is not enough evidence to support that claim 

#### Now let's regress Balance on Ethnicity. After running your regression lines, do you have enough evidence to claim that some ethnic groups carry more balance than others? (Hint: Look at the p-value of  your dummy variables. If it is significant then you will have evidence to support that claim, otherwise you cannot support that statement.

In [42]:
x = CreditData[['Race_Asian', 'Race_Caucasian']]
y = CreditData['Balance']

p_vals = feature_selection.f_regression(x, y)[1]
zip(x, p_vals)

[('Race_Asian', 0.84489564436221742), ('Race_Caucasian', 0.94772751139663791)]

Answer: no, there is not enough evidence to support that claim 

#### I know you get tired of this but for the last time regress Balance on Studentship status. After running your regression lines, do you have enough evidence to claim that students  carry more balance than others? (Hint: Look at the p-value of the your dummy variables. If it is significant then you will have evidence to support that claim, otherwise you cannot support the statement.


In [43]:
x = CreditData[['Student_Yes']]
y = CreditData['Balance']

p_vals = feature_selection.f_regression(x, y)[1]
zip(x, p_vals)

[('Student_Yes', 1.4877341077327523e-07)]

Answer: yes, the p-value is below 5% significance level 

#### Now let's consider effect of students and income on balance simultaneously. Let's start with a regression line.

In [46]:
x = CreditData[['Student_Yes', 'Income']]
y = CreditData['Balance']

p_vals = feature_selection.f_regression(x, y)[1]
zip(x, p_vals)



[('Student_Yes', 1.4877341077327523e-07), ('Income', 1.0308858025893513e-22)]

#### Are all of our regression coefficients significant? If yes, interpret them.

Answer: yes

#### Now let's explore interaction between income and studentship. Let's start with a regression line

In [48]:
# First generate a column for interaction term
CreditData['Studentship_Income'] = CreditData['Student_Yes'] * CreditData['Income']
CreditData.head(2)

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance,Race_Asian,Race_Caucasian,Gender_Female,Married_Yes,Student_Yes,Studentship_Income
0,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333,0.0,1.0,0.0,1.0,0.0,0.0
1,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903,1.0,0.0,1.0,1.0,1.0,106.025


In [53]:
x = CreditData[['Student_Yes', 'Income', 'Studentship_Income']]
y = CreditData['Balance']

p_vals = feature_selection.f_regression(x, y)[1]
zip(x, p_vals)

linreg.fit(x, y)
lm = smf.ols(formula='y ~ x', data=CreditData).fit()
r_squared = linreg.score(x, y)

print('Intercept =' ,linreg.intercept_)
print('coefficients =' ,linreg.coef_)
print('P-Values = ',lm.pvalues)
print('R_Squared = ', r_squared)

('Intercept =', 200.62315294978094)
('coefficients =', array([ 476.67584321,    6.21816874,   -1.99915087]))
('P-Values = ', Intercept    5.789658e-09
x[0]         6.586095e-06
x[1]         6.340684e-23
x[2]         2.488919e-01
dtype: float64)
('R_Squared = ', 0.27988370306198973)


#### Are our coefficients signifincant? It they are write down your regression line below:

Answer: yes => Balance = 200.62315294978094 + 476.67584321 * Student_Yes + 6.21816874 * Income + -1.99915087 * Studentship_Income

#### Assume all coefficients in above regression were significant. Is there any income level at which students and non-students on average carry same level of balance?

Answer: 

