In [2]:
# Week 5 - Guided Practice Template
# The first few inputs are what you did in Week 4 Guided Practice
# Make sure you run each input again here to make sure you have the right starting point for this week
# You need to replace the "pd.read_csv" file location with YOUR file location

import pandas as pd
import statsmodels.api as sm

creditScore = pd.read_csv("Week_4_GP.csv")

X = creditScore[['Credit_Lines','Credit_Length','Number_Missed_Payments','Credit_Utilization','New_Credit','Number_Credit_Types']] # independent variables
y = creditScore['Credit_Score']  # dependent variable

creditScore.dtypes

Respondent_ID               int64
Credit_Score                int64
Credit_Lines                int64
Credit_Length               int64
Number_Missed_Payments      int64
Credit_Utilization        float64
New_Credit                  int64
Number_Credit_Types         int64
dtype: object

In [3]:
# Regression with statsmodel

# Use statsmodels to calculate important traits of the regression
# This re-creates the model using the statsmodel package, rather than sklearn

X = sm.add_constant(X) # this makes sure the regression model has an intercept
# the "sm.OLS" command that calculates the regression does not have an intercept by default
 
model = sm.OLS(y, X).fit() # this calculates the regression equation
 
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:           Credit_Score   R-squared:                       0.485
Model:                            OLS   Adj. R-squared:                  0.472
Method:                 Least Squares   F-statistic:                     36.70
Date:                Fri, 09 Dec 2022   Prob (F-statistic):           3.37e-31
Time:                        01:38:44   Log-Likelihood:                -1364.9
No. Observations:                 241   AIC:                             2744.
Df Residuals:                     234   BIC:                             2768.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                    585

In [5]:
# Use Stepwise Regression - remove the highest p-value variable 1 at a time until all variables are significant

# Start by setting a new X data frame without "Credit_Line" (y does not have to be set again since it is the same)

X = creditScore[['Credit_Length','Number_Missed_Payments','Credit_Utilization','New_Credit','Number_Credit_Types']]

# Re-add the constant to X
X = sm.add_constant(X)


# Refit the model using OLS
model = sm.OLS(y, X).fit()


# Summarize the model and print
print_model = model.summary()
print(print_model)



                            OLS Regression Results                            
Dep. Variable:           Credit_Score   R-squared:                       0.484
Model:                            OLS   Adj. R-squared:                  0.473
Method:                 Least Squares   F-statistic:                     44.01
Date:                Fri, 09 Dec 2022   Prob (F-statistic):           6.27e-32
Time:                        01:42:09   Log-Likelihood:                -1365.1
No. Observations:                 241   AIC:                             2742.
Df Residuals:                     235   BIC:                             2763.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                    589

# Questions
1. All the 5 variables are significant. The model is well-fitted now, hence no any extra step.
2. R^2 is 0.484 while Adjusted R^2 is 0.473. The new R^2 shows that there 48.4% variation explained on the credit score by the predictor variables.
3. The new R^2 has reduced by 0.01 meaning the there is change the total variation explained by the predictor variables. With the adjusted R^2, there is an increase by 0.01 which means the model has increased the variation explained with the omission of the non-significant variable.

In [6]:
# This is the solution for Week 4 Guided Practice
# Make sure to change the file locaiton to yours

creditScorePersonal = pd.read_csv("Week_4_GP_Time_Series.csv")

# This X is the same as the previous one, plus time-based coefficients

Xpersonal = creditScorePersonal[['Q2',
                         'Q3',
                         'Q4',
                         'Credit_Lines',
                         'Credit_Length',
                         'Number_Missed_Payments',
                         'Credit_Utilization',
                         'New_Credit',
                         'Number_Credit_Types']] # independent variables

ypersonal = creditScorePersonal['Credit_Score']  # dependent variable

creditScorePersonal.head(10)

Unnamed: 0,Year,Quarter,Q2,Q3,Q4,New_Credit,Credit_Lines,Credit_Length,Number_Missed_Payments,Credit_Utilization,Number_Credit_Types,Credit_Score
0,2014,Q1,0,0,0,1,1,1,0,0.38,1,655
1,2014,Q2,1,0,0,1,1,1,0,0.34,1,653
2,2014,Q3,0,1,0,0,1,1,0,0.14,1,707
3,2014,Q4,0,0,1,1,2,1,0,0.19,1,614
4,2015,Q1,0,0,0,1,2,1,0,0.2,1,664
5,2015,Q2,1,0,0,1,2,2,0,0.56,1,647
6,2015,Q3,0,1,0,1,2,2,1,0.7,1,592
7,2015,Q4,0,0,1,0,2,2,1,0.62,1,603
8,2016,Q1,0,0,0,0,2,2,1,0.6,1,628
9,2016,Q2,1,0,0,0,2,3,1,0.58,1,639


In [7]:
# Regression with statsmodel

# Use statsmodels to calculate important traits of the regression
# This re-creates the model using the statsmodel package, rather than sklearn

Xpersonal = sm.add_constant(Xpersonal) # this makes sure the regression model has an intercept
# the "sm.OLS" command that calculates the regression does not have an intercept by default
 
model = sm.OLS(ypersonal, Xpersonal).fit() # this calculates the regression equation
 
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:           Credit_Score   R-squared:                       0.901
Model:                            OLS   Adj. R-squared:                  0.857
Method:                 Least Squares   F-statistic:                     20.27
Date:                Fri, 09 Dec 2022   Prob (F-statistic):           3.49e-08
Time:                        01:43:31   Log-Likelihood:                -119.41
No. Observations:                  30   AIC:                             258.8
Df Residuals:                      20   BIC:                             272.8
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                    668

In [8]:
# Using Stepwise Regression, remove the highest p-value variable and recalculate the regression
# Repeat until all variables are significant

Xpersonal = creditScorePersonal[['Q3',
                         'Q4',
                         'Credit_Lines',
                         'Credit_Length',
                         'Number_Missed_Payments',
                         'Credit_Utilization',
                         'New_Credit',
                         'Number_Credit_Types']]


# Re-add the constant to X

Xpersonal = sm.add_constant(Xpersonal)

# Refit the model using OLS
model = sm.OLS(ypersonal, Xpersonal).fit()
 


# Summarize the model and print
print_model = model.summary()
print(print_model)



                            OLS Regression Results                            
Dep. Variable:           Credit_Score   R-squared:                       0.901
Model:                            OLS   Adj. R-squared:                  0.864
Method:                 Least Squares   F-statistic:                     23.94
Date:                Fri, 09 Dec 2022   Prob (F-statistic):           6.80e-09
Time:                        01:48:36   Log-Likelihood:                -119.41
No. Observations:                  30   AIC:                             256.8
Df Residuals:                      21   BIC:                             269.4
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                    668

In [9]:
# Using Stepwise Regression, remove the highest p-value variable and recalculate the regression
# Repeat until all variables are significant


# Independent Variables
Xpersonal = creditScorePersonal[['Q4',
                         'Credit_Lines',
                         'Credit_Length',
                         'Number_Missed_Payments',
                         'Credit_Utilization',
                         'New_Credit',
                         'Number_Credit_Types']]


# Re-add the constant to X
Xpersonal = sm.add_constant(Xpersonal)


# Refit the model using OLS
model = sm.OLS(ypersonal, Xpersonal).fit()


# Summarize the model and print
print_model = model.summary()
print(print_model)



                            OLS Regression Results                            
Dep. Variable:           Credit_Score   R-squared:                       0.901
Model:                            OLS   Adj. R-squared:                  0.869
Method:                 Least Squares   F-statistic:                     28.58
Date:                Fri, 09 Dec 2022   Prob (F-statistic):           1.24e-09
Time:                        01:50:39   Log-Likelihood:                -119.45
No. Observations:                  30   AIC:                             254.9
Df Residuals:                      22   BIC:                             266.1
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                    668

In [10]:
# Create the correlation matrix
Xpersonal.corr()


Unnamed: 0,const,Q4,Credit_Lines,Credit_Length,Number_Missed_Payments,Credit_Utilization,New_Credit,Number_Credit_Types
const,,,,,,,,
Q4,,1.0,0.14663,-0.026035,0.039608,0.036386,-0.032174,0.021997
Credit_Lines,,0.14663,1.0,0.643892,0.222629,0.021838,0.33758,0.816022
Credit_Length,,-0.026035,0.643892,1.0,0.511626,0.252781,-0.154131,0.792198
Number_Missed_Payments,,0.039608,0.222629,0.511626,1.0,0.557179,-0.131897,0.42654
Credit_Utilization,,0.036386,0.021838,0.252781,0.557179,1.0,-0.104624,0.099999
New_Credit,,-0.032174,0.33758,-0.154131,-0.131897,-0.104624,1.0,0.211613
Number_Credit_Types,,0.021997,0.816022,0.792198,0.42654,0.099999,0.211613,1.0


In [11]:
# Using knowledge from the correlation matrix and the One in Ten Rule, remove "Credit_Lines"

Xpersonal = creditScorePersonal[['Q4',
                         'Credit_Length',
                         'Number_Missed_Payments',
                         'Credit_Utilization',
                         'New_Credit',
                         'Number_Credit_Types']] # independent variables

# Independent Variables
# Re-add the constant to X
Xpersonal = sm.add_constant(Xpersonal)


# Refit the model using OLS
model = sm.OLS(ypersonal, Xpersonal).fit()


# Summarize the model and print
print_model = model.summary()
print(print_model)



                            OLS Regression Results                            
Dep. Variable:           Credit_Score   R-squared:                       0.865
Model:                            OLS   Adj. R-squared:                  0.829
Method:                 Least Squares   F-statistic:                     24.50
Date:                Fri, 09 Dec 2022   Prob (F-statistic):           6.63e-09
Time:                        01:56:48   Log-Likelihood:                -124.12
No. Observations:                  30   AIC:                             262.2
Df Residuals:                      23   BIC:                             272.1
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                    655

# Questions
4. With the stepwise removal of insignificant variables, the adjusted R^2 increased consistently from 85.7% to 86.4% then 86.9% while when the removal of Credit Lines variable, which is a significant variable, the adjusted R^2 reduced to 82.9%.
5. Model 4 will be might be the most robust model than model 3 even though its adjusted R^2 is lower than that of model 3 because the "one in ten rule" will prevent overfitting of the predictor variables by ensuring that the number of the variables in the regression are not more than 1/10th amount of observations or sample size.
6. The coefficients of New Credit and Number of Credit Types changed the most. Credit Lines variable correlates extremely highly with Number of Credit Types, which would explain the high change in the coeffcients. 
7. Removing the Credit Lines variable will mean that the model will not lose so much power, especially since that the an increase by 1 unit for Credit Lines variable will extremely likely change the Number of Credit Types by 1 as well.
8. To increase further the robustness of the model, there would be need to use other techniques such as the LASSO regression, which is an automated variable selection method. It determines automatically the most important variables.