# Assignment Module 4: Logistic Regression

## Instructions
We learnt how Python could be used for finding a solution to logistic regression problems.
In this assignment we will be using the exact same input file we used in the lesson.
To recap, this input file contains, Credit_Profile, Annual Income, Education Years and Age data for a cross section of people. The input file is in an excel format. 
Assuming that Annual Income, Education Years and Age (the independent variables) adequately define if a Credit Profile (the independent variable) is loan worthy (1) or not (0), this program estimate the coefficients and the intercept term for this equation using the logit model from the statsmodels.api library. 
This part was illustrated in the lesson. 
In this assignment, the program code flow is exactly the same till the training/ modelling stage. No inputs are required from learners till the training/ modelling stage.
In the testing stage however, learners would need to add 2 lines of code to predict probabilities of credit profiles given input values for the indepdendent variable.

Just to recap the starting point in the logistic regression equation is:

log (probability of odds for Credit_Profile) = constant + coeffficient_1 * Annual_Income + coefficient_2 * Education_Years + coefficient_3 * Age

NOTE: You must run each code cell below, in order from top to bottom, to prepare for the coding exercise. These create variables when are then available to any code cell.

In [1]:
#import the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

#Read the input files 
    
dcredit = pd.read_excel('Helper_Data.xlsx',sheet_name ='Credit Profile')
dcredit


Unnamed: 0,Credit_Profile,Annual_Income,Education_Years,Age
0,1,60000.000000,17,28
1,1,70000.000000,20,24
2,1,80000.000000,18,25
3,1,50000.000000,17,22
4,1,75000.000000,17,22
...,...,...,...,...
66,0,64312.265115,9,26
67,0,64137.281279,8,29
68,0,64726.534621,8,26
69,0,64255.958301,8,29


## Logistic Regression modelling 

In [2]:
# No coding required from learners here

# defining the dependent and independent variables
Xtrain = dcredit[['Annual_Income', 'Education_Years', 'Age']]
#Xtrain = dcredit[[ 'Education_Years', 'Age']]
ytrain = dcredit[['Credit_Profile']]

# building the model and fitting the data
logreg = sm.Logit(ytrain, Xtrain).fit()
print(logreg.summary())


Optimization terminated successfully.
         Current function value: 0.270577
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:         Credit_Profile   No. Observations:                   71
Model:                          Logit   Df Residuals:                       68
Method:                           MLE   Df Model:                            2
Date:                Sat, 08 Jan 2022   Pseudo R-squ.:                  0.6096
Time:                        11:13:25   Log-Likelihood:                -19.211
converged:                       True   LL-Null:                       -49.206
Covariance Type:            nonrobust   LLR p-value:                 9.401e-14
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Annual_Income    5.573e-05   2.13e-05      2.618      0.009     1.4e-05    9.74e-05
Education_Year

## Testing the model

In [3]:
# This is the input test data
# No coding required here

xdata = {'Annual_Income':[100000,40000,90000,80000,90000],
        'Education_Years':[9,9,15,19,29],
        'Age':[29,26,27,39,49]}
ydata = {'Credit_Profile':[1,0,1,1,0 ]}

xtest = pd.DataFrame(xdata)
ytest = pd.DataFrame(ydata)

## Coding required in the exercise below to perform the predictions

## Hint: use logreg.predict to predict actual values and assign it to a variable, say ypredict
## Hint: Having computed the predicted values, just use the print function to print ypredict
## Note that the probabilities (in percentage terms are quite high) except for one (second) reading 
## of Annual income = 40000. For this reading the prediction is 12.4% probability of a loan worthy credit profile
## Since 12.4% is less than 50%, the prediction is that the data set with 
## Annual income = 40000 Education years = 9, Age = 26, the loan worthiness is 0 (that is not loan worthy)
## So the actual value for Annual income = 40000 Education years = 9, Age = 26, matches the predicted value
## If in doubt, please refer to the code in Module 4 lesson 6.

If you have run all of the code examples above, in the order shown, then all the variables created are now available
to any other code cells in this notebook.

In [4]:
# In this code cell, just below this comment, type the code statements to perform the predictions and print them.

Click the button below to see the instructor's solution. Use the "Run" button in the toolbar to run it.
Remember that all the code cells above must be run first.

In [5]:
ypredict = logreg.predict(xtest)
print(ypredict)

0    0.624743
1    0.124206
2    0.944781
3    0.572767
4    0.851745
dtype: float64


Click the button below to see the *complete* solution. Use the "Run" button in the toolbar to run it.

In [6]:
# import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# read the input file    
dcredit = pd.read_excel('Helper_Data.xlsx',sheet_name ='Credit Profile')

# define the independent and dependent variables
Xtrain = dcredit[['Annual_Income', 'Education_Years', 'Age']]    
ytrain = dcredit[['Credit_Profile']]

# build the model and fit the data
logreg = sm.Logit(ytrain, Xtrain).fit()

# Input test data
xdata = {'Annual_Income':[100000,40000,90000,80000,90000],
        'Education_Years':[9,9,15,19,29],
        'Age':[29,26,27,39,49]}
ydata = {'Credit_Profile':[1,0,1,1,0 ]}

xtest = pd.DataFrame(xdata)
ytest = pd.DataFrame(ydata)

# perform the prediction and print
ypredict = logreg.predict(xtest)
print(ypredict)

Optimization terminated successfully.
         Current function value: 0.270577
         Iterations 8
0    0.624743
1    0.124206
2    0.944781
3    0.572767
4    0.851745
dtype: float64
