# 90 Day Fiance - Logistic Regression

*for binary classification problems*

https://www.geeksforgeeks.org/logistic-regression-using-statsmodels/

https://levelup.gitconnected.com/an-introduction-to-logistic-regression-in-python-with-statsmodels-and-scikit-learn-1a1fb5ce1c13

In [14]:
# import libraries

import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import (confusion_matrix, accuracy_score)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix
import numpy as np
import matplotlib.pyplot as plt

In [35]:
# read in Fiance CSV

fiances = pd.read_csv("data/90-Day-Fiance-regression.csv")

In [36]:
fiances.head()

Unnamed: 0,usSex,usAgeRange,usRegion,foreignSex,foreignAgeRange,olderSex,olderNat,ageDiffRange,continent,met,stayTogether
0,0,1,2,1,1,0,0,1,1,1,1
1,0,1,0,1,1,0,0,2,1,2,1
2,0,2,1,1,2,0,0,1,4,0,1
3,0,2,1,1,1,0,0,2,2,0,1
4,1,1,1,0,1,0,1,1,0,1,0


### looking at...
• gender of US fiancé

• age range of US fiancé

• which US region fiancé is from

• which gender is older

• which nationality is older

• age difference range

• which contient international fiancé is from

• how couple met

## Builiding the Logistic Regression Model

**Statsmodels** is a Python module that provides various functions for estimating different statistical models and performing statistical tests  

* define the **set of dependent(y) and independent(X) variables** (if the dependent variable is in non-numeric form, it is first converted to numeric using dummies)

* statsmodels provides a **Logit() function** for performing logistic regression

* the Logit() function accepts y and X as parameters and **returns the Logit object** 

* the model is then fitted to the data

In [37]:
fiances.columns

Index(['usSex', 'usAgeRange', 'usRegion', 'foreignSex', 'foreignAgeRange',
       'olderSex', 'olderNat', 'ageDiffRange', 'continent', 'met',
       'stayTogether'],
      dtype='object')

In [45]:
# defining the dependent and independent variables

feature_names = ['usSex', 'usAgeRange', 'usRegion', 'foreignAgeRange', 'olderSex', 'olderNat', 
                 'ageDiffRange', 'continent', 'met', ]

Xtrain = fiances[feature_names]
ytrain = fiances['stayTogether']

In [40]:
# building the model and fitting the data

log_reg = sm.Logit(ytrain, Xtrain).fit()

Optimization terminated successfully.
         Current function value: 0.503725
         Iterations 6


**Output** 

* **Iterations** refer to the number of times the model iterates over the data, trying to optimize the model
* *by default, the maximum number of iterations performed is 35, after which the optimization fails*

In [41]:
# print the summary table - descriptive summary about the regression results

print(log_reg.summary())

                           Logit Regression Results                           
Dep. Variable:           stayTogether   No. Observations:                   58
Model:                          Logit   Df Residuals:                       49
Method:                           MLE   Df Model:                            8
Date:                Mon, 02 May 2022   Pseudo R-squ.:                  0.2305
Time:                        18:29:03   Log-Likelihood:                -29.216
converged:                       True   LL-Null:                       -37.967
Covariance Type:            nonrobust   LLR p-value:                   0.02529
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
usSex              -1.2587      0.797     -1.580      0.114      -2.820       0.303
usAgeRange          0.3289      0.634      0.519      0.604      -0.913       1.571
usRegion            0.2543      

## Explanation of some of the terms in the summary table

* **coef:** the coefficients of the independent variables in the regression equation
* **Log-Likelihood:** the natural logarithm of the Maximum Likelihood Estimation(MLE) function. MLE is the optimization process of finding the set of parameters that result in the best fit
* **LL-Null** the value of log-likelihood of the model when no independent variable is included(only an intercept is included)
* **Pseudo R-squ.:** a substitute for the R-squared value in Least Squares linear regression. It is the ratio of the log-likelihood of the null model to that of the full model


### Pseudo-R-squared

* measure that provides information about the goodness of fit of a model - quantifies how well the regression line approximates the actual data

* R-squared = 1 == all variation in the y values is accounted for the by the x values

* **R-squared = 0.2647 == 26% of the variation in the y values is accounted for the by the x values**

* R-squared = 0 == none of the variation in the y values is accounted for the by the x values


### Model's Coefficients

## Predicting on New Data 

* test our model on new test data
* **predict() function** is useful for performing predictions
* predictions obtained are fractional values (between 0 and 1) - denote the probability of couples staying together 

In [42]:
# defining the dependent and independent variables
Xtest = fiances[feature_names]
ytest = fiances['stayTogether']

# performing predictions on the test datdaset
yhat = log_reg.predict(Xtest)
prediction = list(map(round, yhat))

In [43]:
# comparing original and predicted values of y
print('Actual values', list(ytest.values))
print('Predictions :', prediction)

Actual values [1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1]
Predictions : [1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1]


### Testing the accuracy of the model

In [44]:
# confusion matrix
cm = confusion_matrix(ytest, prediction)
print ("Confusion Matrix : \n", cm)

# accuracy score of the model
print('Test accuracy = ', accuracy_score(ytest, prediction))

Confusion Matrix : 
 [[12  9]
 [ 5 32]]
Test accuracy =  0.7586206896551724
