In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Find another dataset that is suitable for logistic regression. Run a logistic regression on the data using the statsmodel package.

https://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics

Variations concern hull geometry coefficients and the Froude number:<br>
1. Longitudinal position of the center of buoyancy, adimensional.<br>
2. Prismatic coefficient, adimensional.<br>
3. Length-displacement ratio, adimensional.<br>
4. Beam-draught ratio, adimensional.<br>
5. Length-beam ratio, adimensional.<br>
6. Froude number, adimensional.<br>

The measured variable is the residuary resistance per unit weight of displacement:<br>
7. Residuary resistance per unit weight of displacement, adimensional.

In [2]:
data_url = "yacht_hydrodynamics.data"
data = pd.read_csv(data_url)
data

Unnamed: 0,Long_pos_of_center,prismatic_coefficient,displacement_length_ratio,beam_draught_ratio,length_beam_ratio,froude_number,residuary_resistance
0,-2.3,0.568,4.78,3.99,3.17,0.125,0.11
1,-2.3,0.568,4.78,3.99,3.17,0.150,0.27
2,-2.3,0.568,4.78,3.99,3.17,0.175,0.47
3,-2.3,0.568,4.78,3.99,3.17,0.200,0.78
4,-2.3,0.568,4.78,3.99,3.17,0.225,1.18
...,...,...,...,...,...,...,...
303,-2.3,0.600,4.34,4.23,2.73,0.350,8.47
304,-2.3,0.600,4.34,4.23,2.73,0.375,12.27
305,-2.3,0.600,4.34,4.23,2.73,0.400,19.59
306,-2.3,0.600,4.34,4.23,2.73,0.425,30.48


In [3]:
X = data.drop(["residuary_resistance"], axis=1)
y = pd.DataFrame(data.residuary_resistance)

Fitting the target values (y) to be either 0 or 1 depending

In [4]:
y_scaled = y < y.median()

In [5]:
import statsmodels.api as sm

log_reg = sm.Logit(y_scaled, X).fit(method='bfgs', maxiter=10000)

Optimization terminated successfully.
         Current function value: 0.051302
         Iterations: 77
         Function evaluations: 79
         Gradient evaluations: 79


Print the results and interpret the parameter coefficients for each input variable: https://www.statsmodels.org/stable/index.html.

In [6]:
print(log_reg.summary())

                            Logit Regression Results                            
Dep. Variable:     residuary_resistance   No. Observations:                  308
Model:                            Logit   Df Residuals:                      302
Method:                             MLE   Df Model:                            5
Date:                  Tue, 17 Aug 2021   Pseudo R-squ.:                  0.9260
Time:                          00:56:24   Log-Likelihood:                -15.801
converged:                         True   LL-Null:                       -213.49
Covariance Type:              nonrobust   LLR p-value:                 2.943e-83
                                coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Long_pos_of_center           -0.2327      0.315     -0.739      0.460      -0.850       0.384
prismatic_coefficient        80.9433     23.608      3.429      0.001 

In [7]:
log_reg.params.values

array([  -0.23269135,   80.94334261,   29.55623576,  -12.49384698,
        -27.50302241, -175.22465175])

Evaluate the model as well. 

In [8]:
coef = log_reg.params
coef

Long_pos_of_center            -0.232691
prismatic_coefficient         80.943343
displacement_length_ratio     29.556236
beam_draught_ratio           -12.493847
length_beam_ratio            -27.503022
froude_number               -175.224652
dtype: float64

In [9]:
np.exp(coef)

Long_pos_of_center           7.923981e-01
prismatic_coefficient        1.423138e+35
displacement_length_ratio    6.856620e+12
beam_draught_ratio           3.749654e-06
length_beam_ratio            1.136552e-12
froude_number                7.959772e-77
dtype: float64

In [10]:
# logistic regression coefficients
results = pd.DataFrame(log_reg.params, columns=["coef"])
results["exp_coef"]=np.exp(log_reg.params)
results

Unnamed: 0,coef,exp_coef
Long_pos_of_center,-0.232691,0.7923981
prismatic_coefficient,80.943343,1.4231379999999998e+35
displacement_length_ratio,29.556236,6856620000000.0
beam_draught_ratio,-12.493847,3.749654e-06
length_beam_ratio,-27.503022,1.136552e-12
froude_number,-175.224652,7.959771999999999e-77


The coefficients of the logistic regression model can be interpreted as follows:
Each of the factors has a coefficient. As the factor is increased or decreases, the overall value will increase or decrease by the amount of the increase or decrease times the coefficient.

For example, if every other factor is held constant and Long_pos_of_center score is increased by 1, the log odd would change by -0.232691, and therefore the odds of being admitted would increase by exp(-0.232691) = 7.923981e-01.

Prismatic_coefficient score is increased by 1, the log odd would change by 80.943343, and therefore the odds of being admitted would change by exp(80.943343) = 1.423138e+35.

displacement_length_ratio : 	6.856620e+12<br>
beam_draught_ratio        :	    3.749654e-06<br>
length_beam_ratio         : 	1.136552e-12<br>
froude_number             :	    7.959772e-77<br>