In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Find another dataset that is suitable for logistic regression. Run a logistic regression on the data using the statsmodel package.

https://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics

Variations concern hull geometry coefficients and the Froude number:<br>
1. Longitudinal position of the center of buoyancy, adimensional.<br>
2. Prismatic coefficient, adimensional.<br>
3. Length-displacement ratio, adimensional.<br>
4. Beam-draught ratio, adimensional.<br>
5. Length-beam ratio, adimensional.<br>
6. Froude number, adimensional.<br>

The measured variable is the residuary resistance per unit weight of displacement:<br>
7. Residuary resistance per unit weight of displacement, adimensional.

In [2]:
data_url = "yacht_hydrodynamics.data"
data = pd.read_csv(data_url)
data

Unnamed: 0,Long_pos_of_center,prismatic_coefficient,displacement_length_ratio,beam_draught_ratio,length_beam_ratio,froude_number,residuary_resistance
0,-2.3,0.568,4.78,3.99,3.17,0.125,0.11
1,-2.3,0.568,4.78,3.99,3.17,0.150,0.27
2,-2.3,0.568,4.78,3.99,3.17,0.175,0.47
3,-2.3,0.568,4.78,3.99,3.17,0.200,0.78
4,-2.3,0.568,4.78,3.99,3.17,0.225,1.18
...,...,...,...,...,...,...,...
303,-2.3,0.600,4.34,4.23,2.73,0.350,8.47
304,-2.3,0.600,4.34,4.23,2.73,0.375,12.27
305,-2.3,0.600,4.34,4.23,2.73,0.400,19.59
306,-2.3,0.600,4.34,4.23,2.73,0.425,30.48


In [4]:
from sklearn.preprocessing import MinMaxScaler

X = data.drop(["residuary_resistance"], axis=1)
y = pd.DataFrame(data.residuary_resistance)

X_scaled = MinMaxScaler().fit_transform(X)

Fitting the target values (y) to be either 0 or 1 depending

In [5]:
y_scaled = y < y.median()

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=0.2, random_state=42)

In [7]:
import statsmodels.api as sm

log_reg = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=10000)

Optimization terminated successfully.
         Current function value: 0.081904
         Iterations: 63
         Function evaluations: 64
         Gradient evaluations: 64


Print the results and interpret the parameter coefficients for each input variable: https://www.statsmodels.org/stable/index.html.

In [8]:
print(log_reg.summary())

                            Logit Regression Results                            
Dep. Variable:     residuary_resistance   No. Observations:                  246
Model:                            Logit   Df Residuals:                      240
Method:                             MLE   Df Model:                            5
Date:                  Mon, 16 Aug 2021   Pseudo R-squ.:                  0.8816
Time:                          20:13:52   Log-Likelihood:                -20.148
converged:                         True   LL-Null:                       -170.22
Covariance Type:              nonrobust   LLR p-value:                 9.317e-63
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             1.0460      1.315      0.795      0.426      -1.532       3.624
x2           -11.0671      3.053     -3.626      0.000     -17.050      -5.084
x3           -38.1260      9.447    

In [9]:
log_reg.params.values

array([  1.04597424, -11.0671329 , -38.12603341,  46.15121649,
        44.67137291, -36.07847056])

Evaluate the model as well. 

In [10]:
from sklearn.metrics import (confusion_matrix, accuracy_score)

y_pred = log_reg.predict(X_test)
y_pred = list(map(round, y_pred))

cm = confusion_matrix(y_test, y_pred) 
print ("Confusion Matrix : \n", cm) 

Confusion Matrix : 
 [[25  0]
 [ 3 34]]


In [11]:
print('Test accuracy = ', accuracy_score(y_test, y_pred))

Test accuracy =  0.9516129032258065
