# Logistic Regression Model

In [2]:
import pandas as pd

Mars_df = pd.read_csv('MarsCrater.csv')
Mars_df.head()

Unnamed: 0,CRATER_ID,CRATER_NAME,LATITUDE_CIRCLE_IMAGE,LONGITUDE_CIRCLE_IMAGE,DIAM_CIRCLE_IMAGE,DEPTH_RIMFLOOR_TOPOG,MORPHOLOGY_EJECTA_1,MORPHOLOGY_EJECTA_2,MORPHOLOGY_EJECTA_3,NUMBER_LAYERS
0,01-000000,,84.367,108.746,82.1,0.22,,,,0
1,01-000001,Korolev,72.76,164.464,82.02,1.97,Rd/MLERS,HuBL,,3
2,01-000002,,69.244,-27.24,79.63,0.09,,,,0
3,01-000003,,70.107,160.575,74.81,0.13,,,,0
4,01-000004,,77.996,95.617,73.53,0.11,,,,0


So, we need some categorical data. And we can divide depth of crators into two classes: more than 0.075 and less. As explanotary values we will use diameters and number of layers. 

In [3]:
diam = Mars_df['DIAM_CIRCLE_IMAGE'].to_numpy().reshape(-1, 1)
layers = Mars_df['NUMBER_LAYERS'].to_numpy().reshape(-1, 1)
depth = Mars_df['DEPTH_RIMFLOOR_TOPOG'].values

We can center all of variables with sklearn:

In [4]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(diam)
diam = scaler.transform(diam)

scaler = preprocessing.StandardScaler().fit(layers)
layers = scaler.transform(layers)

And divide depth of crators into two classes:

In [5]:
def divide(row):
    if row['DEPTH_RIMFLOOR_TOPOG']>=0.075:
        return 1
    elif row['DEPTH_RIMFLOOR_TOPOG']<0.075:
        return 0
Mars_df['CAT_DEPTH'] = Mars_df.apply(lambda row: divide(row),axis=1)

In [6]:
import statsmodels.formula.api as smf

  import pandas.util.testing as tm


In [7]:
results = smf.logit(formula = 'CAT_DEPTH ~ diam + layers', data=Mars_df).fit()
print (results.summary())

Optimization terminated successfully.
         Current function value: 0.254568
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:              CAT_DEPTH   No. Observations:               384343
Model:                          Logit   Df Residuals:                   384340
Method:                           MLE   Df Model:                            2
Date:                Sun, 28 Mar 2021   Pseudo R-squ.:                  0.4493
Time:                        19:19:56   Log-Likelihood:                -97841.
converged:                       True   LL-Null:                   -1.7766e+05
Covariance Type:            nonrobust   LLR p-value:                     0.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.5213      0.006   -236.356      0.000      -1.534      -1.509
diam           4.8614      0.

Our regression! It is better than our last regression, pseudo r^2 score is higher. Also we hawe p-values less than 0.05 and positive coeffitients for both explanatory variables.

And we will check it with odds ratios.

In [12]:
import numpy as np
# odds ratios
print ("Odds Ratios")
print (np.exp(results.params))

Odds Ratios
Intercept      0.218422
diam         129.198764
layers         1.714575
dtype: float64


In [15]:
# odd ratios with 95% confidence intervals
params = results.params
conf = results.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (np.exp(conf))

             Lower CI    Upper CI          OR
Intercept    0.215684    0.221195    0.218422
diam       123.034047  135.672370  129.198764
layers       1.693736    1.735669    1.714575


So... After odd ratios we can see that we have a correlations between our target and explanatory variables. The odds of a crater being deeper than the mean crater more than 129 times higher if the crater is bigger(R=129.2, 95% CI = 123.0 - 135.67, p<.0001). And yhe number of layers is less significantly. Craters with more layers are bigger, but only in 1.7 times(OR= 1.71, 95% CI=1.69-1.74, p<.0001)

And we can see strong correlation between diameters and depth of craters, and some correlation (bur it is a correlation) between number of layers and depth, and our hypothesis looks correct.