<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br><h2>Script 02 | From Regression to Classification</h2>
<br>
Written by Chase Kusterer<br>
<a href="https://github.com/chase-kusterer">GitHub</a> | <a href="https://www.linkedin.com/in/kusterer/">LinkedIn</a>
<br><br><br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<h2>Part I: Preparation and Exploration</h2>
<br><h4>a) Imports and Loading the Dataset</h4>
Run the code below to import packages and load the 'titanic_feature_rich.xlsx' dataset into Python.

In [None]:
# standard libraries
import numpy             as np  # mathematical essentials
import pandas            as pd  # data science essentials
import matplotlib.pyplot as plt # data visualization
import seaborn           as sns # enhanced data viz

# classification-specific libraries
import phik                           # phi coefficient
import statsmodels.formula.api as smf # logistic regression
import sklearn.linear_model           # logistic regression


# preprocessing and testing
from sklearn.preprocessing import power_transform    # yeo-johnson
from sklearn.preprocessing import StandardScaler     # standard scaler
from sklearn.model_selection import train_test_split # train-test split
from sklearn.metrics import (confusion_matrix,
                             roc_auc_score, precision_score, recall_score)


# loading data
titanic = pd.read_excel('./datasets/titanic_feature_rich.xlsx')


# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)


# displaying the head of the dataset
titanic.head(n = 5)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>User-Defined Functions</strong><br>
Run the following code to load the user-defined functions used throughout this notebook.

In [None]:
########################################
# standard_scaler
########################################
def standard_scaler(df):
    """
    Standardizes a dataset (mean = 0, variance = 1). Returns a new DataFrame.
    Requires sklearn.preprocessing.StandardScaler()
    
    PARAMETERS
    ----------
    df     | DataFrame to be used for scaling
    """

    # INSTANTIATING a StandardScaler() object
    scaler = StandardScaler(copy = True)


    # FITTING the scaler with the data
    scaler.fit(df)


    # TRANSFORMING our data after fit
    x_scaled = scaler.transform(df)

    
    # converting scaled data into a DataFrame
    new_df = pd.DataFrame(x_scaled)


    # reattaching column names
    new_df.columns = list(df.columns)
    
    return new_df



########################################
## visual_cm
########################################
def visual_cm(true_y, pred_y, labels = None):
    """
    Creates a visualization of a confusion matrix.

    PARAMETERS
    ----------
    true_y : true values for the response variable
    pred_y : predicted values for the response variable
    labels : , default None
    """
    # visualizing the confusion matrix

    # setting labels
    lbls = labels
    

    # declaring a confusion matrix object
    cm = confusion_matrix(y_true = true_y,
                          y_pred = pred_y)


    # heatmap
    sns.heatmap(cm,
                annot       = True,
                xticklabels = lbls,
                yticklabels = lbls,
                cmap        = 'Blues',
                fmt         = 'g')


    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix of the Classifier')
    plt.show()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part II - Response Variable Analysis</h2><br>
Run the following codes to generate survival proportions.

In [None]:
# proportion of 1s and 0s for survived
titanic.value_counts(subset    = 'survived',
                     normalize = True      ).round(decimals = 2)

<br>

In [None]:
# proportion of 1s and 0s
female_passengers = titanic[ titanic['female'] == 1 ]

female_passengers.value_counts(
    subset    = 'survived',
    normalize = True      ).round(decimals = 2).sort_index(ascending = True)

<br>

In [None]:
# proportion of 1s and 0s
male_passengers = titanic[ titanic['female'] == 0 ]

male_passengers.value_counts(
    subset    = 'survived',
    normalize = True      ).round(decimals = 2).sort_index(ascending = True)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
Not surprisingly, a considerably larger proportion of female passengers survived when compared to male passengers. Let's check the strength of the correlation between survival and being female. Note that both <em>survived</em> and <em>female</em> can only take on values of 0 or 1. This is known as a <strong>bivariate association and not a correlation</strong>. Furthermore, if one feature is continuous and the other can only take on a value of 0 or 1, it would be a <strong>point-biserial correlation</strong> (Pearson correlation can be applied for this calculation). While we can still use Pearson correlation get a somewhat similar result, <strong>it is more appropriate to use the <a href="https://en.wikipedia.org/wiki/Phi_coefficient">phi coefficient</a> in cases like these.</strong>

In [None]:
# using Pearson correlation
titanic_corr = titanic.corr(method = 'pearson').round(decimals = 4)


# checking results
titanic_corr.loc[ : , 'survived' ].sort_values(ascending = False)

<br>

In [None]:
# using the phi coefficient for correlation
titanic_phi_corr = titanic.phik_matrix().round(decimals = 4)


# checking results
titanic_phi_corr.loc[ : , 'survived' ].sort_values(ascending = False)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
In short, Pearson correlation is for continuous features and the phi coefficient is for non-continuous features. This is taken advantage of in the code below. Note that <em>survived</em> is in both sets since it is the response variable.<br>

<h4>a) Complete the code below to develop Pearson correlations and phi coefficients for the appropriate features.</h4>

In [None]:
# creating feature sets
continuous     = ['survived', 'age', 'fare']

non_continuous = ['survived', 'sibsp', 'parch', 'm_age', 'm_cabin',
                  'm_boat','m_home_dest', 'potential_youth', 'under_18',
                  'number_of_names', 'pclass_1', 'pclass_2', 'pclass_3',
                  'female', 'male']


# pearson correlation
titanic_corr = titanic[ _____ ]._____.round(decimals = 4)


# phi coefficient
titanic_phi_corr = titanic[ _____ ]._____.round(decimals = 4)


# checking results
print(f"""
Point-Biserial Correlations
---------------------------
{titanic_corr.loc[ : , 'survived' ].sort_values(ascending = False)}


Phi Coefficients
----------------
{titanic_phi_corr.loc[ : , 'survived' ].sort_values(ascending = False)}
""")

In [None]:
# creating feature sets
continuous     = ['survived', 'age', 'fare']

non_continuous = ['survived', 'sibsp', 'parch', 'm_age', 'm_cabin',
                  'm_boat','m_home_dest', 'potential_youth', 'under_18',
                  'number_of_names', 'pclass_1', 'pclass_2', 'pclass_3',
                  'female', 'male']


# pearson correlation
titanic_corr = titanic[ continuous ].corr(method = 'pearson').round(decimals = 4)


# phi coefficient
titanic_phi_corr = titanic[ non_continuous ].phik_matrix(interval_cols = non_continuous).round(decimals = 4)


# checking results
print(f"""
Point-Biserial Correlations
---------------------------
{titanic_corr.loc[ : , 'survived' ].sort_values(ascending = False)}


Phi Coefficients
----------------
{titanic_phi_corr.loc[ : , 'survived' ].sort_values(ascending = False)}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part III - Preparing for Logistic Regression</h2><br>
The dataset has been prepared with the exception of transformations and standardization. Note that the steps to prepare the dataset are available in <strong>Preparing the Titanic Dataset</strong>, in case you are interested in learning more about this.
<br><br>
<h3>Transformations</h3><br>
As with the linear regression models covered in Computational Analytics, the data should be treated for skewness before modeling. However, instead of using <em>np.log1p()</em>, let's instead apply the <strong>Yeo-Johnson transformation</strong>, which is mathematically defined as follows:
<br><br><br>

<div style = "width:image width px; font-size:80%; text-align:center;"><img src= "./documentation/yeo_johnson_transformation.png" width="400" height="200" style="padding-bottom:0.0em;"></div>

<br><br>
In other words it's a more sophisticated version of <em>np.log1p()</em> that has two major advantages:

1. It can transform zeros and negative values.
2. It has a regularization parameter, giving it ability to change the degree of transformation in order to achieve better results.
<br>

In [None]:
help(power_transform)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Run the following codes to transform the x-data using the Yeo-Johnson method.

In [None]:
# subsetting X-data
x_data = titanic.loc[ : , 'age': ]


# checking skewness
x_data.skew().round(decimals = 2)

<br>

In [None]:
# yeo-johnson transformation
x_transformed = power_transform(X           = x_data,
                                method      = 'yeo-johnson',
                                standardize = True        )


# storing results as a DataFrame
x_transformed_df = pd.DataFrame(data    = x_transformed,
                                columns = list(x_data.columns))


# checking skewness results
x_transformed_df.skew().round(decimals = 2)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
Notice that the Yeo-Johnson transformation effected skewness for continuous and interval data, but not for binary or categorical data. Run the code below to observe this more clearly. Furthermore, in each case that the transformation was applied to the continuous data, skewness got closer to zero.

In [None]:
# calculating difference in skewness
print(f"""
Normality Improvements (Skewness)
---------------------------------
{abs(x_data.skew().round(decimals = 2)) - abs(x_transformed_df.skew().round(decimals = 2))}""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h3>Standardization</h3><br>
Run the following codes to standardize the data (important in classification modeling). Even though this was done inside the <em>power_trainsform(&nbsp;)</em> method, it is important to re-standardize the data before modeling (even after a transformation).
<br><br>
Remember, scaling does not affect correlation, phi coefficients, or skewness.

In [None]:
help(standard_scaler)

<br>

In [None]:
# standardizing X-data (st = scaled and transformed)
x_data_st = standard_scaler(df = x_transformed_df)


# checking results
x_data_st.describe(include = 'number').round(decimals = 2)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part IV - Logistic Regression</h2><br>
Much can be said about the power of feature engineering, but in general, <font color='red'><strong>good thinking will always beat statistics</strong></font>.<br><br>

<br>
<strong>Stratifying the Response Variable</strong><br>
When working with classification problems, preserving the balance of the response variable is critically important. In terms of the Titanic dataset, we need to preserve the proportion of people that survived in both the training and testing sets. This can be accomplished by using the <em>stratify</em> argument of <strong>train_test_split(&nbsp;)</strong>. The code below will output the original balance between those that survived and those that did not survive the Titanic disaster.

In [None]:
# survival proportions
titanic.loc[ : ,'survived'].value_counts(normalize = True).round(decimals = 2)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>a) Preparing Explanatory and Response Data</h4>
Instantiate the X-features as <strong>titanic_data</strong> and the response variable (&nbsp;<em>survived</em>&nbsp;) as <strong>titanic_target</strong>.<br><br>
<em><strong>Hint:</strong> Use the DataFrame where the x-data has already been transformed and scaled.

In [None]:
# declaring explanatory variables
titanic_data   = _____


# declaring response variable
titanic_target = _____


## this code will not produce an output ##

In [None]:
# declaring explanatory variables
titanic_data = x_data_st


# declaring response variable
titanic_target = titanic.loc[ : , 'survived']


## this code will not produce an output ##

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>b) Complete and run the following code to split the data into training and testing sets.</h4>
Notice the new stratify argument. This helps preserve the balance of the response variable in the training and testing sets.

In [None]:
# train-test split with stratification
x_train, x_test, y_train, y_test = train_test_split(
            titanic_data,
            titanic_target,
            test_size    = 0.25,
            random_state = 219,
            stratify     = _____) # preserving balance


# merging training data for statsmodels
titanic_train = pd.concat([x_train, y_train], axis = 1)


## this code will not produce an output ##

In [None]:
# train-test split with stratification
x_train, x_test, y_train, y_test = train_test_split(
            titanic_data,
            titanic_target,
            test_size    = 0.25,
            random_state = 219,
            stratify     = titanic_target) # preserving balance


# merging training data for statsmodels
titanic_train = pd.concat([x_train, y_train], axis = 1)


## this code will not produce an output ##

<br>

In [None]:
print(f"""
Response Variable Proportions (Training Set)
--------------------------------------------
{y_train.value_counts(normalize = True).round(decimals = 2)}



Response Variable Proportions (Testing Set)
--------------------------------------------
{y_test.value_counts(normalize = True).round(decimals = 2)}
""")



<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>c) Build a Univariate Logistic Regression Model</h4>
Build a logistic regression model in <strong>statsmodels</strong> using the x-feature that has the strongest relationship with the response variable (&nbsp;<em>survived</em>&nbsp;).

In [None]:
# instantiating a logistic regression model object
logistic_small = smf.logit(formula   = """ _____ """,
                           data = titanic_train)


# FITTING the model object
results_logistic = logistic_small._____


# checking the results SUMMARY
results_logistic.summary2() # summary2() has AIC and BIC

In [None]:
# instantiating a logistic regression model object
logistic_small = smf.logit(formula = """survived ~ m_boat""",
                           data    = titanic_train)


# fitting the model object
results_logistic = logistic_small.fit()


# checking the results SUMMARY
results_logistic.summary2() # summary2() has AIC and BIC

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>d) Build a logistic regression model in statsmodels using all of the explanatory variables.</h4>
Use the loop below for efficiency and correct any errors that occur after the copy/paste.<br><br>
<em><strong>Hint:</strong> Remember to remove one column for each one-hot encoded feature so that the model computes properly.</em>

In [None]:
for val in titanic_data:
    print(f" {val} + ")

<br>

In [None]:
# instantiating a logistic regression model object
logistic_full = smf.logit(formula = """ _____ """,
                                        data    = titanic_train)


# fitting the model object
results_full = logistic_full.fit()


# checking the results SUMMARY
results_full.summary2()

In [None]:
# instantiating a logistic regression model object
logistic_full = smf.logit(formula = """  survived ~
                                         age + 
                                         sibsp + 
                                         parch + 
                                         fare + 
                                         m_age + 
                                         m_cabin + 
                                         m_boat + 
                                         m_home_dest + 
                                         potential_youth + 
                                         under_18 + 
                                         number_of_names + 
                                         pclass_1 + 
                                         pclass_2 + 
                                         female""",
                                         data    = titanic_train)


# fitting the model object
results_full = logistic_full.fit()


# checking the results SUMMARY
results_full.summary2()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part V - Refocusing the Response Variable</h2><br>
<strong>m_boat</strong> is performing incredibly well in predicting passenger survival. This aligns with common sense as getting into a lifeboat means staying out of the frigid waters, avoiding causes of death like hypothermia and drowning. This feature is so powerful and interpretable that there is little reason to develop a more complex model. <strong>Survival depends on getting into a life boat.</strong>
<br><br>
Let's shift the focus of our modeling efforts to factors that contribute to getting into a lifeboat. Thus, we will change our response variable from <strong>survived</strong> to <strong>m_boat</strong>. Note that <strong>survived</strong> should not be used in the model as it takes place after the event horizon. In the interest of time, the following full model has been developed for you.
<br><br>
One task stands in our way before using <strong>m_boat</strong> as the response variable. Since it was transformed, it is no longer in binary form (0 or 1). Therefore, we need to do some preparation before we are ready to model.

In [None]:
# unique values for m_boat
titanic_train['m_boat'].unique()

<br>

In [None]:
# converting m_boat back to 0s and 1s
for index, value in titanic_train.iterrows():
    
    if   titanic_train.loc[ index, 'm_boat' ] < 0:
          titanic_train.loc[ index, 'm_boat' ] = 0
    
    elif titanic_train.loc[ index, 'm_boat' ] > 0:
          titanic_train.loc[ index, 'm_boat' ] = 1
            
    else:
        print('Something went wrong.')

<br>

In [None]:
# new unique values for m_boat
titanic_train['m_boat'].unique()

<br>

In [None]:
# instantiating a logistic regression model object
logit_full = smf.logit(formula = """ m_boat ~
                                     age +
                                     sibsp +
                                     parch +
                                     fare +
                                     m_age +
                                     m_cabin +
                                     m_home_dest +
                                     potential_youth +
                                     under_18 +
                                     number_of_names +
                                     pclass_1 +
                                     pclass_2 +
                                     pclass_3 +
                                     female +
                                     male""",
                                     data    = titanic_train)


# fitting the model object
logit_full = logit_full.fit()


# checking the results SUMMARY
logit_full. summary2()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>a) Develop a model where all features are significant based on their p-values.</h4>
Based on the output above, remove all features that were deemed insignificant based on their p-values. Once finished, check the p-values again to ensure significance.<br><br>
<strong>Note:</strong> 'nan' is also considered insignificant (excluding the intercept, which must always be included in the model).

In [None]:
# instantiating a logistic regression model object
logit_sig = smf.logit(formula = """ _____ """,
                                            data    = titanic_train)


# fitting the model object
logit_sig = logit_sig.fit()


# checking the results SUMMARY
logit_sig.summary2()

In [None]:
# instantiating a logistic regression model object
logit_sig = smf.logit(formula = """ m_boat ~
                                    age +
                                    m_cabin +
                                    number_of_names +
                                    pclass_2 +
                                    pclass_3 +
                                    female""",
                                    data    = titanic_train)


# fitting the model object
logit_sig = logit_sig.fit()


# checking the results SUMMARY
logit_sig.summary2()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part V: Logistic Regression in scikit-learn</h2><br>
We can use the model above as a candidate model. In an effort to stay organized, we can put each candidate model into a dictionary. Run the code below to instantiate a dictionary to store the x-side of each candidate model.

In [None]:
# creating a dictionary to store candidate models

candidate_dict = {

 # full model
 'logit_full'   : ['age', 'sibsp', 'parch', 'fare', 'm_age', 'm_cabin',
                   'm_home_dest', 'potential_youth', 'under_18',
                   'number_of_names', 'pclass_1', 'pclass_2', 'female'],
 

 # p-value significant variables only
 'logit_sig'  : ['age', 'm_cabin', 'number_of_names',
                 'pclass_2', 'pclass_3', 'female'   ]

}

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h4>a) Dynamically print each feature set.</h4>
Complete the code to display each feature set in <strong>candidate_dict</strong>.

In [None]:
# printing candidate variable sets
_____(_____"""
/--------------------------\\
|Explanatory Variable Sets |
\\--------------------------/

Full Model:
-----------
{_____}


Significant p-value Model:
--------------------------------
{_____}
""")

In [None]:
# printing candidate variable sets
print(f"""
/--------------------------\\
|Explanatory Variable Sets |
\\--------------------------/

Full Model:
-----------
{candidate_dict['logit_full']}


Significant p-value Model:
--------------------------------
{candidate_dict['logit_sig']}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>Regression v. Classification in scikit-learn</strong><br>
One of the many great things about working with scikit-learn is that classification modeling follows the same approach as regression modeling.
<br>
<h4>b) Build a logistic regression model in scikit-learn</h4>
Build a logistic regression model in scikit-learn using the <strong>logit_sig</strong> X-features and <strong>m_boat</strong> as the response variable.

In [None]:
# train/test split with the full model
titanic_data   =  x_data_st[ _____ ]
titanic_target =  titanic  [ _____ ]


# this is the exact code we were using before
x_train, x_test, y_train, y_test = train_test_split(
            titanic_data,
            titanic_target,
            random_state = 702,
            test_size    = 0.25,
            stratify     = titanic_target)

In [None]:
# train/test split with the full model
titanic_data   =  x_data_st[ candidate_dict['logit_sig'] ]
titanic_target =  titanic  [ 'm_boat']


# this is the exact code we were using before
x_train, x_test, y_train, y_test = train_test_split(
            titanic_data,
            titanic_target,
            random_state = 702,
            test_size    = 0.25,
            stratify     = titanic_target)

<br>

In [None]:
# INSTANTIATING a logistic regression model
logreg = sklearn.linear_model.LogisticRegression(solver = 'lbfgs',
                                                 C = 1,
                                                 random_state = 702)


# FITTING the training data
logreg_fit = logreg.fit(x_train, y_train)


# PREDICTING based on the testing set
logreg_pred = logreg_fit.predict(x_test)


# saving scoring data for future use
train_score = round(logreg_fit.score(x_train, y_train), ndigits = 4) # train accuracy
test_score  = round(logreg_fit.score(x_test, y_test),   ndigits = 4) # test accuracy
tt_gap      = round(abs(train_score - test_score),      ndigits = 4) # gap

# displaying and saving the gap between training and testing
print(f"""\
Training ACCURACY: {train_score}
Testing  ACCURACY: {test_score}
Train-Test Gap   : {tt_gap}
""") 

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part VI: Why Accuracy is Bad</h2><br>
What does it mean to be accurate? Mathematically, predictive accuracy can be calculated as follows:<br><br>

~~~
correct predictions / total predictions
~~~

<br>
However, such a calculation poses a problem. Let's say, for example, that we went back to predicting whether a passenger survived the Titanic disaster. If we were to run the following code:<br><br>

~~~
titanic['survived'].mean()
~~~

<br>
We would learn that approximately 42% of the passengers in the dataset survived. Therefore, if we were to claim that every passenger survived, we would have an accuracy of 42%, even though we are 100% inaccurate in terms of predicting passengers that did not survive. This becomes an even more serious problem when the response variable is heavily imbalanced, for example, when 90% of observations experienced a phenomenon. Therefore, we need to consider accuracy from two perspectives: positive cases (the 1s) and negative cases (the 0s). In this section, we will cover tools that more appropriately measure classification model performance.

<br><br>
<h3>The Confusion Matrix</h3><br>
The confusion matrix in Python can be read as follows:<br><br>

~~~
                   |
  True Negatives   |  False Positives
  (correct)        |  (incorrect)
                   |
-------------------|------------------
                   |
  False Negatives  |  True Positives
  (incorrect)      |  (correct)
                   |
~~~

<br><br><br>
In terms of our model:<br>

~~~
                                                 |
  PREDICTED: GOT IN LIFEBOAT (m_boat=0)          |  PREDICTED: DID NOT GET IN LIFEBOAT (m_boat=1)
  ACTUAL:    GOT IN LIFEBOAT (m_boat=0)          |  ACTUAL:    GOT IN LIFEBOAT         (m_boat=0)
                                                 |
-------------------------------------------------|-----------------------------------------------
                                                 |
  PREDICTED: GOT IN LIFEBOAT         (m_boat=0)  |  PREDICTED: DID NOT GET IN LIFEBOAT (m_boat=1)
  ACTUAL:    DID NOT GET IN LIFEBOAT (m_boat=1)  |  ACTUAL:    DID NOT GET IN LIFEBOAT (m_boat=1)
                                                 |  
~~~


In [None]:
# creating a confusion matrix
print(confusion_matrix(y_true = y_test,
                       y_pred = logreg_pred))

<br>

In [None]:
# unpacking the confusion matrix
logreg_tn, \
logreg_fp, \
logreg_fn, \
logreg_tp = confusion_matrix(y_true = y_test, y_pred = logreg_pred).ravel()


# printing each result one-by-one
print(f"""
True Negatives : {logreg_tn}
False Positives: {logreg_fp}
False Negatives: {logreg_fn}
True Positives : {logreg_tp}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>Visualized Confusion Matrix</strong><br>
Run the code below to apply the user defined function <em>visual_cm(&nbsp;)</em>, which will generate a visualization of the confusion matrix.

In [None]:
# calling the visual_cm function
visual_cm(true_y = y_test,
          pred_y = logreg_pred,
          labels = ['Life Boat', 'Not In Life Boat'])

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h3>MUST KNOW: Area Under The Curve (AUC)</h3><br>
The area under the curve (AUC) value is one of the most common metrics used to evaluate the overall performance of a classification model. This is largely due to the fact that this metric takes into account two key factors:<br><br>
<u>Sensitivity</u><br>
Number of times the model predicted that an event WOULD occur compared to the number of times the event DID occur.
<br><br>
<u>Specificity</u><br>
Number of times the model predicted that an event WOULD NOT occur compared to the number of times the event DID NOT occur.

In [None]:
# preparing AUC, precision, and recall
auc       = round(roc_auc_score(y_true = y_test, y_score = logreg_pred) , ndigits = 4)
precision = round(precision_score(y_true = y_test, y_pred = logreg_pred), ndigits = 4)
recall    = round(recall_score(y_true = y_test, y_pred = logreg_pred)   , ndigits = 4)


# dynamically printing metrics
print(f"""\
AUC:       {auc}
Precision: {precision}
Recall:    {recall}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Run the code below to observe the model's coefficients.

In [None]:
# zipping each feature name to its coefficient
model_values = zip(titanic[candidate_dict[ 'logit_sig'] ].columns,
                           logreg_fit.coef_.ravel().round(decimals = 2))


# setting up a placeholder list to store model features
model_lst = [('intercept', round(logreg_fit.intercept_[0], ndigits = 2))]


# printing out each feature-coefficient pair one by one
for val in model_values:
    model_lst.append(val)
    

# checking the results
for pair in model_lst:
    print(pair)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part VII: Adjusting The Classification Threshold</h2><br>
In this final section, we will adjust the <strong>classification threshold</strong>, or the boundary between predicting a one or a zero. By default, if an observation has a predicted probability at or above 0.50, it will be predicted as being a member of the 1 class. Adjusting this threshold may lead to better predictions or better alignment with real-world applications. This is particularly effective when attempting to control sensitivity or specificity as it is likely to lead to less false positives or false negatives, depending on the direction the threshold is changing.

In [None]:
# printing the predicted probabilities of 0 and 1, respectively
pd.DataFrame(data = logreg_fit.predict_proba(titanic_data).round(decimals = 2),
             columns = ['Class 0', 'Class 1']).head(n = 5)

<br>

In [None]:
# printing actual predictions (0 or 1)
pd.DataFrame(data    = logreg_fit.predict(titanic_data),
             columns = ['Predicted Class']).head(n = 5)

<br>

In [None]:
# storing objects for predictions and true y values
true_y         = titanic_target
pred_probs     = pd.DataFrame(logreg_fit.predict_proba(titanic_data)).round(decimals = 2)
pred_thresh_50 = pd.DataFrame(logreg_fit.predict(titanic_data))

<br>

In [None]:
# combining the predictions into a DataFrame and renaming columns
prediction_df = pd.concat([true_y, pred_probs, pred_thresh_50], axis = 1)
prediction_df.columns = ['true_y', 'prob_0', 'prob_1', 'pred_thresh_50',]


# checking results
prediction_df.head(n = 15)

<br>

In [None]:
# unpacking the confusion matrix
logreg_tn, \
logreg_fp, \
logreg_fn, \
logreg_tp = confusion_matrix(y_true = prediction_df['true_y'],
                             y_pred = prediction_df['pred_thresh_50']).ravel()


# printing each result one-by-one
print(f"""
True Negatives : {logreg_tn}
False Positives: {logreg_fp}
False Negatives: {logreg_fn}
True Positives : {logreg_tp}
""")

<br>

In [None]:
# probability of 1 >= 0.25
prediction_df['pred_thresh_25'] = (prediction_df['prob_1'] >= 0.25).astype(dtype = int)


# checking results
prediction_df.tail(n = 10)

<br>

In [None]:
# unpacking the confusion matrix
logreg_tn, \
logreg_fp, \
logreg_fn, \
logreg_tp = confusion_matrix(y_true = prediction_df['true_y'],
                             y_pred = prediction_df['pred_thresh_25']).ravel()


# printing each result one-by-one
print(f"""
True Negatives : {logreg_tn}
False Positives: {logreg_fp}
False Negatives: {logreg_fn}
True Positives : {logreg_tp}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

~~~

 __     __                 ____ _                     _ 
 \ \   / /__ _ __ _   _   / ___| | __ _ ___ ___ _   _| |
  \ \ / / _ \ '__| | | | | |   | |/ _` / __/ __| | | | |
   \ V /  __/ |  | |_| | | |___| | (_| \__ \__ \ |_| |_|
    \_/ \___|_|   \__, |  \____|_|\__,_|___/___/\__, (_)
                  |___/                         |___/   
                  
~~~

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>