<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br><h2>Script 04 | Classification Modeling with Unsupervised Data</h2>
<br>
Written by Chase Kusterer<br>
<a href="https://github.com/chase-kusterer">GitHub</a> | <a href="https://www.linkedin.com/in/kusterer/">LinkedIn</a>
<br><br><br>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<h3>Introduction</h3><br>
Now that we have an understanding of unsupervised learning through principal component analysis and k-means clustering, we are ready to apply the results of these algorithms to a supervised learning problem.

In [None]:
# importing libraries
import numpy as np                                          # mathematical essentials
import pandas as pd                                         # data science essentials
from sklearn.decomposition import PCA                       # principal component analysis
from sklearn.model_selection import train_test_split        # train-test split
from sklearn.preprocessing import StandardScaler            # data prep
from sklearn.metrics import confusion_matrix, roc_auc_score # results analysis
from sklearn.cluster import KMeans                          # k-means clustering
import sklearn.linear_model                                 # classification modeling



# importing data
file    = './datasets/ames_classification.xlsx'
housing = pd.read_excel(io = file)


# checking results
housing.head(n = 5)

<br>

In [None]:
# standard_scaler
def standard_scaler(df):
    """
    Standardizes a dataset (mean = 0, variance = 1). Returns a new DataFrame.
    Requires sklearn.preprocessing.StandardScaler()
    
    PARAMETERS
    ----------
    df     | DataFrame to be used for scaling
    """

    # INSTANTIATING a StandardScaler() object
    scaler = StandardScaler(copy = True)


    # FITTING the scaler with the data
    scaler.fit(df)


    # TRANSFORMING our data after fit
    x_scaled = scaler.transform(df)

    
    # converting scaled data into a DataFrame
    new_df = pd.DataFrame(x_scaled)


    # reattaching column names
    new_df.columns = list(df.columns)
    
    return new_df

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h3>Classification Modeling with Principal Components</h3><br>
Let's assume we have already conducted principal component analysis and came to the conclusion that we will retain four principal components. Note that there was no actual analysis conducted and if so, we would likely arrive at a different number of retained principal components.

In [None]:
# subsetting continuous data
housing_continuous = housing[ ['Lot_Area', 'Mas_Vnr_Area', 'Total_Bsmt_SF',
                               'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area',
                               'Garage_Area', 'Porch_Area'] ]


# scaling the data
pca_data = standard_scaler(df = housing_continuous)

<br>

In [None]:
# INSTANTIATING a PCA object
pca = PCA(n_components = _____,
          random_state = 702)


# preparing factor loadings
housing_pca = pca.fit_transform(pca_data)

In [None]:
# INSTANTIATING a PCA object
pca = PCA(n_components = 4,
          random_state = 702)


# preparing factor loadings
housing_pca = pca.fit_transform(pca_data)

<br>

In [None]:
#?# Do we get better results when we scale the factor loadings? #?#
#housing_pca_scaled = standard_scaler(df = pd.DataFrame(data = housing_pca))


# selecting x- and y-data
x_data = housing_pca
y_data = housing['Expensive_Property']


# training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x_data,
                                                    y_data,
                                                    test_size    = 0.25,
                                                    random_state = 702,
                                                    stratify     = y_data)

<br>

In [None]:
# INSTANTIATING a logistic regression model
model = sklearn.linear_model.LogisticRegression(solver       = 'lbfgs',
                                                C            = 1,
                                                random_state = 702)


# FITTING the training data
model_fit = model.fit(x_train, y_train)


# PREDICTING based on the testing set
model_pred = model_fit.predict(x_test) # predict_proba for multiclass


# checking results
train_acc = model_fit.score(x_train, y_train)
test_acc  = model_fit.score(x_test , y_test )
roc_score = roc_auc_score  (y_true      = y_test,
                            y_score     = model_pred)


print(f"""
Train-Test Gap: {round(abs(train_acc - test_acc), ndigits = 3)}
Test AUC Score: {roc_score.round(decimals = 3)}
""")

<br>

In [None]:
# unpacking the confusion matrix
model_tn, \
model_fp, \
model_fn, \
model_tp = confusion_matrix(y_true = y_test, y_pred = model_pred).ravel()


# printing each result one-by-one
print(f"""
True Negatives : {model_tn}
False Positives: {model_fp}
False Negatives: {model_fn}
True Positives : {model_tp}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h3>Classification Modeling with Clusters</h3><br>
Let's assume we also have already conducted principal component analysis and came to the conclusion that we will retain five clusters.

In [None]:
# standardizing the data for clustering
pca_rescaled = standard_scaler(df = pd.DataFrame(data = housing_pca))


# INSTANTIATING a k-Means object with clusters
customers_k_pca = KMeans(n_clusters   = _____ ,
                         n_init       = 'auto',
                         random_state = 702   )


# fitting the object to the data
customers_k_pca.fit(pca_rescaled)


# converting the clusters to a DataFrame
customers_kmeans_pca = pd.DataFrame({'Cluster': customers_k_pca.labels_})


# checking cluster populations
print(customers_kmeans_pca.iloc[: , 0].value_counts())

In [None]:
# standardizing the data for clustering
pca_rescaled = standard_scaler(df = pd.DataFrame(data = housing_pca))


# INSTANTIATING a k-Means object with clusters
customers_k_pca = KMeans(n_clusters   = 5     ,
                         n_init       = 'auto',
                         random_state = 702   )


# fitting the object to the data
customers_k_pca.fit(pca_rescaled)


# converting the clusters to a DataFrame
customers_kmeans_pca = pd.DataFrame({'Cluster': customers_k_pca.labels_})


# checking cluster populations
print(customers_kmeans_pca.iloc[: , 0].value_counts())

<br>

In [None]:
# checking which observations belong to each cluster
customers_kmeans_pca.head(n = 5)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
It a best practice to treat clusters as a categorical feature and assume they have no inherent order. In other words, we should assume that each cluster is independent. This also fits with one of the assumptions of logistic regression.

In [None]:
# factorizing cluster results 
cluster_df = pd.get_dummies(data       = customers_kmeans_pca['Cluster'],
                            drop_first = True).astype(dtype = int)


# checking results
cluster_df.value_counts(normalize = False).sort_index(ascending = False)

<br>

In [None]:
# selecting x- and y-data
x_data = cluster_df
y_data = housing['Expensive_Property']


# training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x_data,
                                                    y_data,
                                                    test_size    = 0.25,
                                                    random_state = 702,
                                                    stratify     = y_data)

<br>

In [None]:
# INSTANTIATING a logistic regression model
model = sklearn.linear_model.LogisticRegression(solver       = 'lbfgs',
                                                C            = 1,
                                                random_state = 702)


# FITTING the training data
model_fit = model.fit(x_train, y_train)


# PREDICTING based on the testing set
model_pred = model_fit.predict(x_test) # predict_proba for multiclass


# checking results
train_acc = model_fit.score(x_train, y_train)
test_acc  = model_fit.score(x_test , y_test )
roc_score = roc_auc_score  (y_true  = y_test,
                            y_score = model_pred)


print(f"""
Train-Test Gap: {round(abs(train_acc - test_acc), ndigits = 3)}
Test AUC Score: {round(roc_score, ndigits = 3)}
""")

<br>

In [None]:
# storing cluster centers
centroids_pca = pd.DataFrame(data = customers_k_pca.cluster_centers_)


# checking cluster centers
centroids_pca.round(decimals = 2)

<br>

In [None]:
# unpacking the confusion matrix
model_tn, \
model_fp, \
model_fn, \
model_tp = confusion_matrix(y_true = y_test, y_pred = model_pred).ravel()


# printing each result one-by-one
print(f"""
True Negatives : {model_tn}
False Positives: {model_fp}
False Negatives: {model_fn}
True Positives : {model_tp}
""")

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

~~~


 __     __                               _        _ _   _ 
 \ \   / /                              | |      (_) | | |
  \ \_/ /__  _   _   _ __ ___   __ _  __| | ___   _| |_| |
   \   / _ \| | | | | '_ ` _ \ / _` |/ _` |/ _ \ | | __| |
    | | (_) | |_| | | | | | | | (_| | (_| |  __/ | | |_|_|
    |_|\___/ \__,_| |_| |_| |_|\__,_|\__,_|\___| |_|\__(_)
     
     
                                                          
~~~
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>