<a href="https://colab.research.google.com/github/cm-int/machine-learning-fundamentals/blob/main/module_3/Democode/Mod_3_Lesson_2_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Selecting Model Features and Algorithms

In this demonstration, you’ll create a classification model based on a raw dataset and measure the precision and recall. You’ll refine the dataset by selecting and scaling features and assess the impact this has on the performance of the model. You'll also examine how the choice of algorithm can affect the results.

This demonstration uses the Bank Marketing dataset.

##Context
Find the best strategies to improve for the next marketing campaign. How can the financial institution have a greater effectiveness for future marketing campaigns? In order to answer this, we have to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions in order to develop future strategies.

##Attribute Information: 
###Customer Data
- Age (numeric)
- Job : type of job (categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown')
- Marital : marital status (categorical: 'divorced', 'married', 'single', 'unknown' ; note: 'divorced' means divorced or widowed)
- Education (categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown')
- Default: has credit in default? (categorical: 'no', 'yes', 'unknown')
- Housing: has housing loan? (categorical: 'no', 'yes', 'unknown')
- Loan: has personal loan? (categorical: 'no', 'yes', 'unknown')

###Campaign Data
- Contact: contact communication type (categorical:
'unknown', 'cellular', 'telephone')
- Day: last contact day of the month (numeric)
- Month: last contact month of year (categorical: 'jan', 'feb', 'mar',
…, 'nov', 'dec')
- Duration: last contact duration, in seconds (numeric). Important
note: this attribute highly affects the output target (e.g., if
duration=0 then y='no'). Yet, the duration is not known before a call
is performed. Also, after the end of the call y is obviously known.
Thus, this input should only be included for benchmark purposes and
should be discarded if the intention is to have a realistic
predictive model.
- Campaign: number of contacts performed during this campaign and for
this client (numeric, includes last contact)
- Pdays: number of days that passed by after the client was last
contacted from a previous campaign (numeric; -1 means client was not
previously contacted)
- Previous: number of contacts performed before this campaign and for
this client (numeric)
- Poutcome: outcome of the previous marketing campaign (categorical:
'unknown', 'failure', 'other', 'success')

###Target Variable:
- Y - has the client subscribed to a term deposit? (binary: 'yes', 'no')

##Source
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

# Review the data and build a default model

In [None]:
# Upload the marketingdata.csv file

!wget 'https://raw.githubusercontent.com/cm-int/machine-learning-fundamentals/main/module_3/Democode/marketingdata.csv'

In [None]:
# Read the data from the CSV file

import numpy as np
import pandas as pd

marketing = pd.read_csv("marketingdata.csv", sep=';')
print(f'{marketing.info()}\n\n')
marketing

In [None]:
# Separate the class variable ('Y') from the features and convert the label (yes/no) into a numeric value (1/0)

from sklearn.preprocessing import LabelEncoder

marketing_class = marketing['y']
cat_encoder = LabelEncoder().fit(marketing_class)
marketing_class = cat_encoder.transform(marketing_class)

print(marketing_class)

In [None]:
# Remove the class label from the list of features, and replace the categorical variables with numeric dummy values

marketing_features = marketing.drop(['y'], axis=1)
marketing_features = pd.get_dummies(marketing_features)

marketing_features

In [None]:
# Split the data into test and training datasets, build a K-Nearest Neighbors model, and test the precision, recall, and AUC

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

features_train, features_test, class_train, class_test = train_test_split(marketing_features, marketing_class, test_size=0.33, random_state=13)

knn_model = KNeighborsClassifier() # Select default hyperparameters (n_neighbors=5)
_ = knn_model.fit(features_train, class_train)

test_results = knn_model.predict(features_test)

_ = ConfusionMatrixDisplay.from_predictions(class_test, test_results, display_labels=['No', 'Yes'])

print(f'Precision: {precision_score(class_test, test_results)}')
print(f'Recall: {recall_score(class_test, test_results)}\n')

_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**

How good is this model at predicting a positive outcome?

*Answer: Poor. The model misses most positive outcomes and reports them as negatives. Additionally, it misclassifies more negative outcomes as positives than it correctly identifies positive outcomes.

This is the baseline for further investigation.*

# Perform a SHAP analysis to examine which features have the most effect on the predictions

In [None]:
# Install the SHAP module

!pip install shap

In [None]:
# Create a SHAP explainer to analyze predictions made using the model
# NOTE: This step takes 5 minutes to run

import shap
import random

items = random.sample(list(features_train.index), 50) # The analysis is restricted to the 50 random observations to save time
explainer_train = features_train[features_train.index.isin(items)]

explainer = shap.Explainer(knn_model.predict, explainer_train)
values = explainer(explainer_train)

In [None]:
# Display the results as a summary plot

shap.summary_plot(shap_values=values, features=explainer_train, plot_type="bar")

In [None]:
# The violin plot indicateshow the features are correlated with the predictions

shap.summary_plot(shap_values=values, features=explainer_train, plot_type="violin")

**Question:**

What does this analysis indicate?

*Answer: Most of the features have little to no bearing on the predictions. It would appear that the most important features are duration, balance, day, age, and pdays, although the corellations are on the 'blue' side (low). The Day feature is a surprise - why should making contact on a  day of the month be important? Further analysis is necessary to verify these findings*

# Perform a multivariate search to find the most important features for the model

In [None]:
# Perform a forward selection search using the SelectKBest function
# NOTE: This step takes 5 or 6 minutes to run

from sklearn.feature_selection import SelectKBest, f_classif
import sklearn.metrics as metrics

# Iterate over the best models with different sized feature sets 
# and calculate the precision and recall of each model

pr_scores = []
rc_scores = []
for k in range(1, len(features_train.columns)-1):
    features_selector = SelectKBest(score_func=f_classif, k=k)
    features_selector = features_selector.fit(features_train, class_train)
    print(features_selector.get_feature_names_out())
    transformed_train = features_selector.transform(features_train)
    transformed_test = features_selector.transform(features_test)
    model = KNeighborsClassifier()
    model.fit(transformed_train, class_train)
    predictions = model.predict(transformed_test)
    pr_score = metrics.precision_score(class_test, predictions, zero_division=0, average='macro')
    pr_scores.append(pr_score)
    rc_score = metrics.recall_score(class_test, predictions, zero_division=0, average='macro')
    rc_scores.append(rc_score)

In [None]:
# Plot the results

import matplotlib.pyplot as plt

plt.figure(figsize=(40, 10))
plt.plot(range(1, len(features_train.columns)-1), pr_scores, label='Precision')
plt.plot(range(1, len(features_train.columns)-1), rc_scores, label='Recall')
plt.xlabel('\nBest K Features', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.xticks(range(1, len(features_train.columns)-1))
plt.ylabel('Precision/Recall Scores', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.legend(prop={'size': 20})
plt.show()

In [None]:
# Find the best precision score to minimize the false positive rate 
# (use recall to minimize the false negative rate)

best_score = max(pr_scores)
num_features = np.where(pr_scores == best_score)[0][0]
features_selector = SelectKBest(score_func=f_classif, k=num_features+1)
features_selector = features_selector.fit(features_train, class_train)
print(f'Best features: {features_selector.get_feature_names_out()}')

**Question:**

How do these findings compare to the SHAP analysis?

*Answer: Day has dropped out of the list, but certain types of job, whether the customer owns their own home, the method of contact, the month, and the outcome now appear to be important*

#Build and test a model using the *'best'* set of features 

In [None]:
best_features_train = features_train[['duration', 'pdays', 'previous', 'job_retired', 'job_student', 'housing_no', 'housing_yes', 'contact_cellular', 'contact_unknown', 'month_dec', \
                                      'month_mar', 'month_may', 'month_oct', 'month_sep', 'poutcome_success', 'poutcome_unknown']]

best_features_test = features_test[['duration', 'pdays', 'previous', 'job_retired', 'job_student', 'housing_no', 'housing_yes', 'contact_cellular', 'contact_unknown', 'month_dec', \
                                    'month_mar', 'month_may', 'month_oct', 'month_sep', 'poutcome_success', 'poutcome_unknown']]

In [None]:
knn_model = KNeighborsClassifier() 
_ = knn_model.fit(best_features_train, class_train)

test_results = knn_model.predict(best_features_test)

_ = ConfusionMatrixDisplay.from_predictions(class_test, test_results)

print(f'Precision: {precision_score(class_test, test_results)}')
print(f'Recall: {recall_score(class_test, test_results)}\n')

_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**

Has the model improved?

*Answer: Slightly. Precision is better, but still not good. Recall remains poor.*

#Examine how scaling affects the choice of features

The numeric features have varying scales, so it may be worth standardizing them to see how this impacts the model

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Return to the original dataset
marketing = pd.read_csv("marketingdata.csv", sep=';')
print(f'{marketing.info()}\n\n')

# Separate the class variable ('Y') from the features and convert the label (yes/no) into a numeric value (1/0)
marketing_class = marketing['y']
cat_encoder = LabelEncoder().fit(marketing_class)
marketing_class = cat_encoder.transform(marketing_class)
print(marketing_class)

# Remove the class label from the list of features
marketing_features = marketing.drop(['y'], axis=1)
print(marketing_features)

# Split the data into test and training datasets
features_train, features_test, class_train, class_test = train_test_split(marketing_features, marketing_class, test_size=0.33, random_state=13)

In [None]:
# Create a pipeline to perform encoding of the categorical features and scaling of the numeric features

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn import set_config

numeric_features = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays',	'previous']
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

# Encode categorical features
categorical_preprocessor = Pipeline([('categorical_encoder', OneHotEncoder())])

# Clean and scale numeric features
numeric_preprocessor = Pipeline([('replace_nan', SimpleImputer(missing_values=np.nan, strategy='mean')),\
                                 ('numeric_scaler', StandardScaler())])

# Apply transformations to categorical and numeric columns as appropriate
pipeline_preprocessor = \
    ColumnTransformer([('numeric_preprocessor', numeric_preprocessor, numeric_features), \
                       ('categorical_preprocessor', categorical_preprocessor, categorical_features)])

# Create the pipeline and fit a K-Nearest Neighbours model
pipe = Pipeline([('preprocessor', pipeline_preprocessor),
                 ('estimator', KNeighborsClassifier())])

pipe.fit(features_train, class_train)

# Display the details of the pipe
set_config(display="diagram")
pipe

In [None]:
# Display the names of the features generated by the pipeline

numeric_feature_names = pipe['preprocessor'].transformers_[0][1]['numeric_scaler'].get_feature_names_out(numeric_features)
print(f'Numeric features: {numeric_feature_names}\n')

categorical_feature_names = pipe['preprocessor'].transformers_[1][1]['categorical_encoder'].get_feature_names_out(categorical_features)
print(f'Categorical features: {categorical_feature_names}')

In [None]:
from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Make test predictions
predictions = pipe.predict(features_test)

# Check the precision and recall
pr_score = metrics.precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = metrics.recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(class_test, predictions)

**Question:**

Has the model improved?

*Answer: Precision and recall are better. AUC is still low.*

#Perform SHAP analysis on the new model

In [None]:
# Run the pipeline again without the estimator to get the transformed training data

transformed_features_train = pd.DataFrame(pipe[0].transform(features_train))
transformed_features_train

In [None]:
# The feature names have been lost, so reinstate them from the lists seen earlier

new_names = np.append(numeric_feature_names, categorical_feature_names)
transformed_features_train.columns = new_names
transformed_features_train

In [None]:
# Perform SHAP analysis over the transformed training data
# NOTE: Allow 5 or 6 minutes for this step to complete
# NOTE 2: Ignore the warnings about the classifier not being fitted with feature names

import shap
import random

items = random.sample(list(transformed_features_train.index), 50) # As before, the analysis is restricted to the 50 random observations to save time
explainer_train = transformed_features_train[transformed_features_train.index.isin(items)]

explainer = shap.Explainer(pipe['estimator'].predict, explainer_train) # Perform the analysis using the 'estimator' object from the pipeline
values = explainer(explainer_train)

In [None]:
# Display the results

shap.summary_plot(shap_values=values, features=explainer_train, plot_type="bar")
shap.summary_plot(shap_values=values, features=explainer_train, plot_type="violin")

**Question:**

What does this analysis tell us?

*Answer: Scaling the numeric features results in more features having a greater influence in the model predictions. Day still seems to be relevant!*

#Perform multivariate forward selection with the new model and assess whether this change improves predictions

In [None]:
# Use the SequentialFeatureSelector to find the best set of features for the model
# NOTE: This step takes approximately 6 minutes to run

from sklearn.feature_selection import SequentialFeatureSelector

features_to_select = 5

sfs_forward = SequentialFeatureSelector(pipe['estimator'], n_features_to_select=features_to_select, direction="forward")
sfs_forward.fit(transformed_features_train, class_train)

print(f"Features selected by forward sequential selection: {sfs_forward.get_feature_names_out()}")

In [None]:
# Transform the test data and rename the columns

transformed_features_test = pd.DataFrame(pipe[0].transform(features_test))
transformed_features_test.columns = new_names
transformed_features_test

In [None]:
# Build a new model using only the selected features

reduced_features_train = transformed_features_train[sfs_forward.get_feature_names_out()]
reduced_features_test = transformed_features_test[sfs_forward.get_feature_names_out()]

knn_model = KNeighborsClassifier() # Create a new classifier
_ = knn_model.fit(reduced_features_train, class_train)

test_results = knn_model.predict(reduced_features_test)

_ = ConfusionMatrixDisplay.from_predictions(class_test, test_results, display_labels=['No', 'Yes'])

print(f'Precision: {precision_score(class_test, test_results)}')
print(f'Recall: {recall_score(class_test, test_results)}\n')

_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**

How does this model fare?

*Answer: Focussing on a smaller number of columns reduced the precision and recall, so this might not be the best strategy in this case*

# Perform detailed multivariate forward selection and select the features more carefully

In [None]:
# Perform another forward selection search using the SelectKBest function and evaluate the best mix of features for precision and recall
# NOTE: This step takes 5 or 6 minutes to run

from sklearn.feature_selection import SelectKBest, f_classif
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

# Iterate over the best models with different sized feature sets 
# and calculate the precision and recall of each model

pr_scores = []
rc_scores = []
for k in range(1, len(transformed_features_train.columns)-1):
    features_selector = SelectKBest(score_func=f_classif, k=k)
    features_selector = features_selector.fit(transformed_features_train, class_train)
    print(features_selector.get_feature_names_out())
    transformed_train = features_selector.transform(transformed_features_train)
    transformed_test = features_selector.transform(transformed_features_test)
    model = KNeighborsClassifier()
    model.fit(transformed_train, class_train)
    predictions = model.predict(transformed_test)
    pr_score = metrics.precision_score(class_test, predictions, zero_division=0, average='macro')
    pr_scores.append(pr_score)
    rc_score = metrics.recall_score(class_test, predictions, zero_division=0, average='macro')
    rc_scores.append(rc_score)

# Plot the results

plt.figure(figsize=(40, 10))
plt.plot(range(1, len(transformed_features_train.columns)-1), pr_scores, label='Precision')
plt.plot(range(1, len(transformed_features_train.columns)-1), rc_scores, label='Recall')
plt.xlabel('\nBest K Features', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.xticks(range(1, len(transformed_features_train.columns)-1))
plt.ylabel('Precision/Recall Scores', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.legend(prop={'size': 20})
plt.show()

In [None]:
# Find the features for the best precision score to minimize the false positive rate 
# (use recall to minimize the false negative rate)

best_score = max(pr_scores)
num_features = np.where(pr_scores == best_score)[0][0]
features_selector = SelectKBest(score_func=f_classif, k=num_features+1)
features_selector = features_selector.fit(transformed_features_train, class_train)
print(f'Best features: {features_selector.get_feature_names_out()}')

In [None]:
# Build another new model using only the selected features

reduced_features_train = transformed_features_train[features_selector.get_feature_names_out()]
reduced_features_test = transformed_features_test[features_selector.get_feature_names_out()]

knn_model = KNeighborsClassifier() # Create a new classifier
_ = knn_model.fit(reduced_features_train, class_train)

test_results = knn_model.predict(reduced_features_test)

_ = ConfusionMatrixDisplay.from_predictions(class_test, test_results, display_labels=['No', 'Yes'])

print(f'Precision: {precision_score(class_test, test_results)}')
print(f'Recall: {recall_score(class_test, test_results)}\n')

_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**

Is this model an improvement?

*Answer: Selecting features based on model precision yields an improvement, and AUC is increasing, but still not as good as selecting every feature.*

#Try using feature extraction as an alternative strategy

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Return to the original dataset again
marketing = pd.read_csv("marketingdata.csv", sep=';')
print(f'{marketing.info()}\n\n')

# Separate the class variable ('Y') from the features and convert the label (yes/no) into a numeric value (1/0)
marketing_class = marketing['y']
cat_encoder = LabelEncoder().fit(marketing_class)
marketing_class = cat_encoder.transform(marketing_class)
print(marketing_class)

# Remove the class label from the list of features
marketing_features = marketing.drop(['y'], axis=1)
marketing_features = pd.get_dummies(marketing_features)
print(marketing_features)

# Split the data into test and training datasets
features_train, features_test, class_train, class_test = train_test_split(marketing_features, marketing_class, test_size=0.33, random_state=13)

In [None]:
from sklearn.decomposition import PCA

# Perform PCA analysis
pca = PCA()
pca.fit(features_train)

print(pca.explained_variance_ratio_)

In [None]:
import matplotlib.pyplot as plt

# Plot the results
plt.figure(figsize=(10,10))
x = np.arange(1, len(pca.explained_variance_)+1)
plt.bar(x, pca.explained_variance_ratio_)
plt.xlabel('Principal Components', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.ylabel('Proportion of Explained Variances', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.show()

**Question:**

Which component(s) account for the most variance?

*Answer: Component 1 accounts for 99.2% of the variance. This component dwarfs the variance of the other components. Try building a model with this single component.*

In [None]:
# Construct another model using the first principal component

pca_data = pd.DataFrame(pca.transform(marketing_features))
first_component_data = pca_data.iloc[:, 0:1]

# Split the data into test and training datasets
pca_train, pca_test, pca_class_train, pca_class_test = train_test_split(first_component_data, marketing_class, test_size=0.33, random_state=13)

# Build the model
pca_knn_model = KNeighborsClassifier()
pca_knn_model.fit(pca_train, pca_class_train)
predictions = pca_knn_model.predict(pca_test)

# Check the precision and recall
pr_score = precision_score(pca_class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(pca_class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(pca_class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(pca_class_test, test_results)

**Question:**

Is this model an improvement?

*Answer: Precision and Recall have dropped, although AUC is climbing (very slowly). Compacting the predictive power into a single component was probably optimistic.*

In [None]:
# Try again with the first three principal components

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Return to the original dataset again
marketing = pd.read_csv("marketingdata.csv", sep=';')
print(f'{marketing.info()}\n\n')

# Separate the class variable ('Y') from the features and convert the label (yes/no) into a numeric value (1/0)
marketing_class = marketing['y']
cat_encoder = LabelEncoder().fit(marketing_class)
marketing_class = cat_encoder.transform(marketing_class)
print(marketing_class)

# Remove the class label from the list of features
marketing_features = marketing.drop(['y'], axis=1)
marketing_features = pd.get_dummies(marketing_features)
print(marketing_features)

# Split the data into test and training datasets
features_train, features_test, class_train, class_test = train_test_split(marketing_features, marketing_class, test_size=0.33, random_state=13)

In [None]:
pca_data = pd.DataFrame(pca.transform(marketing_features))
first_component_data = pca_data.iloc[:, 0:3]

# Split the data into test and training datasets
pca_train, pca_test, pca_class_train, pca_class_test = train_test_split(first_component_data, marketing_class, test_size=0.33, random_state=13)

# Build the model
pca_knn_model = KNeighborsClassifier()
pca_knn_model.fit(pca_train, pca_class_train)
predictions = pca_knn_model.predict(pca_test)

# Check the precision and recall
pr_score = precision_score(pca_class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(pca_class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(pca_class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(pca_class_test, test_results)

**Question:**

Is this model an improvement?

*Answer: Precision and Recall have improved. Maybe there is more information in the second and third components than is alluded to by their variance.*

#Incorporate scaling with PCA

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Return to the original dataset again
marketing = pd.read_csv("marketingdata.csv", sep=';')
print(f'{marketing.info()}\n\n')

# Separate the class variable ('Y') from the features and convert the label (yes/no) into a numeric value (1/0)
marketing_class = marketing['y']
cat_encoder = LabelEncoder().fit(marketing_class)
marketing_class = cat_encoder.transform(marketing_class)
print(marketing_class)

# Remove the class label from the list of features
marketing_features = marketing.drop(['y'], axis=1)
print(marketing_features)

# Split the data into test and training datasets
features_train, features_test, class_train, class_test = train_test_split(marketing_features, marketing_class, test_size=0.33, random_state=13)

In [None]:
# Try scaling before performing PCA.
# Construct a new pipeline that includes feature extraction

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn import set_config
from sklearn.decomposition import PCA

numeric_features = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays',	'previous']
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

# Encode categorical features
categorical_preprocessor = Pipeline([('categorical_encoder', OneHotEncoder())])

# Clean and scale numeric features
numeric_preprocessor = Pipeline([('replace_nan', SimpleImputer(missing_values=np.nan, strategy='mean')),\
                                 ('numeric_scaler', StandardScaler())])

# Apply transformations to categorical and numeric columns as appropriate
pipeline_preprocessor = \
    ColumnTransformer([('numeric_preprocessor', numeric_preprocessor, numeric_features), \
                       ('categorical_preprocessor', categorical_preprocessor, categorical_features)])

# Create the pipeline with PCA and fit a K-Nearest Neighbors model
pca_pipe = Pipeline([('preprocessor', pipeline_preprocessor),
                     ('extractor', PCA(n_components=1)), # Only generate the first PCA component
                     ('estimator', KNeighborsClassifier())])

pca_pipe.fit(features_train, class_train)

# Display the details of the pipe
set_config(display="diagram")
pca_pipe

In [None]:
# Evaluate the model

predictions = pca_pipe.predict(features_test)

# Check the precision and recall
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**

How does this model compare to previously?

*Answer: Precision and Recall are back to where they were with a single component without scaling.*

In [None]:
# Try again with three principal components

pca_pipe[1].set_params(**{'n_components': 3})

pca_pipe.fit(features_train, class_train)
predictions = pca_pipe.predict(features_test)
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**

How about now?

*Answer: Precision and Recall have improved, but are still only comparable to the model without scaling.*

In [None]:
# Try multivariate feature selection to ascertain how selecting multiple PCA components affects the model
# NOTE: This step takes 5 or 6 minutes to run

from sklearn.feature_selection import SelectKBest, f_classif
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

# Run the pipeline again to generate all PCA components
pca_pipe[1].set_params(**{'n_components': None})
pca_pipe.fit(features_train, class_train)

# Generate the transformed data using the pipeline
pca_data = pd.DataFrame(pca_pipe[:-1].transform(marketing_features))

# Rename the columns returned by PCA analysis - the existing column names are numeric which can cause problems. Prepend each name with an 'x'
pca_column_names = [f'x%d' % i for i in pca_data.columns]
pca_data = pca_data.set_axis(pca_column_names, axis=1)

# Split the data into training and test datasets
pca_train, pca_test, pca_class_train, pca_class_test = train_test_split(pca_data, marketing_class, test_size=0.33, random_state=13)
num_rows, num_cols = pca_data.shape

# Iterate over the best models with different sized feature sets 
# and calculate the precision and recall of each model

pr_scores = []
rc_scores = []
for k in range(1, num_cols-1):
    features_selector = SelectKBest(score_func=f_classif, k=k)
    features_selector = features_selector.fit(pca_train, pca_class_train)
    print(features_selector.get_feature_names_out())
    transformed_train = features_selector.transform(pca_train)
    transformed_test = features_selector.transform(pca_test)
    model = KNeighborsClassifier()
    model.fit(transformed_train, class_train)
    predictions = model.predict(transformed_test)
    pr_score = metrics.precision_score(pca_class_test, predictions, zero_division=0, average='macro')
    pr_scores.append(pr_score)
    rc_score = metrics.recall_score(pca_class_test, predictions, zero_division=0, average='macro')
    rc_scores.append(rc_score)

# Plot the results

plt.figure(figsize=(40, 10))
plt.plot(range(1, num_cols-1), pr_scores, label='Precision')
plt.plot(range(1, num_cols-1), rc_scores, label='Recall')
plt.xlabel('\nBest K Features', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.xticks(range(1, num_cols-1))
plt.ylabel('Precision/Recall Scores', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.legend(prop={'size': 20})
plt.show()

In [None]:
# Find the features for the best precision score to minimize the false positive rate 
# (use recall to minimize the false negative rate)

best_score = max(pr_scores)
num_features = np.where(pr_scores == best_score)[0][0]
features_selector = SelectKBest(score_func=f_classif, k=num_features+1)
features_selector = features_selector.fit(pca_train, pca_class_train)
print(f'Best features: {features_selector.get_feature_names_out()}')

In [None]:
# Construct another model using the highlighted principal components

pca_knn_model = KNeighborsClassifier()
pca_knn_model.fit(pca_train[features_selector.get_feature_names_out()], pca_class_train)
predictions = pca_knn_model.predict(pca_test[features_selector.get_feature_names_out()])

# Check the precision and recall
pr_score = metrics.precision_score(pca_class_test, predictions, zero_division=0, average='macro')
rc_score = metrics.recall_score(pca_class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(pca_class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(pca_class_test, test_results)

**Question:**

Is this an improvement?

*Answer: Precision and Recall have improved signifcantly. They are back to where they were prior to PCA!*

#Compare PCA to t-SNE

In [None]:
# NOTE: Only do this part if time allows, otherwise go straight to UMAP

import numpy as np
import pandas as pd
import random
from sklearn.preprocessing import LabelEncoder

# Return to the original dataset again
marketing = pd.read_csv("marketingdata.csv", sep=';')
print(f'{marketing.info()}\n\n')

items = random.sample(list(marketing.index), 10000) # The analysis is restricted to the 10000 random observations to save time; TSNE analysis is very resource intensive
marketing_subset = marketing[marketing.index.isin(items)]

# Separate the class variable ('Y') from the features and convert the label (yes/no) into a numeric value (1/0)
marketing_class = marketing_subset['y']
cat_encoder = LabelEncoder().fit(marketing_class)
marketing_class = cat_encoder.transform(marketing_class)
print(marketing_class)

# Remove the class label from the list of features and encode the categorical features
marketing_features = marketing_subset.drop(['y'], axis=1)
marketing_features = pd.get_dummies(marketing_features)
print(marketing_features)

In [None]:
# NOTE: t-SNE is not compatible with sklearn pipelines
# NOTE 2: Allow 7 minutes for this step

from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split

# Perform t-SNE analysis
tsne = TSNE(n_components=3) # Reduce the dataset to 3 dimensions
transformed_data = tsne.fit_transform(marketing_features)
tsne_features = transformed_data[:, 0:3]

# Split the data into test and training datasets
tsne_features_train, tsne_features_test, tsne_class_train, tsne_class_test = train_test_split(tsne_features, marketing_class, test_size=0.33, random_state=13)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Build a model using the new dataset

tsne_knn_model = KNeighborsClassifier()
tsne_knn_model.fit(tsne_features_train, tsne_class_train)
predictions = tsne_knn_model.predict(tsne_features_test)

# Check the precision and recall
pr_score = precision_score(tsne_class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(tsne_class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(tsne_class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(tsne_class_test, predictions)

**Question:**

How does performance compare to PCA?

*Answer: Precision and Recall are not as good. Could possibly tune by experimenting with the perplexity and learning rate, but it is time-consuming*

#Try UMAP

In [None]:
!pip install umap-learn

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Return to the original dataset again
marketing = pd.read_csv("marketingdata.csv", sep=';')
print(f'{marketing.info()}\n\n')

# Separate the class variable ('Y') from the features and convert the label (yes/no) into a numeric value (1/0)
marketing_class = marketing['y']
cat_encoder = LabelEncoder().fit(marketing_class)
marketing_class = cat_encoder.transform(marketing_class)

# Remove the class label from the list of features
marketing_features = marketing.drop(['y'], axis=1)

# Split the data into test and training datasets
features_train, features_test, class_train, class_test = train_test_split(marketing_features, marketing_class, test_size=0.33, random_state=13)

In [None]:
# Create a pipeline to perform encoding of the categorical features and scaling of the numeric features

import umap.umap_ as umap
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn import set_config

numeric_features = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays',	'previous']
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

# Encode categorical features
categorical_preprocessor = Pipeline([('categorical_encoder', OneHotEncoder())])

# Clean and scale numeric features
numeric_preprocessor = Pipeline([('replace_nan', SimpleImputer(missing_values=np.nan, strategy='mean')),\
                                 ('numeric_scaler', StandardScaler())])

# Apply transformations to categorical and numeric columns as appropriate
pipeline_preprocessor = \
    ColumnTransformer([('numeric_preprocessor', numeric_preprocessor, numeric_features), \
                       ('categorical_preprocessor', categorical_preprocessor, categorical_features)])

# Create the pipeline and fit a K-Nearest Neighbours model
umap_pipe = Pipeline([('preprocessor', pipeline_preprocessor),
                      ('reducer', umap.UMAP(n_components=5)),
                      ('estimator', KNeighborsClassifier())])

umap_pipe.fit(features_train, class_train)

# Display the details of the pipe
set_config(display="diagram")
umap_pipe

In [None]:
from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Make predictions
predictions = umap_pipe.predict(features_test)

# Check the precision and recall
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(class_test, predictions)

**Question:**

How does performance compare to PCA and t-SNE?

*Answer: Precision and Recall are on a par (possibly slightly better) with the scaled version without feature extraction. They are better with than PCA and much better than with t-SNE.*

#Try different algorithms - Logistic Regression and Random Forest

In [None]:
from sklearn.linear_model import LogisticRegression

# Change the K-Nearest Neigbors estimator in the umap pipeline for a Logistic Regression estimator

umap_pipe.set_params(**{'estimator': LogisticRegression(max_iter=1000, solver="lbfgs", tol=1e-3)})
umap_pipe

In [None]:
# Evaluate the model

from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Fit the model 
umap_pipe.fit(features_train, class_train)

# Make predictions
predictions = umap_pipe.predict(features_test)

# Check the precision and recall
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(class_test, predictions)

**Question:**

How does this model compare with those seen so far?

*Answer: This model is the best yet for precision and is OK for recall. This is a significant improvement on the initial model which had a precision of 48.7% and a recall of 27.1%*

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Define a new pipeline that selects all features and doesn't scale the numeric data (Tree models are not sensitive to scaling)
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

# Encode categorical features
categorical_preprocessor = Pipeline([('categorical_encoder', OneHotEncoder())])

# Apply transformations to categorical and numeric columns as appropriate
pipeline_preprocessor = \
    ColumnTransformer([('categorical_preprocessor', categorical_preprocessor, categorical_features)])

# Create the pipeline fit a Random Forest model
forest_pipe = Pipeline([('preprocessor', pipeline_preprocessor),
                        ('estimator', RandomForestClassifier())])

forest_pipe.fit(features_train, class_train)

In [None]:
# Evaluate the model

from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Make predictions
predictions = forest_pipe.predict(features_test)

# Check the precision and recall
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(class_test, predictions)

# Try Stacking to Reduce Bias and Variance across Multiple Algorithms

In [None]:
# Create a stack with pipelines for Random Forest, UMAP, and Naive Bayes estimators. Use Logistic Regression to aggregate the results
# NOTE: Allow 5 minutes to run this step

from sklearn.ensemble import StackingClassifier
from sklearn.naive_bayes import GaussianNB

# Create a new pipeline for K-Nearest Neighbors that generates dummy values for categorical data and scale the numeric features
numeric_features = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays',	'previous']
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

# Encode categorical features
categorical_preprocessor = Pipeline([('categorical_encoder', OneHotEncoder())])

# Clean and scale numeric features
numeric_preprocessor = Pipeline([('replace_nan', SimpleImputer(missing_values=np.nan, strategy='mean')),\
                                 ('numeric_scaler', StandardScaler())])

# Apply transformations to categorical and numeric columns as appropriate
pipeline_preprocessor = \
    ColumnTransformer([('numeric_preprocessor', numeric_preprocessor, numeric_features), \
                       ('categorical_preprocessor', categorical_preprocessor, categorical_features)])

# Create a similar pipeline for Naive Bayes
nb_pipe =  Pipeline([('preprocessor', pipeline_preprocessor),
                     ('estimator', GaussianNB())])

# Create a stack comprising the random forest pipeline from the previous tasks and the KNN and Naive Bayes pipelines.
# The default aggregator at the top of the stack uses Logistic Regression

estimators = [
    ("rf", forest_pipe),
    ("nb", nb_pipe),
    ("umap", umap_pipe)
]

sc = StackingClassifier(estimators=estimators)

sc.fit(features_train, class_train)

In [None]:
# Evaluate the stacked model

from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Make predictions
predictions = sc.predict(features_test)

# Check the precision and recall
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(class_test, predictions)

# Conclusion:

Feature selection, feature extraction, scaling, and algorithm choice are all important factors to consider when building a machine learning model. It is vitally important that you test and verify the decisions you make. Additionally, you should always consider the possibility of ensemble methods, such as stacking, to help reduce the shortcomings of a particular algorithm when clasifying your data.

In this demonstration, the focus has been on raising the Precision and Recall, with scant regard to AUC. This is partly due to the dataset, which has a highly imbalanced distribution for the Yes/No labels (No outnumbers Yes by a factor exceeding 25 times). You'll examine how to address this issue in a later lesson in this module.

It is always important to understand the limitations of a model. Different feature selection and extraction strategies can have an impact (positive and negative) on the model, as can the choice of algorithm.