<a href="https://colab.research.google.com/github/cm-int/classification_models/blob/main/module_6/Democode/Mod_6_Lesson_5_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Selecting Model Features and Algorithms

In this demonstration, you’ll create a classification model based on a raw dataset and measure the precision and recall. You’ll refine the dataset by selecting and scaling features and assess the impact this has on the performance of the model. You'll also examine how the choice of algorithm can affect the results.

This demonstration uses the Airline Passenger Satisfaction dataset.

**Note:** This dataset is a cleaned-up and modified version of the original 'Passenger Satisfaction' dataset published on Kaggle.

##Context
This dataset contains an airline passenger satisfaction survey. What factors are highly correlated to a satisfied (or dissatisfied) passenger? Can you predict passenger satisfaction?

##Features:

- *Gender*: Gender of the passengers (Female, Male)

- *Customer Type*: The customer type (Loyal customer, disloyal customer)

- *Age*: The actual age of the passenger

- *Type of Travel*: Purpose of the flight (Personal Travel, Business Travel)

- *Class*: Travel class (Business, Eco, Eco Plus)

- *Flight distance*: The flight distance of this journey

- *Inflight wifi service*: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)

- *Departure/Arrival time convenient*: Satisfaction level of Departure/Arrival time convenience (0:Completely dissatisfied;5:Completely satisfied)

- *Ease of Online booking*: Satisfaction level of online booking (0:Completely dissatisfied;5:Completely satisfied)

- *Gate location*: Satisfaction level of Gate location (0:Completely dissatisfied;5:Completely satisfied)

- *Food and drink*: Satisfaction level of Food and drink (0:Completely dissatisfied;5:Completely satisfied)

- *Online boarding*: Satisfaction level of online boarding (0:Completely dissatisfied;5:Completely satisfied)

- *Seat comfort*: Satisfaction level of Seat comfort (0:Completely dissatisfied;5:Completely satisfied)

- *Inflight entertainment*: Satisfaction level of inflight entertainment (0:Completely dissatisfied;5:Completely satisfied)

- *On-board service*: Satisfaction level of On-board service (0:Completely dissatisfied;5:Completely satisfied)

- *Leg room service*: Satisfaction level of Leg room service (0:Completely dissatisfied;5:Completely satisfied)

- *Baggage handling*: Satisfaction level of baggage handling (0:Completely dissatisfied;5:Completely satisfied)

- *Check-in service*: Satisfaction level of Check-in service (0:Completely dissatisfied;5:Completely satisfied)

- *Inflight service*: Satisfaction level of inflight service (0:Completely dissatisfied;5:Completely satisfied)

- *Cleanliness*: Satisfaction level of Cleanliness (0:Completely dissatisfied;5:Completely satisfied)

- *Departure Delay in Minutes*: Minutes delayed on departure

- *Arrival Delay in Minutes*: Minutes delayed on Arrival

#Target Class:

- *Satisfaction:* Overall satisfaction level(Not Satisfied/Neutral or Satisfied)


In [None]:
# Upload the customer_satisfaction.csv file

!wget 'https://raw.githubusercontent.com/cm-int/classification_models/main/module_6/Democode/customer_satisfaction.csv'

In [None]:
# Read the data from the CSV file
import numpy as np
import pandas as pd

customer_satisfaction = pd.read_csv("customer_satisfaction.csv")
print(f'{customer_satisfaction.info()}\n\n')
customer_satisfaction

In [None]:
# Separate the class variable ('satisfaction') from the features and convert the categorical features into dummy variables
features = customer_satisfaction.drop(['satisfaction'], axis=1)
features = pd.get_dummies(features)
features

In [None]:
# Convert the class label (not satisfied/satisfied) into a numeric value (0/1)
from sklearn.preprocessing import LabelEncoder

satisfaction_class = customer_satisfaction['satisfaction']
cat_encoder = LabelEncoder().fit(satisfaction_class)
satisfaction_class = cat_encoder.transform(satisfaction_class)

print(satisfaction_class)

In [None]:
# Split the data into test and training datasets, build a K-Nearest Neighbors model, and test the precision, recall, and AUC
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

features_train, features_test, class_train, class_test = train_test_split(features, satisfaction_class, test_size=0.33, random_state=13)

knn_model = KNeighborsClassifier() # Select default hyperparameters (n_neighbors=5)
_ = knn_model.fit(features_train, class_train)

test_results = knn_model.predict(features_test)

_ = ConfusionMatrixDisplay.from_predictions(class_test, test_results,  display_labels=['Not Satisfied', 'Satisfied'])

print(f'Precision: {precision_score(class_test, test_results)}')
print(f'Recall: {recall_score(class_test, test_results)}\n')

_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**


How good is this model at predicting a positive outcome?

*Answer: Poor. The model misses many positive outcomes and reports them as negatives. Additionally, it misclassifies a large percentage of negative outcomes as positives*.

*This is the baseline for further investigation.*

# Perform a SHAP analysis to examine which features have the most effect on the predictions

In [None]:
# Install the SHAP module
!pip install shap

In [None]:
# Create a SHAP explainer to analyze predictions made using the model
# NOTE: This step takes 5 minutes to run

import shap
import random

items = random.sample(list(features_train.index), 50) # The analysis is restricted to 50 random observations to save time
explainer_train = features_train[features_train.index.isin(items)]

explainer = shap.Explainer(knn_model.predict, explainer_train)
values = explainer(explainer_train)

In [None]:
# Display the results as a summary plot

shap.summary_plot(shap_values=values, features=explainer_train, plot_type="bar")

In [None]:
# The violin plot indicateshow the features are correlated with the predictions

shap.summary_plot(shap_values=values, features=explainer_train, plot_type="violin")

**Question:**

What does this analysis indicate?

*Answer: Most of the features have little to no bearing on the predictions. It would appear that the most important features are flight distance, age, arrival delay, and departure delay. The flight distance feature is a bit of a surprise - why should the length of the flight be important? This could possibly be due to passengers feeling tired after a long flight, but further analysis is necessary to verify these findings*

# Perform a multivariate search to find the most important features for the model

In [None]:
# Perform a forward selection search using the SelectKBest function
# NOTE: This step takes 5 or 6 minutes to run

from sklearn.feature_selection import SelectKBest, chi2
import sklearn.metrics as metrics

# Iterate over the best models with different sized feature sets 
# and calculate the precision and recall of each model

pr_scores = []
rc_scores = []
for k in range(1, len(features_train.columns)-1):
    features_selector = SelectKBest(score_func=chi2, k=k)
    features_selector = features_selector.fit(features_train, class_train)
    print(features_selector.get_feature_names_out())
    transformed_train = features_selector.transform(features_train)
    transformed_test = features_selector.transform(features_test)
    model = KNeighborsClassifier()
    model.fit(transformed_train, class_train)
    predictions = model.predict(transformed_test)
    pr_score = metrics.precision_score(class_test, predictions, zero_division=0, average='macro')
    pr_scores.append(pr_score)
    rc_score = metrics.recall_score(class_test, predictions, zero_division=0, average='macro')
    rc_scores.append(rc_score)

In [None]:
# Plot the results
import matplotlib.pyplot as plt

plt.figure(figsize=(40, 10))
plt.plot(range(1, len(features_train.columns)-1), pr_scores, label='Precision')
plt.plot(range(1, len(features_train.columns)-1), rc_scores, label='Recall')
plt.xlabel('\nBest K Features', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.xticks(range(1, len(features_train.columns)-1))
plt.ylabel('Precision/Recall Scores', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.legend(prop={'size': 20})
plt.show()

In [None]:
# Find the best precision score to minimize the false positive rate 
# (use recall to minimize the false negative rate)

best_score = max(pr_scores)
num_features = np.where(pr_scores == best_score)[0][0]
features_selector = SelectKBest(score_func=chi2, k=num_features+1)
features_selector = features_selector.fit(features_train, class_train)
best_features = features_selector.get_feature_names_out()
print(f'Best features: {best_features}')

**Question::**

How do these findings compare to the SHAP analysis?

*Answer: Flight distance is still the most significant feature, but age has dropped out. Unsurprisingly departure delay and arrival delay are still important.* 

#Build and test a model using the *'best'* set of features 

In [None]:
best_features_train = features_train[best_features]

best_features_test = features_test[best_features]

In [None]:
knn_model = KNeighborsClassifier() 
_ = knn_model.fit(best_features_train, class_train)

test_results = knn_model.predict(best_features_test)

_ = ConfusionMatrixDisplay.from_predictions(class_test, test_results, display_labels=['Not Satisfied', 'Satisfied'])

print(f'Precision: {precision_score(class_test, test_results)}')
print(f'Recall: {recall_score(class_test, test_results)}\n')

_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**

Has the model improved?

*Answer: Slightly. Precision and recall are marginally improved, but still a bit low.*

#Examine how scaling affects the choice of features

In [None]:
# Return to the original dataset
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

customer_satisfaction = pd.read_csv("customer_satisfaction.csv")

# Separate the class variable ('satisfaction') from the features and convert the categorical features into dummy variables

features = customer_satisfaction.drop(['satisfaction'], axis=1)

# Convert the class label (not satisfied/satisfied) into a numeric value (0/1)

satisfaction_class = customer_satisfaction['satisfaction']
cat_encoder = LabelEncoder().fit(satisfaction_class)
satisfaction_class = cat_encoder.transform(satisfaction_class)
print(satisfaction_class)

# Split the data into test and training datasets, build a K-Nearest Neighbors model, and test the precision, recall, and AUC

features_train, features_test, class_train, class_test = train_test_split(features, satisfaction_class, test_size=0.33, random_state=13)

print(features_train.info())

In [None]:
# Create a pipeline to perform encoding of the categorical features and scaling of the numeric features
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn import set_config

numeric_features = features_train.select_dtypes(include='number').columns
categorical_features = features_train.select_dtypes(include='object').columns

# Encode categorical features
categorical_preprocessor = Pipeline([('categorical_encoder', OneHotEncoder())])

# Clean and scale numeric features
numeric_preprocessor = Pipeline([('replace_nan', SimpleImputer(missing_values=np.nan, strategy='mean')),\
                                 ('numeric_scaler', StandardScaler())])

# Apply transformations to categorical and numeric columns as appropriate
pipeline_preprocessor = ColumnTransformer([('numeric_preprocessor', numeric_preprocessor, numeric_features), \
                       ('categorical_preprocessor', categorical_preprocessor, categorical_features)])

# Create the pipeline and fit a K-Nearest Neighbours model
pipe = Pipeline([('preprocessor', pipeline_preprocessor),
                 ('estimator', KNeighborsClassifier())])

pipe.fit(features_train, class_train)

# Display the details of the pipe
set_config(display="diagram")
pipe

In [None]:
# Display the names of the features generated by the pipeline
numeric_feature_names = pipe['preprocessor'].transformers_[0][1]['numeric_scaler'].get_feature_names_out(numeric_features)
print(f'Numeric features: {numeric_feature_names}\n')

categorical_feature_names = pipe['preprocessor'].transformers_[1][1]['categorical_encoder'].get_feature_names_out(categorical_features)
print(f'Categorical features: {categorical_feature_names}')

In [None]:
from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Make test predictions
predictions = pipe.predict(features_test)

# Check the precision and recall
pr_score = metrics.precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = metrics.recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['Not Satisifed', 'Satisifed'])
_ = RocCurveDisplay.from_predictions(class_test, predictions)

**Question:**

Has the model improved?

*Answer: Precision, recall, and AUC are all much better. Scaling has had a significant impact on the quality of the model*

#Perform SHAP analysis on the new model

In [None]:
# Run the pipeline again without the estimator to get the transformed training data
transformed_features_train = pd.DataFrame(pipe[0].transform(features_train))
transformed_features_train

In [None]:
# The feature names have been lost, so reinstate them from the lists seen earlier
new_names = np.append(numeric_feature_names, categorical_feature_names)
transformed_features_train.columns = new_names
transformed_features_train

In [None]:
# Perform SHAP analysis over the transformed training data
# NOTE: Allow 5 or 6 minutes for this step to complete
# NOTE 2: Ignore the warnings about the classifier not being fitted with feature names

import shap
import random

items = random.sample(list(transformed_features_train.index), 50) # As before, the analysis is restricted to the 50 random observations to save time
explainer_train = transformed_features_train[transformed_features_train.index.isin(items)]

explainer = shap.Explainer(pipe['estimator'].predict, explainer_train) # Perform the analysis using the 'estimator' object from the pipeline
values = explainer(explainer_train)

In [None]:
# Display the results
shap.summary_plot(shap_values=values, features=explainer_train, plot_type="bar")
shap.summary_plot(shap_values=values, features=explainer_train, plot_type="violin")

**Question:**

What does this analysis tell us?

*Answer: Scaling the numeric features results in more features having a greater influence in the model predictions. Flight distance, which previously had the largest numeric value, has now dropped down the list of importance. Customer satisfaction now seems to be more influenced by features such as inflight wifi, ease of boarding, cleanliness, inflight entertainment, seat comfort, and on-board service.*

#Perform multivariate forward selection with the new model and assess whether this change improves predictions

In [None]:
# Use the SequentialFeatureSelector to find the best set of features for the model
# NOTE: This step takes approximately 6 minutes to run

from sklearn.feature_selection import SequentialFeatureSelector

features_to_select = 5

sfs_forward = SequentialFeatureSelector(pipe['estimator'], n_features_to_select=features_to_select, direction="forward")
sfs_forward.fit(transformed_features_train, class_train)

print(f"Features selected by forward sequential selection: {sfs_forward.get_feature_names_out()}")

In [None]:
# Transform the test data and rename the columns

transformed_features_test = pd.DataFrame(pipe[0].transform(features_test))
transformed_features_test.columns = new_names
transformed_features_test

In [None]:
# Build a new model using only the selected features
reduced_features_train = transformed_features_train[sfs_forward.get_feature_names_out()]
reduced_features_test = transformed_features_test[sfs_forward.get_feature_names_out()]

knn_model = KNeighborsClassifier() # Create a new classifier
_ = knn_model.fit(reduced_features_train, class_train)

test_results = knn_model.predict(reduced_features_test)

_ = ConfusionMatrixDisplay.from_predictions(class_test, test_results, display_labels=['No', 'Yes'])

print(f'Precision: {precision_score(class_test, test_results)}')
print(f'Recall: {recall_score(class_test, test_results)}\n')

_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**

How does this model fare?

*Answer: Focussing on a smaller number of columns reduced the precision and recall slightly, so this might not be the best strategy in this case*

# Perform detailed multivariate forward selection and select the features more carefully

In [None]:
# Perform another forward selection search using the SelectKBest function and evaluate the best mix of features for precision and recall
# NOTE: This step takes 5 or 6 minutes to run

from sklearn.feature_selection import SelectKBest, f_classif
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

# Iterate over the best models with different sized feature sets 
# and calculate the precision and recall of each model

pr_scores = []
rc_scores = []
for k in range(1, len(transformed_features_train.columns)-1):
    features_selector = SelectKBest(score_func=f_classif, k=k)
    features_selector = features_selector.fit(transformed_features_train, class_train)
    print(features_selector.get_feature_names_out())
    transformed_train = features_selector.transform(transformed_features_train)
    transformed_test = features_selector.transform(transformed_features_test)
    model = KNeighborsClassifier()
    model.fit(transformed_train, class_train)
    predictions = model.predict(transformed_test)
    pr_score = metrics.precision_score(class_test, predictions, zero_division=0, average='macro')
    pr_scores.append(pr_score)
    rc_score = metrics.recall_score(class_test, predictions, zero_division=0, average='macro')
    rc_scores.append(rc_score)

# Plot the results

plt.figure(figsize=(40, 10))
plt.plot(range(1, len(transformed_features_train.columns)-1), pr_scores, label='Precision')
plt.plot(range(1, len(transformed_features_train.columns)-1), rc_scores, label='Recall')
plt.xlabel('\nBest K Features', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.xticks(range(1, len(transformed_features_train.columns)-1))
plt.ylabel('Precision/Recall Scores', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.legend(prop={'size': 20})
plt.show()

In [None]:
# Find the features for the best precision score to minimize the false positive rate 
# (use recall to minimize the false negative rate)
best_score = max(pr_scores)
num_features = np.where(pr_scores == best_score)[0][0]
features_selector = SelectKBest(score_func=f_classif, k=num_features+1)
features_selector = features_selector.fit(transformed_features_train, class_train)
print(f'Best features: {features_selector.get_feature_names_out()}')

In [None]:
# Build another new model using only the selected features
reduced_features_train = transformed_features_train[features_selector.get_feature_names_out()]
reduced_features_test = transformed_features_test[features_selector.get_feature_names_out()]

knn_model = KNeighborsClassifier() # Create a new classifier
_ = knn_model.fit(reduced_features_train, class_train)

test_results = knn_model.predict(reduced_features_test)

_ = ConfusionMatrixDisplay.from_predictions(class_test, test_results, display_labels=['Not Satisfied', 'Satisfied'])

print(f'Precision: {precision_score(class_test, test_results)}')
print(f'Recall: {recall_score(class_test, test_results)}\n')

_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**

Is this model an improvement?

*Answer: Selecting features based on model precision yields an improvement, at the cost of a small reduction in recall.*

#Try using feature extraction as an alternative strategy

In [None]:
# Return to the original dataset
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd

customer_satisfaction = pd.read_csv("customer_satisfaction.csv")

# Separate the class variable ('satisfaction') from the features and convert the categorical features into dummy variables

features = customer_satisfaction.drop(['satisfaction'], axis=1)
features = pd.get_dummies(features)

# Convert the class label (not satisfied/satisfied) into a numeric value (0/1)

satisfaction_class = customer_satisfaction['satisfaction']
cat_encoder = LabelEncoder().fit(satisfaction_class)
satisfaction_class = cat_encoder.transform(satisfaction_class)
print(satisfaction_class)

print(features.info())

In [None]:
from sklearn.decomposition import PCA

# Perform PCA analysis
pca = PCA()
pca.fit(features)

print(pca.explained_variance_ratio_)

In [None]:
import matplotlib.pyplot as plt

# Plot the results
plt.figure(figsize=(10,10))
x = np.arange(1, len(pca.explained_variance_)+1)
plt.bar(x, pca.explained_variance_ratio_)
plt.xlabel('Principal Components', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.ylabel('Proportion of Explained Variances', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.show()

**Question:**

Which component(s) account for the most variance?

*Answer: Component 1 accounts for 99.9% of the variance. This component dwarfs the variance of the other components. Try building a model with this single component.*

In [None]:
# Construct another model using the first principal component only
pca_data = pd.DataFrame(pca.transform(features))
component_data = pca_data.iloc[:, 0:1]

# Split the data into test and training datasets
pca_train, pca_test, pca_class_train, pca_class_test = train_test_split(component_data, satisfaction_class, test_size=0.33, random_state=13)

# Build the model
pca_knn_model = KNeighborsClassifier()
pca_knn_model.fit(pca_train, pca_class_train)
predictions = pca_knn_model.predict(pca_test)

# Check the precision and recall
pr_score = precision_score(pca_class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(pca_class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(pca_class_test, predictions, display_labels=['Not Satisfied', 'Satisfied'])
_ = RocCurveDisplay.from_predictions(pca_class_test, test_results)

**Question:**

Is this model an improvement?

*Answer: Precision and Recall have dropped significantly, as has the AUC. Compacting the predictive power into a single component was probably optimistic.*

In [None]:
# Try again with the first ten principal components
pca_data = pd.DataFrame(pca.transform(features))
component_data = pca_data.iloc[:, 0:10]

# Split the data into test and training datasets
pca_train, pca_test, pca_class_train, pca_class_test = train_test_split(component_data, satisfaction_class, test_size=0.33, random_state=13)

# Build the model
pca_knn_model = KNeighborsClassifier()
pca_knn_model.fit(pca_train, pca_class_train)
predictions = pca_knn_model.predict(pca_test)

# Check the precision and recall
pr_score = precision_score(pca_class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(pca_class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(pca_class_test, predictions, display_labels=['Not Satisfied', 'Satisfied'])
_ = RocCurveDisplay.from_predictions(pca_class_test, test_results)

**Question:**

Is this model an improvement?

*Answer: Precision and Recall have improved. Maybe there is more information in the other components than is alluded to by their variance.*

#Compare PCA to t-SNE

**NOTE:** t-SNE analysis of the data takes upwards of 15 minutes. Only do this part if time allows, otherwise go straight to UMAP.

In [None]:
# Return to the original dataset
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd

customer_satisfaction = pd.read_csv("customer_satisfaction.csv")

# Separate the class variable ('satisfaction') from the features and convert the categorical features into dummy variables

features = customer_satisfaction.drop(['satisfaction'], axis=1)
features = pd.get_dummies(features)

# Convert the class label (not satisfied/satisfied) into a numeric value (0/1)

satisfaction_class = customer_satisfaction['satisfaction']
cat_encoder = LabelEncoder().fit(satisfaction_class)
satisfaction_class = cat_encoder.transform(satisfaction_class)
print(satisfaction_class)

print(features.info())

In [None]:
# NOTE: Allow 15 minutes for this step

from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split

# Perform t-SNE analysis
tsne = TSNE(n_components=3) # Reduce the dataset to 3 dimensions
transformed_data = tsne.fit_transform(features)
tsne_features = transformed_data[:, 0:2]

# Split the data into test and training datasets
tsne_features_train, tsne_features_test, tsne_class_train, tsne_class_test = train_test_split(tsne_features, satisfaction_class, test_size=0.33, random_state=13)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Build a model using the new dataset

tsne_knn_model = KNeighborsClassifier()
tsne_knn_model.fit(tsne_features_train, tsne_class_train)
predictions = tsne_knn_model.predict(tsne_features_test)

# Check the precision and recall
pr_score = precision_score(tsne_class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(tsne_class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(tsne_class_test, predictions, display_labels=['Not Satisfied', 'Satisfied'])
_ = RocCurveDisplay.from_predictions(tsne_class_test, predictions)

**Question:**

How does performance compare to PCA?

*Answer: Precision and Recall are comparable to PCA with a single component, but are not as good as PCA with ten components. AUC is poor. You could possibly tune by experimenting with the perplexity and learning rate, but it is a time-consuming process.*

#Try UMAP

In [None]:
!pip install umap-learn

In [None]:
# Return to the original dataset
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

customer_satisfaction = pd.read_csv("customer_satisfaction.csv")

# Separate the class variable ('satisfaction') from the features and convert the categorical features into dummy variables

features = customer_satisfaction.drop(['satisfaction'], axis=1)
features = pd.get_dummies(features)

# Convert the class label (not satisfied/satisfied) into a numeric value (0/1)

satisfaction_class = customer_satisfaction['satisfaction']
cat_encoder = LabelEncoder().fit(satisfaction_class)
satisfaction_class = cat_encoder.transform(satisfaction_class)
print(satisfaction_class)

print(features.info())

In [None]:
# Perform UMAP analysis
import umap.umap_ as umap
from sklearn.model_selection import train_test_split

reducer = umap.UMAP(n_neighbors=7, n_components=7) # Experiment with different values
umap_features = reducer.fit_transform(features)

# Split the data into test and training datasets
umap_features_train, umap_features_test, umap_class_train, umap_class_test = train_test_split(umap_features, satisfaction_class, test_size=0.33, random_state=13)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Build a model using the new dataset

umap_knn_model = KNeighborsClassifier()
umap_knn_model.fit(umap_features_train, umap_class_train)
predictions = umap_knn_model.predict(umap_features_test)

# Check the precision and recall
pr_score = precision_score(umap_class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(umap_class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(umap_class_test, predictions, display_labels=['Not Satisfied', 'Satisfied'])
_ = RocCurveDisplay.from_predictions(umap_class_test, predictions)

**Question:**

How does performance compare to PCA and t-SNE?

*Answer: In this example, the performance of the UMAP model lies between that of PCA with ten components and t-SNE.*

#Incorporate scaling with PCA

*PCA appears to be the most appropriate feature extraction technique for this dataset, and scaling had a significant impact. The next logical step is to try combining these two approaches. This will involve the use of a pipeline.*

In [None]:
# Return to the original dataset
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd

customer_satisfaction = pd.read_csv("customer_satisfaction.csv")

# Separate the class variable ('satisfaction') from the features and convert the categorical features into dummy variables

features = customer_satisfaction.drop(['satisfaction'], axis=1)
features = pd.get_dummies(features)

# Convert the class label (not satisfied/satisfied) into a numeric value (0/1)

satisfaction_class = customer_satisfaction['satisfaction']
cat_encoder = LabelEncoder().fit(satisfaction_class)
satisfaction_class = cat_encoder.transform(satisfaction_class)
print(satisfaction_class)

print(features_train.info())

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into test and training datasets
features_train, features_test, class_train, class_test = train_test_split(features, satisfaction_class, test_size=0.33, random_state=13)

In [None]:
# Try scaling before performing PCA.
# Construct a new pipeline that includes feature extraction

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn import set_config
from sklearn.decomposition import PCA

numeric_features = features_train.select_dtypes(include='number').columns
categorical_features = features_train.select_dtypes(include='object').columns

# Encode categorical features
categorical_preprocessor = Pipeline([('categorical_encoder', OneHotEncoder())])

# Clean and scale numeric features
numeric_preprocessor = Pipeline([('replace_nan', SimpleImputer(missing_values=np.nan, strategy='mean')),\
                                 ('numeric_scaler', StandardScaler())])

# Apply transformations to categorical and numeric columns as appropriate
pipeline_preprocessor = \
    ColumnTransformer([('numeric_preprocessor', numeric_preprocessor, numeric_features), \
                       ('categorical_preprocessor', categorical_preprocessor, categorical_features)])

# Create the pipeline with PCA and fit a K-Nearest Neighbors model
pca_pipe = Pipeline([('preprocessor', pipeline_preprocessor),
                     ('extractor', PCA(n_components=1)), # Only generate the first PCA component
                     ('estimator', KNeighborsClassifier())])

pca_pipe.fit(features_train, class_train)

# Display the details of the pipe
set_config(display="diagram")
pca_pipe

In [None]:
# Evaluate the model
predictions = pca_pipe.predict(features_test)

# Check the precision and recall
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['Not Satisfied', 'Satisfied'])
_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**

How does this model compare to previously?

*Answer: Precision and Recall are better than PCA alone, but not as good as scaling without PCA.*

In [None]:
# Try again with a single principal component (this was the initial fit for PCA earlier)
pca_pipe[1].set_params(**{'n_components': 1})

pca_pipe.fit(features_train, class_train)
predictions = pca_pipe.predict(features_test)
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(class_test, test_results)

In [None]:
# And again, this time with ten principal components
pca_pipe[1].set_params(**{'n_components': 10})

pca_pipe.fit(features_train, class_train)
predictions = pca_pipe.predict(features_test)
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['No', 'Yes'])
_ = RocCurveDisplay.from_predictions(class_test, test_results)

**Question:**

How about now?

*Answer: Precision and Recall have improved, but are still below that achieved by using scaling alone.*

# Try multivariate feature selection to ascertain how selecting multiple PCA components affects the model

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

# Run the pipeline again to generate all PCA components
pca_pipe[1].set_params(**{'n_components': None})
pca_pipe.fit(features_train, class_train)

# Generate the transformed data using the pipeline
pca_data = pd.DataFrame(pca_pipe[:-1].transform(features))

# Rename the columns returned by PCA analysis - the existing column names are numeric which can cause problems. Prepend each name with an 'x'
pca_column_names = [f'x%d' % i for i in pca_data.columns]
pca_data = pca_data.set_axis(pca_column_names, axis=1)

# Split the data into training and test datasets
pca_train, pca_test, pca_class_train, pca_class_test = train_test_split(pca_data, satisfaction_class, test_size=0.33, random_state=13)
num_rows, num_cols = pca_data.shape

# Iterate over the best models with different sized feature sets 
# and calculate the precision and recall of each model

pr_scores = []
rc_scores = []
for k in range(1, num_cols-1):
    features_selector = SelectKBest(score_func=f_classif, k=k)
    features_selector = features_selector.fit(pca_train, pca_class_train)
    print(features_selector.get_feature_names_out())
    transformed_train = features_selector.transform(pca_train)
    transformed_test = features_selector.transform(pca_test)
    model = KNeighborsClassifier()
    model.fit(transformed_train, class_train)
    predictions = model.predict(transformed_test)
    pr_score = metrics.precision_score(pca_class_test, predictions, zero_division=0, average='macro')
    pr_scores.append(pr_score)
    rc_score = metrics.recall_score(pca_class_test, predictions, zero_division=0, average='macro')
    rc_scores.append(rc_score)

# Plot the results

plt.figure(figsize=(40, 10))
plt.plot(range(1, num_cols-1), pr_scores, label='Precision')
plt.plot(range(1, num_cols-1), rc_scores, label='Recall')
plt.xlabel('\nBest K Features', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.xticks(range(1, num_cols-1))
plt.ylabel('Precision/Recall Scores', fontdict={'family': 'serif','color':  'darkred','weight': 'normal','size': 28})
plt.legend(prop={'size': 20})
plt.show()

In [None]:
# Find the features for the best precision score to minimize the false positive rate 
# (use recall, rc_scores, to minimize the false negative rate)

best_score = max(pr_scores)
num_features = np.where(pr_scores == best_score)[0][0]
features_selector = SelectKBest(score_func=f_classif, k=num_features+1)
features_selector = features_selector.fit(pca_train, pca_class_train)
best_features = features_selector.get_feature_names_out()
from sklearn.preprocessing import StandardScaler
print(f'Best features: {best_features}')

In [None]:
# Construct another model using the highlighted principal components
# Note: This test doesn't use the pipeline or perform scaling

pca_knn_model = KNeighborsClassifier()
pca_knn_model.fit(pca_train[best_features], pca_class_train)
predictions = pca_knn_model.predict(pca_test[best_features])

# Check the precision and recall
pr_score = metrics.precision_score(pca_class_test, predictions, zero_division=0, average='macro')
rc_score = metrics.recall_score(pca_class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(pca_class_test, predictions, display_labels=['Not Satisfied', 'Satisfied'])
_ = RocCurveDisplay.from_predictions(pca_class_test, test_results)

**Question:**

Is this an improvement?

*Answer: Precision and Recall have improved signifcantly. However, the model is still only comparable to that which used scaling without PCA*

#Try different algorithms - Logistic Regression and Random Forest

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

# Change the K-Nearest Neigbors estimator in the PCA pipeline for a Logistic Regression estimator
pca_pipe.set_params(**{'estimator': LogisticRegression(max_iter=1000, solver="lbfgs", tol=1e-3)})
pca_pipe

In [None]:
# Evaluate the model
from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Fit the model 
pca_pipe.fit(features_train, class_train)

# Make predictions
predictions = pca_pipe.predict(features_test)

# Check the precision and recall
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['Not Satisfied', 'Satisfied'])
_ = RocCurveDisplay.from_predictions(class_test, predictions)

**Question:**

How does this model compare with those seen so far?

*Answer: This model is OK, but not as good as the KNN model with scaling.*

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

# Return to the original dataset
customer_satisfaction = pd.read_csv("customer_satisfaction.csv")

# Separate the class variable ('satisfaction') from the features and convert the categorical features into dummy variables
features = customer_satisfaction.drop(['satisfaction'], axis=1)
#features = pd.get_dummies(features)

# Convert the class label (not satisfied/satisfied) into a numeric value (0/1)
satisfaction_class = customer_satisfaction['satisfaction']
cat_encoder = LabelEncoder().fit(satisfaction_class)
satisfaction_class = cat_encoder.transform(satisfaction_class)
print(satisfaction_class)

print(features_train.info())

# Split the data into test and training datasets
features_train, features_test, class_train, class_test = train_test_split(features, satisfaction_class, test_size=0.33, random_state=13)

# Define a new pipeline that selects all features and doesn't scale the numeric data (Tree models are not sensitive to scaling)
categorical_features = features_train.select_dtypes(include='object').columns

# Encode categorical features
categorical_preprocessor = Pipeline([('categorical_encoder', OneHotEncoder())])

# Apply transformations to categorical and numeric columns as appropriate
pipeline_preprocessor = ColumnTransformer([('categorical_preprocessor', categorical_preprocessor, categorical_features)])

# Create the pipeline fit a Random Forest model
forest_pipe = Pipeline([('preprocessor', pipeline_preprocessor),
                        ('estimator', RandomForestClassifier())])

forest_pipe.fit(features_train, class_train)

In [None]:
# Evaluate the model
from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Make predictions
predictions = forest_pipe.predict(features_test)

# Check the precision and recall
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['Not Satisfied', 'Satisfied'])
_ = RocCurveDisplay.from_predictions(class_test, predictions)

**Question:**

How does this model compare to the KNN and Logistic Regression models?

*Answer: This model has a poorer performance than those built by using KNN and Logistic Regression. However, this model has not been tuned to the same extent as the KNN model, so improvements may be possible.*

#Try stacking to reduce bias and variance across multiple algorithms

In [None]:
# Create a stack with pipelines for Random Forest, K-Nearest Neighbors, and Naive Bayes estimators. Use Logistic Regression to aggregate the results
# NOTE: Allow 5 minutes to run this step

from sklearn.ensemble import StackingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Create a new pipeline for K-Nearest Neighbors that generates dummy values for categorical data and scale the numeric features
numeric_features = features_train.select_dtypes(include='number').columns
categorical_features = features_train.select_dtypes(include='object').columns

# Encode categorical features
categorical_preprocessor = Pipeline([('categorical_encoder', OneHotEncoder())])

# Clean and scale numeric features
numeric_preprocessor = Pipeline([('replace_nan', SimpleImputer(missing_values=np.nan, strategy='mean')),\
                                 ('numeric_scaler', StandardScaler())])

# Apply transformations to categorical and numeric columns as appropriate
pipeline_preprocessor = ColumnTransformer([('numeric_preprocessor', numeric_preprocessor, numeric_features), \
                       ('categorical_preprocessor', categorical_preprocessor, categorical_features)])

# Create a similar pipeline for Naive Bayes
nb_pipe =  Pipeline([('preprocessor', pipeline_preprocessor),
                     ('estimator', GaussianNB())])

# An another for KNN
knn_pipe = Pipeline([('preprocessor', pipeline_preprocessor),
                     ('estimator', KNeighborsClassifier())])

# Create a stack comprising the random forest pipeline from the previous tasks and the KNN and Naive Bayes pipelines.
# The default aggregator at the top of the stack uses Logistic Regression

estimators = [
    ("rf", forest_pipe),
    ("nb", nb_pipe),
    ("knn", knn_pipe)
]

sc = StackingClassifier(estimators=estimators)

sc.fit(features_train, class_train)

In [None]:
# Evaluate the stacked model
from sklearn.metrics import precision_score, recall_score, ConfusionMatrixDisplay, RocCurveDisplay

# Make predictions
predictions = sc.predict(features_test)

# Check the precision and recall
pr_score = precision_score(class_test, predictions, zero_division=0, average='macro')
rc_score = recall_score(class_test, predictions, zero_division=0, average='macro')

print(f'Precision is {pr_score}\nRecall is {rc_score}')

_ = ConfusionMatrixDisplay.from_predictions(class_test, predictions, display_labels=['Not Satisfied', 'Satisfied'])
_ = RocCurveDisplay.from_predictions(class_test, predictions)

# Conclusion:

Feature selection, feature extraction, scaling, and algorithm choice are all important factors to consider when building a machine learning model. It is vitally important that you test and verify the decisions you make. Additionally, you should always consider the possibility of ensemble methods, such as stacking, to help reduce the shortcomings of a particular algorithm when clasifying your data.

In this demonstration, the focus has been on raising the Precision and Recall, with scant regard to AUC.

It is always important to understand the limitations of a model. Different feature selection and extraction strategies can have an impact (positive and negative) on the model, as can the choice of algorithm.