# Unsupervised Machine Learning - Clustering

This is the final assessment project for course 4 of [IBM Machine Learning Specialization](https://www.coursera.org/specializations/ibm-machine-learning).

# About the data

We'll be working with [Travel Reviews](https://archive.ics.uci.edu/dataset/484/travel+reviews) dataset provided by [UCI Machine Learning Repository](https://archive.ics.uci.edu/). This data set is populated by crawling TripAdvisor.com and it represents aggregated user ratings on various categories, ranging from restaurants and juice bars to museums and religious institutions. Each data instance represents user ratings for a specific travel destination in East Asia, averaged for each category. Each traveler rating is mapped as Excellent (4), Very Good (3), Average (2), Poor (1), and Terrible (0).

In the original dataset, category features are labeled using generic names (Category 1, Category 2, etc.) and as a first step, original labels were replaced by with more descriptive names, guided by dataset description.

## Dataset Features

*Note : The labels inside parenthesis represent new feature names.*

- User ID : User identifier in string form (e.g. 'User 123')
- Category 1 : Average user feedback on art galleries (art_galleries)
- Category 2 : Average user feedback on dance clubs (dance_clubs)
- Category 3 : Average user feedback on juice bars (juice_bars)
- Category 4 : Average user feedback on restaurants (restaurants)
- Category 5 : Average user feedback on museums (museums)
- Category 6 : Average user feedback on resorts (resorts)
- Category 7 : Average user feedback on parks/picnic spots (parks_picnic)
- Category 8 : Average user feedback on beaches (beaches)
- Category 9 : Average user feedback on theaters (theaters)
- Category 10 : Average user feedback on religious institutions (religious_inst)

# Objectives

ABC Company is experiencing a constant decline in sales figures of travel packages for East Asia. Intimidated by most competitors' clever use of AI-powered strategies, the company has hired a small Data Science team to figure out a solution. With a strong motivation to shine in the new department, the team gets busy with scraping user rating pages in tripadvisor.com and comes up with a promising dataset.

Stakeholders at ABC company aren't that impressed, but decide to give their new team a chance. They state some goals and objectives for a new pilot project. The project requirements are summarized below:

1. Using clustering analysis on Travel Reviews dataset, segment travelers based on common tastes and interests.
2. Domain experts will use these segments to design brand new travel packages, tailored to each segment.
3. Final clusters can have different sizes, but keep the count small (4-7 clusters would be nice).
4. The main priority is boosting package sales and getting high customer ratings, so focus on ordinary travelers and ignore possible outliers.

## Environment setup

In this project, we'll be using Numpy, Pandas, Matplotlib, Seaborn and Scikit-Learn, all of which should normally be installed
in any Python Machine Learning environment.

Please run the following cells to prepare our modeling environment.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans, MeanShift, DBSCAN, estimate_bandwidth

%matplotlib inline

In [None]:
# Find highest pair-wise correlations and return as a Pandas DataFrame
def get_high_corr_df(corr_df):

    # Disable self correlations for the upcoming examination
    for col in corr_df.columns:
        corr_df.loc[col,col] = 0.0

    # Build a DataFrame from the highest pair-wise correlations
    corr_maps = []
    indices = np.argmax(corr_df.abs(), axis=1)
    for i in range(len(indices)):
        corr_map = pd.Series({'High_Corr_Col': str(corr_df.columns[indices[i]]),
                              'High_Corr_Val': float(corr_df.iloc[i, indices[i]])})
        corr_maps.append(corr_map)

    high_corr_df = pd.DataFrame(pd.concat(corr_maps, axis=1).T)
    high_corr_df.index = list(corr_df.columns)
    return high_corr_df

In [None]:
# Use a random state for repeatability
rs = 147

# Load Obesity dataset from current directory into a Pandas DataFrame
df = pd.read_csv('tripadvisor_review.csv')

# Exploratory Data Analysis (EDA)

To get familiar with important dataset characteristics, an exhaustive data analysis was made before preparing dataset for modeling. The following sections briefly summarize the results of this analysis.

## Initial data exploration

Preliminary examination of the dataset revealed the following characteristics :

- Dataset has 980 instances, so it's not a large dataset and training will be relatively fast in all models.
- All features are floating point numbers (float64) rounded to 2 decimal points, except the first feature (User ID).
- Dataset has 10 numeric features with generic names (e.g. Category 1), meaning we should probably use more descriptive feature names prior to data analysis.
- Dataset does not have missing values in any feature, so data imputing is not required.
- Being averaged from integer values between 0 and 4, most numeric features have fairly similar value ranges.
- Despite similar ranges in numeric features, scaling is required due to some features having small ranges (Category 7, Category 8 and Category 10).

Please run the following cells for a demonstration of these characteristics.

In [None]:
# Preview a random sample of instances
df.sample(5, random_state=rs)

In [None]:
# Check out dataset shape
df.shape

In [None]:
# Get a brief overview of data types
df.dtypes

In [None]:
# Check out summary statistics
df.describe()

In [None]:
# Reindex numeric features with descriptive names
df.columns = pd.Index(['User ID', 'art_galleries', 'dance_clubs', 'juice_bars', 'restaurants', 'museums',
                       'resorts', 'parks_picnic', 'beaches', 'theaters', 'religious_inst'])
df.sample(5, random_state=rs)

## Correlation analysis

Correlation analysis revealed that some significant feature correlations exist. The clusters identified by our final model may (or may not) reflect some insights about this. But it's important to keep a record of these correlations.

Please run the following cells to see the results of correlation analysis.

In [None]:
# Check out all feature correlations 
corr_mat = df.corr(numeric_only=True)
corr_mat

In [None]:
# Check out highest pair-wise correlations
high_corr_df = get_high_corr_df(corr_mat)
high_corr_df

In [None]:
# Let's filter significant correlations (corr > 0.5)
high_corr_df[abs(high_corr_df.High_Corr_Val) > 0.5]

In [None]:
# Visualize high correlations in a pair plot
high_corr_cols = ['juice_bars', 'parks_picnic', 'museums', 'resorts', 'religious_inst']

sns.pairplot(df[high_corr_cols])

## Normality analysis

Distribution of a couple of features in this dataset is highly skewed and, according to the standard skewness threshold (0.75), half of the features are technically skewed.

Please run the following cells to examine the most skewed features.

In [None]:
# Find all features that have significant skewness (> 0.75)
skew_vals = df.iloc[:, 1:].skew()
skew_cols = skew_vals[abs(skew_vals) > 0.75]
skew_cols.sort_values(ascending=False).to_frame().rename({0: 'skew'}, axis=1)

In [None]:
# Visualize different skew levels using histograms
_, axes_ = plt.subplots(2, 3, figsize=(6, 4))
axes = axes_.flatten()

for i, col in enumerate(skew_cols.index):
    axes[i].hist(df[col], bins=20)
    axes[i].set(xlabel=col, ylabel='Frequency')

axes[5].remove()
plt.tight_layout()
plt.show()

## Outlier analysis

Since we're doing clustering analysis in this project, we're dealing with two types of outliers, both of which don't merit further action :

- Outliers in feature values : All features have a limited (small) range and we're going to remove skewness at feature engineering stage.
- Outliers in clusters : Although DBSCAN will detect any possible outliers, the ABC company is more focused on travel package sales and high satisfaction rate from generic customers. So there is no specific plan for picky customers that may be hard to please.

# Feature selection and engineering

Our exploratory data analysis revealed the need for several data preparation steps. These steps are summarized below.

## Feature selection

The User ID feature was removed, as it doesn't have any modeling value.

## Feature engineering

- All features were renamed according to the main dataset description. This step was performed prior to data analysis.
- All feature values were scaled using MinMaxScaler.

Please run the following cells to see how the above steps were done in code.

In [None]:
# Make a copy of the original data and use it for data preprocessing
data = df.copy()

data = data.drop(['User ID'], axis=1)

In [None]:
scaler = MinMaxScaler()
data[data.columns] = scaler.fit_transform(data[data.columns])

data.describe()

# Clustering models

In order to find the suitable number of clusters suggested by business objectives (i.e. 4-7) several models were trained. Modeling steps can be summarized as follows.

- A range of cluster counts (from 2 to 10) were used to train several K-Means models. Using the Elbow Method, the optimal cluster count was estimated.
- A wide range of epsilon and n_clu values were using to train several DBSCAN models. The counts of clusters and outliers were aggregated and the best model was sought based on arbitrary criteria : n_clusters < 10 and n_outliers < 10 (roughly 1% of all data points)
- A single Mean Shift model was trained using the estimated bandwidth.

Please run the following cells to see the results obtained from different models.

## K-Means

In [None]:
# Train several K-Means models and aggregate results
k_max = 10
km_list = []

for k in range(2, k_max+1):
    km = KMeans(n_clusters=k, random_state=rs)
    km = km.fit(X)
    km_list.append(pd.Series({'K': k, 'Inertia': km.inertia_}))

In [None]:
# Use elbow method to find the best model
km_df = pd.concat(km_list, axis=1).T.set_index('K')
ax = km_df.plot(marker='o', ls='-')
ax.set(xlabel='Clusters', ylabel='Inertia')

As shown in the above plot, an obvious inflection point cannot be estimated. However, to better align with our business objectives, we're going to choose K = 6 as our best model.

In [None]:
km_best = KMeans(n_clusters=7, random_state=rs).fit(X)

print(f'Best K-Means : K = 7, Inertia = {round(km_best.inertia_, 4)}')
km_best

## DBSCAN

In [None]:
# Train several DBSCAN models and aggregate results
eps_vals = [i*0.001 for i in range(1, 21)]
n_clus = [2, 3]
dbs_result = []

for eps in eps_vals:
    for min_samples in n_clus:
        dbs = DBSCAN(eps=eps, min_samples=min_samples)
        dbs = dbs.fit(X)
        dbs_result.append(pd.Series({'eps': eps, 'n_clu': min_samples,
                                     'clusters': len(set(dbs.labels_[dbs.labels_ >= 0])),
                                     'outliers': abs(dbs.labels_[dbs.labels_ == -1].sum())}))

In [None]:
# Build a Pandas DataFrame from aggregated results
dbs_df = pd.concat(dbs_result, axis=1).T
dbs_df.index = pd.Index([f"({round(eps, 4)}, {n_clu})" for eps, n_clu in zip(dbs_df['eps'], dbs_df['n_clu'])],
                        name='(eps, n_clu)')
dbs_df = dbs_df.drop(columns=['eps', 'n_clu'])

In [None]:
dbs_df.head(10)

One can verify from the above (sample) results that DBSCAN cannot lead to optimal clusters with this dataset. This can be due to the infamous Curse of Dimensionality. Many ranges of smaller *epsilon* values were also used, always with similar results.

Ultimately with current dataset shape, DBSCAN fails to provide good results. 

## Mean Shift

In [None]:
# Fit a Mean Shift model to our dataset and display clustering result
bandwidth = estimate_bandwidth(X)
mshift = MeanShift(bandwidth=bandwidth, bin_seeding=True, cluster_all=False)
mshift = mshift.fit(X)

n_clusters = len(set(mshift.labels_[mshift.labels_ >= 0]))
n_outliers = abs(mshift.labels_[mshift.labels_ == -1].sum())

print(f'Mean Shift found {n_clusters} clusters and marked {n_outliers} data points as noise.')
mshift

Comparing our business objectives and the result shown above, we can verify that Mean Shift found too few clusters and too many outliers. So our Mean Shift model is not acceptable.

## Best model

Based on the above analysis, our best clustering model is **K-Means with 6 clusters**.

# Insights and key findings

After training different models and comparing results, we can make some notes about the quality of our modeling workflow and hidden patterns in this dataset.

- Several high correlations between some features may have degraded the overall performance of our models. Also, repeating the process after transforming skewed features may lead to better results in all models.
- K-Means model resulted in a final model that aligns well with our business objectives. However, with a 10-dimensional feature space, we can't get a sufficiently accurate visual cue about the shape and density of final clusters.
- Thinking about the inner logic of DBSCAN clustering, the constant output of the algorithm may be due to varying cluster densities in the dataset. Also, high dimensionality of our dataset may force data points to appear too far apart to distance-based algorithms. Repeating the process with lower dimensions (2D or 3D) may be worth trying.
- As implemented in Scikit-Learn, the Mean Shift model doesn't allow hyperparameter tuning and we're forced to stick with a single estimated bandwidth. To get the most out of this model, some more study and research may be helpful.

# Next steps

Based on the trained models and data preprocessing steps, we may be able to improve data quality and plan for some more experiments. Hopefully, these actions will lead to models that perform better and make room for easier interpretation.

## Possible faults

No dimensionality reduction was incorporated in our data preprocessing. This may be vital for the following reasons:

- Our dataset may be noisy and it's usually good practice to apply techniques like PCA analysis before modeling.
- Because most clustering algorithms are distance-based, using a small feature space may provide much better results.

## Action plan

Repeating the whole modeling with a more carefully refined dataset is definitely worth trying. The following actions can be taken to improve our preprocessing pipeline:
- Apply PCA with a range of components to account for high correlations and possible noise.
- To enable cluster visualization, prefer a PCA transform with 2 or 3 components.
- Try other approaches, like Multi-Dimensional Scaling (MDS), to shrink dataset’s feature space.

# Acknowledgements

Thank you so much for taking the time to review and grade this project.

I’d also like to thank our brilliant instructors (especially Dr. Joseph Santarcangelo) for their excellent learning materials. It’s been a real pleasure taking this course.