# Dimension Reduction

In this notebook you will learn how to perform dimension reduction by using a few common techniques, namely data summaries, data conversion, and PCA. You will also use some common data exploration methods and learn how to make useful interactive plots. Most of the material is based on the examples from the text book.

> (c) 2019 Galit Shmueli, Peter C. Bruce, Peter Gedeck 
>
> Code included in
>
> _Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python_ (First Edition) 
> Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel. 2019.

Let's get started by importing all required libraries:

In [None]:
import dmba

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import plotly.express as px
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn import preprocessing

%matplotlib inline

## Dataset

We are going to use the Boston Housing dataset again. Let's load it directly from the text book Python's library and prepare it for further analysis.

In [None]:
df = dmba.load_data('BostonHousing.csv')
df = df.rename(columns={'CAT. MEDV': 'CAT_MEDV'})
df.head()

## Data summaries

Let's see how to explore common summary statistics for a particular variable, in this case CRIM (per capita crime rate by town). Some of the operations were already used in the `pandas` notebook. Now we will compute mean, standard deviation, min, max, median, length, and missing values of CRIM.

In [None]:
print('Mean : ', df.CRIM.mean())
print('Std. dev : ', df.CRIM.std())
print('Min : ', df.CRIM.min())
print('Max : ', df.CRIM.max())
print('Median : ', df.CRIM.median())
print('Length : ', len(df.CRIM))
print('Number of missing values : ', df.CRIM.isnull().sum())

These data summaries give us an idea of the distribution of CRIM, but as we discussed in the lectures, the best is to always have look at the histogram of numerical variables to fully understand them:

In [None]:
fig = px.histogram(df, x="CRIM")
fig.show()

We can see that CRIM has a log normal distribution with most values concentrated between 0-1. We used `plotly` to draw the histogram. This library produces interactive plots. You can zoom and browse through the plot to explore it.

Now, let's have a look at the box plot for CRIM using a log scale to better visualize it:

In [None]:
fig = px.box(df, x="CRIM", log_x=True)
fig.show()

In [None]:
# TODO: make a box plot for the distribution of CRIM per CAT_MEDV


We saw how to explore a particular variable. But we can also have a look at the summary statistics for all the variables in the dataset. So let's compute mean, standard dev., min, max, median, length, and missing values for all variables:

In [None]:
pd.DataFrame({'mean': df.mean(),
              'sd': df.std(),
              'min': df.min(),
              'max': df.max(),
              'median': df.median(),
              'length': len(df),
              'miss.val': df.isnull().sum(),
             })

These data summaries are very useful for understanding the dataset. In this particular case we have a complete dataset, but we could have discovered that a particular variable has many missing records and decided to exclude it from the analysis, which is a form of dimension reduction.

Another useful method is to look at the pair-wise correlations:

In [None]:
df.corr().round(2)

It is always useful to see all the numbers in table format, but even better to look at a figure, which shows insights more quickly:

In [None]:
sns.set(rc={"figure.figsize":(15, 10)})
dataplot = sns.heatmap(df.corr(), cmap="RdBu", annot=True)
plt.show()

Do any correlations stand out? Do you think you can remove any variable based on the correlation matrix? If yes, does it make sense to remove it/them? 

We can also create pivot tables to explore the data. Let's see how to do it by checking the average values for MEDV grouped by RM and CHAS:

In [None]:
# Binning RM variable
df['RM_bin'] = pd.cut(df.RM, range(0, 10), labels=False)
df.head()

In [None]:
# Pivot table: checking average value of MEDV for all combinations of RM and CHAS
pd.pivot_table(df, values='MEDV', index=['RM_bin'], columns=['CHAS'],
               aggfunc=np.mean, margins=True)

This table shows us that low values of RM (average number of rooms) are only observed for CHAS = 0, which means properties not bounding the river, which is a good insight about the data.

In [None]:
# TODO: bin the AGE variable with a bin size of 10 
# then create a pivot table of AGE_bin vs CHAS that displays the median of MEDV 
# and do show the labels this time


## Data conversion

We can also use data conversion to reduce dimensions. Let's now have a look at aggregations:

In [None]:
# Getting cross tabulation of two variables and converting it into percentages
tbl = pd.crosstab(df.CAT_MEDV, df.ZN.astype('str'))
tbl = tbl[['0.0'] + list(tbl.columns[2:]) + ['100.0']] # re-ordering columns
propTbl = tbl / tbl.sum()
propTbl.round(2)

In [None]:
# And if we want to see the total counts
tbl

Let's now visualize the results:

In [None]:
fig = px.bar(propTbl.transpose()*100., title="Distribution of CAT.MEDV by ZN", 
             labels={'value':'Percentage'})
fig.show()

Can you use this insight to reduce dimensions? Perhaps groupping similar variables together?

In [None]:
# TODO: make a similar plot that displays the distribution of CAT_MEDV per RAD


## Optional: Principal Component Analysis (PCA)

Let's now see how to compute principal components on two dimensions using a new dataset about cereals and their health rating.

In [None]:
# Loading dataset
df_cereals = dmba.load_data('Cereals.csv')
df_cereals.head()

Let's choose the variables calories and rating to see how PCA works:

In [None]:
# Fitting PCA
pcs = PCA(n_components=2)
pcs.fit(df_cereals[['calories', 'rating']])

The importance of components can be assessed using the explained variance:

In [None]:
pcsSummary = pd.DataFrame({'Standard deviation': np.sqrt(pcs.explained_variance_),
                           'Proportion of variance': pcs.explained_variance_ratio_,
                           'Cumulative proportion': np.cumsum(pcs.explained_variance_ratio_)})
pcsSummary = pcsSummary.transpose()
pcsSummary.columns = ['PC1', 'PC2']
pcsSummary.round(4)

We can see that the first principal component holds over 86% of the variance and therefore most of information in the sense of variability.

The `components_` field of `pcs` gives the individual components. The columns in this matrix are the principal components `PC1`, `PC2`. The rows are variables in the order they are found in the input matrix, `calories` and `rating`. This gives the linear coefficients for variables transformation.

In [None]:
pcsComponents = pd.DataFrame(pcs.components_.transpose(), 
                             columns=['PC1', 'PC2'], 
                             index=['calories', 'rating'])
pcsComponents

Use the `transform` method to get the scores, i.e. projected variables:

In [None]:
scores = pd.DataFrame(pcs.transform(df_cereals[['calories', 'rating']]), 
                      columns=['PC1', 'PC2'])
scores.head()

Now let's perform a principal component analysis of the whole table ignoring the first three non-numerical columns:

In [None]:
pcs = PCA()
pcs.fit(df_cereals.iloc[:, 3:].dropna(axis=0))
pcsSummary = pd.DataFrame({'Standard deviation': np.sqrt(pcs.explained_variance_),
                           'Proportion of variance': pcs.explained_variance_ratio_,
                           'Cumulative proportion': np.cumsum(pcs.explained_variance_ratio_)})
pcsSummary = pcsSummary.transpose()
pcsSummary.columns = ['PC{}'.format(i) for i in range(1, len(pcsSummary.columns) + 1)]
pcsSummary.round(4)

We can see that the first two principal components holds over 92% of the variance. Now let's see which variables give the highest weight for them:

In [None]:
pcsComponents = pd.DataFrame(pcs.components_.transpose(), 
                             columns=pcsSummary.columns, 
                             index=df_cereals.iloc[:, 3:].columns)
pcsComponents.iloc[:,:5]

We can see that sodium and potassium have the highest weights for PC1 and PC2, which would indicate them to be the most important variables to explain variance in the data. However, this is likely to be a scaling effect, just have a look at the original records:

In [None]:
df_cereals.head()

That highlights the importance of data standardization before running PCA, which we will now do by using the scale method of preprocessing:

In [None]:
pcs = PCA()
pcs.fit(preprocessing.scale(df_cereals.iloc[:, 3:].dropna(axis=0)))
pcsSummary = pd.DataFrame({'Standard deviation': np.sqrt(pcs.explained_variance_),
                           'Proportion of variance': pcs.explained_variance_ratio_,
                           'Cumulative proportion': np.cumsum(pcs.explained_variance_ratio_)})
pcsSummary = pcsSummary.transpose()
pcsSummary.columns = ['PC{}'.format(i) for i in range(1, len(pcsSummary.columns) + 1)]
pcsSummary.round(4)

Now we see that we need 7 components to explain over 90% of the variance in the data. Let's see which variables have highest weigths:

In [None]:
pcsComponents = pd.DataFrame(pcs.components_.transpose(), 
                             columns=pcsSummary.columns, 
                             index=df_cereals.iloc[:, 3:].columns)
pcsComponents.iloc[:,:7]

Fiber, potassium and rating have high weigths for PC1. Followed by calories, fat, sugars and weight for PC2. Let's now visualize the first two principal components and see whether it brings extra insights:

In [None]:
# Dropping records with NaN values
df_cereals_red = df_cereals.dropna(axis=0)
df_cereals_red = df_cereals_red.reset_index(drop=True)

# Re-projecting data to new system
scores = pd.DataFrame(pcs.fit_transform(preprocessing.scale(df_cereals_red.iloc[:, 3:])), 
                      columns=[f'PC{i}' for i in range(1, 14)])

# Adding column with cereal names
df_pca = pd.concat([df_cereals_red['name'], scores[['PC1', 'PC2']]], axis=1)

df_pca.head()

In [None]:
fig = px.scatter(df_pca, x="PC1", y="PC2", text="name", height=700)
fig.update_traces(textposition="bottom center")
fig.show()