## Course - Dimensionality Reduction in Python

# Module 1 -  Exploring High Dimensional Data

## Introduction

### Visually detecting redundant features
Data visualization is a crucial step in any data exploration. Let's use Seaborn to explore some samples of the US Army ANSUR body measurement dataset.

Two data samples have been pre-loaded as ansur_df_1 and ansur_df_2.

Seaborn has been imported as sns.

In [None]:
"""Create a pairplot of the ansur_df_1 data sample and color the points using the 'Gender' feature."""
# Create a pairplot and color the points using the 'Gender' feature
sns.pairplot(data=ansur_df_1, hue='Gender', diag_kind='hist')

# Show the plot
plt.show()

"""Two features are basically duplicates, remove one of them from the dataset."""

# Remove one of the redundant features
reduced_df = ansur_df_1.drop("body_height", axis=1)

# Create a pairplot and color the points using the 'Gender' feature
sns.pairplot(reduced_df, hue='Gender')

# Show the plot
plt.show()

"""Now create a pairplot of the ansur_df_2 data sample and color the points using the 'Gender' feature."""
# Create a pairplot and color the points using the 'Gender' feature
sns.pairplot(data=ansur_df_2, hue="Gender", diag_kind='hist')


# Show the plot
plt.show()


"""One feature has no variance, remove it from the dataset."""
# Remove the redundant feature
reduced_df = ansur_df_2.drop("n_legs", axis=1)

# Create a pairplot and color the points using the 'Gender' feature
sns.pairplot(reduced_df, hue='Gender', diag_kind='hist')

# Show the plot
plt.show()

## t-SNE visualization of high-dimensional data

### Fitting t-SNE to the ANSUR data
t-SNE is a great technique for visual exploration of high dimensional datasets. In this exercise, you'll apply it to the ANSUR dataset. You'll remove non-numeric columns from the pre-loaded dataset df and fit TSNE to this numeric dataset.

Instructions

Drop the non-numeric columns from the dataset.
Create a TSNE model with learning rate 50.
Fit and transform the model on the numeric dataset.

In [None]:
# Non-numerical columns in the dataset
non_numeric = ['Branch', 'Gender', 'Component']

# Drop the non-numerical columns from df
df_numeric = df.drop(non_numeric , axis=1)

# Create a t-SNE model with learning rate 50
m = TSNE(learning_rate=50)

# Fit and transform the t-SNE model on the numeric dataset
tsne_features = m.fit_transform(df_numeric)
print(tsne_features.shape)

### t-SNE visualisation of dimensionality
Time to look at the results of your hard work. In this exercise, you will visualize the output of t-SNE dimensionality reduction on the combined male and female Ansur dataset. You'll create 3 scatterplots of the 2 t-SNE features ('x' and 'y') which were added to the dataset df. In each scatterplot you'll color the points according to a different categorical variable.

seaborn has already been imported as sns and matplotlib.pyplot as plt.

Instructions

1
Use seaborn's sns.scatterplot to create the plot.
Color the points by 'Component'.

2
Color the points of the scatterplot by 'Branch'.

3
Color the points of the scatterplot by 'Gender'.

# Color the points according to Army Component
sns.scatterplot(x="x", y="y", hue='Component', data=df)

# Show the plot
plt.show()