# Setting up the notebook

## Import the packages we need

There are several libraries that have been created that we can use to make our job easier. We can import these libraries, so that we can take advantage of the functionalities they have, without developing the code ourselves.

Importing these packages needs to be done at the **top** of the notebook, before we run any code.

- [`pandas`](https://pandas.pydata.org/) "is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool"

- [`sklearn`](https://scikit-learn.org/) is a library of tools for predictive and descriptive data analysis written in Python

- [`matplotlib`](https://matplotlib.org/) is a library for creating visualizations

- [`seaborn`](https://seaborn.pydata.org/) is a library for creating visualizations, built on top of `matplotlib`

In [None]:
from math import ceil

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler, StandardScaler


In [None]:
sns.set()


# Import the Data

In [None]:
## To Load our CSV file into a dataframe
## First we need to provide access to our file

from google.colab import drive
drive.mount('/content/drive')

In [None]:
## Load csv file into a dataframe 

data_path = "/content/drive/MyDrive/FAI2223_Notebooks/data/spaceship_titanic_dataset.csv"
df = pd.read_csv(data_path)

## Check file loaded
df.head()

In [None]:
## IF you have problems with Google Drive above, try this cell instead.
## Un-comment the lines by removing the "#" character


## Remove the "#" at the start of the lines below:

#data_path = "https://raw.githubusercontent.com/fpontejos/FAI_2223/main/data/spaceship_titanic_dataset.csv"
#df = pd.read_csv(data_path)

#df.head()

# Dataset description

Our dataset was taken from a Kaggle competition called Spaceship Titanic (Howard et al., 2022)<a name="cite1"></a>[<sup>[1]</sup>](#note1)

Given the details of the passengers on board a spaceship:

1. create a predictive model for which passengers were transported to an alternate dimension. 
2. create a descriptive model about the passengers. 

- **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

- **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.

- **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

- **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

- **Destination** - The planet the passenger will be debarking to.

- **Age** - The age of the passenger.

- **VIP** - Whether the passenger has paid for special VIP service during the voyage.

- **RoomService**, **FoodCourt**, **ShoppingMall**, **Spa**, **VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

- **Name** - The first and last names of the passenger.

- **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict

---
<a name="cite-1"></a>1. [^](#cite1) Howard, A., Chow, A., & Holbrook, R. (2022). Spaceship titanic. https://kaggle.com/competitions/spaceship-titanic

# Exploring and Understanding the Data

In [None]:
## Define a function that plots multiple histograms

def plot_multiple_histograms(data, feats, title="Numeric Variables' Histograms"):

    # Prepare figure. Create individual axes where each histogram will be placed
    fig, axes = plt.subplots(2, ceil(len(feats) / 2), figsize=(20, 11))

    # Plot data
    # Iterate across axes objects and associate each histogram (hint: use the ax.hist() instead of plt.hist()):
    for ax, feat in zip(axes.flatten(), feats): # Notice the zip() function and flatten() method
      ax.hist(data[feat])
      ax.set_title(feat)

    # Layout
    # Add a centered title to the figure:
    plt.suptitle(title)

    plt.show()

    return


## Define a function that plots multiple box plots

def plot_multiple_boxplots(data, feats, title="Numeric Variables' Box Plots"):

    # Prepare figure. Create individual axes where each histogram will be placed
    fig, axes = plt.subplots(2, ceil(len(feats) / 2), figsize=(20, 11))

    # Plot data
    # Iterate across axes objects and associate each histogram (hint: use the ax.hist() instead of plt.hist()):
    for ax, feat in zip(axes.flatten(), feats): # Notice the zip() function and flatten() method
      sns.boxplot(x=data[feat], ax=ax)
      ax.set_title(feat)

    # Layout
    # Add a centered title to the figure:
    plt.suptitle(title)

    plt.show()

    return


def plot_corrmatrix(df, feats, method="pearson"):
  # Prepare figure
  fig = plt.figure(figsize=(10, 8))

  # Obtain correlation matrix. Round the values to 2 decimal cases. Use the DataFrame corr() and round() method.
  corr = np.round(df[feats].corr(method=method), decimals=2)

  # Plot heatmap of the correlation matrix
  sns.heatmap(data=corr, annot=True, cmap=sns.diverging_palette(220, 10, as_cmap=True), 
              vmin=-1, vmax=1, center=0, square=True, linewidths=.5)

  # Layout
  fig.subplots_adjust(top=0.95)
  fig.suptitle("Correlation Matrix", fontsize=20)

  plt.show()
  return 

## Define a function that plots multiple countplots

def plot_categorical_frequencies(data, feats, 
                             title="Categorical Variables' Counts"):
  
    # Prepare figure. Create individual axes where each histogram will be placed
    fig, axes = plt.subplots(2, ceil(len(feats) / 2), figsize=(20, 11))

    # Plot data
    # Iterate across axes objects and associate each histogram (hint: use the ax.hist() instead of plt.hist()):
    for ax, feat in zip(axes.flatten(), feats): 
        sns.countplot(x=df[feat].astype(object), ax=ax, color='#007acc')

    # Layout
    # Add a centered title to the figure:
    plt.suptitle(title)


    plt.show()

    return



## Insights from previous lab

### We have different kinds of variables

In [None]:
metric_features = ["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]
non_metric_features_all = ["HomePlanet", "CryoSleep", "Cabin", "Destination", "VIP"]
non_metric_features = ["HomePlanet", "CryoSleep", "Destination", "VIP"]
target_variable = "Transported"

### We looked at the distributions of their values

![Histograms](https://raw.githubusercontent.com/fpontejos/FAI_2223/main/images/histograms_original.png)

In [None]:
#plot_multiple_histograms(df, metric_features)


Refresher on how to interpret boxplots and histograms:

https://www.open.edu/openlearn/science-maths-technology/mathematics-statistics/interpreting-data-boxplots-and-tables/content-section-2.5

https://statisticsbyjim.com/basics/histograms/


### We have some outliers

![Box Plots](https://raw.githubusercontent.com/fpontejos/FAI_2223/main/images/boxplots_original.png)

In [None]:
#plot_multiple_boxplots(df, metric_features)


### We have some missing values

In [None]:
## Count of missing values
df.isna().sum()


### Our values have different magnitudes

In [None]:
df.describe()


### Some of our features are non-numeric 

In [None]:
df[non_metric_features_all].head(3)

In [None]:

df[non_metric_features_all].nunique()

In [None]:
df[metric_features].head(3)


## How do we deal with these issues?



# Data Preprocessing



### Missing Values

What should we replace our missing values with?

In [None]:
## Reminder of our missing values
df[metric_features].isna().sum()

In [None]:
## First we make a copy of our data. Why?

df_original = df.copy()
df_central = df.copy()


#### Measures of Central Tendency: Mean

In [None]:
df_means = df_central[metric_features].mean()
df_means

#### Measures of Central Tendency: Median

In [None]:
df_medians = df_central[metric_features].median()
df_medians


#### Which one to use?

In [None]:
df_central['Spa'].hist(bins=10) ## Test other bin sizes
plt.plot()

In [None]:
df_central.fillna(df_medians, inplace=True)


In [None]:
df_central[metric_features].isna().sum()

#### What about the non-numeric values?

In [None]:
df_central[non_metric_features].isna().sum()


In [None]:
#plot_categorical_frequencies(df, non_metric_features)


![Count Plots](https://raw.githubusercontent.com/fpontejos/FAI_2223/main/images/value_counts_original.png)


In [None]:
df_modes = df_central[non_metric_features].mode().loc[0]
df_modes


In [None]:
df_central.fillna(df_modes, inplace=True)


In [None]:
df_central[non_metric_features].isna().sum()


In [None]:
plot_categorical_frequencies(df_central, non_metric_features)


In [None]:
## Once we are happy with our choices, copy it back to df
df = df_central.copy()


#### Questions?

### Treating Outliers

In [None]:
df_outliers = df.copy()

In [None]:
## Uncomment line below to run the plotting code
#plot_multiple_boxplots(df, metric_features)

![Box Plots](https://raw.githubusercontent.com/fpontejos/FAI_2223/main/images/boxplots_original.png)

#### Using Inter-Quartile Range (IQR)

In [None]:
def remove_outliers_iqr(df, feats, qa=0.25, qb=0.75):
  df_ = df.copy()
  q25 = df_[feats].quantile(.25)
  q75 = df_[feats].quantile(.75)
  iqr = (q75 - q25)

  upper_lim = q75 + 1.5 * iqr
  lower_lim = q25 - 1.5 * iqr

  iqr_filters = []
  for f in feats:
      llim = lower_lim[f]
      ulim = upper_lim[f]
      iqr_filters.append(df[f].between(llim, ulim, inclusive='both'))

  iqr_filters = pd.Series(np.all(iqr_filters, 0))
  return df_[iqr_filters]



In [None]:
df_out_iqr = remove_outliers_iqr(df_outliers, metric_features)

print('Percentage of data kept after removing outliers with IQR method:')
print(np.round(df_out_iqr.shape[0] / df_outliers.shape[0], 4))


What do you think of this number?

#### Using manual threshold

In [None]:
plot_multiple_boxplots(df_outliers, metric_features)


In [None]:
manual_filters = (
    (df_outliers['RoomService']<=8000)
    &
    (df_outliers['FoodCourt']<=20000)
    &
    (df_outliers['ShoppingMall']<=10000)
    &
    (df_outliers['Spa']<=15000)
    &
    (df_outliers['VRDeck']<=14000)

)

df_out_manual = df_outliers[manual_filters]


In [None]:
print('Percentage of data kept after removing outliers with manual method:')
print(np.round(df_out_manual.shape[0] / df_outliers.shape[0], 4))


What do you think of this number?

In [None]:
df = df_out_manual.copy()

#### Do we have to remove the rows?

We will revisit this question later.


#### Questions?

### Data Standardization

Why do we need to do this?

#### MinMaxScaler()

Transforms values to be between [0,1]

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [None]:
## Again, make a copy first
df_minmax = df.copy()


In [None]:
## Initialize MinMaxScaler
mm_scaler = MinMaxScaler()

## Get the scaled values
mm_scaled_feat = mm_scaler.fit_transform(df_minmax[metric_features])

## Replace original metric_features values with mm_scaled_feat values
df_minmax[metric_features] = mm_scaled_feat

In [None]:
df_minmax.describe().round(2)


In [None]:
plot_multiple_histograms(df_minmax, metric_features)


#### StandardScaler()

AKA Z-Score Scaling

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
## Again, make a copy first
df_standard = df.copy()


In [None]:
## Initialize StandardScaler
st_scaler = StandardScaler()

## Get the scaled values
st_scaled_feat = st_scaler.fit_transform(df_standard[metric_features])

## Replace original metric_features values with mm_scaled_feat values
df_standard[metric_features] = st_scaled_feat

In [None]:
## Let's look at the statistics
## Rounded to two digits for easier viewing
df_standard.describe().round(2)


In [None]:
plot_multiple_histograms(df_standard, metric_features)


In [None]:
df = df_standard.copy()


### Feature Selection

#### Redundancy

We've already seen our correlation matrix. This can help us see which variables are highly correlated to each other, which we can then choose to remove.

In [None]:
plot_corrmatrix(df, metric_features, method="pearson")

#### Relevancy

We select only the variables that are relevant to the task. For example, if the task is to create a demographic segmentation, then we only keep demographic variables. For now, since we don't have a specific task, we consider all variables to be relevant.


#### Questions?

### Wrap up

#### Redo data exploration

Check if the data looks the way you expect it to. 

- Have you missed some outliers? 
- Are there still missing values?
- Is the data normalized?

This is an iterative process. It is likely you will change your preprocessing steps frequently throughout your group work.

## Next lab

KMeans Clustering

https://www.youtube.com/watch?v=5I3Ei69I40s


Hierarchical Clustering

https://dashee87.github.io/images/hierarch.gif