## Install a package from a different conda channel

### If you had issues installing ydata-profiling previously

Open a terminal (Anaconda Terminal) and run the following commands.

*If you are using a custom conda environment, run this first:*

    conda activate <ENV_NAME>
    
Then:

    conda install -c conda-forge ydata-profiling==4.5.0 -y
    
OR from a jupyter cell:

    !conda install -c conda-forge ydata-profiling==4.5.0 -y

OR if conda doesn't work:

    !pip install -U ydata-profiling


- The `<ENV_NAME>` refers to the name of the environment you'll be working with for this project
- The `!` indicates a terminal command
- The `-c` flag indicates the channel we're using to get the package
- The `-y` flag indicates that we're confirming the installation of pandas-profiling and its dependencies a priori

We're installing `ydata-profiling` through this method because Anaconda's `default` channel contains an outdated version of this package, whereas the channel `conda-forge` has an updated version.


In [None]:
# Remember: library imports are ALWAYS at the top of the script, no exceptions!
import sqlite3
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

from itertools import product

## If you have an error here saying ydata_profiling is not available
## Run the instructions above
from ydata_profiling import ProfileReport

# for better resolution plots
%config InlineBackend.figure_format = 'retina' # optionally, you can change 'svg' to 'retina'

# Setting seaborn style
sns.set()

# Context
The data we will be using through the pratical classes comes from a small relational database whose schema can be seen below:
![alt text](../figures/schema.png "Relation database schema")

# Reading the Data

In [None]:
# path to database
my_path = os.path.join("..", "data", "datamining.db")

# connect to the database
conn = sqlite3.connect(my_path)

# the query
query = """
select
    age, 
    income, 
    frq, 
    rcn, 
    mnt, 
    clothes, 
    kitchen, 
    small_appliances, 
    toys, 
    house_keeping,
    dependents, 
    per_net_purchase,
    g.gender, 
    e.education, 
    m.status, 
    r.description
from customers as c
    join genders as g on g.id = c.gender_id
    join education_levels as e on e.id = c.education_id
    join marital_status as m on m.id = c.marital_status_id
    join recommendations as r on r.id = c.recommendation_id
order by c.id;
"""

df = pd.read_sql_query(query, conn)

# Metadata
- *id* - The unique identifier of the customer
- *age* - The year of birht of the customer
- *income* - The income of the customer
- *frq* - Frequency: number of purchases made by the customer
- *rcn* - Recency: number of days since last customer purchase
- *mnt* - Monetary: amount of € spent by the customer in purchases
- *clothes* - Number of clothes items purchased by the customer
- *kitchen* - Number of kitchen items purchased by the customer
- *small_appliances* - Number of small_appliances items purchased by the customer
- *toys* - Number of toys items purchased by the customer
- *house_keeping* - Number of house_keeping items purchased by the customer
- *dependents* - Binary. Whether or not the customer has dependents
- *per_net_purchase* - Percentage of purchases made online
- *education* - Education level of the customer
- *status* - Marital status of the customer
- *gender* - Gender of the customer
- *description* - Last customer's recommendation description

# Initial Analysis

Pandas user guide: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html

Pandas 10 min tutorial: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

In [None]:
# dataset head
df.head(10)

In [None]:
# dataset data types
df.dtypes

In [None]:
# count of missing values
df.isna().sum()

In [None]:
# duplicated observations
df.duplicated().sum()

In [None]:
# descriptive statistics
df.describe(include="all").T  # try with all and without all

In [None]:
# Use these cells to further explore the dataset
# CODE HERE

In [None]:
# How to get the names of the columns of the data?

df.columns

In [None]:
# How to get the values of one column of data?
# Different ways to access the values of a column

df['age']
df.age
df.loc[:,'age']


In [None]:
# How to get the shape of the data?

df.shape

In [None]:
# How to get the unique values of one column of data?

df['education'].unique()


## Problems:
- Duplicates?
- Data types?
- Missing values?
- Strange values?
- Descriptive statistics?

### Take a closer look and point out possible problems:

(hint: a missing values in pandas is represented with a NaN value)

In [None]:
# replace "" by nans
df.replace("", np.nan, inplace=True)

# count of missing values
df.isna().sum()

In [None]:
# check dataset data types again
df.dtypes

In [None]:
# fix wrong dtypes
df.dependents = df.dependents.astype("boolean")  # converting to "boolean" over "bool" allows preservation of NaNs

In [None]:
# check descriptive statistics again
df.describe(include="all").T

# Visual Exploration

Matplotlib tutorials: https://matplotlib.org/stable/tutorials/index.html

Matplotlib gallery: https://matplotlib.org/stable/gallery/index.html

Seaborn tutorials: https://seaborn.pydata.org/tutorial.html


Seaborn gallery: https://seaborn.pydata.org/examples/index.html

### Matplotlib vs Seaborn:

**Matplotlib** - lower level. allows to fully customize the plot appearance

**Seaborn** - higher level. Complex off-the-shelf plots with one line. Matplotlib on steroids


In [None]:
#Define metric and non-metric features. Why?
non_metric_features = ["education", "status", "gender", "dependents", "description"]
metric_features = df.columns.drop(non_metric_features).to_list()

## Pyplot-style vs Object-Oriented-style
- Explicitly create figures and axes, and call methods on them (the "object-oriented (OO) style").
- Rely on pyplot to automatically create and manage the figures and axes, and use pyplot functions for plotting.

More details: https://matplotlib.org/matplotblog/posts/pyplot-vs-object-oriented-interface/

## Numeric Variables' Univariate Distribution

In [None]:
# Single Metric Variable Histogram
plt.hist(df["age"], bins=10)  # mess around with the bins
plt.title("age", y=-0.2)

plt.show()

In [None]:
# Try to visualize other variables' histograms

In [None]:
# Single Metric Variable Box Plot
sns.boxplot(y=df["age"])

plt.show()

What information can we extract from the plots above?

In [None]:
# All Numeric Variables' Histograms in one figure
sns.set()

# Prepare figure. Create individual axes where each histogram will be placed
fig, axes = plt.subplots(2, ceil(len(metric_features) / 2), figsize=(20, 11))

# Plot data
# Iterate across axes objects and associate each histogram (hint: use the ax.hist() instead of plt.hist()):
for ax, feat in zip(axes.flatten(), metric_features): # Notice the zip() function and flatten() method
    ax.hist(df[feat])
    ax.set_title(feat, y=-0.13)
    
# Layout
# Add a centered title to the figure:
title = "Numeric Variables' Histograms"

plt.suptitle(title)

# Save the figure
if not os.path.exists(os.path.join('..', 'figures', 'exp_analysis')):
    # if the exp_analysis directory is not present then create it first
    os.makedirs(os.path.join('..', 'figures', 'exp_analysis'))
    
plt.savefig(os.path.join('..', 'figures', 'exp_analysis', 'numeric_variables_histograms.png'), dpi=200)

plt.show()

In [None]:
# All Numeric Variables' Box Plots in one figure
sns.set()

# Prepare figure. Create individual axes where each box plot will be placed
fig, axes = plt.subplots(2, ceil(len(metric_features) / 2), figsize=(20, 11))

# Plot data
# Iterate across axes objects and associate each box plot (hint: use the ax argument):
for ax, feat in zip(axes.flatten(), metric_features): # Notice the zip() function and flatten() method
    sns.boxplot(x=df[feat], ax=ax)
    
# Layout
# Add a centered title to the figure:
title = "Numeric Variables' Box Plots"

plt.suptitle(title)

# Save the figure
if not os.path.exists(os.path.join('..', 'figures', 'exp_analysis')):
    # if the exp_analysis directory is not present then create it first
    os.makedirs(os.path.join('..', 'figures', 'exp_analysis'))
    
plt.savefig(os.path.join('..', 'figures', 'exp_analysis', 'numeric_variables_boxplots.png'), dpi=200)

plt.show()

### Insights:
- univariate distributions
- potential univariate outliers

--------------------------------------

### During our Exploratory Data Analysis (EDA), we must also account for:
- Coherence check
- Outliers
- Missing values
- Feature Engineering

### Depending on the context, various steps must be considered when performing Data Preprocessing. 

The most relevant steps are the following:
- Coherence check (find inconsistent values, missing values, outliers and any other problem you may find in your dataset)
- Data editing (fix inconsistent values)
- Data cleansing (drop observations - Outlier removal and removal of inconsistent values and/or features)
- Data wrangling (feature extraction/engineering and transformation)
- Data reduction (reducing the dimensionality of a dataset, producing summary statistics, reducing the number of records in a dataset)

# More Visualizations!

## Pairwise Relationship of Numerical Variables

In [None]:
# Single Metric Variable Scatter plot
plt.scatter(df["age"], df["income"], edgecolors="white")
plt.xlabel("age")
plt.ylabel("income")

plt.show()

In [None]:
# Pairwise Relationship of Numerical Variables
sns.set()

# Setting pairplot
sns.pairplot(df[metric_features], diag_kind="hist")

# Layout
plt.subplots_adjust(top=0.95)
plt.suptitle("Pairwise Relationship of Numerical Variables", fontsize=20)

plt.savefig(os.path.join('..', 'figures', 'exp_analysis', 'pairwise_relationship_of_numerical_variables.png'), dpi=200)
plt.show()

### Insights:
- possible bivariate relationships
- potential bivariate outliers
- univariate distributions (diagonal)

### Example of Visualization formatting

In [None]:
# making a joint plot with default formatting
sns.jointplot(data=df, x="house_keeping", y="frq")
plt.show()

In [None]:
# Making the same visualization with customized formatting
sns.set(style="ticks")
sns.jointplot(data=df, x="house_keeping", y="frq", kind="hex", color="red")
plt.show()

## Categorical/Low Cardinality Variables' Absolute Frequencies

In [None]:
# Single Non-Metric variable bar plot
sns.set() # this resets our formatting defaults
sns.countplot(x=df["education"])

plt.show()

In [None]:
# formatting the color of a simple bar chart
sns.countplot(x=df["education"], color='#007acc')

# try replacing the color to 'red' or 'blue', instead of using an RGB code.
# alternatively, you can get the RGB code for a given color here:
# https://www.w3schools.com/colors/colors_picker.asp
# keep in mind any other color picker will do just as well

plt.show()

What information can we extract from the plot above?

**Using the same logic from the multiple box plot figure above, build a multiple bar plot figure for each non-metric variable:**

In [None]:
# All Non-Metric Variables' Absolute Frequencies
sns.set()
title = "Categorical/Low Cardinality Variables' Absolute Frequencies"
# CODE HERE

plt.savefig(os.path.join('..', 'figures', 'exp_analysis', 'categorical_variables_frequecies.png'), dpi=200)
plt.show()

### Insights:
- low frequency values
- high cardinality

## Comparing two categorical variables

In [None]:
# Let's break this down, step by step (pandas plot - matplotlib behind)
sns.set()
df_counts = df\
    .groupby(['description', 'dependents'])\
    .size()\
    .unstack()\
    .plot.bar(stacked=True)

## Comparing a categorical variable vs continuous (or discrete) variables

In [None]:
# Pairwise Relationship of Numerical Variables
sns.set()

# Setting pairplot
sns.pairplot(df[metric_features + ['gender']], diag_kind="hist", hue='gender')

# Layout
plt.subplots_adjust(top=0.95)
plt.suptitle("Pairwise Relationship of Numerical Variables", fontsize=20)

plt.show()

## Explore categorical data vs continuous and discrete data

Another example of visualization. Although it is not a simple visualization to produce, it can be very informative.

In [None]:
# notice we drop missing values in order to not plot it as a distinct value
educ_vals = df.education.dropna().unique()
educ_vals = educ_vals

fig, axes = plt.subplots(len(metric_features), len(educ_vals), figsize=(25,18), sharex=True, sharey="row")

for ax, (feat, educ_deg) in zip(axes.flatten(), product(metric_features, educ_vals)):
    # get the data for each subplot
    data = df[df.education == educ_deg].copy()
    data['dependents'] = data['dependents'].astype(object)
    
    # we are distinguishing points according to the variable "dependents"
    sns.pointplot(x="dependents", y=feat, 
                  hue="gender", hue_order=["F", "M"], 
                  data=data, capsize=.2, ax=ax)
    
    # remove the typical default y and x labels and legend of each axis
    # CODE HERE


# set columns' titles (education)
# only on top-most row of subplots
# CODE HERE

# set metric names
# only on left-most column of subplots
# CODE HERE

# set x axis label (dependents)
# only on the bottom row of subplots
# CODE HERE

# Set legend (gender)
# only once, for the whole figure
# handles, _ = axes[0,0].get_legend_handles_labels()
# CODE HERE


# set figure
plt.subplots_adjust(top=0.92)
plt.suptitle("Three-way ANOVA for each metric variable", fontsize=25)

plt.show()




## Metric Variables' Correlation Matrix

In [None]:
# Prepare figure
fig = plt.figure(figsize=(10, 8))

# Obtain correlation matrix. Round the values to 2 decimal cases. Use the DataFrame corr() and round() method.
corr = # CODE HERE

# Build annotation matrix (values above |0.5| will appear annotated in the plot)
mask_annot = np.absolute(corr.values) >= 0.5
annot = np.where(mask_annot, corr.values, np.full(corr.shape,"")) # Try to understand what this np.where() does

# Plot heatmap of the correlation matrix
sns.heatmap(data=corr, annot=annot, 
            cmap=sns.diverging_palette(220, 10, as_cmap=True), 
            fmt='s', vmin=-1, vmax=1, center=0, square=True, linewidths=.5)

# Layout
fig.subplots_adjust(top=0.95)
fig.suptitle("Correlation Matrix", fontsize=20)

plt.savefig(os.path.join('..', 'figures', 'exp_analysis', 'correlation_matrix.png'), dpi=200)

plt.show()

# A tool to assist you through your exploratory data analysis

Optionally, you may use `pandas-profiling` as a first approach to your data analysis. Remember, although this tool provides excelent insights about the data you're working with, it is not enough to perform a proper analysis.

In [None]:
profile = ProfileReport(
    df, 
    title='Tugas Customer Data',
    correlations={
        "pearson": {"calculate": True},
        "spearman": {"calculate": False},
        "kendall": {"calculate": False},
        "phi_k": {"calculate": False},
        "cramers": {"calculate": False},
    },
)

In [None]:
profile.to_notebook_iframe()

In [None]:
profile.to_file(os.path.join('..', 'figures', "tugas_customer_data.html"))

## Optional Exercise

Download the [Spaceship Titanic dataset](https://www.kaggle.com/competitions/spaceship-titanic/data). Using the  `train.csv` file, perform the same exercises that we did in this notebook.

Identify the metric and non-metric features in this dataset.

Identify if any problems exist:

- Duplicates?
- Data types?
- Missing values?
- Strange values?
- Descriptive statistics?

Visualize the different variables present in this dataset. Are there any interesting relationships present? 



# Optional Exercise

Download the [Spaceship Titanic dataset](https://www.kaggle.com/competitions/spaceship-titanic/data). Using the  `train.csv` file, perform the same exercises that we did in this notebook.

Identify the metric and non-metric features in this dataset.

Identify if any problems exist:

- Duplicates?
- Data types?
- Missing values?
- Strange values?
- Descriptive statistics?
