# **Exploratory Data Analysis**

Exploratory Data Analysis (EDA) is the process of examining and visualizing 
datasets to uncover patterns, trends, and relationships within the data. It 
helps identify data characteristics such as distribution, outliers, missing 
values, and correlations, which inform decisions on data preprocessing and 
model selection. EDA is essential for gaining initial insights and guiding 
the next steps in the data science workflow.


In [5]:
# ...business as usual... Let's set up the environment first.

# Basic imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# scikit-learn related imports
from sklearn.datasets import load_iris, load_wine

# Some Jupyter magic for nicer output
%config InlineBackend.figure_formats = ["svg"]   # Enable vectorized graphics

# Adjust the default settings
import sys
sys.path.append("..")
import ml
ml.setup_plotting()

AttributeError: module 'ml' has no attribute 'setup_plotting'

## **Step 1: Load and inspect the data**

In this notebook, we will explore the "Wine" dataset containing the results 
of a chemical analysis of wines derived from three different cultivars. 
The analysis determined the quantities of 13 constituents found in each of 
the three types of wines. 

The wine samples (rows of the dataframe) should be classified into three 
wine classes (0,1 and 2) based on the features in the dataset.

In principle, you can use this notebook to explore also other datasets. 
Just replace the load_wine() function with the dataset loader of your choice...

In [3]:
dataset = load_wine(as_frame=True)
df = dataset.frame
target_name = "wine_class"

dataset = load_iris(as_frame=True)
df = dataset.frame
target_name = "species"

In [None]:
# Display the df to get a first impression of the data!
display(df)

In [None]:
# Display the summary statistics
summary = df.describe()
summary

In [6]:
########################
###    EXERCISE 1    ###
########################

# Answer the following questions:
# 1) How many samples are there in the dataset?
# 2) Which are the features of the dataset? How many are there?
# 3) What is the target variable? What are the possible values?
# 4) Which are the data types of the features?
# 5) Are there any missing values?
# 6) Are there any outliers?
# 7) What type of problem is this? (Classification, regression, clustering, ... etc.?)

## **Step 2: Visualize the data** 

The more we look at the data, the better we understand it. Let's visualize it!

In [8]:
########################
###    EXERCISE 1    ###
########################

# Create different visualizations to explore the dataset.
# 1) Use histograms or violin plots to visualize the distribution of features
# 2) Use a pairplot to explore the relationships between features
# 3) Use a barplot to visualize the number of samples per class

## **Step 3: Train a simple model**

Here we train a logistic regression classifier on a training dataset 
(80% of data) and use it to predict the target classes of the test 
dataset (20% of data).

In [11]:
########################
###    EXERCISE 3    ###
########################

# Train and validate such a model as described above.

In [13]:
########################
###    EXERCISE 4    ###
########################

# - Have a look at the available classification metrics in scikit-learn:
#   https://scikit-learn.org/1.5/modules/model_evaluation.html#classification-metrics
# - Compute a confusion matrix for the model and visualize it.
# - Compute the classification report for the model and print it.


## **Alernatives / automation**

### **Automated summary using `skimpy`**

[Skimpy](https://aeturrell.github.io/skimpy/) is a light-weight Python package for creating summary statistics from 
Pandas DataFrames. It can be seen as a supercharged version of df.describe().

In [None]:
# Make sure to previously install the skimpy package:
# Just uncomment the following line and run the cell.
# You may have to restart the kernel afterwards!

# %pip install skimpy
from skimpy import skim

skim(df)

### **Automated analysis using `ydata-profiling`**


[ydata-profiling](https://docs.profiling.ydata.ai/latest/) is a Python 
package that provides a simple way to  automatically profile data and 
generate reports.


In [None]:
# Make sure to previously install the package.
# %pip install -U ydata-profiling
# You may have to restart the kernel afterwards!

from ydata_profiling import ProfileReport
profile = ProfileReport(df, 
                        title="OUR DATASET", 
                        sort=None,
                        sensitive=False,
                        explorative=False)

# Create and display the report
profile.to_notebook_iframe()    # Integrate into a Jupyter notebook
#profile.to_widgets()             # Integrate into a Jupyter notebook, compact
#profile.to_file("output.html")  # Save the report to a file
profile