<div style="text-align: center;">
    <h1 style="color: #3498db;">Artificial Intelligence & Machine Learning</h1>
    <h2 style="color: #3498db;">Part 1: Exploratory Data Analysis</h2>
</div>

-------------------------------------------------------------

<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    <b>Authors:</b> K. Said<br>
    <b>Date:</b> 08-09-2023
</div>

<div style="background-color: #e6e6e6; padding: 10px; border-radius: 5px; margin-top: 10px;">
    <p>This notebook is part of the "Artificial Intelligence & Machine Learning" lecture material. The following copyright statement applies to all contents and code within this file.</p>
    <b>Copyright statement:</b>
    <p>This material, no matter whether in printed or electronic form, may be used for personal and non-commercial educational use only. Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors and lecturers.</p>
</div>


<h1 style="color:rgb(0,120,170)">Introduction</h1>

-----------------------------------------------

<h2 style="color:rgb(0,120,0)">What you have learned so far</h2>

--------------------------------------------------------------------

So far we have covered the basics of Machine Learning, including its applications and landscape. 
Afterwards we delved into the crucial topic of datasets and their types. Moreover, we've taken a significant step by selecting our own dataset, setting the stage for practical applications ahead. 


<h2 style="color:rgb(0,120,0)">Why data analysis for ML?</h2>

------------------------------------------------------
Well for that, imagine for a second you want to start a machine learning project, where you are provided with tabular data. By looking at the raw table, one might conclude that this dataset just fits perfectly fine for the model, but when training the model, the accuracy might end up at 60% or worse, even if the model is perfectly fine, the accuracy also depends on the quality of the dataset.
And for this reason we first try to analyse our dataset. Are there any missing values or values in the wrong format? Are there any patterns that can only be seen when visualizing the dataset? Which features should we use and how do the features depend on each other?
Those are only a few questions that can be asked when trying to analyse the data and can have a tremendous impact on the prediction of your model.

<h2 style="color:rgb(0,120,0)">Our Task</h2>

-------------------------------------------------

In this notebook your task is therefore to explore your chosen dataset from the previous task and try to understand the relations between each feature, look for specific patterns and missing values that might lead to problems, but also have a glimpse at the distribution and insights that may be interesting for the model implementation.



<h1 style="color:rgb(0,120,170)">Data Analysis - Example</h1>

-----------------------------------------------

Now we can finally get started with your first more challenging exercise, data exploration. Here we will try to make use of some of the most common methods for exploring your dataset. In Order to make it easier for you to get started, we will first start by showing you some data analysis examples based on the penguins dataset.

The penguins dataset was created by [Dr.Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php ) and the [Palmer Station, Antarctica LTER](https://pallter.marine.rutgers.edu/) , a member of the [Long Term Ecological Research Network.](https://lternet.edu/)

The dataset is often used by people when starting with their very first ML and data analysis project.It contains information about various attributes of penguin species, making it a useful dataset for classification and data exploration exercises. 

For more information [here the link to the dataset](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data).

In [9]:
# Let us first import some packages we need for the visualizations
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.random.seed(42)

ModuleNotFoundError: No module named 'seaborn'

In [8]:
# Now we load our toy dataset 
penguins = sns.load_dataset("penguins")


NameError: name 'sns' is not defined

<details>
<summary style="font-size: larger; color: white; background-color: rgba(255, 165, 0, 0.6); border: 1px solid grey; padding: 5px 15px; border-radius: 8px; cursor: pointer;">Display the dataset</summary>

<div style="background-color: rgba(255, 204, 153, 0.6); padding: 10px; border-radius: 5px;">
    Before starting with any specific plots, we first want to get an overview of our dataset. This can be done easily by using e.g. pandas library, which not only allows us to get the overview, but also some statistical insights.
</div>
</details>


In [None]:
# In order to plot multiple things in one cell, we use the display method of pandas
display(penguins, penguins.describe(include='all'), penguins.dtypes)


<details>
<summary style="font-size: larger; color: white; background-color: rgba(255, 165, 0, 0.6); border: 1px solid grey; padding: 5px 15px; border-radius: 8px; cursor: pointer;">Any missing values?</summary>

<div style="background-color: rgba(255, 204, 153, 0.6); padding: 10px; border-radius: 5px;">
    Well, from above plot we can clearly see that both dataframes contain NaN values. Quite interesting, right? In order to get a better understanding, we will try to plot the amount of missing values per each feature next.
</div>
</details>


In [None]:
# Calculate the percentage of missing values for each feature
nan_percentage = (penguins.isna().mean() * 100).sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=nan_percentage.index, y=nan_percentage.values, palette="viridis")
plt.title("Percentage of Missing Values by Feature in Penguins Dataset", fontsize=16)
plt.ylabel("Percentage of Missing Values", fontsize=14)
plt.xlabel("Features", fontsize=14)
plt.xticks(rotation=45, ha="right", fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()

<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    3% of the data in column "sex" are missing and around 0.5% in the other columns are missing. When a dataset has missing values, this can often happen due to collection errors, incomplete records, or sensor malfunctions during data gathering
</div>

<details>
<summary style="font-size: larger; color: white; background-color: rgba(255, 165, 0, 0.6); border: 1px solid grey; padding: 5px 15px; border-radius: 8px; cursor: pointer;">What is the distribution of each species?</summary>

<div style="background-color: rgba(255, 204, 153, 0.6); padding: 10px; border-radius: 5px;">
    Alright, we know that the dataset contains missing values. But what about the ratio of the different species? In order to find that out, we will now try to look at a simple pie-chart.
</div>
</details>


In [1]:
species_counts = penguins["species"].value_counts()

plt.figure(figsize=(8, 8))
plt.pie(species_counts, labels=species_counts.index, autopct='%1.1f%%', startangle=140)
plt.title("Distribution of Penguin Species")
plt.axis('equal')

plt.show()


NameError: name 'penguins' is not defined

<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    Quite interesting, so we have 3 different species, with the majority being of type <a href="https://en.wikipedia.org/wiki/Ad%C3%A9lie_penguin" style="color: blue; text-decoration: underline;" target="_blank">Adelie</a>.
</div>

<details>
<summary style="font-size: larger; color: white; background-color: rgba(255, 165, 0, 0.6); border: 1px solid grey; padding: 5px 15px; border-radius: 8px; cursor: pointer;">What is the distribution of each species?</summary>

<div style="background-color: rgba(255, 204, 153, 0.6); padding: 10px; border-radius: 5px;">
   Now out of curiosity, we want to plot a boxplot of the bill_length distribution of each species. For the first boxplot we use seaborn, for the second one we use the more interactive plotly library.
</div>
</details>


In [2]:
# Seaborn box plot for "bill_length_mm"
plt.figure(figsize=(8, 6))
sns.boxplot(x="species", y="bill_length_mm", data=penguins, palette="viridis")
plt.title("Seaborn - bill_length_mm Distribution by Penguin Species")
plt.xlabel("Penguin Species")
plt.ylabel("bill_length_mm")
plt.show()

# Plotly box plot for "bill_length_mm"
fig = px.box(penguins, x="species", y="bill_length_mm", title="Plotly - bill_length_mm Distribution by Penguin Species")
fig.update_layout(showlegend=False)
fig.show()

NameError: name 'plt' is not defined

<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    Well, with the above plots, we can clearly see the min, max, and median values. We also see that the Adelie species has a much shorter bill_length, this might come in handy when we choose the features to train our models with.
</div>


<details>
<summary style="font-size: larger; color: white; background-color: rgba(255, 165, 0, 0.6); border: 1px solid grey; padding: 5px 15px; border-radius: 8px; cursor: pointer;">Any correlation between the features?</summary>

<div style="background-color: rgba(255, 204, 153, 0.6); padding: 10px; border-radius: 5px;">
    Now we want to see the correlation between the features, which may be useful for the preprocessing and model selection part. For this purpose we will first create a scatter-plot, followed by a correlation heatmap.
</div>
</details>


In [3]:
# Scatter plot with all features using Seaborn and "Sex" as the hue
plt.figure(figsize=(10, 8))
sns.pairplot(data=penguins, hue="species")
plt.suptitle("Scatter Plot of Penguins Features (All Variables)", fontsize=16)
plt.show()

NameError: name 'plt' is not defined

<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    Having a closer look, we can see that it is much easier for us to separate the blue (Adelie) species from the green (Gentoo) species than the blue from the orange (Chinstrap).
</div>

In [4]:
# Correlation matrix for all features, also called Heatmap
corr_matrix = penguins.corr(numeric_only=True)

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Heatmap of Correlation Matrix for Penguin Features (All Variables)", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

NameError: name 'penguins' is not defined

<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    We see quite a high correlation between body_mass_g and flipper_length_mm (0.87). bill_length and bill_depth on the other hand have a rather very low correlation.
</div>

<h1 style="color:rgb(0,120,170)">Data Analysis - Your turn</h1>

-----------------------------------------------

Now it's your turn to analyse your chosen dataset from the previous tasks. There are no limits, you can just reuse some parts of the previous section, ideas from the info button or try out new things and implement them yourself (which we recommend). 

First load your dataset and try to visualize it in a similar manner as above. You can also play around with different libraries (mentioned in "Useful Ressources" and try to create interactive plots.

And just as a last sitenote, don't try to make as many plots as possible. Instead think to yourself, what information do I want to get from the dataset, what do I want to analyse? How will this specific plot help me gain useful insights of the data and how can I use this insights for the next step, preprocessing?

Other than that, just feel free to explore your dataset.

<details>
<summary style="font-size: larger; color: white; background-color: #3498db; border: 1px solid #3498db; padding: 5px 15px; border-radius: 8px; cursor: pointer;">Info</summary>

<div style="background-color: #E6F7FF; padding: 10px; border-radius: 5px;">
    After watching the lectures, you might feel a bit lost, as there a tons of ways to analyse and visualize things. In order to make it easier for you to know which methods to apply, here a short overview of what you can do to analyse your dataset:

   - **5-number-summary**
     - Minimum
     - Maximum
     - Lower and Upper quantile
     - Median

   - **Mean & Standard Deviation**
    
   - **Investigate Outliers**

   - **Plots**
     - Violin-Plot
     - Box-plot
     - Histogram
     - Bar charts
     - Pie charts
   - **Correlation Matrix**

Of course this is only a small fraction of all available methods, therefore you might want to have a look at the "Useful Resources" to get some more inspiration.

</div>
</details>


<details>
<summary style="font-size: larger; color: white; background-color: darkgreen; border: 1px solid darkgreen; padding: 5px 15px; border-radius: 8px; cursor: pointer;">Useful Ressources</summary>

<div style="background-color: rgba(0, 128, 0, 0.2); padding: 10px; border-radius: 8px; margin-top: 10px;">
    - <a href="https://matplotlib.org/stable/contents.html">Matplotlib</a>: A popular plotting library for creating visualizations.
    <br><br>
    - <a href="https://seaborn.pydata.org/">Seaborn</a>: A data visualization library based on Matplotlib, providing a high-level interface for creating informative and attractive statistical graphics.
    <br><br>
    - <a href="https://plotly.com/python/">Plotly</a>: A library for creating interactive plots and dashboards. It supports a wide range of chart types and is great for sharing data visualizations online.
    <br><br>
    - <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html">Pandas Plotting</a>: Pandas has built-in plotting capabilities that allow you to create basic plots directly from DataFrames.
    <br><br>
    - <a href="https://bokeh.org/">Bokeh</a>: Bokeh is a library for creating interactive, web-ready visualizations. It's well-suited for building interactive dashboards.
    <br><br>
    - <a href="https://altair-viz.github.io/">Altair</a>: Altair is a declarative statistical visualization library that is especially useful for creating complex, layered visualizations.
</div>
</details>


In [None]:
# TODO: Explore your dataset, find formating errors, missing values and much more
