<a href="https://colab.research.google.com/github/aleksejalex/EIEE9E_2025_ZS/blob/main/PyPEF_07_data_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyPEF, lecture 08. Data in Python (part 2).

Prepared by: Aleksej Gaj ( pythonforstudents24@gmail.com )

üîó Course website: [https://aleksejgaj.cz/pef_python/](https://aleksejgaj.cz/pef_python/)


In this tutorial we will
 - recall basics of data manipulation in Python from last time (library `pandas`)
 - continue to work with data in Python and get familiar with:
    - advanced plotting (library `seaborn`)
    - more of pandas functionality
    - practice understanding of data
 - study a linear regression in Python

In [None]:
# imports for today (we are already familiar with those libraries)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Recall last time: pandas

 - load dataframe (from csv, ...)
 - preview first *n* rows of data
 - basic statistics to get first impression
 - changing variable type of some properties (in dataframe represented by columns)

 **(Quite) universal aproach when importing some data:**
 1) technical import, optionally checks if the file/variable is valid, nonempty, ...
 2) checking type of variables - retyping according to interpretation
 3) basic statistics (mean, standard deviation, min&max value, CI, ...)
 4) basic plots (histograms, boxplots, pie charts, ...)


In [None]:
url_to_data = "https://gist.githubusercontent.com/aleksejalex/26a83646c03120af1eaeb117572d895e/raw/2ddc8661d86fbf1b7d09204ff39fdf74ce3723b6/cereals.csv"

df_cereals = pd.read_csv(url_to_data, delimiter=',')

### Preview first $n$ rows of data:

In [None]:
df_cereals.head(3)

### Checking types of variables and changing some of them

In [None]:
df_cereals["calories"] = df_cereals["calories"].astype("float")
df_cereals["type_of"] = df_cereals["type_of"].astype("category")
df_cereals.dtypes

### Basic statistical description of provided data

In [None]:
df_cereals.describe(include='all')

In [None]:
df_cereals['shelf'].dtypes

In [None]:
# get the 'x' and 'y' series
x = df_cereals['sugars']
y = df_cereals['calories']

# plot them in usual way:
plt.figure(figsize=(4,3), dpi = 120)
plt.scatter(x,y)
plt.xlabel("sugars")
plt.ylabel("calories")
plt.show()

In [None]:
# get the 'x' and 'y' series

r1 = df_cereals[df_cereals['shelf'] == 1]['rating']
r2 = df_cereals[df_cereals['shelf'] == 2]['rating']
r3 = df_cereals[df_cereals['shelf'] == 3]['rating']

# plot them in usual way:
plt.figure(figsize=(4,3), dpi = 120)
plt.hist(r1, label="Shelf 1")
plt.hist(r2, label="Shelf 2")
plt.hist(r3, label="Shelf 3")
plt.legend()
plt.title("Rating of cereals for different shelves")
plt.show()

In [None]:
plt.figure(figsize=(5,3))
plt.hist(df_cereals['fiber'], bins=15, label="fiber")
plt.hist(df_cereals['sugars'], bins=15, label="sugars")
plt.legend()
plt.show()

The graph (plotted via Matplotlib) shows everything we need, but... is there a way to *simply* plot same images but looking *more modern*? ü§î

Of course there is:

## Seaborn
<a href="https://seaborn.pydata.org/"><img src="https://seaborn.pydata.org/_static/logo-wide-lightbg.svg" alt="banner" width="380" align="right"></a>
 = library for high-level visualisation
 - based on matplotlib (great!)
 - "a high level API for statistical visualisation"
 - homepage: [here](https://seaborn.pydata.org/)


In [None]:
import seaborn as sns

In [None]:
plt.figure(figsize=(5,3))
sns.histplot(data=df_cereals['fiber'], bins=15, kde=True, label="fiber")
sns.histplot(data=df_cereals['sugars'], bins=15, kde=True, label="sugars")
plt.xlabel("value (in mg)")
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(5,3))
sns.histplot(data=df_cereals[df_cereals['shelf'] == 1]['rating'], kde=True, label="Shelf 1")
sns.histplot(data=df_cereals[df_cereals['shelf'] == 2]['rating'], kde=True, label="Shelf 2")
sns.histplot(data=df_cereals[df_cereals['shelf'] == 3]['rating'], kde=True, label="Shelf 3")
plt.legend()
plt.show()

### Let's move on to different dataset:
We will have a look on very famous 'iris' dataset, [some notes about it](https://en.wikipedia.org/wiki/Iris_flower_data_set)

In [None]:
df_iris = sns.load_dataset('iris')

df_iris.head(8)

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.integratedots.com%2Fwp-content%2Fuploads%2F2019%2F06%2Firis_petal-sepal-e1560211020463.png&f=1&nofb=1&ipt=db4bab5b341bab9c097c2f4f974becf8c756a2ee4d80d1ef38153a8ed583e6c7&ipo=images" alt="banner" width="400" align="center">

([image source](https://www.integratedots.com/determine-number-of-iris-species-with-k-means/))

In [None]:
df_iris.describe(include='all')

Dataset contains 150 observations of iris flowers. The parameters measured were lengths and widths, the "response" variable was the only categorical one present in this dataset: the name of species of the observed flower (3 possible values).

In [None]:
df_iris["species"] = df_iris["species"].astype("category")

In [None]:
plt.figure(figsize=(3,2), dpi=120)
sns.histplot(data=df_iris['sepal_length'])
plt.grid()
plt.show()

üìà 3d picture of the dataset --> separate script

In [None]:
plt.figure(figsize=(9,3), dpi=120)
plt.subplot(1,3,1)
sns.boxplot(x='species', y='sepal_length', data=df_iris, hue='species')
plt.title("sepal length")
plt.subplot(1,3,2)
sns.boxplot(x='species', y='sepal_width', data=df_iris, hue='species')
plt.title("sepal width")
plt.subplot(1,3,3)
sns.boxplot(x='species', y='petal_length', data=df_iris, hue='species')
plt.title("petal length")
plt.tight_layout()
plt.show()

‚ùì What does this image tell us?

**Boxplot:** basic visualisation tool for dataset
![ima](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.sharpsightlabs.com%2Fwp-content%2Fuploads%2F2019%2F11%2Fboxplot-simple-explanation.png&f=1&nofb=1&ipt=3f329f8ea83e1d89a24bb4bac240ba2b6596c69184886fd238d1adf20d6ebe13&ipo=images)

([image source](https://www.sharpsightlabs.com/blog/seaborn-boxplot/))

### PairPlot - first impression of the dataset

In [None]:
sns.pairplot(data=df_iris, hue='species')
plt.show()

### visualise statistical relationship

In [None]:
sns.jointplot(x="sepal_length", y="petal_length", data=df_iris, hue='species')

In [None]:
sns.lmplot(data=df_iris, x="sepal_length", y="petal_length", hue="species", height=4)

In [None]:
sns.lmplot(data=df_iris, x="sepal_length", y="sepal_width", hue="species", height=4)
sns.lmplot(data=df_iris, x="petal_length", y="petal_width", hue="species", height=4)

Based on two last plots we see that:
 - length of sepal and petal correlates  (the bigger is one, the bigger you can expect the other one)
 - length and width of petal has much less correlation (still true)

### Gallery of possibilities:
[https://seaborn.pydata.org/examples/index.html](https://seaborn.pydata.org/examples/index.html)

### to conclude:
Dataset iris is simple but not very intuitive. As we have seen, it is suitable for classification task, i.e. based on dimensions of sepal and petal predict the species of the iris.            **=> next time**

## Your time: Titanic

**task:** get your own feeling of "unknown" data - *the Titanic dataset*

Description from [kaggle](https://www.kaggle.com/datasets/vinicius150987/titanic3/code):

>The sinking of the Titanic is one of the most infamous shipwrecks in history.
>
>On April 15, 1912, during her maiden voyage, the widely considered ‚Äúunsinkable‚Äù RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren‚Äôt enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
>
>While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
>
>In this challenge, we ask you to build a predictive model that answers the question: ‚Äúwhat sorts of people were more likely to survive?‚Äù using passenger data (ie name, age, gender, socio-economic class, etc).


In [None]:
df_titanic = sns.load_dataset('titanic')

In [None]:
df_titanic.head()

In [None]:
df_titanic['town_num'] = pd.factorize(df_titanic['embark_town'])[0] + 1
df_titanic['alive'] = pd.factorize(df_titanic['alive'])[0] + 1

In [None]:
df_titanic.town_num

In [None]:
sns.pairplot(df_titanic, hue='sex')

In [None]:
df_titanic.dtypes

In [None]:
plt.figure(figsize=(9,3), dpi=120)
plt.subplot(1,3,1)
sns.boxplot(x='age', y='class', data=df_titanic, hue='survived')
plt.title("sepal length")

plt.tight_layout()
plt.show()

In [None]:
sns.histplot(data=df_titanic, x = 'age', hue='alive', palette='tab10')

In [None]:
sns.histplot(data=df_titanic['town_num'])