<img src="header.png" align="left"/>

# Exercise Analysis and quality control of data (10 points) 

The goal of this exercise is to get an overview of typical basic data analysis steps.

- Datatypes and shapes of data
- Prints of data
- Missing values
- Basic statistics
- Outliers
- Correlations between features


Code and background taken from:

- [https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)
- [https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623](https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623)
- [https://github.com/Viveckh/HiPlotTutorial/blob/master/Hiplot-Tutorial.ipynb](https://github.com/Viveckh/HiPlotTutorial/blob/master/Hiplot-Tutorial.ipynb)

# Import of python modules

In [None]:
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import hiplot as hip

from scipy import stats
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.cluster import DBSCAN

from keras.datasets import mnist
from keras.utils import to_categorical


In [None]:
#
# Turn of some warnings
#
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
simplefilter(action='ignore', category=Warning)

#
# Einstellen der Grösse von Diagrammen
#
plt.rcParams['figure.figsize'] = [16, 9]

# Datatypes and shapes of data

https://numpy.org/devdocs/user/basics.types.html

<img src="info.png" align="left"/> 

In [None]:
# 
# Load some data
# 
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
iris = pd.read_csv('data/iris/iris_mutilated.csv', names=names)

In [None]:
#
# Print shape of data
#
print(iris.shape)

In [None]:
#
# Task: interpret those numbers in a short statement. (1 points)
# Hint: write your interpretation into your notebook in a MARKDOWN field.

In [None]:
#
# Print datatypes
#
print(iris.info())

# Print data

In [None]:
#
# Print head samples to see some data
#
print(iris.head())

In [None]:
#
# Task: describe what a NaN is (1 points)
#

In [None]:
print(iris.tail())

# Missing data

In [None]:
#
# Print all rows with invalid data.
# Task: explain the function of this statement (2 points)
# 
iris[iris.isna().any(axis=1)]

In [None]:
#
# Print some statistical measures
#
iris.mean()

In [None]:
#
# Replace missing values by mean value of feature
#
iris_non = iris.fillna(iris.mean())

In [None]:
iris_non[iris_non.isna().any(axis=1)]

Filling the invalid data elements with the mean value may create disturbances. An alternative way to handle missing data would be to delete the complete row.

# Duplicates

In [None]:
#
# Test data for duplicates and remove them
# Task: explain this code (2 points)
#
iris_non[iris_non.duplicated(keep=False)]

# Simple statistics

In [None]:
#
# Distribution of class labels
#
print(iris_non.groupby('class').size())

In [None]:
#
# Histogram of class distribution 
#
df = pd.DataFrame(iris_non,columns=['class'])
counts= df.groupby('class').size()
class_pos = np.arange(3)
plt.bar(class_pos, counts, align='center', alpha=0.4)
plt.xlabel(class_pos)
plt.ylabel('Ziffern')
plt.title('Samples pro Ziffer')
plt.show()

In [None]:
#
# Distribution of values in columns (features)
#
iris_non.describe()

# Outliers in the data

In [None]:
#
# Boxplots of features (outliers)
# Task: spot the outliers in the boxplots and describe the feature and the value range of the outliers (2 points)
#
iris_non.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

# Correlations between features

In [None]:
#
# Distribution of values per feature
#
iris_non.hist()
plt.show()

In [None]:
#
# Calculation of correlation facture between features
#
iris_non.corr()

In [None]:
#
# Visual presentation of correlation between features
#
sns.heatmap(iris_non.corr(),annot=True,cmap='Blues_r')

In [None]:
#
# Visualization as pair plot (scatter matrix)
#
scatter_matrix(iris_non)
plt.show()

In [None]:
#
# Advanced pair plot (seaborn library) now including the class of each data point
# Task: what do you think? Which of the three classes are separable (2 points)
#
sns.pairplot(iris_non,hue='class')

In [None]:
#
# Very advanced form of visualization of relations between features
#

In [None]:
iris_data = iris_non.to_dict('records')
iris_data[:2]

In [None]:
hip.Experiment.from_iterable(iris_data).display(force_full_width=True)