# Machine Learning - Data Analysis Introduction

# Part 1

## Get your Python environment running


Familiarize yourself with the most important packages in the field of machine learning. Some others may follow, however, those are the ones you need every time: 
- numpy
- pandas
- matplotlib
- scikit-learn

Make sure you can import all of them in your notebook.

## Warm-up Exercise
- Write a function to sample a number N of datapoints in the p-dimensional union cube $[0;1]^p$ and sort the values according to the squared distance to the origin divided by the squared distance of the edge point most far away from the origin. 

- Execute the function for N=10000 and $p\in [1,2,...20]$. 
- Plot the minimum distance over p.

In [None]:
# TODO


## Data Processing

The most useful Python library for data science is pandas: https://pandas.pydata.org/

It provides two basic data structures, DataFrames and DataSeries.
- DataSeries: one-dimensional data array of any type (one data column), labeled (column header)
- DataFrame: two-dimensional data structure, multi-columns with headers (column titles)

Pandas provides an easy way of working with data with state-of-the art machine learning. 
We typically use it for 
- descriptive statistics of the data 
- in connection with plot libraries like matplotlib or seaborn for data visualization (while pandas has some integrated visualization capabilitites)
- data transformation 
- combination, split of datasets 
- ...


__We will need these commands and skills throughout the lecture, so make sure that you familiarize with the pandas library.__

To get fluent with pandas, carry out the following __exercises__. 
- Use the documentation and API reference of pandas to learn the basics about these functions.
- These exercises guide you through a set of standard data science. 



# Data loading and easy transformations
Load the following two datasets into a dataframe 
- iris.csv to dataframe named iris_df
- decision-tree.txt to dataframe named tree_df


In [None]:
import pandas as pd

iris_df = pd.read_csv("iris.csv")
iris_df.head()


In [None]:
tree_df = pd.read_csv("decision-tree.txt")
# tree_df.head()

For iris_df do the following
- Look at the dataframe.
- Get the column names.
- Rename the columns such that all names are written in UPPERCASE.
- Get a simple statistic for the data. 
- Generate a one-hot encoding for the class values

In [None]:
# Look at the dataframe
iris_df

In [None]:
# Get column names
iris_df.columns.values

In [None]:
# Rename the columns such that all names are written in UPPERCASE.
iris_df.columns.str.upper()


In [None]:
# Get a simple statistic for the data.
iris_df.describe(include='all')

In [None]:
# Generate a one-hot encoding for the class values
# To get 1/0 instead of True/False, add parameter: dtype='int'
pd.get_dummies(iris_df, columns=['class'])

## Data joins 

Load the individual datasets iris1, iris2, iris3, iris4, iris5. Join them appropriately into one dataframe taking into account column names and indices.

Check that your joined dataframe corresponds to iris_df.

In [None]:
iris_df1 = pd.read_csv("iris1.csv")
iris_df2 = pd.read_csv("iris2.csv")
iris_df3 = pd.read_csv("iris3.csv")
iris_df4 = pd.read_csv("iris4.csv")
iris_df5 = pd.read_csv("iris5.csv")

# Merge all the five dataframes above into one dataframe
iris_df12 = pd.concat([iris_df1, iris_df2])
iris_df34 = pd.concat([iris_df3, iris_df4])
iris_df1234 = pd.merge(iris_df12, iris_df34)

iris_merged_df = pd.merge(iris_df1234, iris_df5)
iris_merged_df.drop("Unnamed: 0", axis=1, inplace=True)
iris_merged_df


In [None]:
# Confirm that joined_iris_df = iris_df
iris_merged_df.equals(iris_df)

# Part 2

## Elementary data analysis and visualization of the Iris dataset

How is each of the quanties sepal / petal length / width distributed?
- Compute statistical quantities like mean, standard deviation.
- Are there any values far from the average?
- Visualize the data distribution by appropriate histogram plots. 
    - Use matplotlib.pyplot.hist()
    - Familiarize with the hist() method and its parameters
    - Try at least the following two strategies for the bins parameter: define appropriate binning yourself and at least one pre-defined strategy (e.g. 'auto'). 
    - See the documentation of matplotlib to understand how hist() works. 



In [None]:
# TODO: How is each of the quantities sepal / petal length / width distributed?


In [None]:
# Compute mean, 2d.p
print(f"Sepal Length: {iris_df["sepal length"].mean(): 0.2f}")
print(f"Petal Length: {iris_df["petal length"].mean(): 0.2f}")
print(f"Sepal Width: {iris_df["sepal width"].mean(): 0.2f}")

In [None]:
# Compute standard deviation, 2d.p
print(f"Sepal Length: {iris_df["sepal length"].std(): 0.2f}")
print(f"Petal Length: {iris_df["petal length"].std(): 0.2f}")
print(f"Sepal Width: {iris_df["sepal width"].std(): 0.2f}")

In [None]:
# Calculate skew for each of the columns which are numeric

iris_df.skew(numeric_only=True)

In [None]:
# Calculate kurtosis for each of the columns which are numeric
iris_df.kurt(numeric_only=True)

In [None]:
# Are there any values far from the average?
# The describe() method provides generic information about the data like min, max, mean, standard deviation e.t.c
# Looking at the information provided by the describe() method, there seems to be no weird outliers

# iris_df.describe(include='all')

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Draw Histogram for sepal length
iris_df["sepal length"].hist()
plt.xlabel("Sepal Length")
plt.ylabel("Frequency")
plt.title("Distribution of Sepal Length")


In [None]:
# Draw Histogram for petal length
iris_df["petal length"].hist()
plt.xlabel("Petal Length")
plt.ylabel("Frequency")
plt.title("Distribution of Petal Length")


In [None]:
# Draw Histogram for sepal width
iris_df["sepal width"].hist()
plt.xlabel("Sepal Width")
plt.ylabel("Frequency")
plt.title("Distribution of Sepal Width")


In [None]:
# Try at least the following two strategies for the bins parameter: define appropriate binning yourself and at least one pre-defined strategy (e.g. 'auto').


In [None]:
# With bins parameter given, draw Histogram for petal length
iris_df["petal length"].hist(bins=7)
plt.xlabel("Petal Length")
plt.ylabel("Frequency")
plt.title("Distribution of Petal Length")


In [None]:
# Draw Histogram for petal length with parameter bins="auto"
iris_df["petal length"].hist(bins="auto")
plt.xlabel("Petal Length")
plt.ylabel("Frequency")
plt.title("Distribution of Petal Length")
