# Day1 - Introduction to SciKit Learn and basic algorithms

These notebooks are an addition to the lecture slides.
The idea is to apply practically what was taught theoretically / conceptually during the lecture.

## 1.0 Exploratory data analysis (EDA)

Before we actually work with the data, we should have a careful look at the data. Let us try to import the data, examine it, visualise it, understand it, find potential errors and missing data.

### 1.1 The dataset

* For the intitial exercises, we use a dataset about wine quality from the University of Massachusetts Amherst (http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/). 
* The dataset contains 1599 rows and 12 columns that describe different red wines.
* We will figure out more about the dataset in our EDA.

## 1.2 Opening the dataset

### 1.2.1 Importing dependencies for the EDA

In [5]:
#############################################################
# Importing libraries we need (we import further later)
#############################################################

# NumPy is a library for scientific computing with Python and provides methods to work with arrays and matrices
# Documentation: https://docs.scipy.org/doc/
import numpy as np

# Pandas is a library that provides functionality similar to R - with dataframes that can contain different data types
# Documentation: http://pandas.pydata.org/pandas-docs/stable/
import pandas as pd

### 1.2.2 Opening the dataset in Pandas
We can load the dataset from its remote location

In [19]:
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url)

# Print the data type of "data"
print(type(data)) # We are dealing with a Pandas DataFrame

# Print out the shape of the dataframe (number rows, number columns)
print(data.shape)

# Print out the first 5 rows and the column headers
print(data.head(5))

<class 'pandas.core.frame.DataFrame'>
(1599, 1)
  fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
0   7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5                                                                                                                     
1   7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5                                                                                                                     
2  7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;...                                                                                                                     
3  11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58...                                                                                                                     
4   7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5                                                  

The shape of the DataFrame is not what we expected. There is only a single column. We can see that the columns are separated by ";". So, we add this separator as a second parameter to our CSV reader.

In [24]:
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')

# Print the data type of "data"
print(type(data)) # We are dealing with a Pandas DataFrame

# Print out the shape of the dataframe (number rows, number columns)
print(data.shape)

# Print out the first x rows and the column headers
# You can change the parameter to print out more or leave it away for default
print(data.head(10))

<class 'pandas.core.frame.DataFrame'>
(1599, 12)
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   
5            7.4              0.66         0.00             1.8      0.075   
6            7.9              0.60         0.06             1.6      0.069   
7            7.3              0.65         0.00             1.2      0.065   
8            7.8              0.58         0.02             2.0      0.073   
9            7.5              0.50         0.36             6.1      0.071   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 

Now, we have 1599 rows and 12 columns which makes much more sense.

### 1.2.3 Critically examining the dataset

* What are the dimensions of the dataset?
* What does each column mean?
* What the data types in each column?
  * Numerical values
    * Continuous values?
    * Discrete values?
  * Any non-numerical values, like category names or boolean values like "true"/"false"?
* What could we do with the dataset? Which questions could we ask?
* Given that question, do we have a dataset for supervised or unsupervised learning? Are the correct "answers" provided in the data?

In [2]:
import sklearn


## 1.0 Clustering

### 1.1 Clustering with kNN (k nearest neighbours)

### 1.2 Clustering with kMeans

### 1.3 Clustering with Decision Trees

In [None]:
|