# Introduction to Machine Learning in Python, May 2020
By Dr. Anders Christensen `anders.christensen @ unibas.ch`

---

## Part 1: Exploratory Data Analysis and Classificaton: Fisher's *Iris* Data Set

Our data is in the file `iris.csv`: https://www.dropbox.com/s/g3njhhml16kvaci/iris.csv

The file contains 151 lines, the first 9 are here:


```
Petal length,Petal Width,Sepal Length,Sepal Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
```





Download the file using Linux's `wget` command:

In [0]:
!wget -O iris.csv https://www.dropbox.com/s/g3njhhml16kvaci/iris.csv

In [0]:
#Pandas DataFrames
import pandas as pd

# Read in the CSV as a Pandas DataFrame
data = pd.read_csv("iris.csv")

type(data)

Can also read Excel and a bunch of other formats.

In [0]:
data.head(n=8)    


Let's get an overview of the data?

In [0]:
data.describe()


#Part 1: Plotting Pandas DataFrames with Seaborn

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt

grid = sns.pairplot(data)
plt.show()

Much room for customization: https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt


grid = sns.pairplot(data, kind="reg", diag_kind="kde",  
                    hue="Species", corner=True)
grid.fig.suptitle("Fisher's Iris Data Set", size=32)
plt.show()

In [0]:
plt.figure(figsize=(10,11))
sns.heatmap(data.corr(),annot=True)
plt.plot()

## Code demonstration 1: K-NearestNeighbor (KNN) Classifier for the Iris Dataset.


Look at the label of the $K$ nearest neighbors. A given point is probably the same as its closest neighbors!

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/1280px-KnnClassification.svg.png" width="250">

Training and test data?

Splitting of dataset?

Often a dataset is split into two sets:

*   **Training set:** Data to train your model
*   **Test set:** Data to evaluate/benchmark your model

Important that the two do ***not*** overlap -- this is overfitting!



In [0]:
# "Cheating" using function from Scikit-learn
from sklearn.model_selection import train_test_split

# Make test set 70%/30% split
train, test = train_test_split(data, test_size=0.30)

Let's see what is in the training and test:

In [0]:
train.head()

In [0]:
test.head()

In [0]:
train_features = train[["Sepal Length","Sepal Width","Petal length","Petal Width"]]
train_labels = train["Species"]

test_features = test[["Sepal Length","Sepal Width","Petal length","Petal Width"]]
test_labels = test["Species"]

First step to a simple classifier:

In [0]:
import numpy as np

# L2 distance
def distance(A, B):
  
  difference = A - B
  d = np.linalg.norm(difference)
  
  return d


query = np.asarray(test_features)[0]
train_features = np.asarray(train_features)

for i in range(len(train_features)):

  dist = distance(query, train_features[i])

  print(i, dist)

## Demonstration of Scikit-Learn:

Fortunately we don't always have to implement a classifier, etc from scratch. Sklearn already has a KNN classifier we can use:

https://scikit-learn.org/stable/index.html

In [0]:
# Import the classifier
from sklearn.neighbors import KNeighborsClassifier

# Initialize the object
classifier = KNeighborsClassifier(n_neighbors=15)

# Fit the model
classifier.fit(train_features, train_labels)

KNN documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier

In [0]:
# Make a prediction
predicted_labels = classifier.predict(test_features)
print(predicted_labels)

In [0]:

from sklearn import metrics

accuracy = metrics.accuracy_score(predicted_labels, test_labels)
print("KNN Classifier Accuracy is {:4.1f}%".format(accuracy * 100))

How about another classifier?

In [0]:
from sklearn.neighbors import NearestCentroid
classifier = NearestCentroid()

## Part 2: Linear Least Squares Regression

In the exercieses today, you will implment a linear least squares regression model (often just called linear regression). Here is an introduction to linear least squares regression.

First, lets start with some data for a linear function:
\begin{equation}
y = 1.2x + \mathrm{noise}
\end{equation}

In [0]:
np.random.seed(666)

# X-values
x = np.arange(0,20.0, 0.2)

# Y-values: Y = 1.2*X + random noise 
y = 1.2 * x + np.random.normal(scale=2.0, size=len(x))

print(x.shape)
print(y.shape)

Our little "Exploratory Data Analysis":

In [0]:
import matplotlib.pyplot as plt

plt.scatter(x,y, color="C0", label="Training")
plt.plot(x, x*1.2, color="C2", label="Truth")
plt.grid(True)
plt.legend()
plt.show()

Linear Regression approximates a linear function, e.g.:

\begin{equation}
y(\mathbf{x}) = x_1 \alpha_1 + x_2 \alpha_2 + \dots + x_n \alpha_n
\end{equation}
or in vector notation:
\begin{equation}
y(\mathbf{x}) = \mathbf{x} \cdot \mathbf{\alpha}
\end{equation}

Where $\mathbf{x}$ is our feature vector/descriptor/representation for a given datapoint. $\mathbf{\alpha}$ is the vector of regression coefficients.

"Fitting" is what you do to find the best set of $\alpha$-values. This is done by finding the solution with the "least squares":

\begin{equation}
\mathbf{y} = \mathbf{X}\mathbf{\alpha}
\end{equation}

Minimze the error:
\begin{equation}
\mathbf{\hat{\alpha}} = \text{arg min} || \mathbf{y}^\text{ref} - \mathbf{X}\mathbf{\alpha}||^2
\end{equation}

In [0]:
#Small hack for 1-d representations
X_matrix = x.reshape((100,1))

# Run the solver
alpha, residual, rank, singular_values = np.linalg.lstsq(X_matrix, y)

print(alpha)

How can we interpret $\alpha$ for our 1-D function?
\begin{equation}
y(\mathbf{x}) = x_1 \alpha_1
\end{equation}

Let's try some predictions:

In [0]:
# Random x-values:
x_test = np.random.random(size=(15,1)) * 20.0

print(x_test)

In [0]:
print(alpha.shape)
print(x_test.shape)

y_test = np.matmul(x_test, alpha)

plt.plot(x, x*1.2, color="C2", label="Truth", zorder=-1)
plt.scatter(x,y, color="C0", label="Training")
plt.scatter(x_test,y_test, color="C3", label="Test")

plt.legend()
plt.show()