# Introduction to Machine Learning
This material is adapted from: https://machinelearningmastery.com/machine-learning-in-python-step-by-step/

### Getting started
```
0. Ensure that Python version > 3.8.0 is installed.
1. Setup a virtual environment, called '.venv'
    a. In your terminal: $ python -m venv .venv
2. Activate the virtual environment
    a. In Powershell or Windows CMD: $ .venv\Scripts\activate
    b. In Linux: $ source .venv/bin/activate
3. Install all necessary librarires
    a. $ pip install -r requirements.txt
    b. Upgrade pip if necessary: $ python -m pip install --upgrade pip
4. Open up Jupyter Notebook
    a. $ jupyter notebook
```


In [None]:
#ensure all packages import without error
import scipy
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn

#our model imports to use later
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import precision_score, recall_score, accuracy_score

### Prepare dataset

In [None]:
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv(url, names=names)
print(len(dataset))
dataset.head()

### Analyze dataset

In any machine learning project, it is absolutely critical to understand the distribution of your variables of interest. Not only will it give you clues about the relationships present in your dataset, it will also help you select the most appropriate machine learning method.

In [None]:
# class distribution - is it balanced or unbalanced?
print(dataset.groupby('class').size())
classes = dataset['class'].unique()

In [None]:
# we can use plenty of built in functions out there to assist us

pd.plotting.scatter_matrix(dataset, figsize = (10,10));

In [None]:
# it may be handy to visualize the distribution of our values for each class as well

# the following code iterates through each class and plots a histogram of each feature
fig, axs = plt.subplots(1,3, figsize = (15,5))
plot_id = 0
for cls in classes: # for each class
    for feature in dataset.columns[:-1]: # for each feature
        axs[plot_id].hist(
            dataset.loc[dataset['class'] == cls][feature],
            bins = np.arange(0,8,0.5)
            ) 
        axs[plot_id].set_ylim(0,40)
        axs[plot_id].set_xlabel(cls)
    plot_id += 1
axs[-1].legend(dataset.columns[:-1])


In [None]:
# what relationships can we spot within our feature set?
# are there any clear correlations between our distribution of features and our classes? 
# what exactly is petal and sepal length? understanding what the data represents can help you massively later

### Define the problem

Given a set of petal and sepal dimensions, predict the type of iris. Simple!

### Split into train and validation sets

In practice, splitting is very important. Understanding how your dataset is sampled from the population is key to deciding on a splitting strategy. If you sample your irises from one garden only, then no matter how you split it, you can never say your model generalizes outside of that garden. 

For the sake of argument, let's say that this dataset is extracted from two gardens, garden 1 and garden 2. Then, we can split by garden and answer the question "does training on garden 1 let me predict irises in garden 2?"

In [None]:
#shuffle the dataset randomly
dataset = dataset.sample(frac=1, random_state=42) #set the seed to make it repeatable

#split the dataset in half and label it garden 1 and garden 2
df_garden1 = dataset.iloc[:75]
df_garden2 = dataset.iloc[75:]

#is class balance going to be affected?

#split up each dataframe into inputs and class labels
x_train = df_garden1.loc[:, df_garden1.columns != 'class']
y_train = df_garden1['class']

print(f'x_train has shape {x_train.shape} and y_train has shape {y_train.shape}')

x_val = df_garden2.loc[:, df_garden2.columns != 'class']
y_val = df_garden2['class']

print(f'x_val has shape {x_val.shape} and y_val has shape {y_val.shape}')

### Apply some models

We will use several different classic machine learning techniques to predict the iris class, given the petal and sepal features.

But first, we need to decide on our metric. Because this is a classification problem, we have several options.

- Accuracy: what percentage of examples were correctly identified?
- Precision: given every time a certain class was predicted, how many times was it correct?
- Recall: how many times was a class correctly predicted out of all its instances?

Accuracy is the most simple, but can obscure failures of our model (particularly if classes are imbalanced)

In [None]:
models = {}
models['LR'] = {'model': LogisticRegression(solver='liblinear', multi_class='ovr')}
models['LDA'] = {'model': LinearDiscriminantAnalysis()}
models['KNN'] = {'model': KNeighborsClassifier()}
models['CART'] = {'model': DecisionTreeClassifier()}
models['NB'] = {'model': GaussianNB()}
models['SVM'] = {'model': SVC(gamma='auto')}

# evaluate each model in turn
for name, m_dict in models.items():
    m_dict['results'] = m_dict['model'].fit(x_train, y_train).predict(x_val) #fit and predict with the model
    m_dict['accuracy'] = sum(m_dict['results'] == y_val) / len(y_val) #calculate accuracy
    print(name, '{:.3f}'.format(m_dict['accuracy']))

    #log precision and recall metrics as well for each class
    for cls in classes:
        m_dict[f'{cls}_precision'] = precision_score(
            m_dict['results'] == cls, 
            y_val == cls, 
            average = 'binary'
            )
        m_dict[f'{cls}_recall'] = recall_score(
            m_dict['results'] == cls, 
            y_val == cls, 
            average = 'binary'
            )
        m_dict[f'{cls}_accuracy'] = accuracy_score(
            m_dict['results'] == cls, 
            y_val == cls, 
            )

### Visualize by class
Are any classes more difficult to predict than others?

In [None]:
#try seeing what different metrics look like for different classes
metric = 'accuracy'
bar_offset = -.2
plt.figure(figsize = (10,10))

for cls in classes:
    #practice list comprehension to obtain results from the dict
    vals = [m_dict[cls + '_' + metric] for _,m_dict in models.items()]

    #make a bar plot
    plt.bar(np.array(range(len(models))) + bar_offset, vals, width = 0.2)
    plt.xticks(range(len(models)), labels = list(models.keys()), fontsize = 20)

    bar_offset += 0.2

plt.legend(dataset['class'].unique(), loc = 'lower right', fontsize = 16)

### Follow up problems

1. Which feature in the dataset is most useful for prediction? How would you test that?
2. How does the quantity of training data affect performance metrics? Which methods are capable of learning on less data?
3. What other forms of data could you collect from garden1 to improve your model?
4. What does it mean to have high recall and low precision? If I develop a computer vision model that warns a driver of obstacles in the road, do I care more about precision or recall?