# Intro to Machine Learning using the Iris dataset

## Explore the data

Now fetch the Iris dataset from sklearn. `X` are the inputs and `y` is the target.

In [None]:
import pandas as pd
from sklearn.datasets import load_iris

dataset = load_iris(as_frame=True)
target_names = dataset["target_names"]
X = dataset["data"]
y = dataset["target"].astype("category")

model_feature_id = "target"
X

In [None]:
y

What do the target integers mean?

In [None]:
target_names

Combine the data and look at the distributions

In [None]:
df = pd.concat([X,y], axis=1)
df

What does the input look like? Is it numeric or categorical?

In [None]:
df.info()

Examine the basic stats for the column values

In [None]:
df.describe()

Are there any missing values?

In [None]:
df.isna().mean()

In [None]:
import seaborn as sns

sns.pairplot(df, hue='target')

Can you spot any trends?

Now split the data into sub-samples for train and test. The idea here is to keep some samples back that the model hasn't seen before and then  We will use 80% of records for training and save 20% back for testing.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, shuffle=False, stratify=None)

What does the training data look like? Could we have prepared it any better to give the model a better chance of having a more general view of the data? (clue: shuffle and stratify)

In [None]:
pd.value_counts(y_train)

## Train an ML model

We will use a simple Logistic Regression from sklearn to start with. The default scoring metric, from the `model.score()` function, is accuracy, i.e. number of correct predictions.

The model does pretty well on known data

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_train, y_train)

How does the model do on unseen data?

In [None]:
model.score(X_test, y_test)

We can also use a bunch of other metrics from sklearn to understand how well the model performed. First get some predictions

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

For classification problems, a confusion matrix gives you an idea of how well you are predicting classes and helps visualise true positives, false positives, etc.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

## This is great but how would I actually use a model?

Put simply, we save it to a file and then reload it when we need to make predictions. In ML lingo we would say that the model is trained `offline` and then predictions are served `online`.

In the following we will save a model to disk, delete the object, reload it from disk and then make predictions.

In [None]:
from joblib import dump, load
model_path = 'iris_model.joblib'
dump(model, model_path)
del model
model = load(model_path)
y_pred2 = model.predict(X_test)
y_pred2
