<a href="https://colab.research.google.com/github/ZKisielewska/learning-git-task/blob/master/M_12_5_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In Google Colab, write a program that will generate data for the task using two selected generators from the loaders **sklearn_datasets** group.

**Recommended datasets:**

- digits
- wine classification
- classification of irises
- handwritten numbers (loaded with load_mnist_data)

To complete:

- Select a few **classifiers** (these may be those listed in this chapter).
- **Train and calculate metrics** for each classifier, for test and training datasets (any split).
-Draw conclusions about the performance of each classifier, does it **work well** or does it **overfit**?
- Write down the conclusions in the form of a simple report with a verbal description - how the data was splitted, what models were used, what conclusions were drawn.

In [None]:
# load selected data sets
from sklearn.datasets import load_iris
from sklearn.datasets import load_wine

# import selected clasifiers
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.filterwarnings("ignore")

### **Load the Iris dataset**

In [None]:
# save data as variable
# calling data 'iris' we will see a dictionary with array of lists, split between 'data' and 'target' values
iris = load_iris()

**Logistic Regression**

In [None]:
from sklearn.model_selection import train_test_split

# divide our data into predictors (X) and target values (y)
X = iris.data
y = iris.target

# split the data, 80% in training set and 20% in test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify = y)

In [None]:
# create the clasifiers
log_clf = LogisticRegression()
log_clf_all = LogisticRegression()

print("fitting - training...")
log_clf.fit(X_train, y_train)

print("training on whole dataset...")
log_clf_all.fit(X, y)

print("predicting...")
y_pred = log_clf.predict(X_test)

# we print the values for the first 10 predictions
print("true values ", y[:10])
print("predicted ", y_pred[:10])

print("scoring...")
log_clf_score = log_clf.score(X_train, y_train)
print("Train score = ", log_clf_score)

log_clf_score = log_clf.score(X_test, y_test)
print("Test score = ", log_clf_score)

log_clf_score = log_clf_all.score(X, y)
print("whole set score = ", log_clf_score)

fitting - training...
training on whole dataset...
predicting...
true values  [0 0 0 0 0 0 0 0 0 0]
predicted  [0 2 1 1 0 1 0 0 2 1]
scoring...
Train score =  0.975
Test score =  0.9666666666666667
whole set score =  0.9733333333333334


**Decission Tree Classifier**

In [None]:
# create the clasifiers
tree_clf = DecisionTreeClassifier()
tree_clf_all = DecisionTreeClassifier()

print("fitting - training...")
tree_clf.fit(X_train, y_train)

print("training on whole dataset...")
tree_clf_all.fit(X, y)

print("predicting...")
y_pred = tree_clf.predict(X_test)

# we print the values for the first 10 predictions
print("true values ", y[:10])
print("predicted ", y_pred[:10])

print("scoring...")
tree_clf_score = tree_clf.score(X_train, y_train)
print("Train score = ", tree_clf_score)

tree_clf_score = tree_clf.score(X_test, y_test)
print("Test score = ", tree_clf_score)

tree_clf_score = tree_clf_all.score(X, y)
print("whole set score = ", tree_clf_score)

fitting - training...
training on whole dataset...
predicting...
true values  [0 0 0 0 0 0 0 0 0 0]
predicted  [0 2 1 1 0 1 0 0 2 1]
scoring...
Train score =  1.0
Test score =  0.9333333333333333
whole set score =  1.0


**SVC**

In [None]:
# create the clasifiers
svc_clf = SVC()
svc_clf_all = SVC()

print("fitting - training...")
svc_clf.fit(X_train, y_train)

print("training on whole dataset...")
svc_clf_all.fit(X, y)

print("predicting...")
y_pred = svc_clf.predict(X_test)

# we print the values for the first 10 predictions
print("true values ", y[:10])
print("predicted ", y_pred[:10])

print("scoring...")
svc_clf_score = svc_clf.score(X_train, y_train)
print("Train score = ", svc_clf_score)

svc_clf_score = svc_clf.score(X_test, y_test)
print("Test score = ", svc_clf_score)

svc_clf_score = svc_clf_all.score(X, y)
print("whole set score = ", svc_clf_score)

fitting - training...
training on whole dataset...
predicting...
true values  [0 0 0 0 0 0 0 0 0 0]
predicted  [0 2 1 1 0 1 0 0 2 1]
scoring...
Train score =  0.9833333333333333
Test score =  0.9666666666666667
whole set score =  0.9733333333333334


**KNeighborsClassifier**

In [None]:
# create the clasifiers
knn_clf = KNeighborsClassifier()
knn_clf_all = KNeighborsClassifier()

print("fitting - training...")
knn_clf.fit(X_train, y_train)

print("training on whole dataset...")
knn_clf_all.fit(X, y)

print("predicting...")
y_pred = knn_clf.predict(X_test)

# we print the values for the first 10 predictions
print("true values ", y[:10])
print("predicted ", y_pred[:10])

print("scoring...")
knn_clf_score = knn_clf.score(X_train, y_train)
print("Train score = ", knn_clf_score)

knn_clf_score = knn_clf.score(X_test, y_test)
print("Test score = ", knn_clf_score)

knn_clf_score = knn_clf_all.score(X, y)
print("whole set score = ", knn_clf_score)

fitting - training...
training on whole dataset...
predicting...
true values  [0 0 0 0 0 0 0 0 0 0]
predicted  [0 2 1 1 0 1 0 0 2 1]
scoring...
Train score =  0.9666666666666667
Test score =  1.0
whole set score =  0.9666666666666667


The **IRIS** dataset was used to build the **classifiers**. The dataset contains information for three classes of the IRIS plant, namely:

- IRIS **Setosa**
- IRIS **Versicolour**
- IRIS **Virginica**

with the following attributes: **sepal length, sepal width, petal length, and petal width.**

We trained our model on the train set, and tested the model on the test set. We randomly selected **80%** of the data in our **training** set and **20% as test** set.

We used the follow clasifiers:

- **Logistic Regression**

We can see that our model works well. The test and training score are about 96%.

- **Decission Tree**

It might seem that in the case of the DecissionTreeClassifier we are dealing with overfitting. For the training data, we have a metric of **1.0**, while for test data we have a smaller metric, which means the model does not work well.

- **SVC**

The train score is 98% and it is higher then test score that is 96%. It might seem that out model is overfitting.

- **KNeighborsClassifier**
This model works wevy well. Its metrics are 96% in training set and 100% in test set.



### **Load the Wine dataset**

In [None]:
# save data as variable
wine = load_wine()

In [None]:
# divide our data into predictors (X) and target values (y)
X = wine.data
y = wine.target

# split the data, 80% in training set and 20% in test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify = y)

**Logistic Regression**

In [None]:
# create the clasifiers
log_clf = LogisticRegression()
log_clf_all = LogisticRegression()

print("fitting - training...")
log_clf.fit(X_train, y_train)

print("training on whole dataset...")
log_clf_all.fit(X, y)

print("predicting...")
y_pred = log_clf.predict(X_test)

# we print the values for the first 10 predictions
print("true values ", y[:10])
print("predicted ", y_pred[:10])

print("scoring...")
log_clf_score = log_clf.score(X_train, y_train)
print("Train score = ", log_clf_score)

log_clf_score = log_clf.score(X_test, y_test)
print("Test score = ", log_clf_score)

log_clf_score = log_clf_all.score(X, y)
print("whole set score = ", log_clf_score)

fitting - training...
training on whole dataset...
predicting...
true values  [0 0 0 0 0 0 0 0 0 0]
predicted  [0 2 0 1 1 0 0 1 1 2]
scoring...
Train score =  0.9788732394366197
Test score =  0.9722222222222222
whole set score =  0.9662921348314607


**Decission Tree Classifier**

In [None]:
# create the clasifiers
tree_clf = DecisionTreeClassifier()
tree_clf_all = DecisionTreeClassifier()

print("fitting - training...")
tree_clf.fit(X_train, y_train)

print("training on whole dataset...")
tree_clf_all.fit(X, y)

print("predicting...")
y_pred = tree_clf.predict(X_test)

# we print the values for the first 10 predictions
print("true values ", y[:10])
print("predicted ", y_pred[:10])

print("scoring...")
tree_clf_score = tree_clf.score(X_train, y_train)
print("Train score = ", tree_clf_score)

tree_clf_score = tree_clf.score(X_test, y_test)
print("Test score = ", tree_clf_score)

tree_clf_score = tree_clf_all.score(X, y)
print("whole set score = ", tree_clf_score)

fitting - training...
training on whole dataset...
predicting...
true values  [0 0 0 0 0 0 0 0 0 0]
predicted  [0 1 0 0 1 0 0 1 1 2]
scoring...
Train score =  1.0
Test score =  0.9166666666666666
whole set score =  1.0


**SVC**

In [None]:
# create the clasifiers
svc_clf = SVC()
svc_clf_all = SVC()

print("fitting - training...")
svc_clf.fit(X_train, y_train)

print("training on whole dataset...")
svc_clf_all.fit(X, y)

print("predicting...")
y_pred = svc_clf.predict(X_test)

# we print the values for the first 10 predictions
print("true values ", y[:10])
print("predicted ", y_pred[:10])

print("scoring...")
svc_clf_score = svc_clf.score(X_train, y_train)
print("Train score = ", svc_clf_score)

svc_clf_score = svc_clf.score(X_test, y_test)
print("Test score = ", svc_clf_score)

svc_clf_score = svc_clf_all.score(X, y)
print("whole set score = ", svc_clf_score)

fitting - training...
training on whole dataset...
predicting...
true values  [0 0 0 0 0 0 0 0 0 0]
predicted  [0 1 0 1 1 0 0 1 1 1]
scoring...
Train score =  0.676056338028169
Test score =  0.6944444444444444
whole set score =  0.7078651685393258


**KNeighborsClassifier**

In [None]:
# create the clasifiers
knn_clf = KNeighborsClassifier()
knn_clf_all = KNeighborsClassifier()

print("fitting - training...")
knn_clf.fit(X_train, y_train)

print("training on whole dataset...")
knn_clf_all.fit(X, y)

print("predicting...")
y_pred = knn_clf.predict(X_test)

# we print the values for the first 10 predictions
print("true values ", y[:10])
print("predicted ", y_pred[:10])

print("scoring...")
knn_clf_score = knn_clf.score(X_train, y_train)
print("Train score = ", knn_clf_score)

knn_clf_score = knn_clf.score(X_test, y_test)
print("Test score = ", knn_clf_score)

knn_clf_score = knn_clf_all.score(X, y)
print("whole set score = ", knn_clf_score)

fitting - training...
training on whole dataset...
predicting...
true values  [0 0 0 0 0 0 0 0 0 0]
predicted  [0 2 0 1 2 0 0 1 1 2]
scoring...
Train score =  0.7816901408450704
Test score =  0.8055555555555556
whole set score =  0.7865168539325843


The **WINE** dataset was used to build the **classifiers**.
We trained our model on the train set, and tested the model on the test set. We randomly selected **80%** of the data in our **training** set and **20%** as **test** set.

We used the follow clasifiers:

- **Logistic Regression**

We can see that our model works well. The test and training score are about 97%.

- **Decission Tree**

In this case we are dealing with overfitting. For the training data, we have a metric of **1.0**, while for test data we have a smaller metric, which means the model does not work well.

- **SVC**

The train score is 67% and it is lower then test score that is 69%. These results are quite poor and inform us that our model does not work well.

- **KNeighborsClassifier**
In this case, like with SVC we are dealing with low metrics. This model will not work well either.

It seems that in both cases, for **IRIS** and **WINE** data sets the best solution is using **Logistic Regression** for modelling. The metrics are over 96% and model should well predict with this clasifier.