<a href="https://colab.research.google.com/github/ZahraHassan1994/AI_-application/blob/main/Another_copy_of_model_assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Assessment
You should build a machine learning pipeline with a complete model assessment step. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Conduct data exploration, data preprocessing, and feature engineering if necessary.
- Choose a few machine learning algorithms, such as [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
- Define a grid of hyperparameters for every selected model.
- Conduct [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) using k-fold cross-validation on the training set to find out the best model (i.e., the best algorithm and its hyperparameters).
- Train the best model on the whole training set.
- Test the trained model on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/m-mahdavi/teaching/main/datasets/mnist.csv")
df.head()

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,31953,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34452,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,60897,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,36953,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1981,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
df.describe()

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
count,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,...,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0
mean,34415.17925,4.4395,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.07675,0.01525,0.013,0.0015,0.0,0.0,0.0,0.0,0.0,0.0
std,20508.890104,2.879655,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.616022,0.964495,0.822192,0.094868,0.0,0.0,0.0,0.0,0.0,0.0
min,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,16575.75,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,34435.5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,52111.5,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,69998.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,125.0,61.0,52.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
df. columns

Index(['id', 'class', 'pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5',
       'pixel6', 'pixel7', 'pixel8',
       ...
       'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779', 'pixel780',
       'pixel781', 'pixel782', 'pixel783', 'pixel784'],
      dtype='object', length=786)

In [6]:
y= df['class']
y.head()

0    5
1    8
2    5
3    0
4    3
Name: class, dtype: int64

In [7]:
x = df.drop(['class','id'], axis=1)

In [13]:
x.head()

Unnamed: 0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
from sklearn.model_selection import train_test_split

In [9]:
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.30, random_state=42)
print("df size is:", df.shape)
print("train size is:", x_train.shape)
print("test size is:", x_test.shape)

df size is: (4000, 786)
train size is: (2800, 784)
test size is: (1200, 784)


## Decision Tree

In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
parameters_grid_1= {
    "criterion" : ["gini", "entropy"],
    "max_depth": range(1,20,3),
    "min_samples_split": range(2,20,3)
}
model_1= DecisionTreeClassifier()
grid_search_1 = GridSearchCV(model_1, parameters_grid_1, cv=5, scoring='accuracy')
grid_search_1.fit(x_train, y_train)
print("Accuracy of best decision tree classifier = {:.2f}".format(grid_search_1.best_score_))
print("best found parameters of decision tree classifier = {}".format(grid_search_1.best_params_))

Accuracy of best decision tree classifier = 0.75
best found parameters of decision tree classifier = {'criterion': 'entropy', 'max_depth': 13, 'min_samples_split': 2}


## SVM

In [11]:
from sklearn import svm
parameters_grid_2= {
    "kernel" : ["linear", "rbf",'poly'],
    "C": [0.01,0.1,1,10,100]
}
model_2= svm.SVC()
grid_search_2 = GridSearchCV(model_2, parameters_grid_2, cv=5, scoring='accuracy')
grid_search_2.fit(x_train, y_train)
print("Accuracy of best decision tree classifier = {:.2f}".format(grid_search_2.best_score_))
print("best found parameters of decision tree classifier = {}".format(grid_search_2.best_params_))


Accuracy of best decision tree classifier = 0.95
best found parameters of decision tree classifier = {'C': 10, 'kernel': 'rbf'}


In [15]:
from sklearn.neighbors import KNeighborsClassifier
parameters_grid_3= {
    "n_neighbors" : [1,5,10,15,20],
    "metric": ["minkowski","euclidean","manhattan"]
}
model_3= KNeighborsClassifier()
grid_search_3 = GridSearchCV(model_3, parameters_grid_3, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_3.fit(x_train, y_train)
print("Accuracy of best decision tree classifier = {:.2f}".format(grid_search_3.best_score_))
print("best found parameters of decision tree classifier = {}".format(grid_search_3.best_params_))


Accuracy of best decision tree classifier = 0.91
best found parameters of decision tree classifier = {'metric': 'minkowski', 'n_neighbors': 1}


## Testing best model


In [21]:

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
y_pred= grid_search_2.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
precision, recall, f1, support= precision_recall_fscore_support(y_test, y_pred)
cr = classification_report(y_test, y_pred)

print("accuracy=",accuracy)
print("precision=",precision)
print("recall=",recall)
print(" f1=", f1)
print("confusion_matrix:\n",cm)
print("classification_report:\n",cr)


accuracy= 0.9491666666666667
precision= [0.99019608 0.94482759 0.92307692 0.94017094 0.94656489 0.90517241
 0.98412698 0.99065421 0.9453125  0.92792793]
recall= [1.         0.98561151 0.92307692 0.91666667 0.96124031 0.96330275
 0.94656489 0.95495495 0.90977444 0.93636364]
 f1= [0.99507389 0.96478873 0.92307692 0.92827004 0.95384615 0.93333333
 0.96498054 0.97247706 0.92720307 0.9321267 ]
confusion_matrix:
 [[101   0   0   0   0   0   0   0   0   0]
 [  0 137   0   1   0   0   0   0   1   0]
 [  1   3 108   1   3   0   0   0   1   0]
 [  0   0   4 110   0   5   0   0   1   0]
 [  0   0   0   0 124   0   1   0   1   3]
 [  0   0   0   2   0 105   0   0   1   1]
 [  0   0   1   0   0   3 124   1   2   0]
 [  0   1   0   0   1   0   0 106   0   3]
 [  0   2   2   3   0   3   1   0 121   1]
 [  0   2   2   0   3   0   0   0   0 103]]
classification_report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00       101
           1       0.94  