## **How can we understand which Machine learning algorithm is better to use for our problem?**

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import load_digits
digits = load_digits()

### **Each time we run this cell, the  sample change totally and the score of each model will change, too!⬇️**

In [24]:
from sklearn.model_selection import train_test_split
X_test, X_train, y_test, y_train = train_test_split(digits.data, digits.target, test_size=0.3)

In [25]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9554494828957836

In [4]:
svm = SVC()
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.9745425616547335

In [5]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9626093874303898

### **K FOLD**

**Importing K-Fold Cross-Validation *(It will divide the dataset into 5 different subsets, using 4 for training and 1 for testing in each round)*:**

In [6]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
kf

KFold(n_splits=5, random_state=None, shuffle=False)

**Splitting a Small Simplified Example Dataset *(This shows how different splits work)*:**

In [7]:
for train_index, test_index in kf.split([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]):
  print("Train: ", train_index, "   Test: ", test_index)

Train:  [2 3 4 5 6 7 8 9]    Test:  [0 1]
Train:  [0 1 4 5 6 7 8 9]    Test:  [2 3]
Train:  [0 1 2 3 6 7 8 9]    Test:  [4 5]
Train:  [0 1 2 3 4 5 8 9]    Test:  [6 7]
Train:  [0 1 2 3 4 5 6 7]    Test:  [8 9]


**Function to Evaluate a Model:**

In [8]:
def get_score(model, X_train, X_test, y_train, y_test):
  model.fit(X_train, y_train)
  return model.score(X_test, y_test)

**Running K-Fold on the Digits Dataset *(For each K-Fold split: It extracts the training and testing sets. It evaluates each model using `get_score()`. It appends the scores to the respective lists)*:**

In [None]:
scores_l = []
scores_svm = []
scores_rf = []

for train_index, test_index in kf.split(digits.data):
  X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], \
                                      digits.target[train_index], digits.target[test_index]

  scores_l.append(get_score(LogisticRegression(), X_train, X_test, y_train, y_test))
  scores_svm.append(get_score(SVC(C=15), X_train, X_test, y_train, y_test))
  scores_rf.append(get_score(RandomForestClassifier(n_estimators=45), X_train, X_test, y_train, y_test))

**Displaying the Scores:**

In [10]:
scores_l

[0.9277777777777778,
 0.8666666666666667,
 0.9387186629526463,
 0.935933147632312,
 0.9080779944289693]

In [11]:
scores_svm

[0.9805555555555555,
 0.9611111111111111,
 0.9832869080779945,
 0.9888579387186629,
 0.958217270194986]

In [12]:
scores_rf

[0.9361111111111111,
 0.9138888888888889,
 0.9637883008356546,
 0.9637883008356546,
 0.9220055710306406]

### **K Fold Cross Validation**

**Using Stratified K-Fold  *(ensures that each fold has the same proportion of target classes to handle class imbalances)*:**

In [13]:
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)

**Instead of manually looping, `cross_val_score()`: Splits the dataset automatically using folds. Trains and evaluates each model on different folds. Returns accuracy scores for each fold:**

In [14]:
from sklearn.model_selection import cross_val_score

In [None]:
cross_val_results_l = cross_val_score(LogisticRegression(), digits.data, digits.target, cv=folds)

In [16]:
cross_val_results_svm = cross_val_score(SVC(C=15), digits.data, digits.target, cv=folds)

In [17]:
cross_val_results_rf = cross_val_score(RandomForestClassifier(n_estimators=45), digits.data, digits.target, cv=folds)

**Printing Cross-Validation Results:**

In [18]:
print("Cross-Validation Results for Logistic Regression (Accuracy):")
for i, result in enumerate(cross_val_results_l, 1):
    print(f"  Fold {i}: {result * 100:.2f}%")

print(f'Mean Accuracy: {cross_val_results_l.mean() * 100:.2f}%')

Cross-Validation Results for Logistic Regression (Accuracy):
  Fold 1: 92.15%
  Fold 2: 94.16%
  Fold 3: 91.65%
Mean Accuracy: 92.65%


In [19]:
print("Cross-Validation Results for SVM (Accuracy):")
for i, result in enumerate(cross_val_results_svm, 1):
    print(f"  Fold {i}: {result * 100:.2f}%")

print(f'Mean Accuracy: {cross_val_results_svm.mean() * 100:.2f}%')

Cross-Validation Results for SVM (Accuracy):
  Fold 1: 96.83%
  Fold 2: 98.00%
  Fold 3: 97.33%
Mean Accuracy: 97.38%


In [20]:
print("Cross-Validation Results for SVM (Accuracy):")
for i, result in enumerate(cross_val_results_rf, 1):
    print(f"  Fold {i}: {result * 100:.2f}%")

print(f'Mean Accuracy: {cross_val_results_rf.mean() * 100:.2f}%')

Cross-Validation Results for SVM (Accuracy):
  Fold 1: 92.15%
  Fold 2: 95.66%
  Fold 3: 93.49%
Mean Accuracy: 93.77%
