# Train, Validate $\rightarrow$ Train, Test

In this exercise, you will perform empirical comparison of the results of a ten-fold cross-validated model with a fully trained model.

## Notes and Guidelines
* Read a dataset and use it for a classification task.
* Construct a Gaussian Naive Bayes classifier and fit it to the phoneme dataset provided.
* Save and re-load a trained classifier.
* Compare K-fold cross-validation scores with the success rate of a fully-trained model.


### Dataset
* Dataset acquired from [KEEL](http://sci2s.ugr.es/keel/dataset.php?cod=105), an excellent resource for finding 'toy' datasets (and a few more serious ones).
    * A description of the dataset is provided at the above link - **read it.**
    * Excerpt: 
    *The aim of this dataset is to distinguish between nasal (class 0) and oral sounds (class 1).
    The class distribution is 3,818 samples in class 0 and 1,586 samples in class 1.
    The phonemes are transcribed as follows: sh as in she, dcl as in dark, iy as the vowel in she, aa as the vowel in dark, and ao as the first vowel in water.*
    
* It is not necessary to fully understand the nature or context of the values in the dataset - only that there are five columns of input (featural) data and one column of output (class) data.

## Handling imports and checking the dataset

In [1]:
import os
import pandas as pd
import numpy as np


# <import the necessary modules here> 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
import pickle

# locate dataset
DATASET = '/dsa/data/all_datasets/phoneme.csv'  # phoneme classification dataset
assert os.path.exists(DATASET)  # check if the file actually exists

## Constructing data frame from raw dataset

<span style="background:yellow">**Note**</span>: Variable `dataset` should be used for the data frame.

In [2]:

dataset = pd.read_csv(DATASET, header=0).sample(frac=1)

# verify dataset shape
print("Dataset shape: ", dataset.shape)

Dataset shape:  (5404, 6)


In [3]:
# show first few lines of the dataset
dataset.head()

Unnamed: 0,Aa,Ao,Dcl,Iy,Sh,Class
890,0.277,1.136,2.468,-0.763,-0.41,0
2328,1.721,0.578,-0.193,-0.13,-0.1,0
5279,3.087,-0.304,0.15,-0.072,0.057,0
3650,0.401,1.813,1.245,0.505,-0.234,0
344,0.174,0.938,0.513,1.097,0.886,1


## Splitting data into training and test sets

Split the datasets into training (80%) and testing (20%) sets. 

The below is only necessary if you are interested in visualizing
the data or providing neatly-labeled output within the program.

```python
# extract labels from column headers
phonemes = dataset.columns[0:5].tolist()  # Feature labels
labels = {0: 'Nasal', 1: 'Oral'}  # Class labels
```

---

**Activity 01:** Extract features and class data from the primary data frame.

In [4]:

X = dataset.iloc[:, :-1] 
y = dataset.iloc[:, -1]  


In [5]:
# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training shapes (X, y): ", X_train.shape, y_train.shape)
print("Testing shapes (X, y): ", X_test.shape, y_test.shape)

Training shapes (X, y):  (4323, 5) (4323,)
Testing shapes (X, y):  (1081, 5) (1081,)


## Constructing the classifier and running automated cross-validation

---

**Activity 02:**

* Run a 10-fold cross validation with `GaussianNB` classifier on the training set.
* Print the accuracy scores for these 10 folds.

In [6]:
# Your code below this line (Question #02)
# --------------------------

gnb = GaussianNB()

# Perform 10-fold cross-validation
cv_scores = cross_val_score(gnb, X_train, y_train, cv=10)

# Print accuracy scores for the 10 folds
print("Cross-validation scores: ", cv_scores)
print("Mean cross-validation accuracy: ", np.mean(cv_scores))





Cross-validation scores:  [0.75057737 0.77829099 0.78290993 0.74768519 0.75925926 0.76851852
 0.73148148 0.74305556 0.7962963  0.77083333]
Mean cross-validation accuracy:  0.7628907920622701


## Training the classifier and pickling to disk

---

**Activity 03:** Train the model with all the training instances and store to disk.

In [7]:
# Your code below this line (Question #03)
# --------------------------

gnb.fit(X_train, y_train)

# Save trained model
with open("naive_bayes_model.pkl", "wb") as model_file:
    pickle.dump(gnb, model_file)



## Unpickling the model and making predictions


---

**Activity 04:**
* Load the saved model. 
* Make predictions for the testing set.


In [9]:
# Your code below this line (Question #04)
# --------------------------

# load pickled model
with open("naive_bayes_model.pkl", "rb") as model_file:
    loaded_model = pickle.load(model_file)

# make predictions with freshly loaded model
y_pred = loaded_model.predict(X_test)

# verify input and output shape are appropriate
print("Input vs. output shape:")
print(X_test.shape, y_pred.shape)




Input vs. output shape:
(1081, 5) (1081,)


## Performing final performance comparison

**Activity 05:**


In [11]:
# tally up right + wrong 'guesses' by model
true, false = 0, 0
for i, j in zip(y_test, y_pred):
    # print(i, j)
    if i == j:
        true += 1
    else:
        false += 1

# report results numerically and by percentage

true_percent = true / (true + false) * 100
print("Correct guesses: " + str(true) + "\nIncorrect guesses: " + str(false))
print("Percent correct: " + str(true_percent))

# compare to average of cross-validation scores

avg_cv = np.sum(cv_scores) / len(cv_scores) * 100
print("Percent cross-validation score (10 folds, average): " + str(avg_cv))

Correct guesses: 818
Incorrect guesses: 263
Percent correct: 75.67067530064755
Percent cross-validation score (10 folds, average): 76.289079206227


## Measure performance using sklearn

---

**Activity 06:**

Compute the following on the test set and display:
 1. Compute Confusion Matrix
 1. Accuracy
 1. Precision
 1. Recall
 1. $F_1$-Score
 
Add additional cells if required. 

In [12]:
# Your code below this line  (Question #06)
# --------------------------
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

# Compute Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Compute Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Compute Precision
precision = precision_score(y_test, y_pred)
print("Precision:", precision)

# Compute Recall
recall = recall_score(y_test, y_pred)
print("Recall:", recall)

# Compute F1-Score
f1 = f1_score(y_test, y_pred)
print("F1-Score:", f1)


Confusion Matrix:
 [[581 173]
 [ 90 237]]
Accuracy: 0.7567067530064755
Precision: 0.5780487804878048
Recall: 0.7247706422018348
F1-Score: 0.6431478968792401


## Conclusions?

---

**Activity 07:**

How did your trained model perform on the test set relative to your expectations based on the cross-validation?
Provide your answer in the cell below.

# Add your answer below this comment  (Question #07)
# -----------------------------------

The trained model performed slightly below the expected accuracy obtained from cross-validation. The 10-fold cross-validation resulted in an average accuracy of 76.29%, while the fully trained model achieved 75.67% accuracy on the test set. The difference is minor, which suggests that the cross-validation estimate provided a reliable assessment of the model’s performance.

The confusion matrix highlights that the model correctly classified more nasal sounds (class 0) compared to oral sounds (class 1), indicating an imbalance in precision and recall. The precision (57.80%) for class 1 suggests that when the model predicts an oral sound, it is correct only about 58% of the time. However, the recall (72.48%) indicates that the model is relatively good at identifying oral sounds but still misses some.

Overall, the model’s performance aligns reasonably well with cross-validation results, reinforcing that cross-validation is a good technique for estimating the generalization ability of the classifier. The slightly lower accuracy on the test set could be due to variations in the data split, but the difference is within an acceptable range.





## Logistic Regression

---

**Activity 08:**
* Run a 10-fold cross validation on a logistic regression classifier.
* Print the accuracy scores for these 10 folds.


In [14]:
# Your code below this line  (Question #08)
# -----------------------------------
from sklearn.linear_model import LogisticRegression

# Initialize Logistic Regression classifier
log_reg = LogisticRegression(max_iter=1000)

# Perform 10-fold cross-validation
cv_scores_log = cross_val_score(log_reg, X_train, y_train, cv=10)

# Print accuracy scores for the 10 folds
print("Cross-validation scores: ", cv_scores_log)
print("Mean cross-validation accuracy: ", np.mean(cv_scores_log))





Cross-validation scores:  [0.75750577 0.7482679  0.77829099 0.71296296 0.75462963 0.76157407
 0.72222222 0.75462963 0.78703704 0.75      ]
Mean cross-validation accuracy:  0.7527120220682576



---

**Activity 09:** Compute and display the confusion matrix of the logistic regression model for the test set. 


In [15]:
# Your code below this line  (Question #09)
# -----------------------------------

# Train Logistic Regression model on the full training set
log_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred_log = log_reg.predict(X_test)

# Compute Confusion Matrix
conf_matrix_log = confusion_matrix(y_test, y_pred_log)
print("Confusion Matrix:\n", conf_matrix_log)

# Compute Accuracy
accuracy_log = accuracy_score(y_test, y_pred_log)
print("Accuracy:", accuracy_log)

# Compute Precision
precision_log = precision_score(y_test, y_pred_log)
print("Precision:", precision_log)

# Compute Recall
recall_log = recall_score(y_test, y_pred_log)
print("Recall:", recall_log)

# Compute F1-Score
f1_log = f1_score(y_test, y_pred_log)
print("F1-Score:", f1_log)





Confusion Matrix:
 [[651 103]
 [172 155]]
Accuracy: 0.7456059204440333
Precision: 0.6007751937984496
Recall: 0.4740061162079511
F1-Score: 0.5299145299145298



---

**Activity 10:** Compare the two models by their confusion matrices; how do you interpret their performance? 


# Save your notebook!  Then `File > Close and Halt`