
<center>DTSC 5502.022 - Principles and Techniques of Data Science, Data Science Master Program, Dept. of Data Science, University of North Texas, Denton, TX</center>

# <center>Machine Learning Modeling Building, Evaluation and Parameter Tuning</center> 

<center>Invited Lecture - Bishnu Sarker, Assitant Professor of Health Informatics, Dept. of Information Science, University of North Texas, Denton TX</center> 
<center> bishnu.sarker@unt.edu || Office: DP E295P || Hours: Monday-Tuesday; 3:00-5:00PM </center>




**Module Description:**
This module provides a comprehensive overview of the entire modeling lifecycle, from model selection to performance evaluation. 
Students will learn :
1. how to choose appropriate models for different types of problems 
2. will gain essential skills in evaluating their effectiveness using a variety of metrics.
3. Will understand  techniques such as cross-validation and hyperparameter tuning to ensure that models are both robust and accurate.
4. adiitionally, they will learn to build apps to deploy the model. 

**Learning Outcomes**
By the end of this task, students will:
1. Understand the end-to-end workflow for building a classification model
2. Evaluate models using multiple metrics
3. Tune hyper parameters. 
4. Make informed decisions for model selection

**Machine Learning Life Cycle** 

0. Problem Statement
1. Data collection, preprocessing and normalization, feature selection and data matrix structuring.
2. Preparing training dataset, validation dataset and testing dataset
3. Training machine learning models on the training data, tuning on validation data, and  evaluating on testing data. 
4. Repeating step 3 on multiple machine learning algorithms and tracking and recording the evaluation metrics. 
5. Model selection based on the performance criteria. 
6. Building apps and deploying the best model and continous monitoring for drifting.
7. Repeating the 1-6 if model fails to serve the purpose in the future.  

**Agenda**

1. Introduction to the machine learning - a refresher
2. Building a Machine Learning Pipeline - Basic model building to cross-validation and parameter tuning. 
3. Building a Streamlit App to deploying the model. 
4. Discussing my research and opportunities. 


#### 0. Problem Statement

The very first step is to define the machine learning task. 

In this practice note, our objective of the machine learning task is  predicting if a patient has diabetics or not from several of features given in the dataset. We will be working on PIMA INDIANS DIABETES DATA to predict the Onset of diabetes based on diagnostic measurements. 
The data and problem is defined here: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database 

Also, a direct access to data exists through: "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" 

The following description of the attributes of the data is taken from: https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names

Here is the list of attributes: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)     --- target variable/response variable/ depedant variable/ labels/class
   
   
 The data was part of the following publication:
 
 Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.
 
The Proposed ADAP algorithm used 576 instances for training and recorded the sensitivity and specificity
       of their algorithm was 76% on the remaining 192 instances as test set. 

#### Bonus Step: Setting up the environment. 

**Download and Install Python**

You should be having a working python environment from your past working experience.  If you don't have a working python environment, please go to Anaconda webpage and download latest anaconda distribution for your os.  https://www.anaconda.com/download 

**Understanding how to work with Numpy, Scipy, matplotlib and Pandas**

I expect you to have working knowledge of numpy, scipy, matplotlib and pandas.  

**Importing the packages**

Most cases, you will need the following packages to work with machine learning project in Python. 
1. Pandas 
2. Matplotlib
3. Numpy
4. Scipy
5. Sklearn

Let's import the above packages and also check if the system is ready for the project

In [None]:
import pandas as pd
import numpy as np
import scipy as sc
import matplotlib.pyplot as plt
import sys

### 1. Data collection, preprocessing and normalization, feature selection and data matrix structuring.
Let's use Pandas to read the data from the online source and get it into a dataframe. 

In [None]:
data_src="https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"   # data source
features=['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']   # column names
data=pd.read_csv(data_src, names=features)    # pandas function for reading CSV file
print(data.shape)  # prints the shape of the dataframe. number of rows, number of columns. 

The output shows that there are 768 instances/examples/observations/rows  and 9 features/attributes/columns/fields in the data. 

In [None]:
# First few rows of the data
data.head(5)

In [None]:
# Last few rows of the data
data.tail(5)

**Exploratory Analysis**

It is always a good idea to look at the summary of the dataset. This first step often helps with feature engineering for example finding right features for the model. In other words, identifying the right set of columns that are explanatory of the target variable. 

Let us look into the statistical description of the data. This gives the basics statistical measure of the data. 

In [None]:
data.describe()

A good way to know the data type of the features/columns is to use info() function. This gives the total row/column number as well type of each columns including if there is any null values. 

In [None]:
data.info()

In [None]:
# This is matrix of box plots. each sub plot is a box plot for of column
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False, figsize=(15,10))
plt.show()

Box plot is a nice way to locate outliers. Above charts show that there are many points which are out of accepted range of boxplots.  

In [None]:
#Histogram of each column
data.hist()
plt.show()

There are many features which seems to be not normally distributed. This is an important insight because many machine learning learning models assume data to be normally distributed.  

Lets now count how many different classes do we have in the dataset.  groupby() is a easy way to get it.  

In [None]:
# to look at class wise counts of rows
data.groupby("class").size()

Result shows that there are two classes: 0 and 1. For 0, there are 500 instances, for 1, there are 268 instances. This data set is class imbalance as the number of instances between 2 classes vary significantly.  

In [None]:
# This is correlation matrix between the columns
from pandas.plotting import scatter_matrix

In [None]:
scatter_matrix(data)
plt.show()

scatter plot is a nice way to plot the co-relation among features.  

**Normalizing/Standardization of data**

Often times, the features present different measurements. It is important that we normalize the data before moving to model development. Normalization converts any data point to a uniform units. There are many normalization techniques based on the assumption of the data distribution. 

In [None]:
## From dataframe values to numpy arrays. While you can still do everything on Dataframes. Following example will show how to 
## use numpy in the process. 
data_array=data.values

In [None]:
## Again to look at the shape of the array
data_array.shape

In [None]:
data_array

Remember the last column of the dataframe was the label column. That means, the task of the model would be to predict the values of the last column given the values of the other columns i.e. features. 
lets seperate these two parts into two variable. 

In [None]:
## Seperating the feature columns and label column. In this example, the last column is the label column. 
X=data_array[:, 0:8]
Y=data_array[:,8]

In [None]:
## this is the feature matrix
X

In [None]:
# this is list of labels
Y

Now, lets apply the normalization function. Following normalization assumes that the data is normally distributed. thus it computes the Z-score against each value in each column. 

In [None]:
# FFunction for normalizing the numerical values. Z=(X-mean)/SD 
from sklearn.preprocessing import StandardScaler

In [None]:
rescaledX=StandardScaler().fit_transform(X)

In [None]:
rescaledX

rescaledX holds the scaled/normalized/standarized data points of X

Once you have prepared your data matrix, the next thing is to fit a machine learning model. There are few steps: spliting the data into train and test set; fitting the model in training set; evaluating the model on test set; displaying the performance metrics to know the model strength. 

### 2. Preparing training dataset, validation dataset and testing dataset

At this point, we know how to read data and pre-process data to re-scale ot nirmalize data. Now, we will see how to apply a machine learning model using SKLEARN package. 
In this step, we need some special libraries to be imported.

In [None]:
### Import the following packages for any machine learning project. 
# Preparing the dataset
from sklearn.model_selection import train_test_split   # split the dataset into training and testing set
from sklearn.model_selection import cross_val_score    # Perform cross validation 
from sklearn.model_selection import StratifiedKFold    # Stratify the data in each fold
from sklearn.model_selection import KFold              # Defining the number of fold in Cross-validation

# Evaluation Metrics
from sklearn.metrics import classification_report     # to get the performance measures
from sklearn.metrics import confusion_matrix          # To compute the false positives and false negatives
from sklearn.metrics import accuracy_score            # Accuracy measures

# Machine learning models
from sklearn.linear_model import LogisticRegression   # Logistics Regression Model
from sklearn.tree import DecisionTreeClassifier       # Decision Tree Model
from sklearn.neighbors import KNeighborsClassifier    # K-nearest Neighbor model
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis   #LDA model
from sklearn.naive_bayes import GaussianNB                             # naive Bayes
from sklearn.svm import SVC                                            # Support Vector Machine


Essentially, following line split the rescaledX and Y into four new variables. 

- X_train: the feauture matrix for training
- X_test: the feature matrix for training
- Y_train: label vector for training
- Y_test: label vector for testing - also known as ground truth labels. 

test_size=0.2 will ensure that 80% of the rescaledX goes to training and 20% for testing. 

In [None]:
X_train, X_test, Y_train, Y_test=train_test_split(rescaledX, Y, test_size=0.2) # Adhering to 80:20 rule for train:test split

In [None]:
train_test_split?

### 3.Training machine learning models on the training data, tuning on validation data, and  evaluating on testing data. 
Now we are going to use X_train and Y_train as our primary data for building the model. Once the model is build, we will use X_train and Y_Train to validate the performance of the model. 

Following 2 lines of code actually build a model around train data: X_train and Y_train. 

**Training Machine Learning Models: Logistic Regression**

In [None]:
LogisticRegression?

In [None]:
LR=LogisticRegression(solver='liblinear', max_iter=200, penalty='l1')
#fit(X_train, Y_train) 
LR.fit(X_train, Y_train) 
LR

**Testign a ML model**

Once the model is trained, following one line can be used to predict the classes for test set. X_test

In [None]:
X_test

**On a single data point**

In [None]:
x1=X_test[1]
x1

In [None]:
Y_pred=LR.predict([x1])
Y_pred

In [None]:
LR.predict_proba([x1])

In [None]:
LR.predict_log_proba([x1])

In [None]:
Y_test[1]

**On the entire test dataset**

In [None]:
Y_pred=LR.predict(X_test)

In [None]:
Y_pred 

In [None]:
Y_test 

**Model Performance: confusion Matrix**


**Definition:** A confusion matrix is a table that summarizes the performance of a classification model by comparing **actual vs predicted labels**. It helps compute precision, recall, F1-score, accuracy, and other metrics.

* **Structure (Binary Classification):**

| Actual \ Predicted  | Predicted Positive  | Predicted Negative  |
| ------------------- | ------------------- | ------------------- |
| **Actual Positive** | True Positive (TP)  | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN)  |

* **Interpretation:**

  * **TP:** Correctly predicted positive
  * **TN:** Correctly predicted negative
  * **FP:** Incorrectly predicted positive (false alarm)
  * **FN:** Incorrectly predicted negative (missed positive)

* **Example:**
  Suppose a medical test predicts disease presence for 100 patients:

* TP = 40 (correctly detected disease)

* TN = 45 (correctly identified healthy)

* FP = 10 (healthy predicted as sick)

* FN = 5 (sick predicted as healthy)

**Confusion Matrix Table:**

| Actual \ Predicted | Positive | Negative |
| ------------------ | -------- | -------- |
| Positive           | 40       | 5        |
| Negative           | 10       | 45       |



The better understand model performance, we can use following line with Y_pred : predicted classes and Y_test : actual classes. 



In [None]:
confusion_matrix(Y_test, Y_pred)

**Model Performance: Classification Report**

it provides a set of metrics to better evaluate the model's performance:  Following portion is formatted by using ChatGPT. 

**1. Precision**

* **Definition:** Precision measures the proportion of positive predictions that are actually correct. It focuses on how many predicted positives are true positives.

* **Formula:**
  
  [$\text{Precision} = \frac{TP}{TP + FP}$]
  
  where:
  TP = True Positives, FP = False Positives

* **Example:**
  Suppose a model predicts 50 patients as having a disease:

* 40 truly have the disease (TP = 40)

* 10 do not (FP = 10)

[
$\text{Precision} = \frac{40}{40 + 10} = \frac{40}{50} = 0.8$
]
**Interpretation:** 80% of predicted positives are correct.


**2. Recall (Sensitivity / True Positive Rate)**

* **Definition:** Recall measures the proportion of actual positives that are correctly identified. It focuses on how many true positives are captured.

* **Formula:**
  [
  $\text{Recall} = \frac{TP}{TP + FN}$
  ]
  where FN = False Negatives

* **Example:**
  Suppose there are 50 actual patients with the disease:

* The model correctly predicts 40 as positive (TP = 40)

* Misses 10 (FN = 10)

[
$\text{Recall} = \frac{40}{40 + 10} = \frac{40}{50} = 0.8$
]
**Interpretation:** The model identifies 80% of actual positive cases.


**3. F1-Score**

* **Definition:** F1-score is the harmonic mean of precision and recall. It balances the two metrics and is useful when both false positives and false negatives matter.

* **Formula:**
  [
  $F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
  ]

* **Example:**
  Using the previous example:

* Precision = 0.8, Recall = 0.8

[
$F1 = 2 \times \frac{0.8 \times 0.8}{0.8 + 0.8} = 0.8$
]
**Interpretation:** Good balance between precision and recall.


**4. Accuracy**

* **Definition:** Accuracy measures the proportion of total predictions (both positive and negative) that are correct.

* **Formula:**
  [
  $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
  ]

* **Example:**
  Suppose we have:

* TP = 40, TN = 45, FP = 10, FN = 5

[
$\text{Accuracy} = \frac{40 + 45}{40 + 45 + 10 + 5} = \frac{85}{100} = 0.85$
]
**Interpretation:** 85% of all predictions are correct.


In [None]:
print(classification_report(Y_test, Y_pred))

In [None]:
confusion_matrix(Y_test, Y_pred)

Model Performance: Area under the curve (AUC) and Precision Recall Curves to explore for more robust evaluation. 

A summary table of popular metrics to help:

| Metric                     | Definition                                           | Formula                            | Example                                          |
| -------------------------- | ---------------------------------------------------- | ---------------------------------- | ------------------------------------------------ |
| **Precision**              | Correctness of positive predictions                  | $TP / (TP + FP) $                    | TP=40, FP=10 → 0.8                               |
| **Recall**                 | Coverage of actual positives                         | $TP / (TP + FN) $                    | TP=40, FN=10 → 0.8                               |
| **F1-Score**               | Harmonic mean of precision & recall                  | $2 * (P*R)/(P+R)  $                  | P=0.8, R=0.8 → 0.8                               |
| **Accuracy**               | Overall correct predictions                          | $(TP + TN)/(TP + TN + FP + FN) $     | TP=40, TN=45, FP=10, FN=5 → 0.85                 |
| **ROC Curve**              | TPR vs FPR at different thresholds                   | $TPR = TP/(TP+FN), FPR = FP/(FP+TN)$ | Plot TPR vs FPR                                  |
| **AUC**                    | Area under ROC curve                                 | $∫ TPR(FPR) d(FPR)   $               | AUC=0.95 indicates excellent model               |
| **Precision-Recall Curve** | Trade-off between precision and recall at thresholds | Vary threshold, plot P vs R        | PR curve shows how precision changes with recall |


### 4. Repeating step 3 on multiple machine learning algorithms and tracking and recording the evaluation metrics. 

**Repeating training, test steps for Decision Tree**

Decision Tree is popular machine learning model that works by generating if-then-else rules from the data. 

In [None]:
DT=DecisionTreeClassifier() 
DT.fit(X_train, Y_train)
Y_pred=DT.predict(X_test)
print(classification_report(Y_test, Y_pred)) 

**Repeating training, test steps for Random Forest**

Apply random forest machine learning model in a similar fashion. and compare the performance. 

In [None]:
## Your code here

**Repeating training, test steps for Model X** 
Apply another Machine learning Model of your choice and compare the perfomance here. 

In [None]:
# you code goes here

**Model Validation and Evaluation**

There are many ways to validate the performance of a model. Such as Cross-Validation. In cross-validation, the training data X_train will be splitted into K-folds - k number of chunks. Each model will be trained by taking 1 fold as validation set and rest of the k-1 folds combined to form training data. The average of the performances from K number of models will indicate the overall performances of the selected model. 

Cross-validation prevents the model overfitting as the model sees every instance for once as training and testing set. 

#### Cross-validation
To perform cross validation, following steps are necessary:

- defining the K-folds using KFold(): its takes number of folds, if data would be shufled. 
- Definig a model object. 
- Applying cross validation using cross_val_score() : it takes model, features, labels and kfolds. 

In [None]:
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
LR = LogisticRegression(solver='liblinear')
results = cross_val_score(LR, rescaledX,Y, cv=kfold)


In [None]:
results

As you can see, the results prints a list of 10 numbers. These are basically accuracies from 10 different runs with 10 folds. 

We can just look into mean performance using following lines of codes

In [None]:
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Interpretation of the result: The model may perform with an accuracy of 77.216% with an deviation +/- 4.76%. Out of 100 test examples feed to the model, roughly 72 - 82 examples will be correctly classified. There is 23% chance of miss-classification.  

### 5. Model selection based on the performance criteria

Sometimes, to justify the superiority of a model, we need to compare with other models.

Following lines of code, 1) defines multiple machine learning models, apply cross validation on each of them, and present the mean performance for each.

As well as, it draws a set of box plots shows the performance deviations among the models and withing the k-folds. 

In [None]:
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
# Compare Algorithms
plt.boxplot(results, labels=names)
plt.title('Algorithm Comparison')
plt.show()

It may seems that training a machine learning models is fairly easy task. However, real complexity arise from finding the best fitted model. And requires a lot of efforts on tuning model parameters. So we have run the model on the default parameters. Now it is time to search for the best model. 

### 6. Parameter Tuning. 

Well, we have got some idea about which one is performing with what accuracies. However, for each of the algorithm, it involves many parameters that requires tuning and settings to reach the goal. 
We will see how can to parameter tuning to get the best outcome. 

**Import the grid search function from sklearn**

In [None]:
from sklearn.model_selection import GridSearchCV



##### Define the model and parameters values to try

In [None]:
LR=LogisticRegression(solver='liblinear')
solvers=['lbfgs', 'sag', 'saga', 'newton-cg'] ## Possible solvers
Cs=np.logspace(-3,3,7)  # possible range of values
penalties=["l1", 'l2'] ## possible regularizers
max_iters=[50, 100, 200] ## range of iterations

params=dict(C=Cs, penalty=penalties)
params

##### Apply grid search to find the best parameters from the values we have selected. 

In [None]:
grid=GridSearchCV(estimator=LR, param_grid=params, cv=10)


##### Now train the model on the tuned parameters

In [None]:
grid.fit(X_train, Y_train)

We can now access a number of elements from our grid search to know better about the best fitted model. 

For example, `what is the best score found?` 

In [None]:
print(grid.best_score_)

`What are the best parameters?`

In [None]:
grid.best_params_

`What is the best model/estimator built?`

In [None]:
bestModel=grid.best_estimator_

In [None]:
Y_pred=bestModel.predict(X_test)
Y_pred


In [None]:
print(classification_report(Y_test,Y_pred ))

### 7. Building apps and deploying the best model and continous monitoring for drifting.
This is an important but missing part from many ML lecture. 

Now that you have trained a model. You have built the best model. What to do next?

The next thing is to save the model. And use it when you have a new data to classify. You don't need train the model every time. 
Train once. and use many times untill you have enough new data to training the model again. 

**Saving the Model**

In [None]:
from pickle import dump, load


In [None]:
dump(bestModel, open("bestModel.model", 'wb')) 

**Loading the Model**

In [None]:
model=load(open('bestModel.model', 'rb'))


**Using the model to predict**

In [None]:
Y_pred=model.predict(X_test)
#model.score(X_test, Y_test)


In [None]:
Y_pred

# Hurray!! We have done a complete Machine learning project. 


Task : Building and Evaluating a Machine Learning Model


Task Description:
Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.

In this task, we will build and evaluate a machine learning model using the dataset we prepared earlier. We will follow these steps:

1. Split the dataset into training and testing sets.
2. Train a machine learning model on the training set.
3. Evaluate the model's performance on the testing set.
4. Save the best-performing model for future use.
    
Dataset link: https://www.kaggle.com/datasets/nancyalaswad90/breast-cancer-dataset/data



In [None]:
##Your Code goes here

Pitch your best model and tell us why. 