# Model Training Research: Candidate Models and Considerations

This document outlines candidate models to be considered for our wine quality prediction task, along with key factors to evaluate during the research phase.

## Candidate Models

Here are some potential machine learning models that could be explored for predicting wine quality:

1. **Decision Trees:**
   - **Strengths:** Flexible and can handle both continuous and categorical features without extensive data preprocessing. Robust to outliers.
   - **Weaknesses:** Prone to overfitting if not properly tuned. Can be less interpretable than linear models.

2. **Random Forests:**
   - **Strengths:** Ensemble method combining multiple decision trees, leading to improved accuracy and reduced overfitting compared to single decision trees.
   - **Weaknesses:** Interpretability can be challenging due to the ensemble nature. May require hyperparameter tuning for optimal performance.

3. **Support Vector Machines (SVMs):**
   - **Strengths:** Powerful for classification tasks, especially with high-dimensional data. Effective at handling non-linear relationships using kernel functions.
   - **Weaknesses:** Sensitive to outliers and feature scaling. Can be computationally expensive for large datasets.

4. **K-Nearest Neighbors (KNN):**
   - **Strengths:** Simple and easy to implement. No explicit model training required. Effective for both classification and regression.
   - **Weaknesses:** Performance can be affected by the "curse of dimensionality." Sensitive to noisy data and choice of distance metric.

5. **Logistic Regression:**
   - **Strengths:** Simple and interpretable. Suitable for binary classification tasks. Can handle large datasets efficiently.
   - **Weaknesses:** Assumes linear relationship between features and target variable. May underperform if the data is not linearly separable.

6. **Gradient Boosting Machines (GBM):**
   - **Strengths:** Builds decision trees sequentially, focusing on correcting errors made by previous trees. Typically achieves high accuracy.
   - **Weaknesses:** Prone to overfitting if not properly regularized. Requires careful tuning of hyperparameters.

7. **Naive Bayes:**
   - **Strengths:** Simple and fast to train. Performs well on text classification tasks and with categorical features.
   - **Weaknesses:** Assumes independence between features, which may not hold true in practice. Can be sensitive to imbalanced class distributions.

8. **Neural Networks:**
   - **Strengths:** Capable of learning complex patterns in data. Can handle high-dimensional inputs and non-linear relationships.
   - **Weaknesses:** Requires large amounts of data for training. May suffer from overfitting if not properly regularized. Computationally intensive.

## Selection Criteria

When evaluating these models, we should consider the following factors:

- **Problem Type:** Since we're predicting wine quality (likely a categorical variable), we'll be focusing on classification models.
- **Data Characteristics:**
  - Feature types (continuous, categorical)
  - Data size and dimensionality
  - Presence of missing values or outliers
- **Model Performance:**
  - Accuracy, precision, recall, F1-score (depending on class imbalance)
  - Overfitting potential
- **Interpretability:**
  - Importance of understanding model predictions and feature relationships
  - Trade-off between accuracy and interpretability

## Research Plan

The research will involve:

1. **Data Preprocessing:** Exploring data cleaning, handling missing values, and feature scaling if necessary.
2. **Model Training and Evaluation:** Implementing and training each candidate model with appropriate hyperparameter tuning. Evaluating performance metrics on a separate validation set.
3. **Model Selection:** Choosing the model with the best balance of accuracy, interpretability, and suitability for the data.
4. **Model Interpretation:** Analyzing the chosen model to understand how features influence predictions (if applicable).

Through this research, we aim to identify the most effective model for predicting wine quality based on the given dataset and evaluation criteria.
ed on the given dataset and evaluation criteria.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

In [2]:
df = pd.read_csv("new_df_wine.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,medium
1,1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,medium
2,2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,medium
3,3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,medium
4,5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,medium


I will encode my target variable since some of the models can handle categorical data...

In [3]:
label_mapping = {'low': 0, 'medium': 1, 'high': 2}
# Map the labels to numerical values
df['quality'] = df['quality'].map(label_mapping)
df.head(20)

Unnamed: 0.1,Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,1
1,1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,1
2,2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,1
3,3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,1
4,5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,1
5,6,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,1
6,7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,2
7,8,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,2
8,9,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,1
9,10,6.7,0.58,0.08,1.8,0.097,15.0,65.0,0.9959,3.28,0.54,9.2,1


#### Train test split
Doing this now to avoid data leakage.

In [4]:
def train_test(df,target_column,test_size,random_state):
    """
    Perform train-test split on a DataFrame.
    
    Parameters:
    df (DataFrame): The DataFrame containing features and target variable.
    target_variable (str): The name of the target variable.
    random_state (int): Random state for reproducibility.
    test_size (float): The proportion of the dataset to include in the test split.
    
    Returns:
    X_train (DataFrame): The features for training.
    X_test (DataFrame): The features for testing.
    y_train (Series): The target variable for training.
    y_test (Series): The target variable for testing.
    """
    X = df.drop(target_column,axis = 1)
    X = df.drop('Unnamed: 0',axis = 1)
    y = df[target_column]

    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size,random_state=random_state)
    return X_train, X_test, y_train, y_test

In [5]:
X_train, X_test, y_train, y_test = train_test(df,'quality',0.2,42)

In [6]:
X_train.shape, y_test.shape

((1087, 12), (272,))

In [8]:
from sklearn.compose import ColumnTransformer
num_features = X.select_dtypes(exclude="object").columns
numeric_transformer = StandardScaler()

NameError: name 'X' is not defined

In [None]:
preprocessor = ColumnTransformer(
    [
         ("StandardScaler", numeric_transformer, num_features)       
    ]
)

In [None]:
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

X_train.shape,X_test.shape

In [None]:
X_train

#### Evaluation and model training:

In [9]:
def evaluate_model(true, predicted):
    accuracy = accuracy_score(true, predicted)
    precision = precision_score(true, predicted, average='weighted')  # Update here
    recall = recall_score(true, predicted, average='weighted')  # Update here
    f1 = f1_score(true, predicted, average='weighted')  # Update here
    return accuracy, precision, recall, f1

In [10]:
models = {
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine": SVC(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Logistic Regression": LogisticRegression(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "Gaussian Naive Bayes": GaussianNB(),
    "Multilayer Perceptron": MLPClassifier(),
    "CatBoost Classifier": CatBoostClassifier(),
    "XGBoost Classifier": XGBClassifier()
}

model_list = []
accuracy_list = []

for model_name, model in models.items():
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test dataset
    accuracy_train, precision_train, recall_train, f1_train = evaluate_model(y_train, y_train_pred)
    accuracy_test, precision_test, recall_test, f1_test = evaluate_model(y_test, y_test_pred)

    print(model_name)
    model_list.append(model_name)
    
    print('Model performance for Training set')
    print("- Accuracy: {:.4f}".format(accuracy_train))
    print("- Precision: {:.4f}".format(precision_train))
    print("- Recall: {:.4f}".format(recall_train))
    print("- F1 Score: {:.4f}".format(f1_train))
    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Accuracy: {:.4f}".format(accuracy_test))
    print("- Precision: {:.4f}".format(precision_test))
    print("- Recall: {:.4f}".format(recall_test))
    print("- F1 Score: {:.4f}".format(f1_test))
    print('='*35)
    print('\n')

    accuracy_list.append(model.score(X_test, y_test))

Decision Tree
Model performance for Training set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000


Random Forest
Model performance for Training set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000


Support Vector Machine
Model performance for Training set
- Accuracy: 0.8188
- Precision: 0.8075
- Recall: 0.8188
- F1 Score: 0.7381
----------------------------------
Model performance for Test set
- Accuracy: 0.8235
- Precision: 0.7996
- Recall: 0.8235
- F1 Score: 0.7473




  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


K-Nearest Neighbors
Model performance for Training set
- Accuracy: 0.8804
- Precision: 0.8845
- Recall: 0.8804
- F1 Score: 0.8567
----------------------------------
Model performance for Test set
- Accuracy: 0.8382
- Precision: 0.8310
- Recall: 0.8382
- F1 Score: 0.8013


Logistic Regression
Model performance for Training set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000


Gradient Boosting
Model performance for Training set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000


Gaussian Naive Bayes
Model performance for Training set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000
----------------------------------
Model performance



Multilayer Perceptron
Model performance for Training set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000


Learning rate set to 0.079464
0:	learn: 0.9616873	total: 148ms	remaining: 2m 27s
1:	learn: 0.8507430	total: 152ms	remaining: 1m 15s
2:	learn: 0.7599621	total: 155ms	remaining: 51.4s
3:	learn: 0.6777414	total: 157ms	remaining: 39s
4:	learn: 0.6077261	total: 159ms	remaining: 31.7s
5:	learn: 0.5513941	total: 162ms	remaining: 26.9s
6:	learn: 0.4996991	total: 165ms	remaining: 23.4s
7:	learn: 0.4533942	total: 168ms	remaining: 20.9s
8:	learn: 0.4137263	total: 171ms	remaining: 18.8s
9:	learn: 0.3761021	total: 172ms	remaining: 17.1s
10:	learn: 0.3444367	total: 175ms	remaining: 15.7s
11:	learn: 0.3144831	total: 176ms	remaining: 14.5s
12:	learn: 0.2898540	total: 178ms	remaining: 13.5s
13:	learn: 0.2669705	total: 181ms	remainin

In [11]:
results = pd.DataFrame(list(zip(model_list, accuracy_list)), 
                       columns=['Model Name', 'Accuracy']).sort_values(by=["Accuracy"],
                                                                       ascending=False)
results

Unnamed: 0,Model Name,Accuracy
0,Decision Tree,1.0
1,Random Forest,1.0
4,Logistic Regression,1.0
5,Gradient Boosting,1.0
6,Gaussian Naive Bayes,1.0
7,Multilayer Perceptron,1.0
8,CatBoost Classifier,1.0
9,XGBoost Classifier,1.0
3,K-Nearest Neighbors,0.838235
2,Support Vector Machine,0.823529


### Top 4 Models Selection for Industry Readiness

Considering computational efficiency, interpretability, scalability, and robustness, the following top 4 models are selected for further tuning and industry readiness:

1. **Decision Tree:**
   - **Reasoning:** Decision trees are computationally efficient, interpretable, and suitable for tasks where transparency is crucial. They are well-suited for decision-making processes and require minimal preprocessing.

2. **Random Forest:**
   - **Reasoning:** Random Forest combines the simplicity of decision trees with improved accuracy and robustness. It generalizes well to unseen data and is efficient for large-scale datasets, common in industry applications.

3. **Logistic Regression:**
   - **Reasoning:** Logistic Regression is simple, interpretable, and computationally efficient. It provides probabilistic outputs and works well for binary classification tasks, making it suitable for applications where transparency and simplicity are prioritized.

4. **Gradient Boosting:**
   - **Reasoning:** Gradient Boosting models like XGBoost and CatBoost offer superior performance and scalability. Despite being more complex, they are optimized for efficiency and robustness, making them suitable for industry-scale applications.

**Why These Models:**

- **Interpretability:** Decision trees, Logistic Regression, and ensemble methods like Random Forests provide interpretability, crucial for understanding model decisions in industry settings.
  
- **Scalability:** Gradient Boosting methods are scalable and efficient, capable of handling large-scale datasets commonly encountered in industry applications.

- **Robustness:** Random Forests and Gradient Boosting models are robust against overfitting and generalize well to unseen data, ensuring reliable performance in real-world scenarios.

- **Ease of Deployment:** All selected models are relatively easy to deploy in production environments, with mature implementations available in popular machine learning libraries and frameworks.

Focusing on these top 4 models strikes a balance between computational efficiency, interpretability, scalability, and robustness, ensuring readiness for deployment in real-world industry applications. Further tuning and optimization can enhance their performance and suitability for specific use cases.
gration into your pipeline.
