In [3]:
from IPython.display import Markdown

import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

def get_completion_from_messages(messages, model="gpt-3.5-turbo", temperature=0, max_tokens=1000):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message["content"]

In [5]:
instructions = """
Revise the title and the content:


# Modelling
On the [Data Wrangling](exploring-data.qmd) section, it was found that the 6-machine paramaters to be studied were discrete values with three levels each, and the target variable was a highly imbalanced binary variable.

In this section, the dataset will be fed into different machine learning models to find an optimal model that can help us understand the mapping between the input and output variables. To do this, the dataset should be splitted into two parts namely, the training set, and the testing set with test size of 20%. The training set will be used during model training, and cross validation while the testing set will be used to evaluate the model performance on unseen data.

Once the data was splitted, it will be loaded to different out-of-the-box machine learning models from scikit-learn library.
At this stage, the only goal is to find the optimal model using their default hyperparameters.

Finally, each of the model's training and testing performance will be compared and evaluated. The goal is to shortlist the top performing model.

## Importing Relevant Libraries
The following code below contains all of the import statements for this section.

## Importing the Dataset
The following code below loads the dataset pre-processed from [EDA](exploring-data.qmd) section into pandas `DataFrame`.

## Vertical Data Splitting
Once loaded, the dataset will be splitted into feature matrix `X`, and target vector `y`.

## Horizontal Data Splitting
Once the feature matrix and target vector was defined, they will be splitted horizontally into training set and testing set.
The code below will perform the horizontal split with test size of 20% ensuring that both sets contains the same proportion of `0`s and `1`s on their target variable.

## Modelling
To compare different models, the yellowbrick data visualization library was used. This library contains out-of-the-box data visualization tools for comparing different machine learning algorithms implemented in scikit-learn.

The class definition below provides an abstraction in comparing different machine learning models.
Its constructor accepts a list of estimators, an input feature X, and output variable y. It has a show_classification_report method that returns a data visualization of each of the model's ROC curve, confusion matrix, and classification report.

To generate these reports, the ClassifierEvaluator class internally splits the input feature X and output variable y into training, and testing set. Each of the classifier will be trained on the training set. To evaluate the model on unseen data, the score method of yellowbrick visualizers will be called using the testing set.

## Training the Model
The code below will train the 5 machine learning models using the preprocessed dataset from [Data Wrangling](exploring-data.qmd) section.

## Classification Report
The following series of plots below shows the ROC curve, confusion matrix, and classification report of each models.
This visualization can be used to compare the performance of each of the model.



While generating content, make sure that the following will be followed:
1. Make it short, concise, but interesting.
2. The tone should be professional, academic, technical, formal, and compelling.
3. The copy should be cohesive and tells a compelling and interesting story.
4. Return your results in markdown.
5. Generate markdown headings when necessary.
6. Generate the last paragraph concluding all the things that was made so far and what are the next step to be made.
"""

messages = [
    {
        "role": "user",
        "content": instructions
    },
]

results = get_completion_from_messages(messages=messages)
Markdown(results)

# Machine Learning Model Comparison

In the previous section on Data Wrangling, it was discovered that the 6 machine parameters to be studied had discrete values with three levels each, and the target variable was a highly imbalanced binary variable.

In this section, we will feed the dataset into different machine learning models to find the optimal model that can help us understand the relationship between the input and output variables. To do this, we will split the dataset into a training set and a testing set, with a test size of 20%. The training set will be used for model training and cross-validation, while the testing set will be used to evaluate the model's performance on unseen data.

Once the data is split, we will load it into different out-of-the-box machine learning models from the scikit-learn library. At this stage, our goal is to find the optimal model using their default hyperparameters.

Finally, we will compare and evaluate the training and testing performance of each model. The objective is to shortlist the top-performing model.

## Importing Relevant Libraries

The following code contains all the necessary import statements for this section.

## Importing the Dataset

The code below loads the pre-processed dataset from the EDA section into a pandas DataFrame.

## Vertical Data Splitting

Once the dataset is loaded, we will split it into a feature matrix `X` and a target vector `y`.

## Horizontal Data Splitting

After defining the feature matrix and target vector, we will split them horizontally into a training set and a testing set. The code below performs the horizontal split with a test size of 20%, ensuring that both sets contain the same proportion of `0`s and `1`s in their target variable.

## Modelling

To compare different models, we will use the yellowbrick data visualization library. This library provides out-of-the-box data visualization tools for comparing different machine learning algorithms implemented in scikit-learn.

The class definition below provides an abstraction for comparing different machine learning models. Its constructor accepts a list of estimators, an input feature `X`, and an output variable `y`. It has a `show_classification_report` method that returns a data visualization of each model's ROC curve, confusion matrix, and classification report.

To generate these reports, the `ClassifierEvaluator` class internally splits the input feature `X` and output variable `y` into training and testing sets. Each classifier will be trained on the training set. To evaluate the model on unseen data, the `score` method of yellowbrick visualizers will be called using the testing set.

## Training the Model

The code below will train 5 machine learning models using the preprocessed dataset from the Data Wrangling section.

## Classification Report

The following series of plots show the ROC curve, confusion matrix, and classification report of each model. This visualization can be used to compare the performance of each model.

In conclusion, we have successfully trained and evaluated multiple machine learning models on our dataset. The next step would be to analyze the results and select the top-performing model based on the evaluation metrics. We can then proceed with further fine-tuning and optimization of the chosen model to improve its performance.