In [2]:
from IPython.display import Markdown

import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

def get_completion_from_messages(messages, model="gpt-3.5-turbo", temperature=0, max_tokens=1000):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message["content"]

In [8]:
instructions = """
Revise the title and the content:


# Data Wrangling
Since the objective of the study was to develop a machine learning model that can help us to understand and highlight the correlation between input parameters and the incidence of defects, we will use the version 1.0 of the dataset.

The version 1.0 contains the different set of machine parameter combinations,
and the binary target variable. Although the version 1.1 contain additional information about the weld width, weld gap, crack count, and crack length, they will hurt the performance of our models since they contains information about the occurrence of crack. These additional features can be considered as leaky features.

## Dropping Unwanted Variables
Before wrangling the data, its important to have a deeper understanding of the purpose of each columns. Thus, having known which of these columns were features, targets, and unwanted, would be a great help later during the modelling phase.

### Identifying the Feature Variables
As highlighted in the [Laser Welding Process](laser-welding-process.qmd) page,
this dataset contains 6 features representing the 6 different factors to be studied and 2 features for identifying the weld number, and cross section position.

#### Factors to be Studied
1. Laser beam power (W)
2. Welding speed (m/min)
3. Angular position in welding direction (°)
4. Focal position (mm)
5. Gas flow rate (l/min)
6. Material thickness of the steel sheet (mm)

#### Experiment Identification
1. weld number
2. cross section positon in the weld (mm)

### Identifying the Target Variables
For the target variables,
this dataset contains 4-continuous target variable,
and 1-binary target variable.
In this study,
the binary target variable will be used,
and the remaining 4 should be dropped.

#### Binary Target Variable
1. cracking in the weld metal

#### Continuous Variables
1. weld seam width steel (µm)
2. weld seam width copper (µm)
3. weld depth copper (µm)
4. gap

## Encoding Categorical Variables
Soon the dataset will be fed into different machine learning models however, most of these models cannot accept string data. Thus, we should encode the `cracking in the weld metal` variable into a numeric one. One can do so by mapping `yes` as `1`, and `no` as `0`. The python code below will perform that.

## Quantifying Missing Data
Most machine learning models can't be trained if missing data exists. To address this, missing data should be quantified and addressed. Fortunately, this dataset does not contain one.

## Data Visualization
In this section, the dataset will be explored visually in order to gain a deeper understanding of the distribution, and relationship of each variables.

### Data Distribution
The purpose of visualizing the distribution of our data is to help us understand its central tendency, and spread. This can be achieved using histograms.

### Correlation Heatmap
Some machine learning models assumes that all features to be fed with it during training were not correlated. We can check this by generating a heatmap of the correlation matrix of our input features. In addition, this heatmap can be used to check which features were correlated with the target variable.

Based on above heatmap, the following can be observed:

1. There were no correlation between the input features (no multicollinearity).
2. `power`, `angular position`, and `material thickness` were the top positively correlated features with `cracking in the metal weld`.
3. `gas flow rate` was the top negatively correlated feature with `cracking in the metal weld`.

While generating content, make sure that the following will be followed:
1. Make it short, concise, but interesting.
2. The tone should be professional, academic, technical, formal, and compelling.
3. The copy should be cohesive and tells a compelling and interesting story.
4. Return your results in markdown.
5. Generate markdown headings when necessary.
6. Generate the last paragraph concluding all the things that was made so far and what are the next step to be made.
"""

messages = [
    {
        "role": "user",
        "content": instructions
    },
]

results = get_completion_from_messages(messages=messages)
Markdown(results)

# Exploring Data for Machine Learning Model Development

In order to develop a machine learning model that can effectively highlight the correlation between input parameters and the incidence of defects, we will be using version 1.0 of the dataset. This version contains the necessary machine parameter combinations and a binary target variable.

While version 1.1 of the dataset includes additional information about weld width, weld gap, crack count, and crack length, we have decided to exclude these features from our analysis. Including them would introduce leaky features that could negatively impact the performance of our models.

Before we begin data wrangling, it is important to understand the purpose of each column in the dataset. This will help us identify which columns are features, targets, and unwanted variables. 

The dataset contains 6 features representing different factors to be studied, such as laser beam power, welding speed, angular position, focal position, gas flow rate, and material thickness of the steel sheet. Additionally, there are 2 features for identifying the weld number and cross section position.

For the target variables, the dataset includes 4 continuous variables and 1 binary variable. In this study, we will focus on the binary target variable, which indicates the presence or absence of cracking in the weld metal. The remaining 4 continuous variables will be dropped from our analysis.

Next, we will encode the binary target variable into a numeric format, mapping "yes" to 1 and "no" to 0. This is necessary as most machine learning models cannot accept string data.

Fortunately, the dataset does not contain any missing data. However, it is important to quantify and address missing data if it exists, as most machine learning models cannot be trained with missing data.

To gain a deeper understanding of the distribution and relationship of each variable, we will visualize the dataset. This will involve exploring the data distribution using histograms and generating a correlation heatmap to identify any correlations between the input features and the target variable.

Based on the heatmap analysis, we observed that there is no multicollinearity among the input features. The features that showed the highest positive correlation with the presence of cracking in the metal weld were power, angular position, and material thickness. On the other hand, gas flow rate showed the highest negative correlation with the presence of cracking.

In conclusion, we have explored the dataset and identified the relevant features and target variables for our machine learning model. We have also encoded the binary target variable and visualized the data distribution and correlations. The next steps will involve training and evaluating our machine learning models using this prepared dataset.