## Feature Engineering

In [None]:
from typing import List
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

## Initializing the variables

In [None]:
DATA_DIR = (
    Path("..")
    / ".."
    / ".."
    / "hfactory_magic_folders"
    / "plastic_cost_prediction"
    / "data"
)
MAIN_FILE = "PA6_cleaned_dataset.csv"

In [None]:
df = pd.read_csv(DATA_DIR / MAIN_FILE)
# convert time from string to datetime
df["time"] = pd.to_datetime(df["time"])

In [None]:
def grouping_vars(df: pd.DataFrame) -> List[List[str]]:
    """Allows us to finds gorups of variables with the same prefix.

    Parameters
    ----------
    df: pd.DataFrame
        Dataframe for which we want to find groups.

    Returns
    -------
    groups: Lsit[List[str]]
        List with the groups of variables.
    """
    prefix_dict = dict()
    cols = list(df.columns)

    # finding the prefixes for each column and adding them to the dictionary.
    for col in cols:
        prefix = col.split("_")[0]
        if prefix in prefix_dict:
            prefix_dict[prefix] += [col]
        else:
            prefix_dict[prefix] = [col]

    groups = []

    for _, value in prefix_dict.items():
        if len(value) > 1:
            groups += [value]
    return groups

In [None]:
groups = grouping_vars(df)
CRUDE_vars = groups[0]
NGAS_vars = groups[1]
Electricity_vars = groups[2]

## Time Variables

Since we have acess to the time feature, we believe that creating a variable for year and month might help us in the future with our models. After knowing that the companies which produce PA6 have contracts for different energy ressources with a quarterly fixed rate, we have decided that from a business perspective, it made sense to add a variable representing the quarter. 

In [None]:
df["year"] = [df["time"][i].year for i in range(len(df))]
df["month"] = [df["time"][i].month for i in range(len(df))]
df["quarter"] = (df["month"] - 1) // 3 + 1

Since our month and quarter variable are cyclical variables, we have decided to do a sine and cosine decomposition for both of them. 

In [None]:
df["month_sin"] = np.sin(df["month"] * 2 * np.pi / 12)
df["month_cos"] = np.cos(df["month"] * 2 * np.pi / 12)
df.drop("month", axis=1, inplace=True)

df["quarter_sin"] = np.sin(df["quarter"] * 2 * np.pi / 4)
df["quarter_cos"] = np.cos(df["quarter"] * 2 * np.pi / 4)
df.drop("quarter", axis=1, inplace=True)

## Correlation analysis

In our initial dataset, we have 97 rows and 23 columns. Due to the curse of dimensionality, it is very important to try to diminish the number of features. With this in mind, we will now analyze the correlations between the features, in hopes of finding insights regarding which features can be removed or grouped together. 

In [None]:
time_variables = [
    "time",
    "year",
    "month_sin",
    "month_cos",
    "quarter_sin",
    "quarter_cos",
]
data_for_corr = df.drop(time_variables, axis=1)
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(
    data_for_corr.corr(), cmap="flare", vmin=-1, vmax=1, annot=True
)
heatmap.set_title("Correlation Heatmap", fontdict={"fontsize": 12}, pad=12)

From looking at this plot, and also by analyzing the features at our disposal, we can identify different groups: 

Crude Prices
Natural Gas Prices
Chemical prices
Electricity Prices

We will now investigate different ways to aggregate them, in order to find the most meaningful variables for our future models.

## Crude Prices

In [None]:
df["CRUDE_AVG"] = df[CRUDE_vars].mean(axis=1)
df["CRUDE_MIN"] = df[CRUDE_vars].min(axis=1)
df["CRUDE_MAX"] = df[CRUDE_vars].max(axis=1)

In [None]:
data_crude = df[["CRUDE_AVG", "CRUDE_MIN", "CRUDE_MAX", "best_price_compound"]]
plt.figure(figsize=(10, 5))
heatmap = sns.heatmap(
    data_crude.corr(), cmap="flare", vmin=-1, vmax=1, annot=True
)
heatmap.set_title("Correlation Heatmap", fontdict={"fontsize": 12}, pad=12)

As we can see, all three  aggregated variables are very correlated with one another, and they all have around the same correlation with the target variable. Due to the mean being the more stable option, we decide to use the mean as the function we will use to aggregate the variables representing crude prices. 

## Natural Gas Prices

In [None]:
gas_cols = [col for col in df.columns.to_list() if "GAS" in col]

plt.figure()

for column in gas_cols:
    plt.plot(df["time"], df[column], label=column)

plt.title(f"Time Series Plot - Natural Gas Prices")
plt.xlabel("time")
plt.ylabel("natural gas prices")
plt.legend();

As we can see from the plot above, all the natural gas prices variables are very similar to one another except for iNATGAS. From the correlation matrix we can also see that iNATGAS has a correlation of 0.99 with NGAS_EUR. Knowing these two facts, and in order to avoid iNATGAS heavily influencing the aggregate variable we want to define, we will only aggregate the remaining natural gas variables, dropping the iNATGAS variable.

In [None]:
df["NATGAS_AVG"] = df[NGAS_vars].mean(axis=1)
df["NATGAS_MIN"] = df[NGAS_vars].min(axis=1)
df["NATGAS_MAX"] = df[NGAS_vars].max(axis=1)

In [None]:
data_NatGas = df[
    ["NATGAS_AVG", "NATGAS_MIN", "NATGAS_MAX", "best_price_compound"]
]
plt.figure(figsize=(10, 5))
heatmap = sns.heatmap(
    data_NatGas.corr(), cmap="flare", vmin=-1, vmax=1, annot=True
)
heatmap.set_title("Correlation Heatmap", fontdict={"fontsize": 12}, pad=12)

As we can see, NATGAS_AVG and NATGAS_MAX have a very high correlation, with NATGAS_MIN being less correlated to the other two aggregating variables. NATGAS_AVG is also the one with the biggest correlation to the target variable. Adding to this the fact that the mean is the most stable variable, we will use the mean to aggregate these three variables. 

## Chemical Prices

There are three chemicals whose price is a feature of our dataset. They are Benzene, Cyclohexane and Caprolactam. Let us see how they are related to PA6 from a chemical point of view:

Benzene --> Cyclohexane --> Caprolactam --> PA6

As we can see, only Caprolactam is directly used in the production of PA6. After talking to an expert in Chemical Engineering, we obtained the information that the majority of companies which produce PA6 buy their Caprolactam instead of producing it. With this in mind, we have made the educated assumption that Caprolactam price will be much more influent in the prediction of the price of PA6 than the other variables and therefore we will discard the remaining chemical variables. 

We can also see that PA6 GLOBAL_ EMEAS _ EUR per TON has a correlation of 0.97 with Caprolactam_Price, and therefore we will drop it. 

## Electricity Prices

In [None]:
df["Electricity_AVG"] = df[Electricity_vars].mean(axis=1)
df["Electricity_MIN"] = df[Electricity_vars].min(axis=1)
df["Electricity_MAX"] = df[Electricity_vars].max(axis=1)

In [None]:
data_electricity = df[
    [
        "Electricity_AVG",
        "Electricity_MIN",
        "Electricity_MAX",
        "best_price_compound",
    ]
]
plt.figure(figsize=(10, 5))
heatmap = sns.heatmap(
    data_electricity.corr(), cmap="flare", vmin=-1, vmax=1, annot=True
)
heatmap.set_title("Correlation Heatmap", fontdict={"fontsize": 12}, pad=12)

As we can see, all three variables are very correlated and have a close to equal correlation to the target variable. Therefore, we have decided to use the mean to group the different variables regarding electricity prices. 

## Correlation analysis of the processed variables

Let us now take a look at the correlation matrix for the variables after the preprocesing is done.

In [None]:
processed_data_for_corr = df[
    [
        "Caprolactam_price",
        "CRUDE_AVG",
        "Electricity_AVG",
        "NATGAS_AVG",
        "Inflation_rate_france",
        "Automotive Value",
        "best_price_compound",
    ]
]
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(
    processed_data_for_corr.corr(), cmap="flare", vmin=-1, vmax=1, annot=True
)
heatmap.set_title("Correlation Heatmap", fontdict={"fontsize": 12}, pad=12)

As we can see, we have managed to reduce drastically the ammount of variables which are extremely correlated with one another, as well as diminuishing the number of variables, one of our main goals.