## Standardisation of the data: scaling

In [143]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, RobustScaler, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [144]:
df = pd.read_csv("../data/data_imputed_2025_11_05.csv")
print(df.head())

       C    Si    Mn      S      P   Ni        Cr        Mo         V  \
0  0.037  0.30  0.65  0.008  0.012  0.0  1.308508  0.541548  0.063321   
1  0.037  0.30  0.65  0.008  0.012  0.0  1.308508  0.541548  0.063321   
2  0.037  0.30  0.65  0.008  0.012  0.0  1.308508  0.541548  0.063321   
3  0.037  0.31  1.03  0.007  0.014  0.0  1.040208  0.461698  0.099845   
4  0.037  0.31  1.03  0.007  0.014  0.0  1.040208  0.461698  0.099845   

         Cu  ...  AcicularFerrite  Martensite  FerriteCarbide  \
0  0.202337  ...              NaN         NaN             NaN   
1  0.202337  ...              NaN         NaN             NaN   
2  0.202337  ...              NaN         NaN             NaN   
3  0.190271  ...              NaN         NaN             NaN   
4  0.190271  ...             40.0         0.0             0.0   

                          WeldID  MechanicalTestDone  PrimaryFerrite_missing  \
0    Evans-Ni/CMn-1990/1991-0Aaw                   1                       1   
1  Evans-N

In [145]:
columns = [
    "C", "Si", "Mn", "S", "P", "Ni", "Cr", "Mo", "V", "Cu", "Co", "W",
    "O", "Ti", "N", "Al", "B", "Nb", "Sn", "As", "Sb",
    "Current", "Voltage", "AC_DC", "ElectrodePolarity", "HeatInput",
    "InterpassTemp", "WeldType", "PWHT_Temp", "PWHT_Time",
    "YieldStrength", "UTS", "Elongation", "ReductionArea",
    "CharpyTemp", "CharpyImpact", "Hardness", "FATT50", "PrimaryFerrite",
    "Ferrite2ndPhase", "AcicularFerrite", "Martensite", "FerriteCarbide",
    "WeldID"
]

chem_cols = ['C', 'Si', 'Mn', 'S', 'P', 'Ni', 'Cr', 'Mo', 'V', 'Cu', 'Co', 'W', 'O','Ti', 'N', 'Al', 'B', 'Nb', 'Sn', 'As', 'Sb']

micro_cols = [
    'PrimaryFerrite', 'Ferrite2ndPhase', 'AcicularFerrite',
    'Martensite', 'FerriteCarbide'
]

process_param_columns = ['Current', 'Voltage','AC_DC', 'ElectrodePolarity', 'HeatInput', 'InterpassTemp', 'WeldType', 'PWHT_Temp', 'PWHT_Time']

mech_cols = [
    'UTS', 'Elongation', 'ReductionArea', # Do not include YieldStrength
    'CharpyTemp', 'CharpyImpact', 'Hardness', 'FATT50'
]

process_num = ["Current","Voltage","HeatInput","InterpassTemp"]
process_cat = ["AC_DC","ElectrodePolarity","WeldType"]

Listing of the columns based on the distribution they are following (see eda for plots). The distributions are going to determine which scaler is used: 
* StandardScaler: used on data following a Gaussian distribution. 
* MinMaxScaler : used on bounded data, not following a distribution. 
* OneHotEncoder : used on categorical data. 

In [None]:
# Definition of the columns based on the distribution they are following (see eda for full distributions of each column). The scalers will then be defined based on the different distributions. 

GAUSSIAN = [
    "C", "Si", "S", "P", "V", "Co", "W", "O", "N", "B","Sn", "As", 
    "Sb", "UTS", "Elongation", "ReductionArea", "CharpyTemp", 
    "CharpyImpact", "Hardness", "FATT50"
    ]

MINMAX = ["Mn", "Ni", "Cr", "Mo", "Cu", "Ti", "Al", "Nb",
          "Current", "Voltage", "PWHT_Temp", 
          "PWHT_Time", "PrimaryFerrite", "Ferrite2ndPhase", "AcicularFerrite", 
          "Martensite", "FerriteCarbide", "HeatInput", "InterpassTemp"
          ]

OHE = ["AC_DC","ElectrodePolarity","WeldType"]

In [147]:
features = chem_cols + process_num + process_cat + micro_cols
targets = "YieldStrength"

x = df[features].copy()
y = df[targets].copy()

Performing a split train/test to avoid any leak during fitting of the scalers.  Parameters: 
* test_size = 0.2 to keep 80% of the data for learning, while still evaluating the performance of the model in a safe and reliable way. 
* random_state = 42 to define a standard granularity to make sure splitting is identical at each execution.

In [155]:
# Separation of labelled and unlabelled data
df_labeled = df[df[targets].notna()].copy()
df_unlabeled = df[df[targets].isna()].copy()

X_labeled = df_labeled.drop(columns=[targets])
y_labeled = df_labeled[targets]

# Train/test split on labelled data
X_train_lab, X_test, y_train_lab, y_test = train_test_split(
    X_labeled, y_labeled,
    test_size=0.2,
    random_state=42,
    shuffle=True
)

# Building of the entire train/test set (using unlabelled data as well for the train, and only labelled data for the test)
X_train = pd.concat([X_train_lab, df_unlabeled.drop(columns=[targets])], axis=0)
y_train = pd.concat([y_train_lab, pd.Series([None]*len(df_unlabeled), index=df_unlabeled.index, name=targets)])

# Reindexation
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

X_train.shape, y_train.shape, X_test.shape, y_test.shape


((1500, 47), (1500,), (152, 47), (152,))

Definition of the different scaling transformation to perform on each block:
* **Chemical composition columns** : MinMaxScaler : the data is bounded and skewed, with a majority of small values (concentrations). MinMaxScaler will bring back those values in the [0;1] interval, without altering the proportions.  
* **Process parameters, numerical**: RobustScaler : centered values with the most outliers, RobustScaler centers on the median.  
* **Process parameters, categorical**: OneHotEncoder : categorical data is treated and scaled using a categorical scaler, here OneHotEncoder.  
* **Microstructure** : MinMaxScaler, for the same reasons as the chemical block : the data is bounded.
* **Mechanical data** : StandardScaler : the data derives from controlled physical processes, which tend to disperse the data following a normal distribution.

In [156]:
# # Numerical columns (block)
# num_chem = [c for c in chem_cols if c in X_train.columns]
# num_proc = [c for c in process_num if c in X_train.columns]
# num_micro = [c for c in micro_cols if c in X_train.columns]
# num_mech = [c for c in mech_cols if c in X_train.columns]

# # Categorial columns
# cat_proc = [c for c in process_cat if c in X_train.columns]

# # Composition of the column transformer
# preprocess = ColumnTransformer(
#     transformers=[
#         ("chem_minmax", MinMaxScaler(), num_chem),
#         ("proc_robust", RobustScaler(), num_proc),
#         ("micro_minmax", MinMaxScaler(), num_micro),
#         ("mech_standard", StandardScaler(), num_mech),
#         ("proc_ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_proc),
#     ],
#     remainder="drop"
# )

gauss = [c for c in GAUSSIAN if c in X_train.columns]
minmax = [c for c in MINMAX if c in X_train.columns]
ohe = [c for c in OHE if c in X_train.columns]


# Composition of the column transformer
preprocess = ColumnTransformer(
    transformers=[
        ("Gaussian", StandardScaler(), gauss),
        ("MinMax", MinMaxScaler(), minmax),
        ("OneHotEncoder", OneHotEncoder(), ohe)
    ],
    remainder="drop"
)


Creation of a pipeline to fit (on the training set) and then to transform (using train/test) in a reproductible way.

In [157]:
pipeline = Pipeline([
    ("preprocess", preprocess)
])

pipeline.fit(X_train)

X_train_scaled = pipeline.transform(X_train)
X_test_scaled  = pipeline.transform(X_test)

X_train_scaled.shape, X_test_scaled.shape

((1500, 54), (152, 54))

Reverting back to a DataFrame after scaling of the data. 

In [158]:
output_feature_names = []

output_feature_names += gauss
output_feature_names += minmax
output_feature_names += ohe

if len(ohe) > 0:
    onehot = pipeline.named_steps["preprocess"].named_transformers_["OneHotEncoder"]
    onehot_feature_names = onehot.get_feature_names_out(ohe).tolist()
else:
    onehot_feature_names = []

output_feature_names += onehot_feature_names

len(output_feature_names), X_train_scaled.shape[1]

X_train_scaled_df = pd.DataFrame(X_train_scaled, index=X_train.index, columns=output_feature_names)
X_test_scaled_df  = pd.DataFrame(X_test_scaled,  index=X_test.index,  columns=output_feature_names)

X_train_scaled_df.head()

ValueError: Shape of passed values is (1500, 54), indices imply (1500, 57)

In [None]:
# Normalised train/test
train_out = X_train_scaled_df.copy()
test_out  = X_test_scaled_df.copy()

if y_train is not None:
    train_out[targets] = y_train
    test_out[targets]  = y_test

# Saving the datasets
train_out.to_csv("../data/train_normalised.csv", index=False)
test_out.to_csv("../data/test_normalised.csv", index=False)
