## Standardisation of the data: scaling

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
df = pd.read_csv("../data/data_imputed_2025_11_05.csv")
print(df.head())

       C    Si    Mn      S      P   Ni        Cr        Mo         V  \
0  0.037  0.30  0.65  0.008  0.012  0.0  2.160911  0.611092  0.087085   
1  0.037  0.30  0.65  0.008  0.012  0.0  2.160911  0.611092  0.087085   
2  0.037  0.30  0.65  0.008  0.012  0.0  2.160911  0.611092  0.087085   
3  0.037  0.31  1.03  0.007  0.014  0.0  1.733540  0.497154  0.086635   
4  0.037  0.31  1.03  0.007  0.014  0.0  1.733540  0.497154  0.086635   

         Cu  ...  CharpyTemp  CharpyImpact  Hardness  FATT50  PrimaryFerrite  \
0  0.191183  ...         NaN           NaN       NaN     NaN             NaN   
1  0.191183  ...       -28.0         100.0       NaN     NaN             NaN   
2  0.191183  ...       -38.0         100.0       NaN     NaN             NaN   
3  0.180019  ...         NaN           NaN       NaN     NaN             NaN   
4  0.180019  ...       -48.0         100.0       NaN     NaN            32.0   

   Ferrite2ndPhase  AcicularFerrite  Martensite  FerriteCarbide  \
0            

In [4]:
columns = [
    "C", "Si", "Mn", "S", "P", "Ni", "Cr", "Mo", "V", "Cu", "Co", "W",
    "O", "Ti", "N", "Al", "B", "Nb", "Sn", "As", "Sb",
    "Current", "Voltage", "AC_DC", "ElectrodePolarity", "HeatInput",
    "InterpassTemp", "WeldType", "PWHT_Temp", "PWHT_Time",
    "YieldStrength", "UTS", "Elongation", "ReductionArea",
    "CharpyTemp", "CharpyImpact", "Hardness", "FATT50", "PrimaryFerrite",
    "Ferrite2ndPhase", "AcicularFerrite", "Martensite", "FerriteCarbide",
    "WeldID"
]

In [5]:
process_param_columns = ['Current', 'Voltage','AC_DC', 'ElectrodePolarity', 'HeatInput', 'InterpassTemp', 'WeldType', 'PWHT_Temp', 'PWHT_Time']

In [6]:
chem_cols = ['C', 'Si', 'Mn', 'S', 'P', 'Ni', 'Cr', 'Mo', 'V', 'Cu', 'Co', 'W', 'O','Ti', 'N', 'Al', 'B', 'Nb', 'Sn', 'As', 'Sb']

In [7]:
mech_cols = [
    'YieldStrength', 'UTS', 'Elongation', 'ReductionArea',
    'CharpyTemp', 'CharpyImpact', 'Hardness', 'FATT50'
]

In [8]:
micro_cols = [
    'PrimaryFerrite', 'Ferrite2ndPhase', 'AcicularFerrite',
    'Martensite', 'FerriteCarbide'
]

Definition of the different data: differentiation between categorical and numerical data, useful to define which scaling is going to be used on each part.    
Definition of the targets and the features to excludes the targets from the scaling step.

In [11]:
process_num = ["Current","Voltage","HeatInput","InterpassTemp"]
process_cat = ["AC_DC","ElectrodePolarity","WeldType"]

features = chem_cols + process_num + process_cat + micro_cols
targets = ["YieldStrength","UTS","Elongation","ReductionArea","CharpyTemp","CharpyImpact","Hardness","FATT50"]

x = df[features].copy()
y = df[targets].copy()

Performing a split train/test to avoid any leak during fitting of the scalers.  Parameters: 
* test_size = 0.2 to keep 80% of the data for learning, while still evaluating the performance of the model in a safe and reliable way. 
* random_state = 42 to define a standard granularity to make sure splitting is identical at each execution.

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

x_train.shape, x_test.shape

((1321, 33), (331, 33))

Definition of the different scaling transformation to perform on each block:
* **Chemical composition columns** : MinMaxScaler : the data is bounded and skewed, with a majority of small values (concentrations). MinMaxScaler will bring back those values in the [0;1] interval, without altering the proportions.  
* **Process parameters, numerical**: RobustScaler : centered values with the most outliers, RobustScaler centers on the median.  
* **Process parameters, categorical**: OneHotEncoder : categorical data is treated and scaled using a categorical scaler, here OneHotEncoder.  
* **Microstructure** : MinMaxScaler, for the same reasons as the chemical block : the data is bounded.  
* **Mechanical data** : StandardScaler : the data derives from controlled physical processes, which tend to disperse the data following a normal distribution.

In [13]:
# Numerical columns (block)
num_chem = [c for c in chem_cols if c in x_train.columns]
num_proc = [c for c in process_num if c in x_train.columns]
num_micro = [c for c in micro_cols if c in x_train.columns]

# Categorial columns
cat_proc = [c for c in process_cat if c in x_train.columns]

# Composition of the column transformer
preprocess = ColumnTransformer(
    transformers=[
        ("chem_minmax", MinMaxScaler(), num_chem),
        ("proc_robust", RobustScaler(), num_proc),
        ("micro_minmax", MinMaxScaler(), num_micro),
        ("proc_ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_proc),
    ],
    remainder="drop"
)


Following errors when defining the pipeline, some outliers had to be re-treated. In the following cell, I look for the columns that contain the outliers I found, and replace them. 
* I found an interval, and decided to take its middle. 
* I found 2 values '<1', that I rounded to 0.5 for scaling.

In [14]:
import re
# I am here looking for values that resemble an interval.

# List of numerical columns 
num_cols = [c for c in x_train.columns if x_train[c].dtype.kind in "iufc"]

maybe_numeric = [c for c in x_train.columns if x_train[c].dtype == "object"]

# Looking for a potential interval
pattern = re.compile(r"^\s*-?\d+(?:[.,]\d+)?\s*[-–]\s*-?\d+(?:[.,]\d+)?\s*$")

suspects = {}
for c in maybe_numeric:
    m = x_train[c].astype(str).str.match(pattern, na=False)
    if m.any():
        suspects[c] = x_train.loc[m, c].unique()[:10]
suspects

def interval_midpoint(series: pd.Series) -> pd.Series:
    s = series.astype(str).str.strip()
    s = s.str.replace(",", ".", regex=False)
    ab = s.str.extract(r"^\s*(-?\d+(?:\.\d+)?)\s*[-–]\s*(-?\d+(?:\.\d+)?)\s*$")
    mid = ab.astype(float).mean(axis=1)
    fallback = pd.to_numeric(s, errors="coerce")
    out = mid.fillna(fallback)
    return out
for c in suspects.keys():
    x_train[c] = interval_midpoint(x_train[c])
    x_test[c]  = interval_midpoint(x_test[c])


In [15]:
# I am here looking for columns containing a <, and replacing them.
suspect_cols = {}

for c in x_train.columns:
    mask = x_train[c].astype(str).str.contains("<", na=False)
    if mask.any():
        suspect_cols[c] = x_train.loc[mask, c].unique()[:10]


for col in suspect_cols.keys():
    mask = x_train[col].astype(str).str.contains("<", na=False)
    x_train.loc[mask, col] = 0.5
    x_train[col] = pd.to_numeric(x_train[col], errors="coerce")

    if col in x_test.columns:
        mask_test = x_test[col].astype(str).str.contains("<", na=False)
        x_test.loc[mask_test, col] = 0.5
        x_test[col] = pd.to_numeric(x_test[col], errors="coerce")


Creation of a pipeline to fit (on the training set) and then to transform (using train/test) in a reproductible way.

In [16]:
pipeline = Pipeline([
    ("preprocess", preprocess)
])

pipeline.fit(x_train)

X_train_scaled = pipeline.transform(x_train)
X_test_scaled  = pipeline.transform(x_test)

X_train_scaled.shape, X_test_scaled.shape

((1321, 45), (331, 45))

Reverting back to a DataFrame after scaling of the data. 

In [17]:
output_feature_names = []

output_feature_names += num_chem
output_feature_names += num_proc
output_feature_names += num_micro

if len(cat_proc) > 0:
    onehot = pipeline.named_steps["preprocess"].named_transformers_["proc_ohe"]
    onehot_feature_names = onehot.get_feature_names_out(cat_proc).tolist()
else:
    onehot_feature_names = []

output_feature_names += onehot_feature_names

len(output_feature_names), X_train_scaled.shape[1]

X_train_scaled_df = pd.DataFrame(X_train_scaled, index=x_train.index, columns=output_feature_names)
X_test_scaled_df  = pd.DataFrame(X_test_scaled,  index=x_test.index,  columns=output_feature_names)

X_train_scaled_df.head()

Unnamed: 0,C,Si,Mn,S,P,Ni,Cr,Mo,V,Cu,...,WeldType_FCA,WeldType_GMAA,WeldType_GTAA,WeldType_MMA,WeldType_NGGMA,WeldType_NGSAW,WeldType_SA,WeldType_SAA,WeldType_ShMA,WeldType_TSA
306,0.311258,0.3,0.550505,0.028777,0.028226,0.287541,0.087169,0.353414,0.025465,0.211625,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
192,0.410596,0.145455,0.388889,0.043165,0.044355,0.233041,0.890409,0.730539,0.07215,0.311987,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
309,0.311258,0.3,0.550505,0.028777,0.028226,0.287541,0.087169,0.353414,0.025465,0.211625,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1360,0.298013,0.209091,0.424242,0.057554,0.08871,0.307985,0.266295,0.703043,0.028876,0.268421,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
63,0.298013,0.254545,0.580808,0.05036,0.020161,0.359333,0.14695,0.321025,0.044662,0.27547,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# Normalised train/test
train_out = X_train_scaled_df.copy()
test_out  = X_test_scaled_df.copy()

if y_train is not None:
    train_out[targets] = y_train
    test_out[targets]  = y_test

# Saving the datasets
train_out.to_csv("../data/train_normalised.csv", index=False)
test_out.to_csv("../data/test_normalised.csv", index=False)
