# **0. Import dataset**

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv("dataset.csv")

In [None]:
# Confirm the dataset has been imported correctly
df.head()

#**1. Data Exploration**

##**1.1 Nulls**

In [None]:
df.info()

In [None]:
#Imputing the NaN in 'finger_line_37_ring' using mean
mask_nan = df["finger_line_37_ring"].isna()
mean_val = df["finger_line_37_ring"].mean()
df.loc[mask_nan, "finger_line_37_ring"] = mean_val

In [None]:
print(df.isna().sum())

In [None]:
# Based on that every column has one nan, it's probably due to an empty row
empty_rows = df[df.isnull().all(axis=1)]
print(empty_rows)

In [None]:
# Dropping row 176 from the dataframe
df = df.drop(index=176)

## 1.2. **Distribution**

In [None]:
df.describe()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Select numerical columns
numerical_columns = [col for col in df.select_dtypes(include=['float64'])]

# Histogram
df[numerical_columns].hist(bins=20, figsize=(12, 10))
plt.suptitle('Histogram')
plt.show()

**Right-skewed distributions** for most of the variables

**Long tails** which might indicate anatomical variability

Many finger_line columns have very similar distributions, which means they may carry redundant information. Also, based on the right-skewed we should have outliers. A great choice is to use robustscaler because we are going to use regression and we want to keep the outliers as they are real data.

## **1.3. Outliers**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Select only numeric columns
numeric_columns = [col for col in df.select_dtypes(include=['float64'])]

# Calculate number of rows for subplots (2 plots per row)
num_cols = len(numeric_columns)
num_rows = (num_cols + 1) // 2

# Create the subplots dynamically (2 columns per row)
fig, axes = plt.subplots(num_rows, 2, figsize=(12, num_rows * 5))
axes = axes.flatten()

# Plot one boxplot per numeric column
for i, column in enumerate(numeric_columns):
    sns.boxplot(y=df[column], ax=axes[i])
    axes[i].set_title(f'Distribution of {column}')

# Remove any unused subplot axes
for j in range(i + 1, len(axes)):
    axes[j].remove()

# Adjust layout to avoid overlap
plt.tight_layout()
plt.show()

Again we see what we discussed in the distribution.

##**1.3. Collinearity**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Generate a mask for the upper triangle because the other traingle is its simmetry. This way we have a better visualization
correlation_matrix = df.corr()
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Drawing the heatmap with the mask
plt.figure(figsize = (11,9))
sns.heatmap(correlation_matrix, mask=mask, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, cmap="coolwarm")
plt.show()

The heatmap confirms high collinearity among the fingers lines features. This is expected since these are sequential measurements along the same finger. This collinearity can harm linear models but is well-handled by tree-based models. Therefore, we will proceed using Random Forest, which is robust to redundant features.

We could analyze the multicollinerity but, since the heatmap already reveals extreme collinearity, it is pointless.


In [None]:
# Drop the 'file' column since it's just an identification feature
df = df.drop(columns=["file"])

#**2. Preprocessing**

We selected RobustScaler to normalize the input features due to the presence of outliers and skewed distributions as we talked before.

Unlike StandardScaler, which uses the mean and standard deviation and is sensitive to outliers, RobustScaler uses the median and interquartile range (IQR), making it more resilient to extreme values.

Also, we are applying PCA due to the high collinearity observed.

This choices allows us to simplify the prepocessing, avoiding to handle manually outliers or make feature selection.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA

# Separate features and target
X = df.drop(columns=["ring_GT"])
y = df["ring_GT"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features using RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply PCA to reduce the number of features
pca = PCA(n_components=5)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

#3. **Model Selection & Training**

We selected a Random Forest Regressor as our baseline model due to several reasons:

* It handles high-dimensional input and multicollinearity well, which is essential given the redundancy of the features.
* It is robust to outliers and does not require feature scaling (although we applied RobustScaler).
* It captures non-linear relationships between features and the target (ring_GT) without requiring prior transformation or feature engineering.

We expect Random Forest to perform reliably as a first model, providing a strong baseline against other models like Linear Regression


To improve the performance of our Random Forest we applied a grid search:
 * n_estimators: number of trees in the forest,
 * max_depth: maximum depth of each tree
 * min_samples_split: minimum samples required to split an internal node.

This process helps us identify the best-performing combination of hyperparameters using cross-validation and evaluating with negative RMSE as the scoring metric.

In summary, this optimization for our Random Forest helps reduce overfitting and selects a model that better generalizes to unseen data.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Hyperparameter grid
rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize model
rf_model = RandomForestRegressor(random_state=42)

# GridSearchCV using RMSE as scoring
rf_grid_search = GridSearchCV(
    rf_model,
    rf_params,
    cv=5,
    scoring='neg_root_mean_squared_error'
)

# Fitting
rf_grid_search.fit(X_train_pca, y_train)

#**4. Evaluation**

In [None]:
# Evaluate best model
best_rf_model = rf_grid_search.best_estimator_
y_rf_pred = best_rf_model.predict(X_test_pca)

# Metrics
rmse = mean_squared_error(y_test, y_rf_pred)
mae = mean_absolute_error(y_test, y_rf_pred)
r2 = r2_score(y_test, y_rf_pred)

print("Random Forest Regressor:")
print(f"Best Parameters: {rf_grid_search.best_params_}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R² Score: {r2:.4f}")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_test, y=y_rf_pred, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label='Ideal')

# Labels and title
plt.xlabel("Actual ring_GT")
plt.ylabel("Predicted ring_GT")
plt.title("Predicted vs Actual ring_GT")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

The Random Forest Regressor with RobustScaler and PCA performed well on the task of predicting the ring_GT target variable, providing a RMSE of 0.53, MAE of 0.50, and an R² score of 0.69. These results indicate that the model captured a good portion of the variance.

On one side, since the features are in millimeters, the RMSE is telling us that, on average, the model's prediction deviate from the actual values by 0,53 mm.

On the other side, the R2 score of 0.69 suggests that there is still room for improvement in explaining the variance in the target.

We could experiment with other models like SVR which is robust to overfitting and may capture non-linearities more effectively. Other solutions to refine the model would go through testing more values for the hyperparameter, feature engineering or trying different number of components for the PCA.