<a href="https://colab.research.google.com/github/ezzywd78/MS_AI_Engineer_Course/blob/main/Mercedes_Benz_Greener_Manufacturing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Load and preprocess data from "train.zip" and "test.zip" including unzipping, initial exploration, cleaning zero-variance columns, applying label encoding, and performing dimensionality reduction. Then, train an XGBoost regression model on the processed training data to predict the 'y' column, make predictions on the preprocessed test set, and summarize the entire process.

## Unzip and Load Data

### Subtask:
Unzip the provided 'train.zip' and 'test.zip' files and load the data into pandas DataFrames for training and testing.


**Reasoning**:
The first step is to import the necessary libraries, create directories for the extracted data, and then unzip the provided 'train.zip' and 'test.zip' files into their respective directories.



In [None]:
import zipfile
import os

# Create 'train_data' directory and extract train.zip
os.makedirs('train_data', exist_ok=True)
with zipfile.ZipFile('/content/train.zip', 'r') as zip_ref:
    zip_ref.extractall('train_data')
print("train.zip extracted to train_data/")

# Create 'test_data' directory and extract test.zip
os.makedirs('test_data', exist_ok=True)
with zipfile.ZipFile('/content/test.zip', 'r') as zip_ref:
    zip_ref.extractall('test_data')
print("test.zip extracted to test_data/")

train.zip extracted to train_data/
test.zip extracted to test_data/


**Reasoning**:
Now that the files are unzipped, the next step is to load the data from 'train.csv' and 'test.csv' into pandas DataFrames, and then display the first 5 rows of each to verify the loading process.



In [None]:
import pandas as pd

# Load the training data
train_df = pd.read_csv('train_data/train.csv')
print("Train DataFrame loaded. First 5 rows:")
print(train_df.head())

# Load the test data
test_df = pd.read_csv('test_data/test.csv')
print("\nTest DataFrame loaded. First 5 rows:")
print(test_df.head())

Train DataFrame loaded. First 5 rows:
   ID       y  X0 X1  X2 X3 X4 X5 X6 X8  ...  X375  X376  X377  X378  X379  \
0   0  130.81   k  v  at  a  d  u  j  o  ...     0     0     1     0     0   
1   6   88.53   k  t  av  e  d  y  l  o  ...     1     0     0     0     0   
2   7   76.26  az  w   n  c  d  x  j  x  ...     0     0     0     0     0   
3   9   80.62  az  t   n  f  d  x  l  e  ...     0     0     0     0     0   
4  13   78.02  az  v   n  f  d  h  d  n  ...     0     0     0     0     0   

   X380  X382  X383  X384  X385  
0     0     0     0     0     0  
1     0     0     0     0     0  
2     0     1     0     0     0  
3     0     0     0     0     0  
4     0     0     0     0     0  

[5 rows x 378 columns]

Test DataFrame loaded. First 5 rows:
   ID  X0 X1  X2 X3 X4 X5 X6 X8  X10  ...  X375  X376  X377  X378  X379  X380  \
0   1  az  v   n  f  d  t  a  w    0  ...     0     0     0     1     0     0   
1   2   t  b  ai  a  d  b  g  y    0  ...     0     0     1     0

## Initial Data Exploration and Cleaning

### Subtask:
Perform an initial inspection of both the training and test datasets. This includes checking for null values, examining unique values for categorical features, and understanding the data types and distributions.


**Reasoning**:
The first step in data exploration is to check for missing values, understand data types, and get basic descriptive statistics for numerical columns. This can be achieved for both train_df and test_df using `.isnull().sum()`, `.info()`, and `.describe()` methods.



In [None]:
print("--- Training Data Info ---")
train_df.info()
print("\n--- Missing Values in Training Data ---")
print(train_df.isnull().sum()[train_df.isnull().sum() > 0])
print("\n--- Descriptive Statistics for Training Data ---")
print(train_df.describe())

print("\n\n--- Test Data Info ---")
test_df.info()
print("\n--- Missing Values in Test Data ---")
print(test_df.isnull().sum()[test_df.isnull().sum() > 0])
print("\n--- Descriptive Statistics for Test Data ---")
print(test_df.describe())

--- Training Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 378 entries, ID to X385
dtypes: float64(1), int64(369), object(8)
memory usage: 12.1+ MB

--- Missing Values in Training Data ---
Series([], dtype: int64)

--- Descriptive Statistics for Training Data ---
                ID            y          X10     X11          X12  \
count  4209.000000  4209.000000  4209.000000  4209.0  4209.000000   
mean   4205.960798   100.669318     0.013305     0.0     0.075077   
std    2437.608688    12.679381     0.114590     0.0     0.263547   
min       0.000000    72.110000     0.000000     0.0     0.000000   
25%    2095.000000    90.820000     0.000000     0.0     0.000000   
50%    4220.000000    99.150000     0.000000     0.0     0.000000   
75%    6314.000000   109.010000     0.000000     0.0     0.000000   
max    8417.000000   265.320000     1.000000     0.0     1.000000   

               X13          X14          X15          X16      

**Reasoning**:
Now that we have reviewed the basic information and descriptive statistics, the next step is to examine the categorical features as specified in the instructions. This involves identifying object-type columns, counting their unique values, and showing value counts for a few to understand their distribution.



In [None]:
print("\n--- Categorical Features in Training Data ---")
train_object_cols = train_df.select_dtypes(include='object').columns
print(f"Found {len(train_object_cols)} object columns in train_df: {list(train_object_cols)}")
for col in train_object_cols:
    print(f"Column '{col}': {train_df[col].nunique()} unique values")
    if train_df[col].nunique() < 20: # Display value counts for columns with fewer unique values
        print(train_df[col].value_counts())
    print("\n")

print("\n--- Categorical Features in Test Data ---")
test_object_cols = test_df.select_dtypes(include='object').columns
print(f"Found {len(test_object_cols)} object columns in test_df: {list(test_object_cols)}")
for col in test_object_cols:
    print(f"Column '{col}': {test_df[col].nunique()} unique values")
    if test_df[col].nunique() < 20: # Display value counts for columns with fewer unique values
        print(test_df[col].value_counts())
    print("\n")


--- Categorical Features in Training Data ---
Found 8 object columns in train_df: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']
Column 'X0': 47 unique values


Column 'X1': 27 unique values


Column 'X2': 44 unique values


Column 'X3': 7 unique values
X3
c    1942
f    1076
a     440
d     290
g     241
e     163
b      57
Name: count, dtype: int64


Column 'X4': 4 unique values
X4
d    4205
a       2
b       1
c       1
Name: count, dtype: int64


Column 'X5': 29 unique values


Column 'X6': 12 unique values
X6
g    1042
j    1039
d     625
i     488
l     478
a     206
h     190
k      43
c      38
b      28
f      20
e      12
Name: count, dtype: int64


Column 'X8': 25 unique values



--- Categorical Features in Test Data ---
Found 8 object columns in test_df: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']
Column 'X0': 49 unique values


Column 'X1': 27 unique values


Column 'X2': 45 unique values


Column 'X3': 7 unique values
X3
c    1900
f    1083
a     476
d     274
g 

## Remove Zero-Variance Columns

### Subtask:
Identify and remove columns from both the training and test datasets that have a variance equal to zero, as these columns do not contribute any information.


**Reasoning**:
First, I will identify columns in the training data (excluding 'ID' and 'y') that have zero variance. This involves calculating the variance for each numerical column and filtering those with a variance of 0.



In [None]:
zero_variance_cols_train = [col for col in train_df.columns if col not in ['ID', 'y'] and train_df[col].nunique() == 1]
print(f"Columns with zero variance in train_df (excluding 'ID' and 'y'): {len(zero_variance_cols_train)}")
print(zero_variance_cols_train)

Columns with zero variance in train_df (excluding 'ID' and 'y'): 12
['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347']


**Reasoning**:
Next, I will identify columns in the test data (excluding 'ID') that have zero variance. This involves checking the number of unique values for each column and filtering those with only one unique value.



In [None]:
zero_variance_cols_test = [col for col in test_df.columns if col != 'ID' and test_df[col].nunique() == 1]
print(f"Columns with zero variance in test_df (excluding 'ID'): {len(zero_variance_cols_test)}")
print(zero_variance_cols_test)

Columns with zero variance in test_df (excluding 'ID'): 5
['X257', 'X258', 'X295', 'X296', 'X369']


**Reasoning**:
Now that zero-variance columns for both train and test datasets have been identified separately, I will combine them into a single set to ensure consistent removal across both dataframes.



In [None]:
all_zero_variance_cols = list(set(zero_variance_cols_train + zero_variance_cols_test))
print(f"Combined zero-variance columns: {len(all_zero_variance_cols)}")
print(all_zero_variance_cols)

Combined zero-variance columns: 17
['X369', 'X93', 'X296', 'X235', 'X268', 'X11', 'X295', 'X290', 'X293', 'X258', 'X330', 'X289', 'X347', 'X297', 'X257', 'X233', 'X107']


**Reasoning**:
Now that the common set of zero-variance columns has been identified, I will remove these columns from both the training and test DataFrames to ensure consistency and eliminate non-informative features.



In [None]:
initial_train_cols = train_df.shape[1]
initial_test_cols = test_df.shape[1]

train_df = train_df.drop(columns=all_zero_variance_cols, errors='ignore')
test_df = test_df.drop(columns=all_zero_variance_cols, errors='ignore')

removed_train_cols = initial_train_cols - train_df.shape[1]
removed_test_cols = initial_test_cols - test_df.shape[1]

print(f"Number of columns removed from train_df: {removed_train_cols}")
print(f"New shape of train_df: {train_df.shape}")
print(f"Number of columns removed from test_df: {removed_test_cols}")
print(f"New shape of test_df: {test_df.shape}")

Number of columns removed from train_df: 17
New shape of train_df: (4209, 361)
Number of columns removed from test_df: 17
New shape of test_df: (4209, 360)


## Apply Label Encoding

### Subtask:
Apply Label Encoding to all categorical features in both the training and test datasets to convert them into numerical representations suitable for machine learning models.


**Reasoning**:
First, I will identify the categorical columns (object type) present in both the training and test datasets. It is crucial to find the intersection of these columns to ensure that encoding is applied consistently to features common in both datasets. I will list these columns to verify they have been correctly identified.



In [None]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Identify categorical columns in train_df
train_categorical_cols = train_df.select_dtypes(include='object').columns.tolist()

# Identify categorical columns in test_df
test_categorical_cols = test_df.select_dtypes(include='object').columns.tolist()

# Get the common categorical columns present in both train and test datasets
common_categorical_cols = list(set(train_categorical_cols) & set(test_categorical_cols))

print(f"Categorical columns in train_df: {train_categorical_cols}")
print(f"Categorical columns in test_df: {test_categorical_cols}")
print(f"Common categorical columns for encoding: {common_categorical_cols}")

Categorical columns in train_df: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']
Categorical columns in test_df: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']
Common categorical columns for encoding: ['X0', 'X6', 'X1', 'X2', 'X4', 'X8', 'X5', 'X3']


**Reasoning**:
Now that the common categorical columns have been identified, I will iterate through each of them, create a comprehensive set of unique values from both the training and test datasets for each column, fit a LabelEncoder to these combined values, and then transform the respective columns in both DataFrames. This ensures consistent encoding across the datasets.



In [None]:
label_encoders = {}
for col in common_categorical_cols:
    le = LabelEncoder()
    # Combine unique values from train and test for comprehensive fitting
    combined_values = pd.concat([train_df[col], test_df[col]], axis=0).astype(str).unique()
    le.fit(combined_values)

    train_df[col] = le.transform(train_df[col].astype(str))
    test_df[col] = le.transform(test_df[col].astype(str))
    label_encoders[col] = le

print("Label Encoding applied to common categorical columns.")
print("First 5 rows of train_df after encoding:")
print(train_df.head())
print("\nFirst 5 rows of test_df after encoding:")
print(test_df.head())
print("\nData types of encoded columns in train_df:")
print(train_df[common_categorical_cols].dtypes)
print("\nData types of encoded columns in test_df:")
print(test_df[common_categorical_cols].dtypes)

Label Encoding applied to common categorical columns.
First 5 rows of train_df after encoding:
   ID       y  X0  X1  X2  X3  X4  X5  X6  X8  ...  X375  X376  X377  X378  \
0   0  130.81  37  23  20   0   3  27   9  14  ...     0     0     1     0   
1   6   88.53  37  21  22   4   3  31  11  14  ...     1     0     0     0   
2   7   76.26  24  24  38   2   3  30   9  23  ...     0     0     0     0   
3   9   80.62  24  21  38   5   3  30  11   4  ...     0     0     0     0   
4  13   78.02  24  23  38   5   3  14   3  13  ...     0     0     0     0   

   X379  X380  X382  X383  X384  X385  
0     0     0     0     0     0     0  
1     0     0     0     0     0     0  
2     0     0     1     0     0     0  
3     0     0     0     0     0     0  
4     0     0     0     0     0     0  

[5 rows x 361 columns]

First 5 rows of test_df after encoding:
   ID  X0  X1  X2  X3  X4  X5  X6  X8  X10  ...  X375  X376  X377  X378  X379  \
0   1  24  23  38   5   3  26   0  22    0  ...   

## Perform Dimensionality Reduction

### Subtask:
Apply a dimensionality reduction technique, such as Principal Component Analysis (PCA), to the processed training and test datasets to reduce the number of features while retaining most of the important information.


**Reasoning**:
The first step in applying PCA is to separate the target variable and identifier columns from the feature sets for both the training and test datasets. This ensures that PCA is applied only to the features, and that 'ID' and 'y' columns are handled appropriately.



In [None]:
y_train = train_df['y']
train_ids = train_df['ID']
test_ids = test_df['ID']

X_train = train_df.drop(columns=['ID', 'y'])
X_test = test_df.drop(columns=['ID'])

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of train_ids: {train_ids.shape}")
print(f"Shape of test_ids: {test_ids.shape}")

Shape of X_train: (4209, 359)
Shape of y_train: (4209,)
Shape of X_test: (4209, 359)
Shape of train_ids: (4209,)
Shape of test_ids: (4209,)


**Reasoning**:
Now that the data is separated, I will initialize a StandardScaler and fit it to the training features (`X_train`) to learn the scaling parameters. Then, I will transform both the training (`X_train`) and test (`X_test`) features using this fitted scaler to standardize them, which is a common prerequisite for PCA.



In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Shape of X_train_scaled: {X_train_scaled.shape}")
print(f"Shape of X_test_scaled: {X_test_scaled.shape}")

Shape of X_train_scaled: (4209, 359)
Shape of X_test_scaled: (4209, 359)


**Reasoning**:
Now that the data has been scaled, I will initialize a PCA model and fit it to the scaled training data (`X_train_scaled`). I will set `n_components=0.95` to retain 95% of the variance, and then transform both the scaled training and test datasets using this fitted PCA model.



In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print(f"Original number of features: {X_train_scaled.shape[1]}")
print(f"Number of components after PCA (train): {X_train_pca.shape[1]}")
print(f"Number of components after PCA (test): {X_test_pca.shape[1]}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_.sum():.2f}")

Original number of features: 359
Number of components after PCA (train): 146
Number of components after PCA (test): 146
Explained variance ratio: 0.95


**Reasoning**:
Now that the PCA transformation has been applied, I will convert the PCA-transformed arrays into pandas DataFrames, `X_train_pca_df` and `X_test_pca_df`, and ensure they retain an appropriate index. This will make them ready for subsequent machine learning model training.



In [None]:
X_train_pca_df = pd.DataFrame(X_train_pca, index=X_train.index)
X_test_pca_df = pd.DataFrame(X_test_pca, index=X_test.index)

print("PCA-transformed data converted to DataFrames.")
print(f"Shape of X_train_pca_df: {X_train_pca_df.shape}")
print(f"Shape of X_test_pca_df: {X_test_pca_df.shape}")

PCA-transformed data converted to DataFrames.
Shape of X_train_pca_df: (4209, 146)
Shape of X_test_pca_df: (4209, 146)


## Train XGBoost Model

### Subtask:
Train an XGBoost regression model using the preprocessed training data.


**Reasoning**:
I will import the XGBoost regressor, instantiate it with some initial parameters, and then fit the model using the PCA-transformed training features and the target variable.



In [None]:
from xgboost import XGBRegressor

# Instantiate XGBRegressor with some common parameters
xgb_model = XGBRegressor(
    n_estimators=1000, # Number of boosting rounds
    learning_rate=0.05, # Step size shrinkage to prevent overfitting
    max_depth=5, # Maximum depth of a tree
    subsample=0.8, # Subsample ratio of the training instance
    colsample_bytree=0.8, # Subsample ratio of columns when constructing each tree
    random_state=42, # Random seed for reproducibility
    n_jobs=-1 # Use all available CPU cores
)

# Fit the model to the PCA-transformed training data
xgb_model.fit(X_train_pca_df, y_train)

print("XGBoost Regressor model trained successfully.")

XGBoost Regressor model trained successfully.


## Predict Test Set Values

### Subtask:
Use the trained XGBoost model to predict the test bench times for the preprocessed test dataset.


**Reasoning**:
I will use the trained XGBoost model to make predictions on the PCA-transformed test dataset (`X_test_pca_df`) and store the results in `y_pred_test`.



In [None]:
y_pred_test = xgb_model.predict(X_test_pca_df)
print("Predictions on test data generated successfully.")
print(f"Shape of y_pred_test: {y_pred_test.shape}")
print(f"First 5 predictions: {y_pred_test[:5]}")

Predictions on test data generated successfully.
Shape of y_pred_test: (4209,)
First 5 predictions: [103.721275 118.361115 101.11171   80.81917  106.33356 ]


## Final Task

### Subtask:
Summarize the entire process, including the steps taken for data preparation, model training, and the final predictions made for the test set. Discuss potential next steps or insights from the model's performance.


## Summary:

### Data Analysis Key Findings

*   **Data Loading and Initial Exploration**:
    *   Successfully loaded `train.csv` (4209 rows, 378 columns) and `test.csv` (4209 rows, 377 columns) after unzipping.
    *   Both datasets were found to have no missing values.
    *   Identified 8 categorical columns common to both datasets ('X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8').
*   **Data Cleaning (Zero-Variance Columns)**:
    *   17 unique columns with zero variance were identified across both training and test sets.
    *   These 17 columns were removed, reducing `train_df` to (4209, 361) columns and `test_df` to (4209, 360) columns.
*   **Feature Engineering (Label Encoding)**:
    *   Label Encoding was applied to the 8 common categorical columns in both `train_df` and `test_df`, converting them to numerical representations.
*   **Dimensionality Reduction (PCA)**:
    *   Features were standardized using `StandardScaler`.
    *   Principal Component Analysis (PCA) was applied to the standardized data, reducing the original 359 numerical features to 146 components while retaining 95% of the variance.
    *   The PCA-transformed training and test feature sets (`X_train_pca_df`, `X_test_pca_df`) both have a shape of (4209, 146).
*   **Model Training**:
    *   An XGBoost Regressor model was trained using the PCA-transformed training features (`X_train_pca_df`) and the target variable (`y_train`).
    *   Key hyperparameters included `n_estimators=1000`, `learning_rate=0.05`, and `max_depth=5`.
*   **Prediction**:
    *   The trained XGBoost model successfully generated predictions (`y_pred_test`) for the preprocessed test set (`X_test_pca_df`).
    *   The predictions consist of 4209 values, with the first five predictions being approximately \[103.72, 118.36, 101.11, 80.82, 106.33].

### Insights or Next Steps

*   The combination of Label Encoding for categorical features and PCA for numerical features effectively prepared the high-dimensional dataset for modeling. PCA significantly reduced the feature space from 359 to 146, which can help mitigate the curse of dimensionality and potentially improve model training efficiency without losing significant information.
*   The next crucial steps involve evaluating the model's performance on a validation set (or using cross-validation) to assess its generalization capability and fine-tuning the XGBoost hyperparameters to optimize prediction accuracy.
