# **Data Segregation and Preprocessing for CO2 Emissions Modeling**

## **1. Introduction**

This notebook is the second step in our project to model CO2 emissions. Building upon the exploratory data analysis (EDA), this script focuses on preparing the data for the modeling phase. The key processes covered here include:
* Loading the cleaned dataset versioned by Weights & Biases (Wandb).
* Scaling the numerical features to a common range.
* Splitting the dataset into training and testing sets.
* Versioning the final, processed datasets back to Wandb for use in model training.

## **2. Library Imports**

We begin by importing the necessary Python libraries.

In [None]:
import wandb
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import warnings

# Suppress warnings for a cleaner output
warnings.filterwarnings("ignore")

# To run this notebook, you need a Wandb account and an API key.
# You can create a file named my_key.py with the line: WANDB_KEY = 'your_api_key_here'
# and then uncomment the line below.
from my_key import WANDB_KEY

* **wandb**: For interacting with the Weights & Biases platform, managing experiments, and handling data artifacts.
* **os**: To interact with the operating system, primarily for handling file paths.
* **pandas**: For data manipulation using its powerful DataFrame structures.
* **sklearn.model_selection.train_test_split**: A function to split arrays or matrices into random train and test subsets.
* **sklearn.preprocessing.MinMaxScaler**: A tool to scale features to a given range, typically [0, 1].

## **3. Initialization of Weights & Biases (Wandb)**

A new Wandb run is initiated to log this data segregation and preprocessing job. This ensures that every step of our machine learning pipeline is tracked and reproducible.

In [2]:
# Log in to Wandb using your API key.
# Make sure to replace 'your_api_key_here' with your actual key or use the my_key.py file.
wandb.login(key=WANDB_KEY)

# Initialize a new Wandb run. We define a specific job_type for clarity.
run = wandb.init(project="SBAI 2025", job_type="data-segregation", save_code=True)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: C:\Users\thomm\_netrc
[34m[1mwandb[0m: Currently logged in as: [33mthommasflores[0m ([33mthommasflores-ufrn[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## **4. Loading the Cleaned Dataset**

Instead of reading from a local CSV file, we retrieve the cleaned dataset directly from Wandb. We use the `use_artifact` method to pull the latest version of our `clean_dataset`, ensuring that we are working with the correct data from the previous EDA step.

In [3]:
# Use the 'latest' alias to get the most recent version of the artifact.
artifact = run.use_artifact(artifact_or_name="clean_dataset:latest")

# Download the artifact's contents to a local directory.
# Wandb manages the path and returns it.
path = artifact.download()

# Construct the full path to the CSV file within the downloaded directory.
csv_file_path = os.path.join(path, 'emission_clean.csv')

# Load the data into a pandas DataFrame.
df = pd.read_csv(csv_file_path)
print("Cleaned dataset loaded successfully from Wandb artifact.")
display(df.head())

[34m[1mwandb[0m:   1 of 1 files downloaded.  


Cleaned dataset loaded successfully from Wandb artifact.


Unnamed: 0,CO2 (g/s) [estimated maf],intake_pressure,intake_temperature,rpm,speed
0,0.809921,26.0,54.0,1568.0,43.0
1,1.796942,57.0,53.0,1582.0,43.0
2,2.199995,69.0,53.0,1600.0,43.0
3,1.226761,38.0,54.0,1625.0,44.0
4,0.756202,24.0,54.0,1586.0,45.0


## **5. Feature Scaling**

Machine learning algorithms often perform better when numerical input features are scaled to a standard range. This prevents features with larger scales from dominating the model. Here, we use `MinMaxScaler` to transform our data into a [0, 1] range.

In [4]:
# Initialize the scaler.
scaler = MinMaxScaler()

# Apply the scaler to the DataFrame.
# fit_transform calculates the scaling parameters (min, max) and applies the transformation.
df_scaled_values = scaler.fit_transform(df)

# The output of the scaler is a NumPy array. We convert it back to a pandas DataFrame,
# preserving the original column names.
df_scaled = pd.DataFrame(df_scaled_values, columns=df.columns.tolist())

print("Data scaled successfully. Displaying descriptive statistics of the scaled data:")
display(df_scaled.describe())

Data scaled successfully. Displaying descriptive statistics of the scaled data:


Unnamed: 0,CO2 (g/s) [estimated maf],intake_pressure,intake_temperature,rpm,speed
count,10230.0,10230.0,10230.0,10230.0,10230.0
mean,0.164184,0.408729,0.449839,0.261053,0.379858
std,0.166003,0.252708,0.224329,0.170076,0.284558
min,0.0,0.0,0.0,0.0,0.0
25%,0.046714,0.197531,0.285714,0.118919,0.126582
50%,0.074022,0.382716,0.464286,0.267027,0.379747
75%,0.26337,0.54321,0.571429,0.385495,0.582278
max,1.0,1.0,1.0,1.0,1.0


As shown by the descriptive statistics, all features now have a minimum value of 0 and a maximum value of 1.

## **6. Data Splitting (Train-Test Split)**

To evaluate our model's performance on unseen data, we split the dataset into two parts: a training set and a testing set.
* **Training Set (80%)**: Used to train the machine learning model.
* **Testing Set (20%)**: Used to evaluate the final performance of the trained model.

We set a `random_state` to ensure that the split is reproducible every time the code is run.

In [5]:
# Split the scaled DataFrame into training and testing sets.
# test_size=0.2 means 20% of the data will be used for testing.
train_df, test_df = train_test_split(df_scaled, test_size=0.2, random_state=42)

print(f"Data split into training and testing sets.")
print(f"Training set shape: {train_df.shape}")
print(f"Testing set shape: {test_df.shape}")

Data split into training and testing sets.
Training set shape: (8184, 5)
Testing set shape: (2046, 5)


## **7. Versioning the Processed Datasets**

Finally, we save the training and testing sets as new artifacts in Wandb. This is a crucial step for maintaining a clear data lineage. The next step in our pipeline (model training) can now pull these specific artifacts, ensuring a seamless and organized workflow.

In [6]:
# Create a local directory to store the split data files.
os.makedirs("split_data", exist_ok=True)
train_path = "split_data/train.csv"
test_path = "split_data/test.csv"

# Save the DataFrames to local CSV files.
train_df.to_csv(train_path, index=False)
test_df.to_csv(test_path, index=False)
print("Train and test CSV files saved locally.")

# Create a new Wandb artifact for the training dataset.
train_artifact = wandb.Artifact("train_dataset", type="dataset", description="Training data for the CO2 emission model.")
train_artifact.add_file(train_path)
wandb.log_artifact(train_artifact)
print("Training data artifact created and logged.")

# Create a new Wandb artifact for the testing dataset.
test_artifact = wandb.Artifact("test_dataset", type="dataset", description="Testing data for the CO2 emission model.")
test_artifact.add_file(test_path)
wandb.log_artifact(test_artifact)
print("Testing data artifact created and logged.")

# Finish the Wandb run to save all logs and artifacts.
wandb.finish()

Train and test CSV files saved locally.
Training data artifact created and logged.
Testing data artifact created and logged.


## **8. Conclusion**

This notebook has successfully prepared the data for model training. We have loaded the cleaned data, scaled its features, split it into training and testing sets, and versioned these final datasets as artifacts in Weights & Biases. The next logical step is to train a predictive model using the `train_dataset` artifact.