# Feature Engineering: Creating Derived Variables

To improve the predictive power of our model, we generate new features from the raw dataset.  
These engineered variables capture useful patterns that may not be obvious in the original columns.

In this step, we:

- **Load the modified dataset** from the local `products_modified.csv` file using `pandas.read_csv`

- **Transform product views**:
  - `views_log = np.log1p(df["number_of_views"])`
  - Applies a log transformation (`log(1 + x)`) to the number of views
  - This reduces skewness and makes highly variable counts easier for the model to handle

- **Extract temporal features from listing date**:
  - `year = pd.to_datetime(df["listing_date"]).dt.year`
  - `month = pd.to_datetime(df["listing_date"]).dt.month`
  - Converts the listing date into numeric year and month values
  - Helps the model capture seasonal or time‚Äërelated trends in product listings

By adding these engineered features (`views_log`, `year`, `month`), we enrich the dataset with more informative signals that can improve category prediction accuracy.

In [35]:
import numpy as np
import pandas as pd

df = pd.read_csv("../data/products_modified.csv")

df["views_log"] = np.log1p(df["number_of_views"])
df["year"] = pd.to_datetime(df["listing_date"]).dt.year
df["month"] = pd.to_datetime(df["listing_date"]).dt.month


# Building the Machine Learning Pipeline

Once the dataset is cleaned and features are engineered, we need to define how the data will be transformed and passed into a classifier.  
This is done using a **scikit‚Äëlearn Pipeline**, which chains together preprocessing steps and the final model.

In this step, we:

- **Define a text vectorizer**:
  - `TfidfVectorizer(ngram_range=(1,2), min_df=2, max_features=100000)`
  - Converts product titles into numerical features using TF‚ÄëIDF
  - Captures both single words (unigrams) and pairs of words (bigrams)
  - Ignores very rare terms (`min_df=2`) and limits vocabulary size for efficiency

- **Create a ColumnTransformer**:
  - `"text"` applies the TF‚ÄëIDF vectorizer to the `product_title` column
  - `"num"` applies `StandardScaler` to numeric features (`views_log`, `merchant_rating`, `year`, `month`)
  - This ensures text and numeric features are processed appropriately in parallel

- **Build the Pipeline**:
  - `"prep"` step runs the ColumnTransformer (text + numeric preprocessing)
  - `"clf"` step trains a `LogisticRegression` classifier
    - `max_iter=2000` allows more iterations for convergence
    - `n_jobs=4` enables parallel computation
    - `multi_class="multinomial"` handles multiple categories
    - `class_weight="balanced"` adjusts for imbalanced class distributions

This pipeline ensures that raw product titles and numeric features are automatically transformed into a format suitable for classification, and then passed into a robust logistic regression model for category prediction.

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

text_vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_features=100000)

preprocessor = ColumnTransformer(
    transformers=[
        ("text", text_vectorizer, "product_title"),
        ("num", StandardScaler(), ["views_log", "merchant_rating", "year", "month"])
    ]
)

pipe = Pipeline([
    ("prep", preprocessor),
    ("clf", LogisticRegression(max_iter=2000, n_jobs=4, multi_class="multinomial", class_weight="balanced"))
])

# Training and Evaluating Multiple Models

To identify the best algorithm for product category prediction, we train and compare several classifiers using the same preprocessing pipeline.

In this step, we:

- **Check for missing values** with `df.isna().sum()` and drop rows where `product_title` is missing.  
  This ensures that all training samples have valid text input.

- **Define features and labels**:
  - `X` includes both text (`product_title`) and numeric features (`views_log`, `merchant_rating`, `year`, `month`)
  - `y` is the target column `category_label`

- **Split the dataset** into training and validation sets using `train_test_split`:
  - 80% of the data is used for training
  - 20% is reserved for validation
  - `stratify=y` ensures class proportions are preserved
  - `random_state=42` guarantees reproducibility

- **Set up a dictionary of models to test**:
  - Logistic Regression
  - Naive Bayes
  - Decision Tree
  - Random Forest
  - Linear SVM

- **Build pipelines for each model**:
  - For **Naive Bayes**, only text features are used (TF‚ÄëIDF vectorization), since it does not handle scaled numeric features well
  - For the other models, both text (TF‚ÄëIDF) and numeric features (scaled with `StandardScaler`) are included via `ColumnTransformer`

- **Train and evaluate each model**:
  - `pipeline.fit(X_train, y_train)` trains the model
  - `pipeline.predict(X_val)` generates predictions on the validation set
  - `accuracy_score` reports the overall accuracy
  - `classification_report` provides precision, recall, and F1‚Äëscore for each category

This loop allows us to quickly compare multiple algorithms under the same preprocessing setup, helping us identify which classifier performs best for product category prediction.

In [44]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

print(df.isna().sum())
df = df.dropna(subset=["product_title"])

# Features »ôi label
X = df[["product_title", "views_log", "merchant_rating", "year", "month"]]
y = df["category_label"]

# Train/validation split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Models to test
models = {
    "Logistic Regression": LogisticRegression(max_iter=2000, class_weight="balanced", multi_class="multinomial"),
    "Naive Bayes": MultinomialNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Linear SVM": LinearSVC()
}

# Loop through models
for name, model in models.items():
    print(f"\n {name}")
    
    if name == "Naive Bayes":
        pipeline = Pipeline([
            ("preprocessing", ColumnTransformer([
                ("text", TfidfVectorizer(ngram_range=(1,2), min_df=2, max_features=100000), "product_title")
            ])),
            ("classifier", model)
        ])
    else:
        pipeline = Pipeline([
            ("preprocessing", ColumnTransformer([
                ("text", TfidfVectorizer(ngram_range=(1,2), min_df=2, max_features=100000), "product_title"),
                ("num", StandardScaler(), ["views_log", "merchant_rating", "year", "month"])
            ])),
            ("classifier", model)
        ])
    
    # Training and assessment
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_val)
    
    print("Accuracy:", accuracy_score(y_val, y_pred))
    print(classification_report(y_val, y_pred, zero_division=0))

Unnamed: 0         0
product_id         0
product_title      0
merchant_id        0
category_label     0
product_code       0
number_of_views    0
merchant_rating    0
listing_date       0
views_log          0
year               0
month              0
dtype: int64

üîç Logistic Regression




Accuracy: 0.9247698504027618
                  precision    recall  f1-score   support

             CPU       0.00      0.00      0.00        16
            CPUs       0.98      0.94      0.96       742
 Digital Cameras       1.00      0.99      0.99       532
     Dishwashers       0.91      0.96      0.93       675
        Freezers       0.87      0.95      0.91       436
 Fridge Freezers       0.96      0.89      0.93      1085
         Fridges       0.85      0.80      0.83       681
      Microwaves       0.98      0.97      0.97       461
    Mobile Phone       0.03      0.09      0.04        11
   Mobile Phones       0.93      0.95      0.94       794
             TVs       0.99      0.97      0.98       701
Washing Machines       0.98      0.94      0.96       794
          fridge       0.04      0.12      0.06        24

        accuracy                           0.92      6952
       macro avg       0.73      0.74      0.73      6952
    weighted avg       0.94      0.92    



Accuracy: 0.9567031070195627
                  precision    recall  f1-score   support

             CPU       0.00      0.00      0.00        16
            CPUs       0.98      1.00      0.99       742
 Digital Cameras       1.00      0.99      1.00       532
     Dishwashers       0.93      0.96      0.94       675
        Freezers       0.97      0.95      0.96       436
 Fridge Freezers       0.93      0.95      0.94      1085
         Fridges       0.91      0.90      0.91       681
      Microwaves       0.98      0.97      0.97       461
    Mobile Phone       0.00      0.00      0.00        11
   Mobile Phones       0.96      0.99      0.97       794
             TVs       0.99      0.98      0.99       701
Washing Machines       0.97      0.95      0.96       794
          fridge       0.00      0.00      0.00        24

        accuracy                           0.96      6952
       macro avg       0.74      0.74      0.74      6952
    weighted avg       0.95      0.96    