# About the dataset
## Alibaba E-commerce User Behavior Dataset

Each row in the dataset represents a **page of 50 items** returned to a user during a session (up to 12 pages per session). The dataset contains **15 columns**, explained below:

### 🧾 Column Descriptions

| Column | Name               | Description |
|--------|--------------------|-------------|
| 4      | `page_id`          | ID of the returned page (0–11). Each page contains 50 items. |
| 5      | `hour`             | Hour of the request (0–23), e.g., 14 = 2 PM. |
| 1-3  | `user_profile`     | Tuple of `[age-level, gender, purchase power]` used to represent an anonymous user. |
| 6      | `item_positions`   | Positions of each item in the overall session (range: 0–600). |
| 7–9    | `item_predictions` | Predicted values for each item: `[CTR, CVR, price]`. |
| 10–12  | `user_actions`     | User interactions with each item: `[isClick, isCart, isFav]` (1 or 0 for each). |
| 13     | `purchase_amounts` | Amount (in Yuan) spent per item (0.0 = no purchase). |
| 14     | `item_feature`     | A powerful numerical feature for each item (used in state representation). |
| 15     | `is_terminal`      | Flag indicating if this is the last page viewed by the user (1 = yes, 0 = no). |


## Feature Engineering


To prepare the data for baseline models (lightGBM inthis case)which require fixed-size input vectors rather than sequences, we engineered features by aggregating information across the 50 items within each session. The goal of these features is to summarize key aspects of user behavior and session value:

* **Interaction Totals:**
    * **Features Added:** Calculated the total count for each key interaction type: `Clicks`, `AddToCarts`, `Wishlists`, and `Purchases`.
    * **Rationale:** These counts provide a direct measure of the user's overall engagement level and intent signals within the session. For instance, a high number of `AddToCarts` might indicate strong purchase intent, potentially correlating with the target variables (`age-level`, `gender`, `purchase power`).

* **Interaction Position Statistics:**
    * **Features Added:** Computed the `Average`, `Minimum`, and `Maximum` position index for items that the user interacted with (e.g., clicked, purchased).
    * **Rationale:** This captures *how* the user interacts with the ranked list of items. Behavior might differ significantly if users primarily interact with items at the top (low position index) versus items further down (high position index), reflecting aspects like attention span, decisiveness, or search effort.

* **Purchase Value Aggregation:**
    * **Features Added:** Calculated the `Sum` and `Average` of `Purchase Amounts` for the session.
    * **Rationale:** These metrics directly represent the monetary value generated during the session and provide insight into the user's spending behavior in that context. They are particularly relevant for predicting `purchase power` and may also correlate with other demographic segments.

In [109]:
import numpy as np
import pandas as pd 
def engineer_features_from_row(row_string):
    
    try:
        fields = row_string.strip().split(';')
        if len(fields) != 15:
           
            return None 

        #  Parse Original Features 
        age_level = int(fields[0])
        gender = int(fields[1])
        purchasing_power = int(fields[2])
        page_id = int(fields[3]) 
        hour = int(fields[4])
        terminal_flag = int(fields[14]) 

        # Parse comma-separated lists
        item_positions_str = fields[5].split(',')
        clicks_str = fields[9].split(',')
        add_to_carts_str = fields[10].split(',')
        wishlists_str = fields[11].split(',')
        purchase_amounts_str = fields[12].split(',')

       
        if not all([item_positions_str, clicks_str, add_to_carts_str, wishlists_str, purchase_amounts_str]):
           
             return None

        item_positions = [int(x) for x in item_positions_str]
        clicks = [int(x) for x in clicks_str]
        add_to_carts = [int(x) for x in add_to_carts_str]
        wishlists = [int(x) for x in wishlists_str]
        purchase_amounts = [float(x) for x in purchase_amounts_str]

        
        list_len = len(item_positions)
        if list_len == 0 or not all(len(lst) == list_len for lst in [clicks, add_to_carts, wishlists, purchase_amounts]):
            
             return None

        #  Feature Engineering 
        features = {
            'age_level': age_level,
            'gender': gender,
            'purchasing_power': purchasing_power,
            'page_id': page_id,
            'hour': hour,
            'terminal_flag': terminal_flag
        }

        # Click-Based Features
        total_clicks = sum(clicks)
        features['total_clicks'] = total_clicks
        features['click_rate'] = total_clicks / list_len
        features['any_clicks'] = 1 if total_clicks > 0 else 0

        clicked_positions = [pos for pos, click in zip(item_positions, clicks) if click == 1]
        if total_clicks > 0:
            features['avg_pos_clicked'] = np.mean(clicked_positions)
            features['min_pos_clicked'] = np.min(clicked_positions)
            features['max_pos_clicked'] = np.max(clicked_positions)
            features['std_pos_clicked'] = np.std(clicked_positions) if total_clicks > 1 else 0.0
        else:
            features['avg_pos_clicked'] = -1.0
            features['min_pos_clicked'] = -1
            features['max_pos_clicked'] = -1
            features['std_pos_clicked'] = -1.0

        # Add-to-Cart-Based Features
        total_add_to_carts = sum(add_to_carts)
        features['total_add_to_carts'] = total_add_to_carts
        features['any_add_to_carts'] = 1 if total_add_to_carts > 0 else 0

        added_positions = [pos for pos, add in zip(item_positions, add_to_carts) if add == 1]
        if total_add_to_carts > 0:
            features['avg_pos_added'] = np.mean(added_positions)
            features['min_pos_added'] = np.min(added_positions)
            features['max_pos_added'] = np.max(added_positions)
            features['std_pos_added'] = np.std(added_positions) if total_add_to_carts > 1 else 0.0
        else:
            features['avg_pos_added'] = -1.0
            features['min_pos_added'] = -1
            features['max_pos_added'] = -1
            features['std_pos_added'] = -1.0

        #  Wishlist-Based Features
        total_wishlists = sum(wishlists)
        features['total_wishlists'] = total_wishlists
        features['any_wishlists'] = 1 if total_wishlists > 0 else 0

        wishlisted_positions = [pos for pos, wish in zip(item_positions, wishlists) if wish == 1]
        if total_wishlists > 0:
            features['avg_pos_wishlisted'] = np.mean(wishlisted_positions)
            features['min_pos_wishlisted'] = np.min(wishlisted_positions)
            features['max_pos_wishlisted'] = np.max(wishlisted_positions)
            features['std_pos_wishlisted'] = np.std(wishlisted_positions) if total_wishlists > 1 else 0.0
        else:
            features['avg_pos_wishlisted'] = -1.0
            features['min_pos_wishlisted'] = -1
            features['max_pos_wishlisted'] = -1
            features['std_pos_wishlisted'] = -1.0

        #  Purchase-Based Features
        purchased_items_info = [(pos, amount) for pos, amount in zip(item_positions, purchase_amounts) if amount > 0]
        total_items_purchased = len(purchased_items_info)
        features['total_items_purchased'] = total_items_purchased
        features['any_purchases'] = 1 if total_items_purchased > 0 else 0

        if total_items_purchased > 0:
            purchased_positions = [item[0] for item in purchased_items_info]
            purchase_values = [item[1] for item in purchased_items_info]
            features['total_purchase_amount'] = np.sum(purchase_values)
            features['avg_purchase_amount'] = np.mean(purchase_values)
            features['avg_pos_purchased'] = np.mean(purchased_positions)
            features['min_pos_purchased'] = np.min(purchased_positions)
            features['max_pos_purchased'] = np.max(purchased_positions)
            features['std_pos_purchased'] = np.std(purchased_positions) if total_items_purchased > 1 else 0.0
        else:
            features['total_purchase_amount'] = 0.0
            features['avg_purchase_amount'] = 0.0 # Or -1.0 or np.nan
            features['avg_pos_purchased'] = -1.0
            features['min_pos_purchased'] = -1
            features['max_pos_purchased'] = -1
            features['std_pos_purchased'] = -1.0

        return features

    except ValueError as ve:
       
        return None
    except Exception as e:
        return None 

file_path = 'www.data'
all_engineered_data = []
rows_processed = 0
rows_skipped = 0
row_limit=100000

print(f"Processing up to {row_limit} rows from file: {file_path}...")

try:
    with open(file_path, 'r') as f:
        # Use enumerate to get row index (starts from 0)
        for i, line in enumerate(f):
            # Stop if the row limit is reached
            if i >= row_limit:
                print(f"\nReached row limit ({row_limit}). Stopping file reading.")
                break

            
            if not line.strip():
                rows_skipped += 1
                continue

            engineered_row_data = engineer_features_from_row(line)

            if engineered_row_data:
                all_engineered_data.append(engineered_row_data)
                rows_processed += 1
            else:
               
                rows_skipped += 1

            
            if (i + 1) % 10000 == 0:
                print(f"  Processed {i+1} lines...")


    print(f"\nFinished processing.")
    print(f"Attempted to read: {i+1 if 'i' in locals() else 0} lines (up to limit of {row_limit})") # Check if loop ran at all
    print(f"Successfully processed rows: {rows_processed}")
    print(f"Skipped rows (due to errors, formatting, or empty): {rows_skipped}")

    # Convert the list of dictionaries into a Pandas DataFrame
    if all_engineered_data:
        df = pd.DataFrame(all_engineered_data)

        print("\nDataFrame Info:")
        df.info() 

        print("\nDataFrame Head (first 5 rows):")
        print(df.head())

        print("\nDataFrame Description (basic stats):")
        print(df.describe())

        

    elif rows_processed == 0:
        print("\nNo valid rows were processed from the first {row_limit} lines (or file was smaller/empty).")

except FileNotFoundError:
    print(f"Error: File not found at {file_path}")
except Exception as e:
    print(f"An unexpected error occurred during file processing: {e}")

Processing up to 100000 rows from file: www.data...
  Processed 10000 lines...
  Processed 20000 lines...
  Processed 30000 lines...
  Processed 40000 lines...
  Processed 50000 lines...
  Processed 60000 lines...
  Processed 70000 lines...
  Processed 80000 lines...
  Processed 90000 lines...
  Processed 100000 lines...

Reached row limit (100000). Stopping file reading.

Finished processing.
Attempted to read: 100001 lines (up to limit of 100000)
Successfully processed rows: 100000
Skipped rows (due to errors, formatting, or empty): 0

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 33 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   age_level              100000 non-null  int64  
 1   gender                 100000 non-null  int64  
 2   purchasing_power       100000 non-null  int64  
 3   page_id                100000 non-null  int64  
 4   hour   

In [110]:
df

Unnamed: 0,age_level,gender,purchasing_power,page_id,hour,terminal_flag,total_clicks,click_rate,any_clicks,avg_pos_clicked,...,max_pos_wishlisted,std_pos_wishlisted,total_items_purchased,any_purchases,total_purchase_amount,avg_purchase_amount,avg_pos_purchased,min_pos_purchased,max_pos_purchased,std_pos_purchased
0,6,2,6,1,17,0,0,0.00,0,-1.000000,...,-1,-1.0,0,0,0.0,0.0,-1.0,-1,-1,-1.0
1,6,2,4,0,22,0,0,0.00,0,-1.000000,...,8,0.0,0,0,0.0,0.0,-1.0,-1,-1,-1.0
2,5,2,6,9,7,0,1,0.02,1,497.000000,...,-1,-1.0,0,0,0.0,0.0,-1.0,-1,-1,-1.0
3,5,2,6,7,21,0,0,0.00,0,-1.000000,...,-1,-1.0,0,0,0.0,0.0,-1.0,-1,-1,-1.0
4,6,2,5,1,20,0,0,0.00,0,-1.000000,...,-1,-1.0,0,0,0.0,0.0,-1.0,-1,-1,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,4,2,2,2,7,0,3,0.06,1,128.000000,...,-1,-1.0,0,0,0.0,0.0,-1.0,-1,-1,-1.0
99996,4,2,2,6,15,0,3,0.06,1,311.333333,...,-1,-1.0,0,0,0.0,0.0,-1.0,-1,-1,-1.0
99997,7,1,6,0,16,0,2,0.04,1,10.000000,...,-1,-1.0,0,0,0.0,0.0,-1.0,-1,-1,-1.0
99998,5,2,6,1,7,0,1,0.02,1,53.000000,...,-1,-1.0,0,0,0.0,0.0,-1.0,-1,-1,-1.0


In [111]:
feature_to_encode = 'page_id'

# Perform one-hot encoding on page_id
encoded_df = pd.get_dummies(df, columns=[feature_to_encode], prefix=feature_to_encode)

# Display the result
print("Original DataFrame:")
print(df)
print("\nDataFrame after One-Hot Encoding:")

print(encoded_df)

Original DataFrame:
       age_level  gender  purchasing_power  page_id  hour  terminal_flag  \
0              6       2                 6        1    17              0   
1              6       2                 4        0    22              0   
2              5       2                 6        9     7              0   
3              5       2                 6        7    21              0   
4              6       2                 5        1    20              0   
...          ...     ...               ...      ...   ...            ...   
99995          4       2                 2        2     7              0   
99996          4       2                 2        6    15              0   
99997          7       1                 6        0    16              0   
99998          5       2                 6        1     7              0   
99999          7       1                 3        1    17              0   

       total_clicks  click_rate  any_clicks  avg_pos_clicked  ...  

In [112]:
encoded_df['page_id']=df['page_id']

In [113]:
encoded_df.head(2)

Unnamed: 0,age_level,gender,purchasing_power,hour,terminal_flag,total_clicks,click_rate,any_clicks,avg_pos_clicked,min_pos_clicked,...,page_id_3,page_id_4,page_id_5,page_id_6,page_id_7,page_id_8,page_id_9,page_id_10,page_id_11,page_id
0,6,2,6,17,0,0,0.0,0,-1.0,-1,...,False,False,False,False,False,False,False,False,False,1
1,6,2,4,22,0,0,0.0,0,-1.0,-1,...,False,False,False,False,False,False,False,False,False,0


In [114]:
encoded_df.columns

Index(['age_level', 'gender', 'purchasing_power', 'hour', 'terminal_flag',
       'total_clicks', 'click_rate', 'any_clicks', 'avg_pos_clicked',
       'min_pos_clicked', 'max_pos_clicked', 'std_pos_clicked',
       'total_add_to_carts', 'any_add_to_carts', 'avg_pos_added',
       'min_pos_added', 'max_pos_added', 'std_pos_added', 'total_wishlists',
       'any_wishlists', 'avg_pos_wishlisted', 'min_pos_wishlisted',
       'max_pos_wishlisted', 'std_pos_wishlisted', 'total_items_purchased',
       'any_purchases', 'total_purchase_amount', 'avg_purchase_amount',
       'avg_pos_purchased', 'min_pos_purchased', 'max_pos_purchased',
       'std_pos_purchased', 'page_id_0', 'page_id_1', 'page_id_2', 'page_id_3',
       'page_id_4', 'page_id_5', 'page_id_6', 'page_id_7', 'page_id_8',
       'page_id_9', 'page_id_10', 'page_id_11', 'page_id'],
      dtype='object')

In [115]:
# converting the below mentioned columns to be between 1 and 50 so that the algorithm can find appropriate relations


cols_to_update = [
    'avg_pos_clicked', 'min_pos_clicked', 'max_pos_clicked',
    'avg_pos_added', 'min_pos_added', 'max_pos_added',
    'avg_pos_wishlisted', 'min_pos_wishlisted', 'max_pos_wishlisted',
    'avg_purchase_amount', 'avg_pos_purchased', 'std_pos_purchased', 'min_pos_purchased','max_pos_purchased',
]

for col in cols_to_update:
    encoded_df[f'{col}_1'] = encoded_df.apply(
        lambda row: row[col] - 50 * row['page_id'] if row[col] not in [0, -1] else row[col],
        axis=1
    )


In [116]:
encoded_df.drop(['terminal_flag'],axis=1,inplace=True)

In [117]:
encoded_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 58 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   age_level              100000 non-null  int64  
 1   gender                 100000 non-null  int64  
 2   purchasing_power       100000 non-null  int64  
 3   hour                   100000 non-null  int64  
 4   total_clicks           100000 non-null  int64  
 5   click_rate             100000 non-null  float64
 6   any_clicks             100000 non-null  int64  
 7   avg_pos_clicked        100000 non-null  float64
 8   min_pos_clicked        100000 non-null  int64  
 9   max_pos_clicked        100000 non-null  int64  
 10  std_pos_clicked        100000 non-null  float64
 11  total_add_to_carts     100000 non-null  int64  
 12  any_add_to_carts       100000 non-null  int64  
 13  avg_pos_added          100000 non-null  float64
 14  min_pos_added          100000 non-nul

### LightGBM model for classification of age

In [122]:

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.preprocessing import LabelEncoder






#  Define target and features
target_age = 'age_level'
features = [
    'hour',
    'total_clicks', 'click_rate', 'any_clicks', 
    'avg_pos_clicked_1', 'min_pos_clicked_1', 'max_pos_clicked_1', 'std_pos_clicked',
    
    'total_add_to_carts', 'any_add_to_carts', 
    'avg_pos_added_1', 'min_pos_added_1', 'max_pos_added_1', 'std_pos_added',
    
    'total_wishlists', 'any_wishlists', 
    'avg_pos_wishlisted_1', 'min_pos_wishlisted_1', 'max_pos_wishlisted_1', 'std_pos_wishlisted',
    
    'total_items_purchased', 'any_purchases', 'total_purchase_amount',
    'avg_purchase_amount_1', 'avg_pos_purchased_1', 'min_pos_purchased_1',
    'max_pos_purchased_1', 'std_pos_purchased',
    
    'page_id_0', 'page_id_1', 'page_id_2', 'page_id_3', 'page_id_4', 'page_id_5',
    'page_id_6', 'page_id_7', 'page_id_8', 'page_id_9', 'page_id_10', 'page_id_11'
]



X = encoded_df[features]
y = encoded_df[target_age]


num_classes = y.nunique()
print(f"Number of unique age levels (classes): {num_classes}")


#  Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"\nOriginal training set shape: X={X_train.shape}, y={y_train.shape}")
print(f"Original training class distribution:\n{y_train.value_counts().sort_index()}")
print(f"\nTest set shape: X={X_test.shape}, y={y_test.shape}")
print(f"Test set class distribution:\n{y_test.value_counts().sort_index()}")


#  Apply SMOTE to the Training Data 
print("\nApplying SMOTE to the training data...")


min_class_size = y_train.value_counts().min()
safe_k_neighbors = max(1, min_class_size - 1) 

print(f"Smallest class size in training set: {min_class_size}. Setting SMOTE k_neighbors={safe_k_neighbors}")

smote = SMOTE(random_state=42, k_neighbors=safe_k_neighbors)

try:
    X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
    print(f"\nResampled training set shape: X={X_train_res.shape}, y={y_train_res.shape}")
    print(f"Resampled training class distribution:\n{pd.Series(y_train_res).value_counts().sort_index()}") 
    print("SMOTE applied successfully.")
    use_resampled_data = True
except ValueError as e:
    print(f"\nSMOTE Error: {e}")
    print("Could not apply SMOTE (likely too few samples in a minority class).")
    print("Proceeding with original training data. Consider adding 'class_weight=\'balanced\'' back to LGBMClassifier if using original data.")
    X_train_res, y_train_res = X_train, y_train 
    use_resampled_data = False


#  Initialize and Train LightGBM Model (without class_weight) 
# Use slightly increased patience and estimators as SMOTE might need more training
lgbm_classifier = lgb.LGBMClassifier(
    objective='multiclass',
    num_class=num_classes,
    metric='multi_logloss',
    n_estimators=6000,          
    learning_rate=0.04,
    num_leaves=31,               
    max_depth=-1,
    random_state=42,
    n_jobs=-1,
    colsample_bytree=0.8,
    subsample=0.8,
    reg_alpha=0.1,
    reg_lambda=0.1
    
)

# Add class_weight back if SMOTE failed and we fell back to original data
if not use_resampled_data:
    print("\nSMOTE failed, adding class_weight='balanced' to the model.")
    lgbm_classifier.set_params(class_weight='balanced')


print(f"\nTraining LightGBM model {'on RESAMPLED data' if use_resampled_data else 'on ORIGINAL data (with weighting if SMOTE failed)'}...")

# Train the model using the (potentially) resampled training data
lgbm_classifier.fit(
    X_train_res, y_train_res,
    eval_set=[(X_test, y_test)], 
    eval_metric='multi_logloss',
    callbacks=[lgb.early_stopping(stopping_rounds=150, verbose=True)] # Increased patience
)

print("\nTraining complete.")

# Make Predictions on the Original Test Set 
print("\nMaking predictions on the test set...")
y_pred = lgbm_classifier.predict(X_test)

# Evaluate the Model 
print("\nEvaluating the model...")
accuracy = accuracy_score(y_test, y_pred)

f1_micro = f1_score(y_test, y_pred, average='micro', zero_division=0)
f1_macro = f1_score(y_test, y_pred, average='macro', zero_division=0)
f1_weighted = f1_score(y_test, y_pred, average='weighted', zero_division=0)

print(f"\nTest Set Evaluation:")
print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score (Micro): {f1_micro:.4f}")
print(f"F1 Score (Macro): {f1_macro:.4f}")
print(f"F1 Score (Weighted): {f1_weighted:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, zero_division=0))


# Feature Importance
print("\nFeature Importances:")
  
if isinstance(X_train_res, np.ndarray):
    feature_names = X_train.columns 
else:
     feature_names = X_train_res.columns 

feature_imp = pd.DataFrame({
    'feature': feature_names,
    'importance': lgbm_classifier.feature_importances_
}).sort_values('importance', ascending=False)

print(feature_imp.head(20)) 



Number of unique age levels (classes): 9

Original training set shape: X=(80000, 40), y=(80000,)
Original training class distribution:
age_level
0      287
1      343
2     4586
3    24194
4    18817
5    13076
6     8136
7     8978
8     1583
Name: count, dtype: int64

Test set shape: X=(20000, 40), y=(20000,)
Test set class distribution:
age_level
0      72
1      85
2    1147
3    6049
4    4704
5    3269
6    2034
7    2245
8     395
Name: count, dtype: int64

Applying SMOTE to the training data...
Smallest class size in training set: 287. Setting SMOTE k_neighbors=286

Resampled training set shape: X=(217746, 40), y=(217746,)
Resampled training class distribution:
age_level
0    24194
1    24194
2    24194
3    24194
4    24194
5    24194
6    24194
7    24194
8    24194
Name: count, dtype: int64
SMOTE applied successfully.

Training LightGBM model on RESAMPLED data...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.009805 seconds.
You can s

### Applying Light GBM for purchasing power prediction

In [None]:


target_age = 'purchasing_power'


X = encoded_df[features]
y = encoded_df[target_age]


num_classes = y.nunique()
print(f"Number of unique age levels (classes): {num_classes}")


#  Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"\nOriginal training set shape: X={X_train.shape}, y={y_train.shape}")
print(f"Original training class distribution:\n{y_train.value_counts().sort_index()}")
print(f"\nTest set shape: X={X_test.shape}, y={y_test.shape}")
print(f"Test set class distribution:\n{y_test.value_counts().sort_index()}")


#  Apply SMOTE to the Training Data 
print("\nApplying SMOTE to the training data...")


min_class_size = y_train.value_counts().min()
safe_k_neighbors = max(1, min_class_size - 1) 

print(f"Smallest class size in training set: {min_class_size}. Setting SMOTE k_neighbors={safe_k_neighbors}")

smote = SMOTE(random_state=42, k_neighbors=safe_k_neighbors)

try:
    X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
    print(f"\nResampled training set shape: X={X_train_res.shape}, y={y_train_res.shape}")
    print(f"Resampled training class distribution:\n{pd.Series(y_train_res).value_counts().sort_index()}") 
    print("SMOTE applied successfully.")
    use_resampled_data = True
except ValueError as e:
    print(f"\nSMOTE Error: {e}")
    print("Could not apply SMOTE (likely too few samples in a minority class).")
    print("Proceeding with original training data. Consider adding 'class_weight=\'balanced\'' back to LGBMClassifier if using original data.")
    X_train_res, y_train_res = X_train, y_train 
    use_resampled_data = False


#  Initialize and Train LightGBM Model (without class_weight) 
# Use slightly increased patience and estimators as SMOTE might need more training
lgbm_classifier = lgb.LGBMClassifier(
    objective='multiclass',
    num_class=num_classes,
    metric='multi_logloss',
    n_estimators=2000,          
    learning_rate=0.09,
    num_leaves=31,               
    max_depth=-1,
    random_state=42,
    n_jobs=-1,
    colsample_bytree=0.8,
    subsample=0.8,
    reg_alpha=0.1,
    reg_lambda=0.1
    
)

# Add class_weight back if SMOTE failed and we fell back to original data
if not use_resampled_data:
    print("\nSMOTE failed, adding class_weight='balanced' to the model.")
    lgbm_classifier.set_params(class_weight='balanced')


print(f"\nTraining LightGBM model {'on RESAMPLED data' if use_resampled_data else 'on ORIGINAL data (with weighting if SMOTE failed)'}...")

# Train the model using the (potentially) resampled training data
lgbm_classifier.fit(
    X_train_res, y_train_res,
    eval_set=[(X_test, y_test)], 
    eval_metric='multi_logloss',
    callbacks=[lgb.early_stopping(stopping_rounds=150, verbose=True)] # Increased patience
)

print("\nTraining complete.")

# Make Predictions on the Original Test Set 
print("\nMaking predictions on the test set...")
y_pred = lgbm_classifier.predict(X_test)

# Evaluate the Model 
print("\nEvaluating the model...")
accuracy = accuracy_score(y_test, y_pred)

f1_micro = f1_score(y_test, y_pred, average='micro', zero_division=0)
f1_macro = f1_score(y_test, y_pred, average='macro', zero_division=0)
f1_weighted = f1_score(y_test, y_pred, average='weighted', zero_division=0)

print(f"\nTest Set Evaluation:")
print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score (Micro): {f1_micro:.4f}")
print(f"F1 Score (Macro): {f1_macro:.4f}")
print(f"F1 Score (Weighted): {f1_weighted:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, zero_division=0))


# Feature Importance
print("\nFeature Importances:")
  
if isinstance(X_train_res, np.ndarray):
    feature_names = X_train.columns 
else:
     feature_names = X_train_res.columns 

feature_imp = pd.DataFrame({
    'feature': feature_names,
    'importance': lgbm_classifier.feature_importances_
}).sort_values('importance', ascending=False)

print(feature_imp.head(20)) 



Number of unique age levels (classes): 8

Original training set shape: X=(80000, 40), y=(80000,)
Original training class distribution:
purchasing_power
0     3123
1     5152
2    12576
3    16018
4    17150
5    13356
6     8798
7     3827
Name: count, dtype: int64

Test set shape: X=(20000, 40), y=(20000,)
Test set class distribution:
purchasing_power
0     781
1    1288
2    3144
3    4004
4    4287
5    3339
6    2200
7     957
Name: count, dtype: int64

Applying SMOTE to the training data...
Smallest class size in training set: 3123. Setting SMOTE k_neighbors=3122

Resampled training set shape: X=(137200, 40), y=(137200,)
Resampled training class distribution:
purchasing_power
0    17150
1    17150
2    17150
3    17150
4    17150
5    17150
6    17150
7    17150
Name: count, dtype: int64
SMOTE applied successfully.

Training LightGBM model on RESAMPLED data...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006435 seconds.
You can set `force

In [None]:
##