# INSIGHT MEMO: Gender Classification by Contenct Activity

### WHAT WE WANT TO ACHIEVE
The goal of this project is to predict the gender (male/female) of a device_id based on the web contents they watch in their browser. 


### HOW WE GET THERE
1. **Data Collection and Preprocessing:**
   - Load the datasets `ground_truth_ml.csv` and `variables_ml.csv`.
   - Merge the datasets on the `id` column.
   - Aggregate the content watched by each device_id.
   - Preprocess the text data using TF-IDF Vectorizer with Spanish stop words.

2. **Addressing Class Imbalance:**
   - The dataset is highly imbalanced with 83% females and 17% males.
   - To handle this, we applied undersampling to the majority class (females) to balance the classes.

3. **Model Building and Evaluation:**
   - Train three models: Logistic Regression, XGBoost, and a simple Neural Network.
   - Evaluate the models based on accuracy, precision, recall, and F1 score.

### VARIABLES
- **id:** Unique identifier of a device_id (essentially a single person).
- **content:** The name of a single web content.
- **gender:** Ground truth gender of the device_id (male 'm' or female 'f').

### MODELS PROPOSED
1. **Logistic Regression:**
   - Simple and interpretable model.
   - Moderate performance but good as a baseline.
2. **XGBoost:**
   - Ensemble model known for its high performance.
   - More complex and requires tuning.
3. **Neural Network:**
   - Flexible and capable of modeling complex patterns.
   - Requires more data and computational resources.

### ISSUES: UNBALANCED CLASSES PROBLEM
The dataset had a significant imbalance with a majority of female samples (83%). 

To address this, we chose to use undersampling of the female class instead of oversampling the male class. This approach was selected to simplify the data preprocessing steps. Oversampling needs a deep unserstaanding syntheics data techniques.

### RESULTS
Logistic Regression and XGBoost offer balanced performance, while the Neural Network excels in recall, identifying more actual females at the cost of precision.


--> HOW TO READ THE RESULTS documentation -->  https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

Accuracy: Measures the overall correctness of the model. For example, Logistic Regression has an accuracy of 59.73%, meaning it correctly predicts the gender 59.73% of the time.

Precision: Indicates the proportion of positive predictions (females) that are actually correct. For instance, XGBoost's precision of 0.5939 means that when it predicts a device ID as female, it is correct 59.39% of the time.

Recall: Measures the model's ability to identify all actual positive cases. The Neural Network's recall of 0.8078 shows it successfully identifies 80.78% of all actual females.

F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall. For example, Logistic Regression's F1 score of 0.6459 indicates a balanced performance between identifying actual positives and minimizing false positives.


### WHAT WE CAN DO BETTER WITH MORE TIME AND RESOURCES
1. **Data Augmentation:** Instead of undersampling, explore oversampling techniques such as SMOTE to generate synthetic samples for the minority class.
2. **Hyperparameter Tuning:** Perform extensive hyperparameter tuning for models, especially XGBoost and Neural Networks, to improve performance.
3. **Feature Engineering:** Investigate additional features or more sophisticated text preprocessing techniques (e.g., word embeddings).
4. **Model Complexity:** Implement more complex models such as deep neural networks with more layers and neurons.
5. **Ensemble Methods:** Combine predictions from multiple models to create a more robust ensemble model.
6. **Cross-Validation:** Use cross-validation to ensure the stability and reliability of the model performance.

### CONCLUSION
This project demonstrates the process of building and evaluating predictive models on a synthetic dataset with an imbalanced class distribution. Despite the challenges, the models provided reasonable performance, and with further refinement and additional resources, the results can be significantly improved.

# CODE

In [37]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, precision_score, recall_score, f1_score
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import resample  
import xgboost as xgb
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder


## Logistic Regression for Gender Prediction

In [38]:
# Load datasets
ground_truth_df = pd.read_csv('ground_truth_ml.csv')
variables_df = pd.read_csv('variables_ml.csv')

# Merge datasets on 'id' column
final_df = pd.merge(variables_df, ground_truth_df, on='id')

In [39]:
# Download the stopwords from nltk
nltk.download('stopwords')

# Get Spanish stop words
spanish_stop_words_lg = stopwords.words('spanish')

# Initialize TF-IDF Vectorizer with Spanish stop words
tfidf_vectorizer_lg = TfidfVectorizer(stop_words=spanish_stop_words_lg, max_features=5000)

# Transform the content data into TF-IDF features
X_lg = tfidf_vectorizer_lg.fit_transform(final_df['content'])

# Encode the gender labels
y_lg = final_df['gender'].map({'m': 0, 'f': 1})

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/elenaabcc/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [40]:
# Create a DataFrame with features and labels
data_with_features_lg = pd.DataFrame(X_lg.toarray(), columns=tfidf_vectorizer_lg.get_feature_names_out())
data_with_features_lg['gender'] = y_lg

# Separate majority and minority classes
majority_class_lg = data_with_features_lg[data_with_features_lg['gender'] == 1]  # Female
minority_class_lg = data_with_features_lg[data_with_features_lg['gender'] == 0]  # Male

# Undersample majority class
majority_class_undersampled_lg = resample(majority_class_lg, 
                                          replace=False,   # Do not replace
                                          n_samples=len(minority_class_lg),  # Match number of minority class
                                          random_state=42)  # For reproducibility

# Combine minority class with undersampled majority class
balanced_df_lg = pd.concat([majority_class_undersampled_lg, minority_class_lg])

# Separate features and labels
X_balanced = balanced_df_lg.drop('gender', axis=1)
y_balanced = balanced_df_lg['gender']

In [41]:
# Split the balanced data into training and testing sets
X_train_lg, X_test_lg, y_train_lg, y_test_lg = train_test_split(X_balanced, y_balanced, test_size=0.3, random_state=42)

# Initialize Logistic Regression model
logistic_model_lg = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
logistic_model_lg.fit(X_train_lg, y_train_lg)

# Make predictions on the test set
y_pred_lg = logistic_model_lg.predict(X_test_lg)

# Calculate the accuracy
accuracy_lg = accuracy_score(y_test_lg, y_pred_lg)
precision_lg = precision_score(y_test_lg, y_pred_lg)
recall_lg = recall_score(y_test_lg, y_pred_lg)
f1_lg = f1_score(y_test_lg, y_pred_lg)

In [42]:
# Create DataFrame with predictions
test_ids_lg = final_df.iloc[y_test_lg.index]['id']
predictions_df_models = pd.DataFrame({
    'id': test_ids_lg,
    'ground_truth_gender': y_test_lg.map({0: 'm', 1: 'f'}).values,
    'predicted_gender_lg': pd.Series(y_pred_lg).map({0: 'm', 1: 'f'}).values
})

# Merge predictions_df_lg with the original variables to get the content
content_df_lg = variables_df[variables_df['id'].isin(test_ids_lg)]
result_df_lg = pd.merge(predictions_df_models, content_df_lg, on='id', how='left')


In [43]:
result_df_lg# Save metrics to a DataFrame
metrics_df = pd.DataFrame({
    'Model': ['Logistic Regression'],
    'Accuracy': [accuracy_lg],
    'Precision': [precision_lg],
    'Recall': [recall_lg],
    'F1 Score': [f1_lg]
})

## XGBoost for Gender Prediction

In [44]:
# Initialize XGBoost model
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Train the XGBoost model
xgb_model.fit(X_train_lg, y_train_lg)

# Make predictions on the test set
y_pred_xgb = xgb_model.predict(X_test_lg)

# Calculate metrics for XGBoost model
accuracy_xgb = accuracy_score(y_test_lg, y_pred_xgb)
precision_xgb = precision_score(y_test_lg, y_pred_xgb)
recall_xgb = recall_score(y_test_lg, y_pred_xgb)
f1_xgb = f1_score(y_test_lg, y_pred_xgb)



Parameters: { "use_label_encoder" } are not used.



In [45]:

# Generate and print the confusion matrix for XGBoost
conf_matrix_xgb = confusion_matrix(y_test_lg, y_pred_xgb, labels=[0, 1])
print("Confusion Matrix for XGBoost:")
print(pd.DataFrame(conf_matrix_xgb, index=['Male', 'Female'], columns=['Predicted Male', 'Predicted Female']))


Confusion Matrix for XGBoost:
        Predicted Male  Predicted Female
Male             18012             11966
Female           12187             17501


In [46]:
# Create DataFrame with predictions for XGBoost
predictions_df_models_xgboost = pd.DataFrame({
    'id': test_ids_lg,
    'ground_truth_gender_xgb': y_test_lg.map({0: 'm', 1: 'f'}).values,
    'predicted_gender_xgb': pd.Series(y_pred_xgb).map({0: 'm', 1: 'f'}).values
})

# Merge predictions_df_xgboost with predictions_df_models by id.. keep only predicted_gender_xgb from predictions_df_models_xgboost
predictions_df_models = predictions_df_models.merge(predictions_df_models_xgboost[['id', 'predicted_gender_xgb']], on='id', how='left')

In [47]:
predictions_df_models

Unnamed: 0,id,ground_truth_gender,predicted_gender_lg,predicted_gender_xgb
0,cee8a7ca-8fe5-43fb-b01f-4ff5f54b5f6c,f,f,m
1,9fdd46ec-6760-4b1d-9ac1-b380d4401873,m,f,f
2,9fdd46ec-6760-4b1d-9ac1-b380d4401873,m,f,f
3,9fdd46ec-6760-4b1d-9ac1-b380d4401873,m,f,f
4,9fdd46ec-6760-4b1d-9ac1-b380d4401873,m,f,f
...,...,...,...,...
252785,fab9230e-a491-402b-b308-b9c94d85a495,f,f,f
252786,fab9230e-a491-402b-b308-b9c94d85a495,f,f,f
252787,8110bc79-bbd0-41e0-b766-4c67c638b93a,f,f,f
252788,23623383-7fb7-4d87-98d7-7887169f8fc8,m,m,m


In [48]:
# Append metrics to the existing metrics DataFrame
new_metrics_df = pd.DataFrame({
    'Model': ['XGBoost'],
    'Accuracy': [accuracy_xgb],
    'Precision': [precision_xgb],
    'Recall': [recall_xgb],
    'F1 Score': [f1_xgb]
})

# Append new metrics to the existing metrics DataFrame
metrics_df = metrics_df.append(new_metrics_df, ignore_index=True)


  metrics_df = metrics_df.append(new_metrics_df, ignore_index=True)


## NN for Gender Prediction

In [49]:
# Encode the gender labels for neural network
encoder = LabelEncoder()
y_nn = encoder.fit_transform(y_balanced)
y_nn = to_categorical(y_nn)

# Split the balanced data into training and testing sets
X_train_nn, X_test_nn, y_train_nn, y_test_nn = train_test_split(X_balanced, y_nn, test_size=0.3, random_state=42)

# Define the neural network model
nn_model = Sequential()
nn_model.add(Dense(64, input_dim=X_train_nn.shape[1], activation='relu'))
nn_model.add(Dense(32, activation='relu'))
nn_model.add(Dense(2, activation='softmax'))


In [50]:

#! #########################################################################
#! BEFORE RUNNING THIS PLEASE NOTE: 
#! this runs takes minutes to be complete. Increase or decrease the number of epochs as needed.
#! #########################################################################

# Compile the model
nn_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
nn_model.fit(X_train_nn, y_train_nn, 
             epochs=3, #! CHOOSE UR THE NUMBER OF EPOCHS -- THE PERFORMACNE RESULT OF THIS PROJECT IS BASED ON 20 EPOCH 
             batch_size=10, 
             verbose=1)


Epoch 1/3
    1/13922 [..............................] - ETA: 1:35:49 - loss: 0.7020 - accuracy: 0.4000

2024-07-21 14:13:59.635004: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x1784097b0>

In [51]:
# Evaluate the model
loss, accuracy_nn = nn_model.evaluate(X_test_nn, y_test_nn, verbose=0)
print(f"Neural Network Accuracy: {accuracy_nn * 100:.2f}%")


2024-07-21 14:19:47.173646: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Neural Network Accuracy: 59.77%


In [52]:
# Make predictions on the test set
y_pred_nn = nn_model.predict(X_test_nn)
y_pred_nn = np.argmax(y_pred_nn, axis=1)
y_test_nn = np.argmax(y_test_nn, axis=1)

# Calculate metrics for Neural Network
accuracy_nn = accuracy_score(y_test_nn, y_pred_nn)
precision_nn = precision_score(y_test_nn, y_pred_nn)
recall_nn = recall_score(y_test_nn, y_pred_nn)
f1_nn = f1_score(y_test_nn, y_pred_nn)

# Generate confusion matrix for Neural Network
conf_matrix_nn = confusion_matrix(y_test_nn, y_pred_nn)
print("Confusion Matrix for Neural Network:")
print(pd.DataFrame(conf_matrix_nn, index=['Male', 'Female'], columns=['Predicted Male', 'Predicted Female']))


 107/1865 [>.............................] - ETA: 2s

2024-07-21 14:20:00.225315: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Confusion Matrix for Neural Network:
        Predicted Male  Predicted Female
Male             13399             16579
Female            7424             22264


In [53]:
# Create DataFrame with predictions for Neural Network
predictions_df_nn = pd.DataFrame({
    'id': final_df.iloc[y_test_lg.index]['id'],
    'ground_truth_gender': pd.Series(y_test_nn).map({0: 'm', 1: 'f'}).values,
    'predicted_gender_nn': pd.Series(y_pred_nn).map({0: 'm', 1: 'f'}).values
})

# Merge Neural Network predictions with the existing predictions DataFrame
predictions_df_models = pd.merge(predictions_df_models, predictions_df_nn[['id', 'predicted_gender_nn']], on='id', how='left')


In [54]:
# Append metrics to the existing metrics DataFrame
new_metrics_nn = pd.DataFrame({
    'Model': ['Neural Network'],
    'Accuracy': [accuracy_nn],
    'Precision': [precision_nn],
    'Recall': [recall_nn],
    'F1 Score': [f1_nn]
})

# Append new metrics to the existing metrics DataFrame
metrics_df = pd.concat([metrics_df, new_metrics_nn], ignore_index=True)

# Save updated metrics DataFrame to a CSV file
metrics_df.to_csv('model_metrics_combined.csv', index=False)


In [55]:
metrics_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression,0.597258,0.574112,0.738177,0.645889
1,XGBoost,0.595197,0.593919,0.589497,0.5917
2,Neural Network,0.597711,0.573179,0.749933,0.64975


In [59]:
predictions_df_models.to_csv('predictions_df_models.csv', index=False)


In [None]:
predictions_df_models