# Gender Identification Classification

## Student Identification
- **Name**: Abdul Razzak Ghouri
- **Student ID**: FA22BSCS0097
- **Assignment**: AI Assignment 3
- **Date**: May 26, 2025

## Overview
This notebook implements a gender identification classification problem using supervised machine learning for an Artificial Intelligence course assignment. It uses a synthetic dataset with features (height, weight, voice pitch) to predict gender (male/female). Two classification algorithms, Logistic Regression and Random Forest, are trained and compared based on their accuracy scores.

## 1. Import Libraries

Import necessary Python libraries for data manipulation, model training, and evaluation.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

## 2. Create Synthetic Dataset

Generate a synthetic dataset with 1000 samples (500 male, 500 female) using height, weight, and voice pitch as features. A random seed ensures reproducibility.

In [2]:
np.random.seed(42)
n_samples = 1000

In [3]:
data = {
    'height': np.concatenate([np.random.normal(175, 8, n_samples//2), np.random.normal(162, 7, n_samples//2)]),
    'weight': np.concatenate([np.random.normal(80, 10, n_samples//2), np.random.normal(60, 8, n_samples//2)]),
    'voice_pitch': np.concatenate([np.random.normal(120, 20, n_samples//2), np.random.normal(200, 25, n_samples//2)]),
    'gender': ['male'] * (n_samples//2) + ['female'] * (n_samples//2)
}

## 3. Data Preprocessing

Convert the dataset into a DataFrame and encode gender as binary (0 for male, 1 for female). Select features and target variable.

In [4]:
df = pd.DataFrame(data)

In [5]:
df['gender'] = df['gender'].map({'male': 0, 'female': 1})

In [6]:
X = df[['height', 'weight', 'voice_pitch']]
y = df['gender']

## 4. Train-Test Split

Split the dataset into 80% training and 20% testing sets.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 5. Feature Scaling

Standardize features to have mean 0 and variance 1 for better model performance.

In [8]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 6. Train Logistic Regression Model

Train a Logistic Regression model and calculate its accuracy on the test set.

In [10]:
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_predictions)

## 7. Train Random Forest Model

Train a Random Forest model with 100 trees and calculate its accuracy on the test set.

In [11]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)

## 8. Evaluate and Compare Models

Print the accuracy scores of both models and determine which performed better.

In [12]:
print(f"Logistic Regression Accuracy: {lr_accuracy:.2f}")
print(f"Random Forest Accuracy: {rf_accuracy:.2f}")

Logistic Regression Accuracy: 0.98
Random Forest Accuracy: 0.97


In [13]:
if lr_accuracy > rf_accuracy:
    print("Logistic Regression performed better.")
else:
    print("Random Forest performed better.")

Logistic Regression performed better.


## 9. Conclusion

Logistic Regression achieved a higher accuracy (0.98) compared to Random Forest (0.97) on the test set. This suggests that the relationship between the features (height, weight, voice pitch) and gender is largely linear, which Logistic Regression handles effectively. Random Forest, while robust for non-linear data, may have slightly overfit or been less effective due to the simplicity of the dataset.