## Week6 LogisticRegression
In week 6, we've covered:
* Logistic regression
* Imbalanced Data Sets

Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a **TODO**.

Upload **Week6_LogisticRegression_Homework.ipynb**, **train.csv** and **test.csv** to Google Drive.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

**TODO**: Replace **`YourFolderName`**  below with the folder name on your google drive where you put the `housing.csv` file. Run the cell, check if `train.csv` and `test.csv` is listed.

In [None]:
!ls /content/drive/My\ Drive/Homegrown_examples

Alternative method to upload if mounting drive doesn't work

In [3]:
from google.colab import files
uploaded = files.upload()

Saving test.csv to test.csv
Saving train.csv to train.csv


## 1. Introduction

### E-commerce Website Conversion Prediction Dataset

#### Overview
This dataset contains user session data from an e-commerce website, designed to predict whether a visitor will convert (make a purchase) during their session. This is a classic **binary classification problem** with **significant class imbalance** - only about 5% of visitors actually convert.

### Dataset Features

#### Target Variable
- **`converted`**: Binary target variable (0 = No conversion, 1 = Conversion)
  - **Class imbalance**: ~95% non-conversions, ~5% conversions
  - This imbalance makes it a perfect dataset for practicing sampling techniques

### Features

#### User Demographics
- **`age`**: User's age in years (18-75)
- **`income`**: User's annual income in USD

#### Session Characteristics
- **`session_id`**: Unique identifier for each session
- **`device_type`**: Device used to access the site (`desktop`, `mobile`, `tablet`)
- **`traffic_source`**: How the user found the website (`organic`, `paid_search`, `social`, `email`, `direct`, `referral`)
- **`pages_viewed`**: Number of pages viewed during the session
- **`session_duration_seconds`**: Time spent on the site in seconds
- **`day_of_week`**: Day of the week (0 = Monday, 6 = Sunday)
- **`previous_visits`**: Number of previous visits to the site


## 2. Exploring the data

In [27]:
import pandas as pd
import io

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# TODO: Display the shape of the training data
# TODO: Print the number of samples and features
# TODO: Show the first 5 rows of the dataset

In [28]:
# TODO: Display data types and non-null counts
# TODO: Check for missing values

In [29]:
# TODO: Calculate the conversion rate
# TODO: Display value counts for the target variable (converted)
# TODO: Show the percentage distribution

In [30]:
# TODO: Show summary statistics for numerical columns

In [31]:
# TODO: For each categorical column, display:
#       - Number of unique values
#       - Value counts

## 3. Baseline Logistic Regression


In [32]:
# TODO: Create X_train and y_train from train_data (drop 'converted' and 'session_id')
# TODO: Create X_test and y_test from test_data (drop 'converted' and 'session_id')
# TODO: Print shapes and conversion rates for both datasets

In [33]:
from sklearn.preprocessing import LabelEncoder

# TODO: Create encoded copies of X_train and X_test
# TODO: For each categorical column, fit LabelEncoder on training data
# TODO: Transform both training and test data using the same encoder

In [34]:
from sklearn.preprocessing import StandardScaler

# TODO: Create a StandardScaler instance
# TODO: Fit the scaler on X_train_encoded and transform both datasets

In [35]:
from sklearn.linear_model import LogisticRegression

baseline_model = LogisticRegression(
    random_state=42,
    max_iter=1000  # Increase max iterations to ensure convergence
)

# TODO: Fit the model on X_train_scaled and y_train

In [36]:
# TODO: Use the baseline model to make predictions on X_test_scaled
# TODO: Compare predicted vs actual conversion counts and rates

## 4. Baseline Model Evaluation

In [37]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# TODO: Calculate accuracy, precision, recall, and F1-score for the baseline model

In [38]:
from sklearn.metrics import confusion_matrix, classification_report

# TODO: Generate and display the confusion matrix

How is the baseline model's performing?

**TODO**


## 5. Addressing the Class Imbalnce

In [15]:
!pip install imbalanced-learn

from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd



In [39]:
print("=== RANDOM OVERSAMPLING ===")

ros = RandomOverSampler(random_state=42)

# TODO: Apply fit_resample to X_train_scaled and y_train
# TODO: Print the before/after sizes and class distributions

=== RANDOM OVERSAMPLING ===


In [40]:
print("=== RANDOM UNDERSAMPLING ===")

rus = RandomUnderSampler(random_state=42)

# TODO: Apply fit_resample to X_train_scaled and y_train
# TODO: Print the before/after sizes and class distributions

=== RANDOM UNDERSAMPLING ===


In [41]:
print("=== SMOTE OVERSAMPLING ===")

smote = SMOTE(random_state=42)

# TODO: Apply fit_resample to X_train_scaled and y_train
# TODO: Print the before/after sizes and class distributions

=== SMOTE OVERSAMPLING ===


In [42]:
# TODO: Create a comparison table showing the dataset sizes and class distributions
# TODO: Display method name, total size, class 0 count, class 1 count, and balance ratio
# This helps visualize the effect of each sampling technique

## 6. Balanced Models

In [43]:
print("=== TRAINING MODEL WITH RANDOM OVERSAMPLING ===")

model_ros = LogisticRegression(random_state=42, max_iter=1000)

# TODO: Train the model on X_train_ros and y_train_ros
# TODO: Make predictions on X_test_scaled

=== TRAINING MODEL WITH RANDOM OVERSAMPLING ===


In [44]:
print("=== TRAINING MODEL WITH RANDOM UNDERSAMPLING ===")

model_rus = LogisticRegression(random_state=42, max_iter=1000)

# TODO: Train the model on X_train_rus and y_train_rus
# TODO: Make predictions on X_test_scaled

=== TRAINING MODEL WITH RANDOM UNDERSAMPLING ===


In [45]:
print("=== TRAINING MODEL WITH SMOTE ===")

model_smote = LogisticRegression(random_state=42, max_iter=1000)

# TODO: Train the model on X_train_smote and y_train_smote
# TODO: Make predictions on X_test_scaled

=== TRAINING MODEL WITH SMOTE ===


In [46]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# TODO: Calculate accuracy, precision, recall, F1-score, and ROC-AUC for all models

Compare the model performance. What model performs the best?

**TODO**
