# 🧑‍🏫 Task 1 Part 2: Build Your Own Logistic Regression Model for Sentiment Analysis
In this exercise, you will build a **logistic regression model** from scratch to perform sentiment analysis.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like `LogisticRegression` from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Load the Data
**Task:** Use `pandas` to load the dataset from a file named `IMDB_reviews.csv`.

> **Hint:** Use `pd.read_csv()` to load the file and display the first 5 rows.

**Question:** What are the key features and the target variable in this dataset?

In [5]:
# Load the dataset and display the first few rows
import pandas as pd

# Load the dataset
df = pd.read_csv('IMDB_Dataset.csv')  

# Display the first 5 rows of the dataset
print("First 5 rows of the dataset:")
print(df.head())

# Answer: Key var --> review & sentiment

First 5 rows of the dataset:
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


## Step 3: Tokenization and Text Cleaning
**Task:** Implement your own function to:
1. Convert all text to lowercase.
2. Remove punctuation and special characters.
3. Split the text into words (tokenization).

> **Hint:** Use Python string methods and list comprehensions.

**Question:** Why is tokenization important for text-based models?

In [18]:
import pandas as pd
import string

def preprocess_text(text):
    """
    Preprocess text by converting to lowercase, removing punctuation,
    and tokenizing into words.
    
    Args:
        text (str): Input text string
        
    Returns:
        list: List of preprocessed tokens
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    
    # Split into words (tokenization)
    tokens = text.split()
    
    return tokens

# Apply preprocessing to each review
df['preprocessed_tokens'] = df['review'].apply(preprocess_text)

# Display a sample of original and preprocessed text
print("Sample of preprocessed reviews:")
for original, tokens in zip(df['review'].head(), df['preprocessed_tokens'].head()):
    print("\nOriginal:", original)
    print("Preprocessed:", tokens)

#Tokenization is important for text-based models for Standardization, vocabulary building, feature extracting, pattern recognition, etc.

Sample of preprocessed reviews:

Original: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say th

## Step 4: Create a Vocabulary
**Task:** Create a **vocabulary** (a list of unique words) from the tokenized dataset.

> **Hint:** Use a set to store unique words, then convert it to a list.

**Question:** How does vocabulary size affect model performance?

In [22]:
# Your code here
vocabulary = sorted(set(word for tokens in df['preprocessed_tokens'] for word in tokens))
print(f"Vocabulary size: {len(vocabulary)}\nFirst 10 words: {vocabulary[:10]}")

#A larger vocabulary increases model complexity and coverage but requires more data to train effectively and may lead to sparsity issues.

Vocabulary size: 181685
First 10 words: ['\x08\x08\x08\x08a', '\x10own', '0', '00', '000', '0000000000001', '00000001', '000001', '00000110', '0001']


## Step 5: Implement Word Count
**Task:** Calculate and store the number of times each word appears in a particular review for all reviews

In [28]:
# Your code here
# Example: Write functions to calculate word counts
from collections import Counter

# Calculate word counts for each review
df['word_counts'] = df['review'].apply(lambda x: dict(Counter(str(x).lower().split())))

# Display results for the first review
print("\nFirst review text:")
print(df['review'].iloc[0])
print("\nWord counts for first review:")
print(df['word_counts'].iloc[0])

# Calculate total word counts across all reviews
total_counts = Counter()
for counts in df['word_counts']:
   total_counts.update(counts)

print("\nTop 10 most frequent words across all reviews:")
print(dict(total_counts.most_common(10)))


First review text:
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the sh

## Step 6: Train-Test Split
**Task:** Split the data into **80% training** and **20% testing** sets.

> **Hint:** Use `numpy` or list slicing to split the data manually.

**Question:** Why do we need to split the data for training and testing?

In [32]:
# Your code here

# Calculate split index
split_idx = int(len(df) * 0.8)

# Split the data
train_data = df[:split_idx]
test_data = df[split_idx:]

print(f"Training set size: {len(train_data)}, Testing set size: {len(test_data)}")

#We split data to evaluate model performance on new examples, cos we must ensure it generalizes 
#well and doesn't just memorize training data. 

Training set size: 40000, Testing set size: 10000


## Step 7: Building the Logistic Regression Model (Divided Steps)

### Part 1: The Prediction functions
The **prediction function** returns the predicted value of the data point using the weights and the bias. It uses the sigmoid function to convert the prediction into a value in the range of 0 to 1.

**Task:** Implement the sigmoid and prediction functions

In [34]:
import numpy as np

def sigmoid(x):
    """
    Compute sigmoid function: f(x) = 1 / (1 + e^(-x))
    """
    return 1 / (1 + np.exp(-x))

def lr_prediction(weights, bias, features):
    """
    Calculate prediction using logistic regression
    weights: array of feature weights
    bias: scalar bias term
    features: input feature vector
    """
    z = np.dot(features, weights) + bias
    return sigmoid(z)

# Test the functions
test_weights = np.array([0.1, -0.2, 0.3])
test_features = np.array([2, 1, 3])
test_bias = 0.5

prediction = lr_prediction(test_weights, test_bias, test_features)
print(f"Test prediction: {prediction}")

# Test sigmoid function separately
print(f"Sigmoid of 0: {sigmoid(0)}") # Should be 0.5
print(f"Sigmoid of 2: {sigmoid(2)}") # Should be ~0.88
print(f"Sigmoid of -2: {sigmoid(-2)}") # Should be ~0.12

Test prediction: 0.8021838885585817
Sigmoid of 0: 0.5
Sigmoid of 2: 0.8807970779778823
Sigmoid of -2: 0.11920292202211755


### Part 2: Implementing the Error functions
**Task:** Use the gradient update rules to train the logistic regression model over multiple epochs.

In [36]:
import numpy as np

def log_loss(weights, bias, features, label):
   """
   Calculate log loss for a single data point
   
   Args:
   weights: feature weights
   bias: bias term
   features: input feature vector
   label: true label (0 or 1)
   
   Returns:
   float: log loss for this example
   """
   prediction = lr_prediction(weights, bias, features)
   # Add small epsilon to avoid log(0)
   epsilon = 1e-15
   return -label * np.log(prediction + epsilon) - (1 - label) * np.log(1 - prediction + epsilon)

def total_log_loss(weights, bias, X, y):
   """
   Calculate average log loss for entire dataset
   
   Args:
   weights: feature weights
   bias: bias term
   X: feature matrix (n_samples x n_features)
   y: true labels array
   
   Returns:
   float: average log loss across all examples
   """
   total_loss = 0
   for features, label in zip(X, y):
       total_loss += log_loss(weights, bias, features, label)
   return total_loss / len(y)

# Test the functions
test_weights = np.array([0.1, -0.2, 0.3])
test_features = np.array([2, 1, 3])
test_bias = 0.5
test_label = 1

# Test single example loss
single_loss = log_loss(test_weights, test_bias, test_features, test_label)
print(f"Single example log loss: {single_loss:.4f}")

# Test total loss
test_X = np.array([[2, 1, 3], [1, 2, 1], [3, 1, 2]])
test_y = np.array([1, 0, 1])
total_loss = total_log_loss(test_weights, test_bias, test_X, test_y)
print(f"Average log loss: {total_loss:.4f}")

Single example log loss: 0.2204
Average log loss: 0.4859


### Part 1: Update Weights
The **Update_Weights** adjusts weights and bias based on whether points are correctly or incorrectly classified, It is a simple method of improving the model at every iteration:
1. **Correctly classified points:** Move the line **away** from the point.
2. **Incorrectly classified points:** Move the line **towards** the point.

**Task:** Implement the gradient update function based on these rules.

In [None]:
#Your Code
def	lr_update_weights(weights,	bias,	features,	label,	learning_rate	=	0.01):
    return

### Part 2: Implementing the Logistic Regression Algorithm
**Task:** Use the function to update weights to train the logistic regression model over multiple epochs. Keep track of the total error for each epoch. You will later plot these errors.

In [None]:
# Implement the logistic regression model 
def	lr_algorithm(features,	labels,	learning_rate	=	0.01,	epochs	=	200):
    return

## Step 8: Evaluate Your Model
**Task:** Calculate the accuracy of the model. Compare the predicted labels with the actual labels.

> **Hint:** Use the formula for accuracy: (Correct Predictions / Total Predictions) * 100

**Question:** Which metric—accuracy, precision, or recall—is most important for sentiment analysis?

In [None]:
# Your code here


## Step 8: Visualize the Errors  
**Task:** Create a scatter plot of the total errors over the training epochs. The plot should show a gradual decrease in errors, stabilizing as the model converges.

In [None]:
#Your code here

## Step 9: Make Predictions on New Data
**Task:** Use your trained model to predict the sentiment of the following review:

> _"The movie was absolutely fantastic and kept me hooked till the end."_

**Question:** What challenges might arise when predicting on new data?

In [None]:
# Your code here


## Step 10: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

### Notes (if any):