<a href="https://colab.research.google.com/github/adulu-bang/ML_BOOTCAMP_24_IET_NITK/blob/main/Session1/Tasks/Task%202%20-%20Logistic%20Regression/Task-1_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧑‍🏫 Task 1 Part 2: Build Your Own Logistic Regression Model for Sentiment Analysis
In this exercise, you will build a **logistic regression model** from scratch to perform sentiment analysis.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like `LogisticRegression` from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Load the Data
**Task:** Use `pandas` to load the dataset from a file named `IMDB_reviews.csv`.

> **Hint:** Use `pd.read_csv()` to load the file and display the first 5 rows.

**Question:** What are the key features and the target variable in this dataset?

In [4]:
# Load the dataset and display the first few rows
import pandas as pd
import numpy as np

reviews=pd.read_csv('/content/IMDB_Dataset.csv')
reviews.head()
#reviews.tail()

#key feature: the review,  target variable:sentiment









Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Step 3: Tokenization and Text Cleaning
**Task:** Implement your own function to:
1. Convert all text to lowercase.
2. Remove punctuation and special characters.
3. Split the text into words (tokenization).

> **Hint:** Use Python string methods and list comprehensions.

**Question:** Why is tokenization important for text-based models?

In [6]:
#Write your own tokenizer function

import re



def tokenizer(text):
         text = text.lower()  # converts to lowr case
         text = re.sub(r'[^\w\s]', '', text)  # remov punct and special char
         tokens = text.split() # splits into separate words
         return tokens

'''a= tokenizer("This, !s a t$est.!")
print(a)
print("\n")

b= tokenizer(reviews['review'][0])
print(b)'''


tokenized_reviews = [tokenizer(review) for review in reviews['review']]# a list of lists
print(tokenized_reviews[2])









#this tokenization is imp in text-based models bec these individual words are often used..
#as features and their freq is observed to make predictions on the text.
#also, tokenization helps create a list of unique words(tokens), with each being assigned a numerical index.
#It also reduces the no. of things to process, as we no more need to process punctuations, spl characters etc.











['i', 'thought', 'this', 'was', 'a', 'wonderful', 'way', 'to', 'spend', 'time', 'on', 'a', 'too', 'hot', 'summer', 'weekend', 'sitting', 'in', 'the', 'air', 'conditioned', 'theater', 'and', 'watching', 'a', 'lighthearted', 'comedy', 'the', 'plot', 'is', 'simplistic', 'but', 'the', 'dialogue', 'is', 'witty', 'and', 'the', 'characters', 'are', 'likable', 'even', 'the', 'well', 'bread', 'suspected', 'serial', 'killer', 'while', 'some', 'may', 'be', 'disappointed', 'when', 'they', 'realize', 'this', 'is', 'not', 'match', 'point', '2', 'risk', 'addiction', 'i', 'thought', 'it', 'was', 'proof', 'that', 'woody', 'allen', 'is', 'still', 'fully', 'in', 'control', 'of', 'the', 'style', 'many', 'of', 'us', 'have', 'grown', 'to', 'lovebr', 'br', 'this', 'was', 'the', 'most', 'id', 'laughed', 'at', 'one', 'of', 'woodys', 'comedies', 'in', 'years', 'dare', 'i', 'say', 'a', 'decade', 'while', 'ive', 'never', 'been', 'impressed', 'with', 'scarlet', 'johanson', 'in', 'this', 'she', 'managed', 'to', 'to

## Step 4: Create a Vocabulary
**Task:** Create a **vocabulary** (a list of unique words) from the tokenized dataset.

> **Hint:** Use a set to store unique words, then convert it to a list.

**Question:** How does vocabulary size affect model performance?

In [7]:
# Your code here




def create_vocabulary(tokenized_dataset):
    vocabulary = set()  # Use a set to store unique words in reviews
    for review_tokens in tokenized_dataset: # iteration through each review(list of tokens)
        for token in review_tokens:# iteration through individual tokens in each review
            vocabulary.add(token)  # Add each token to the vocabulary set
    return list(vocabulary)  # Convert the set to a list


vocabulary = create_vocabulary(tokenized_reviews)
print(vocabulary)

#llarger vocabulary means that we can represent a larger variety of words, that wud lead to better performance
#However, overfitting and increased computational cost are some cons

#smaller vocab reduces the reperesentational power, but has reduced computational cost and lower overfitting risk.





## Step 5: Implement Word Count
**Task:** Calculate and store the number of times each word appears in a particular review for all reviews

In [11]:
# Your code here
# Example: Write functions to calculate word counts

'''def calculate_word_counts(tokenized_reviews, vocabulary):
    word_counts = []
    for review_tokens in tokenized_reviews:
        review_word_count = {}
        for token in review_tokens:
            if token in vocabulary:
                review_word_count[token] = review_word_count.get(token, 0) + 1
        word_counts.append(review_word_count)
    return word_counts

# Calculate word counts
word_counts = calculate_word_counts(tokenized_reviews, vocabulary)'''


'''from collections import Counter

def calculate_word_counts(tokenized_reviews, vocabulary):
    word_counts = []
    for review_tokens in tokenized_reviews:
        # Use Counter to efficiently count word frequencies
        review_word_count = Counter(review_tokens)
        # Filter by vocabulary (if needed)
        filtered_word_count = {token: count for token, count in review_word_count.items() if token in vocabulary}
        word_counts.append(filtered_word_count)
    return word_counts

# Calculate word counts using the optimized approach
word_counts = calculate_word_counts(tokenized_reviews, vocabulary)


print(word_counts[0])'''




KeyboardInterrupt: 

## Step 6: Train-Test Split
**Task:** Split the data into **80% training** and **20% testing** sets.

> **Hint:** Use `numpy` or list slicing to split the data manually.

**Question:** Why do we need to split the data for training and testing?

In [9]:
# Your code here

split_index = int(0.8 * len(reviews))  # 80% of the data for training, 20% testing

train_reviews = reviews['review'][:split_index]
train_sentiments = reviews['sentiment'][:split_index]

test_reviews = reviews['review'][split_index:]
test_sentiments = reviews['sentiment'][split_index:]






## Step 7: Building the Logistic Regression Model (Divided Steps)

### Part 1: The Prediction functions
The **prediction function** returns the predicted value of the data point using the weights and the bias. It uses the sigmoid function to convert the prediction into a value in the range of 0 to 1.

**Task:** Implement the sigmoid and prediction functions

In [14]:
def sigmoid(x):
  return 1 / (1 + np.exp(-x))






def lr_prediction(weights,	bias,	features):
  z = np.dot(features, weights) + bias
  return sigmoid(z)



### Part 2: Implementing the Error functions
**Task:** Use the gradient update rules to train the logistic regression model over multiple epochs.

In [None]:
def	log_loss(weights,	bias,	features,	label):
    return

def	total_log_loss(weights,	bias,	X,	y):
    return

### Part 1: Update Weights
The **Update_Weights** adjusts weights and bias based on whether points are correctly or incorrectly classified, It is a simple method of improving the model at every iteration:
1. **Correctly classified points:** Move the line **away** from the point.
2. **Incorrectly classified points:** Move the line **towards** the point.

**Task:** Implement the gradient update function based on these rules.

In [None]:
#Your Code
def	lr_update_weights(weights,	bias,	features,	label,	learning_rate	=	0.01):
    return

### Part 2: Implementing the Logistic Regression Algorithm
**Task:** Use the function to update weights to train the logistic regression model over multiple epochs. Keep track of the total error for each epoch. You will later plot these errors.

In [None]:
# Implement the logistic regression model
def	lr_algorithm(features,	labels,	learning_rate	=	0.01,	epochs	=	200):
    return

## Step 8: Evaluate Your Model
**Task:** Calculate the accuracy of the model. Compare the predicted labels with the actual labels.

> **Hint:** Use the formula for accuracy: (Correct Predictions / Total Predictions) * 100

**Question:** Which metric—accuracy, precision, or recall—is most important for sentiment analysis?

In [None]:
# Your code here


## Step 8: Visualize the Errors  
**Task:** Create a scatter plot of the total errors over the training epochs. The plot should show a gradual decrease in errors, stabilizing as the model converges.

In [None]:
#Your code here

## Step 9: Make Predictions on New Data
**Task:** Use your trained model to predict the sentiment of the following review:

> _"The movie was absolutely fantastic and kept me hooked till the end."_

**Question:** What challenges might arise when predicting on new data?

In [None]:
# Your code here


## Step 10: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

### Notes (if any):