# Notebook 4: Baseline Modeling

Before we use 'Heavy' AI like BERT, we need a simple standard to compare it against. We call this a **Baseline**.

### The Plan:
1. **Vectorization**: Turn words into numbers.
2. **Logistic Regression**: A simple model that acts like a weighted checklist.
3. **Testing**: See if it actually works.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Load our clean data
df = pd.read_csv("../data/processed/clean_reviews.csv")
X = df['final_text'].astype(str)
y = df['label']

print(f"We have {len(X)} reviews ready for training.")

We have 1000 reviews ready for training.


## 1. Vectorization (Text to Numbers)

**In Plain English**: Computers can't read. We have to turn sentences into a grid of numbers.

![TF-IDF Visual](assets/tfidf_visual.png)

- Imagine a giant table where every unique word has its own column.
- If a review contains a word, we put a number in that column.
- **TF-IDF** is just a 'smart' version of this that gives more points to important words and less points to common words like 'the' or 'and'.

In [2]:
vectorizer = TfidfVectorizer(max_features=2000)
X_numbers = vectorizer.fit_transform(X)

print(f"Our 'Math Table' has {X_numbers.shape[1]} columns (unique words).")

Our 'Math Table' has 124 columns (unique words).


## 2. Logistic Regression (The Decision Maker)

**What it does**: It looks at our 'Math Table' and learns which words are linked to which sentiment.

- **The Goal**: To find a mathematical 'line' that separates Positive, Neutral, and Negative reviews.
- **How it works**: It assigns 'weights' to words. 'Excellent' gets a high positive weight, while 'Terrible' gets a high negative weight.
- **The Output**: For any new review, it calculates a score and places it in the most likely category.

In [3]:
model = LogisticRegression()
model.fit(X_numbers, y)

print("Training complete! The model has finished its checklist.")

Training complete! The model has finished its checklist.


## 3. Quick Test
Let's see if this simple 'Checklist' model can actually guess sentiments correctly.

In [4]:
test_sentences = [
    "This product is absolutely amazing!",
    "It was a total waste of money and time.",
    "It is an average product, nothing special."
]

test_numbers = vectorizer.transform(test_sentences)
predictions = model.predict(test_numbers)

sentiment_map = {0: "Negative", 1: "Neutral", 2: "Positive"}

for text, p in zip(test_sentences, predictions):
    print(f"Review: '{text}'")
    print(f"Guessed Sentiment: {sentiment_map[p]}\n")

Review: 'This product is absolutely amazing!'
Guessed Sentiment: Positive

Review: 'It was a total waste of money and time.'
Guessed Sentiment: Positive

Review: 'It is an average product, nothing special.'
Guessed Sentiment: Positive

