# Sentiment analysis model using Scikit-learn with IMDb reviews dataset

# 📘 Introduction

**Sentiment analysis** is a natural language processing (NLP) technique used to identify and classify opinions expressed in a piece of text, with the goal of determining whether the sentiment is positive, negative, or neutral. This kind of analysis is widely used in areas such as customer feedback monitoring, social media analysis, and product or service review mining.

In this project, we trained a binary sentiment classification model using the well-known **IMDb movie reviews dataset**. Each review is labeled as either positive or negative, making it an ideal dataset for applying and comparing various machine learning algorithms.

---

## 🎯 Objective

The main goal of this project was to build a system capable of **automatically predicting whether a movie review is positive or negative**, following these steps:

1. Text cleaning and preprocessing  
2. Text vectorization using **TF-IDF**  
3. Training and comparing multiple classification models  
4. Evaluating model performance using key metrics  
5. Using the best-performing model to make predictions on new, unseen text

---

## 🤖 Models evaluated

The following machine learning models were trained and compared:

- **Logistic Regression**
- **Random Forest**
- **Multinomial Naive Bayes**
- **Linear Support Vector Classifier (Linear SVC)**
- **K-Nearest Neighbors (KNN)**

After evaluating precision, recall, F1-score, and training time, the model that showed the **best overall performance was `Logistic Regression`**, offering both high accuracy and fast execution.



## ⬆️ Load dataset

```python
import os
import pandas as pd

def load_data(path):
    data = []
    labels = []

    for etiqueta in ['pos', 'neg']:
        folder = os.path.join(path, etiqueta)
        print(f"Loading {etiqueta} from {folder}")
        for file in os.listdir(folder):
            with open(os.path.join(folder, file), 'r', encoding='utf-8') as f:
                data.append(f.read())
                labels.append(1 if etiqueta == 'pos' else 0)
    print(f"Total loaded of {path}: {len(data)} reviews")
    return pd.DataFrame({'text': data, 'label': labels})

train = load_data('aclImdb/train')
test = load_data('aclImdb/test')
```

## 🔃 Text preprocessing

```python
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def clear_text(text):
    text = text.lower()
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    text = ' '.join([word for word in text.split() if word not in ENGLISH_STOP_WORDS])
    return text

def apply_preprocessing(df):
    df['text'] = df['text'].apply(clear_text)
    return df
```

## 🔢 Vectorization (TF-IDF)

```python
from sklearn.feature_extraction.text import TfidfVectorizer

def vectorize_text(train, test):
    vectorizer = TfidfVectorizer(max_features=5000)
    X_train = vectorizer.fit_transform(train['text'])
    X_test = vectorizer.transform(test['text'])
    y_train = train['label']
    y_test = test['label']
    return X_train, X_test, y_train, y_test, vectorizer
```

## 🔣 Models training

```python
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
import joblib
import time

def train_and_eval(X_train, X_test, y_train, y_test):
    models = {
        "Logistic Regression": LogisticRegression(solver='saga', max_iter=1000, n_jobs=-1),
        "Random Forest": RandomForestClassifier(n_estimators=100, n_jobs=-1),
        "Multinomial NB": MultinomialNB(),
        "Linear SVC": LinearSVC(max_iter=1000),
        "K-Nearest Neighbors": KNeighborsClassifier(n_jobs=10)
    }

    results = {}

    for name, model in models.items():
        print(f"\nTraining: {name}")
        start = time.time()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        end = time.time()

        acc = accuracy_score(y_test, y_pred)
        print(f"Accuracy: {acc:.4f}")
        print(classification_report(y_test, y_pred))

        results[name] = {
            "model": model,
            "accuracy": acc,
            "time": end - start
        }

    return results

def search_better_parameters(X_train, y_train):
    model = LogisticRegression(solver='saga', max_iter=1000)
    parameters = {
        'C': [0.01, 0.1, 1, 10],
        'penalty': ['l2', 'l1']
    }
    grid = GridSearchCV(model, parameters, cv=3, scoring='accuracy', n_jobs=-1)
    grid.fit(X_train, y_train)

    print("Better combination:", grid.best_params_)
    print("Better accuracy:", grid.best_score_)

    return grid.best_estimator_

def save_model(model, vectorizer, path_model='model.joblib', path_vectorizer='vectorizer.joblib'):
    joblib.dump(model, path_model)
    joblib.dump(vectorizer, path_vectorizer)
```

## 📶 Comparison of results

```python
import matplotlib.pyplot as plt

def print_results(results):
    names = list(results.keys())
    accuracies = [r["accuracy"] for r in results.values()]
    times = [r["time"] for r in results.values()]

    plt.figure(figsize=(10, 4))

    plt.subplot(1, 2, 1)
    plt.barh(names, accuracies)
    plt.title("Accuracy by model")
    plt.xlabel("Accuracy")

    plt.subplot(1, 2, 2)
    plt.barh(names, times)
    plt.title("Training time by model")
    plt.xlabel("Seconds")

    plt.tight_layout()
    plt.show()

```

## 📶 Run models

```python
import time
from load_dataset import load_data
from preprocessing import apply_preprocessing
from vectorization import vectorize_text
from model import train_and_eval, save_model
from print_results import print_results

start = time.time()

# 1. Load dataset
train = load_data('aclImdb/train')
test = load_data('aclImdb/test')

# 2. Clear text
train = apply_preprocessing(train)
test = apply_preprocessing(test)

# 3. Vectorize text
X_train, X_test, y_train, y_test, vectorizer = vectorize_text(train, test)

# 4.Train and evaluate models
results = train_and_eval(X_train, X_test, y_train, y_test)

# Choose the best model based on accuracy
better_name = max(results, key=lambda k: results[k]["accuracy"])
better_model = results[better_name]["model"]

print(f"\n🧠 Better model: {better_name} ({results[better_name]['accuracy']:.4f})")
# Save the best model and vectorizer
save_model(better_model, vectorizer)

end = time.time()
print(f"Training and evaluate completed in {end - start:.2f} seconds")

print_results(results)

```

## 🆒 Test model

```python
import joblib
from preprocessing import clear_text

# Load the trained model and vectorizer
model = joblib.load('model.joblib')
vectorizer = joblib.load('vectorizer.joblib')

def predict_sentiment(text):
    clean_text = clear_text(text)
    vector = vectorizer.transform([clean_text])
    pred = model.predict(vector)[0]
    return "Positive" if pred == 1 else "Negative"

# Test
if __name__ == "__main__":
    text_usuario = input("Enter a review: ")
    resultado = predict_sentiment(text_usuario)
    print("Prediction:", resultado)

```

# 🧠 Conclusions

This project demonstrated the full process of building a sentiment analysis pipeline using a real-world dataset of movie reviews from IMDb. Through text preprocessing, TF-IDF vectorization, model training, and evaluation, we were able to identify a model capable of accurately predicting whether a review expresses a positive or negative sentiment.

Here are the key takeaways:

- **Text preprocessing and vectorization are critical** steps in NLP tasks. Cleaning the text and using TF-IDF allowed us to effectively convert raw text into meaningful numerical features.
- Among the models evaluated — including Logistic Regression, Random Forest, Multinomial Naive Bayes, Linear SVC, and K-Nearest Neighbors — **Logistic Regression consistently outperformed the others** in terms of accuracy and training time.
- The final model achieved an accuracy of approximately **87%**, demonstrating strong performance on unseen reviews.
- Simpler linear models can often outperform more complex models when the data is well-preprocessed, especially in high-dimensional spaces like TF-IDF.
- This project can be easily extended to other domains (e.g., product reviews, tweets) or improved with more advanced techniques such as word embeddings, deep learning (LSTM, BERT), or hyperparameter optimization.

Overall, this analysis highlights the importance of combining effective preprocessing with model selection, and it provides a strong foundation for more advanced NLP projects.
