# Introduction
This notebook leverages pre-defined functions from the `train_logistic_regression.py` script to train a Logistic Regression model on the Sentiment140 dataset using TF-IDF features.


### Setup

In [7]:
import sys
sys.path.append('../../src/models/')  # Add the path to the script

In [8]:
from train_logistic_regression import (
    load_data, vectorize_text, train_logistic_regression,
    evaluate_model, save_model_and_vectorizer
)

### Load the cleaned data

In [11]:
df = load_data('../../data/processed/cleaned_data.csv')
df = df.dropna(subset=['clean_text'])

### Feature Engineering: TF-IDF Vectorization

In [14]:
X, tfidf = vectorize_text(df, max_features=1000)
y = df['label']

### Train Logistic Regression Model

In [15]:
model = train_logistic_regression(X, y)

### Evaluate the Model

In [17]:
accuracy, report = evaluate_model(model, X, y)
print(f"Model Accuracy on Full Dataset: {round(accuracy, 2)}")
print("\nClassification Report:\n", report)

Model Accuracy on Full Dataset: 0.75

Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.72      0.74    796302
           1       0.74      0.78      0.76    795668

    accuracy                           0.75   1591970
   macro avg       0.75      0.75      0.75   1591970
weighted avg       0.75      0.75      0.75   1591970



### Visualization (e.g., Confusion Matrix, ROC Curve)

In [None]:
#[todo]

### Save the model and TF-IDF transformer

In [19]:
save_model_and_vectorizer(
    model, tfidf,
    '../../models/logistic_regression_baseline.pkl',
    '../../models/tfidf_vectorizer.pkl'
)