A machine learning project implementing logistic regression from scratch for binary sentiment analysis on movie reviews using Python and scikit-learn.
This repository contains a complete sentiment analysis pipeline that classifies movie reviews as positive (1) or negative (0) using logistic regression. The model is trained from scratch using gradient descent with custom implementations of the objective function and gradient calculations.
nlp-model/
├── functions.py # Core ML functions and utilities
├── nlp_train.ipynb # Main training notebook
├── data/
│ ├── debug/ # Small dataset for testing
│ │ ├── dict.txt # Vocabulary mapping for debug set
│ │ └── reviews.tsv # Sample reviews for debugging
│ └── full/ # Complete dataset
│ ├── dict.txt # Full vocabulary mapping (39,190 words)
│ ├── train_data.tsv # Training data (1,200 samples)
│ ├── valid_data.tsv # Validation data
│ └── test_data.tsv # Test data
└── figures/
└── nlls.png # Training/validation loss visualization
Computes the negative log-likelihood (NLL) loss function for logistic regression:
- Input: Feature matrix X, labels Y, parameters theta
- Output: Mean negative log-likelihood
- Uses sigmoid function with numerical stability (clipping to prevent log(0))
- This is the cost function we minimize during training
Calculates the gradient of the loss function for a single sample:
- Input: Single feature vector Xi, label Yi, parameters theta
- Output: Gradient vector for parameter updates
- Used in stochastic gradient descent for weight updates
Implements stochastic gradient descent training:
- Parameters:
X_train
: Training feature matrixY_train
: Training labelstheta0
: Initial parameters (usually zeros)num_epochs
: Number of training epochslr
: Learning rate
- Returns: History of parameter values after each epoch
- Processes one sample at a time for parameter updates
Complete training pipeline with validation:
- Trains the model and tracks performance on both training and validation sets
- Implements early stopping based on validation performance
- Returns: Dictionary containing:
best_theta
: Optimal parametersbest_epoch
: Epoch with best validation performancebest_train_nll
&best_val_nll
: Best loss valuestrain_error
&val_error
: Error ratestheta_history
: Complete parameter evolution
Makes binary predictions using trained parameters:
- Applies sigmoid function to get probabilities
- Uses 0.5 threshold for binary classification
- Returns: List of binary predictions (0 or 1)
Calculates classification error rate:
- Input: True labels y, predicted labels y_hat
- Output: Proportion of misclassified samples
Creates training visualization:
- Plots training and validation negative log-likelihood vs epochs
- Saves plot as
./figures/nlls.png
- Helps identify overfitting and convergence patterns
The notebook implements a complete machine learning workflow:
# Load datasets
trainData = pd.read_csv('./data/full/train_data.tsv', sep='\t', header=None)
valData = pd.read_csv('./data/full/valid_data.tsv', sep='\t', header=None)
testData = pd.read_csv('./data/full/test_data.tsv', sep='\t', header=None)
# Feature extraction using bag-of-words
cvect = CountVectorizer(binary=True, max_features=10000)
xTrain = cvect.fit_transform(trainData.iloc[:, 1]) # Text reviews
yTrain = trainData.iloc[:, 0].values.reshape(-1, 1) # Labels (0/1)
# Training parameters
numEpochs = 1000
lr = 0.005
# Train model with validation
output = fn.train_evaluate_model(xTrain, yTrain, xVal, yVal,
numEpochs, lr, visualize_nlls=True)
The notebook evaluates model performance on multiple metrics:
- Training Error Rate: Performance on training data
- Validation Error Rate: Performance on validation data (for model selection)
- Test Error Rate: Final performance evaluation
- Negative Log-Likelihood: Loss function values
Demonstrates model predictions on individual reviews:
# Example prediction on first test sample
reviewPrediction = fn.predict(xTest[0], output['best_theta'])
print(f"True sentiment: {yTest[0]}")
print(f"Predicted sentiment: {reviewPrediction}")
- TSV files: Tab-separated values with two columns
- Column 1: Label (0 = negative, 1 = positive)
- Column 2: Review text
- Dictionary files: Word-to-index mappings for vocabulary
- Training: 1,200 movie reviews
- Validation: Used for hyperparameter tuning and early stopping
- Test: Final evaluation set
- Vocabulary: 39,190 unique words in full dataset
- Binary bag-of-words: Each word's presence/absence (not frequency)
- Max features: Limited to 10,000 most frequent words
- Bias term: Added automatically during training
-
Environment Setup:
pip install numpy pandas scikit-learn matplotlib
-
Run Training: Open
nlp_train.ipynb
in Jupyter and execute cells sequentially -
Custom Training:
import functions as fn # Train with custom parameters output = fn.train_evaluate_model(xTrain, yTrain, xVal, yVal, numEpochs=500, lr=0.01) # Make predictions predictions = fn.predict(xTest, output['best_theta'])
The model shows signs of overfitting (perfect training accuracy vs 86.75% validation accuracy), which is common in text classification with limited data.
- Binary classification with logistic regression
- Gradient descent optimization
- Text preprocessing and feature extraction
- Model evaluation and validation techniques
- Overfitting detection and mitigation
- Scientific computing with NumPy and pandas
- The NLP model uses binary bag-of-words features (word presence, not frequency)
- Numerical stability is ensured through probability clipping in the objective function
- The visualization helps identify convergence and overfitting patterns
- This is an educational implementation built to understand logistic regression's inner functioning