Full machine learning pipeline including data cleaning, EDA, NLP (TF-IDF), feature engineering, model development, and evaluation for predicting recipe star ratings.
Project Overview
This project analyzes a large dataset of recipe reviews with the goal of predicting star ratings (1–5) using machine learning techniques.
The dataset includes:
- likes_score
- dislike_index
- vote_ratio
- ranking_value
- review text
- user & recipe identifiers
- timestamps
- numeric recipe attributes (steps, ingredients, time)
The workflow covers the entire data science lifecycle:
- ✔ Data Cleaning
- ✔ Exploratory Data Analysis (EDA)
- ✔ Feature Engineering
- ✔ Text Vectorization (TF-IDF)
- ✔ Modeling (Logistic Regression & Random Forest)
- ✔ Evaluation & Feature Importance
- ✔ Insights & Recommendations
- Data Cleaning
Steps taken to prepare the dataset:
✔ Removing invalid and duplicate entries
- Ensured only 1–5 star ratings
- Removed duplicates to avoid bias
✔ Handling missing values
- Numeric columns → filled with median
- Text → filled with empty string
- Categorical IDs → filled with
"unknown"
✔ Converting timestamps
- Extracted month, weekday, and hour
✔ Normalizing categories
- Lowercased and stripped whitespace for consistency
- Exploratory Data Analysis (EDA) Key findings from descriptive analysis:
Rating Distribution
-
Highly imbalanced dataset — majority of reviews are 5-star.
Relationship between numeric features & ratings
-
High-ingredient or long recipes show more rating variance.
Review Length vs Rating
-
Longer reviews generally correlate with higher ratings.
Correlation Heatmap
-
Strong correlation between ranking score and star rating.
-
Weak correlation between most numeric features and stars → text matters most.
- Feature Engineering
✔ Review Length
Calculated word count as a new feature.
✔ TF-IDF Vectorization
Converted review text into numerical vectors.
Helps identify strong positive & negative words:
- Positive: delicious, loved, good
- Negative: not, boring, bland
✔ Encoding Categorical Variables
Applied one-hot encoding to user_id and recipe_id.
✔ Scaling Numeric Features
Standardized continuous variables to improve model performance.
- Model Development
Two models were built and compared:
Logistic Regression (Baseline Model)
- Tuned using GridSearchCV
- Best parameters:
C = 0.5penalty = l2
- Benefits: Balanced performance
- Accuracy: 78%
- Stronger performance on minority classes than Random Forest
Random Forest (Advanced Model)
- Tuned using GridSearchCV
- Best parameters:
n_estimators = 200max_depth = 20
- Learning non-linear relationships
- Accuracy: 82%
- Excellent on 5-star predictions, poor on minority classes
- Evaluation
Logistic Regression
- Accuracy: 78%
- Macro F1: 0.45
- ROC AUC: 0.87
- More balanced across star categories
Random Forest
- Accuracy: 82%
- Macro F1: 0.34
- ROC AUC: 0.79
- Strong on 5-star class but weak on 1–3 stars
- Feature Importance
Key Predictors:
- TF-IDF text features dominated importance
- Strong indicators:
- Positive words: delicious, great, loved
- Negative words: not, but, bland
- Review length was also an important contributor
- Numeric attributes had minimal impact
- Insights & Interpretation
Data Distribution
-
Highly imbalanced → difficult for models
-
Oversampling or synthetic balancing is needed
Model Performance
-
Random Forest achieved higher accuracy
-
Logistic Regression performed better across all classes
Feature Importance
-
Text features drive most of the predictive power
-
Numeric features contribute less
- Practical Implications
The findings can be applied to:
- Recipe recommendation systems
- Improving user feedback understanding
- Identifying recipe strengths & weaknesses
- Content moderation & quality improvement
- Highlighting reliable user reviews
-
Recommendations for Improvement
-
Address Class Imbalance
- Use SMOTE, undersampling, or hybrid methods
- Use Advanced NLP Models
- Replace TF-IDF with BERT or transformer embeddings
- Collect More Metadata
- Include cuisine type, difficulty, demographics
- Hybrid Modeling
- Combine ensemble methods with deep learning