Skip to content

Python - Full machine learning pipeline including data cleaning, EDA, NLP (TF-IDF), feature engineering, model development, and evaluation for predicting recipe star ratings.

Notifications You must be signed in to change notification settings

desai-shashank/Python-Recipe-Rating-Prediction-using-Machine-Learning-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Python - Recipe-Rating-Prediction-using-Machine-Learning-

Full machine learning pipeline including data cleaning, EDA, NLP (TF-IDF), feature engineering, model development, and evaluation for predicting recipe star ratings. Project Overview
This project analyzes a large dataset of recipe reviews with the goal of predicting star ratings (1–5) using machine learning techniques.
The dataset includes:

  • likes_score
  • dislike_index
  • vote_ratio
  • ranking_value
  • review text
  • user & recipe identifiers
  • timestamps
  • numeric recipe attributes (steps, ingredients, time)

The workflow covers the entire data science lifecycle:

  • ✔ Data Cleaning
  • ✔ Exploratory Data Analysis (EDA)
  • ✔ Feature Engineering
  • ✔ Text Vectorization (TF-IDF)
  • ✔ Modeling (Logistic Regression & Random Forest)
  • ✔ Evaluation & Feature Importance
  • ✔ Insights & Recommendations
  1. Data Cleaning
    Steps taken to prepare the dataset:

✔ Removing invalid and duplicate entries

  • Ensured only 1–5 star ratings
  • Removed duplicates to avoid bias

✔ Handling missing values

  • Numeric columns → filled with median
  • Text → filled with empty string
  • Categorical IDs → filled with "unknown"

✔ Converting timestamps

  • Extracted month, weekday, and hour

✔ Normalizing categories

  • Lowercased and stripped whitespace for consistency
  1. Exploratory Data Analysis (EDA) Key findings from descriptive analysis:

Rating Distribution

  • Highly imbalanced dataset — majority of reviews are 5-star.

    Relationship between numeric features & ratings

  • High-ingredient or long recipes show more rating variance.

    Review Length vs Rating

  • Longer reviews generally correlate with higher ratings.

    Correlation Heatmap

  • Strong correlation between ranking score and star rating.

  • Weak correlation between most numeric features and stars → text matters most.

  1. Feature Engineering

✔ Review Length
Calculated word count as a new feature.

✔ TF-IDF Vectorization
Converted review text into numerical vectors.
Helps identify strong positive & negative words:

  • Positive: delicious, loved, good
  • Negative: not, boring, bland

✔ Encoding Categorical Variables
Applied one-hot encoding to user_id and recipe_id.

✔ Scaling Numeric Features
Standardized continuous variables to improve model performance.

  1. Model Development

Two models were built and compared:

Logistic Regression (Baseline Model)

  • Tuned using GridSearchCV
  • Best parameters:
    • C = 0.5
    • penalty = l2
  • Benefits: Balanced performance
  • Accuracy: 78%
  • Stronger performance on minority classes than Random Forest

Random Forest (Advanced Model)

  • Tuned using GridSearchCV
  • Best parameters:
    • n_estimators = 200
    • max_depth = 20
  • Learning non-linear relationships
  • Accuracy: 82%
  • Excellent on 5-star predictions, poor on minority classes
  1. Evaluation

Logistic Regression

  • Accuracy: 78%
  • Macro F1: 0.45
  • ROC AUC: 0.87
  • More balanced across star categories

Random Forest

  • Accuracy: 82%
  • Macro F1: 0.34
  • ROC AUC: 0.79
  • Strong on 5-star class but weak on 1–3 stars
  1. Feature Importance

Key Predictors:

  • TF-IDF text features dominated importance
  • Strong indicators:
    • Positive words: delicious, great, loved
    • Negative words: not, but, bland
  • Review length was also an important contributor
  • Numeric attributes had minimal impact
  1. Insights & Interpretation

Data Distribution

  • Highly imbalanced → difficult for models

  • Oversampling or synthetic balancing is needed

    Model Performance

  • Random Forest achieved higher accuracy

  • Logistic Regression performed better across all classes

    Feature Importance

  • Text features drive most of the predictive power

  • Numeric features contribute less

  1. Practical Implications
    The findings can be applied to:
  • Recipe recommendation systems
  • Improving user feedback understanding
  • Identifying recipe strengths & weaknesses
  • Content moderation & quality improvement
  • Highlighting reliable user reviews
  1. Recommendations for Improvement

  2. Address Class Imbalance

  • Use SMOTE, undersampling, or hybrid methods
  1. Use Advanced NLP Models
  • Replace TF-IDF with BERT or transformer embeddings
  1. Collect More Metadata
  • Include cuisine type, difficulty, demographics
  1. Hybrid Modeling
  • Combine ensemble methods with deep learning

About

Python - Full machine learning pipeline including data cleaning, EDA, NLP (TF-IDF), feature engineering, model development, and evaluation for predicting recipe star ratings.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published