Python - Recipe-Rating-Prediction-using-Machine-Learning-

Full machine learning pipeline including data cleaning, EDA, NLP (TF-IDF), feature engineering, model development, and evaluation for predicting recipe star ratings. Project Overview
This project analyzes a large dataset of recipe reviews with the goal of predicting star ratings (1–5) using machine learning techniques.
The dataset includes:

likes_score
dislike_index
vote_ratio
ranking_value
review text
user & recipe identifiers
timestamps
numeric recipe attributes (steps, ingredients, time)

The workflow covers the entire data science lifecycle:

✔ Data Cleaning
✔ Exploratory Data Analysis (EDA)
✔ Feature Engineering
✔ Text Vectorization (TF-IDF)
✔ Modeling (Logistic Regression & Random Forest)
✔ Evaluation & Feature Importance
✔ Insights & Recommendations

Data Cleaning
Steps taken to prepare the dataset:

✔ Removing invalid and duplicate entries

Ensured only 1–5 star ratings
Removed duplicates to avoid bias

✔ Handling missing values

Numeric columns → filled with median
Text → filled with empty string
Categorical IDs → filled with "unknown"

✔ Converting timestamps

Extracted month, weekday, and hour

✔ Normalizing categories

Lowercased and stripped whitespace for consistency

Exploratory Data Analysis (EDA) Key findings from descriptive analysis:

Rating Distribution

Highly imbalanced dataset — majority of reviews are 5-star.

Relationship between numeric features & ratings
High-ingredient or long recipes show more rating variance.

Review Length vs Rating
Longer reviews generally correlate with higher ratings.

Correlation Heatmap
Strong correlation between ranking score and star rating.
Weak correlation between most numeric features and stars → text matters most.

Feature Engineering

✔ Review Length
Calculated word count as a new feature.

✔ TF-IDF Vectorization
Converted review text into numerical vectors.
Helps identify strong positive & negative words:

Positive: delicious, loved, good
Negative: not, boring, bland

✔ Encoding Categorical Variables
Applied one-hot encoding to user_id and recipe_id.

✔ Scaling Numeric Features
Standardized continuous variables to improve model performance.

Model Development

Two models were built and compared:

Logistic Regression (Baseline Model)

Tuned using GridSearchCV
Best parameters:
- C = 0.5
- penalty = l2
Benefits: Balanced performance
Accuracy: 78%
Stronger performance on minority classes than Random Forest

Random Forest (Advanced Model)

Tuned using GridSearchCV
Best parameters:
- n_estimators = 200
- max_depth = 20
Learning non-linear relationships
Accuracy: 82%
Excellent on 5-star predictions, poor on minority classes

Evaluation

Logistic Regression

Accuracy: 78%
Macro F1: 0.45
ROC AUC: 0.87
More balanced across star categories

Random Forest

Accuracy: 82%
Macro F1: 0.34
ROC AUC: 0.79
Strong on 5-star class but weak on 1–3 stars

Feature Importance

Key Predictors:

TF-IDF text features dominated importance
Strong indicators:
- Positive words: delicious, great, loved
- Negative words: not, but, bland
Review length was also an important contributor
Numeric attributes had minimal impact

Insights & Interpretation

Data Distribution

Highly imbalanced → difficult for models
Oversampling or synthetic balancing is needed

Model Performance
Random Forest achieved higher accuracy
Logistic Regression performed better across all classes

Feature Importance
Text features drive most of the predictive power
Numeric features contribute less

Practical Implications
The findings can be applied to:

Recipe recommendation systems
Improving user feedback understanding
Identifying recipe strengths & weaknesses
Content moderation & quality improvement
Highlighting reliable user reviews

Recommendations for Improvement
Address Class Imbalance

Use SMOTE, undersampling, or hybrid methods

Use Advanced NLP Models

Replace TF-IDF with BERT or transformer embeddings

Collect More Metadata

Include cuisine type, difficulty, demographics

Hybrid Modeling

Combine ensemble methods with deep learning

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Part-4 Report.pdf		Part-4 Report.pdf
Part4.ipynb		Part4.ipynb
README.md		README.md
recipe_reviews.csv		recipe_reviews.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python - Recipe-Rating-Prediction-using-Machine-Learning-

About

Uh oh!

Releases

Packages

Languages

desai-shashank/Python-Recipe-Rating-Prediction-using-Machine-Learning-

Folders and files

Latest commit

History

Repository files navigation

Python - Recipe-Rating-Prediction-using-Machine-Learning-

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages