# Report: Recipe Discovery - predicting popularity
## Data Validation & Preprocessing
Data preprocessing for model development included the following steps:  
- **Recipe Column**: The recipe column is a unique identifier. This was confirmed by verifying that the recipe column has 947 unique values. To ensure that the unique identifier does not influence the model's predictions, the recipe column was dropped from the training data.
```python
# Verify that recipe column has 947 unique entries
identifier = df['recipe'].nunique()
print(f"There's 947 rows in the dataset. The number of unique identifiers is {identifier}.")
There's 947 rows in the dataset. The number of unique identifiers is 947.
```
```python
# Dropping the 'recipe' column as it is a unique identifier and not included in the model
df = df.drop('recipe', axis=1)
```
- **Category Column**: The category column represents the type of cuisine and has no inherent order, classifying it as nominal data. Therefore, the category column was one-hot encoded to binary format using the pandas get_dummies() function.
```python
# category - get dummies to transform into binary form
df = pd.get_dummies(df, columns=['category'])
```
- **Servings Column**: The servings column has a natural order (1 < 2 < 4 < 6), making it ordinal data. To label encode the data, the servings column was stripped of trailing text and converted to integers.

```python
# servings - unify categories and convert to integer
df['servings'] = df['servings'].str.replace(' as a snack', '').astype(int)
```
- **High Traffic Column**: The high_traffic column was converted to binary format where 1 represents 'High' and 0 represents 'Low'.
```python
# high_traffic converted to binary form - 1 if 'High' else 0
df['high_traffic'] = df['high_traffic'].apply(lambda x: 1 if x == 'High' else 0)
```
- **Dropping Inconsistent Columns**: Columns calories, carbohydrate, sugar, and protein were dropped from the training data due to:
    - Missing values in over 5% of the rows.
    - High variance (e.g. values for calories ranging between 0.1 and 3600).
    - High number of outliers in each column.
    - Logical inconsistencies (e.g. 185 rows where sugar content is greater than carbohydrate content).
```
Variance in Calories: 205228.02388369216
Variance in Carbohydrate: 1931.5174120872857
Variance in Sugar: 215.47820227263065
Variance in Protein: 1322.757884850712
```
```
The number of rows where sugar is greater than carbohydrate is 185.
```
Consequently, the reliability of this data was judged as low, and it was disregarded to avoid introducing bias and negatively affecting model accuracy and training efficiency.
```python
# Drop columns with inconsistent data
df = df.drop(columns=['calories', 'carbohydrate', 'sugar', 'protein'])
```
To include data points about calories, carbohydrates, sugars, and protein in future models, there must be a special emphasis on improving data collection to ensure the availability of higher quality data required for machine learning model development.
## Exploratory Analysis
![Bar Chart](dscert_servings.png)
*Figure 1: The number of recipes categorized by the number of servings they provide. The serving sizes are divided into four categories: 1, 2, 4, and 6 servings.*  

Figure 1 shows that recipes designed for 4 servings are the most common, with a total of 391 recipes. This is followed by recipes for 6 servings (198 recipes), 2 servings (183 recipes), and 1 serving (175 recipes). This distribution suggests that recipes catering to medium-sized groups (4 servings) are twice as common on the portal compared to other serving sizes.  

![Pie Chart](dscert_traffic.png)
*Figure 2: The traffic distribution for all recipes on the portal.*  

Figure 2 shows that a majority, 61%, of the observations are classified as "High" traffic, indicating that most recipes receive a significant amount of visits. The remaining 39% of the observations are categorized as "Low" traffic, representing recipes with lower traffic.  

![Bar Chart](dscert_comparison.png)
*Figure 3: The distribution of recipe postings categorized by serving sizes and their corresponding traffic levels (High vs. Low).*  

Figure 3 shows that recipes with higher serving sizes tend to have proportionally more "High" traffic postings. Specifically, the 4- and 6-servings categories show a significantly higher ratio of "High" to "Low" traffic recipes of 1.54 and 1.86 respectively. This is in comparison to 1- and 2-servings categories showing a ratio of 1.43 and 1.35 respectively. This trend suggests that recipes designed to serve more people are more popular and attract higher traffic on the platform.
### Findings
- **Serving Size Popularity**: Recipes designed for 4 servings are the most common, indicating a preference for recepies of medium-sized group meals.
- **Traffic Distribution**: A significant majority of recipes receive high traffic, suggesting that the portal is effective in attracting visitors to its content.
- **Serving Size and Traffic**: Recipes with larger serving sizes (4 and 6 servings) tend to attract more traffic, highlighting the popularity of recipes that cater to more people.

## Model Development
**Problem Statement**: The task at hand is a binary classification problem where the goal is to predict whether a recipe will generate high traffic based on various features. The dataset consists of 947 rows with the following columns:

- One column of label encoded categorical data representing serving size with 4 potential values (1, 2, 4, 6).
- One-hot encoded categorical data with 11 potential values.
- A binary target variable indicating high traffic (1 for High, 0 for Low).

**Model Selection**:  
```python
# Define the models 
models = {
    "Logistic Regression": LogisticRegression(random_state=12), 
    "Random Forest": RandomForestClassifier(random_state=12)
}
```  

Given the nature of the dataset and the problem, Logistic Regression and Random Forest models were chosen for following reasons:  

**Logistic Regression**:
- **Simplicity and Interpretability**: Logistic Regression serves as a baseline model due to its simplicity and interpretability. It is a straightforward model that provides easily interpretable coefficients, making it simple to understand the relationship between features and the target variable.
- **Performance on Small Datasets**: Logistic Regression performs well on small datasets, such as the one at hand with 947 rows.
- **Binary Classification**: It is well-suited for binary classification problems, which aligns with our target variable.

**Random Forest**:
- **Handling Categorical Data**: Random Forest can handle both numerical and categorical data effectively, making it a good fit for our dataset with label encoded and one-hot encoded features.
- **Robustness and Accuracy**: Random Forest is known for its robustness and ability to handle overfitting, providing high accuracy and generalization performance.
- **Feature Importance**: It provides insights into feature importance, helping to understand which features contribute most to the prediction.

### Steps for fitting the models 

**Splitting the Data**:
Split the data into features and the target variable to prepare for model training.
```python
# Splitting the data into features (X) and target variable (y)
X = df.drop('high_traffic', axis=1)
y = df['high_traffic']
```

**Train-Test Split**:
Split the data into training and testing sets to evaluate the models' performance on unseen data. This step is crucial for assessing the generalization ability of the models.
```python
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)
```

**Fitting the Models**:
Fit the models on the training data to prepare them for making predictions.
```python
# Fit the models on the training data
models["Logistic Regression"].fit(X_train, y_train)
models["Random Forest"].fit(X_train, y_train)
```

## Model Evaluation
**Stratified K-Fold Cross-Validation**: The team used Stratified K-Fold cross-validation to ensure the same proportion of classes in each fold. This technique helps in maintaining the distribution of the target variable across training and validation sets, leading to more reliable and unbiased performance estimates.
```python
# Splitting data into training and testing sets with stratified K-Fold cross-validation to ensure same proportion of classes
skf = StratifiedKFold(n_splits=4)
```
Both models were evaluated using cross-validation to obtain mean accuracy scores and standard deviations. Cross-validation helps in assessing the stability and reliability of the models by providing performance metrics averaged over multiple folds.

```python
# Model Development with Cross-Validation
def evaluate_model(model):
    # Perform cross-validation and return mean accuracy score and standard deviation of accuracy scores.
    cv_scores = cross_val_score(model, X_train, y_train, cv=skf)
    return cv_scores.mean(), cv_scores.std()

models = {"Logistic Regression": LogisticRegression(random_state=12), "Random Forest": RandomForestClassifier(random_state=12)}

# Evaluate each model using cross-validation
for model_name, model in models.items():
    mean_accuracy, std_accuracy = evaluate_model(model)
    print(f"{model_name} - Mean Accuracy: {mean_accuracy:.4f}, Std Dev: {std_accuracy:.4f}")
```
```
Logistic Regression - Mean Accuracy: 0.7557, Std Dev: 0.0324 
Random Forest - Mean Accuracy: 0.7504, Std Dev: 0.0351
```
Stratified K-Fold Cross-Validation enhanced the credibility of the model evaluation process, ensuring that the performance metrics were reflective of the models' true capabilities. The cross-validation results demonstrate that both Logistic Regression and Random Forest models achieve consistent mean accuracy scores with low standard deviations, indicating their effectiveness in predicting high traffic recipes.  

### Performance Metrics
To evaluate the performance of the Logistic Regression and Random Forest models, we used several metrics: Accuracy, Precision, Recall, F1 Score, and ROC AUC Score. These metrics provide a comprehensive view of the models' effectiveness in predicting high traffic for recipes.
```python
# Model Evaluation
def print_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    roc_auc = roc_auc_score(y_true, y_pred)

    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"ROC AUC Score: {roc_auc:.4f}")

# Calculating metrics for each model
print("Logistic Regression Metrics:")
lr_predictions = models["Logistic Regression"].predict(X_test)
print_metrics(y_test, lr_predictions)

print("\nRandom Forest Metrics:")
rf_predictions = models["Random Forest"].predict(X_test)
print_metrics(y_test, rf_predictions)
```

![Bar Chart](dscert_performance.png)
*Figure 4: Comparison of accuracy, precision, recall, F1 Score, and ROC AUC Score for both models.*  

Figure 4 provides a visual comparison of model performance across different metrics. Both models achieved similar accuracy scores, with Logistic Regression slightly outperforming Random Forest in terms of precision and F1 Score. However, Random Forest had a marginally higher recall, indicating it was slightly better at identifying true positives.

#### ROC Curves  
ROC curves plot the true positive rate against the false positive rate for different threshold values. The area under the curve (AUC) provides a single metric to compare the models' performance.  

![Line Chart](dscert_roc.png)
*Figure 5: ROC Curves for Logistic Regression and Random Forest models.*  

Figure 5 shows that both models have similar ROC AUC scores, indicating comparable performance.

#### Confusion Matrices
Confusion matrices display the number of true positives, true negatives, false positives, and false negatives.  

![Confusion Matrix](dscert_matricies.png)
*Figure 6: Confusion matrices for Logistic Regression and Random Forest models*  

Figure 6 shows that both Logistic Regression and Random Forest models provide insights into their performance in classifying high traffic recipes. Both models correctly identified a similar number of low traffic recipes, with Logistic Regression having one more true negative than Random Forest. Random Forest has one fewer false positive compared to Logistic Regression, indicating a slightly better performance in avoiding incorrect high traffic predictions. Logistic Regression has one fewer false negative than Random Forest, suggesting it is slightly better at identifying true high traffic recipes. Random Forest correctly identified one more high traffic recipe compared to Logistic Regression.  

### Feature Importance
The feature importance analysis provides insights into which features contribute most to the models' predictions.
```python
# Random Forest Feature Importances
rf_feature_importances = models["Random Forest"].feature_importances_
# Logistic Regression Feature Importances (absolute values of coefficients)
lr_feature_importances = abs(models["Logistic Regression"].coef_[0])

# Creating DataFrames to compare feature importances
features = X.columns
importance_df_rf = pd.DataFrame({
    'Feature': features,
    'Importance': rf_feature_importances,
    'Model': 'Random Forest'
}).sort_values(by='Importance', ascending=False)

importance_df_lr = pd.DataFrame({
    'Feature': features,
    'Importance': lr_feature_importances,
    'Model': 'Logistic Regression'
}).sort_values(by='Importance', ascending=False)

# Print importance scores for chosen models
print("\nFeature Importances for Logistic Regression:")
print(importance_df_lr)

print("Feature Importances for Random Forest:")
print(importance_df_rf)
```  

#### Logistic Regression
```
Feature Importances for Logistic Regression:
           Feature  Importance                Model
1        Beverages    2.944493  Logistic Regression
11       Vegetable    2.320975  Logistic Regression
10          Potato    2.122185  Logistic Regression
9             Pork    1.539368  Logistic Regression
3          Chicken    1.281188  Logistic Regression
2        Breakfast    1.239868  Logistic Regression
4   Chicken Breast    0.766422  Logistic Regression
7             Meat    0.294666  Logistic Regression
5          Dessert    0.259192  Logistic Regression
8    One Dish Meal    0.236877  Logistic Regression
6     Lunch/Snacks    0.024592  Logistic Regression
0         Servings    0.006461  Logistic Regression
```  

![Bar Chart](dscert_fiLR.png)
*Figure 7: Feature Importances for Logistic Regression model*  

Figure 7 shows that the most important features were Beverages, Vegetable, and Potato.  

#### Random Forest
```
Feature Importances for Random Forest:
           Feature  Importance          Model
1        Beverages    0.277469  Random Forest
11       Vegetable    0.126001  Random Forest
0         Servings    0.121834  Random Forest
10          Potato    0.118346  Random Forest
2        Breakfast    0.087162  Random Forest
9             Pork    0.079937  Random Forest
3          Chicken    0.071396  Random Forest
4   Chicken Breast    0.037178  Random Forest
7             Meat    0.022649  Random Forest
8    One Dish Meal    0.020592  Random Forest
5          Dessert    0.018921  Random Forest
6     Lunch/Snacks    0.018515  Random Forest
```  

![Bar Chart](dscert_fiRF.png)
*Figure 8: Feature Importances for Random Forest model*  

Figure 8 shows that the most important features were Beverages, Vegetable, Servings, and Potato.  

Figures 7 and 8 demonstrate that both models identified Beverages, Vegtable, and Potato as the most important features for predicting "High" traffic recepies. Intrestingly Logistic Regression identified Servings as the lowest indicator while Random Forest placed it as the third most important feature.

### Conclusion  
- **Overall Performance**: Both models perform similarly, with minor differences in their classification results. Logistic Regression has a slight edge in correctly identifying low traffic recipes (TN) and avoiding false negatives (FN), while Random Forest performs marginally better in reducing false positives (FP) and correctly identifying high traffic recipes (TP).
- **Model Choice**: The choice between Logistic Regression and Random Forest can be guided by the specific priorities of the application. If minimizing false positives is crucial, Random Forest might be preferred. Conversely, if identifying true high traffic recipes is more important, Logistic Regression could be the better choice.  

In summary, both models are effective for this classification task, and the differences in their confusion matrices are minimal, indicating comparable performance. The final decision on which model to use should consider the specific business objectives and the relative importance of precision versus recall. The evaluation metrics and visualizations indicate that both Logistic Regression and Random Forest models perform well on the dataset. Logistic Regression has a slight edge in precision and F1 Score, making it a better choice if minimizing false positives is crucial. On the other hand, Random Forest's higher recall suggests it is better at identifying true positives, which could be beneficial if capturing all high traffic recipes is more important. Overall, both models are suitable for the task, and the choice between them can be guided by the specific priorities of the application, such as the importance of precision versus recall.

## Business Metrics
To effectively compare model performance to business objectives, we need to define relevant business metrics that align with the goals of the organization. In this case, the primary business goal is to accurately predict high traffic recipes, which can drive more user engagement and potentially increase revenue through advertisements or premium content.  

**Key Metrics**:  
- **Conversion Rate**: Percentage of predicted high traffic recipes that result in high traffic. This metric helps assess the model's ability to identify recipes that will attract more users.
- **User Engagement**: Average time users spend on high traffic recipes. This metric indicates how engaging the predicted high traffic recipes are to users.
- **Revenue Impact**: Additional revenue generated from high traffic recipes. This metric quantifies the financial benefit of accurately predicting high traffic recipes.
- **False Positive Rate**: Percentage of recipes predicted as high traffic that do not result in high traffic. This metric helps understand the cost of incorrect predictions in terms of wasted resources and potential user dissatisfaction.

### Model Performance Using Business Metrics
To evaluate the performance of the Logistic Regression and Random Forest models using the defined business metrics, we should analyze the following:  

**Conversion Rate**: The conversion rate, represented by precision, measures the percentage of recipes predicted to be high traffic that actually result in high traffic. A higher conversion rate indicates better model performance in accurately identifying high traffic recipes.
- **Logistic Regression**: Precision of 0.8435 (84.35%).  
- **Random Forest**: Precision of 0.8376 (83.76%).  

**User Engagement**: Track average time spent on high traffic recipes to compare engagement levels. Higher user engagement can lead to increased user satisfaction, loyalty, and potentially higher ad revenue.  

**Revenue Impact**: Analyze additional revenue generated from high traffic recipes predicted by each model. By analyzing the additional revenue generated from high traffic recipes predicted by each model, we can quantify the financial benefit. This is achieved by comparing the revenue before and after implementing the models.  

**False Positive Rate**: The false positive rate helps assess the cost of incorrect predictions, indicating how often recipes predicted to be high traffic that do not actually result in high traffic. A lower false positive rate is desirable as it minimizes wasted resources and potential user dissatisfaction.
- **Logistic Regression**: 15.65% (100% - 84.35%).  
- **Random Forest**: 16.24% (100% - 83.76%).  

## Final Summary and Recommendations
Both Logistic Regression and Random Forest models perform well in predicting high traffic recipes, with Logistic Regression having a slight edge in precision. This indicates that both models can effectively identify recipes that drive user engagement and generate additional revenue. Key business metrics such as conversion rate, user engagement, and revenue impact will measure whether using these models is beneficial to successfully meet business goals. However, the false positive rate underscores the need to balance precision and recall to minimize incorrect predictions and optimize resource allocation. Continuous monitoring of these metrics ensures alignment with organizational goals and supports data-driven decisions to improve user engagement and revenue. 

**Model Selection**:
- **Baseline Model**: Logistic Regression for its simplicity and interpretability.
- **Comparison Model**: Random Forest to capture more complex relationships in the data.  

**Key Insights from Feature Importances**:
- **Beverages**: Most influential in predicting high traffic.
- **Vegetables and Potatoes**: Significant contributors to high traffic.
- **Servings**: Key feature in the Random Forest model, indicating its relevance in attracting high traffic.  

**Model Performance**:
- **Overall Performance**: Both models perform similarly, with minor differences in their classification results. Logistic Regression has a slight edge in correctly identifying low traffic recipes (TN) and avoiding false negatives (FN), while Random Forest performs marginally better in reducing false positives (FP) and correctly identifying high traffic recipes (TP).
- **Model Choice**: The choice between Logistic Regression and Random Forest can be guided by the specific priorities of the application. If minimizing false positives is crucial, Random Forest might be preferred. Conversely, if identifying true high traffic recipes is more important, Logistic Regression could be the better choice.  

In summary, both models are effective for this classification task, and the differences in their confusion matrices are minimal, indicating comparable performance. The final decision on which model to use should consider the specific business objectives and the relative importance of precision versus recall.

### Recommendations
The feature importances from model results provides valuable insights for planning which recipes to publish. Below is suggestions on how to benefit from these insights:  

**Focus on High-Impact Categories**:
- Beverages: Both models indicate that recipes in the Beverages category have the highest impact on predicting high traffic. Prioritize publishing more beverage recipes.  
- Vegetables and Potatoes: Additionally these categories are highly important. Increase frequency of vegetarian recipes.  

**Optimize Serving Sizes**:
- The number of servings is a significant feature in the Random Forest model. Analyze and publish recipes with popular serving sizes.  

**Diversify Recipe Categories**:
- Ensure a balanced mix of moderately important categories like Breakfast, Pork, and Chicken to maintain variety and cater to different audience preferences.  

**Minimize Low-Impact Categories**:
- Only publish in categories like Desserts, Meat, Lunch/Snacks and One Dish Meals occasionally. While it's still beneficial to include these recipes, they should not be the primary focus.  

**Seasonal and Trend Analysis**:
- **Seasonal Recipes**: Use the insights to plan seasonal recipes. For example, publish more vegetarian recipes during harvest seasons.  
- **Trend Analysis**: Monitor trends and adjust your recipe publication strategy accordingly. If a particular category gains popularity, increase the frequency of related recipes.  

**Marketing and Promotion**:
- Highlight high-impact categories in marketing campaigns. Promote beverage and vegetarian recipes on social media and newsletters to attract more traffic.  
- Use feedback and engagement metrics to refine strategy. If certain recipes receive more positive feedback, consider publishing similar recipes.  

**Measure Business Metrics**:
- **Traffic Increase**: Track the percentage increase in traffic after implementing model recommendations.  
- **Engagement Rate**: Monitor the average engagement rate (likes, shares, comments) for published recipes.  
- **Conversion Rate**: Measure the conversion rate of visitors to subscribers or customers.  
- **Recipe Popularity**: Track the average traffic per recipe category to identify the most popular categories.  

By focusing on these recommendations, the business can leverage the predictive power of the models to drive more traffic, enhance user engagement, and ultimately increase revenue. Emphasizing beverages and vegetarian recipes, combined with content optimization and personalized recommendations, will help create a more engaging and user-friendly platform. This data-driven approach will optimize content strategy and achieve business objectives effectively.ictive power of the models to drive more traffic, enhance user engagement, and ultimately increase revenue. Emphasizing beverages and vegetarian recipes, combined with content optimization and personalized recommendations, will help create a more engaging and user-friendly platform. This data-driven approach will optimize content strategy and achieve business objectives effectively.drive more traffic, enhance user engagement, and ultimately increase revenue. Emphasizing beverages and vegetarian recipes, combined with content optimization and personalized recommendations, will help create a more engaging and user-friendly platform. This data-driven approach will optimize content strategy and achieve business objectives effectively.ls to drive more traffic, enhance user engagement, and ultimately increase revenue. Emphasizing beverages and vegetarian recipes, combined with content optimization and personalized recommendations, will help create a more engaging and user-friendly platform. This data-driven approach will optimize content strategy and achieve business objectives effectively.