# The Foodprocessor with new recipes with new features

In [58]:
# Start the notebook
!pip install datasets transformers



In [59]:
from datasets import load_dataset
import pandas as pd

In [60]:
from google.colab import drive
drive.mount('/content/drive')  # To save results

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [61]:
# Load dataset
def load_huggingface_dataset(dataset_name, subset=None):
    """
    Load a dataset from Hugging Face and convert to pandas DataFrame

    Args:
        dataset_name (str): Name of the dataset on Hugging Face
        subset (str, optional): Specific subset/configuration of the dataset

    Returns:
        dict: Dictionary of DataFrames for each split (train/test/validation)
    """
    try:
        # Load the dataset
        if subset:
            dataset = load_dataset(dataset_name, subset)
        else:
            dataset = load_dataset(dataset_name)

        # Convert to pandas DataFrames
        dataframes = {}
        for split in dataset.keys():
            dataframes[split] = dataset[split].to_pandas()

        return dataframes

    except Exception as e:
        print(f"Error loading dataset: {str(e)}")
        return None

# Call the function with the correct dataset name
DataFilesPatternsDict = load_huggingface_dataset("Thefoodprocessor/recipe_new_with_features_full")

In [62]:
# Display the first few rows of the dataset to get an overview
print(dataset['train'].head())

                                     recipe_original  \
0  Roasted Carrots and Beets with Pecan Pesto\n1 ...   
1  Escarole and Olive Salad with Raspberry Dressi...   
2  Apple Tansie\n12 egg yolks\n6 egg whites\n1/2 ...   
3  Sauteed Foie Gras with Ginger Poached Pears an...   
4  Lemon-Scented Broccolini\n1 1/2 pounds broccol...   

                                      title_original  \
0         Roasted Carrots and Beets with Pecan Pesto   
1   Escarole and Olive Salad with Raspberry Dressing   
2                                       Apple Tansie   
3  Sauteed Foie Gras with Ginger Poached Pears an...   
4                           Lemon-Scented Broccolini   

                                       title_cleaned  \
0                Roasted Vegetables with Pecan Pesto   
1   Escarole and Olive Salad with Raspberry Dressing   
2                                        Apple Tarts   
3  Sauteed Foie Gras with Ginger Poached PearsPin...   
4                                   Lemon Broc

## EDA Plan

### 1. Basic Dataset Overview

In [63]:
# Shape and Info: Get the number of rows and columns, and check the data types for each column to understand the structure.
# Access the 'train' DataFrame
train_df = DataFilesPatternsDict['train']

# Display shape and info of the train DataFrame
print(train_df.shape)
print(train_df.info())

(74465, 11)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74465 entries, 0 to 74464
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   recipe_original           74465 non-null  object
 1   title_original            74465 non-null  object
 2   title_cleaned             74465 non-null  object
 3   recipe_new                74465 non-null  object
 4   wine_type                 74465 non-null  object
 5   allergy_type              74465 non-null  object
 6   diet_type                 74465 non-null  object
 7   holiday                   74465 non-null  object
 8   cuisine_type              74465 non-null  object
 9   meal_type                 74465 non-null  object
 10  ingredients_alternatives  74465 non-null  object
dtypes: object(11)
memory usage: 6.2+ MB
None


In [64]:
# Check for Missing Values: Identify if there are any missing values in each column.
# Check for missing values in each column
missing_values = train_df.isnull().sum()
print(missing_values)

# Display columns with missing values only, if any
missing_values = missing_values[missing_values > 0]
print("Columns with missing values:\n", missing_values)

recipe_original             0
title_original              0
title_cleaned               0
recipe_new                  0
wine_type                   0
allergy_type                0
diet_type                   0
holiday                     0
cuisine_type                0
meal_type                   0
ingredients_alternatives    0
dtype: int64
Columns with missing values:
 Series([], dtype: int64)


In [65]:
# Check name of columns
# Display column names
print(train_df.columns)

Index(['recipe_original', 'title_original', 'title_cleaned', 'recipe_new',
       'wine_type', 'allergy_type', 'diet_type', 'holiday', 'cuisine_type',
       'meal_type', 'ingredients_alternatives'],
      dtype='object')


In [66]:
# Check for entire duplicate rows
duplicate_rows = train_df[train_df.duplicated()]
print(f"Number of duplicate rows: {duplicate_rows.shape[0]}")

# Display a few duplicate rows, if any
if not duplicate_rows.empty:
    print(duplicate_rows.head())

Number of duplicate rows: 828
                                        recipe_original  \
1085  Cucumber Salsa\n2 cucumbers, thinly sliced\n3 ...   
1153  Toasted Baguette\n1 baguette\n1/4 cup olive oi...   
1198  Banana and Strawberries Delight\n1 1/2 cups gr...   
1232  Lemon Knots\n12 cups all-purpose flour, plus m...   
1326  Bread and Butter Pickles\n3 pounds pickling cu...   

                       title_original                   title_cleaned  \
1085                   Cucumber Salsa                  Cucumber Salsa   
1153                 Toasted Baguette                Toasted Baguette   
1198  Banana and Strawberries Delight   Banana and Strawberry Delight   
1232                      Lemon Knots                     Lemon Knots   
1326         Bread and Butter Pickles  Bread and Butter Pickle Recipe   

                                             recipe_new  \
1085  Cucumber Salsa\n2 cucumbers, thinly sliced\n3 ...   
1153  Toasted Baguette\n1 baguette\n1/4 cup olive oi...   

It seems title_original and title_cleaned are identical.
Check for whether could indeed explain why certain rows appear as duplicates.

In [67]:
# Count rows where title_original and title_cleaned are the same
identical_titles_count = (train_df['title_original'] == train_df['title_cleaned']).sum()
print(f"Number of rows where title_original and title_cleaned are identical: {identical_titles_count}")

Number of rows where title_original and title_cleaned are identical: 17371


### 2. Textual Column Analysis

In [68]:
# Display a few samples from the textual columns
print(train_df[['recipe_original', 'title_original', 'title_cleaned', 'recipe_new']].sample(5))

                                         recipe_original  \
18982  Whiskey Tri-Tip\n2 1/2 pounds tri-tip steak\n2...   
58033  Herb Stuffed Tomatoes\n5 beefsteak tomatoes\n1...   
62901  Cast-Iron Pizza\n1 pound frozen store-bought p...   
13165  Double-Yummy Chili Stuffed Baked Potatoes\n6 l...   
58403  Meatless Shepherd's Pie with Horseradish-Chedd...   

                                          title_original  \
18982                                    Whiskey Tri-Tip   
58033                              Herb Stuffed Tomatoes   
62901                                    Cast-Iron Pizza   
13165          Double-Yummy Chili Stuffed Baked Potatoes   
58403  Meatless Shepherd's Pie with Horseradish-Chedd...   

                                           title_cleaned  \
18982                                    Whiskey Tri-Tip   
58033                        Stuffed Tomatoes with Herbs   
62901                                 Pizza in cast iron   
13165                       Chili Stuf

In [69]:
# Display a random sample of 5 rows from all columns
train_df.sample(5)

Unnamed: 0,recipe_original,title_original,title_cleaned,recipe_new,wine_type,allergy_type,diet_type,holiday,cuisine_type,meal_type,ingredients_alternatives
72683,Brown Butter Risotto with Lobster\n1 pound (ab...,Brown Butter Risotto with Lobster,Brown Butter Risotto with Lobster,Brown Butter Risotto with Lobster\n1 pound (ab...,"[Chardonnay, Pinot Grigio, Sauvignon Blanc, Sp...","[Fish, Rice, Dairy, Nut]","[(Vegetarian,80),(Dairy-Free,70),(Lactovegetar...",Thanksgiving,"American, Italian","Dinner, Entree","Frozen lobster tails: fresh lobster tails, shr..."
58835,Hot Cocoa Smoothie\n3/4 cup chocolate oat milk...,Hot Cocoa Smoothie,Hot Chocolate Smoothie,Hot Chocolate Smoothie\n3/4 cup chocolate oat ...,"[Merlot, Cabernet Sauvignon, Malbec, Zinfandel]","[Milk, Nut]","[(Vegan,90),(Dairy-Free,80),(Lactovegetarians,...",Winter Solstice,"American, European","Breakfast, Dessert","Chocolate oat milk: almond milk, soy milk, coc..."
30473,Green Tomato Pie Turnovers\n5 green tomatoes\n...,Green Tomato Pie Turnovers,Green Tomato Pie Turnovers,Green Tomato Pie Turnovers\n5 green tomatoes\n...,"[Chardonnay, Sauvignon Blanc, Riesling, Pinot ...","[Nut, Dairy, Egg]","[(Vegan,80),(Vegetarian,70),(Dairy-Free,60),(F...",Thanksgiving,American,"Dinner, Appetizer","Green tomatoes: yellow tomatoes, cherry tomato..."
44744,Wild Rocket and Chilli Spaghetti\n1 pound 1-ou...,Wild Rocket and Chilli Spaghetti,Spaghetti with Wild Rocket and Chili,Spaghetti with Wild Rocket and Chili\n1 pound ...,"[Barolo, Barbaresco, Dolcetto, Aglianico]","[Milk, Anchovy, Citrus, Gluten]","[(Vegetarian,80),(Gluten-Free,70),(Dairy-Free,...",New Year's Eve,Italian,"Dinner, Entree","Dried spaghetti: bucatini, linguine, fettuccin..."
5964,Maple-Roasted Bacon\n3/4 pound thick-cut smoke...,Maple-Roasted Bacon,Maple Roasted Bacon,Maple Roasted Bacon\n3/4 pound thick-cut smoke...,"[Burgundy, Merlot, Pinot Gris]","[Meat, Sugar]","[(Paleo,80),(Ketogenic,70),(Atkins,60),(SouthB...",Thanksgiving,American,"Breakfast, Brunch, Dessert","Bacon: Canadian bacon, prosciutto, serrano ham..."


### 3. Categorical Data Analysis

In [70]:
# Columns like wine_type, allergy_type, diet)type, holiday, cuisine_tyep, and meal_type are categorical.
# Unique values and counts: for each categorical column, list teh unique values and counts to understand variety and ferquency of categories.

# List unique values and their counts for each categorical column
categorical_columns = ['wine_type', 'allergy_type', 'diet_type', 'holiday', 'cuisine_type', 'meal_type']

for column in categorical_columns:
    print(f"\nUnique values and counts for '{column}':")
    print(train_df[column].value_counts())


Unique values and counts for 'wine_type':
wine_type
[Chardonnay, Sauvignon Blanc, Riesling, Pinot Grigio]            6708
[Chardonnay, Pinot Grigio, Sauvignon Blanc, Riesling]            4051
[Chardonnay, Sauvignon Blanc, Pinot Grigio, Riesling]            1921
[Chardonnay, Riesling, Gewürztraminer, Pinot Grigio]             1750
[Chardonnay, Sauvignon Blanc, Pinot Grigio, Vermentino]          1699
                                                                 ... 
[Rum, Bourbon, Brandy]                                              1
[Tequila, Moscato, Sauvignon Blanc, Pinot Grigio]                   1
[Lapsang Souchong, Chardonnay, Sauvignon Blanc, Pinot Grigio]       1
[Marsanne, Bois du Renard, Languedoc]\n\n                           1
[Chardonnay, Pinotage, Zinfandel]                                   1
Name: count, Length: 5871, dtype: int64

Unique values and counts for 'allergy_type':
allergy_type
[Alcohol]                              555
[Dairy, Sugar]                     

In [77]:
# List unique values for each categorical column
categorical_columns = ['wine_type', 'allergy_type', 'diet_type', 'holiday', 'cuisine_type', 'meal_type']

for column in categorical_columns:
    print(f"\nUnique values for '{column}':")
    print(train_df[column].unique())


Unique values for 'wine_type':
['[Pinot Noir, Beaujolais, Sangiovese, Grenache]'
 '[Chardonnay, Sauvignon Blanc, Vermentino, Grüner Veltliner]'
 '[Chardonnay, Riesling, Gewürztraminer]' ...
 '[Chardonnay, Riesling, Gouda Cheese, Tempranillo]'
 '[Pinot Grigio, Chianti, Riesling, Sparkling Shiraz]'
 '[Chardonnay, Pinotage, Zinfandel]']

Unique values for 'allergy_type':
['[Nut, Beet, Fruit, Cheese]' '[Fruit, Vegetable, Nut, Sesame]'
 '[Egg, Fruit]' ...
 "[Cow's milk, Eggs, Garlic, Onion, Rice, Spinach, Wheat]"
 '[Fish, Beans, Herbs, Dairy]' "[Cow's milk, Gluten, Garlic, Lemon]"]

Unique values for 'diet_type':
['[(Vegetarian,80),(Gluten-Free,70),(Dairy-Free,60),(Paleo,50),(Lactovegetarians,40)]'
 '[(Vegan,80),(Vegetarian,70),(Fruitarian,60),(Paleo,50),(LowFODMAP,40)]'
 '[(Vegan,0),(Fruitarian,80),(Paleo,70),(LowFODMAP,60),(Vegetarian,50)]'
 ...
 '[(Paleo,90),(LowFODMAP,80),(Gluten-Free,70),(Dairy-Free,60),(Vegetarian,50),(OvoVegetarian,40)]'
 '[(Paleo,80),(Ketogenic,70),(Atkins,60),(Low

Below is a summary of the unique values and their counts for each categorical column:

### Summary of Categorical Columns

1. **`wine_type`**:
   - Total unique types: **5,871**
   - Most common combinations:
     - `[Chardonnay, Sauvignon Blanc, Riesling, Pinot Grigio]` appears **6,708** times.
     - `[Chardonnay, Pinot Grigio, Sauvignon Blanc, Riesling]` appears **4,051** times.
   - There’s a wide variety of wine combinations, with some very specific or unique entries appearing only once.

2. **`allergy_type`**:
   - Total unique types: **31,229**
   - Most common types:
     - `[Alcohol]` appears **555** times.
     - `[Dairy, Sugar]` appears **436** times.
   - This high variety indicates that the dataset includes many specific combinations of allergens, possibly to cover a range of dietary restrictions and food sensitivities.

3. **`diet_type`**:
   - Total unique types: **9,300**
   - Most common types:
     - `[(Paleo,80),(Ketogenic,70),(Atkins,60),(SouthBeach,50)]` appears **3,893** times.
     - `[(Vegetarian,80),(Dairy-Free,70),(Lactovegetarians,60),(OvoVegetarian,50),(Paleo,40)]` appears **3,637** times.
   - The dataset covers a wide range of diet types, often combining several diets in one entry, which reflects complex dietary needs.

4. **`holiday`**:
   - Total unique types: **271**
   - Most common holidays:
     - `Thanksgiving` appears **30,103** times.
     - `Easter` appears **12,535** times.
     - `New Year’s Eve` appears **9,417** times.
   - The dataset is heavily centered around traditional American holidays, with a few unique or rare holidays.

5. **`cuisine_type`**:
   - Total unique types: **1,385**
   - Most common cuisines:
     - `American` appears **19,038** times.
     - `Italian` appears **6,659** times.
     - `American, Italian` appears **5,802** times.
   - The dataset is diverse in terms of cuisine, with combinations of multiple cuisines being common.

6. **`meal_type`**:
   - Total unique types: **4,449**
   - Most common meal types:
     - `Dinner, Appetizer` appears **8,398** times.
     - `Dinner, Entree` appears **7,542** times.
     - `Dessert, Snack` appears **6,581** times.
   - There are many meal-type combinations, with dinner, dessert, and snacks being the most frequently mentioned.

### Observations
- The dataset exhibits significant variety within each categorical column, especially for `allergy_type` and `diet_type`, which likely reflects the need to address a wide array of dietary restrictions.
- Certain categories, like `wine_type` and `meal_type`, have high counts for specific combinations, indicating common pairings (e.g., certain wine types paired together or typical meal courses).
- **Thanksgiving** is the most popular holiday in the dataset by a large margin, and **American cuisine** is the most common, which may suggest a bias towards American food culture in this dataset.

This diversity in categorical data can provide interesting insights and challenges when analyzing food-related trends, such as the popularity of specific cuisines during holidays or the commonality of certain dietary restrictions.

### 4. Multi-Label Analysis

In [71]:
# Some columns, liek allergy_type, diet_type, and wine_type, contains list of multiple items.
# Multiple label-columns

# Define the multi-label columns you want to analyze
multi_label_columns = ['allergy_type', 'diet_type', 'wine_type']

# Loop through each column and perform analysis
for column in multi_label_columns:
    print(f"\nAnalysis for '{column}' column:")

    # Explode the column so each item in the list becomes a separate row
    exploded_df = train_df.copy()
    exploded_df[column] = exploded_df[column].str.strip('[]').str.replace("'", "").str.split(',')
    exploded_df = exploded_df.explode(column)

    # Clean up whitespace if necessary
    exploded_df[column] = exploded_df[column].str.strip()

    # Display the value counts for the exploded column
    print(exploded_df[column].value_counts().head(10))  # Display top 10 most common items



Analysis for 'allergy_type' column:
allergy_type
Milk      19787
Nut       17233
Dairy     16635
Egg       14430
Wheat     13055
Fruit     12360
Soy        9909
Sugar      8496
Gluten     8467
Meat       7582
Name: count, dtype: int64

Analysis for 'diet_type' column:
diet_type
70)                  73274
80)                  72163
60)                  71318
(Paleo               64591
50)                  62555
(Vegetarian          47584
(Dairy-Free          43652
40)                  43181
(Lactovegetarians    32906
(OvoVegetarian       27658
Name: count, dtype: int64

Analysis for 'wine_type' column:
wine_type
Chardonnay            51387
Sauvignon Blanc       43221
Pinot Grigio          35952
Riesling              33077
Vermentino            12976
Tempranillo           10223
Moscato                7694
Cabernet Sauvignon     6198
Gewürztraminer         6092
Zinfandel              5551
Name: count, dtype: int64


Belwo is a summary of the exploded multi-label analysis for each of the columns.

### Summary of Multi-Label Analysis

#### 1. **`allergy_type`**:
   - The most common allergens identified in recipes are:
     - **Milk** (19,787 occurrences)
     - **Nut** (17,233 occurrences)
     - **Dairy** (16,635 occurrences)
     - **Egg** (14,430 occurrences)
     - **Wheat** (13,055 occurrences)
   - Other frequently listed allergens include **Fruit**, **Soy**, **Sugar**, **Gluten**, and **Meat**.
   - This distribution highlights that many recipes are not compatible with common allergens, particularly dairy, nuts, and wheat, which are often problematic for those with food sensitivities.

#### 2. **`diet_type`**:
   - The values for `diet_type` have some irregular formatting, where counts include percentages (like `70)`, `80)`), which appear due to incomplete or unstructured data.
   - **Popular Diet Types**:
     - The diet type **Paleo** appears frequently (64,591 occurrences), followed by **Vegetarian** (47,584 occurrences) and **Dairy-Free** (43,652 occurrences).
     - **Lactovegetarians** (32,906) and **OvoVegetarians** (27,658) are also common, indicating a mix of plant-based and specialized diets.
   - It may be beneficial to clean this data by removing or restructuring the percentage values for more accurate analysis.

#### 3. **`wine_type`**:
   - The most commonly paired wine types with recipes are:
     - **Chardonnay** (51,387 occurrences)
     - **Sauvignon Blanc** (43,221 occurrences)
     - **Pinot Grigio** (35,952 occurrences)
     - **Riesling** (33,077 occurrences)
   - Other popular wine types include **Vermentino**, **Tempranillo**, **Moscato**, **Cabernet Sauvignon**, **Gewürztraminer**, and **Zinfandel**.
   - This suggests a preference for light and versatile white wines (like Chardonnay, Sauvignon Blanc, and Pinot Grigio) in recipe pairings.

### Observations
- **Common Allergens**: Milk, nuts, dairy, and eggs are the most frequent allergens, underscoring their prevalence in many recipes and the importance of allergen labeling.
- **Diet Types**: Paleo and various vegetarian or dairy-free diets are well-represented, though the `diet_type` column would benefit from data cleaning to remove artifacts like percentages.
- **Wine Pairings**: White wines dominate the pairing choices, with Chardonnay being the top choice, followed by Sauvignon Blanc and Pinot Grigio.


### 5. Numerical Analysis

In [72]:
# Statistics for numerical columns (if any)
# Identify numerical columns in the DataFrame
numerical_columns = train_df.select_dtypes(include=['number']).columns

# Check if there are any numerical columns
if len(numerical_columns) > 0:
    # Display descriptive statistics for numerical columns
    descriptive_stats = train_df[numerical_columns].describe().transpose()
    descriptive_stats['median'] = train_df[numerical_columns].median()

    print("Descriptive Statistics for Numerical Columns:")
    print(descriptive_stats)
else:
    print("No numerical columns found in the dataset.")

No numerical columns found in the dataset.


### 6. Relationship Analysis

#### Diet vs. Allergies: Investigate if there are patterms or correlations between diet_type and allergy_type.

We have used .explode() to treat each list item separately and then create a crosstab.

In [73]:
# Explode diet_type and allergy_type columns
diet_allergy_df = train_df.copy()
diet_allergy_df['diet_type'] = diet_allergy_df['diet_type'].str.strip('[]').str.replace("'", "").str.split(',')
diet_allergy_df['allergy_type'] = diet_allergy_df['allergy_type'].str.strip('[]').str.replace("'", "").str.split(',')
diet_allergy_df = diet_allergy_df.explode('diet_type').explode('allergy_type')

# Clean up any extra whitespace
diet_allergy_df['diet_type'] = diet_allergy_df['diet_type'].str.strip()
diet_allergy_df['allergy_type'] = diet_allergy_df['allergy_type'].str.strip()

# Create a crosstab to view the frequency of each diet and allergy combination
diet_allergy_crosstab = pd.crosstab(diet_allergy_df['diet_type'], diet_allergy_df['allergy_type'])

# Display the crosstab
print("Diet vs. Allergy Crosstab:")
print(diet_allergy_crosstab)

Diet vs. Allergy Crosstab:
allergy_type                                           Acid  Acidic  Acidity  \
diet_type                                                                      
                                                    0     0       0        0   
"driedshrimp"isnotdairy-free                        0     0       0        0   
(                                                   0     0       0        0   
((Dairy-Free                                        0     0       0        0   
((Vegetarian                                        0     0       0        0   
...                                                ..   ...     ...      ...   
zestedandjuiced\n2tbspoliveoil\n2tbspchoppedfre...  0     0       0        0   
zestedandjuiced\n2tbspoliveoil\n2tbspchoppedfre...  0     0       0        0   
zestedandjuiced\n2tbspoliveoil\n2tbspchoppedfre...  0     0       0        0   
zestedandjuiced\n2tbspoliveoil\n2tbspchoppedfre...  0     0       0        0   
zestedandjuic

The output of the **Diet vs. Allergy** crosstab shows the frequency of each diet type alongside different allergens. Here’s a summary of what this large table indicates:

##### Summary of Diet vs. Allergy Crosstab

1. **Structure of the Crosstab**:
   - The crosstab has **4,562 rows** (representing unique diet types) and **1,694 columns** (representing unique allergens).
   - Each cell in the table represents the count of recipes associated with a specific combination of diet type and allergen.

2. **Issues with Data Quality**:
   - There are many values in `diet_type` that appear to be artifacts or poorly structured entries, such as `"(Dairy-Free"`, `"(Vegetarian"`, and phrases like `"you can substitute with an equal amount of granulated sugar"`.
   - These unstructured entries likely resulted from irregular formatting or embedded notes in the data. This is causing additional rows in the crosstab that may not represent true diet types.

3. **Insights from Valid Data**:
   - Despite the unstructured entries, valid diet types (like `Dairy-Free`, `Vegetarian`, `Paleo`) and allergens (like `Milk`, `Nut`, `Wheat`, `Egg`) can still provide insights:
     - **Common Allergens Across Diets**: Popular diet types like `Dairy-Free` and `Vegetarian` appear to co-occur with common allergens like **Milk**, **Nuts**, and **Wheat**, likely indicating the need for recipes that cater to these restrictions within these diets.
     - **Patterns by Diet**: The crosstab seems to reveal patterns in which certain diet types are less compatible with specific allergens. For example, if `Dairy-Free` recipes commonly lack **Milk** but contain **Nuts**, this might suggest a trade-off in recipe ingredients to maintain dietary restrictions.

#### Holiday vs. Cuisine: Check if specific holidays are associated with certain cuisines.

We can again use .explode() on cuisine_type if it contains multiple entries per recipe.

In [74]:
 # Explode the cuisine_type column
holiday_cuisine_df = train_df.copy()
holiday_cuisine_df['cuisine_type'] = holiday_cuisine_df['cuisine_type'].str.strip('[]').str.replace("'", "").str.split(',')
holiday_cuisine_df = holiday_cuisine_df.explode('cuisine_type')

# Clean up any extra whitespace
holiday_cuisine_df['cuisine_type'] = holiday_cuisine_df['cuisine_type'].str.strip()

# Create a crosstab for holiday and cuisine_type
holiday_cuisine_crosstab = pd.crosstab(holiday_cuisine_df['holiday'], holiday_cuisine_df['cuisine_type'])

# Display the crosstab
print("Holiday vs. Cuisine Crosstab:")
print(holiday_cuisine_crosstab)

Holiday vs. Cuisine Crosstab:
cuisine_type                                 * African* Middle Eastern  \
holiday                                                                  
 Cinco de Mayo                            0                          0   
4th of July                               0                          0   
All Hallows Eve                           0                          0   
Anniversary                               0                          0   
Australia Day                             0                          0   
...                                      ..                        ...   
Winter/Holiday Season                     0                          0   
Winter/New Year's Eve                     0                          0   
Winter/Spring Break                       0                          0   
WinterPlease answer with only one phrase  0                          0   
Winterrecipe title:                       0                          0   

cuisine

The **Holiday vs. Cuisine** crosstab reveals the frequency of specific cuisines associated with different holidays. Below is a summary of the main points:

##### Summary of Holiday vs. Cuisine Crosstab

1. **Structure of the Crosstab**:
   - The crosstab contains **271 rows** (representing unique holidays) and **753 columns** (representing unique cuisines).
   - Each cell represents the number of recipes associated with a specific holiday and cuisine pairing.

2. **Observations on Data Quality**:
   - There are numerous irregular entries in the `cuisine_type` column, such as `"AfghanText: Lemon-"`, `"Winterrecipe title:"`, and `"gdsgesgbsgygeg"`. These likely result from unstructured data or incorrectly parsed text, which add unnecessary complexity to the analysis.
   - Similar issues appear in `holiday` (e.g., `"WinterPlease answer with only one phrase"`), indicating that some holiday values may also need cleaning for meaningful insights.

3. **Key Insights (Ignoring Data Artifacts)**:
   - **Most Popular Holidays**: Certain holidays, such as `4th of July`, have substantial counts across various American cuisines, indicating a strong association with American culinary traditions.
   - **4th of July**: This holiday appears frequently with **American** cuisine, as well as combinations with Mexican (e.g., "AmericanMexican") and Western influences.
   - **Thanksgiving**: While specific counts aren’t visible in this summary, Thanksgiving is traditionally associated with American cuisine, which might show up with high values in the original table.
   - **Global Representation**: The crosstab also includes a broad range of cuisines from various regions (e.g., "Middle Eastern", "African", "Yucatecan"), suggesting some cultural diversity within holiday-themed recipes.

4. **Patterns by Cuisine Type**:
   - **American Cuisine**: Strongly represented across a variety of holidays, especially **4th of July** and **Thanksgiving**, reinforcing its significance in American holiday traditions.
   - **Niche Cuisines**: Less common cuisines (e.g., "West African" or "Yucatecan") have sparse representation across holidays, with few associations.
   - **Cultural Holidays and Specific Cuisines**: Certain holidays, like **Cinco de Mayo**, traditionally associated with Mexican culture, appear in the table but with limited counts due to the sparse data structure.

#### Wine Paitings: Look into common wine types associated with specific recipes.

In [75]:
# Explode both wine_type and cuisine_type columns
wine_cuisine_df = train_df.copy()
wine_cuisine_df['wine_type'] = wine_cuisine_df['wine_type'].str.strip('[]').str.replace("'", "").str.split(',')
wine_cuisine_df['cuisine_type'] = wine_cuisine_df['cuisine_type'].str.strip('[]').str.replace("'", "").str.split(',')
wine_cuisine_df = wine_cuisine_df.explode('wine_type').explode('cuisine_type')

# Clean up any extra whitespace
wine_cuisine_df['wine_type'] = wine_cuisine_df['wine_type'].str.strip()
wine_cuisine_df['cuisine_type'] = wine_cuisine_df['cuisine_type'].str.strip()

# Create a crosstab to view the relationship between wine type and cuisine type
wine_cuisine_crosstab = pd.crosstab(wine_cuisine_df['wine_type'], wine_cuisine_df['cuisine_type'])

# Display the top 10 wine types with their associated cuisine counts
top_wines = wine_cuisine_crosstab.loc[wine_pairing_counts.head(10).index]
print("Top 10 Wine Types and Associated Cuisine Counts:")
print(top_wines)

Top 10 Wine Types and Associated Cuisine Counts:
cuisine_type           * African* Middle Eastern  * Eastern European  \
wine_type                                                              
Chardonnay          2                          1                   0   
Sauvignon Blanc     2                          1                   1   
Pinot Grigio        1                          1                   1   
Riesling            1                          0                   1   
Vermentino          1                          1                   0   
Tempranillo         1                          0                   0   
Moscato             0                          0                   0   
Cabernet Sauvignon  0                          0                   0   
Gewürztraminer      1                          0                   0   
Zinfandel           0                          0                   0   

cuisine_type        * European* Middle Eastern  * Jewish* Eastern European  \
wine_typ

##### Summary of top 10 wines and associated cusine counts.

Below is a summary of the **Top 10 Wine Types and Associated Cuisine Counts** crosstab outcome.

1. **Most Common Wine Pairings with American Cuisine**:
   - **Chardonnay**, **Sauvignon Blanc**, **Pinot Grigio**, and **Riesling** are heavily associated with **American cuisine**. For example:
     - **Chardonnay**: 37,171 associations with American cuisine.
     - **Sauvignon Blanc**: 29,764 associations with American cuisine.
     - **Pinot Grigio**: 25,445 associations with American cuisine.
     - This suggests that American dishes commonly recommend these lighter, versatile wines.

2. **Cuisines with Lower Representation**:
   - Cuisines like **African**, **West African**, **Middle Eastern**, and **Yucatecan** have much lower counts across all wine types. For example:
     - **Chardonnay** has only 42 associations with **African** cuisine and 9 with **West African**.
     - This could indicate that these cuisines are either less represented in the dataset or are not traditionally associated with these wine pairings.

3. **Irregular and Miscellaneous Entries**:
   - Several entries in `cuisine_type` seem to be unstructured or contain extraneous information, such as:
     - `"West AfricanNote: The term "West"`
     - `"gdsgesgbsgygeg"`
     - `"cook the end of the end of the end of"`
   - These appear to be data artifacts or unstructured text, which may result from parsing errors or extraneous notes in the original data.

4. **Insights on Specific Wine Types**:
   - **Tempranillo** and **Zinfandel** are associated with fewer cuisines, indicating they may be recommended for more specific dishes or cultural contexts.
   - **Moscato** and **Gewürztraminer** have relatively low counts across the board, with a slight presence in **American** and **Western** cuisines. Moscato is often associated with desserts or sweeter dishes, which may be less frequently represented in general cuisine categories.

5. **Common Combinations**:
   - **Chardonnay and American cuisine** is the most frequent pairing, followed by **Sauvignon Blanc with American cuisine**.
   - Some unique pairings include **Tempranillo with Western** and **Riesling with European cuisines**.

### 7. Alternative ingredients

We performed text mining to identify the mose common ingredient substitutions.

Explanation

Function to Extract Alternatives: The extract_alternatives function uses regular expressions to identify the primary ingredient and its alternatives from each line in the ingredients_alternatives text.

Flatten List of Alternatives: We apply the function to each row in ingredients_alternatives and flatten the results into a single list of ingredient pairs (primary, alternative).

Counting Frequencies:
alternative_counter counts how often each ingredient appears as a substitute, giving insight into popular substitutions.
primary_counter counts the primary ingredients that are most often substituted.

In [76]:
import re
from collections import Counter

# Step 1: Preprocess and extract ingredient alternatives
def extract_alternatives(text):
    # Split each substitution by a newline and then use regex to find primary and alternative ingredients
    alternatives = []
    lines = text.split('\n')
    for line in lines:
        match = re.match(r'(\w+):\s*(.+)', line)  # Match 'Ingredient: alternatives'
        if match:
            primary = match.group(1).strip()
            substitutes = [alt.strip() for alt in match.group(2).split(',')]
            for substitute in substitutes:
                alternatives.append((primary, substitute))
    return alternatives

# Step 2: Apply the function to extract all alternatives
all_alternatives = []
for text in train_df['ingredients_alternatives'].dropna():
    all_alternatives.extend(extract_alternatives(text))

# Step 3: Count the frequency of each alternative
alternative_counter = Counter([alt[1] for alt in all_alternatives])

# Display the most common alternative ingredients
print("Most Common Alternative Ingredients:")
print(alternative_counter.most_common(10))

# Step 4 (Optional): Count the frequency of primary ingredient substitutions
primary_counter = Counter([alt[0] for alt in all_alternatives])
print("\nMost Common Primary Ingredients for Substitution:")
print(primary_counter.most_common(10))

Most Common Alternative Ingredients:
[('shallots', 30720), ('scallions', 22876), ('maple syrup', 17123), ('green onions', 16347), ('ghee', 15030), ('kosher salt', 15008), ('coconut oil', 13228), ('whole wheat flour', 12891), ('basil', 12626), ('sea salt', 11713)]

Most Common Primary Ingredients for Substitution:
[('Garlic', 47018), ('Butter', 44147), ('Sugar', 40982), ('Salt', 38712), ('Flour', 36744), ('Onion', 30425), ('Eggs', 26258), ('Milk', 18087), ('Water', 15957), ('Parsley', 15460)]


##### Summary of of the top 10 most common **alternative ingredients** and **primary ingredients** for substitution based on your analysis.

1. **Top 10 Most Frequent Alternative Ingredients**:
   - **Shallots**: 30,720 occurrences
   - **Scallions**: 22,876 occurrences
   - **Maple Syrup**: 17,123 occurrences
   - **Green Onions**: 16,347 occurrences
   - **Ghee**: 15,030 occurrences
   - **Kosher Salt**: 15,008 occurrences
   - **Coconut Oil**: 13,228 occurrences
   - **Whole Wheat Flour**: 12,891 occurrences
   - **Basil**: 12,626 occurrences
   - **Sea Salt**: 11,713 occurrences

2. **Insights**:
   - **Frequent Substitutes for Common Ingredients**: Many of these alternatives (e.g., **shallots** for onions, **ghee** for butter, **maple syrup** for sweeteners) are often healthier or specialized substitutions.
   - **Healthy Alternatives**: Ingredients like **coconut oil**, **whole wheat flour**, and **sea salt** suggest that many substitutions are aimed at health-conscious choices, with options that offer dietary benefits.
   - **Interchangeable Ingredients**: Ingredients like **shallots** and **scallions** (or **green onions**) are closely related and frequently used as direct substitutes.

### Summary of Most Common Primary Ingredients for Substitution
1. **Top 10 Most Frequent Primary Ingredients**:
   - **Garlic**: 47,018 occurrences
   - **Butter**: 44,147 occurrences
   - **Sugar**: 40,982 occurrences
   - **Salt**: 38,712 occurrences
   - **Flour**: 36,744 occurrences
   - **Onion**: 30,425 occurrences
   - **Eggs**: 26,258 occurrences
   - **Milk**: 18,087 occurrences
   - **Water**: 15,957 occurrences
   - **Parsley**: 15,460 occurrences

2. **Insights**:
   - **Common Cooking Staples**: The primary ingredients for substitution are largely staples in cooking and baking, such as **butter**, **sugar**, **salt**, **flour**, **eggs**, and **milk**.
   - **Flavor and Dietary Adjustments**: Ingredients like **garlic** and **onion** are essential in flavor profiles, while **butter** and **milk** substitutions often cater to dairy-free or vegan alternatives.
   - **Nutritional Preferences**: The prevalence of substitutions for **sugar**, **flour**, and **butter** indicates a trend towards finding healthier or allergen-free options in recipes.

### Overall Insights
This analysis suggests a pattern where common ingredients in cooking (like butter, salt, and garlic) are often substituted with alternatives that meet dietary restrictions (e.g., dairy-free, low-sodium) or enhance healthfulness (e.g., using whole wheat flour or coconut oil). It reflects a growing demand for ingredient flexibility in recipes, likely driven by dietary needs and health trends.