In [7]:
print("Author: Yashvi Patodi\n"
      "NLP Project-5\n"
      "Exploratory Data Analysis on Indian Food Dataset\n"
      "\n"
      "This analysis delves into the Indian food dataset, focusing on systematic data assessment, cleansing, "
      "and feature engineering to enrich dataset usability and insights.\n\n"
      
      "1. Data Structure Evaluation: Analyzing dataset composition, including data types, column details, and unique "
      "value counts, to understand its foundational structure.\n"
      "2. Missing Value Handling: Precisely identifying and address null values within key columns, employing appropriate "
      "imputation strategies to maintain data integrity.\n"
      "3. Feature Engineering: Introducing new attributes to enhance analytical depth, such as total_time (summing "
      "preparation and cooking times) and num_ingredients (counting ingredients for each dish).\n\n"
      "Execution Details:\n\n"
      "-Data Inspection: Assess column types, unique names, and generate statistical summaries for a concise overview.\n"
      "-Imputation of Missing Data: Replace absent region values with a placeholder, and remove rows with critical missing entries for consistency.\n"
      "-Numerical and Categorical Breakdown: Segment numeric and categorical data, facilitating tailored analyses on dataset structure.\n"
      "-Data Export: Finalize and export the refined dataset (indian_food_updated.csv) with added features to support subsequent analysis.\n")


Author: Yashvi Patodi
NLP Project-5
Exploratory Data Analysis on Indian Food Dataset

This analysis delves into the Indian food dataset, focusing on systematic data assessment, cleansing, and feature engineering to enrich dataset usability and insights.

1. Data Structure Evaluation: Analyzing dataset composition, including data types, column details, and unique value counts, to understand its foundational structure.
2. Missing Value Handling: Precisely identifying and address null values within key columns, employing appropriate imputation strategies to maintain data integrity.
3. Feature Engineering: Introducing new attributes to enhance analytical depth, such as total_time (summing preparation and cooking times) and num_ingredients (counting ingredients for each dish).

Execution Details:

-Data Inspection: Assess column types, unique names, and generate statistical summaries for a concise overview.
-Imputation of Missing Data: Replace absent region values with a placeholder, and re

In [None]:
import pandas as pd
indian_df= pd.read_csv('indian_food.csv')
df= pd.DataFrame(indian_df)
print(df)

               name                                        ingredients  \
0        Balu shahi                    Maida flour, yogurt, oil, sugar   
1            Boondi                            Gram flour, ghee, sugar   
2    Gajar ka halwa       Carrots, milk, sugar, ghee, cashews, raisins   
3            Ghevar  Flour, ghee, kewra, milk, clarified butter, su...   
4       Gulab jamun  Milk powder, plain flour, baking powder, ghee,...   
..              ...                                                ...   
250       Til Pitha            Glutinous rice, black sesame seeds, gur   
251         Bebinca  Coconut milk, egg yolks, clarified butter, all...   
252          Shufta  Cottage cheese, dry dates, dried rose petals, ...   
253       Mawa Bati  Milk powder, dry fruits, arrowroot powder, all...   
254          Pinaca  Brown rice, fennel seeds, grated coconut, blac...   

           diet  prep_time  cook_time flavor_profile   course  \
0    vegetarian         45         25         

In [None]:
print(df.dtypes)
df.columns
df.shape
df.info()

name              object
ingredients       object
diet              object
prep_time          int64
cook_time          int64
flavor_profile    object
course            object
state             object
region            object
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255 entries, 0 to 254
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            255 non-null    object
 1   ingredients     255 non-null    object
 2   diet            255 non-null    object
 3   prep_time       255 non-null    int64 
 4   cook_time       255 non-null    int64 
 5   flavor_profile  255 non-null    object
 6   course          255 non-null    object
 7   state           255 non-null    object
 8   region          254 non-null    object
dtypes: int64(2), object(7)
memory usage: 18.1+ KB


In [None]:
print("Name", df['name'].nunique())

Name 255


In [None]:
df.describe()

Unnamed: 0,prep_time,cook_time
count,255.0,255.0
mean,31.105882,34.529412
std,72.554409,48.26565
min,-1.0,-1.0
25%,10.0,20.0
50%,10.0,30.0
75%,20.0,40.0
max,500.0,720.0


In [None]:
na_values= df.isnull().sum()
na_values

Unnamed: 0,0
name,0
ingredients,0
diet,0
prep_time,0
cook_time,0
flavor_profile,0
course,0
state,0
region,1


In [None]:
na_values= df.isnull().sum()
na_values
df['region'].fillna('NaN', inplace=True)
print(df)

               name                                        ingredients  \
0        Balu shahi                    Maida flour, yogurt, oil, sugar   
1            Boondi                            Gram flour, ghee, sugar   
2    Gajar ka halwa       Carrots, milk, sugar, ghee, cashews, raisins   
3            Ghevar  Flour, ghee, kewra, milk, clarified butter, su...   
4       Gulab jamun  Milk powder, plain flour, baking powder, ghee,...   
..              ...                                                ...   
250       Til Pitha            Glutinous rice, black sesame seeds, gur   
251         Bebinca  Coconut milk, egg yolks, clarified butter, all...   
252          Shufta  Cottage cheese, dry dates, dried rose petals, ...   
253       Mawa Bati  Milk powder, dry fruits, arrowroot powder, all...   
254          Pinaca  Brown rice, fennel seeds, grated coconut, blac...   

           diet  prep_time  cook_time flavor_profile   course  \
0    vegetarian         45         25         

In [None]:
na_values= df.isnull().sum()
print("NA values in each column:\n", na_values)
df_cleaned= df.dropna(subset=['region'])
na_values=df_cleaned.nunique().sum()
print("NA values in each column:\n", na_values)

NA values in each column:
 name              0
ingredients       0
diet              0
prep_time         0
cook_time         0
flavor_profile    0
course            0
state             0
region            0
dtype: int64
NA values in each column:
 592


In [None]:
import numpy as np
df1 = df.select_dtypes(include=[np.number])
print("Numeric Features:")
print(df1)

df2 = df.select_dtypes(include=['category'])
print("\nCategorical Features:")
print(df2)

num_numeric_features = df1.shape[1]
num_categorical_features = df2.shape[1]

print(f"\nNumber of Numeric Features: {num_numeric_features}")
print(f"Number of Categorical Features: {num_categorical_features}")

Numeric Features:
     prep_time  cook_time
0           45         25
1           80         30
2           15         60
3           15         30
4           15         40
..         ...        ...
250          5         30
251         20         60
252         -1         -1
253         20         45
254         -1         -1

[255 rows x 2 columns]

Categorical Features:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[255 rows x 0 columns]

Number of Numeric Features: 2
Number of Categorical Features: 0


In [None]:
df['total_time'] = df['prep_time'] + df['cook_time']
df.to_csv('indian_food_updated.csv', index=False)
print(df)

               name                                        ingredients  \
0        Balu shahi                    Maida flour, yogurt, oil, sugar   
1            Boondi                            Gram flour, ghee, sugar   
2    Gajar ka halwa       Carrots, milk, sugar, ghee, cashews, raisins   
3            Ghevar  Flour, ghee, kewra, milk, clarified butter, su...   
4       Gulab jamun  Milk powder, plain flour, baking powder, ghee,...   
..              ...                                                ...   
250       Til Pitha            Glutinous rice, black sesame seeds, gur   
251         Bebinca  Coconut milk, egg yolks, clarified butter, all...   
252          Shufta  Cottage cheese, dry dates, dried rose petals, ...   
253       Mawa Bati  Milk powder, dry fruits, arrowroot powder, all...   
254          Pinaca  Brown rice, fennel seeds, grated coconut, blac...   

           diet  prep_time  cook_time flavor_profile   course  \
0    vegetarian         45         25         

In [None]:
df['num_ingredients'] = df['ingredients'].apply(lambda x: len(x.split(',')))
print(df)

               name                                        ingredients  \
0        Balu shahi                    Maida flour, yogurt, oil, sugar   
1            Boondi                            Gram flour, ghee, sugar   
2    Gajar ka halwa       Carrots, milk, sugar, ghee, cashews, raisins   
3            Ghevar  Flour, ghee, kewra, milk, clarified butter, su...   
4       Gulab jamun  Milk powder, plain flour, baking powder, ghee,...   
..              ...                                                ...   
250       Til Pitha            Glutinous rice, black sesame seeds, gur   
251         Bebinca  Coconut milk, egg yolks, clarified butter, all...   
252          Shufta  Cottage cheese, dry dates, dried rose petals, ...   
253       Mawa Bati  Milk powder, dry fruits, arrowroot powder, all...   
254          Pinaca  Brown rice, fennel seeds, grated coconut, blac...   

           diet  prep_time  cook_time flavor_profile   course  \
0    vegetarian         45         25         