In [1]:
# Indian food soooo yummmy

# Importing visualization and data handling libraries

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

pd.set_option('display.max.rows', 100)
pd.set_option('display.max_colwidth', 100)

First things first, always take a quick look at the data so I can have a better idea of what I am working with.
And going from there I will decide what needs to be changed before I proceed on with the data analysis portion.

In [2]:
df = pd.read_csv('../data/indian_food.csv')
df.sample(15)

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
114,Pindi chana,"Fennel, tea bags, tomato, kasuri methi, cinnamon",vegetarian,500,120,spicy,main course,Punjab,North
147,Papadum,"Lentils, black pepper, vegetable oil",vegetarian,5,5,spicy,snack,Kerala,South
55,Anarsa,"Rice flour, jaggery, khus-khus seeds",vegetarian,10,50,sweet,dessert,Maharashtra,West
103,Mushroom matar,"Canned coconut milk, frozen green peas, wild mushrooms, garam masala, tomatoes",vegetarian,10,30,spicy,main course,Punjab,North
234,Hando Guri,"Jaggery, raisins",vegetarian,-1,-1,sweet,dessert,Assam,North East
76,Butter chicken,"Chicken, greek yogurt, cream, garam masala powder, cashew nuts, butter",non vegetarian,10,35,spicy,main course,NCT of Delhi,North
211,Vindaloo,"Chicken, coconut oil, wine vinegar, ginger, green, cinnamon",non vegetarian,10,40,spicy,main course,Goa,West
193,Kombdi vade,"Rice flour, urad dal, wheat flour, gram flour, turmeric",vegetarian,10,25,spicy,snack,Maharashtra,West
22,Chhena poda,"Sugar, chenna cheese",vegetarian,10,45,sweet,dessert,Odisha,East
232,Chingri malai curry,"Coconut milk, lobster, fresh green chilli, ginger, red onion",non vegetarian,10,40,spicy,main course,West Bengal,East


Before going further I want to make all of the data lower case. 

In [3]:
lower_columns = ['ingredients', 'name', 'state', 'region']
df[lower_columns] = df[lower_columns].map(lambda x: str(x).lower())
df[lower_columns]

Unnamed: 0,ingredients,name,state,region
0,"maida flour, yogurt, oil, sugar",balu shahi,west bengal,east
1,"gram flour, ghee, sugar",boondi,rajasthan,west
2,"carrots, milk, sugar, ghee, cashews, raisins",gajar ka halwa,punjab,north
3,"flour, ghee, kewra, milk, clarified butter, sugar, almonds, pistachio, saffron, green cardamom",ghevar,rajasthan,west
4,"milk powder, plain flour, baking powder, ghee, milk, sugar, water, rose water",gulab jamun,west bengal,east
...,...,...,...,...
250,"glutinous rice, black sesame seeds, gur",til pitha,assam,north east
251,"coconut milk, egg yolks, clarified butter, all purpose flour",bebinca,goa,west
252,"cottage cheese, dry dates, dried rose petals, pistachio, badam",shufta,jammu & kashmir,north
253,"milk powder, dry fruits, arrowroot powder, all purpose flour",mawa bati,madhya pradesh,central


#### Assessing -1 values in data set
There are 5 different columns that possess values that do not make sense.
Aside from the possibility that the data is merely missing I have some ideas as to why those values are like that.
* For `prep_time` and `cook_time`, maybe this is indicitive of things that take very little time to cook, but I doubt this.
* For `state` and `region` I think that it's possible that these dishes may be common in a wider area of India and don't have
as much of a home as others.
* As for `flavor_profile`, maybe the flavor for these dishes is more complicated than spicy, sweet, bitter and sour, but I doubt this.

In [4]:
df.map(lambda x: x == -1 or x == '-1').sum()

name               0
ingredients        0
diet               0
prep_time         30
cook_time         28
flavor_profile    29
course             0
state             24
region            13
dtype: int64

#### Looking at categorical data
Below I want to be able to see how many of each value is in most of the columns.
By doing so I will have a much better idea for what kind of categories lie within the data,
which will give me more ideas on what I can do with my imaginary restaurant menu.
I did not include certain columns such as 'name' and 'ingredients' as they are far too unique.
I also did not include 'prep_time' and 'cook_time' as they are better represented with something like a bar chart.

#### Key points from the categorical data
* Only 29 `non-vegetarian` options, as someone who is sensitive to plants I find this troubling, for my stomach.
* I live in Medellín, Colombia, and the people here do not eat spicy food, it is not easy to find peppers here, but that doesn't really matter, because you can always take the spice out. Which directly relates to this project in terms of the demographics of the people that would be eating this food.
* Mostly main courses. Only 2 `starters`, I will probably combine this with snack. Snack and starter are conceptually similar so it makes sense to combine them, especially considering there are only 2 `starters`, and having a category that small doesn't make much sense.
* There are many regions, many of them only have a few dishes. But some of them have many. `-1` in this category makes up 24 dishes.
* The western area of Indian seems to have the most dishes. `-1` in this category makes up 13 dishes.

In [5]:
for col in df.columns:
    if col not in ['prep_time', 'ingredients', 'cook_time', 'name']:
        print(df[col].value_counts(), '\n\n')

diet
vegetarian        226
non vegetarian     29
Name: count, dtype: int64 


flavor_profile
spicy     133
sweet      88
-1         29
bitter      4
sour        1
Name: count, dtype: int64 


course
main course    129
dessert         85
snack           39
starter          2
Name: count, dtype: int64 


state
gujarat            35
punjab             32
maharashtra        30
west bengal        24
-1                 24
assam              21
tamil nadu         20
andhra pradesh     10
uttar pradesh       9
kerala              8
odisha              7
karnataka           6
rajasthan           6
telangana           5
bihar               3
goa                 3
manipur             2
jammu & kashmir     2
madhya pradesh      2
uttarakhand         1
tripura             1
nagaland            1
nct of delhi        1
chhattisgarh        1
haryana             1
Name: count, dtype: int64 


region
west          74
south         59
north         49
east          31
north east    25
-1            13
ce

## `-1` data
Now I would like to take a closer look at the `-1` data.

#### Flavor profile `-1` values
It looks like many of the items contain rice, or flour, which could mean that many of these items are
base foods, which is something that kind of serves as a foundation for a meal. Like rice dishes and bread.
I looked up online all the dishes here, and yes they are pretty much entirely rice and bread dishes, they all look wonderful.

I did some research online and found more appropriate categories for everything that is `-1`.

In [6]:
df[df['flavor_profile'] == '-1']

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
78,chapati,"whole wheat flour, olive oil, hot water, all purpose flour",vegetarian,10,10,-1,main course,maharashtra,west
104,naan,"whole wheat flour, honey, butter, garlic",vegetarian,60,30,-1,main course,punjab,north
116,rongi,"garam masala powder, tomato, kasuri methi, cinnamon, mustard oil",vegetarian,10,30,-1,main course,punjab,north
131,kanji,"carrot, yellow mustard, red chilli, black salt",vegetarian,10,45,-1,snack,kerala,south
145,pachadi,"coconut oil, cucumber, curd, curry leaves, mustard seeds",vegetarian,10,25,-1,main course,-1,south
146,paniyaram,"yogurt, ginger, curry leaves, baking soda, green chilli",vegetarian,10,20,-1,main course,tamil nadu,south
150,paruppu sadam,"arhar dal, sambar powder, tomato, curry leaves, fennel seeds",vegetarian,10,20,-1,main course,tamil nadu,south
153,puli sadam,"urad dal, lemon, tamarind, cooked rice, curry leaves",vegetarian,10,20,-1,main course,tamil nadu,south
155,puttu,"brown rice flour, sugar, grated coconut",vegetarian,495,40,-1,main course,kerala,south
157,sandige,"thin rice flakes, black sesame seeds, curry leaves",vegetarian,120,60,-1,main course,karnataka,south


In [7]:
flavor_sub1 = df[df['flavor_profile'] == '-1']['ingredients'].map(lambda x: str(x).split(', '))
flavor_sub1

78                           [whole wheat flour, olive oil, hot water, all purpose flour]
104                                            [whole wheat flour, honey, butter, garlic]
116                    [garam masala powder, tomato, kasuri methi, cinnamon, mustard oil]
131                                      [carrot, yellow mustard, red chilli, black salt]
145                            [coconut oil, cucumber, curd, curry leaves, mustard seeds]
146                             [yogurt, ginger, curry leaves, baking soda, green chilli]
150                        [arhar dal, sambar powder, tomato, curry leaves, fennel seeds]
153                                [urad dal, lemon, tamarind, cooked rice, curry leaves]
155                                             [brown rice flour, sugar, grated coconut]
157                                  [thin rice flakes, black sesame seeds, curry leaves]
158                                                      [sevai, parboiled rice, steamer]
159       

In [8]:
df[df['flavor_profile'] == '-1']['name']

78            chapati
104              naan
116             rongi
131             kanji
145           pachadi
146         paniyaram
150     paruppu sadam
153        puli sadam
155             puttu
157           sandige
158             sevai
159      thayir sadam
160           theeyal
171            bhakri
176        copra paak
179         dahi vada
180          dalithoy
189            kansar
216        farsi puri
222              khar
224             luchi
227    bengena pitika
228       bilahi maas
229        black rice
231        brown rice
236     chingri bhape
244           pakhala
245        pani pitha
248          red rice
Name: name, dtype: object

In [9]:
sweet_dishes = ['copra paak', 'kansar', 'dahi vada', 'puttu']
spicy_dishes = ['rongi', 'theeyal', 'khar', 'bengena pitika', 'bilahi maas', 'chingri bhape', 'paniyaram']
sour_dishes = ['kanji', 'puli sadam', 'pachadi', 'pakhala']
others = ['red rice', 'pani pitha', 'pakhala', 'brown rice', 'black rice', 'luchi', 'farsi puri', 'dalithoy', 'dahi vada', 'bhakri', 'thayir sadam', 'sevai', 'sandige', 'paruppu sadam', 'naan', 'chapati']

# Updating the 'flavor' column based on the dish names
df.loc[df['name'].isin(sweet_dishes), 'flavor_profile'] = 'sweet'
df.loc[df['name'].isin(spicy_dishes), 'flavor_profile'] = 'spicy'
df.loc[df['name'].isin(others), 'flavor_profile'] = 'neutral/other'
df.loc[df['name'].isin(sour_dishes), 'flavor_profile'] = 'sour'

In [10]:
'''
And now all the flavor profile columns are all sorted out now to continue on to region.
'''

df['flavor_profile'].value_counts()

flavor_profile
spicy            140
sweet             91
neutral/other     15
sour               5
bitter             4
Name: count, dtype: int64

### Region `-1` values

In [11]:
df[df['region'] == '-1']

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
7,kaju katli,"cashews, ghee, cardamom, sugar",vegetarian,10,20,sweet,dessert,-1,-1
9,kheer,"milk, rice, sugar, dried fruits",vegetarian,10,40,sweet,dessert,-1,-1
10,laddu,"gram flour, ghee, sugar",vegetarian,10,40,sweet,dessert,-1,-1
12,nankhatai,"refined flour, besan, ghee, powdered sugar, yoghurt, green cardamom",vegetarian,20,30,sweet,dessert,-1,-1
94,khichdi,"moong dal, green peas, ginger, tomato, green chili",vegetarian,40,20,spicy,main course,-1,-1
96,kulfi falooda,"rose syrup, falooda sev, mixed nuts, saffron, sugar",vegetarian,45,25,sweet,dessert,-1,-1
98,lauki ki subji,"bottle gourd, coconut oil, garam masala, ginger, green chillies",vegetarian,10,20,spicy,main course,-1,-1
109,pani puri,"kala chana, mashed potato, boondi, sev, lemon",vegetarian,15,2,spicy,snack,-1,-1
111,papad,"urad dal, sev, lemon juice, chopped tomatoes",vegetarian,5,5,spicy,snack,-1,-1
117,samosa,"potatoes, green peas, garam masala, ginger, dough",vegetarian,30,30,spicy,snack,-1,-1


It seems that many of the foods listed seem to be from different areas of the world, or from a wide range of places in India.
I researched as many foods as I could but there were not many that had origin data, for region and state of India.
I used google search, mostly wikipedia to find the missing data for these foods.

In [15]:
region_state_data = {
    'kaju katli': {
        'region': 'north',
        'state': 'deccan'
    },
    'kheer': {
        'region': 'south'
    },
    'nankhatai': {
        'region': 'north',
        'state': 'gujarat'
    }
}

df = df.apply(lambda row: row.update(region_state_data.get(row['name'], {})) or row, axis = 1)

In [16]:
df[df['name'].isin(['kaju katli', 'kheer', 'nankhatai'])]

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
7,kaju katli,"cashews, ghee, cardamom, sugar",vegetarian,10,20,sweet,dessert,deccan,north
9,kheer,"milk, rice, sugar, dried fruits",vegetarian,10,40,sweet,dessert,-1,south
12,nankhatai,"refined flour, besan, ghee, powdered sugar, yoghurt, green cardamom",vegetarian,20,30,sweet,dessert,gujarat,north


#### State's `-1` values
For the state values that are equal to -1 I went ahead and did the same thing as before looking everything up on wikipedia.
Only to find that most dishes don't seem to have a single origin that I can find.

In [18]:
df[df['state'] == '-1']

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
9,kheer,"milk, rice, sugar, dried fruits",vegetarian,10,40,sweet,dessert,-1,south
10,laddu,"gram flour, ghee, sugar",vegetarian,10,40,sweet,dessert,-1,-1
94,khichdi,"moong dal, green peas, ginger, tomato, green chili",vegetarian,40,20,spicy,main course,-1,-1
96,kulfi falooda,"rose syrup, falooda sev, mixed nuts, saffron, sugar",vegetarian,45,25,sweet,dessert,-1,-1
98,lauki ki subji,"bottle gourd, coconut oil, garam masala, ginger, green chillies",vegetarian,10,20,spicy,main course,-1,-1
109,pani puri,"kala chana, mashed potato, boondi, sev, lemon",vegetarian,15,2,spicy,snack,-1,-1
111,papad,"urad dal, sev, lemon juice, chopped tomatoes",vegetarian,5,5,spicy,snack,-1,-1
115,rajma chaval,"red kidney beans, garam masala powder, ginger, tomato, mustard oil",vegetarian,15,90,spicy,main course,-1,north
117,samosa,"potatoes, green peas, garam masala, ginger, dough",vegetarian,30,30,spicy,snack,-1,-1
128,dosa,"chana dal, urad dal, whole urad dal, blend rice, rock salt",vegetarian,360,90,spicy,snack,-1,south


In [19]:
state_update_values = {
    'idli': {
        'region': 'south'
    },
    'masala dosa': {
        'region': 'south'
    },
    'pachadi': {
        'region': 'south'
    },
    'rasam': {
        'region': 'south'
    },
    'sambar': {
        'region': 'south'
    },
    'uttappam': {
        'region': 'south'
    }
}

df = df.apply(lambda row: row.update(state_update_values.get(row['name'], {})) or row, axis = 1)
df[df['name'].isin([k for k, v in state_update_values.items()])]

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
130,idli,"split urad dal, urad dal, idli rice, thick poha, rock salt",vegetarian,360,90,spicy,snack,-1,south
144,masala dosa,"chana dal, urad dal, potatoes, idli rice, thick poha",vegetarian,360,90,spicy,snack,-1,south
145,pachadi,"coconut oil, cucumber, curd, curry leaves, mustard seeds",vegetarian,10,25,sour,main course,-1,south
154,rasam,"tomato, curry leaves, garlic, mustard seeds, hot water",vegetarian,10,35,spicy,main course,-1,south
156,sambar,"pigeon peas, eggplant, drumsticks, sambar powder, tamarind",vegetarian,20,45,spicy,main course,-1,south


#### Cook time & prep time `-1` values
To save on time I am just going to have ChatGPT just make me a dictionary of approxiate cook and prep times.

In [24]:
df[df.eq(-1).any(axis = 1)]

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
19,sohan papdi,"gram flour, ghee, sugar, milk, cardamom",vegetarian,-1,60,sweet,dessert,maharashtra,west
21,chhena kheeri,"chhena, sugar, milk",vegetarian,-1,60,sweet,dessert,odisha,east
65,pork bharta,"boiled pork, onions, chillies, ginger and garlic",non vegetarian,-1,-1,spicy,main course,tripura,north east
132,kaara kozhambu,"sesame oil, drumstick, tamarind paste, sambar powder, tomato",vegetarian,-1,-1,spicy,main course,tamil nadu,south
134,keerai masiyal,"urad dal, curry leaves, sugar, mustard seeds, spinach",vegetarian,-1,-1,spicy,main course,tamil nadu,south
148,paravannam,"raw rice, jaggery, milk",vegetarian,-1,-1,spicy,main course,kerala,south
152,poriyal,"chana dal, urad dal, beans, coconut, mustard",vegetarian,-1,-1,spicy,main course,tamil nadu,south
167,kolim jawla,"baingan, fish, coconut oil, fresh coconut, ginger",non vegetarian,-1,-1,spicy,main course,maharashtra,west
172,bombil fry,"bombay duck, malvani masala, rice flour, bombay rava, green chilies",non vegetarian,-1,-1,spicy,main course,maharashtra,west
185,ghooghra,"dry fruits, semolina, all purpose flour",vegetarian,-1,-1,spicy,snack,gujarat,west


In [30]:
dish_cook_times = {
    'sohan papdi': 60,  
    'chhena kheeri': 45,  
    'pork bharta': 90,  
    'kaara kozhambu': 30,  
    'keerai masiyal': 25,  
    'paravannam': 35,  
    'poriyal': 20,  
    'kolim jawla': 25,  
    'bombil fry': 15,  
    'ghooghra': 30,  
    'halvasan': 45,  
    'mag dhokli': 60,  
    'farsi puri': 20,  
    'cheera doi': 240,  
    'kumol sawul': 20,  
    'bengena pitika': 30,  
    'black rice': 50,  
    'bora sawul': 25,  
    'hando guri': 60,  
    'kabiraji': 30,  
    'khorisa': 10,  
    'koldil chicken': 60,  
    'konir dom': 45,  
    'koldil duck': 120,  
    'masor koni': 20,  
    'pakhala': 10,  
    'payokh': 45,  
    'red rice': 50,  
    'shufta': 30,  
    'pinaca': 40  
}
dish_prep_times = {
    'sohan papdi': 30,
    'chhena kheeri': 15,
    'pork bharta': 20,
    'kaara kozhambu': 20,
    'keerai masiyal': 15,
    'paravannam': 15,
    'poriyal': 10,
    'kolim jawla': 15,
    'bombil fry': 20,
    'ghooghra': 45,
    'halvasan': 30,
    'mag dhokli': 30,
    'farsi puri': 40,
    'cheera doi': 15,
    'kumol sawul': 5,
    'bengena pitika': 10,
    'black rice': 10,
    'bora sawul': 10,
    'hando guri': 10,
    'kabiraji': 30,
    'khorisa': 10,
    'koldil chicken': 20,
    'konir dom': 20,
    'koldil duck': 20,
    'masor koni': 10,
    'pakhala': 10,
    'payokh': 15,
    'red rice': 10,
    'shufta': 30,
    'pinaca': 15
}

prep_cook_times = {}
for food in dish_prep_times:
    prep_cook_times[food] = {
        'prep_time': dish_prep_times[food],
        'cook_time': dish_cook_times.get(food, None)
    }

for food, times in prep_cook_times.items():
    mask = df['name'] == food
    df.loc[mask, 'prep_time'] = times['prep_time']
    df.loc[mask, 'cook_time'] = times['cook_time']

df[df['name'].isin([k for k, v in prep_cook_times.items()])]

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
19,sohan papdi,"gram flour, ghee, sugar, milk, cardamom",vegetarian,30,60,sweet,dessert,maharashtra,west
21,chhena kheeri,"chhena, sugar, milk",vegetarian,15,45,sweet,dessert,odisha,east
65,pork bharta,"boiled pork, onions, chillies, ginger and garlic",non vegetarian,20,90,spicy,main course,tripura,north east
132,kaara kozhambu,"sesame oil, drumstick, tamarind paste, sambar powder, tomato",vegetarian,20,30,spicy,main course,tamil nadu,south
134,keerai masiyal,"urad dal, curry leaves, sugar, mustard seeds, spinach",vegetarian,15,25,spicy,main course,tamil nadu,south
148,paravannam,"raw rice, jaggery, milk",vegetarian,15,35,spicy,main course,kerala,south
152,poriyal,"chana dal, urad dal, beans, coconut, mustard",vegetarian,10,20,spicy,main course,tamil nadu,south
167,kolim jawla,"baingan, fish, coconut oil, fresh coconut, ginger",non vegetarian,15,25,spicy,main course,maharashtra,west
172,bombil fry,"bombay duck, malvani masala, rice flour, bombay rava, green chilies",non vegetarian,20,15,spicy,main course,maharashtra,west
185,ghooghra,"dry fruits, semolina, all purpose flour",vegetarian,45,30,spicy,snack,gujarat,west


In [32]:
df.to_csv('../data/processed_indian_food.csv', index = False)

Exception in callback BaseSelectorEventLoop._read_from_self()
handle: <Handle BaseSelectorEventLoop._read_from_self()>
Traceback (most recent call last):
  File "c:\Users\OMEN\anaconda3\Lib\asyncio\events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "c:\Users\OMEN\anaconda3\Lib\asyncio\selector_events.py", line 119, in _read_from_self
    data = self._ssock.recv(4096)
           ^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [WinError 10054] Se ha forzado la interrupción de una conexión existente por el host remoto
Exception in callback BaseSelectorEventLoop._read_from_self()
handle: <Handle BaseSelectorEventLoop._read_from_self()>
Traceback (most recent call last):
  File "c:\Users\OMEN\anaconda3\Lib\asyncio\events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "c:\Users\OMEN\anaconda3\Lib\asyncio\selector_events.py", line 119, in _read_from_self
    data = self._ssock.recv(4096)
           ^^^^^^^^^^^^^^^^^^^^^^
Connec

### Finishing up
Looking at the data now I think that I am satisfied with what I have. there still are some -1 values in there, but I am fine with that.
I searched online for information pertaining to those particular data points and found that there was no answer or that there was more than one answer.
So now it is time to move onto the data analysis portion of the project.