<a href="https://colab.research.google.com/github/ayten21/python-libraries/blob/main/Data_modification_with_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
---

<center><h1>  Data modification with Pandas</h1></center>

---


### `TABLE OF CONTENTS`
- How to calculate sum, mean, median and mode of a column
- How to get the summary of the numerical variables?
- How to get number of missing values in each columns?
- How to impute the missing values in any column?
- How to group the data based on categories of one column?
- How to group the data based on categories of multiple columns?
- How to create new feature using the aggregated results of a column?
- How to update the values of a column with a new mapping?
- How to create a new column by modifying the existing column?
- How to convert categorical variables into numerical?


In [1]:
import pandas as pd

In [47]:
data = pd.read_csv('sales_data.csv')

In [4]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


How to calculate sum, mean, median and mode of a column?

In [38]:
data.Item_MRP.sum()

1201681.4808

In [39]:
data.Item_MRP.mean()

140.9927819781768

In [40]:
data.Item_MRP.median()

143.0128

In [48]:
data.Outlet_Type.mode()

0    Supermarket Type1
dtype: object

In [49]:
data.Outlet_Type.value_counts()

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64

How to get the summary of the numerical variables?

In [50]:
# get the summary
data.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


---

HOW TO IMPUTE THE MISSING VALUES USING LOC IN ANY COLUMN?



***Check the columns with null values.***

First, we will check the number of missing values in each of the column. Use the function `isna().sum()`.



---

In [5]:
data.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

So, we have missing values in the column `Item_Weight` and `Outlet_Size`.

---

#### `Select the columns where Item_Weight has missing values using loc`

---

In [51]:
# rows with null values in the Item_Weight
data.loc[data.Item_Weight.isna() == True]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
7,FDP10,,Low Fat,0.127470,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
18,DRI11,,Low Fat,0.034238,Hard Drinks,113.2834,OUT027,1985,Medium,Tier 3,Supermarket Type3,2303.6680
21,FDW12,,Regular,0.035400,Baking Goods,144.5444,OUT027,1985,Medium,Tier 3,Supermarket Type3,4064.0432
23,FDC37,,Low Fat,0.057557,Baking Goods,107.6938,OUT019,1985,Small,Tier 1,Grocery Store,214.3876
29,FDC14,,Regular,0.072222,Canned,43.6454,OUT019,1985,Small,Tier 1,Grocery Store,125.8362
...,...,...,...,...,...,...,...,...,...,...,...,...
8485,DRK37,,Low Fat,0.043792,Soft Drinks,189.0530,OUT027,1985,Medium,Tier 3,Supermarket Type3,6261.8490
8487,DRG13,,Low Fat,0.037006,Soft Drinks,164.7526,OUT027,1985,Medium,Tier 3,Supermarket Type3,4111.3150
8488,NCN14,,Low Fat,0.091473,Others,184.6608,OUT027,1985,Medium,Tier 3,Supermarket Type3,2756.4120
8490,FDU44,,Regular,0.102296,Fruits and Vegetables,162.3552,OUT019,1985,Small,Tier 1,Grocery Store,487.3656


In [52]:
# fill the null values in Item_Weight by mean
data.loc[(data.Item_Weight.isna() == True) , 'Item_Weight'] = data.Item_Weight.mean() 

In [53]:
data.isna().sum()

Item_Identifier                 0
Item_Weight                     0
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [10]:
data.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

In [11]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


#### `Fill the missing values in the column: Outlet_Size by most frequent value using loc`

---

***Mode of Outlet Size***

In [12]:
data.Outlet_Size.mode()

0    Medium
dtype: object

In [13]:
data.Outlet_Size.value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [14]:
data.loc[(data.Outlet_Size.isna() == True) , 'Outlet_Size'] = 'Medium'

#### `Use the fillna function to impute the missing values.`

- `fillna` function is another way to impute the missing values. Use the parameter `inplace=True` to store the results in the dataframe.

In [15]:
# fill the null values in Outlet Size by the most frequent value: "Medium"
data.Outlet_Size.fillna('Medium', inplace=True)

In [16]:
# check the null values again
data.isna().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

#### `HOW TO UPDATE THE VALUES OF A COLUMN?`


---

In [17]:
# look at the data
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


---

***Let's have a look at the count of each category in the column `Item_Fat_Content`. We will use `value_counts` function to to do that.***


---

In [18]:
data.Item_Fat_Content.value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

---

***We can see that the categories `Low Fat`, `LF` and `low fat` are same and also `Regular`, `reg` are same. So to keep the data clean we will map all these to only two categories to `LF` and `R` using the map function. Let's see how?***

In [19]:
# Create a new mapping (dictionary) 
mapping = {
    'Low Fat' : 'LF',
    'Regular' : 'R',
    'LF' : 'LF',
    'reg': 'R',
    'low fat' : 'LF'
}

In [20]:
data.Item_Fat_Content.map(mapping)

0       LF
1        R
2       LF
3        R
4       LF
        ..
8518    LF
8519     R
8520    LF
8521     R
8522    LF
Name: Item_Fat_Content, Length: 8523, dtype: object

In [21]:
# use the  map function to update the values
data.Item_Fat_Content = data.Item_Fat_Content.map(mapping)

In [22]:
# Countof new categories in the column Item_Fat_Content
data.Item_Fat_Content.value_counts()

LF    5517
R     3006
Name: Item_Fat_Content, dtype: int64

How to group the data based on categories of one column?

In [54]:
d = data.groupby(['Item_Type'])
d

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fcbc0a30510>

In [55]:
d.first()

Unnamed: 0_level_0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Item_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Baking Goods,FDP36,10.395,Regular,0.0,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
Breads,FDO23,17.85,Low Fat,0.0,93.1436,OUT045,2002,Medium,Tier 2,Supermarket Type1,2174.5028
Breakfast,FDP49,9.0,Regular,0.069089,56.3614,OUT046,1997,Small,Tier 1,Supermarket Type1,1547.3192
Canned,FDC14,12.857645,Regular,0.072222,43.6454,OUT019,1985,Small,Tier 1,Grocery Store,125.8362
Dairy,FDA15,9.3,Low Fat,0.016047,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
Frozen Foods,FDH17,16.2,Regular,0.016687,96.9726,OUT045,2002,Small,Tier 2,Supermarket Type1,1076.5986
Fruits and Vegetables,FDX07,19.2,Regular,0.0,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38
Hard Drinks,DRI11,12.857645,Low Fat,0.034238,113.2834,OUT027,1985,Medium,Tier 3,Supermarket Type3,2303.668
Health and Hygiene,NCB42,11.8,Low Fat,0.008596,115.3492,OUT018,2009,Medium,Tier 3,Supermarket Type2,1621.8888
Household,NCD19,8.93,Low Fat,0.0,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [56]:
d.mean()

Unnamed: 0_level_0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
Item_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Baking Goods,12.377448,0.069169,126.380766,1997.728395,1952.971207
Breads,11.629818,0.066255,140.952669,1997.657371,2204.132226
Breakfast,12.785278,0.085723,141.788151,1997.336364,2111.808651
Canned,12.399254,0.068129,139.763832,1998.152542,2225.194904
Dairy,13.329387,0.072427,148.499208,1997.681818,2232.542597
Frozen Foods,12.865543,0.065645,138.503366,1998.024533,2132.867744
Fruits and Vegetables,13.161297,0.068513,144.581235,1997.719968,2289.009592
Hard Drinks,11.611435,0.064943,137.077928,1998.17757,2139.221622
Health and Hygiene,13.093044,0.055216,130.818921,1997.734615,2010.000265
Household,13.297274,0.061322,149.424753,1997.784615,2258.7843


In [57]:
d.mean()['Item_MRP']

Item_Type
Baking Goods             126.380766
Breads                   140.952669
Breakfast                141.788151
Canned                   139.763832
Dairy                    148.499208
Frozen Foods             138.503366
Fruits and Vegetables    144.581235
Hard Drinks              137.077928
Health and Hygiene       130.818921
Household                149.424753
Meat                     139.882032
Others                   132.851430
Seafood                  141.841719
Snack Foods              146.194934
Soft Drinks              131.492506
Starchy Foods            147.838023
Name: Item_MRP, dtype: float64

In [58]:
d.max()

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Item_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Baking Goods,FDZ60,20.85,reg,0.31109,265.5568,OUT049,2009,Tier 3,Supermarket Type3,7931.6754
Breads,FDZ35,20.85,reg,0.28151,263.6594,OUT049,2009,Tier 3,Supermarket Type3,8958.339
Breakfast,FDR37,21.1,reg,0.274592,234.93,OUT049,2009,Tier 3,Supermarket Type3,8209.314
Canned,FDZ49,21.35,reg,0.328391,266.8884,OUT049,2009,Tier 3,Supermarket Type3,10306.584
Dairy,FDZ50,20.7,reg,0.304737,266.6884,OUT049,2009,Tier 3,Supermarket Type3,10256.649
Frozen Foods,FDZ52,20.85,reg,0.294939,264.891,OUT049,2009,Tier 3,Supermarket Type3,9678.0688
Fruits and Vegetables,FDZ56,21.35,reg,0.321115,264.2252,OUT049,2009,Tier 3,Supermarket Type3,12117.56
Hard Drinks,DRQ35,19.7,low fat,0.298205,261.4278,OUT049,2009,Tier 3,Supermarket Type3,7843.124
Health and Hygiene,NCZ53,21.25,low fat,0.255348,266.6884,OUT049,2009,Tier 3,Supermarket Type3,9779.9362
Household,NCZ54,21.25,low fat,0.325781,264.791,OUT049,2009,Tier 3,Supermarket Type3,13086.9648


In [59]:
# step wise groupby

d = data.groupby(['Outlet_Size', 'Item_Type']).mean()
d

Unnamed: 0_level_0,Unnamed: 1_level_0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
Outlet_Size,Item_Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
High,Baking Goods,12.036918,0.05949,129.202044,1987.0,2050.901134
High,Breads,11.048,0.065586,133.75896,1987.0,2080.731528
High,Breakfast,12.564231,0.075118,147.490585,1987.0,2104.286508
High,Canned,11.922231,0.056733,135.442708,1987.0,2211.265203
High,Dairy,13.071875,0.068907,153.509173,1987.0,2453.181713
High,Frozen Foods,13.250707,0.065639,136.82925,1987.0,2214.096189
High,Fruits and Vegetables,13.259613,0.061302,145.57287,1987.0,2405.118103
High,Hard Drinks,11.741957,0.062271,141.927522,1987.0,2363.59
High,Health and Hygiene,13.02877,0.051031,135.11098,1987.0,1953.042439
High,Household,14.033398,0.053742,147.097522,1987.0,2408.217992


We can also do this using the pivot table also

In [60]:
d['Item_MRP']

Outlet_Size  Item_Type            
High         Baking Goods             129.202044
             Breads                   133.758960
             Breakfast                147.490585
             Canned                   135.442708
             Dairy                    153.509173
             Frozen Foods             136.829250
             Fruits and Vegetables    145.572870
             Hard Drinks              141.927522
             Health and Hygiene       135.110980
             Household                147.097522
             Meat                     137.244790
             Others                   132.576613
             Seafood                  134.864240
             Snack Foods              145.847086
             Soft Drinks              131.758473
             Starchy Foods            158.157074
Medium       Baking Goods             126.178568
             Breads                   140.861039
             Breakfast                134.537511
             Canned               

In [61]:
pd.pivot_table(data, index='Item_Type', values="Item_MRP", aggfunc= 'mean')

Unnamed: 0_level_0,Item_MRP
Item_Type,Unnamed: 1_level_1
Baking Goods,126.380766
Breads,140.952669
Breakfast,141.788151
Canned,139.763832
Dairy,148.499208
Frozen Foods,138.503366
Fruits and Vegetables,144.581235
Hard Drinks,137.077928
Health and Hygiene,130.818921
Household,149.424753


In [62]:
pd.pivot_table(data, index= ['Outlet_Size', 'Item_Type'], values= 'Item_MRP', aggfunc= 'mean')

Unnamed: 0_level_0,Unnamed: 1_level_0,Item_MRP
Outlet_Size,Item_Type,Unnamed: 2_level_1
High,Baking Goods,129.202044
High,Breads,133.75896
High,Breakfast,147.490585
High,Canned,135.442708
High,Dairy,153.509173
High,Frozen Foods,136.82925
High,Fruits and Vegetables,145.57287
High,Hard Drinks,141.927522
High,Health and Hygiene,135.11098
High,Household,147.097522


How to create new feature using the aggregated results of a column?

In [63]:
average_item_visibility = data.groupby(['Item_Identifier'])['Item_Visibility'].mean().reset_index()
average_item_visibility

Unnamed: 0,Item_Identifier,Item_Visibility
0,DRA12,0.031956
1,DRA24,0.048062
2,DRA59,0.134718
3,DRB01,0.082126
4,DRB13,0.008002
...,...,...
1554,NCZ30,0.024956
1555,NCZ41,0.051623
1556,NCZ42,0.009044
1557,NCZ53,0.027775


Now, we want to create a new feature average_item_visibility using the above dataframe.
Let's first define a function that will take the parameter Item_Identifier and return the corresponding average Item_Visibilty using the dataframe average_item_visibility

In [64]:
def get_item_visibility(x) :
    return average_item_visibility.loc[(average_item_visibility.Item_Identifier == x), 'Item_Visibility'].values[0]

In [65]:
get_item_visibility('DRA24')

0.04806226414285714

Now, use the apply function to create the new feature. You just need to access the Item_Identifier column and use the apply method and pass the function that we have defined.

In [67]:
data['average_item_visibility'] = data.Item_Identifier.apply(get_item_visibility)

In [68]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,average_item_visibility
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,0.017387
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,0.019219
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,0.020145
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38,0.015274
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,0.008082


How to group the data based on categories of multiple columns?

There is a better way of doing the above task is by using the transform function.

The time taken by the transform function to perform the above operation is comparatively less over a large dataframe. That’s a sigificant advantage as comapred to the first approach we used.

In [69]:
data['average_item_visibility_2'] = data.groupby(['Item_Identifier'])['Item_Visibility'].transform('mean')

In [70]:
data[['average_item_visibility', 'average_item_visibility_2']]

Unnamed: 0,average_item_visibility,average_item_visibility_2
0,0.017387,0.017387
1,0.019219,0.019219
2,0.020145,0.020145
3,0.015274,0.015274
4,0.008082,0.008082
...,...,...
8518,0.061705,0.061705
8519,0.046952,0.046952
8520,0.035203,0.035203
8521,0.120686,0.120686


HOW TO CREATE A NEW COLUMN BY MODIFYING THE EXISTING COLUMN?

---

In [23]:
data.Item_MRP

0       249.8092
1        48.2692
2       141.6180
3       182.0950
4        53.8614
          ...   
8518    214.5218
8519    108.1570
8520     85.1224
8521    103.1332
8522     75.4670
Name: Item_MRP, Length: 8523, dtype: float64

#### `APPLY`

***Create a new column `Item_MRP_in_USD` by dividing the each value in the column `Item_MRP` by 74 using the apply function. Let's see how?***

---


In [24]:
data.Item_MRP.apply(lambda x: x/74)

0       3.375800
1       0.652286
2       1.913757
3       2.460743
4       0.727857
          ...   
8518    2.898943
8519    1.461581
8520    1.150303
8521    1.393692
8522    1.019824
Name: Item_MRP, Length: 8523, dtype: float64

In [25]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,LF,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,R,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,LF,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,R,0.0,Fruits and Vegetables,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38
4,NCD19,8.93,LF,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [26]:
def convert(price):
    price = price/74
    price = price + 1.28
    return price

In [30]:
data['Item_MRP_in_USD'] = data.Item_MRP.apply(lambda x : convert(x))

In [31]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_MRP_in_USD_UPDATED,Item_MRP_in_USD
0,FDA15,9.3,LF,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,4.6558,4.6558
1,DRC01,5.92,R,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,1.932286,1.932286
2,FDN15,17.5,LF,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,3.193757,3.193757
3,FDX07,19.2,R,0.0,Fruits and Vegetables,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38,3.740743,3.740743
4,NCD19,8.93,LF,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,2.007857,2.007857


In [32]:
data[['Item_MRP','Item_MRP_in_USD']]

Unnamed: 0,Item_MRP,Item_MRP_in_USD
0,249.8092,4.655800
1,48.2692,1.932286
2,141.6180,3.193757
3,182.0950,3.740743
4,53.8614,2.007857
...,...,...
8518,214.5218,4.178943
8519,108.1570,2.741581
8520,85.1224,2.430303
8521,103.1332,2.673692


---

#### `HOW TO CONVERT CATEGORICAL VARIABLES INTO NUMERICAL?`

Most of the machine learning algorithms do not take categorical variables so we need to convert them into numerical ones. In pandas, we have one such function `get_dummies` which will help us in doing such tasks.  It will create a binary column for each of the categories. 

For example, look at the image below, We have two genders male and female. It will create two binary columns.  
![](encode.png)

This is also known as `One Hot Encoding`. You will learn more encoding techniques in the data pre-processing module.

---

#### Let's look at the data again.

In [33]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_MRP_in_USD_UPDATED,Item_MRP_in_USD
0,FDA15,9.3,LF,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,4.6558,4.6558
1,DRC01,5.92,R,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,1.932286,1.932286
2,FDN15,17.5,LF,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,3.193757,3.193757
3,FDX07,19.2,R,0.0,Fruits and Vegetables,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38,3.740743,3.740743
4,NCD19,8.93,LF,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,2.007857,2.007857


In [34]:
# currently we have 13 columns in the data
data.shape

(8523, 14)

#### `USE GET_DUMMIES`

---

In [35]:
# convert categorical variables into numerical variables.
data = pd.get_dummies(data)

In [36]:
# view the data
data.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,Item_MRP_in_USD_UPDATED,Item_MRP_in_USD,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,...,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,9.3,0.016047,249.8092,1999,3735.138,4.6558,4.6558,0,0,0,...,0,1,0,1,0,0,0,1,0,0
1,5.92,0.019278,48.2692,2009,443.4228,1.932286,1.932286,0,0,0,...,0,1,0,0,0,1,0,0,1,0
2,17.5,0.01676,141.618,1999,2097.27,3.193757,3.193757,0,0,0,...,0,1,0,1,0,0,0,1,0,0
3,19.2,0.0,182.095,1998,732.38,3.740743,3.740743,0,0,0,...,0,1,0,0,0,1,1,0,0,0
4,8.93,0.0,53.8614,1987,994.7052,2.007857,2.007857,0,0,0,...,1,0,0,0,0,1,0,1,0,0


In [37]:
# now, we have 1603 columns
data.shape

(8523, 1604)