# Data Preprocessing

We have all heard the phrase "garbage in, garbage out", in the realm of Machine Learning, it translates to Bad Data, Bad Results.

Throughout this bootcamp we will delve into Python and play with data. We will explore several valuable libraries such as numpy, pandas, scikit et cetera. Python is widely used in ML (Machine Learning) due to it's simplicity in syntax and extensive open source libraries.

## Introduction To Python

So what is Python? A Snake? Yes, but also a programming language that encompasses well known concepts such as conditionals (if-else statements), loops (for, while), arithmethic (+,-,*,/) and more.
Throughout this course we will explain all the syntax we use, but still a basic understanding can be of use. Let's start simple...

### The Basics

#### Printing

In [33]:
! pip install pandas
! pip install numpy




[notice] A new release of pip available: 22.3.1 -> 23.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.3.1 -> 23.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
print("Heya World") # Will print to console, this is a comment

Collecting pandas
  Downloading pandas-1.5.3-cp38-cp38-win_amd64.whl (11.0 MB)
     ---------------------------------------- 11.0/11.0 MB 2.9 MB/s eta 0:00:00
Collecting numpy>=1.20.3
  Downloading numpy-1.24.2-cp38-cp38-win_amd64.whl (14.9 MB)
     ---------------------------------------- 14.9/14.9 MB 1.6 MB/s eta 0:00:00
Installing collected packages: numpy, pandas
Successfully installed numpy-1.24.2 pandas-1.5.3
Heya World



[notice] A new release of pip available: 22.3.1 -> 23.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Do note that python does not use curly braces unlike most languages, it doesn't even need semicolons at the end of each statement;

Instead it solely relies on **indentation**.

#### Conditionals

In [2]:
a = 1+1 # basic arithmethic
if a == 2:
    print("Maths is real")
elif a == 1: # elif is how you write else if in python
    print("How??")
else:
    print("What??")

Maths is real


#### Loops

Loops in python are also simple.

In [3]:
for i in range(10): # loops 10 times from 0 to 9
    print(i)

###

i=0
while i<10: # Equaivalent while loop of the above for
    print(i) 
    i += 1 # Abbreviated means of addition can also use -= /= or */

0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9


#### Datatypes

All this is great, but what are the datatypes available in this langauge? 

Well, the common datatypes are integers, floats, booleans and [strings](https://www.programiz.com/python-programming/string). More complex ones would be [Lists](https://www.programiz.com/python-programming/list), [Tuples](https://www.programiz.com/python-programming/tuple) and [Dictionaries](https://www.programiz.com/python-programming/dictionary). They are all fairly simple, and you can learn more about them by checking the links provided.

#### Functions

In [4]:
def ml_function(): # How to define a function
    print("It is 2023")

ml_function() # How to call a function

It is 2023


## Python Libraries

Libraries are collections of modules, and modules are simply python code - functions, classes, constants etc. Below is a simple example of this.

In [5]:
import math

print(math.pi)

3.141592653589793


How does this work?

When we write `import math`, Python brings in all the code written under the math module into our program, and then we access math.PI which is a constant in the module. You can always learn more about any module by referring to the [documentation](https://docs.python.org/3/library/math.html?highlight=math#module-math).

The libraries that are important to us, are those that do things related to Machine Learning, initially we will see how to handle data, with the help of a library known as Pandas.

### Pandas

[Pandas Docs](https://pandas.pydata.org/docs/). The Dataset we will be using for out first foray into Pandas is the [Insurance Targets List](https://drive.google.com/file/d/1FG6-KJxEZ7j2h3_0Ee04VwzYK3KMscjG/view?usp=sharing).

In [6]:
import pandas as pd

data = pd.read_csv('data.csv')
data.head()
# data.tail()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [7]:
data.shape

(10, 4)

In [8]:
data.columns

Index(['Country', 'Age', 'Salary', 'Purchased'], dtype='object')

#### Data Indexing

Pandas offers two ways to index data:

 * `.loc` - Label Based Indexing
 * `.iloc` - Integer Based Indexing

##### `.loc`

 * loc interprets values provided as labels (strings).
 * Its often used for conditional indexing.
 * It includes last index.


In [9]:
data.loc[2:8] # Reads this as a label, since our index is numeric lookss same as iloc 

Unnamed: 0,Country,Age,Salary,Purchased
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No


In [10]:
data.loc[data['Salary']>70000] # loc allows us conditionals

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No


In [11]:
data.loc[(data['Purchased'].str.contains('No')) & ( data['Salary']>70000)] # Grouping conditionals is also possible

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
8,Germany,50.0,83000.0,No


##### `.iloc`

* In pandas we use a method iloc which stands for integer locate
* It views all input as integer indices
* in contrast to loc it doesn't include last index

In [12]:
data.iloc[2:8]

Unnamed: 0,Country,Age,Salary,Purchased
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes


In [13]:
data.iloc[2:8, 2:5]

Unnamed: 0,Salary,Purchased
2,54000.0,No
3,61000.0,No
4,,Yes
5,58000.0,Yes
6,52000.0,No
7,79000.0,Yes


In [14]:
X = data.iloc[:, 1:].values
print(X)

[[44.0 72000.0 'No']
 [27.0 48000.0 'Yes']
 [30.0 54000.0 'No']
 [38.0 61000.0 'No']
 [40.0 nan 'Yes']
 [35.0 58000.0 'Yes']
 [nan 52000.0 'No']
 [48.0 79000.0 'Yes']
 [50.0 83000.0 'No']
 [37.0 67000.0 'Yes']]


## Exploring Data
Data exploration is the first step of data analysis used to explore and visualize data to uncover insights from the start or identify areas or patterns to dig into more.

### `.unique()`

Returns a list of unique values in a given Series

### `.value_counts()`

Returns a Series of count of each value in a given Series

In [34]:
import IPython.display as ipd
import numpy as np
# Loading data
data = pd.read_csv("train.csv")
data.head() # By default gives only 5 entries, Has a max limit of 50, accepts parameter upto 50
# data.tail()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [35]:
data["Item_Type"].unique()

array(['Dairy', 'Soft Drinks', 'Meat', 'Fruits and Vegetables',
       'Household', 'Baking Goods', 'Snack Foods', 'Frozen Foods',
       'Breakfast', 'Health and Hygiene', 'Hard Drinks', 'Canned',
       'Breads', 'Starchy Foods', 'Others', 'Seafood'], dtype=object)

In [36]:
data["Item_Type"].value_counts()
# ipd.display(data["Item_Type"].count())

Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64

## Inconsistent Data

Checking the same for Item Fat Content:

In [37]:
data["Item_Fat_Content"].unique() 

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

In [38]:
data["Item_Fat_Content"].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

To deal with missing values we could replace each manually or we could python in-built string processing library by:
1. Converting all variables to lower-case, this takes care of capitalization inconsistencies
2. Converting all alternate labels to a singular label, lf and ref to low fat and regular

In [39]:
dat_copy = data.copy() # Making a shallow copy to avoid any conversion issues
dat_copy["Item_Fat_Content"] = dat_copy["Item_Fat_Content"].str.lower()
ipd.display(dat_copy["Item_Fat_Content"].unique())


# dat_copy["Item_Fat_Content"] = dat_copy["Item_Fat_Content"].replace("lf","low fat")
dat_copy["Item_Fat_Content"] = dat_copy["Item_Fat_Content"].replace({"lf":"low fat", "reg": "regular"})
dat_copy["Item_Fat_Content"].unique()

# dat_copy["Item_Fat_Content"].value_counts()

array(['low fat', 'regular', 'lf', 'reg'], dtype=object)

array(['low fat', 'regular'], dtype=object)

In [40]:
data = dat_copy
data.head(15)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,low fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,low fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,low fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
5,FDP36,10.395,regular,0.0,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
6,FDO10,13.65,regular,0.012741,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
7,FDP10,,low fat,0.12747,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
8,FDH17,16.2,regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986
9,FDU28,19.2,regular,0.09445,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.535


## Missing Values
Missing values in datasets are often represented by `NaN` or `None`. It usually requires to cleaned before being fed into a model.
But before we get into missing values, here's a small revision:

1. The mean is the average of a data set.
2. The mode is the most common number in a data set.
3. The median is the middle of the set of numbers.

[![03.gif](https://i.postimg.cc/3Nv2XQRP/03.gif)](https://postimg.cc/QKjBDPcb)


**How to deal with missing data?**

1.  Drop data  
    a. Drop the whole row  
    b. Drop the whole column
    
2.  Replace data  
    a. Replace it by mean  
    b. Replace it by frequency  
    c. Replace it based on other functions

In [41]:
data["Outlet_Size"].unique() 

array(['Medium', nan, 'High', 'Small'], dtype=object)

Missing data can also be found by `isnull()` or `isna()` functions.

In [42]:
ipd.display(data.isnull().any())

Item_Identifier              False
Item_Weight                   True
Item_Fat_Content             False
Item_Visibility              False
Item_Type                    False
Item_MRP                     False
Outlet_Identifier            False
Outlet_Establishment_Year    False
Outlet_Size                   True
Outlet_Location_Type         False
Outlet_Type                  False
Item_Outlet_Sales            False
dtype: bool

We can figure which of our rows are missing their `Item Weights` by the following:

In [43]:
data[data['Item_Weight'].isnull()] #1462 columns

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
7,FDP10,,low fat,0.127470,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
18,DRI11,,low fat,0.034238,Hard Drinks,113.2834,OUT027,1985,Medium,Tier 3,Supermarket Type3,2303.6680
21,FDW12,,regular,0.035400,Baking Goods,144.5444,OUT027,1985,Medium,Tier 3,Supermarket Type3,4064.0432
23,FDC37,,low fat,0.057557,Baking Goods,107.6938,OUT019,1985,Small,Tier 1,Grocery Store,214.3876
29,FDC14,,regular,0.072222,Canned,43.6454,OUT019,1985,Small,Tier 1,Grocery Store,125.8362
...,...,...,...,...,...,...,...,...,...,...,...,...
8485,DRK37,,low fat,0.043792,Soft Drinks,189.0530,OUT027,1985,Medium,Tier 3,Supermarket Type3,6261.8490
8487,DRG13,,low fat,0.037006,Soft Drinks,164.7526,OUT027,1985,Medium,Tier 3,Supermarket Type3,4111.3150
8488,NCN14,,low fat,0.091473,Others,184.6608,OUT027,1985,Medium,Tier 3,Supermarket Type3,2756.4120
8490,FDU44,,regular,0.102296,Fruits and Vegetables,162.3552,OUT019,1985,Small,Tier 1,Grocery Store,487.3656


**As `Item_Weight` is a Ratio (continious) variable we replace it by it's `Mean`**

In [44]:
# Replace Outlet by mode
# Replace Item_weight by mean

avg_item_wt = data["Item_Weight"].astype("float").mean()

data_cpy = data.copy()
data_cpy["Item_Weight"].replace(np.nan, avg_item_wt, inplace=True)
data_cpy["Item_Weight"].isnull().any() 
#Maybe assert

False

**As `Outlet_size` is an ordinal variable we replace it by it's `Mode`**

In [45]:
data_cpy = data.copy()

ipd.display(data_cpy["Outlet_Size"].value_counts())
outlet_mode = data_cpy["Outlet_Size"].mode()
data_cpy["Outlet_Size"].replace(np.nan, outlet_mode[0], inplace=True)

data_cpy["Outlet_Size"].isnull().any()
# data_cpy["Outlet_Size"].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

False

In [46]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,low fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,low fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,low fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [47]:
data = data_cpy
ipd.display(data.isnull().any())

Item_Identifier              False
Item_Weight                   True
Item_Fat_Content             False
Item_Visibility              False
Item_Type                    False
Item_MRP                     False
Outlet_Identifier            False
Outlet_Establishment_Year    False
Outlet_Size                  False
Outlet_Location_Type         False
Outlet_Type                  False
Item_Outlet_Sales            False
dtype: bool