# Data Preprocessing

We have all heard the phrase "garbage in, garbage out", in the realm of Machine Learning, it translates to Bad Data, Bad Results.

Throughout this bootcamp we will delve into Python and play with data. We will explore several valuable libraries such as numpy, pandas, scikit et cetera. Python is widely used in ML (Machine Learning) due to it's simplicity in syntax and extensive open source libraries.

## Introduction To Python

So what is Python? A Snake? Yes, but also a programming language that encompasses well known concepts such as conditionals (if-else statements), loops (for, while), arithmethic (+,-,*,/) and more.
Throughout this course we will explain all the syntax we use, but still a basic understanding can be of use. Let's start simple...

### The Basics

#### Printing

In [1]:
! pip install pandas
print("Heya World") # Will print to console, this is a comment

Collecting pandas
  Downloading pandas-1.5.3-cp38-cp38-win_amd64.whl (11.0 MB)
     ---------------------------------------- 11.0/11.0 MB 2.9 MB/s eta 0:00:00
Collecting numpy>=1.20.3
  Downloading numpy-1.24.2-cp38-cp38-win_amd64.whl (14.9 MB)
     ---------------------------------------- 14.9/14.9 MB 1.6 MB/s eta 0:00:00
Installing collected packages: numpy, pandas
Successfully installed numpy-1.24.2 pandas-1.5.3
Heya World



[notice] A new release of pip available: 22.3.1 -> 23.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Do note that python does not use curly braces unlike most languages, it doesn't even need semicolons at the end of each statement;

Instead it solely relies on **indentation**.

#### Conditionals

In [2]:
a = 1+1 # basic arithmethic
if a == 2:
    print("Maths is real")
elif a == 1: # elif is how you write else if in python
    print("How??")
else:
    print("What??")

Maths is real


#### Loops

Loops in python are also simple.

In [3]:
for i in range(10): # loops 10 times from 0 to 9
    print(i)

###

i=0
while i<10: # Equaivalent while loop of the above for
    print(i) 
    i += 1 # Abbreviated means of addition can also use -= /= or */

0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9


#### Datatypes

All this is great, but what are the datatypes available in this langauge? 

Well, the common datatypes are integers, floats, booleans and [strings](https://www.programiz.com/python-programming/string). More complex ones would be [Lists](https://www.programiz.com/python-programming/list), [Tuples](https://www.programiz.com/python-programming/tuple) and [Dictionaries](https://www.programiz.com/python-programming/dictionary). They are all fairly simple, and you can learn more about them by checking the links provided.

#### Functions

In [4]:
def ml_function(): # How to define a function
    print("It is 2023")

ml_function() # How to call a function

It is 2023


## Python Libraries

Libraries are collections of modules, and modules are simply python code - functions, classes, constants etc. Below is a simple example of this.

In [5]:
import math

print(math.pi)

3.141592653589793


How does this work?

When we write `import math`, Python brings in all the code written under the math module into our program, and then we access math.PI which is a constant in the module. You can always learn more about any module by referring to the [documentation](https://docs.python.org/3/library/math.html?highlight=math#module-math).

The libraries that are important to us, are those that do things related to Machine Learning, initially we will see how to handle data, with the help of a library known as Pandas.

### Pandas

[Pandas Docs](https://pandas.pydata.org/docs/). The Dataset we will be using for out first foray into Pandas is the [Insurance Targets List](https://drive.google.com/file/d/1FG6-KJxEZ7j2h3_0Ee04VwzYK3KMscjG/view?usp=sharing).

In [6]:
import pandas as pd

data = pd.read_csv('data.csv')
data.head()
# data.tail()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [7]:
data.shape

(10, 4)

In [8]:
data.columns

Index(['Country', 'Age', 'Salary', 'Purchased'], dtype='object')

#### Data Indexing

Pandas offers two ways to index data:

 * `.loc` - Label Based Indexing
 * `.iloc` - Integer Based Indexing

##### `.loc`

 * loc interprets values provided as labels (strings).
 * Its often used for conditional indexing.
 * It includes last index.


In [9]:
data.loc[2:8] # Reads this as a label, since our index is numeric lookss same as iloc 

Unnamed: 0,Country,Age,Salary,Purchased
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No


In [10]:
data.loc[data['Salary']>70000] # loc allows us conditionals

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No


In [11]:
data.loc[(data['Purchased'].str.contains('No')) & ( data['Salary']>70000)] # Grouping conditionals is also possible

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
8,Germany,50.0,83000.0,No


##### `.iloc`

* In pandas we use a method iloc which stands for integer locate
* It views all input as integer indices
* in contrast to loc it doesn't include last index

In [12]:
data.iloc[2:8]

Unnamed: 0,Country,Age,Salary,Purchased
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes


In [13]:
data.iloc[2:8, 2:5]

Unnamed: 0,Salary,Purchased
2,54000.0,No
3,61000.0,No
4,,Yes
5,58000.0,Yes
6,52000.0,No
7,79000.0,Yes


In [14]:
X = data.iloc[:, 1:].values
print(X)

[[44.0 72000.0 'No']
 [27.0 48000.0 'Yes']
 [30.0 54000.0 'No']
 [38.0 61000.0 'No']
 [40.0 nan 'Yes']
 [35.0 58000.0 'Yes']
 [nan 52000.0 'No']
 [48.0 79000.0 'Yes']
 [50.0 83000.0 'No']
 [37.0 67000.0 'Yes']]
