# Handling Missing Values 
### What is Missing Value?
Missing data is defined as the values or data that is not stored (or not present) for some variables i the given dataset.
### How is the missing value represented in the dataset?
- In the dataset, blank shows the missing values
- In Pandas, usually, missing values are represented by NaN
### Why is Data Missing from the Dataset?
There can be multiple reasons why certain values are missing from the data.

Reasons for the missing data from the dataset affect the approach of handling missing data. So it's necessary to understand why the data could be missing.

Some of the reasons are listed below:
+ Past data might get corrupted due to improper maintenance
+ Observations are not recorded for certain fields due to some reasons. This reasons might be failure in recording the values due to human error.
+ The user has not provided the values intentionally.

### Why do we need to care about handling missing values?
It is important to handle the missing values appropriately.
- Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like K-nearest and Naive Bayes support data with missing values.
- You may end up building a biased machine learning model which will lead to incorrect results if the missing values are not handled properly.
- Missing data can lead to a lact of precision in the statistical analysis.

## REPLACING MISSING VALUES
### 1. Replacing the Missing Value with Mean
This is the most common method of imputing missing values of numeric columns. If there are outliers then the mean will not be appropriate. In such cases, outliers need to be treated first.

You can use 'fillna' method for imputing the columns 'LoanAmount' and 'Credit_History' with mean of the respective column values

#### Replacing the missing values for numerical columns with mean
train_df['LoanAmount'] = train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean())

train_df['Credit_History'] = train_df['Credit_History'].fillna(train_df['Credit_History'].mean())

### 2. Replacing Missing Values with Mode
Mode is the most frequenlty occuring value. It is used in the case of categorical features.

You can use the 'fillna' method for imputing the categorical columns 'Gender', 'Married', and 'Self_Employed'.

#### Replacing the missing values for categorical columns with mode

train_df['Gender'] = train_df['Gender'].fillna(train_df['Gender'].mode()[0])

train_df['Married'] = train_df['Married'].fillna(train_df['Married'].mode()[0])

train_df['Self_Employed'] = train_df['Self_Employed'].fillna(train_df['Self_Employed'].mode()[0)train_df.isnull().sum()

## Imputing Missing Values for Categorical Features
There are two ways to impute missing values for categorical features as follows:

Impute the Most frequent value.

We will make use of 'SimpleImputer' in this case and as this is a non-numeric column we can't use mean or median but we can use most frequent value and constant.

Impute the Value "missing", which treats it as a Separate Category.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('Placement_Data_Full_Class.csv')

In [5]:
df.head()

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


In [7]:
df['salary'] = df['salary'].fillna(df['salary'].mode()[0]) 

In [8]:
df.isnull().sum()

sl_no             0
gender            0
ssc_p             0
ssc_b             0
hsc_p             0
hsc_b             0
hsc_s             0
degree_p          0
degree_t          0
workex            0
etest_p           0
specialisation    0
mba_p             0
status            0
salary            0
dtype: int64

In [9]:
df.head()

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,300000.0
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


In [10]:
import numpy as np
#Importing the SimpleImputer class
from sklearn.impute import SimpleImputer

In [11]:
# Imputer object using the mean strategy and 
# Missing values type for imputation
imputer = SimpleImputer(missing_values = np.nan,strategy = 'mean')

data = [[12, np.nan, 36], [10, 12, np.nan], [np.nan, 11, 20]]
print("Original Data : \n", data)

Original Data : 
 [[12, nan, 36], [10, 12, nan], [nan, 11, 20]]


In [14]:
imputer = imputer.fit(data)

# Imputing the data
data = imputer.transform(data)

print("Imputed Data : \n", data)

Imputed Data : 
 [[12.  11.5 36. ]
 [10.  12.  28. ]
 [11.  11.  20. ]]


# Feature Encoding
### What are Encoding Techniques?
In many practical data science activities, the data set will contain categorical variables. These variables are typically stored as text values". Since machine learning is based on mathematical equations, it would cause a problem when we keep categorical variables as is. Many algorithms support categorical values without further manipulation, but in those cases, it's still a topic of discussion on whether to encode the variables or not. The algorithms that do not support categorical values, in that case, are left with encoding methodologies.

## Encoding Methodologies

### Nominal Encoding:- Where order of data does not matter
In Nominal Encoding we have various techniques:
- One Hot Encoding
- One Hot Encoding with many categories
- Mean Encoding

### Ordinal Encoding:- Where order of data matters
In Ordinal Encoding we also have various techniques
- Label Encoding
- Target Guided Ordinal Encoding

#### One Hot Encoding
In this method, we map each category to a vector that contains 1 and 0 denoting the presence of the feature or not. The number of vectors depends on the categories which we want to keep. For high cardinality features, this method produces a lot of columns that slows down the learning significantly. There is a buzz between one hot encoding and dummy encoding and when to use one. They are much alike except one hot encoding produces the number of columns equal to the number of categories and dummy producing is one less. This should ultimately be handled by the modeler accordingly in the validation process.

In [24]:
#Example of one hot encoder
df = pd.read_csv('airport.csv')

In [25]:
df.head()

Unnamed: 0,id,ident,type,name,latitude_deg,longitude_deg,elevation_ft,iso_region,municipality,scheduled_service,gps_code,iata_code,local_code,year_built,year_tier,Closed,home_link,wikipedia_link,keywords,score
0,26555,VIDP,large_airport,Indira Gandhi International Airport,28.5665,77.103104,777.0,IN-DL,New Delhi,1,VIDP,DEL,,1930.0,1950.0,n,http://www.newdelhiairport.in/,http://en.wikipedia.org/wiki/Indira_Gandhi_Int...,Palam Air Force Station,51475
1,26434,VABB,large_airport,Chhatrapati Shivaji International Airport,19.088699,72.867897,39.0,IN-MM,Mumbai,1,VABB,BOM,,1942.0,1950.0,n,http://www.csia.in/,http://en.wikipedia.org/wiki/Chhatrapati_Shiva...,"Bombay, Sahar International Airport",1014475
2,26618,VOMM,large_airport,Chennai International Airport,12.990005,80.169296,52.0,IN-TN,Chennai,1,VOMM,MAA,,1910.0,1950.0,n,,http://en.wikipedia.org/wiki/Chennai_Internati...,,51150
3,35145,VOBL,large_airport,Kempegowda International Airport,13.1979,77.706299,3000.0,IN-KA,Bangalore,1,VOBL,BLR,,2008.0,2010.0,n,http://www.bengaluruairport.com/home/home.jspx,https://en.wikipedia.org/wiki/Kempegowda_Inter...,,51200
4,26444,VAGO,large_airport,Goa International Airport,15.3808,73.831398,150.0,IN-GA,Vasco da Gama,1,VOGO,GOI,,1955.0,1955.0,n,,http://en.wikipedia.org/wiki/Dabolim_Airport,"Goa Airport, Dabolim Navy Airbase, ________隷怄_...",875


In [26]:
#Taking the categorical column type as it has textual data inside the column
df['type'].value_counts()
#The below values are the different types of values that are present inside the type column of the data set 

type
small_airport     157
medium_airport    101
heliport           41
closed             27
large_airport      11
Name: count, dtype: int64

In [28]:
#To get the dummy variables of any file in a certain column can be achieved as
pd.get_dummies(df,columns=['type'])

Unnamed: 0,id,ident,name,latitude_deg,longitude_deg,elevation_ft,iso_region,municipality,scheduled_service,gps_code,...,Closed,home_link,wikipedia_link,keywords,score,type_closed,type_heliport,type_large_airport,type_medium_airport,type_small_airport
0,26555,VIDP,Indira Gandhi International Airport,28.566500,77.103104,777.0,IN-DL,New Delhi,1,VIDP,...,n,http://www.newdelhiairport.in/,http://en.wikipedia.org/wiki/Indira_Gandhi_Int...,Palam Air Force Station,51475,False,False,True,False,False
1,26434,VABB,Chhatrapati Shivaji International Airport,19.088699,72.867897,39.0,IN-MM,Mumbai,1,VABB,...,n,http://www.csia.in/,http://en.wikipedia.org/wiki/Chhatrapati_Shiva...,"Bombay, Sahar International Airport",1014475,False,False,True,False,False
2,26618,VOMM,Chennai International Airport,12.990005,80.169296,52.0,IN-TN,Chennai,1,VOMM,...,n,,http://en.wikipedia.org/wiki/Chennai_Internati...,,51150,False,False,True,False,False
3,35145,VOBL,Kempegowda International Airport,13.197900,77.706299,3000.0,IN-KA,Bangalore,1,VOBL,...,n,http://www.bengaluruairport.com/home/home.jspx,https://en.wikipedia.org/wiki/Kempegowda_Inter...,,51200,False,False,True,False,False
4,26444,VAGO,Goa International Airport,15.380800,73.831398,150.0,IN-GA,Vasco da Gama,1,VOGO,...,n,,http://en.wikipedia.org/wiki/Dabolim_Airport,"Goa Airport, Dabolim Navy Airbase, ________隷怄_...",875,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332,46625,IN-0062,Shirpur Airport,21.323999,74.956734,,IN-MM,Shirpur,0,,...,n,,,"Shirpur, Dhule",0,False,False,False,False,True
333,46640,IN-0064,Umaria Air Field,23.532514,80.808220,1510.0,IN-MP,Umaria,0,,...,n,,,,0,False,False,False,False,True
334,46561,IN-0038,Upper Tadong Indian Army Helipad,27.306717,88.599650,,IN-SK,Upper Tadong,0,,...,n,,,,0,False,True,False,False,False
335,319166,IN-0101,Vanasthali Airport,26.407627,75.870128,,IN-RJ,Vanasthali,0,,...,n,,,,0,False,False,False,False,True


#### Label Encoding
In this encoding each category is assigned a value from 1 through N (here N is the number of category for the feature). It may look like (Car<Bus<Truck...0<1<2). Categories that have some ties or are close to each other lose some information after encoding

In [34]:
df1 = pd.read_csv('Placement_Data_Full_Class.csv')

In [38]:
df1.head()

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


In [35]:
from sklearn.preprocessing import LabelEncoder

In [39]:
le = LabelEncoder()

In [41]:
df1['status'] = le.fit_transform(df1['status'])

In [42]:
df1.head()

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,1,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,1,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,1,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,0,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,1,425000.0


In [43]:
df1.tail()

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
210,211,M,80.6,Others,82.0,Others,Commerce,77.6,Comm&Mgmt,No,91.0,Mkt&Fin,74.49,1,400000.0
211,212,M,58.0,Others,60.0,Others,Science,72.0,Sci&Tech,No,74.0,Mkt&Fin,53.62,1,275000.0
212,213,M,67.0,Others,67.0,Others,Commerce,73.0,Comm&Mgmt,Yes,59.0,Mkt&Fin,69.72,1,295000.0
213,214,F,74.0,Others,66.0,Others,Commerce,58.0,Comm&Mgmt,No,70.0,Mkt&HR,60.23,1,204000.0
214,215,M,62.0,Central,58.0,Others,Science,53.0,Comm&Mgmt,No,89.0,Mkt&HR,60.22,0,


# Feature Scaling
Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing.

The most common techniques of feature scaling are Normalization and Standardization.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
df = pd.DataFrame({
    'Weight': [15,18,12,10],
    'Price': [1,3,2,5]},
    index = ['Orange', 'Apple', 'Banana', 'Grape'])
print(df)

        Weight  Price
Orange      15      1
Apple       18      3
Banana      12      2
Grape       10      5


#  1)Min-Max scaler
                        Xnew = X - Xmin / Xmax - Xmin

Transform features by scaling each feature to a given range. This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g., between zero and one. This scaler shrinks the data within the range of -1 to 1 if there are negative values. We can set the range like [0,1] or [0,5] or [-1,1].

In [4]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [8]:
df1 = pd.DataFrame(scaler.fit_transform(df),
                   columns=['Weight','Price'],
                   index=['Orange','Apple','Banana','Orange'])

In [12]:
ax=df.plot.scatter(x='Weight',y='Price',color=['red','green','blue','yellow'],
                     marker = 'o',s=60,label='AFTER SCALING', ax=ax)
plt.axhline(0, color='red',alpha=0.2)
plt.axvline(0, color='red',alpha=0.2)
                     

NameError: name 'ax' is not defined

# 2) Standard Scaler
                    Xnew = x - u / std

The standard scaler assumes data is normally distributed within each feature and scales them such that the distribution centered around 0, with a standard deviation of 1.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. If data is not normally distributed, this is not the best Scaler to use.

In [13]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [16]:
df2 = pd.DataFrame(scaler.fit_transform(df),
                   columns = ['Weight', 'Price'],
                   index = ['Orange', 'Apple', 'Banana', 'Grape'])
ax = df.plot.scatter(x='Weight',y='Price',color=['red','green','blue','yello'],
                     marker = '*',s=80, label='BEFORE SCALING');
df2.plot.scater(x='Weight','Price', color=['red','green','blue','yellow'],
                marker = '0',s=60,label='AFTER SCALING', ax = ax)
plt.axhline(0, color='red',alpha=0.2)
plt.axvline(0, color='red',alpha=0.2);

SyntaxError: positional argument follows keyword argument (3990973409.py, line 7)