# Data Normalization
"""Def: Normalization is the process of organizing data in a database,which includes 
creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible 
by eliminating redundancy and inconsistent dependency

Sources: Microsoft """

# How to deal with missing values? - 5 examples

## 1.Deleting row which has null-values

df.dropna()

"""Pros:
Complete removal of data with missing values results in robust and highly accurate model
Deleting a particular row or a column with no specific information is better, since it does not have a high weightage
Cons:
Loss of information and data
Works poorly if the percentage of missing values is high (say 30%), compared to the whole dataset"""

## 2. Replacing With Mean/Median/Mode

df.fillna(df.mean())

df.fillna(df.median())

df[...] = df[...].fillna(df[...].mode())

""" Pros:
This is a better approach when the data size is small
It can prevent data loss which results in removal of the rows and columns
Cons:
Imputing the approximations add variance and bias
Works poorly compared to other multiple-imputations method"""

## 3. Assigning An Unique Category

df[...].fillna('U')

"""Pros:
Less possibilities with one extra category, resulting in low variance after one hot encoding â€” since it is categorical
Negates the loss of data by adding an unique category
Cons:
Adds less variance
Adds another feature to the model while encoding, which may result in poor performance"""

## 4. Predicting The Missing Values

### Step 1: Seperate the null values from dataframe (df) and create a variable "test data"

test_data = df[df['C'].isnull()]

### Step 2: Drop the null values from the dataframe (df) and represent as 'train data'

df.dropna(inplace = True)

### Step 3: Create "x_train" & "y_train" from train data

x_train = df.drop('...', axis = 1) 
x_train
y_train = df['']
y_train

### Step 4: Build the linear regression model

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train, y_train)

### Step 5: Create the x_test from test data

x_test = test_data[]
x_test 

### Step 6: Apply the model on x_test of test data to make predictions

y_pred = lr.predict(x_test)
y_fred

### Step 7: Replacing the missing values with predicted values

test_data['y_pred'] = y_pred
test_data
 
""" Pros: Imputing the missing variable is an improvement as long as the bias from the same is smaller than the omitted variable bias
Yields unbiased estimates of the model parameters
     Cons:
Bias also arises when an incomplete conditioning set is used for a categorical variable
Considered only as a proxy for the true values"""


## 5. Using Algorithms which Support Missing Values

