## 1. The absence of the data are termed as missing values.  If missing values are not handled properly, they may lead to unreliable or biased model.  Decision trees, random forest techniques splits the missing values into an another category. They work by taking available data into consideration without imputing or deleting the missing values.  Other algorithms not affected by missing values are K-Nearest Neighbours, SVM, Gaussian Mixture Model,

## 2. The missing data can be handled by:
### i) Imputation technique
- **Mean imputation** :Replace the missing data with the mean of the existing values
- **Mode imputation** :Replace the missing data with the mode of the existing values
- **Median imputation**: Replace the missing data with the median of the existing values

### ii) Dropping missing values
- **Removing missing values** isn't a reliable approach since it leads to data loss




## Imputation techniques

In [23]:
import pandas as  pd
df = pd.read_csv('Bengaluru_House_Data.csv')  ## https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data?select=Bengaluru_House_Data.csv
df.head(3)

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0


In [25]:
df.dtypes

area_type        object
availability     object
location         object
size             object
society          object
total_sqft       object
bath            float64
balcony         float64
price           float64
dtype: object

In [22]:
df.isnull().sum()

## The missing values in this dataset are:
# Society - 5502   string
# balcony - 609    float
# bath  -  73      float
# size -  16       string

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

> **Mean and Median imputations are best suitable for numeric data**

> **Categorical data uses Mode imputation, since mean cannot be calculated for non-numeric values**


### Mean Imputation on balcony


In [34]:
df['balcony'].value_counts()

2.0    5113
1.0    4897
3.0    1672
0.0    1029
Name: balcony, dtype: int64

In [None]:
mean_balcony = df['balcony'].mean()
mean_balcony

In [35]:
df['balcony'] = df['balcony'].fillna(round(mean_balcony,2))

In [36]:
# df

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.00,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.00,120.00
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.00,62.00
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.00,95.00
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.00,51.00
...,...,...,...,...,...,...,...,...,...
13315,Built-up Area,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.00,231.00
13316,Super built-up Area,Ready To Move,Richards Town,4 BHK,,3600,5.0,1.58,400.00
13317,Built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.00,60.00
13318,Super built-up Area,18-Jun,Padmanabhanagar,4 BHK,SollyCl,4689,4.0,1.00,488.00


### Median Imputation on bath


In [38]:
df['bath'].value_counts()

2.0     6908
3.0     3286
4.0     1226
1.0      788
5.0      524
6.0      273
7.0      102
8.0       64
9.0       43
10.0      13
12.0       7
13.0       3
11.0       3
16.0       2
27.0       1
40.0       1
15.0       1
14.0       1
18.0       1
Name: bath, dtype: int64

In [49]:
median_bath = df['bath'].median()
median_bath

2.0

In [44]:
df['bath'] = df['bath'].fillna(median_bath)

In [45]:
# df

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.00,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.00,120.00
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.00,62.00
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.00,95.00
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.00,51.00
...,...,...,...,...,...,...,...,...,...
13315,Built-up Area,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.00,231.00
13316,Super built-up Area,Ready To Move,Richards Town,4 BHK,,3600,5.0,1.58,400.00
13317,Built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.00,60.00
13318,Super built-up Area,18-Jun,Padmanabhanagar,4 BHK,SollyCl,4689,4.0,1.00,488.00


### Mode Imputation on size


In [47]:
df['size'].value_counts()

2 BHK         5199
3 BHK         4310
4 Bedroom      826
4 BHK          591
3 Bedroom      547
1 BHK          538
2 Bedroom      329
5 Bedroom      297
6 Bedroom      191
1 Bedroom      105
8 Bedroom       84
7 Bedroom       83
5 BHK           59
9 Bedroom       46
6 BHK           30
7 BHK           17
1 RK            13
10 Bedroom      12
9 BHK            8
8 BHK            5
11 BHK           2
11 Bedroom       2
10 BHK           2
14 BHK           1
13 BHK           1
12 Bedroom       1
27 BHK           1
43 Bedroom       1
16 BHK           1
19 BHK           1
18 Bedroom       1
Name: size, dtype: int64

In [51]:
mode_bedroom_size = df['size'].mode()
mode_bedroom_size

0    2 BHK
Name: size, dtype: object

In [52]:
df['size'] = df['size'].fillna(mode_bedroom_size)

In [53]:
# df

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.00,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.00,120.00
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.00,62.00
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.00,95.00
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.00,51.00
...,...,...,...,...,...,...,...,...,...
13315,Built-up Area,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.00,231.00
13316,Super built-up Area,Ready To Move,Richards Town,4 BHK,,3600,5.0,1.58,400.00
13317,Built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.00,60.00
13318,Super built-up Area,18-Jun,Padmanabhanagar,4 BHK,SollyCl,4689,4.0,1.00,488.00


## Dropping null values

In [54]:
df.shape
## There are 13,320 rows initially

(13320, 9)

In [55]:
## calculating total null values
df.isnull().sum().sum()

5519

In [58]:
df.dropna().shape


(7804, 9)

- **Dropping all null values from dataframe will result in loss of 5516 rows, which is not reliable**

## 3. When there is a significant difference among the number of outputs of a dataset, then we say the dataset is imbalanced. For example there are 3 categorical ouputs in a dataset, If there are 1200 data samples of class-1, 800 of class-2 and 900 of class-3, then this dataset is imbalanced since class-1 has significantly larger number of data points. 
## If the imbalanced data isn't handled using required techniques then the model becomes biased. In the above example, the model becomes biased to the class-1.



## 4. Up-Sampling is the technique increasing the size of minority class (i.e which has less number of data points). In contrary Down-Sampling is removal of data points from the majority class. Both the techniques are performed to get the balanced data
## When a dataset has 1000 data samples belonging to students who passed the test and 500 samples  of students who failed . Then we can perform either up-scaling or down-scaling as follows:
- ## Up-Scaling 'Fail' category by increasing data points and make them equal to 1000
-  ## Down-Scaling 'Pass' category by reducing the data samples and make it equal to 500

## 5. Data Agumentation is the method of  increasing the size of the dataset by adding data points artificially. Synthetic Minority Over-sampling Technique (SMOTE) is a data augumentation technique, used to increase the size of the minority class by synthetically adding the data points to make a balanced dataset. It uses nearest neighbours to add the data points. In the up-sampling technique the existing data points are repeated to increase the size of minority class, whereas in smote new points are generated artificially by some techniques in the minority class. 

## 6. Outliers generally are the data points that are far away from the majority of the data points or the central tendancy of the data. Ouliers make the model to predict inaccurately, biased. Hence, it is necessary to handle outliers.

## 7. There are few techniques to handle missing values
### Deletion: 
  - ### Though this method is not the best method always, we can use this method if the amount of missing values is relatively less
  
### Imputation:
 - ### Filling the missing values with some predicted value is called imputation
 - ### If the data points are numerical, then we can use mean/median/mode imputation
 - ### Mode imputation is best suitable for the categorical data points
 - ### Mean imputation, median imputation, mode imputation involve filling the missing values with the mean, median, mode of the existing/available data respectively.
 - #### Regression imputation is another technique, where the missing values are predicted using regression models based on the available data.