# Module 4: Data preprocessing
- Data preprocessing is a crucial step in the data analysis pipeline.
- It involves cleaning and preparing raw data to make it suitable for further analysis or machine learning.
- The quality of your analysis depends on the quality of the preprocessed data.
- Here are some key steps in data preprocessing:
    - **Handling missing values:**
        - Identify and handle missing data using techniques like imputation or deletion, depending on the extent and nature of the missing values.
        - Impute (replace) missing values using simple methods like mean or median, or more advanced techniques.
    - **Data cleaning:**
        - Detect and correct errors or inconsistencies in the data, such as typos, duplicate records, and outliers.
        - Remove or correct data entries that are irrelevant, erroneous, or invalid.
    - **Data transformation:**
        - Encoding the categorical variables into numerical format suitable for the analysis, using techniques like one-hot encoding, label encoding etc.
        - Binning and discretization to group the continuous numeric features into bins or discrete intervals, if required. 
    - **Data normalization:**
        - Normalize or standardize numerical features to bring them to a common scale.
        - Transform skewed data distributions using techniques like log transformation.
    - **Dealing with outliers:**
        - Identify and handle outliers that can adversely affect analysis results or model performance.
        - Consider whether outliers are genuine data points or errors before deciding to remove or transform them.   
    - **Feature selection:**
        - Select relevant features that contribute most to the analysis or model performance and remove less important ones.
        - Feature selection helps reduce dimensionality and computational complexity. 
    - **Feature engineering:**
      - Create new attributes through feature engineering or extraction, such as generating datetime features from timestamps.
      - Form new attributes which embodies the features of existing attributes using techniques like PCA, LDA etc.

- In this module, we will discuss some data preprocessing techniques.

### Part 4.1.1  :  What to do with missing values? - part I

## Treating Missing Data
- Dealing with missing values is an important step in data preprocessing and analysis, as missing data can lead to biased or unreliable results if not handled properly. 
- There are various strategies for treating missing values, depending on the nature of the data and the reason for the missingness. 
- Here are some common techniques:
    - **Deletion of Missing Data:**
        - Removing entire rows with missing values. This can lead to loss of valuable data if the missing values are not randomly distributed.
        - Delete a column containing missing values if it has huge amount of data. However, the term huge is relative, and it depends on the business problem and the domain, and the decision is left to the discretion of the data analyst. Because, deletion of a column removes an attribute from the dataset, and one must know how important or not-important that attribute is.
    - **Imputation Techniques:** Imputation is the technique to fill the missing value by some other relevant value.
        - Mean/Median Imputation: Replacing missing values with the mean or median of the non-missing values in the variable. This method assumes that the missing values are missing at random. When the attribute has many outliers, we might prefer median imputation, since the mean is more influenced by outliers. If there are not a lot of outliers, we can use mean imputation.
        - Mode Imputation: Replacing missing categorical values with the mode (most frequent category).
        - Interpolation: Linear or spline interpolation can be used for time-series data where missing values are interpolated based on neighboring points.
        - Regression Imputation: Predicting the missing values using regression models based on other variables.
        - K-Nearest Neighbors (KNN) Imputation: Replacing missing values with the values of the most similar (in terms of other attributes) data points.
        - Multiple Imputation: Creating multiple datasets where missing values are imputed differently in each dataset, then analyzing the results using standard methods.
    - **Domain Knowledge:**
        - Using domain expertise to make educated guesses and fill in missing values. 
        - This can be especially useful when the nature of the missing values is well understood.
- Keep in mind that the choice of missing value treatment depends on the characteristics of your data and the goals of your analysis. Careful consideration is necessary to avoid introducing bias or distorting the results.

### Part 4.1.2  :  What to do with missing values? - Part II

In [73]:
# import the required packages
import pandas as pd
import numpy as np
import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")

In [74]:
#read the data file
pos_data = pd.read_csv('POS_Data.csv')

In [75]:
# A quick look at the data
pos_data.head()   

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Unit_price,Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,,0,0.0
1,SKU1054,05-08-21,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,,0,0.0
2,SKU1068,01-08-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,,0,0.0
3,SKU1056,11-05-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,,0,0.0
4,SKU1061,12-10-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,,0,0.0


In [76]:
# how many rows and columns?
pos_data.shape

(31185, 11)

In [77]:
# find missing values in each column
pos_data.isna().sum()

SKU ID              0
Date                0
Manufacturer        0
Sector             50
Category           36
Segment            29
Brand              27
Revenue($)          0
Unit_price      11550
Units_sold          0
Page_traffic        0
dtype: int64

In [78]:
# check the proportion of missing data
round(pos_data.isna().sum()/pos_data.shape[0] * 100,2)

SKU ID           0.00
Date             0.00
Manufacturer     0.00
Sector           0.16
Category         0.12
Segment          0.09
Brand            0.09
Revenue($)       0.00
Unit_price      37.04
Units_sold       0.00
Page_traffic     0.00
dtype: float64

**Around 37% of data is missing from the column *Unit_price***

- We see that the following attributes have the missing information:
    - Sector
    - Category
    - Segment
    - Brand
    - Unit_price

- The first four attributes are categorical variables and Unit_price is a numerical variable.
- Each of these variables needs a different strategy to deal with missing data.

#### Dealing with missing values in *Unit_price*
- We have discussed earlier in the course that *Unit_price* is a derived column. 
- That is, if we closely observe the values of *Unit_price*, *Revenue* and *Units_sold*, we can make out that 

$$ UnitPrice = \frac{Revenue}{UnitsSold} $$
- So, wherever *Revenue* and *Units_sold* are not available, the *Unit_price* is empty, which in-turn treated by Python as missing values.
- Hence, we can afford to delete the entire column.

**NOTE** that, by deleting this column, we are actually doing two data preprocessing tasks:
- Dealing with missing values by removing the column, and 
- Feature selection by deciding that this attribute is not a relevant one for our analysis and hence reducing the dimensionality of the data.

In [79]:
#drop the column
pos_data = pos_data.drop(['Unit_price'], axis=1)
pos_data.shape 

(31185, 10)

### Part 4.1.3  : What to do with missing data - part III

### Dealing with missing values in *Sector, Category, Segment, Brand*
- By now we know that there is a hierarchical relationship between these attributes in the same order of their appearance in the dataset. 
- So, we cannot treat these attributes independently and simply impute the missing values by mode of the column. 
- Hence, we must apply domain knowledge here and impute with appropriate value.
- For example, if a value is missing in *Sector*, we can check the corresponding value in *Category* and/or *Segment* to understand to which sector that particular product belongs.

**Some observations that will help in designing the logic for filling missing values in these columns:**
- We observed earlier in this course that the segment *Liquid* is listed under two different categories viz. *Laundry Detergents* and *Fabric Softeners*.
- So, if we write generalized code to search for segment name and then the corresponding category, there will be an ambiguity.
- When we goto replacing missing values in the *Brand* column, we will face similar problems.
- So, to achieve the right mapping, let us rename *Liquid* to *Laundry_Liquid* when the category is *Laundry Detergents* and to *Fabric_Liquid* when the category is *Fabric Softeners*
- **There is a catch here** - There are few records where Category is missing! How do we make out whether the segments in such records are to be filled with *Laundry_Liquid* or *Fabric_Liquid*?
- **Answer:** Take the help of *Brand*. We can observe from the dataset that *Tide* and *Gain* are the brand names under *Laundry Detergents* and *Downy* and *comfort* are the brands under *Fabric Softeners*. So, we will use this information as well while replacing the names of *Liquid*.
- **Problem is not solved yet!** We have 3 records where Category, Segment and Brand are all missing. We can now take the help of Sector. The segment whose value we are going to modify belongs to Fabric Care.  So, we will take the mode the category (which is Laundry Detergents) and hence replace the segment *Liquid* of these 3 rows as *Laundry_Liquid*

**So let's give it a shot**

In [80]:
# how many of each unique segment do we have?
pos_data.Segment.value_counts()

Liquid                         7273
Powder                         6164
Anti-aging                     3406
Shampoo                        2545
Suncreens                      1676
Dryer Sheets                   1090
Conditioners                   1075
Kids Toothbrushes              1065
Manual Toothbrushes            1065
Electric Toothbrushes          1061
Pods                           1009
Fluoride-Free Toothpaste        648
Whitening Toothpaste            632
Sensitivity Toothpaste          623
Acne                            561
Alcohol-Free Mouthwash          433
Fluoride Mouthwash              426
Breath-Freshening Mouthwash     404
Name: Segment, dtype: int64

**Observe that Liquid is showing up only once. That means, it is a combined result from two different categories Laundry Detergents and Fabric Softners**

In [81]:
# find the mode of the category - this is required when category, segment, brand all are missing
pos_data.Category.mode()   

0    Laundry Detergents
Name: Category, dtype: object

In [82]:
# write a function to perform the replace operation
def Replace_Liquid_Segment(row):
    if row['Segment']=='Liquid':
        if (row['Category']=='Laundry Detergents' or row['Brand']=='Tide' or row['Brand']=='Gain') :
            return 'Laundry_Liquid'
        elif row['Category']=='Fabric Softeners' or row['Brand']=='Downy' or row['Brand']=='comfort':
            return "Fabric_Liquid"
        else:
            return 'Laundry_Liquid'        # using the modal value as the replacement when we run out of options 
    return row['Segment']

In [83]:
# apply the function to replace 
pos_data['Segment'] = pos_data.apply(Replace_Liquid_Segment, axis = 1)
pos_data.Segment.value_counts()

Powder                         6164
Fabric_Liquid                  4279
Anti-aging                     3406
Laundry_Liquid                 2994
Shampoo                        2545
Suncreens                      1676
Dryer Sheets                   1090
Conditioners                   1075
Kids Toothbrushes              1065
Manual Toothbrushes            1065
Electric Toothbrushes          1061
Pods                           1009
Fluoride-Free Toothpaste        648
Whitening Toothpaste            632
Sensitivity Toothpaste          623
Acne                            561
Alcohol-Free Mouthwash          433
Fluoride Mouthwash              426
Breath-Freshening Mouthwash     404
Name: Segment, dtype: int64

**We can now see that we have two new segment names Fabric_Liquid and Laundry_Liquid, instead of Liquid**

**Let us now list out the records/rows with missing values. This will later help us to observe these rows to view and confirm whether the correct value is filled or not.**

In [85]:
# check the records with missing values
missing_data = pos_data[pos_data.isnull().any(axis = 1)].index.tolist()
print('Total number of rows containing missig values:', len(missing_data))
print('The row numbers containing missing value:\n', missing_data)

Total number of rows containing missig values: 128
The row numbers containing missing value:
 [14, 16, 3223, 3229, 3234, 3236, 3238, 7280, 7281, 7282, 7286, 7287, 7335, 7336, 7337, 7351, 7352, 7354, 7355, 7357, 7360, 7361, 7364, 7374, 7375, 7379, 7382, 7395, 7396, 7398, 7400, 7401, 7403, 7410, 7412, 7429, 7430, 7432, 7434, 7463, 7464, 7465, 7467, 7468, 7471, 7472, 7488, 7489, 7493, 7494, 7495, 7497, 7498, 7577, 7578, 7579, 7580, 7581, 7582, 7583, 7586, 7587, 7601, 7602, 7604, 7611, 7612, 7613, 7614, 7615, 7619, 7620, 7623, 7624, 7625, 7626, 7627, 7628, 7629, 10335, 10336, 10337, 10338, 10339, 10340, 10342, 10343, 10344, 10345, 10346, 18061, 18062, 18065, 18066, 18069, 18070, 18071, 18072, 18073, 18074, 18075, 18601, 18605, 18606, 18611, 18614, 18617, 18620, 22790, 22794, 22795, 22796, 22798, 27694, 27697, 27700, 27701, 27703, 31165, 31166, 31167, 31172, 31173, 31176, 31178, 31181, 31182, 31183]


In [86]:
# a quick check on the extent of missing data
pos_data.isna().sum()

SKU ID           0
Date             0
Manufacturer     0
Sector          50
Category        36
Segment         29
Brand           27
Revenue($)       0
Units_sold       0
Page_traffic     0
dtype: int64

### Logic for filling missing values in the *Sector* column
1. Create a list of different Sectors (Oral Care, Fabric Care and Beauty & Personal Care)
2. Create a dictionary of sector and categories under each sector. Here, sector name will be the key, and a list of categories under each sector will be the corresponding value.
3. Create a user-defined function which maps the sector and category.
    - Pick a row where sector is missing.
    - Check what's the value of category in that row.
    - If the category is not a null value, then findout the corresponding sector name using the dictionary created in step (2) above. 
    - return the value of appropriate sector.
4. Use the `apply()` method from pandas to invoke the function created above.

In [90]:
# create a list of sectors
Sector_List = pos_data['Sector'].value_counts().index.tolist()
Sector_Category_dict={}

In [93]:
print(Sector_List)

['Fabric Care', 'Beauty and Personal Care', 'Oral Care']


In [94]:
# create a dictionary: sector as key, categories under each sector as a value-list.

for i in Sector_List:
    Sector_Category_dict[i] = pos_data.loc[pos_data['Sector']==i].Category.value_counts().index.tolist()

Sector_Category_dict      #display the dictionary

{'Fabric Care': ['Laundry Detergents', 'Fabric Softeners'],
 'Beauty and Personal Care': ['Skincare', 'Haircare'],
 'Oral Care': ['Toothbrushes', 'Toothpaste', 'Mouthwash']}

In [95]:
# define a function to choose the correct Sector based on the value of corresponding Category
def Sector_Category_map(row):
    if pd.isna(row['Sector']):                    #if sector is null
        if not pd.isna(row['Category']):          #if category is not null
            for k, v in Sector_Category_dict.items():
                if row['Category'] in Sector_Category_dict[k]:
                    return k                      # k is the correct sector for category value v
    else:
        return row['Sector']           #if sector was not null, return the existing value of sector as is

In [96]:
# use apply() method to call the mapping function created above
pos_data['Sector'] = pos_data.apply(Sector_Category_map, axis=1)

In [97]:
# check missing values now
pos_data.isna().sum()

SKU ID           0
Date             0
Manufacturer     0
Sector           3
Category        36
Segment         29
Brand           27
Revenue($)       0
Units_sold       0
Page_traffic     0
dtype: int64

- There were 50 rows where *Sector* had a missing value.
- After the process, we are now left with 3 missing values in *Sector*.
- This is because, the corresponding *Category* is also missing in those rows, and hence we couldn't find the right match.

### Part 4.1.4  :  What to do with missing values? - part IV

### Logic for filling missing values in *Category* column
1. Create a list of different Categories (like Laundry Detergents, Fabric Softners etc.)
2. Create a dictionary of category and segment under each category. Here, category name will be the key, and a list of segments under each category will be the corresponding value.
3. Create a user-defined function which maps the category and segment (similar to the one we created before for mapping sector and category), and return the appropriate value of category. 
4. Use the `apply()` method to invoke the function created above.

In [98]:
# create a list of categories
CategoryList = list(np.concatenate(list(Sector_Category_dict.values())))
print(CategoryList)

['Laundry Detergents', 'Fabric Softeners', 'Skincare', 'Haircare', 'Toothbrushes', 'Toothpaste', 'Mouthwash']


In [99]:
# create a dictionary: category as key, segments under each category as a value-list.
Category_Segment_dict={}
for i in CategoryList:
    Category_Segment_dict[i] = pos_data.loc[pos_data['Category']==i].Segment.value_counts().index.tolist()

In [100]:
#display the dictionary
Category_Segment_dict           

{'Laundry Detergents': ['Powder', 'Laundry_Liquid', 'Pods'],
 'Fabric Softeners': ['Fabric_Liquid', 'Dryer Sheets'],
 'Skincare': ['Anti-aging', 'Suncreens', 'Acne'],
 'Haircare': ['Shampoo', 'Conditioners'],
 'Toothbrushes': ['Manual Toothbrushes',
  'Kids Toothbrushes',
  'Electric Toothbrushes'],
 'Toothpaste': ['Fluoride-Free Toothpaste',
  'Whitening Toothpaste',
  'Sensitivity Toothpaste'],
 'Mouthwash': ['Alcohol-Free Mouthwash',
  'Fluoride Mouthwash',
  'Breath-Freshening Mouthwash']}

In [101]:
# define a function to choose the correct Category based on the value of corresponding segment
def Category_Segment_map(row):
    if pd.isna(row['Category']):                    # if category is null
        if not pd.isna(row['Segment']):             # if Segment is not null
            for k, v in Category_Segment_dict.items():
                if row['Segment'] in Category_Segment_dict[k]:
                    return k                         # k is the correct category for the segment value v
    return row['Category']                  # if category was not null, return the existing value of category as is

In [102]:
# use apply() method to call the mapping function created above
pos_data['Category']=pos_data.apply(Category_Segment_map, axis=1)
pos_data.isna().sum()

SKU ID           0
Date             0
Manufacturer     0
Sector           3
Category         0
Segment         29
Brand           27
Revenue($)       0
Units_sold       0
Page_traffic     0
dtype: int64

***Inference:***
- There were 36 rows where *Category* had a missing value.
- After the process, all missing values in *Category* are filled. 
- This means that there is no row where both *Category* and *Segment* are missing.
- Now, given that all categories have filled, we can call the map function between sector and category once again to fill the 3 values of sector, which were remaining in the previous step. 

#### Fill missing values in *Sector* column again, as all Categories have been filled now

In [103]:
# since all Categories are now filled, call the function to fill the remaining missing values for Sector
pos_data['Sector'] = pos_data.apply(Sector_Category_map, axis = 1)
pos_data.isna().sum()

SKU ID           0
Date             0
Manufacturer     0
Sector           0
Category         0
Segment         29
Brand           27
Revenue($)       0
Units_sold       0
Page_traffic     0
dtype: int64

- We can now observe that there are no more missing values under *Sector* and *Category*

### Part 4.1.5  :  What to do with missing values? - part V

### Logic for filling missing values in *Segment* column
1. Create a list of different segments (like Powder, Pods etc.)
2. Create a dictionary of segment and brand under each segment. Here, segment name will be the key, and a list of brands under each segment will be the value corresponding that key.
3. Create a user-defined function which maps the segment and brand, and return the appropriate value of segment. 
4. Use the `apply()` method to invoke the function created above.

In [104]:
# create the list of segment names
SegmentList = list(np.concatenate(list(Category_Segment_dict.values())))
print(SegmentList)

['Powder', 'Laundry_Liquid', 'Pods', 'Fabric_Liquid', 'Dryer Sheets', 'Anti-aging', 'Suncreens', 'Acne', 'Shampoo', 'Conditioners', 'Manual Toothbrushes', 'Kids Toothbrushes', 'Electric Toothbrushes', 'Fluoride-Free Toothpaste', 'Whitening Toothpaste', 'Sensitivity Toothpaste', 'Alcohol-Free Mouthwash', 'Fluoride Mouthwash', 'Breath-Freshening Mouthwash']


In [105]:
# create the dictionary
Segment_Brand_dict={}
for i in SegmentList:
    Segment_Brand_dict[i]=pos_data.loc[pos_data['Segment']==i].Brand.value_counts().index.tolist()
    
Segment_Brand_dict

{'Powder': ['Ariel', 'Gain'],
 'Laundry_Liquid': ['Tide', 'Gain'],
 'Pods': ['Tide', 'Gain'],
 'Fabric_Liquid': ['comfort', 'Downy'],
 'Dryer Sheets': ['Bounce', 'Gain'],
 'Anti-aging': ['Clinique', 'Olay'],
 'Suncreens': ['Aveeno', 'Cetaphil'],
 'Acne': ['Neutrogena', 'Cetaphil'],
 'Shampoo': ['Pantene', 'Dove'],
 'Conditioners': ['Dove', 'Pantene'],
 'Manual Toothbrushes': ['Colgate', 'Oral-B'],
 'Kids Toothbrushes': ['Colgate', 'Oral-B'],
 'Electric Toothbrushes': ['Philips', 'Oral-B'],
 'Fluoride-Free Toothpaste': ["Tom's of Maine", 'Himalaya Herbals'],
 'Whitening Toothpaste': ['Close-up', 'Colgate'],
 'Sensitivity Toothpaste': ['Sensodyne', 'Colgate'],
 'Alcohol-Free Mouthwash': ['Crest', 'Colgate'],
 'Fluoride Mouthwash': ['Listerine', 'Crest'],
 'Breath-Freshening Mouthwash': ['Scope', 'Colgate']}

In [106]:
# define a function to choose the correct Segment based on the value of corresponding Brand
def Segment_Brand_map(row):
    if pd.isna(row['Segment']):                    #if category is null
        if not pd.isna(row['Brand']):          #if Segment is not null
            for k, v in Segment_Brand_dict.items():
                if row['Brand'] in Segment_Brand_dict[k]:
                    return k
    else:
        return row['Segment']

In [107]:
# apply the function
pos_data['Segment'] = pos_data.apply(Segment_Brand_map, axis = 1)
pos_data.isna().sum()

SKU ID           0
Date             0
Manufacturer     0
Sector           0
Category         0
Segment          0
Brand           27
Revenue($)       0
Units_sold       0
Page_traffic     0
dtype: int64

### Logic for filling missing values in *Brand* column
- We don't have any further hierarchy after Brand. 
- So, missing values in Brand should be imputed in a different way.
- Check where is NaN in Brand
- Then select corresponding Segment, say s
- Then get the mode of Brand where segment is s
- Fill brand with that mode


In [108]:
# write a function to fill the brand value based on the mode of existing brands for each segment
def Brand_fill(row):
    if pd.isna(row['Brand']):
        return pos_data.loc[pos_data['Segment']==row['Segment']]['Brand'].mode()[0]

    return row['Brand']

In [109]:
# apply the function
pos_data['Brand']=pos_data.apply(Brand_fill, axis=1)
pos_data.isna().sum()

SKU ID          0
Date            0
Manufacturer    0
Sector          0
Category        0
Segment         0
Brand           0
Revenue($)      0
Units_sold      0
Page_traffic    0
dtype: int64

- We can see that now our data has no more missing values.
- Let us store this dataframe in the directory, as we need this for further processes like data normalization, data transformation etc.

In [110]:
# observe those 128 rows which had missing values initially
# carefully check how the hierarchical relationship between the attributes is maintained while filling the missing values

pos_data.loc[missing_data].head(10)

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
14,SKU1068,10-02-21,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,24734,2091,5142.0
16,SKU1069,10/29/2022,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,29246,1774,3593.0
3223,SKU1130,12-10-22,Synergix solutions,Oral Care,Toothbrushes,Manual Toothbrushes,Colgate,0,0,0.0
3229,SKU1116,8/28/2021,Synergix solutions,Oral Care,Toothbrushes,Manual Toothbrushes,Oral-B,0,0,0.0
3234,SKU1119,04-03-21,Synergix solutions,Oral Care,Toothbrushes,Manual Toothbrushes,Oral-B,16435,790,4014.0
3236,SKU1080,10/23/2021,Synergix solutions,Oral Care,Toothbrushes,Electric Toothbrushes,Philips,29557,1262,3944.0
3238,SKU1096,6/18/2022,Synergix solutions,Oral Care,Toothbrushes,Kids Toothbrushes,Colgate,34629,1220,3663.0
7280,SKU1201,10/16/2021,Synergix solutions,Fabric Care,Fabric Softeners,Fabric_Liquid,comfort,28168,1609,3233.0
7281,SKU1199,11/13/2021,Synergix solutions,Fabric Care,Fabric Softeners,Fabric_Liquid,Downy,23290,1357,3289.0
7282,SKU1206,12-11-21,Synergix solutions,Fabric Care,Fabric Softeners,Fabric_Liquid,comfort,28117,1489,2714.0


In [111]:
# save this version of the data set
pos_data.to_csv('POS_Filled_Data.csv')

### Part 4.1.6  : What to do with missing values? - Part VI
- We have seen how to use domain knowledge to fill the missing values.
- As our POS data had hierarchical relationships among columns, we had to devise the logic needed and had to write a good amount of code to fill the missing values appropriately.
- However, not all data has be to treated like this. 
- We will take another dataset *UsedCars.csv* which has details about the resale price of the cars.

**NOTE:**
- Sometimes, the dataset may contain special characters to indicate missing values; that is, instead of a blank cell, we may have symbols like *?*, *#* etc. 
- Sometimes it may be even noted with the text *unavailable*, *unknown* etc.
- So, it is always advisable to open the data file using Excel and observe such inconsistencies.
- The pandas *read* method will identify a blank cell as a missing value (NaN), but it cannot identify any other charaters. Instead, it will treat it as a genuine cell value. 
- In such situations, while reading the data, we can specify what characters have to be treated as missing values or na values.

In [112]:
#read the data - notice additional parameters to read_csv

cars_data = pd.read_csv('UsedCarsPrice.csv',index_col = 0, na_values = ["??", "????"])
cars_data.head(10)

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,13500,23.0,46986.0,Diesel,90.0,1.0,0,2000,three,1165
1,13750,23.0,72937.0,Diesel,90.0,1.0,0,2000,3,1165
2,13950,24.0,41711.0,Diesel,90.0,,0,2000,3,1165
3,14950,26.0,48000.0,Diesel,90.0,0.0,0,2000,3,1165
4,13750,30.0,38500.0,Diesel,90.0,0.0,0,2000,3,1170
5,12950,32.0,61000.0,Diesel,90.0,0.0,0,2000,3,1170
6,16900,27.0,,Diesel,,,0,2000,3,1245
7,18600,30.0,75889.0,,90.0,1.0,0,2000,3,1245
8,21500,27.0,19700.0,Petrol,192.0,0.0,0,1800,3,1185
9,12950,23.0,71138.0,Diesel,,,0,1900,3,1105


In [113]:
# check the size of the data set
cars_data.shape

(1436, 10)

In [114]:
# quick check of the amount of missing data
cars_data.isna().sum()

Price          0
Age          100
KM            15
FuelType     100
HP             6
MetColor     150
Automatic      0
CC             0
Doors          0
Weight         0
dtype: int64

***NOTE:*** The attributes Age, Kilometers, FuelType, HP and MetColor have missing values

In [115]:
# get a basic summary
cars_data.describe()

Unnamed: 0,Price,Age,KM,HP,MetColor,Automatic,CC,Weight
count,1436.0,1336.0,1421.0,1430.0,1286.0,1436.0,1436.0,1436.0
mean,10730.824513,55.672156,68647.239972,101.478322,0.674961,0.05571,1566.827994,1072.45961
std,3626.964585,18.589804,37333.023589,14.768255,0.468572,0.229441,187.182436,52.64112
min,4350.0,1.0,1.0,69.0,0.0,0.0,1300.0,1000.0
25%,8450.0,43.0,43210.0,90.0,0.0,0.0,1400.0,1040.0
50%,9900.0,60.0,63634.0,110.0,1.0,0.0,1600.0,1070.0
75%,11950.0,70.0,87000.0,110.0,1.0,0.0,1600.0,1085.0
max,32500.0,80.0,243000.0,192.0,1.0,1.0,2000.0,1615.0


#### Logic for imputation
- Age, KM and HP are numeric variables.
- We could draw boxplots for these variables and check for the amount of outliers, and decide whether to impute these variables with mean or median.
- However, the statistical attributes we get using `describe()` also gives us some idea about the distribution of these variables and the possibility of outliers.
- After such analysis, we will impute Age and HP by mean, and KM by median. 
- The categorical variables like FuelType and MetColor are imputed with mode. Note that, the MetColor has values 0 and 1 indicating whether the car is of metalic color body or not. Though it looks like a numerical variable, it is actually categorical variable containing yes/no.
- The `fillna()` method of pandas dataframe is used to fill the missing values.

In [116]:
cars_data['Age'].fillna(cars_data['Age'].mean(), inplace=True)
cars_data['HP'].fillna(cars_data['HP'].mean(), inplace=True)
cars_data['KM'].fillna(cars_data['KM'].median(), inplace=True)
cars_data['FuelType'].fillna(cars_data['FuelType'].mode()[0], inplace=True)
cars_data['MetColor'].fillna(cars_data['MetColor'].mode()[0], inplace=True)

In [117]:
cars_data.isna().sum()

Price        0
Age          0
KM           0
FuelType     0
HP           0
MetColor     0
Automatic    0
CC           0
Doors        0
Weight       0
dtype: int64

**We can now see that the data set does not contain any more missing values.**