# Data Cleaning (continued)

In [1]:
import numpy as np

In [2]:
import pandas as pd

df = pd.read_csv('../data/Train.csv')

## Data Cleaning: Nonsensical Data

Nonsensical Data refers to data that exists in the dataset, but contains meaningless values. For example, if we're dealing with a dataset of products sold at a store, then any value of 0 for the weight of a product would be a nonsensical value, since weight cannot be zero.

Generally, in most situations you can treat nonsensical data the same as missing data, and replace them the same way you would replace missing data (e.g. the mean of existing data, the mode, etc.)

The main difference between the two is that missing data is very easy to find: you can simply tell pandas to find null values.
On the other hand, nonsensical data is much harder to spot, because you, as a human being, have to use your knowledge of what counts as 'valid' data to be able to distinguish nonsensical data from real data.

## Data Cleaning: Grouping Data

This isn't strictly part of the Data Cleaning process, but it is an extremely useful tool for analyzing the data.

Pandas allows you to group, or aggregate, data based on certain column values. For example, let's say you want to find the average weight of each item type, instead of just the total average weight.

You can use the pandas.pivot_table function to do so.

In [10]:
# Using pivot_table to group Item_Weight by Item_Identifier
mean_weight_per_item = df.pivot_table(index='Item_Identifier', values='Item_Weight')
mean_weight_per_item

Unnamed: 0_level_0,Item_Weight
Item_Identifier,Unnamed: 1_level_1
DRA12,11.600
DRA24,19.350
DRA59,8.270
DRB01,7.390
DRB13,6.115
DRB24,8.785
DRB25,12.300
DRB48,16.750
DRC01,5.920
DRC12,17.850


As you can see above,it returned a dataframe showing each unique item, along with the average weight for that item. By default, pivot table calculates average, but you can use the **aggfunc** argument to give it any function that works on an array, and it will use that function instead (for example: np.sum, np.max, etc.) 

The point of grouping is to be able to view the data in a more accurate way.

If we filled in the missing item weights with just the average of the entire column, that would cause significant inaccuracies in the data. Think about it: the items consist of various types of different products, which can all have vastly different weights. A loaf of bread is certainly going to be lighter than a 36-pack box of canned soda. If we took the average of **all** item weights, then for the missing weights, we would accidentally overestimate or underestimate the weight of that item.

To fix this kind of issue, instead of using the average weight of all items as our fill value, we will take the average weight of each item separately (this is what we did with pivot_table above).

Once we have the average weights per item, we can use that value to fill in missing rows.

In [13]:
# For each row missing Item_weight, use that row's Item_Identifier to get the average weight for that item,
# and fill in the missing weight

# create a row filter to select the rows missing Item_Weight
missing_weight = df['Item_Weight'].isnull()

# use loc to assign to each missing weight in the Item_Weight column a new value,
# using the average weight of that item
df.loc[missing_weight, 'Item_Weight'] = df.loc[missing_weight, 'Item_Identifier'].apply(lambda item_id: mean_weight_per_item.loc[item_id])

In [14]:
# now, all the missing weights are filled, with the average weight for that item
df

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
1,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
3,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800
4,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
5,FDP36,10.395,Regular,0.000000,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
6,FDO10,13.650,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
7,FDP10,19.000,Low Fat,0.127470,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
8,FDH17,16.200,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986
9,FDU28,19.200,Regular,0.094450,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.5350
