# Tutorial 1

**Data Preprocessing with python**

Data Cleaning is the process of detecting and correcting (or removing) corrupt and inaccurate records from dataset  
Data Missing .....

### Step 1: Get data
**About dataset: [Clothing Fit Dataset for Size Recommendation](https://www.kaggle.com/rmisra/clothing-fit-dataset-for-size-recommendation)**  
In this tutorial, we use a dataset comprised of clothing feedback from customer and other information like reviews, ratings, product categories, catalog size, customers's measurement.  

**Getting dataset using pandas**  
We using pandas - an open source data analysis and manipulation tool. 
- Using read_jsons() function to import a json file data to jupyter

In [26]:
import pandas as pd

modcloth = pd.read_json("./Dataset/modcloth_final_data.json", lines=True)
modcloth.head()

Unnamed: 0,bra size,bust,category,cup size,fit,height,hips,item_id,length,quality,review_summary,review_text,shoe size,shoe width,size,user_id,user_name,waist
0,34.0,36.0,new,d,small,5ft 6in,38.0,123373,just right,5.0,,,,,7,991571,Emily,29.0
1,36.0,,new,b,small,5ft 2in,30.0,123373,just right,3.0,,,,,13,587883,sydneybraden2001,31.0
2,32.0,,new,b,small,5ft 7in,,123373,slightly long,2.0,,,9.0,,7,395665,Ugggh,30.0
3,,,new,dd/e,fit,,,123373,just right,5.0,,,,,21,875643,alexmeyer626,
4,36.0,,new,b,small,5ft 2in,,123373,slightly long,5.0,,,,,18,944840,dberrones1,


### Step 2: Dealing with missing value
#### a. **Summary how many missing value in each features**  
- Using isnull() function to detect missing value.
- sum() function to sum of values for the requested axis  
*axis = 0: applied axis is index*  
*axis =1: applied axis is column*

**Let's look it**  
In the previous cell, we see some columns which have some NaN values - missing values.

In [37]:
# get the number of missing data in the bra size column
modcloth[["bra size"]].isnull().sum(axis=0)

bra size    6018
dtype: int64

**Excercise 1**  
How many missing value are there in the bust, cup size, height?

In [None]:
## Todo

#### b. **Drop missing values**  
- Using notnull() to detect non-missing values.  
- Using [dropna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) to remove missing values.  
- Using [fillna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html): Fill NA/NaN values using the specified method.

**Let's look it**  

In [42]:
# We are going to drop rows where the bra size column values are missing
modcloth_subset = modcloth[modcloth["bra size"].notnull()]

# get the number of missing data in the bra size column
modcloth_subset[["bra size"]].isnull().sum(axis=0)

bra size    0
dtype: int64

**Excercise 2**  
Let's get a subset data where all values of **height** column are not missing from the above modcloth_subset! 

In [43]:
# Todo:

We can also using *dropna()* function to remove missing value

In [56]:
# Drop column with missing values
columns_with_na_dropped = modcloth.dropna(axis=1)

# just how much data did we lose?
print("Columns in original dataset: %d \n" % modcloth.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Columns in original dataset: 18 

Columns with na's dropped: 6


Replacing missing values is a popular method to dealing with missing values

In [60]:
# replace all NaN's with 0
modcloth.fillna(0).head()

Unnamed: 0,bra size,bust,category,cup size,fit,height,hips,item_id,length,quality,review_summary,review_text,shoe size,shoe width,size,user_id,user_name,waist
0,34.0,36,new,d,small,5ft 6in,38.0,123373,just right,5.0,0,0,0.0,0,7,991571,Emily,29.0
1,36.0,0,new,b,small,5ft 2in,30.0,123373,just right,3.0,0,0,0.0,0,13,587883,sydneybraden2001,31.0
2,32.0,0,new,b,small,5ft 7in,0.0,123373,slightly long,2.0,0,0,9.0,0,7,395665,Ugggh,30.0
3,0.0,0,new,dd/e,fit,0,0.0,123373,just right,5.0,0,0,0.0,0,21,875643,alexmeyer626,0.0
4,36.0,0,new,b,small,5ft 2in,0.0,123373,slightly long,5.0,0,0,0.0,0,18,944840,dberrones1,0.0


In [63]:
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
modcloth.fillna(method = 'bfill', axis=0).fillna(0).head()

Unnamed: 0,bra size,bust,category,cup size,fit,height,hips,item_id,length,quality,review_summary,review_text,shoe size,shoe width,size,user_id,user_name,waist
0,34.0,36,new,d,small,5ft 6in,38.0,123373,just right,5.0,Too much ruching,"I liked the color, the silhouette, and the fab...",9.0,wide,7,991571,Emily,29.0
1,36.0,39,new,b,small,5ft 2in,30.0,123373,just right,3.0,Too much ruching,"I liked the color, the silhouette, and the fab...",9.0,wide,13,587883,sydneybraden2001,31.0
2,32.0,39,new,b,small,5ft 7in,41.0,123373,slightly long,2.0,Too much ruching,"I liked the color, the silhouette, and the fab...",9.0,wide,7,395665,Ugggh,30.0
3,36.0,39,new,dd/e,fit,5ft 2in,41.0,123373,just right,5.0,Too much ruching,"I liked the color, the silhouette, and the fab...",8.5,wide,21,875643,alexmeyer626,27.0
4,36.0,39,new,b,small,5ft 2in,41.0,123373,slightly long,5.0,Too much ruching,"I liked the color, the silhouette, and the fab...",8.5,wide,18,944840,dberrones1,27.0


**Excercise 3**  
- Let's understand the other methods using in [fillna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)
- You have to replace all NA value using fillna with method ffill

In [64]:
## Todo
modcloth.fillna(method = 'ffill', axis=0)

Unnamed: 0,bra size,bust,category,cup size,fit,height,hips,item_id,length,quality,review_summary,review_text,shoe size,shoe width,size,user_id,user_name,waist
0,34.0,36,new,d,small,5ft 6in,38.0,123373,just right,5.0,,,,,7,991571,Emily,29.0
1,36.0,36,new,b,small,5ft 2in,30.0,123373,just right,3.0,,,,,13,587883,sydneybraden2001,31.0
2,32.0,36,new,b,small,5ft 7in,30.0,123373,slightly long,2.0,,,9.0,,7,395665,Ugggh,30.0
3,32.0,36,new,dd/e,fit,5ft 7in,30.0,123373,just right,5.0,,,9.0,,21,875643,alexmeyer626,30.0
4,36.0,36,new,b,small,5ft 2in,30.0,123373,slightly long,5.0,,,9.0,,18,944840,dberrones1,30.0
