# AB1_Data_Wrangling_New_Test_Set

This workbook **prepares the data** for test feeding a new image to the model. The following dataset is employed:

"Fashion Product Images" <Link>(https://www.kaggle.com/paramaggarwal/fashion-product-images-dataset)
    
This entails the following steps:
    
| No.    | Step                                      |
|:-------|:------------------------------------------|
| AB1.1   | Import Libraries                         |
| AB1.2   | Data Wrangling                           |

## AB 1.1 Import Libraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings 

warnings.filterwarnings("ignore")

## AB 1.2 Data Wrangling 

In order to select images for the model, **image ids** for tops, skirts, and dresses for women are collected.

In [2]:
#Load data
style_data = pd.read_csv('../final_project/styles.csv', sep='/t')
style_data.head()

Unnamed: 0,"id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName"
0,"15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fal..."
1,"39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer..."
2,"59263,Women,Accessories,Watches,Watches,Silver..."
3,"21379,Men,Apparel,Bottomwear,Track Pants,Black..."
4,"53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,..."


In [3]:
#Split data
style_data.iloc[:,0] = style_data.iloc[:,0].str.split(',')
style_data.head()

Unnamed: 0,"id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName"
0,"[15970, Men, Apparel, Topwear, Shirts, Navy Bl..."
1,"[39386, Men, Apparel, Bottomwear, Jeans, Blue,..."
2,"[59263, Women, Accessories, Watches, Watches, ..."
3,"[21379, Men, Apparel, Bottomwear, Track Pants,..."
4,"[53759, Men, Apparel, Topwear, Tshirts, Grey, ..."


In [4]:
#Extract data
style_data['id'] = style_data.iloc[:,0].apply(lambda x: x[0])
style_data['gender'] = style_data.iloc[:,0].apply(lambda x: x[1])
style_data['type'] = style_data.iloc[:,0].apply(lambda x: x[4])
style_data.head()

Unnamed: 0,"id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName",id,gender,type
0,"[15970, Men, Apparel, Topwear, Shirts, Navy Bl...",15970,Men,Shirts
1,"[39386, Men, Apparel, Bottomwear, Jeans, Blue,...",39386,Men,Jeans
2,"[59263, Women, Accessories, Watches, Watches, ...",59263,Women,Watches
3,"[21379, Men, Apparel, Bottomwear, Track Pants,...",21379,Men,Track Pants
4,"[53759, Men, Apparel, Topwear, Tshirts, Grey, ...",53759,Men,Tshirts


Since the target group is **women**, only entries for women are kept.

In [5]:
#Keep only relevant data
style_data = style_data.loc[style_data['gender']=='Women']

In [6]:
#Assign a new column
style_data.loc[style_data['type']=='Tops', 'cat'] = 'Top'
style_data.loc[style_data['type']=='Skirts', 'cat'] = 'Skirt'
style_data.loc[style_data['type']=='Dresses', 'cat'] = 'Dress'

In [7]:
#Keep only relevant data
style_data = style_data[['id', 'cat']]
style_data2 = style_data.dropna()
style_data2 = style_data2.loc[style_data2['cat']!=0].reset_index(drop=True)
style_data2.head()

Unnamed: 0,id,cat
0,49653,Top
1,58513,Top
2,39716,Dress
3,28456,Skirt
4,31782,Top


In [11]:
#Save to csv
style_data2.to_csv('../final_project/style_data2.csv')

**2,026** images are available in the categories 'top', 'skirt', and 'dress'.

In [12]:
#Find number of images
style_data2.groupby('cat')['id'].count().sum()

2026

**Most** images are available in the category 'top', while only a **few** exist for 'dress' and 'skirt'.

In [13]:
#Check
style_data2['cat'].value_counts()

Top      1532
Dress     390
Skirt     104
Name: cat, dtype: int64