# Sandbox

First, we will create a Pandas Dataframe out of the Products.csv and clean it.

In [1]:
from data_cleaning import DataCleaning
from pathlib import Path
from PIL import Image
import os
import pandas as pd


product_path = 'EC2_files/Products.csv'
products_df = pd.read_csv(product_path, lineterminator='\n')

products_df.drop(columns='Unnamed: 0', inplace=True) #dropping unnecessary index column

clean_products_df = DataCleaning(products_df).clean_product_table()


NUMBER OF ROWS REMAINING: 7156

Cleaning prices...
NUMBER OF ROWS REMAINING: 7156

Cleaning nulls...
NUMBER OF ROWS REMAINING: 7156

Displaying nulls in each column:
id                     0
product_name           0
category               0
product_description    0
price                  0
location               0
dtype: int64


Next we can create a column called 'root_category which will extract the main category from the 'category' column e.g. root category of "Home & Garden / Dining, Living Room Furniture / Mirrors, Clocks & Ornaments" is "Home & Garden". 

Making a set out of this column give us all the unique entries, allowing us to make an encoder(enumerating each category) and decoder(finding the category from the assigned number).

In [2]:
clean_products_df['root_category'] = clean_products_df['category'].apply(lambda x:x.split(' / ')[0])

root_set = set(clean_products_df['root_category'])

encoder = {category: index for index, category in enumerate(root_set)}
decoder = {category: index for index, category in encoder.items()}

clean_products_df['label'] = clean_products_df['root_category'].apply(lambda x:encoder[x])


Merging the Images and Product dataframes to produce a training set with the encoded labels for each image

In [3]:
images_df = pd.read_csv('EC2_files/Images.csv')
images_df.drop(columns='Unnamed: 0', inplace=True) #dropping unnecessary index column

clean_products_df.rename(columns={'id': 'product_id'}, inplace=True) #renaming id column to product_id for table merge

merged_df = pd.merge(clean_products_df, images_df, on='product_id')

training_df = merged_df.drop(columns=['product_id', 'product_name', 'category', 'product_description', 'price', 'location','root_category'])
