# Investigation notebook

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('../src'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import pandas as pd

from process_labels import lower_dataframe

#Read file
file = pd.read_csv('../data/raw/care_labels.csv')
file.head()

Unnamed: 0,product_id,product_category,care_label
0,#113,PANTS,"Main: 40% Cotton, 60% Polyester, 290 g/m².\nCo..."
1,#212,PANTS,"Main: DuraTwill, 52% Cotton 48% Polyamide, 240..."
2,#213,PANTS,"Main: 40% Cotton, 60% Polyester, 290 g/m².\nCo..."
3,#214,PANTS,"Main: Canvas+, 60% Cotton, 40% Polyester, 340 ..."
4,#312,PANTS,"Main: DuraTwill, 52% Cotton 48% Polyamide, 240..."


There is 3 columns:
- one conatining ID which should be unique -> to check
- one with the category 
- one with care label which contains more information but no natural language (it seems)

In [3]:
# Check shape
file.shape

(573, 3)

In [4]:
# check if IDs are unique -> yes (=shape)
file.product_id.nunique()

573

In [5]:
# See what the categories are
file.product_category.sort_values().unique()

array(['ACCESSORY/BELT', 'ACCESSORY/CAP-HAT', 'ACCESSORY/HEADBAND',
       'ACCESSORY/KEYCHAIN', 'ACCESSORY/KNEEPAD', 'ACCESSORY/KNIT-CAP',
       'ACCESSORY/MASK', 'ACCESSORY/PHONE-CASE', 'ACCESSORY/SCARF',
       'ACCESSORY/SUN-HAT', 'ACCESSORY/WALLET', 'BAG/LARGE', 'BAG/MEDIUM',
       'GLOVES', 'HOSIERY/LEGGINGS', 'HOSIERY/SOCKS', 'JACKET',
       'JACKET/COAT', 'PANTS', 'PANTS/SHORTS', 'SHIRT', 'SWEATER',
       'SWEATER/HOODIE', 'TSHIRT', 'TSHIRT/LONG-SLEEVE',
       'TSHIRT/TANK-TOP', 'UNDERWEAR/BOXERS', 'UNDERWEAR/BRA',
       'UNDERWEAR/PANTIES', 'UNKNOWN', 'WORKWEAR/COVERALL'], dtype=object)

- "clean" categories: all in upper char, same spelling (no typo / plural-singular difference)
- Some categories also have subcategories but some do not
- One unknown category 

In [6]:

groupby_product = file.groupby('product_category').size()
groupby_product

product_category
ACCESSORY/BELT           16
ACCESSORY/CAP-HAT         9
ACCESSORY/HEADBAND        1
ACCESSORY/KEYCHAIN        1
ACCESSORY/KNEEPAD         8
ACCESSORY/KNIT-CAP       17
ACCESSORY/MASK            5
ACCESSORY/PHONE-CASE     12
ACCESSORY/SCARF           1
ACCESSORY/SUN-HAT         2
ACCESSORY/WALLET          4
BAG/LARGE                 1
BAG/MEDIUM                2
GLOVES                   22
HOSIERY/LEGGINGS          8
HOSIERY/SOCKS            27
JACKET                  108
JACKET/COAT               4
PANTS                   139
PANTS/SHORTS             26
SHIRT                    38
SWEATER                   3
SWEATER/HOODIE           35
TSHIRT                   51
TSHIRT/LONG-SLEEVE       19
TSHIRT/TANK-TOP           1
UNDERWEAR/BOXERS          3
UNDERWEAR/BRA             1
UNDERWEAR/PANTIES         1
UNKNOWN                   5
WORKWEAR/COVERALL         3
dtype: int64

Some categories only have one corresponding item

## Unknown category investigation

In [7]:
# check if unknown category items have other information available
file[file.product_category=="UNKNOWN"]

Unnamed: 0,product_id,product_category,care_label
437,#9082,UNKNOWN,"100% Cotton, 340 g/m²."
539,#9700,UNKNOWN,100% ripstop polyamide
540,#9716,UNKNOWN,100% ripstop polyamide
541,#9736,UNKNOWN,100% ripstop polyamide
548,#9762,UNKNOWN,"100% PVC, conforms to EN 15777"


- other columns are correclty filled in
- care label could help us retrieve what type of item it is by comparing with known items

In [8]:
file[file.care_label=='100% ripstop polyamide']

Unnamed: 0,product_id,product_category,care_label
537,#9623,BAG/MEDIUM,100% ripstop polyamide
538,#9626,BAG/MEDIUM,100% ripstop polyamide
539,#9700,UNKNOWN,100% ripstop polyamide
540,#9716,UNKNOWN,100% ripstop polyamide
541,#9736,UNKNOWN,100% ripstop polyamide


It looks like the #9700, #9716 and #9736 could be bags. This information should be confirmed by the data supplier. The other UNKNOWN items should also be clarified.

In [9]:
# Check if items can be identified thanks to their label 

groupby_product = pd.DataFrame(groupby_product,columns = ["item_count"])
groupby_product['label_count'] = file.groupby(['product_category'])['care_label'].nunique()
groupby_product

Unnamed: 0_level_0,item_count,label_count
product_category,Unnamed: 1_level_1,Unnamed: 2_level_1
ACCESSORY/BELT,16,14
ACCESSORY/CAP-HAT,9,9
ACCESSORY/HEADBAND,1,1
ACCESSORY/KEYCHAIN,1,1
ACCESSORY/KNEEPAD,8,7
ACCESSORY/KNIT-CAP,17,15
ACCESSORY/MASK,5,5
ACCESSORY/PHONE-CASE,12,5
ACCESSORY/SCARF,1,1
ACCESSORY/SUN-HAT,2,2


There is almost as many labels as unique items. So knowing the label doesn't really help to identify the article. 

# Care label investigation

In [10]:
pd.set_option('display.max_colwidth', 1)
file.care_label.head(20)

0     Main: 40% Cotton, 60% Polyester, 290 g/m².\nContrast: 53% Cotton 47% Polyester, 290 g/m².\nReinforcement Knee: 100% CORDURA®-Polyamide, 205 g/m².                                                                                                                                                                                                        
1     Main: DuraTwill, 52% Cotton 48% Polyamide, 240 g/m².\nReinforcement: 100% CORDURA®-Polyamide.                                                                                                                                                                                                                                                            
2     Main: 40% Cotton, 60% Polyester, 290 g/m².\nContrast: 53% Cotton, 47% Polyester, 290 g/m².\nReinforcement Knee: 100% CORDURA®-Polyamide, 205 g/m².                                                                                                                                                

The first 20 rows (at least) follow the same structure :
- Definition of the main part composition ("Main" + ":", followed by percentage and the name of materials)
- Definition of other parts composition with their name specified (f"{other_part}" + ":", followed by percentage and the name of materials)
- Weight is indicated for each part - sometimes not for all parts. Not always in the same uom (g/m2, gr)
- Some materials are patented
- Sometimes the country (of production?) is indicated
- No natural language, only composition
- Upper and lower characters 
- Line escapes characters
- Multiple spaces in succession
- All in english

In [11]:
file.care_label.tail(20)

553    100% leather                                                                                                                                                                 
554    100% Nylon, 100% CORDURA®-Polyamide.\nLeather pouches.                                                                                                                       
555    100% Nylon,100% Polyamide.                                                                                                                                                   
556    100% Leather.                                                                                                                                                                
557    100% Polyamide.                                                                                                                                                              
558    31% polyester, 28% modacrylic, 20% Aramid Kermel®, 20% CV FR, 1% antistatic, 320 g/m2.  

The last 20 rows (at least) are not following the same structure as the first rows:
- They don't follow specific format (Main definition, other part definition followed by ":")
- There is a bit of natural language "Belt in 100% Nylon, Pouches 100% Polyamide."
- Upper and lower characters
- Still have percentages followed by name of compound.

In [12]:
# display other lines
for i in range(0,500,40):
    print("---")
    print(i)
    print(file.iloc[i,2])

---
0
Main: 40% Cotton, 60% Polyester, 290 g/m².
Contrast: 53% Cotton 47% Polyester, 290 g/m².
Reinforcement Knee: 100% CORDURA®-Polyamide, 205 g/m².
---
40
Main: 100% polyester, 137 g/m². Lining: 100% solution dyed polyamide, 65 g/m². Pocket lining: 100% polyester, 69 g/m²
---
80
100% Cotton, Col. 2800: 95% Cotton, 5% Viscose 160 g/m².
---
120
65% Polyester, 35% Cotton 220 g/m².
---
160
Colour 0400, 1604, 3904, 9504: 60% cotton, 40% polyester, 400 g/m². Colour 2804: Main: 55% cotton, 45% polyester, 400 g/m². Contrast: 60% cotton, 40% polyester, 400 g/m².
---
200
Main: 88% polyester, 12% elastane, 218 g/m2.


---
240
Main: 63% polyester, 37% cotton, 360 g/m². Reinforcement 1: 100% polyamide CORDURA®, 205 g/m². Reinforcement 2: 100% polyester CORDURA®, 320 g/m².
---
280
Main: 55% Protal, 44% Cotton, 1% Antistatic, 275 gram. Reinforcement: 39% Modacrylic, 28% CORDURA®, 17% Cotton, 15% Aramid, 1% Antistatic, 270 gram.
---
320
Main: 100% polyester; 200 g/m2.Contrast: 88% polyamide CORDURA®

In addition, colors can be indicated. It seems they are indicated by first mentioning 'color or 'colour' or 'col.' (to verify) and then the colors IDs (4 digits).
There can be different composition depending on colors.

# Information we can retrieve from the file

- item ID
- category
- subcategory
- color
- part (main/reinforcement,..).  default to main if no info
- composition : % and material and  country of origin 
- weight

But first need to clean the text : 
- remove extra spaces
- rm escape lines
- remove small words 
- lower all text
- normalize unit of measure of weight
- normalize 'color' identifier


In [13]:
# Check all color identifier 
lowered_label = lower_dataframe(file[['care_label']])
[el for el in lowered_label.care_label.to_list() if 'col' in el]

['main: 100% polyester, 230 g/m². lining: 100% solution dyed polyamide 65 g/m². padding: 100% polyester 120 g/m². pocket lining: 100% polyester, 215 g/m². collar lining: 100% polyester, 350 g/m²',
 'main: 47% cotton, 53% polyester, 237 g/m².  contrast: 91.5% polyamide, 8.5% elastane, 250 g/m².   colour 0904; main: 61% polyester 39% sorona® polyester, 252 g/m².  contrast: 91.5% polyamide, 8.5% elastane, 250 g/m²',
 'color 9567: main: 100% polyester, 140 g/m².',
 'col 0400, 0900, 5800 and 9500: 100% cotton, 160 g/m2. \ncol 2800: 95% cotton, 5% viscose, 160 g/m2.\n',
 '100% cotton, col. 2800: 95% cotton, 5% viscose 160 g/m².',
 'color 9567: main: 100% polyester, 140 g/m².',
 'col 0400, 5800 and 9500: 100% cotton, 160 g/m2. \ncol 2800: 95% cotton, 5% viscose, 160 g/m2.\n',
 'colors: 0400, 0600, 4000, 9500: 100% cotton, 200 g/m². \ncolor: 3400: 65% polyester, 35% cotton.\ncolor: 2800: 95% cotton, 5% viscose',
 'col 0400 and 3100: 100% cotton, 200 g/m2. \ncol 2800: 95% cotton, 5 % viscose 20

"color" identifiers : col, col. , color, colors, colour, colours 