# Week 4 Notebook


```
Level 1: Query Classification

Your first task is to generate training data that fastText can learn from. You’ve already worked with examples of parsing the category tree as XML, pruning the category tree to a maximum depth, and mapping the leaf category ids of queries to ancestor categories. Feel free to review these in the above reading materia!
```

## Task 1: Prune the category taxonomy.
Work on cleaning queries / stemming etc

In [158]:
import pandas as pd
df = pd.read_csv('/workspace/datasets/train.csv')[['category', 'query']]
df.head()

Unnamed: 0,category,query
0,abcat0101001,Televisiones Panasonic 50 pulgadas
1,abcat0101001,Sharp
2,pcmcat193100050014,nook
3,abcat0101001,rca
4,abcat0101005,rca


### Write a stemmer for words

In [3]:
import utilities.functions as fn

cleaned_queries = []

for index, row in df.head(20).iterrows():
    query = row["query"]
    #
    # Clean the queries using shared function
    #
    normalized_query = fn.clean_query(query)
    print('query="{0}" clean="{1}"'.format(query, normalized_query))
    #
    # Create new rows for cleaned queries
    #
    new_row = {};
    new_row["query"] = normalized_query;
    new_row["category"] = row["category"]
    cleaned_queries.append(new_row)
    
cleaned_queries_df = pd.DataFrame(cleaned_queries)

query="Televisiones Panasonic  50 pulgadas" clean="television panason 50 pulgada"
query="Sharp" clean="sharp"
query="nook" clean="nook"
query="rca" clean="rca"
query="rca" clean="rca"
query="Flat screen tvs" clean="flat screen tv"
query="macbook" clean="macbook"
query="Blue tooth headphones" clean="blue tooth headphon"
query="Tv antenna" clean="tv antenna"
query="memory card" clean="memori card"
query="AC power cord" clean="ac power cord"
query="Zagg iPhone" clean="zagg iphon"
query="Watch The Throne" clean="watch the throne"
query="Remote control extender" clean="remot control extend"
query="Camcorder" clean="camcord"
query="3ds" clean="3d"
query="hoya" clean="hoya"
query="wireless headphones" clean="wireless headphon"
query="wireless headphones" clean="wireless headphon"
query="Samsung 40" clean="samsung 40"


### Implement the query cleaner and category roll up
Through stemming and cleaning of categories we get a much cleaner dataset

In [25]:
%%bash
python ../week4/create_labeled_queries.py --min_queries=200
head -n 10 /workspace/datasets/labeled_query_data.txt
wc -l /workspace/datasets/labeled_query_data.txt

> cleaning_queries
> checking for min_queries
> original unique categories=1451
> final categories=776
__label__abcat0102003 blu ray player
__label__abcat0201011 ipod
__label__cat02015 tae guk gi
__label__pcmcat180400050000 canon camera
__label__pcmcat209000050008 hp touchpad
__label__cat02002 the sim
__label__pcmcat209000050007 dryer
__label__abcat0101001 lcd tv
__label__abcat0403000 gopro
__label__pcmcat183800050007 usb car adapt
994279 /workspace/datasets/labeled_query_data.txt


## Task 2: Train a query classifier.
Use the labeled data to build a fast test model

In [27]:
%%bash
./create_qu_model.sh

Training data
__label__abcat0102003 blu ray player
__label__abcat0201011 ipod
__label__cat02015 tae guk gi
__label__pcmcat180400050000 canon camera
__label__pcmcat209000050008 hp touchpad
__label__cat02002 the sim
__label__pcmcat209000050007 dryer
__label__abcat0101001 lcd tv
__label__abcat0403000 gopro
__label__pcmcat183800050007 usb car adapt
Test data
__label__pcmcat174700050005 star war
__label__cat09000 tmnt
__label__cat02719 t pain
__label__abcat0515028 laptop case
__label__cat02015 land befor time
__label__cat02015 barbi
__label__abcat0807001 epson photo
__label__pcmcat180400050000 digit camera
__label__pcmcat158900050018 lcd projector
__label__pcmcat246100050002 bluetooth


Read 0M words
Number of words:  683
Number of labels: 672
Progress: 100.0% words/sec/thread:   28668 lr:  0.000000 avg.loss:  1.611424 ETA:   0h 0m 0s


p@1 test
N	9941
P@1	0.435
R@1	0.435
p@5 test
N	9941
P@5	0.125
R@5	0.623


## Updating labels
Wanting higher P&R updating my training data to require queries to have more min_queries

In [28]:
%%bash
python ../week4/create_labeled_queries.py --min_queries=1000
head -n 10 /workspace/datasets/labeled_query_data.txt
wc -l /workspace/datasets/labeled_query_data.txt

> cleaning_queries
> checking for min_queries
> original unique categories=1453
> final categories=502
__label__abcat0811004 g2 batteri
__label__pcmcat162100050040 virgin mobil
__label__abcat0208007 lcd
__label__pcmcat186100050006 extern hard drive
__label__pcmcat174700050005 age of empir
__label__abcat0511004 wireless printer
__label__pcmcat209000050007 ipad
__label__cat02716 ugk
__label__pcmcat156300050010 fridg
__label__pcmcat247400050001 macbook
994302 /workspace/datasets/labeled_query_data.txt


In [39]:
%%bash
./create_qu_model.sh

Training data
__label__abcat0101001 lcd tv
__label__pcmcat158900050018 projector
__label__abcat0403004 flip video camera
__label__cat02015 darker than black
__label__cat02015 appl
__label__cat02015 fast and furiou
__label__abcat0515028 carri case for laptop
__label__abcat0101001 42 panason plasma
__label__pcmcat232900050017 metal gear
__label__abcat0201011 samsung galaxi mp3
Test data
__label__cat02015 make the grade
__label__abcat0504010 usb memori
__label__abcat0703002 star war 3
__label__abcat0208011 bose portabl
__label__pcmcat218000050003 ipod case
__label__pcmcat247400050000 2398896 2402035 5386263 5386272 6804112 8579932 8589878 9374278 9650424
__label__pcmcat186400050002 camera
__label__pcmcat231700050017 googl tv
__label__abcat0101001 lcd tv
__label__pcmcat253700050020 kiss


Read 0M words
Number of words:  712
Number of labels: 426
Progress: 100.0% words/sec/thread:    9359 lr:  0.000000 avg.loss:  1.474405 ETA:   0h 0m 0s


p@1 test
N	9952
P@1	0.446
R@1	0.446
p@5 test
N	9952
P@5	0.128
R@5	0.64


In [31]:
df.head()

Unnamed: 0,category,query
0,abcat0101001,Televisiones Panasonic 50 pulgadas
1,abcat0101001,Sharp
2,pcmcat193100050014,nook
3,abcat0101001,rca
4,abcat0101005,rca


In [157]:
import xml.etree.ElementTree as ET
import pandas as pd
root_category_id = 'cat00000'

tree = ET.parse('/workspace/datasets/product_data/categories/categories_0001_abcat0010000_to_pcmcat99300050000.xml')
root = tree.getroot()

categories = []
parents = []
for child in root:
    id = child.find('id').text
    cat_path = child.find('path')
    cat_path_ids = [cat.find('id').text for cat in cat_path]
    leaf_id = cat_path_ids[-1]
    if leaf_id != root_category_id:
        categories.append(leaf_id)
        parents.append(cat_path_ids[-2])
parents_df = pd.DataFrame(list(zip(categories, parents)), columns =['category', 'parent'])

print(parents_df)

               category             parent
0          abcat0010000           cat00000
1          abcat0011000       abcat0010000
2          abcat0011001       abcat0011000
3          abcat0011002       abcat0011000
4          abcat0011003       abcat0011000
...                 ...                ...
4634  pcmcat97200050013           cat15205
4635  pcmcat97200050015           cat15063
4636  pcmcat99000050001  pcmcat50000050006
4637  pcmcat99000050002  pcmcat99000050001
4638  pcmcat99300050000           cat15063

[4639 rows x 2 columns]


In [159]:
cat_value_counts = df['category'].value_counts()

print(cat_value_counts)

cat02015              177638
abcat0101001           80213
pcmcat247400050000     79245
pcmcat209000050008     74258
pcmcat144700050004     43991
                       ...  
pcmcat230600050054         1
pcmcat230600050036         1
pcmcat221400050012         1
pcmcat254000050002         1
pcmcat221400050013         1
Name: category, Length: 1540, dtype: int64


In [213]:
test = 'pcmcat230600050054';

def get_category_size(x):
    return df.category[df.category == x].count()

def get_parent_category(frame, category):
    if any(frame.category == category):
        return frame[frame['category'] == category]['parent'].item();
    else:
        return False
    
def first_min_queries_match(x):
    size = get_category_size(x)
    
    if size > 2:
        return x
    else:
        parent = get_parent_category(parents_df, x);
        print('no_match', 'size=', size, 'parent=', parent)
        if parent:
            return first_min_queries_match(parent)
        return x;
    
print(first_min_queries_match(test))

no_match size= 1 parent= pcmcat230600050007
no_match size= 0 parent= pcmcat230600050006
no_match size= 0 parent= pcmcat273800050017
no_match size= 0 parent= pcmcat242800050021
no_match size= 0 parent= cat00000
no_match size= 0 parent= False
cat00000


In [210]:
print(get_category_size('pcmcat209000050008'))

74258


In [147]:
dict = cat_value_counts.to_dict()

if 'cat02015' in dict:
    print('swag')

swag


In [42]:
zip(['__label__pcmcat247400050000', '__label__pcmcat164200050013', '__label__pcmcat247400050001', '__label__pcmcat189600050008', '__label__abcat0515025'], [0.68960935, 0.08535343, 0.07896541, 0.03031814, 0.0144454 ])[0]

TypeError: 'zip' object is not subscriptable