# Week 4 Notebook


```
Level 1: Query Classification

Your first task is to generate training data that fastText can learn from. You’ve already worked with examples of parsing the category tree as XML, pruning the category tree to a maximum depth, and mapping the leaf category ids of queries to ancestor categories. Feel free to review these in the above reading materia!
```

## Task 1: Prune the category taxonomy.
Work on cleaning queries / stemming etc

In [158]:
import pandas as pd
df = pd.read_csv('/workspace/datasets/train.csv')[['category', 'query']]
df.head()

Unnamed: 0,category,query
0,abcat0101001,Televisiones Panasonic 50 pulgadas
1,abcat0101001,Sharp
2,pcmcat193100050014,nook
3,abcat0101001,rca
4,abcat0101005,rca


### Write a stemmer for words

In [3]:
import utilities.functions as fn

cleaned_queries = []

for index, row in df.head(20).iterrows():
    query = row["query"]
    #
    # Clean the queries using shared function
    #
    normalized_query = fn.clean_query(query)
    print('query="{0}" clean="{1}"'.format(query, normalized_query))
    #
    # Create new rows for cleaned queries
    #
    new_row = {};
    new_row["query"] = normalized_query;
    new_row["category"] = row["category"]
    cleaned_queries.append(new_row)
    
cleaned_queries_df = pd.DataFrame(cleaned_queries)

query="Televisiones Panasonic  50 pulgadas" clean="television panason 50 pulgada"
query="Sharp" clean="sharp"
query="nook" clean="nook"
query="rca" clean="rca"
query="rca" clean="rca"
query="Flat screen tvs" clean="flat screen tv"
query="macbook" clean="macbook"
query="Blue tooth headphones" clean="blue tooth headphon"
query="Tv antenna" clean="tv antenna"
query="memory card" clean="memori card"
query="AC power cord" clean="ac power cord"
query="Zagg iPhone" clean="zagg iphon"
query="Watch The Throne" clean="watch the throne"
query="Remote control extender" clean="remot control extend"
query="Camcorder" clean="camcord"
query="3ds" clean="3d"
query="hoya" clean="hoya"
query="wireless headphones" clean="wireless headphon"
query="wireless headphones" clean="wireless headphon"
query="Samsung 40" clean="samsung 40"


### Implement the query cleaner and category roll up
Through stemming and cleaning of categories we get a much cleaner dataset

In [25]:
%%bash
python ../week4/create_labeled_queries.py --min_queries=200
head -n 10 /workspace/datasets/labeled_query_data.txt
wc -l /workspace/datasets/labeled_query_data.txt

> cleaning_queries
> checking for min_queries
> original unique categories=1451
> final categories=776
__label__abcat0102003 blu ray player
__label__abcat0201011 ipod
__label__cat02015 tae guk gi
__label__pcmcat180400050000 canon camera
__label__pcmcat209000050008 hp touchpad
__label__cat02002 the sim
__label__pcmcat209000050007 dryer
__label__abcat0101001 lcd tv
__label__abcat0403000 gopro
__label__pcmcat183800050007 usb car adapt
994279 /workspace/datasets/labeled_query_data.txt


## Task 2: Train a query classifier.
Use the labeled data to build a fast test model

In [27]:
%%bash
./create_qu_model.sh

Training data
__label__abcat0102003 blu ray player
__label__abcat0201011 ipod
__label__cat02015 tae guk gi
__label__pcmcat180400050000 canon camera
__label__pcmcat209000050008 hp touchpad
__label__cat02002 the sim
__label__pcmcat209000050007 dryer
__label__abcat0101001 lcd tv
__label__abcat0403000 gopro
__label__pcmcat183800050007 usb car adapt
Test data
__label__pcmcat174700050005 star war
__label__cat09000 tmnt
__label__cat02719 t pain
__label__abcat0515028 laptop case
__label__cat02015 land befor time
__label__cat02015 barbi
__label__abcat0807001 epson photo
__label__pcmcat180400050000 digit camera
__label__pcmcat158900050018 lcd projector
__label__pcmcat246100050002 bluetooth


Read 0M words
Number of words:  683
Number of labels: 672
Progress: 100.0% words/sec/thread:   28668 lr:  0.000000 avg.loss:  1.611424 ETA:   0h 0m 0s


p@1 test
N	9941
P@1	0.435
R@1	0.435
p@5 test
N	9941
P@5	0.125
R@5	0.623


## Updating labels
Wanting higher P&R updating my training data to require queries to have more min_queries

In [28]:
%%bash
python ../week4/create_labeled_queries.py --min_queries=1000
head -n 10 /workspace/datasets/labeled_query_data.txt
wc -l /workspace/datasets/labeled_query_data.txt

> cleaning_queries
> checking for min_queries
> original unique categories=1453
> final categories=502
__label__abcat0811004 g2 batteri
__label__pcmcat162100050040 virgin mobil
__label__abcat0208007 lcd
__label__pcmcat186100050006 extern hard drive
__label__pcmcat174700050005 age of empir
__label__abcat0511004 wireless printer
__label__pcmcat209000050007 ipad
__label__cat02716 ugk
__label__pcmcat156300050010 fridg
__label__pcmcat247400050001 macbook
994302 /workspace/datasets/labeled_query_data.txt


In [30]:
%%bash
./create_qu_model.sh

Training data
__label__abcat0811004 g2 batteri
__label__pcmcat162100050040 virgin mobil
__label__abcat0208007 lcd
__label__pcmcat186100050006 extern hard drive
__label__pcmcat174700050005 age of empir
__label__abcat0511004 wireless printer
__label__pcmcat209000050007 ipad
__label__cat02716 ugk
__label__pcmcat156300050010 fridg
__label__pcmcat247400050001 macbook
Test data
__label__abcat0206001 zx99i
__label__abcat0503002 router
__label__cat02015 big bang theori season four
__label__abcat0401004 olympu camera
__label__abcat0508018 roxio
__label__abcat0701001 xbox 360 consul
__label__cat02015 mont carlo
__label__pcmcat247400050000 hp laptop
__label__pcmcat209000050008 htc flyer
__label__cat02015 the big bang theori


Read 0M words
Number of words:  733
Number of labels: 429
Progress: 100.0% words/sec/thread:   29430 lr:  0.000000 avg.loss:  1.777434 ETA:   0h 0m 0s


p@1 test
N	9966
P@1	0.451
R@1	0.451
p@5 test
N	9966
P@5	0.129
R@5	0.645


In [31]:
df.head()

Unnamed: 0,category,query
0,abcat0101001,Televisiones Panasonic 50 pulgadas
1,abcat0101001,Sharp
2,pcmcat193100050014,nook
3,abcat0101001,rca
4,abcat0101005,rca


In [157]:
import xml.etree.ElementTree as ET
import pandas as pd
root_category_id = 'cat00000'

tree = ET.parse('/workspace/datasets/product_data/categories/categories_0001_abcat0010000_to_pcmcat99300050000.xml')
root = tree.getroot()

categories = []
parents = []
for child in root:
    id = child.find('id').text
    cat_path = child.find('path')
    cat_path_ids = [cat.find('id').text for cat in cat_path]
    leaf_id = cat_path_ids[-1]
    if leaf_id != root_category_id:
        categories.append(leaf_id)
        parents.append(cat_path_ids[-2])
parents_df = pd.DataFrame(list(zip(categories, parents)), columns =['category', 'parent'])

print(parents_df)

               category             parent
0          abcat0010000           cat00000
1          abcat0011000       abcat0010000
2          abcat0011001       abcat0011000
3          abcat0011002       abcat0011000
4          abcat0011003       abcat0011000
...                 ...                ...
4634  pcmcat97200050013           cat15205
4635  pcmcat97200050015           cat15063
4636  pcmcat99000050001  pcmcat50000050006
4637  pcmcat99000050002  pcmcat99000050001
4638  pcmcat99300050000           cat15063

[4639 rows x 2 columns]


In [159]:
cat_value_counts = df['category'].value_counts()

print(cat_value_counts)

cat02015              177638
abcat0101001           80213
pcmcat247400050000     79245
pcmcat209000050008     74258
pcmcat144700050004     43991
                       ...  
pcmcat230600050054         1
pcmcat230600050036         1
pcmcat221400050012         1
pcmcat254000050002         1
pcmcat221400050013         1
Name: category, Length: 1540, dtype: int64


In [161]:
test = 'pcmcat230600050054';

def get_category_size(category):
    dict = cat_value_counts.to_dict()
    if category in dict:
        print(dict[category])
        return dict[category]
    else:
        return 0

def get_parent_category(frame, category):
    return frame[frame['category'] == category]['parent'].item();    

    
# first_min_queries_match(test)

def first_min_queries_match(x):
    size = get_category_size(x)
    
    if size > 2:
        return x
    else:
        parent = get_parent_category(parents_df, x);
        print('no_match', 'size=', size, 'parent=', parent)
        if parent:
            return first_min_queries_match(parent)
        return x


print("")
print(first_min_queries_match('pcmcat221400050012'))

# def first_min_queries_match(cat):
#     condition = get_category_size(cat) >= 10000
#     print(condition)
#     while not condition:
#         print('swag')
#         parent = get_parent_category(cat)
#         print(parent)
        
        
# first_min_queries_match(test)
#     cat_parent = get_parent_categoy(parents_df, cat)
    
#     return [cat_size, cat_parent]
    
    
#     # while get_category_size(cat) < 10000
#     #     first_min_queries_match(cat)
    





1
no_match size= 1 parent= pcmcat221400050011
no_match size= 0 parent= pcmcat151600050019
no_match size= 0 parent= abcat0207000
no_match size= 0 parent= cat00000


ValueError: can only convert an array of size 1 to a Python scalar

In [131]:
print(cat_value_counts);
cat_value_counts.index.tolist().index('abcat0101001')

cat02015              177638
abcat0101001           80213
pcmcat247400050000     79245
pcmcat209000050008     74258
pcmcat144700050004     43991
                       ...  
pcmcat230600050054         1
pcmcat230600050036         1
pcmcat221400050012         1
pcmcat254000050002         1
pcmcat221400050013         1
Name: category, Length: 1540, dtype: int64


1

In [147]:
dict = cat_value_counts.to_dict()

if 'cat02015' in dict:
    print('swag')

swag
