# Week 3 Notes

In [None]:
pip install -r ../requirements_week3.txt

---
## Product Name Stemmer & Tokenizer
First attempt at nltk library. Goals to stem words and clean up strings of titles as much as possible!

In [10]:
import nltk
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

names = [
    'Peavey - GT10 10W Guitar Amplifier', 
    'Roland - 20W Guitar Amplifier',
    'PhoneMate - 5.8GHz Expandable Cordless Phone',
    'Whirlpool - 17.5 Cu. Ft. Chest Freezer - White'
]

stemmer = SnowballStemmer('english')

def transform_name(product_name):
    words = nltk.word_tokenize(product_name)
    new_words=[word for word in words if word.isalnum()]
    stemmed_name=" ".join([stemmer.stem(word) for word in new_words])
    return stemmed_name


for name in names:
    print(name, '=>', transform_name(name))

Peavey - GT10 10W Guitar Amplifier => peavey gt10 10w guitar amplifi
Roland - 20W Guitar Amplifier => roland 20w guitar amplifi
PhoneMate - 5.8GHz Expandable Cordless Phone => phonem expand cordless phone
Whirlpool - 17.5 Cu. Ft. Chest Freezer - White => whirlpool cu ft chest freezer white


---
## Exercise 1 
### Precision and Recall (Category Depth 4)
categoryPath ex. `abcat0712003` = `Board & Puzzle`

In [16]:
%%bash
echo "p@1"
~/fastText-0.9.2/fasttext test /workspace/datasets/categories/model_categories.bin /workspace/datasets/categories/contentCategories.test
echo "p@5"
~/fastText-0.9.2/fasttext test /workspace/datasets/categories/model_categories.bin /workspace/datasets/categories/contentCategories.test 5

p@1
N	9549
P@1	0.526
R@1	0.526
p@5
N	9549
P@5	0.154
R@5	0.771


### Precision and Recall (Category Depth 3)
categoryPath ex. `abcat0705002` = `PSP`

In [18]:
%%bash
echo "p@1"
~/fastText-0.9.2/fasttext test /workspace/datasets/categories/model_categories.bin /workspace/datasets/categories/contentCategories.test
echo "p@5"
~/fastText-0.9.2/fasttext test /workspace/datasets/categories/model_categories.bin /workspace/datasets/categories/contentCategories.test 5

p@1
N	9920
P@1	0.671
R@1	0.671
p@5
N	9920
P@5	0.172
R@5	0.86


---
## Exercise 2
Derive Synonyms from Content

Below Is my working function for cleaning up product names

In [20]:
import nltk
import string
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
nltk.download('words', quiet=True)
nltk.download('stopwords', quiet=True)

names = [
    'Peavey - GT10 10W Guitar Amplifier', 
    'Roland - 20W Guitar Amplifier',
    'PhoneMate - 5.8GHz Expandable Cordless Phone',
    'Whirlpool - 17.5 Cu. Ft. Chest Freezer - White',
    'Apple - iPhone 13 Pro Max 5G 256GB - Sierra Blue (AT&T)'
    'Bose - Headphones 700 Wireless Noise Cancelling Over-the-Ear Headphones - Triple Black'
]


def transform_name(name):
    tokens = word_tokenize(name)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuation from each word
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    # filter out stop words
    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    return " ".join(words)


for name in names:
    print(transform_name(name))

peavey guitar amplifier
roland guitar amplifier
phonemate expandable cordless phone
whirlpool cu ft chest freezer white
apple iphone pro max sierra blue bose headphones wireless noise cancelling overtheear headphones triple black


# Synonyms from Title Model
I used the basic fasttext model builder, didn't experiment much with tuning. 

Build the model
```
~/fastText-0.9.2/fasttext skipgram -input /workspace/datasets/fasttext/titles.txt -output /workspace/datasets/fasttext/title_model
```

Test the model

In [76]:
%%bash
python testSynonyms.py

         queries  score                                                                                                                                                                       neighbors
0          Phone    1.0                                     motorola(0.96), cell(0.95), mobil(0.95), verizon(0.94), earphon(0.94), tmobil(0.94), nocontract(0.94), gophon(0.94), htc(0.94), droid(0.94)
1         Camera    1.0                                                    xs(0.99), vr(0.99), rebel(0.99), dslr(0.99), slr(0.99), nikon(0.99), zoom(0.99), finepix(0.99), cybershot(0.99), sigma(0.99)
2         Laptop    1.0                                  processor(0.98), gateway(0.97), drive(0.97), ideapad(0.96), aspir(0.96), pavilion(0.96), vaio(0.96), ideacentr(0.96), duo(0.96), display(0.96)
3   Refrigerator    1.0                                 thruthedoor(0.99), ice(0.99), freezer(0.99), french(0.99), sidebysid(0.98), order(0.98), cu(0.98), counterdepth(0.98), frostfre(0.98), ft(0.98)


---
## Project Assement
### For classifying product names to categories:

What precision (P@1) were you able to achieve?
- .56

What fastText parameters did you use?
- epoch=25, wordNGrams=2, learningRate=1.0
- I also checked out the autotuner but didn't get it working

How did you transform the product names?
- Used the snowball stemmer, nltk tokenizer

How did you prune infrequent category labels, and how did that affect your precision?
- Used a pandas dataframe to count items in a category

How did you prune the category tree, and how did that affect your precision?
- Using a higher level category increased precision by 16%

### For deriving synonyms from content:
What 20 tokens did you use for evaluation?
- Phone,Camera,Laptop,Refrigerator,Guitar,Dryer,Tv,Subwoofer,Beats,Mouse,Blender,Macbook,Apple,Samsung,Nintendo,Sony,Playstation,Xbox,Hp,Whirpool,Kitchenaid

What fastText parameters did you use?
- 

How did you transform the product names?
- Lowercase, stemmed, removed punctuation, removed stop words

What threshold score did you use?
- .93

What synonyms did you obtain for those tokens?
- See output above


### For integrating synonyms with search:
How did you transform the product names (if different than previously)?

What threshold score did you use?

Were you able to find the additional results by matching synonyms?
