# Creating product categories
In this notebook we are going to create clusters from our data. These will serve us as product categories.

Steps:
   1. Load data
   2. Cleaning data
   3. Extract keywords from product description as identifiers for our products.
   4. Create a count vectorizer matrix using the keywords
   5. Build clusters using kmods algorithm

### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

from sklearn.feature_extraction.text import CountVectorizer

## 1. Load data

In [2]:
data = pd.read_csv('../data/data.csv')

In [3]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


## 2. Cleaning the data

We need all product description. First, we are going to remove descriptions that are no products using StockCode variable.

In [18]:
are_product = data['StockCode'].apply(lambda code: code[0].isnumeric())
df = data[are_product]['Description']

In [19]:
df.shape

(538914,)

Now, we need to get rid of missing values in Description column, because is the feature we use to create the counts matrix.
\
Also we will drop duplicates.

In [20]:
df.isna().sum()

1439

In [21]:
df = df.dropna().drop_duplicates()

In [22]:
df.isna().sum()

0

In [23]:
df.shape

(4198,)

In [25]:
df.head()

0     WHITE HANGING HEART T-LIGHT HOLDER
1                    WHITE METAL LANTERN
2         CREAM CUPID HEARTS COAT HANGER
3    KNITTED UNION FLAG HOT WATER BOTTLE
4         RED WOOLLY HOTTIE WHITE HEART.
Name: Description, dtype: object

## 3. Extracting product names from descriptions

Now we will extract nouns from product descriptions using NLTK module for pos tagging, stemming and tokenizing words.

In [91]:
sample = df.sample(5)
sample

2897           WOVEN BERRIES CUSHION COVER 
30788    WHITE DOVE HONEYCOMB PAPER GARLAND
15618        SET/6 EAU DE NIL BIRD T-LIGHTS
1419                  DOCTOR'S BAG SOFT TOY
43147          BLACK BAROQUE CARRIAGE CLOCK
Name: Description, dtype: object

In [92]:
sample_string = '. '.join(sample.values).lower()
tokens = nltk.word_tokenize(sample_string)
tokens[:7]

['woven', 'berries', 'cushion', 'cover', '.', 'white', 'dove']

In [99]:
sentence = 'WHITE HANGING HEART T-LIGHT HOLDER'.lower().split()
tagger = nltk.pos_tag(tokens)
tagger

[('woven', 'JJ'),
 ('berries', 'NNS'),
 ('cushion', 'NN'),
 ('cover', 'NN'),
 ('.', '.'),
 ('white', 'JJ'),
 ('dove', 'NN'),
 ('honeycomb', 'NN'),
 ('paper', 'NN'),
 ('garland', 'NN'),
 ('.', '.'),
 ('set/6', 'JJ'),
 ('eau', 'NN'),
 ('de', 'IN'),
 ('nil', 'JJ'),
 ('bird', 'JJ'),
 ('t-lights', 'NNS'),
 ('.', '.'),
 ('doctor', 'NN'),
 ("'s", 'POS'),
 ('bag', 'NN'),
 ('soft', 'JJ'),
 ('toy', 'NN'),
 ('.', '.'),
 ('black', 'JJ'),
 ('baroque', 'NN'),
 ('carriage', 'NN'),
 ('clock', 'NN')]

In [104]:
tokens

['woven',
 'berries',
 'cushion',
 'cover',
 '.',
 'white',
 'dove',
 'honeycomb',
 'paper',
 'garland',
 '.',
 'set/6',
 'eau',
 'de',
 'nil',
 'bird',
 't-lights',
 '.',
 'doctor',
 "'s",
 'bag',
 'soft',
 'toy',
 '.',
 'black',
 'baroque',
 'carriage',
 'clock']

In [None]:
"""
keywords (dict):    
    key:      
        nouns from descriptions

    values: 
        count: noun freq in the data
        words: list of words associated with the noun
"""
keywords = {} 

def update_counter(pos_tags):
    """ 
        Process POS tagged words list to get product 
        names and counts and update keywords dict 
    """
    stemmer = nltk.stem.snowball.SnowballStemmer('english')
    
    for word, tag in pos_tags:
        
    

In [120]:
sample.str.lower().apply(nltk.word_tokenize).apply(nltk.pos_tag).apply(update_counter)

[[('woven', 'JJ'), ('berries', 'NNS'), ('cushion', 'NN'), ('cover', 'NN')],
 [('white', 'JJ'),
  ('dove', 'NN'),
  ('honeycomb', 'NN'),
  ('paper', 'NN'),
  ('garland', 'NN')],
 [('set/6', 'JJ'),
  ('eau', 'NN'),
  ('de', 'IN'),
  ('nil', 'JJ'),
  ('bird', 'NN'),
  ('t-lights', 'NNS')],
 [('doctor', 'NN'),
  ("'s", 'POS'),
  ('bag', 'NN'),
  ('soft', 'JJ'),
  ('toy', 'NN')],
 [('black', 'JJ'), ('baroque', 'NN'), ('carriage', 'NN'), ('clock', 'NN')]]