# Tests, Tries and Concept

The aim of this notebook is to explore the solutions and build a proof of concept using a small subset of the full dataset to gain time. By doing so, we will get a better understanding of the [NLP pipeline](https://spacy.io/usage/processing-pipelines). We will call this subset of 100 descriptions: _subset_.

In [325]:
# general:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import time # will be use to choose the faster solution
import itertools # used for instance to flatten list of lists
from collections import Counter # e.g. to find most common words


# NLP:
import spacy

In [2]:
raw_winedata = pd.read_csv("../data/winemag-data-190314.csv").drop("Unnamed: 0", axis=1) # indexes were stored as col
initial_number_of_rows = raw_winedata.shape[0]
print(raw_winedata.shape)
raw_winedata.head()

(141617, 14)


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,vintage
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013.0
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011.0
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013.0
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013.0
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012.0


In [38]:
# random subset of 1000 descriptions
# (in case there is any specific order in the original dataset)
subset = raw_winedata.sample(1000, random_state=51)
print(subset.shape)

(1000, 14)


From now on, we will work with this subset only.

## Data Processing

No need to remove rows with missing values (for now) in this subset, nor to remove duplicate descriptions.

In [39]:
subset.isnull().sum()

country                    0
description                0
designation              286
points                     0
price                     59
province                   0
region_1                 170
region_2                 637
taster_name              158
taster_twitter_handle    196
title                      0
variety                    0
winery                     0
vintage                   31
dtype: int64

In [40]:
duplicates_description  = subset[subset.duplicated(subset='description', keep=False)].sort_values(by='description')
print('number of duplicated descriptions: ', duplicates_description.shape[0]/2)

number of duplicated descriptions:  0.0


## Testing and Understanding spaCy

Some code to test and understanc spaCy on the subset.


In [129]:
nlp = spacy.load('en_core_web_md')

In [130]:
# tokenization:
time0 = time.time()
descriptions = subset['description'].apply(nlp)
print('parser run in:', time.time() - time0, ' s.')

parser run in: 10.827577829360962  s.


In [131]:
# using nlp.pipe: Process the texts as a stream
# and buffer them in batches, instead of one-by-one. 
# This is usually much more efficient.

time0 = time.time()
descriptions = subset['description'].apply(nlp.pipe)
print('parser run in:', time.time() - time0, ' s.')

parser run in: 0.004770755767822266  s.


Indeed! _nlp.pipe_ is a lot faster than _nlp_!

In [74]:
# need to cast as list if want to print spaCy Doc objects:
list(descriptions)[0]

A delicious, balanced Pinot Noir. Shows dry, pure flavors of raspberries, cherries, cola and sandalwood, with a delicious, drink-me-now appeal.

In [80]:
print("Pipeline", nlp.pipe_names) # the different components of the model, forming the pipeline
# + the tokenizer is not indicated and run at the beginning.

Pipeline ['tagger', 'parser', 'ner']


In [75]:
for token in list(descriptions)[0]:
    print(token.text, # Text: The original word text.
          token.lemma_, # Lemma: The base form of the word.
          token.pos_, # POS: The simple part-of-speech tag.
          token.tag_, # Tag: The detailed part-of-speech tag.
          token.dep_, # Dep: Syntactic dependency, i.e. the relation between tokens.
          token.shape_, # Shape: The word shape – capitalization, punctuation, digits.
          token.is_alpha, # is alpha: Is the token an alpha character? (alphanumeric)
          token.is_stop # is stop: Is the token part of a stop list, i.e. the most common words of the language?
         )

A a DET DT det X True True
delicious delicious ADJ JJ amod xxxx True False
, , PUNCT , punct , False False
balanced balanced ADJ JJ amod xxxx True False
Pinot Pinot PROPN NNP compound Xxxxx True False
Noir Noir PROPN NNP ROOT Xxxx True False
. . PUNCT . punct . False False
Shows show NOUN NNS ROOT Xxxxx True False
dry dry ADJ JJ amod xxx True False
, , PUNCT , punct , False False
pure pure ADJ JJ amod xxxx True False
flavors flavor NOUN NNS dobj xxxx True False
of of ADP IN prep xx True True
raspberries raspberry NOUN NNS pobj xxxx True False
, , PUNCT , punct , False False
cherries cherry NOUN NNS conj xxxx True False
, , PUNCT , punct , False False
cola cola NOUN NN conj xxxx True False
and and CCONJ CC cc xxx True True
sandalwood sandalwood NOUN NN conj xxxx True False
, , PUNCT , punct , False False
with with ADP IN prep xxxx True True
a a DET DT det x True True
delicious delicious ADJ JJ amod xxxx True False
, , PUNCT , punct , False False
drink drink VERB VB compound xxxx True Fa

In [76]:
spacy.displacy.render(list(descriptions)[0], style='dep')

In [92]:
# NER not workng... not sure why?
for ent in list(descriptions)[2].ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Verdejo 28 35 ORG
cool years 60 70 DATE
2009 78 82 DATE


In [93]:
spacy.displacy.serve(list(descriptions)[2], style="ent")
# need to stop kernel to shut down server.

  "__main__", mod_spec)



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [94]:
# print vectors:
for token in list(descriptions)[0]:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

A True 5.306696 False
delicious True 7.171026 False
, True 5.094723 False
balanced True 6.4426575 False
Pinot True 7.5997887 False
Noir True 6.878326 False
. True 4.9316354 False
Shows True 5.2839694 False
dry True 6.768264 False
, True 5.094723 False
pure True 6.4195275 False
flavors True 7.584025 False
of True 4.97793 False
raspberries True 7.4071903 False
, True 5.094723 False
cherries True 7.4071903 False
, True 5.094723 False
cola True 6.54571 False
and True 4.6577983 False
sandalwood True 7.4872804 False
, True 5.094723 False
with True 4.7708583 False
a True 5.306696 False
delicious True 7.171026 False
, True 5.094723 False
drink True 6.9833527 False
- True 5.6033444 False
me True 5.75488 False
- True 5.6033444 False
now True 4.8410244 False
appeal True 5.9267282 False
. True 4.9316354 False


Open [this link](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes) to see what are all the tokens attributes.

In [71]:
# For now, we just want to tokenize and we will probably not use the parser (POS tags),
# how faster is the processing?
# tokenizing:
time0 = time.time()
descriptions = subset['description'].apply(nlp, disable=["parser"])
print('parser run in:', time.time() - time0, ' s.')


parser run in: 5.848058223724365  s.


In [95]:
# if we disable all the components and keep only tokenization:
time0 = time.time()
descriptions = subset['description'].apply(nlp, disable=['tagger', 'parser', 'ner'])
print('parser run in:', time.time() - time0, ' s.')

# a lot faster! That will be important when processing all the descriptions (full dataset)

parser run in: 0.1649169921875  s.


In [97]:
# check if we still have the vectors when disabling ['tagger', 'parser', 'ner']
for token in list(descriptions)[0]:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

# WE DO !

A True 5.306696 False
delicious True 7.171026 False
, True 5.094723 False
balanced True 6.4426575 False
Pinot True 7.5997887 False
Noir True 6.878326 False
. True 4.9316354 False
Shows True 5.2839694 False
dry True 6.768264 False
, True 5.094723 False
pure True 6.4195275 False
flavors True 7.584025 False
of True 4.97793 False
raspberries True 7.4071903 False
, True 5.094723 False
cherries True 7.4071903 False
, True 5.094723 False
cola True 6.54571 False
and True 4.6577983 False
sandalwood True 7.4872804 False
, True 5.094723 False
with True 4.7708583 False
a True 5.306696 False
delicious True 7.171026 False
, True 5.094723 False
drink True 6.9833527 False
- True 5.6033444 False
me True 5.75488 False
- True 5.6033444 False
now True 4.8410244 False
appeal True 5.9267282 False
. True 4.9316354 False


In [109]:
# Each Doc, Span and Token comes with a .similarity() method 
for token1 in list(descriptions)[0][3:4]:
    for token2 in list(descriptions)[0]:
        print(token1.text, token2.text, token1.similarity(token2))

balanced A 0.33048978
balanced delicious 0.35654667
balanced , 0.1519814
balanced balanced 1.0
balanced Pinot 0.30750036
balanced Noir 0.15775882
balanced . 0.19880065
balanced Shows 0.19468865
balanced dry 0.34358656
balanced , 0.1519814
balanced pure 0.41958594
balanced flavors 0.4269035
balanced of 0.18883164
balanced raspberries 0.21687455
balanced , 0.1519814
balanced cherries 0.21687455
balanced , 0.1519814
balanced cola 0.20375118
balanced and 0.29367846
balanced sandalwood 0.29296434
balanced , 0.1519814
balanced with 0.33131588
balanced a 0.33048978
balanced delicious 0.35654667
balanced , 0.1519814
balanced drink 0.31249914
balanced - 0.059531394
balanced me 0.19236864
balanced - 0.059531394
balanced now 0.15937865
balanced appeal 0.23878936
balanced . 0.19880065


In [113]:
# vectors for medium model: 20'000 terms and 300 dimensions:
len(nlp.vocab.vectors)

20000

In [118]:
# the vector of a word in a doc:
list(descriptions)[0][4].vector

array([-5.0629e-01, -9.2535e-02,  5.1896e-01, -1.0059e+00,  7.7771e-01,
        2.1284e-01,  1.2649e-01,  1.1408e-01, -4.2443e-02,  1.8741e+00,
       -6.3908e-01,  4.5790e-01, -4.0649e-01, -5.3637e-01,  1.0609e-01,
       -8.7617e-01, -5.6196e-01,  1.5024e+00,  2.1300e-01, -3.5320e-01,
       -1.4053e-01, -4.1087e-01,  3.0588e-01,  3.4842e-02, -5.3801e-01,
        1.1918e-01, -2.1134e-02,  4.3866e-01, -2.0190e-01, -2.2332e-01,
        1.8903e-02,  1.9194e-01,  8.1535e-01,  9.6442e-02, -4.3574e-01,
        9.5558e-01,  3.6137e-01, -4.1387e-01,  2.8160e-01,  3.1058e-01,
        7.1966e-03,  2.9827e-01,  2.7132e-01, -3.5750e-01,  7.2742e-02,
       -2.5117e-02,  2.3366e-01, -4.4472e-01, -1.7822e-01,  2.8943e-01,
       -9.4849e-02, -4.1734e-01, -1.4773e-03,  1.5498e-02,  7.4254e-01,
       -6.7351e-01,  1.1987e-01, -2.7886e-01, -2.5243e-02, -1.4184e-01,
        3.2295e-01, -5.7773e-01,  7.4962e-02,  3.0049e-01, -3.8984e-01,
       -1.6904e-01, -7.7433e-03, -3.7745e-01, -9.7340e-02,  1.05

In [119]:
# the vector of a doc is the average of all token vectors in the doc:
list(descriptions)[0].vector

array([-1.28362730e-01,  2.82365769e-01, -4.49462198e-02, -1.27498358e-01,
        1.22002006e-01,  1.47130564e-01, -1.50934458e-02, -1.09166056e-01,
       -1.10732287e-01,  1.67319226e+00, -1.56334013e-01,  1.80393815e-01,
       -1.33543879e-01, -2.66527552e-02, -1.14607863e-01, -2.78478980e-01,
       -1.24037325e-01,  1.17868316e+00, -7.19795898e-02, -4.13767211e-02,
       -3.36467177e-02,  3.48486146e-03, -2.11610962e-02, -2.91093774e-02,
        3.47571298e-02, -7.23614395e-02,  1.01035580e-01,  8.29180051e-03,
        1.51149690e-01, -3.24545950e-01,  8.69362522e-03,  1.80217456e-02,
       -7.20151421e-03, -9.33022238e-06,  7.96838179e-02,  1.25217214e-02,
        7.34353736e-02, -4.64807451e-02, -1.05573855e-01,  1.03139743e-01,
       -1.08939558e-01,  3.47541720e-02,  1.18087411e-01, -1.32807776e-01,
        6.54570311e-02,  2.25159004e-01, -1.84240416e-01,  2.59242281e-02,
        7.39087835e-02, -8.07781331e-03, -5.62592223e-02,  7.26155788e-02,
       -8.64307135e-02,  

In [125]:
# save the Doc for later:
import pickle
pickle.dump(descriptions,  open( "../data/test_save_subset.p", "wb" ))
# The size is huge !

In [126]:
# try to load the saved Doc:
test = pickle.load( open( "../data/test_save_subset.p", "rb" ) )
print(len(list(test)[0][4].vector))
list(test)[0]


300


A delicious, balanced Pinot Noir. Shows dry, pure flavors of raspberries, cherries, cola and sandalwood, with a delicious, drink-me-now appeal.

It works! The Doc created, e.g. after tokenization can be saved as pickles, but their size is huge!

## Tokenization
We will use the large model, because we can, and because the vocabulary is larger. However, we only need to tokenize and will not need the PoS tagger, dependecy parser and Named Entities Recognizer.  

We will clean the token after, removing punctuation and stopwords, setting all of them to lower case.

In [271]:
# loading the nlp model. 
time0 = time.time()
nlp = spacy.load('en_core_web_lg', disable=["tagger", "parser", "ner"])
print('model loaded in:', time.time() - time0, ' s.')

model loaded in: 9.464062929153442  s.


In [287]:
# Creating our tokenizer function
def spacy_tokenizer(description):
    
    # Tokenize:
    mytokens = nlp(description)

    # Removing stopwords, punctuation and convert to lower_case
    mytokens = [token.lower_ for token in mytokens if not token.is_punct and not token.is_stop]

    # Return preprocessed list of tokens
    return mytokens


In [288]:
time0 = time.time()

descriptions = subset['description'].apply(spacy_tokenizer) 

# if we want, we can add the parsed descriptions to the dataframe:
subset.loc[:,'description_tokenized'] = descriptions

print('Finished! Tokenization and cleaning took ', time.time() - time0, ' s. to run.')

Finished! Tokenization and cleaning took  0.22167301177978516  s. to run.


In [255]:
# # tokenization and cleaning of descriptions:

# def clean_tokenized_text(doc):
#     # remove stopwords and punctuation,
#     # return the lower case in the form of a list of words (strings):
#     return([token.lower_ for token in doc if not token.is_punct and not token.is_stop]) # [] to not return the generator

# def tokenize_and_clean(texts):
#     # texts is a pd.Series (the description column of the wine dataset)
#     time0 = time.time()
    
#     # using nlp instead of nlp.pipe:
#     descriptions = texts.apply(nlp) # tokenize
#     print('Tokenization took ', time.time() - time0, ' s. to run.')
#     tokenized_serie = descriptions.apply(clean_tokenized_text) # clean text
    
# #     # using nlp.pipe, should be faster:
# #     tokenized_serie = []
# #     for doc in nlp.pipe(texts): 
# #         # clean text:
# #         tokenized_serie.append(clean_tokenized_text(doc))
# #     # returns a list of words per doc
    
#     print('Finished! Tokenization and cleaning took ', time.time() - time0, ' s. to run.')
#     return(tokenized_serie)
    
# descriptions = tokenize_and_clean(subset['description'])
# print(list(descriptions)[0])

Finished! Tokenization and cleaning took  0.22227191925048828  s. to run.
['delicious', 'balanced', 'pinot', 'noir', 'shows', 'dry', 'pure', 'flavors', 'raspberries', 'cherries', 'cola', 'sandalwood', 'delicious', 'drink', 'appeal']


Our descriptions have now the form of a list of list of words, i.e. a list of words per doc. 

Ok, so now let's try some vectorization techniques!

## Bag of Words

In [327]:
# creates corpus based on the tokenized docs:
time0 = time.time()

corpus = list(itertools.chain(*descriptions)) # should be faster than solution below
# corpus = [y for x in descriptions for y in x]

# the corpus is here a list of all the -clean- words found in the descriptions.

print('needed time: ', time.time() - time0)
print('number of words in corpus: ', len(corpus))

needed time:  0.0009322166442871094
number of words in corpus:  24092


In [328]:
# Utility function to create a list of the n most common words in corpus.
def bag_of_words(text, n):
    # Return the n most common words.
    return [item[0] for item in Counter(text).most_common(n)]

common_words = bag_of_words(corpus, 3000) # let's start with 3000 most common words

**THIS IS SUPER SLOW AND WILL NOT WORK ON FULL DATA SET ! + it doesn't even work now...**

In [354]:
# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(descriptions, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['description'] = descriptions
    df.loc[:, common_words] = 0
    df = df.reset_index()
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, description in enumerate(df['description']):
        words = [token for token in description]
    
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

word_counts = bow_features(descriptions, common_words)

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400
Processing row 450
Processing row 500


KeyError: 'screaming'