# Tests, Tries and Concept

The aim of this notebook is to explore the solutions and build a proof of concept using a small subset of the full dataset to gain time. By doing so, we will get a better understanding of the [NLP pipeline](https://spacy.io/usage/processing-pipelines). We will call this subset of 100 descriptions: _subset_.

In [34]:
# general:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import time # will be use to choose the faster solution

# NLP:
import spacy

In [2]:
raw_winedata = pd.read_csv("../data/winemag-data-190314.csv").drop("Unnamed: 0", axis=1) # indexes were stored as col
initial_number_of_rows = raw_winedata.shape[0]
print(raw_winedata.shape)
raw_winedata.head()

(141617, 14)


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,vintage
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013.0
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011.0
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013.0
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013.0
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012.0


In [38]:
# random subset of 1000 descriptions
# (in case there is any specific order in the original dataset)
subset = raw_winedata.sample(1000, random_state=51)
print(subset.shape)

(1000, 14)


From now on, we will work with this subset only.

## Data Processing

No need to remove rows with missing values (for now) in this subset, nor to remove duplicate descriptions.

In [39]:
subset.isnull().sum()

country                    0
description                0
designation              286
points                     0
price                     59
province                   0
region_1                 170
region_2                 637
taster_name              158
taster_twitter_handle    196
title                      0
variety                    0
winery                     0
vintage                   31
dtype: int64

In [40]:
duplicates_description  = subset[subset.duplicated(subset='description', keep=False)].sort_values(by='description')
print('number of duplicated descriptions: ', duplicates_description.shape[0]/2)

number of duplicated descriptions:  0.0


## Vectorization

A lot of descisions have to be tested here. The first approach will be to go through all the steps, using the subset, and try several solutions, without taking much decisions on data cleaning. An optimization iteration can be undertaken after. Let's start with:

* spaCy medium sized English model


In [72]:
nlp = spacy.load('en_core_web_md')

In [73]:
# tokenization:
time0 = time.time()
descriptions = subset['description'].apply(nlp)
print('parser run in:', time.time() - time0, ' s.')

parser run in: 10.860634088516235  s.


In [74]:
# need to cast as list if want to print spaCy Doc objects:
list(descriptions)[0]

A delicious, balanced Pinot Noir. Shows dry, pure flavors of raspberries, cherries, cola and sandalwood, with a delicious, drink-me-now appeal.

In [80]:
print("Pipeline", nlp.pipe_names) # the different components of the model, forming the pipeline
# + the tokenizer is not indicated and run at the beginning.

Pipeline ['tagger', 'parser', 'ner']


In [75]:
for token in list(descriptions)[0]:
    print(token.text, # Text: The original word text.
          token.lemma_, # Lemma: The base form of the word.
          token.pos_, # POS: The simple part-of-speech tag.
          token.tag_, # Tag: The detailed part-of-speech tag.
          token.dep_, # Dep: Syntactic dependency, i.e. the relation between tokens.
          token.shape_, # Shape: The word shape – capitalization, punctuation, digits.
          token.is_alpha, # is alpha: Is the token an alpha character? (alphanumeric)
          token.is_stop # is stop: Is the token part of a stop list, i.e. the most common words of the language?
         )

A a DET DT det X True True
delicious delicious ADJ JJ amod xxxx True False
, , PUNCT , punct , False False
balanced balanced ADJ JJ amod xxxx True False
Pinot Pinot PROPN NNP compound Xxxxx True False
Noir Noir PROPN NNP ROOT Xxxx True False
. . PUNCT . punct . False False
Shows show NOUN NNS ROOT Xxxxx True False
dry dry ADJ JJ amod xxx True False
, , PUNCT , punct , False False
pure pure ADJ JJ amod xxxx True False
flavors flavor NOUN NNS dobj xxxx True False
of of ADP IN prep xx True True
raspberries raspberry NOUN NNS pobj xxxx True False
, , PUNCT , punct , False False
cherries cherry NOUN NNS conj xxxx True False
, , PUNCT , punct , False False
cola cola NOUN NN conj xxxx True False
and and CCONJ CC cc xxx True True
sandalwood sandalwood NOUN NN conj xxxx True False
, , PUNCT , punct , False False
with with ADP IN prep xxxx True True
a a DET DT det x True True
delicious delicious ADJ JJ amod xxxx True False
, , PUNCT , punct , False False
drink drink VERB VB compound xxxx True Fa

In [76]:
spacy.displacy.render(list(descriptions)[0], style='dep')

In [92]:
# NER not workng... not sure why?
for ent in list(descriptions)[2].ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Verdejo 28 35 ORG
cool years 60 70 DATE
2009 78 82 DATE


In [93]:
spacy.displacy.serve(list(descriptions)[2], style="ent")
# need to stop kernel to shut down server.

  "__main__", mod_spec)



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [71]:
# For now, we just want to tokenize and we will probably not use the parser (POS tags),
# how faster is the processing?
# tokenizing:
time0 = time.time()
descriptions = subset['description'].apply(nlp, disable=["parser"])
print('parser run in:', time.time() - time0, ' s.')


parser run in: 5.848058223724365  s.


In [70]:
# if we disable all the components and keep only tokenization:
time0 = time.time()
descriptions = subset['description'].apply(nlp, disable=['tagger', 'parser', 'ner'])
print('parser run in:', time.time() - time0, ' s.')

# a lot faster! That will be important when processing all the descriptions (full dataset)

parser run in: 0.1695239543914795  s.


In [28]:
# parsing and cleaning text:
def parse_and_clean(series):
    # parse
    parsed_series = series.apply(nlp)
    # make use of the spaCy tokens to clean text:
    
    

112728    (A, delicious, ,, balanced, Pinot, Noir, ., Sh...
71255     (A, zippy, ,, lively, nose, of, citrus, and, l...
72503     (This, style, of, oak, -, fermented, Verdejo, ...
110799    (On, the, nose, ,, this, has, aromas, of, ston...
122732    (Chardonnay, is, emerging, as, a, leader, at, ...
                                ...                        
140024    (This, is, a, fruity, sparkler, made, in, the,...
100786    (Aromas, of, yellow, stone, fruit, and, choppe...
138571    (Light, in, color, ,, this, blend, 's, red, -,...
84362     (Rather, light, for, the, vintage, ,, this, is...
67914     (Simple, aromas, may, lack, vibrancy, but, are...
Name: description, Length: 100, dtype: object