In [1]:
import os
import spacy
import pandas as pd

```bash
# download the package first
pip install spacy

# after that download the trained english model
python -m spacy download en
```

# Spacy

In [2]:
reviews = pd.read_table('hotelreviews.txt', names = ['text'])
reviews.head()

Unnamed: 0,text
0,Nice place Better than some reviews give it cr...
1,what a surprise What a surprise the Sheraton w...
2,Good location Boston from 17th Floor of ...
3,Find an alternative to the Sheraton We stayed ...
4,Barely Tolerable If it were possible to give o...


The first step to use `spaCy` is to constructs a language processing pipeline, here we're:

- Loading the pre-trained english model
- Grabbing a sample text and hand it over to spaCy and be prepared to wait...

In [3]:
# load the model/pipeline, once we have
# loaded the object, we can call it as
# though it were a function
nlp = spacy.load('en')

In [4]:
# grab a single document
doc = reviews.loc[0, 'text']
parsed_doc = nlp(doc)
parsed_doc

Nice place Better than some reviews give it credit for. Overall, the rooms were a bit small but nice. Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city). Overall, it was a good experience and the staff was quite friendly. 

...1/20th of a second or so. Although the text looks exactly the same as before, a lot has actually happened under the hood. Let's take a look at what we got during that time. From here, we'll start to look at the functionalities/properties that spaCy provided us out of the box.

## Tokenization

The first one is sentence detection/segmentation (note that all of these features have already been computed, all we're doing now is accessing it via attribute). Every spaCy document is tokenized into sentences and further into tokens which can be accessed by iterating over the document.

In [5]:
# access the sents attribute, which is a
# generator that we can loop through
for num, sentence in enumerate(parsed_doc.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print()

# access the first token
print('tokens:')
print(parsed_doc[0])

Sentence 1:
Nice place Better than some reviews give it credit for.

Sentence 2:
Overall, the rooms were a bit small but nice.

Sentence 3:
Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).

Sentence 4:
Overall, it was a good experience and the staff was quite friendly.

tokens:
Nice


## Part of Speech Tagging

Part-of-speech tags (POST) are the properties of the word that are defined by the usage of the word in a grammatically correct sentence. These tags can be used as the text features in information filtering, statistical models and rule based parsing.

In [6]:
# the orth_ attribute will give us
# the string representation of the
# token as oppose to a spacy type token
token_text = [token.orth_ for token in parsed_doc]
token_pos = [token.pos_ for token in parsed_doc]

post = pd.DataFrame(list(zip(token_text, token_pos)),
                    columns = ['token_text', 'part_of_speech'])
post.head()

Unnamed: 0,token_text,part_of_speech
0,Nice,ADJ
1,place,NOUN
2,Better,PROPN
3,than,ADP
4,some,DET


From the table above, we can see that the word "Nice" is an adjective and so on.


## Named Entity Recognition

Spacy consists of a fast entity recognition model which is capable of identifying entitiy phrases from the document. Entities can be of different types, such as – person, location, organization, dates, numericals, etc. 

In [7]:
# For a given document, the standard way to access entity is
# to use the .ents attribute; for each entity, we can then
# access the .lable_ attribute to check the entity type for
# the entity that got flagged;

# please check the documentation to see what the label for
# entity means
# https://spacy.io/docs/usage/entity-recognition#entity-types
for num, entity in enumerate(parsed_doc.ents):
    print('Entity {}:'.format(num + 1), entity.orth_, '-', entity.label_)

Entity 1: Better - FAC
Entity 2: the Prudential Center - ORG


We can also perform token-level entity analysis. This is basically the name entity recognition that we've already looked at, but at the token by token level. It also provides a inside outside begin indicator. e.g. here "the Prudential Center" represents one single entity, so "the" is the beginning of the entity (B); "Prudential Center" are both inside that entity (I). And the one that does not belong to an entity gets labeled as outside (O).

In [8]:
token_entity_type = [token.ent_type_ for token in parsed_doc]
token_entity_iob = [token.ent_iob_ for token in parsed_doc]

entity = pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)),
                      columns = ['token_text', 'entity_type', 'inside_outside_begin'])
entity.iloc[37:41]

Unnamed: 0,token_text,entity_type,inside_outside_begin
37,the,ORG,B
38,Prudential,ORG,I
39,Center,ORG,I
40,makes,,O


## Token Level Attribute

What about a variety of other token-level attributes, such as the relative frequency of tokens (how frequently does each token/word appears in the english vocabulary), and whether or not a token matches any of the following categories?

- stopword (grammatically functional words that don't contribute too much to the context)
- punctuation
- whitespace
- number
- whether the token is included in spaCy's default vocabulary or not?
- In terms of the token's relative frequency, spaCy expresses it as the log probability, so a negative number closer to 0 means it appears more often. Or we can say a smaller absolute value means it commonly appears

Please refer to the [documentation page](https://spacy.io/docs/api/token) to see all the available attrbutes at the token level.

In [9]:
token_attrs = [(token.orth_,
                token.lemma_,
                token.prob,
                token.is_stop,
                token.is_punct,
                token.is_space,
                token.like_num,
                token.is_oov)
                for token in parsed_doc]

df = pd.DataFrame(token_attrs,
                  columns = ['text',
                             'lemma',
                             'log_probability',
                             'stop',
                             'punctuation',
                             'whitespace',
                             'number',
                             'out_of_vocab'])

# we convert the boolean columns to only showing Yes for True
# and a blank string for False for a cleaner output
df.loc[:, 'stop':'out_of_vocab'] = (df.loc[:, 'stop':'out_of_vocab']
                                      .applymap(lambda x: 'Yes' if x else ''))
df.head(15)

Unnamed: 0,text,lemma,log_probability,stop,punctuation,whitespace,number,out_of_vocab
0,Nice,nice,-9.845901,,,,,
1,place,place,-8.045827,,,,,
2,Better,better,-10.571031,,,,,
3,than,than,-6.372464,Yes,,,,
4,some,some,-6.402781,Yes,,,,
5,reviews,review,-11.378132,,,,,
6,give,give,-7.725083,Yes,,,,
7,it,-PRON-,-4.50645,Yes,,,,
8,credit,credit,-8.618998,,,,,
9,for,for,-4.91397,Yes,,,,


## Dependency Parsing

Spacy also offers a fast and accurate dependency parser. Let's parse the dependency tree of all the sentences which contains the a targeted term and check what are the adjectival tokens are used with that term.

In [29]:
# toy example of how to get the depency
token = parsed_doc[1]
print('target word:', token)
print('depency:')

# to get the dependency for a token
# we can access the .children attribute
# and iterate through them
for c in token.children:
    print(c.is_quote)

target word: place
depency:
False


In [11]:
from collections import Counter


def post_words(document, token, post, topn = 5):
    """
    given a document/corpus, look for the most commonly
    associated part of speech tag associated with the specified token
    """
    target_sents = [sent for sent in document.sents if token in sent.lower_]    
    words = []
    for sentence in target_sents:
        for word in sentence: 
            words.extend([child.lemma_
                          for child in word.children
                          if child.pos_ == post and child.lemma_ != '-PRON-'])

    common_words = Counter(words).most_common(topn)
    return common_words

In [12]:
# lump all the documents into one giant document
corpus = ' '.join([review for review in reviews['text']])
document = nlp(corpus)
common_words = post_words(document, token = 'view', post = 'ADJ')
common_words

[('great', 41), ('good', 28), ('which', 13), ('small', 13), ('fantastic', 11)]

In [13]:
from joblib import cpu_count


def valid_word(token):
    """
    Returns False if the spacy token is either
    a punctuation, whitespace, number or a pronoun
    (indicated by the '-PRON-' flag)
    """
    pron_flag = token.lemma_ != '-PRON-'
    word_flag = not (token.is_punct or token.is_space or token.like_num)
    return word_flag and pron_flag


def clean_corpus(texts, parser, stopwords, batch_size = 10000, n_jobs = -1):
    """
    Generator function using spaCy to parse reviews:
    - lemmatize the text
    - remove punctuation, whitespace and number
    - remove pronoun, e.g. 'it'
    """
    n_threads = cpu_count()
    if n_jobs > 0 and n_jobs < n_threads:
        n_threads = n_jobs
    
    # use the .pip to process texts as a stream;
    # this functionality supports using multi-threads
    for parsed_text in parser.pipe(texts, batch_size = batch_size, n_threads = n_threads):
        tokens = []
        for token in parsed_text:
            if valid_word(token) and token.lemma_ not in stopwords:
                tokens.append(token.lemma_)
        
        cleaned_text = ' '.join(tokens)
        yield cleaned_text
        

def export_unigrams(unigram_path, texts, parser, stopwords):
    """
    Clean up the text and export it to a .txt file,
    where each line is one document
    """
    with open(unigram_path, 'w', encoding = 'utf_8') as f:
        for cleaned_text in clean_corpus(texts, parser, stopwords):
            f.write(cleaned_text + '\n')

In [14]:
# a set of stopwords built-in to spacy
stopwords = spacy.en.STOP_WORDS

texts = reviews['text']
UNIGRAM_PATH = 'unigram.txt'
if not os.path.exists(UNIGRAM_PATH):
    export_unigrams(UNIGRAM_PATH, texts = texts, parser = nlp, stopwords = stopwords)

In [17]:
from gensim.models.word2vec import LineSentence

Using TensorFlow backend.


# Reference

- [Blog: Natural Language Processing Made Easy – using SpaCy (in Python)](https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/)