In [None]:
import spacy
import pandas as pd

```bash
# download the package first
pip install spacy

# after that download the trained english model
python -m spacy download en
```

In [12]:
reviews = pd.read_table('hotelreviews.txt', names = ['text'])
reviews.head()

Unnamed: 0,text
0,Nice place Better than some reviews give it cr...
1,what a surprise What a surprise the Sheraton w...
2,Good location Boston from 17th Floor of ...
3,Find an alternative to the Sheraton We stayed ...
4,Barely Tolerable If it were possible to give o...


The first step to use `spaCy` is to constructs a language processing pipeline, here we're loading the pre-trained english model.

In [14]:
 nlp = spacy.load("en")

We'll grab a sample text and hand it over to spaCy, and be prepared to wait...

In [17]:
doc = reviews.loc[0, 'text']
parsed_doc = nlp(doc)
parsed_doc

Nice place Better than some reviews give it credit for. Overall, the rooms were a bit small but nice. Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city). Overall, it was a good experience and the staff was quite friendly. 

...1/20th of a second or so. Although the text looks exactly the same as before, a lot has actually happened under the hood. 

Let's take a look at what we got during that time. From here, we'll start to look at the functionalities that spaCy provided us out of the box. The first one is sentence detection/segmentation (note that all of these features have already been computed, all we're doing now is accessing it via attribute).

Every spaCy document is tokenized into sentences and further into tokens which can be accessed by iterating over the document.

In [20]:
# access the sents attribute, which is a
# generator that we can loop through
for num, sentence in enumerate(parsed_doc.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print()

# access the first token
print('tokens:')
print(parsed_doc[0])

Sentence 1:
Nice place Better than some reviews give it credit for.

Sentence 2:
Overall, the rooms were a bit small but nice.

Sentence 3:
Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).

Sentence 4:
Overall, it was a good experience and the staff was quite friendly.

tokens:
Nice


What about a variety of other token-level attributes, such as the relative frequency of tokens (how frequently does each token/word appears in the english vocabulary), and whether or not a token matches any of the following categories?

- stopword (grammatically functional words that don't contribute too much to the context)
- punctuation
- whitespace
- number
- whether the token is included in spaCy's default vocabulary or not ?

In terms of the token's relative frequency, spaCy expresses it as the log probability, so a negative number closer to 0 means it appears more often. Or we can say a smaller absolute value means it commonly appears.

In [36]:
# the orth_ attribute will give us
# the string representation of the
# token as oppose to a spacy type token
token_attrs = [(token.orth_,
                token.prob,
                token.is_stop,
                token.is_punct,
                token.is_space,
                token.like_num,
                token.is_oov)
                for token in parsed_doc]

df = pd.DataFrame(token_attrs,
                  columns = ['text',
                             'log_probability',
                             'stop',
                             'punctuation',
                             'whitespace',
                             'number',
                             'out of vocab'])

# we convert the boolean columns to only showing Yes for True
# and a blank string for False for a cleaner output
df.loc[:, 'stop':'out of vocab'] = (df.loc[:, 'stop':'out of vocab']
                                      .applymap(lambda x: 'Yes' if x else ''))
df.head(15)

Unnamed: 0,text,log_probability,stop,punctuation,whitespace,number,out of vocab
0,Nice,-9.845901,,,,,
1,place,-8.045827,,,,,
2,Better,-10.571031,,,,,
3,than,-6.372464,Yes,,,,
4,some,-6.402781,Yes,,,,
5,reviews,-11.378132,,,,,
6,give,-7.725083,Yes,,,,
7,it,-4.50645,Yes,,,,
8,credit,-8.618998,,,,,
9,for,-4.91397,Yes,,,,


In [40]:
import string
punctuations = string.punctuation


In [44]:
token = parsed_doc[0]
token

Nice

In [46]:
token.lemma_

'nice'

In [None]:
for doc in parsed_doc

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer
?TfidfVectorizer

In [None]:
def spacy_tokenizer(sentence):
    """splits a string/sentence into a sequence of tokens"""
    # strip first
    tokens = parser(sentence)
    tokens = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tokens]
    tokens = [tok for tok in tokens if (tok not in stopwords and tok not in punctuations)]
    return tokens

#create vectorizer object to generate feature vectors, we will use custom spacy’s tokenizer
vec = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
clf = LinearSVC()

https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+AnalyticsVidhya+%28Analytics+Vidhya%29