In [1]:
import pandas as pd
import numpy as np
import spacy

In [2]:
nlp = spacy.load('en')

In [3]:
doc = nlp("Tea is healthy and calming, don't you think?")

In [4]:
print(f"Token \t\tLemma \t\tStopword".format('Token','Lemma','Stopword'))
print('-'*40)
for token in doc:
    print(f'{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}')

Token 		Lemma 		Stopword
----------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
calming		calm		False
,		,		False
do		do		True
n't		not		True
you		-PRON-		True
think		think		False
?		?		False


In [5]:
#Pattern Matching
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

#The matcher is created using the vocabulary of your model. Here we're using the small English model you loaded earlier.
#Setting atrr='LOWER' will match the phrases on lowercased text. This provides case INSENSITIVE matching

In [6]:
#Create a list of terms to match in the text.
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", patterns)

In [7]:
#Then create a document from the text to search and use the phrase matcher to find where the terms occur in the text
text_doc = nlp("Glowing review overall, and some really interseting side-by-side"
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year's iphone XS and Google Pixel 3.")
matches = matcher(text_doc)
print(matches)

[(3766102292120407359, 16, 18), (3766102292120407359, 21, 23), (3766102292120407359, 29, 31), (3766102292120407359, 32, 34)]


#### The matches are a tuple (match id, position of start, position of end of phrase)

In [8]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

TerminologyList iPhone 11


## Text Classification with SpaCy

In [10]:
#Loading spam data
spam = pd.read_csv('spam.csv')
spam.head(10)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


### Bag of Words
Machine Learning models don't learn from raw text data. Instead, you need to convert the text to somthing numeric. The simplest common representation is a variation of one-hot encoding. You represent each document as a vector of term frequencies for each term in the vocabulary.

For each document, count up how many times a term occurs, and place that count in the appropriate element of a vector.

Another common representation is TF-IDF (term frequency - inverse document frequency). TF-IDF is similar to bag of words except that each term count is scaled by the term's frequency in the CORPUS.

### Building a Bag of Words model
Once you have your documents in a bag of words representation, you can use those vectors as input to any machine learning model. spaCy handles the bag of words conversion and building a simple linear model for you with the 'TextCategorizer' class.

The TextCategorizer is a spaCy pipe. Pipes are classes for processing and transforming tokens. When you create a spaCy model with nlp=spacy.load('en_core_web_sm'), there are default pipes that perform part of speech tagging, entity recognition, and other transoformations. When you run text through a model doc=nlp('some text'), the output of the pipes are attached to the tokens in the doc object. The lemmas for token.lemma_ comes from one of these pipes. 

In [11]:
#Creates an empty model
nlp = spacy.blank('en')

#Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
            "textcat",
            config={
                "exclusive_classes":True,
                "architecture": "bow"
            })

#Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

Since the classes are either ham or spam, we set "exclusive_classes" to True. We've also configured it with the bag of words ("bow") architecture. spaCy provides a convolutional neural network architecture as well.

In [12]:
#Add labels to text classifier
textcat.add_label('ham')
textcat.add_label('spam')

1

##### Training a Text Categorizer Model
Convert the labels in the data to the from TextCategorizer requires. For each document, create a dictionary of boolean values for each class.

If a text is "ham", we need a dictionary {'ham':True, 'spam':False}. The model is looking for these labels inside another dictionary with the key 'cats'.

In [13]:
train_texts = spam['text'].values
train_labels = [{'cats': {'ham':label == 'ham',
                          'spam':label == 'spam'}}
                for label in spam['label']]

##### Then combine the texts and labels into a single list

In [14]:
train_data = list(zip(train_texts, train_labels))
train_data[:3]

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  {'cats': {'ham': True, 'spam': False}}),
 ('Ok lar... Joking wif u oni...', {'cats': {'ham': True, 'spam': False}}),
 ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
  {'cats': {'ham': False, 'spam': True}})]

#### Ready to train the model. Create an optimizer usering nlp.begin_training(). spaCy uses this optimizer to update the model. In general it's more fficient to train models in small batches. spaCy provides the minibatch function that returns a genrator yielding minibatches for traiing. Finally, the minibatches are split into texts and labels, then used the nlp.update to update the model's parameters.

In [None]:
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

#Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)

#Iterate through minibatches
for batch in batches:
    #Each batch is a list of (text, label) but we need to send separate
    #lists for texts and labels to update(). This is a quick way to split a list of tuples into lists
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)