### Introduction :
Data comes in various forms and types, numbers, timestamps, dates, images... and text is one of these forms that differs in terms of how we get text data, the shape and nature of this text, and especially what we can do with it and how.
In this notebook, I will go through some basic operations and functions to see what kind of fun we can have with textual data.


In [1]:
## Importing spaCy lirary for NLP :
import spacy
## We are going to analyze text written in the English language, let's load it also
nlp = spacy.load('en')


In [2]:
## let's try to input a small text to see what we can do :
doc = nlp("Football is not my favorite sport")


In [3]:
## let's split our text into tokens:
for token in doc:
    print(token)

Football
is
not
my
favorite
sport


In [4]:
span=doc[1:3]
span.text

'is not'

#### Preprocessing
Before we start doing complex things, we need to do some transformations to the data we have. these transformations a bit different than the transformations applied to non-text data. we will start by lemmatizing our words.
We need to do this because words do not all have the same usage and meaing, especially stopwords for instance.


In [5]:
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")


Token 		Lemma 		Stopword
----------------------------------------
Football		football		False
is		be		True
not		not		True
my		-PRON-		True
favorite		favorite		False
sport		sport		False


There are various other transformations that we could apply on our text, like transforming to lower case, matching words with 

In [6]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

The matcher is created using the vocabulary of your model. Here we're using the small English model you loaded earlier. Setting attr='LOWER' will match the phrases on lowercased text. This provides case insensitive matching.

Next you create a list of terms to match in the text. The phrase matcher needs the patterns as document objects. The easiest way to get these is with a list comprehension using the nlp model.

In [7]:
terms = ['Player', 'goal', 'ball', 'game']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

In [8]:
text_doc = nlp("I really am looking forward to the game tonight "
               "They missed the goal 5 times"
               "Just don't drop the ball") 
matches = matcher(text_doc)
print(matches)

[(3766102292120407359, 7, 8), (3766102292120407359, 12, 13), (3766102292120407359, 19, 20)]


The matches here are a tuple of the match id and the positions of the start and end of the phrase.



In [9]:
match_id, start, end = matches[1]
print(nlp.vocab.strings[match_id], text_doc[start:end])

TerminologyList goal


#### Text Classification
Text classification is one of the most common NLP tasks. We will use SpaCy to analyze and build understanding of a peice of text, then we will use it to detect if the text is spam or not. This should be interesting :

In [10]:
import pandas as pd

# Loading the spam data, we took this from a Kaggle competition dataset
# ham is the label for non-spam messages
spam = pd.read_csv('../input/nlp-course/spam.csv')
spam.head(10)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


### Building a Bag of Words

In [11]:
import spacy

# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

In [12]:
# Add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")

1

Training a Text Categorizer Model

In [13]:
''' 
Next, you'll convert the labels in the data to the form TextCategorizer requires. For each document, 
you'll create a dictionary of boolean values for each class.
For example, if a text is "ham", we need a dictionary {'ham': True, 'spam': False}. 
The model is looking for these labels inside another dictionary with the key 'cats'.
'''
train_texts = spam['text'].values
train_labels = [{'cats': {'ham': label == 'ham',
                          'spam': label == 'spam'}} 
                for label in spam['label']]

In [14]:
train_data = list(zip(train_texts, train_labels))
train_data[:3]

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  {'cats': {'ham': True, 'spam': False}}),
 ('Ok lar... Joking wif u oni...', {'cats': {'ham': True, 'spam': False}}),
 ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
  {'cats': {'ham': False, 'spam': True}})]

In [15]:
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

# Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)
# Iterate through minibatches
for batch in batches:
    # Each batch is a list of (text, label) but we need to
    # send separate lists for texts and labels to update().
    # This is a quick way to split a list of tuples into lists
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)

In [16]:
import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(10):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)

{'textcat': 0.4348172624983704}
{'textcat': 0.6518694226331547}
{'textcat': 0.7900564276918112}
{'textcat': 0.8785885196629231}
{'textcat': 0.9360322702493176}
{'textcat': 0.9741643182592479}
{'textcat': 1.0029479640338683}
{'textcat': 1.022063153248012}
{'textcat': 1.0369556128306696}
{'textcat': 1.0472958829307644}


### Making Predictions

In [17]:
''' 
Now that you have a trained model, you can make predictions with the predict() method. 
The input text needs to be tokenized with nlp.tokenizer. 
Then you pass the tokens to the predict method which returns scores. 
The scores are the probability the input text belongs to the classes.
''' 
texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA" ]
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores)

[[9.9994671e-01 5.3246935e-05]
 [1.2245358e-02 9.8775464e-01]]


In [18]:
# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

['ham', 'spam']


   © www.wajdibensaad.com 