In [1]:
import spacy
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
spam=pd.read_csv('SMSSpamCollection',sep="\t",names=['type','text'])

In [3]:
spam

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


## Bag of Words

Machine learning models don't learn from raw text data. Instead, you need to convert the text to something numeric.

The simplest common representation is a variation of one-hot encoding. You represent each document as a vector of term frequencies for each term in the vocabulary. The vocabulary is built from all the tokens (terms) in the corpus (the collection of documents).

As an example, take the sentences "Tea is life. Tea is love." and "Tea is healthy, calming, and delicious." as our corpus. The vocabulary then is {"tea", "is", "life", "love", "healthy", "calming", "and", "delicious"} (ignoring punctuation).

For each document, count up how many times a term occurs, and place that count in the appropriate element of a vector. The first sentence has "tea" twice and that is the first position in our vocabulary, so we put the number 2 in the first element of the vector. Our sentences as vectors then look like

`v1=[2,2,1,1,0,0,0,0]`

`v2=[1,1,0,0,1,1,1,1]`
 
This is called the bag of words representation. You can see that documents with similar terms will have similar vectors. Vocabularies frequently have tens of thousands of terms, so these vectors can be very large.

Another common representation is **TF-IDF (Term Frequency - Inverse Document Frequency)**. TF-IDF is similar to bag of words except that each term count is scaled by the term's frequency in the corpus. Using TF-IDF can potentially improve your models. You won't need it here. Feel free to look it up though!

## Bag Of Words with spaCy


Once you have your documents in a bag of words representation, you can use those vectors as input to any machine learning model. spaCy handles the bag of words conversion and building a simple linear model for you with the `TextCategorizer` class.

The `TextCategorizer` is a spaCy **pipe**. **Pipes** are classes for processing and transforming tokens. When you create a spaCy model with `nlp = spacy.load('en_core_web_sm')`, there are default pipes that perform part of speech tagging, entity recognition, and other transformations. When you run text through a model `doc = nlp("Some text here")`, the output of the pipes are attached to the tokens in the doc object. The lemmas for `token.lemma_` come from one of these pipes.

You can remove or add pipes to models. What we'll do here is create an empty model without any pipes (other than a tokenizer, since all models always have a tokenizer). Then, we'll create a `TextCategorizer` pipe and add it to the empty model.

In [4]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup VERB dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


In [5]:
textcat=nlp.add_pipe("textcat")

In [6]:
textcat.add_label("ham")
textcat.add_label("spam")

1

In [7]:
train_texts = spam['text'].values
train_labels = [{'cats': {'ham': label == 'ham','spam': label == 'spam'}}for label in spam['type']]

In [8]:
train_texts[0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [9]:
train_labels[0]

{'cats': {'ham': True, 'spam': False}}

In [10]:
train_data=list(zip(train_texts,train_labels))

In [11]:
train_data[0]

('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
 {'cats': {'ham': True, 'spam': False}})

In [12]:
type(train_data)

list

In [13]:
doc

Apple is looking at buying U.K. startup for $1 billion

In [14]:
from spacy.training import Example

In [15]:
words=[i for i in doc]

In [16]:
words

[Apple, is, looking, at, buying, U.K., startup, for, $, 1, billion]

In [17]:
tags=[i.pos_ for i in doc]