# Vectorizing Raw Data: N-Grams

### N-Grams 

Creates a document-term matrix where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.

"NLP is an interesting topic"

| n | Name      | Tokens                                                         |
|---|-----------|----------------------------------------------------------------|
| 2 | bigram    | ["nlp is", "is an", "an interesting", "interesting topic"]      |
| 3 | trigram   | ["nlp is an", "is an interesting", "an interesting topic"] |
| 4 | four-gram | ["nlp is an interesting", "is an interesting topic"]    |

### Read in text

In [3]:
import pandas as pd
import re
import string
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("Ex_Files_NLP_Python_ML_EssT\\Exercise Files\\Ch03\\03_04\\Start\\SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

### Create function to remove punctuation, tokenize, remove stopwords, and stem

In [4]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = " ".join([ps.stem(word) for word in tokens if word not in stopwords])
    return text

data['cleaned_text'] = data['body_text'].apply(lambda x: clean_text(x))
data.head()

Unnamed: 0,label,body_text,cleaned_text
0,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd...
1,ham,"Nah I don't think he goes to usf, he lives around here though",nah dont think goe usf live around though
2,ham,Even my brother is not like to speak with me. They treat me like aids patent.,even brother like speak treat like aid patent
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,date sunday
4,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...,per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend...


### Apply CountVectorizer (w/ N-Grams)

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

ngram_vect = CountVectorizer(ngram_range=(2,4))
X_counts = ngram_vect.fit_transform(data["cleaned_text"])
print(X_counts.shape)
print(ngram_vect.get_feature_names())

(5567, 90667)
['008704050406 sp', '008704050406 sp arrow', '0089mi last', '0089mi last four', '0089mi last four digit', '0121 2025050', '0121 2025050 visit', '0121 2025050 visit wwwshortbreaksorguk', '01223585236 xx', '01223585236 xx luv', '01223585236 xx luv nikiyu4net', '01223585334 cum', '01223585334 cum wan', '01223585334 cum wan 2c', '0125698789 ring', '0125698789 ring ur', '0125698789 ring ur around', '02 user', '02 user today', '02 user today lucki', '020603 2nd', '020603 2nd attempt', '020603 2nd attempt reach', '0207 153', '0207 153 9153', '0207 153 9153 offer', '0207 153 9996', '0207 153 9996 offer', '02072069400 bx', '02072069400 bx 526', '02072069400 bx 526 sw73ss', '02073162414 cost', '02073162414 cost 20pmin', '02073162414 cost 20pmin gsex', '02085076972 repli', '02085076972 repli stop', '02085076972 repli stop end', '020903 2nd', '020903 2nd attempt', '020903 2nd attempt contact', '021 3680', '021 3680 subject', '021 3680 subject ts', '021 3680offer', '021 3680offer end'

### Apply CountVectorizer (w/ N-Grams) to smaller sample

In [15]:
data_sample = data[0:20]

ngram_vect_sample = CountVectorizer(ngram_range=(2,4))
X_counts_sample = ngram_vect_sample.fit_transform(data_sample["cleaned_text"])
print(X_counts_sample.shape)
print(ngram_vect_sample.get_feature_names())

(20, 537)
['09061701461 claim', '09061701461 claim code', '09061701461 claim code kl341', '100 20000', '100 20000 pound', '100 20000 pound txt', '100000 prize', '100000 prize jackpot', '100000 prize jackpot txt', '11 month', '11 month entitl', '11 month entitl updat', '12 hour', '150pday 6day', '150pday 6day 16', '150pday 6day 16 tsandc', '16 tsandc', '16 tsandc appli', '16 tsandc appli repli', '20000 pound', '20000 pound txt', '20000 pound txt csh11', '2005 text', '2005 text fa', '2005 text fa 87121', '21st may', '21st may 2005', '21st may 2005 text', '4txtú120 poboxox36504w45wq', '4txtú120 poboxox36504w45wq 16', '6day 16', '6day 16 tsandc', '6day 16 tsandc appli', '81010 tc', '81010 tc wwwdbuknet', '81010 tc wwwdbuknet lccltd', '87077 eg', '87077 eg england', '87077 eg england 87077', '87077 trywal', '87077 trywal scotland', '87077 trywal scotland 4txtú120', '87121 receiv', '87121 receiv entri', '87121 receiv entri questionstd', '87575 cost', '87575 cost 150pday', '87575 cost 150pday

### Vectorizers output sparse matrices

_**Sparse Matrix**: A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements._

In [16]:
X_counts_df = pd.DataFrame(X_counts_sample.toarray())
X_counts_df.columns = ngram_vect_sample.get_feature_names()
X_counts_df

Unnamed: 0,09061701461 claim,09061701461 claim code,09061701461 claim code kl341,100 20000,100 20000 pound,100 20000 pound txt,100000 prize,100000 prize jackpot,100000 prize jackpot txt,11 month,...,word claim 81010 tc,wwwdbuknet lccltd,wwwdbuknet lccltd pobox,wwwdbuknet lccltd pobox 4403ldnw1a7rw18,xxxmobilemovieclub use,xxxmobilemovieclub use credit,xxxmobilemovieclub use credit click,ye naughti,ye naughti make,ye naughti make wet
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,1,1,1,0,...,1,1,1,1,0,0,0,0,0,0
