# Classifying Text Using Naive Bayes

## Tokenization

In [1]:
lines = [
    'How to tokenize?\nLike a boss.',
    'Google is accessible via http://www.google.com',
    '1000 new followers! #TwitterFamous',
]

lines

['How to tokenize?\nLike a boss.',
 'Google is accessible via http://www.google.com',
 '1000 new followers! #TwitterFamous']

In [2]:
for line in lines:
  print(line.split())

['How', 'to', 'tokenize?', 'Like', 'a', 'boss.']
['Google', 'is', 'accessible', 'via', 'http://www.google.com']
['1000', 'new', 'followers!', '#TwitterFamous']


### Regex

In [3]:
import re

_token_pattern = r"\w+"
token_pattern = re.compile(_token_pattern)
    
for line in lines:
    print(token_pattern.findall(line))

['How', 'to', 'tokenize', 'Like', 'a', 'boss']
['Google', 'is', 'accessible', 'via', 'http', 'www', 'google', 'com']
['1000', 'new', 'followers', 'TwitterFamous']


In [4]:
import re

_token_pattern = r"(?u)\b\w\w+\b"
token_pattern = re.compile(_token_pattern)
    
for line in lines:
    print(token_pattern.findall(line))


['How', 'to', 'tokenize', 'Like', 'boss']
['Google', 'is', 'accessible', 'via', 'http', 'www', 'google', 'com']
['1000', 'new', 'followers', 'TwitterFamous']


In [5]:
_token_pattern = r"\w+"
token_pattern = re.compile(_token_pattern)

def tokenizer(line):
    line = line.lower()
    line = re.sub(r'http[s]?://[\w\/\-\.\?]+','_url_', line)
    line = re.sub(r'\d+:\d+','_time_', line)
    line = re.sub(r'#\w+', '_hashtag_', line)
    line = re.sub(r'\d+','_num_', line)
    return token_pattern.findall(line)

for line in lines:
    print(tokenizer(line))

['how', 'to', 'tokenize', 'like', 'a', 'boss']
['google', 'is', 'accessible', 'via', '_url_']
['_num_', 'new', 'followers', '_hashtag_']


# Vectorizing text into matrices

## Bag of words

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(lowercase=True, tokenizer=tokenizer)

X = vec.fit_transform(lines)

type(X)

scipy.sparse.csr.csr_matrix

In [7]:
X

<3x15 sparse matrix of type '<class 'numpy.int64'>'
	with 15 stored elements in Compressed Sparse Row format>

In [8]:
import pandas as pd


pd.DataFrame(X.todense(), columns=vec.get_feature_names_out())

Unnamed: 0,_hashtag_,_num_,_url_,a,accessible,boss,followers,google,how,is,like,new,to,tokenize,via
0,0,0,0,1,0,1,0,0,1,0,1,0,1,1,0
1,0,0,1,0,1,0,0,1,0,1,0,0,0,0,1
2,1,1,0,0,0,0,1,0,0,0,0,1,0,0,0


### Different sentences, same representation

In [9]:
flight = [
          'Flight was delayed, I am not happy',
          'Flight was not delayed, I am happy'
]

In [10]:
flight_vec = CountVectorizer(tokenizer=tokenizer)
X_flight = flight_vec.fit_transform(flight)

flight_df = pd.DataFrame(X_flight.todense(), columns=flight_vec.get_feature_names_out())
flight_df

Unnamed: 0,am,delayed,flight,happy,i,not,was
0,1,1,1,1,1,1,1
1,1,1,1,1,1,1,1


The order of the tokens in the sentences is lost. That is why this method is known as bag of words – the result is like a bag that words are just put into without any order. Obviously, this makes it impossible to tell which of the two people is happy and which is not. To fix this problem, we may need to use n-grams.

## N-grams
Rather than treating each term as a token, we can treat the combinations of each two consecutive terms as a single token.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

flight_ngram_vec = CountVectorizer(ngram_range=(2,2))
X_ngram_flight = flight_ngram_vec.fit_transform(flight)

flight_ngram_df = pd.DataFrame(
    X_ngram_flight.todense(), 
    columns=flight_ngram_vec.get_feature_names_out()
)

In [13]:
flight_ngram_df.index.name = 'doc-id'
flight_ngram_df

Unnamed: 0_level_0,am happy,am not,delayed am,flight was,not delayed,not happy,was delayed,was not
doc-id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,1,1,1,0,1,1,0
1,1,0,1,1,1,0,0,1


Now we can tell who is happy and who is not. When using word pairs, this is known as bigrams. We can also do 3-grams (with three consecutive words), 4-grams, or any other number of grams.

## Using characters instead of words

Some situations may require us to tokenize our documents based on characters instead. In situations where word boundaries are not clear, such as in hashtags and URLs, the use of characters as tokens may help. Natural languages tend to have different frequencies for their characters. The letter **e** is the most commonly used character in the English language, and character combinations such as **th**, **er**, and **on** are also very common. Other languages, such as French and Dutch, have different character frequencies. If our aim is to classify documents based on their languages, the use of characters instead of words can come in handy.

In [14]:
flight = [
          'Flight was delayed, I am not happy',
          'Flight was not delayed, I am happy'
]

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

char_vec = CountVectorizer(ngram_range=(4,4), analyzer='char')
X_char = char_vec.fit_transform(flight)

char_df = pd.DataFrame(
    X_char.todense(),
    columns= char_vec.get_feature_names_out()
)

char_df.index.name = 'doc-id'
char_df

Unnamed: 0_level_0,am,del,hap,i a,not,was,", i",am h,am n,appy,...,not,ot d,ot h,s de,s no,t de,t ha,t wa,was,"yed,"
doc-id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,1,1,1,1,1,1,0,1,1,...,1,0,1,1,0,0,1,1,1,1
1,1,1,1,1,1,1,1,1,0,1,...,1,1,0,0,1,1,0,1,1,1


All our tokens are made of four characters now. Whitespaces are also treated as characters. With characters, it is more common to go for higher values of n.

## Capturing important words with TF-IDF

In [18]:
lines_fruits = [
    'I like apples',
    'I like oranges',
    'I like pears',
]

In [19]:
# CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer


vec = CountVectorizer(token_pattern=r'\w+')
X = vec.fit_transform(lines_fruits)

df1 = pd.DataFrame(
    X.todense().astype(float).round(2),
    columns=vec.get_feature_names_out()
)

df1.index.name = 'CountVectorizer'

# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(token_pattern=r'\w+')
X = vec.fit_transform(lines_fruits)

df2 = pd.DataFrame(
    X.todense().round(2),
    columns=vec.get_feature_names_out()
)

df2.index.name = 'TF-IDF'