In [1]:
import pandas as pd

In [2]:
# toy example: 4 documents
X_train = [
    'call you tonight',
    'call me a cab',
    'please call me... PLEASE',
    'he called the police'
]
X_train

['call you tonight',
 'call me a cab',
 'please call me... PLEASE',
 'he called the police']

We will be using countvectorizer to convert text into a matrix of word counts

In [3]:
# import countvectorizer
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer() # with default parameters

In [4]:
# "learn the vocabulary"
vect.fit(X_train)

In [5]:
# exmaine the fitted vocabulary
vect.get_feature_names_out()

array(['cab', 'call', 'called', 'he', 'me', 'please', 'police', 'the',
       'tonight', 'you'], dtype=object)

In [6]:
# converting training data into a 'document-term matrix'
X_train_dtm = vect.transform(X_train)
X_train_dtm

<4x10 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [7]:
X_train_dtm.toarray()

array([[0, 1, 0, 0, 0, 0, 0, 0, 1, 1],
       [1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 2, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 1, 1, 0, 0]])

In [8]:
pd.DataFrame(X_train_dtm.toarray(), columns=vect.get_feature_names_out(), index=X_train)

Unnamed: 0,cab,call,called,he,me,please,police,the,tonight,you
call you tonight,0,1,0,0,0,0,0,0,1,1
call me a cab,1,1,0,0,1,0,0,0,0,0
please call me... PLEASE,0,1,0,0,1,2,0,0,0,0
he called the police,0,0,1,1,0,0,1,1,0,0


In [9]:
X_test = ['please don\'t call me']
X_test_dtm = vect.transform(X_test)
pd.DataFrame(X_test_dtm.toarray(), columns=vect.get_feature_names_out(), index=X_test)

Unnamed: 0,cab,call,called,he,me,please,police,the,tonight,you
please don't call me,0,1,0,0,1,1,0,0,0,0


### Tuning the Vectorizer

**stop_words:** Stop words are words like [I, a, an, this, the, ...] that don't add much meaning to a sentence. We can remove them to reduce the number of features.

In [10]:
vect = CountVectorizer(stop_words='english')
vect.fit(X_train)
vect.get_feature_names_out()

array(['cab', 'called', 'police', 'tonight'], dtype=object)

In [11]:
# list of scikit learn stop words
from sklearn.feature_extraction import _stop_words

sorted(list(_stop_words.ENGLISH_STOP_WORDS))

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'bill',
 'both',
 'bottom',
 'but',
 'by',
 'call',
 'can',
 'cannot',
 'cant',
 'co',
 'con',
 'could',
 'couldnt',
 'cry',
 'de',
 'describe',
 'detail',
 'do',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eg',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'etc',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'fill',
 'find',
 'fire',
 'first',
 'five',
 'for'

**ngram_range**: An n-gram is a sequence of n words. For example, "apple juice" is a 2-gram (aka a bigram), and "I love apple juice" is a 4-gram (aka a four-gram). The ngram_range parameter lets us specify the range of n-gram sizes we want to include in our features. In the example above, we included unigrams (ngram_range=(1,1)) and bigrams (ngram_range=(2,2)).

In [12]:
vect = CountVectorizer(ngram_range=(1, 3)) # 1 grams, 2 grams, 3 grams
vect.fit(X_train)
vect.get_feature_names_out()

array(['cab', 'call', 'call me', 'call me cab', 'call me please',
       'call you', 'call you tonight', 'called', 'called the',
       'called the police', 'he', 'he called', 'he called the', 'me',
       'me cab', 'me please', 'please', 'please call', 'please call me',
       'police', 'the', 'the police', 'tonight', 'you', 'you tonight'],
      dtype=object)

**max_df / min_df:** When building the vocabulary, we can set the maximum document frequency (max_df) and minimum document frequency (min_df). If the word frequency is below min_df OR above max_df, the word is ignored. This allows us to exclude words that are too rare or too common to be useful.

In [13]:
# ignore items that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)
vect.fit(X_train)
vect.get_feature_names_out()

array(['cab', 'called', 'he', 'me', 'please', 'police', 'the', 'tonight',
       'you'], dtype=object)

In [14]:
# only show items that appear in at least 2 documents
vect = CountVectorizer(min_df=2)
vect.fit(X_train)
vect.get_feature_names_out()

array(['call', 'me'], dtype=object)

### Naive Bayes Classifier

Naive Bayes classifiers are a collection of classification algorithms based on Bayes' Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

"Naive" assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, words in a sentence are independent of each other.

It works on the principle of conditional probability. Conditional probability is the probability of something happening, given that something else has already occurred.

Bayes Theorem: 
$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$

Pros:
- Very fast
- Simple to implement


<u>Example</u>

Training documents:

| Column | Document | Label |
| :-- | --: |  --: |
| Doc1 | "basketball is a great game" | not politics |
| Doc2 | "the election is over" | politics |
| Doc3 | "very clean debate" | politics |
| Doc4 | "a close but forgettable race" | not politics |
| Doc5 | "the election is a race" | politics |

Vocabulary: basketball, great, ... , race (ignore stop words)
new document: "a very close race"

<u>goal</u>: predict the label of the new document i.e. to estimate p(politics | "a very close race") or p(not politics | "a very close race")

Naive bayes predicts the label with the Largest Probability

$P(politics | "a .. race") = \frac{1}{p("a .. race")}*p("a .. race" | politics)*p(politics)$ the end is the prior probability, the probability of the label

$P(not-pol | "a .. race") = \frac{1}{p("a .. race")}*p("a .. race" | not-pol)*p(not pol)$ before the prior is the likelihood

$P(politics) = \frac{number\ of\ politics\ docs}{total\ number\ of\ docs} = \frac{3}{5}$

$P(not-politics) = \frac{2}{5}$

Naive bayes:

$P("a...race" | politics) = P("very" | politics) * P("close" | politics) * P("race" | politics)$

$P("very" | politics) = number\ of\ times\ "very"\ appears\ in\ politics\ docs / total\ number\ of\ words\ in\ politics\ docs = 1+\alpha/7+11\alpha$

$P("close" | politics) = 0+\alpha/7+11\alpha$

$P("race" | politics) = 1+\alpha/7+11\alpha$

We dont have "close" in politics docs, so we use Laplace smoothing using an alpha parameter (from 0 to 1) to avoid 0 probability

We can add the alpha parameter to the numerator and the number of words in the vocabulary times alpha to the denominator.