# Naive Bayes and NLP (Natural Language Processing)
- text classification...
- supervised learning with raw text.
- bayes theorum is a probability formula that leverages previously known probabilities to define probability of related events occuring
- Naive bayes: we model the prob of belonging to a class GIVEN a vector of features
- supervised learing: we must have existing labels for historical data.
- NB can treat each word OR letter as a feature to classify a record.

## Feature Extraction from Text Data
Most ML algorithms cannot take in raw text as data. We instead need to transform the text into features (numerical) that we can pass into the machine.
We can use:
- Count vectorisation: create vocab of all posible words across documents (texts) then we can count out how many times a word occurs in a document (most values would be 0). Creates a DTM (Document Term Matrix) maps frequency.
    + must consider common words as they don't help.
    + must consider common words between documents (not too important so can exclude)
    + MOST NLP have common lists of "Stop Words" to not consider.
    + inverse document frequency factor: decreases common word weightng and increases rare word weighting.
    
We need to create a "Bag of words" model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [11]:
with open('One.txt') as mytext:
    a = mytext.read()
    alines = mytext.readlines()
    arr = a.lower().split()

In [9]:
a

'This is a story about dogs\nour canine pets\nDogs are furry animals\n'

In [4]:
print(a)

This is a story about dogs
our canine pets
Dogs are furry animals



In [12]:
arr

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

In [13]:
with open('One.txt') as mytext:
    words_one = mytext.read().lower().split()
    uni_words_one = set(words_one)

In [14]:
uni_words_one

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'dogs',
 'furry',
 'is',
 'our',
 'pets',
 'story',
 'this'}

In [17]:
with open('Two.txt') as mytext:
    words_two = mytext.read().lower().split()
    uni_words_two = set(words_two)

In [18]:
uni_words_two

{'a',
 'about',
 'catching',
 'fun',
 'is',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

In [19]:
# Combine the two sets...
all_uni_words = set()
all_uni_words.update(uni_words_one)
all_uni_words.update(uni_words_two)

In [20]:
all_uni_words

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'catching',
 'dogs',
 'fun',
 'furry',
 'is',
 'our',
 'pets',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

In [21]:
full_vocab = dict()

In [22]:
i = 0

In [23]:
for word in all_uni_words:
    full_vocab[word] = i
    i = i + 1

In [24]:
full_vocab

{'sport': 0,
 'surfing': 1,
 'catching': 2,
 'story': 3,
 'waves': 4,
 'furry': 5,
 'a': 6,
 'about': 7,
 'popular': 8,
 'our': 9,
 'dogs': 10,
 'fun': 11,
 'pets': 12,
 'this': 13,
 'canine': 14,
 'are': 15,
 'animals': 16,
 'water': 17,
 'is': 18}

In [25]:
one_freq = [0] * len(full_vocab)
two_freq = [0] * len(full_vocab)
all_words = ['']*len(full_vocab)

In [26]:
one_freq

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [27]:
with open('One.txt') as f:
    one_text = f.read().lower().split()

In [28]:
# Nice algo trick right there to count frequencies!
for word in one_text:
    word_ind = full_vocab[word]
    one_freq[word_ind] += 1

In [29]:
with open('Two.txt') as f:
    two_text = f.read().lower().split()

In [30]:
# Nice algo trick right there to count frequencies!
for word in two_text:
    word_ind = full_vocab[word]
    two_freq[word_ind] += 1

In [31]:
for word in full_vocab:
    word_index = full_vocab[word]
    all_words[word_index] = word

In [32]:
all_words

['sport',
 'surfing',
 'catching',
 'story',
 'waves',
 'furry',
 'a',
 'about',
 'popular',
 'our',
 'dogs',
 'fun',
 'pets',
 'this',
 'canine',
 'are',
 'animals',
 'water',
 'is']

In [36]:
# Bad of words...
bow = pd.DataFrame(data=[one_freq, two_freq], columns=all_words)

In [37]:
bow

Unnamed: 0,sport,surfing,catching,story,waves,furry,a,about,popular,our,dogs,fun,pets,this,canine,are,animals,water,is
0,0,0,0,1,0,1,1,1,0,1,2,0,1,1,1,1,1,0,1
1,1,2,1,1,1,0,1,1,1,0,0,1,0,1,0,0,0,1,3


# Doing it with SciKitLearn

In [38]:
text = ['This is a leggolas line', 'This is another big line', 'completely different line mate']

In [39]:
# Count vectoriser to do that work for us automatically...
from sklearn.feature_extraction.text import CountVectorizer

In [48]:
cv = CountVectorizer(stop_words='english')
# Get unique vocab on fit and do the freq counts on transform
sparse_matrix = cv.fit_transform(text)

In [49]:
cv

CountVectorizer(stop_words='english')

In [50]:
sparse_matrix.todense()

matrix([[0, 0, 0, 1, 1, 0],
        [1, 0, 0, 0, 1, 0],
        [0, 1, 1, 0, 1, 1]], dtype=int64)

In [51]:
cv.vocabulary_

{'leggolas': 3,
 'line': 4,
 'big': 0,
 'completely': 1,
 'different': 2,
 'mate': 5}

In [55]:
from sklearn.feature_extraction.text import TfidfTransformer
# TD-IDF = term frequency - inverse document frequency

In [56]:
tfidf = TfidfTransformer()

In [58]:
results = tfidf.fit_transform(sparse_matrix) # Pass in a bag of words!

In [61]:
results.todense() # Don't call on big data sets

matrix([[0.        , 0.        , 0.        , 0.861037  , 0.50854232,
         0.        ],
        [0.861037  , 0.        , 0.        , 0.        , 0.50854232,
         0.        ],
        [0.        , 0.54645401, 0.54645401, 0.        , 0.32274454,
         0.54645401]])

In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer # Does work of the two previous things...

In [63]:
tv = TfidfVectorizer()

In [64]:
tv_results = tv.fit_transform(text)

In [65]:
tv_results.todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.4804584 ,
         0.63174505, 0.37311881, 0.        , 0.4804584 ],
        [0.53409337, 0.53409337, 0.        , 0.        , 0.40619178,
         0.        , 0.31544415, 0.        , 0.40619178],
        [0.        , 0.        , 0.54645401, 0.54645401, 0.        ,
         0.        , 0.32274454, 0.54645401, 0.        ]])