# 1. Intro

### Bayes Theorom

Naive Bayes classifier is a collection of many algorithms where all the algorithms share one common principle, and that is each feature being classified is not related to any other feature.  That is each feature needs to be independent of every other feature.  The presence or absence of a feature does not affect the presence or absence of the other features

Bayes theorom is used to calculate the probability of an event occuring based on the prior knowledge of conditions related to an event

### Multinomial Naive Bayes

Multinomial Naive Bayes algo is a Bayesian learning approach popular in NLP where the program guesses the the tag of a text, such as an email or newspaper story, using the Bayes theorom. It calculates each tag's likelihood for a given sample and outputs the tag with the greatest chance. 

Steps:
1 - Create a frequency table
2 - Find the probabilities and create a likelihood table

# 2. Import Libraries and Read the Dataset

In [1]:
import pandas as pd
import numpy as np

from sklearn.naive_bayes import MultinomialNB

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score
import time

import re

In [2]:
df = pd.read_csv("./Datasets/NewsCategorizer.csv")

In [3]:
df.head()

Unnamed: 0,category,headline,links,short_description,keywords
0,WELLNESS,143 Miles in 35 Days: Lessons Learned,https://www.huffingtonpost.com/entry/running-l...,Resting is part of training. I've confirmed wh...,running-lessons
1,WELLNESS,Talking to Yourself: Crazy or Crazy Helpful?,https://www.huffingtonpost.com/entry/talking-t...,Think of talking to yourself as a tool to coac...,talking-to-yourself-crazy
2,WELLNESS,Crenezumab: Trial Will Gauge Whether Alzheimer...,https://www.huffingtonpost.com/entry/crenezuma...,The clock is ticking for the United States to ...,crenezumab-alzheimers-disease-drug
3,WELLNESS,"Oh, What a Difference She Made",https://www.huffingtonpost.com/entry/meaningfu...,"If you want to be busy, keep trying to be perf...",meaningful-life
4,WELLNESS,Green Superfoods,https://www.huffingtonpost.com/entry/green-sup...,"First, the bad news: Soda bread, corned beef a...",green-superfoods


# 3. Pre-Processing

In [4]:
df.drop(labels = ['links','short_description','keywords'],axis = 'columns', inplace=True)

In [5]:
df = df[['headline','category']]

In [6]:
df['category'].value_counts()

WELLNESS          5000
POLITICS          5000
ENTERTAINMENT     5000
TRAVEL            5000
STYLE & BEAUTY    5000
PARENTING         5000
FOOD & DRINK      5000
WORLD NEWS        5000
BUSINESS          5000
SPORTS            5000
Name: category, dtype: int64

Convert the categorical variabels to numerics, I show two different ways of doing this

In [7]:
df['categories_numeric']=df.category.map({'WELLNESS':0, 'POLITICS':1,
                                         'ENTERTAINMENT':2, 'TRAVEL':3,
                                         'STYLE & BEAUTY':4, 'PARENTING':5,
                                         'FOOD & DRINK':6, 'WORLD NEWS':7,
                                         'BUSINESS':8, 'SPORTS':9})
df.drop(labels = ['category'], axis = 'columns', inplace = True)

Second way, cleaner but more code

In [None]:
cols = list(df['category'])

for i in range (0, len(cols)):
    if cols[i] == 'WELLNESS':
        cols[i] = 0
    elif cols[i] == 'POLITICS':
        cols[i] = 1
    elif cols[i] == 'ENTERTAINMENT':
        cols[i] = 2
    elif cols[i] == 'TRAVEL':
        cols[i] = 3
    elif cols[i] == 'STYLE & BEAUTY':
        cols[i] = 4
    elif cols[i] == 'PARENTING':
        cols[i] = 5
    elif cols[i] == 'FOOD & DRINK':
        cols[i] = 6
    elif cols[i] == 'WORLD NEWS':
        cols[i] = 7
    elif cols[i] == 'BUSINESS':
        cols[i] = 8
    elif cols[i] == 'SPORTS':
        cols[i] = 9
new = pd.DataFrame(cols, columns = ['categories_numeric'], dtype=int)

df['category'] = new['categories_numeric']

In [8]:
df.categories_numeric.unique()

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [9]:
#print how the new dataset looks
df.head()

Unnamed: 0,headline,categories_numeric
0,143 Miles in 35 Days: Lessons Learned,0
1,Talking to Yourself: Crazy or Crazy Helpful?,0
2,Crenezumab: Trial Will Gauge Whether Alzheimer...,0
3,"Oh, What a Difference She Made",0
4,Green Superfoods,0


# 4. Data Preparation for Model Training

In [10]:
from sklearn.model_selection import train_test_split

def split_data (features, labels, test_percent):
    x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=test_percent, random_state=42)
    return x_train, x_test, y_train, y_test

In [11]:
x_train, x_test, y_train, y_test = split_data(df.headline, df.categories_numeric, 0.2)

Convert the dataset to a "Bag of Word" model

In [12]:
from sklearn.preprocessing import StandardScaler
#Here we convert our dataset into a Bag Of Word model using a Bigram model (2 word sequences)

vect = CountVectorizer(ngram_range=(2,2))
#converting traning features into numeric vector
X_train = vect.fit_transform(x_train)
#converting training labels into numeric vector
X_test = vect.transform(x_test)

# 5. Training Multinomial Naive Bayes Model

In [13]:
mnb = MultinomialNB(alpha = 0.2)

mnb.fit(X_train,y_train)

result = mnb.predict(X_test)

The accuracy initially looks very low, but compared to random guessing the model perfomrms 55% better than random guessing (the benchmark)

In [14]:
# Accuracy
accuracy_score(result,y_test)

0.651

# 6. Testing on unseen data

In [15]:
#Function that selects the prediction
def predict_title(news):
    test = vect.transform(news)
    pred = mnb.predict(test)
    if pred == 0: return 'WELLNESS'
    elif pred == 1: return 'POLITICS'
    elif pred == 2: return 'ENTERTAINMENT'
    elif pred == 3: return 'TRAVEL'
    elif pred == 4: return 'STYLE & BEAUTY'
    elif pred == 5: return 'PARENTING'
    elif pred == 6: return 'FOOD & DRINK'
    elif pred == 7: return 'WORLD NEWS'
    elif pred == 8: return 'BUSINESS'
    elif pred == 9: return 'SPORTS' 
    else: return 'no class found'

In [25]:
print(predict_title(["Tom Brady traded to the Celtics"]), 
      predict_title(["Don't give your kids crack cocaine"]),
      predict_title(["Morbius Breaks all time box office ticket sales"]),
      predict_title(["India Begins to Invade Canada starting at the Eastern Border"]))


SPORTS PARENTING ENTERTAINMENT ENTERTAINMENT


Clearly the model is imperfect especially with the last prediction, but this is my first intro to NLP