# Multinomial Naive Bayes

![](pic1.png)

- classification model and used in NLP
- alternative to the distance-based K-Means clustering and decision tree forests, and deals with probability as the “likelihood” that data belongs to a specific class.
- **ADVANTAGE**= Its main advantage is the significantly reduced complexity. It provides an ability to perform the classification, using small training sets, not requiring to be continuously re-trained.

In [1]:
import numpy as np 
import pandas as pd

In [2]:
df = pd.read_csv("sample_news.csv")

In [3]:
df.head()

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [4]:
df.shape

(422419, 8)

In [6]:
df['CATEGORY'].value_counts()

e    152469
b    115967
t    108344
m     45639
Name: CATEGORY, dtype: int64

e - entertainment
b - business
t - technology 

### Basics of NLP

In [21]:
! pip install nltk



In [None]:
import nltk
from nltk import word_tokenize
data = "Fed official says weak data caused by weather, should not slow taper"
print(word_tokenize(data))

In [25]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
print(stopwords)

{"you'll", 'am', "don't", 'by', 'his', 'then', 'this', "mustn't", 'shan', 'she', 'are', "shouldn't", 'itself', 'and', 'ours', 'being', 'wouldn', 'ourselves', 'once', 'having', "didn't", 'couldn', 'who', 'few', "that'll", "you're", 'where', 'doesn', 'about', 'it', 'as', 'same', 'hadn', 'had', 'hers', "haven't", 'them', 'off', 'mightn', 'any', 'such', 'aren', 'for', 'do', "should've", 'does', 'herself', 'we', 'isn', "couldn't", 'ain', 'yourself', 'i', 'theirs', 'but', 'at', 'with', 'why', 'before', 'you', 'myself', 'from', 'me', 'most', "isn't", 'ma', 'has', 'to', 'if', 'these', 'themselves', 'll', 'just', "won't", 'a', 'own', 'after', "it's", 'will', 'was', "she's", 'through', 'under', 'didn', 'when', 'how', 'weren', 'some', "you'd", 'each', 'is', 'd', 'needn', 'haven', 'down', 'wasn', 'whom', 'above', 'be', 'been', "needn't", 'shouldn', 'hasn', "shan't", 'my', 'during', 'very', 't', 'on', 's', 'over', 'between', 'only', 'their', 'up', 'into', 'have', "mightn't", 'again', 're', 'won', '

### Text Cleaning

In [3]:
import string 
data = "Fed official says weak data caused by weather, should not slow taper"
data_1 = [char for char in data if char not in string.punctuation]
print(data_1)

['F', 'e', 'd', ' ', 'o', 'f', 'f', 'i', 'c', 'i', 'a', 'l', ' ', 's', 'a', 'y', 's', ' ', 'w', 'e', 'a', 'k', ' ', 'd', 'a', 't', 'a', ' ', 'c', 'a', 'u', 's', 'e', 'd', ' ', 'b', 'y', ' ', 'w', 'e', 'a', 't', 'h', 'e', 'r', ' ', 's', 'h', 'o', 'u', 'l', 'd', ' ', 'n', 'o', 't', ' ', 's', 'l', 'o', 'w', ' ', 't', 'a', 'p', 'e', 'r']


In [4]:
data_1 = ''.join(data_1)
print(data_1)

Fed official says weak data caused by weather should not slow taper


In [5]:
data_1 = data_1.split() # word_tokenize(data)
print(data_1)

['Fed', 'official', 'says', 'weak', 'data', 'caused', 'by', 'weather', 'should', 'not', 'slow', 'taper']


### Count vectorizer 

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(data_1)
print(vectorizer.vocabulary_)

{'fed': 3, 'official': 5, 'says': 6, 'weak': 10, 'data': 2, 'caused': 1, 'by': 0, 'weather': 11, 'should': 7, 'not': 4, 'slow': 8, 'taper': 9}


In [10]:
# "Encode document"
vector = vectorizer.transform(data_1)
print(vector)

  (0, 3)	1
  (1, 5)	1
  (2, 6)	1
  (3, 10)	1
  (4, 2)	1
  (5, 1)	1
  (6, 0)	1
  (7, 11)	1
  (8, 7)	1
  (9, 4)	1
  (10, 8)	1
  (11, 9)	1


In [None]:
def text_cleaning(a):
remove_punctuation = [char for char in a if char not in string.punctuation]
remove_punctuation = ''.join(remove_punctuation)
return

![](Multinomial.png)