## Classifying News Headlines with Naive Bayes
Reference: Classifying News Headlines and Explaining the Result from Kaggle[^1]

[^1]: http://nbviewer.jupyter.org/github/dreamgonfly/lime-examples/blob/master/Classifying%20News%20Headlines%20and%20Explaining%20the%20Result.ipynb


In [23]:
import pandas as pd
news = pd.read_csv('uci-news-aggregator.csv').sample(frac=0.1)

In [24]:
len(news)

42242

In [25]:
news.head(3)

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
131117,131453,Renowned producer accused of raping a teenage boy,http://www.standardmedia.co.ke/entertainment/a...,The Standard Digital News,e,dD2SKRjEC3OpoSMBZ2YngvPPLyq8M,www.standardmedia.co.ke,1397779477911
193151,193487,Nigerian official: Hundreds killed in attack,http://www.pratttribune.com/article/20140507/N...,Pratt Tribune,t,dIm0mkuI23GU6LMzEn7rv3Rk99ZaM,www.pratttribune.com,1399565617078
270836,271282,Sinead O'Connor Reveals New Vampy and Glamorou...,http://www.celebuzz.com/2014-06-03/sinead-ocon...,Celebuzz,e,dR9_9_z6SwLZEhMjrEsD_8k6JG1iM,www.celebuzz.com,1401834527043


In [26]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

X = news['TITLE']
y = encoder.fit_transform(news['CATEGORY'])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [27]:
len(X_train)

31681

In [28]:
len(X_test)

10561

In [29]:
type(X_train)

pandas.core.series.Series

In [30]:
X_train.head(3)

232076    Ryan Gosling's Lost River: Matt Smith in first...
386342            New HIV Infections Decreasing, UN Reports
117145         Google buys US drone startup Titan Aerospace
Name: TITLE, dtype: object

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=3)

train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

In [32]:
train_vectors

<31681x9782 sparse matrix of type '<class 'numpy.int64'>'
	with 266772 stored elements in Compressed Sparse Row format>

In [33]:
X_train.iloc[1]

'New HIV Infections Decreasing, UN Reports'

In [34]:
train_vectors[1]

<1x9782 sparse matrix of type '<class 'numpy.int64'>'
	with 5 stored elements in Compressed Sparse Row format>

In [35]:
type(train_vectors)

scipy.sparse.csr.csr_matrix

In [36]:
# one-hot vector
train_vectors[1].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [37]:
from sklearn.metrics import accuracy_score

In [38]:
train_vectors.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Gaussian Naive Bayes

In [39]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(train_vectors.toarray(), y_train)

GaussianNB(priors=None)

In [40]:
pred = clf.predict(test_vectors.toarray())
accuracy_score(y_test, pred, )

0.81744153015812893

### Multinomial Naive Bayes

In [41]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(train_vectors, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [42]:
pred = clf.predict(test_vectors)
accuracy_score(y_test, pred, )

0.89915727677303281

### Bernoulli Naive Bayes

In [43]:
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(train_vectors, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [44]:
pred = clf.predict(test_vectors.toarray())
accuracy_score(y_test, pred, )

0.89963071678818296