# notes
- UCI ML News Aggregator Dataset contains headlines
- categories for over 400k news articles.
- currently uses multinomial scikit learn

In [6]:
# get some libraries that will be useful
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)x

# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder


# grab the data
news = pd.read_csv("../data_news_aggregator/news_data_uci.csv")

In [7]:
news.head() # let's take a look at our data

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


One thing we'll want to do is normalize the TITLE column a bit: remove punctuation and lowercase everything. This will give us a smaller set of words, which will decrease the size of our model, and ensure that words are treated the same even if they occur capitalized at the beginning of the headline or lowercase in the middle.

In [14]:
from train import normalize_text
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
print(len(news['TEXT']),news['TEXT'])

422419 0         fed official says weak data caused by weather ...
1         fed's charles plosser sees high bar for change...
2         us open stocks fall after fed official hints a...
3         fed risks falling behind the curve' charles pl...
4         fed's plosser nasty weather has curbed job growth
5          plosser fed may have to accelerate tapering pace
6                  fed's plosser taper pace may be too slow
7         fed's plosser expects us unemployment to fall ...
8         us jobs growth last month hit by weather:fed p...
9         ecb unlikely to end sterilisation of smp purch...
10        ecb unlikely to end sterilization of smp purch...
11                    eu's half-baked bank union could work
12             europe reaches crunch point on banking union
13        ecb focus-stronger euro drowns out ecb's messa...
14               eu aims for deal on tackling failing banks
15         forex pound drops to one-month lows against euro
16        noyer says strong euro 

# format into matrix

In [10]:
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])

encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])

# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# take a look at the shape of each of these
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(337935, 54637)
(337935,)
(84484, 54637)
(84484,)


So the x training vector contains 337935 observations of 54637 occurrences -- this latter number is the number of unique words in the entire collection of headlines. The x training vector contains the 337935 labels associated with each observation in the x training vector.

So we're ready to go. Let's make the classifier!

In [7]:
nb = MultinomialNB()
nb.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

How well did it do?

In [8]:
nb.score(x_test, y_test)

0.92754841153354484

If you feel like exploring what words are characteristic of each category, you can pull out the coefficients of the Naive Bayes classifier: