# notes
- UCI ML News Aggregator Dataset contains headlines
- categories for over 400k news articles.
- currently uses multinomial scikit learn

In [1]:
# get some libraries that will be useful
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)x

# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder


# grab the data
news = pd.read_csv("../data_news_aggregator/news_data_uci.csv")

In [3]:
#news.head() # let's take a look at our data

One thing we'll want to do is normalize the TITLE column a bit: remove punctuation and lowercase everything. This will give us a smaller set of words, which will decrease the size of our model, and ensure that words are treated the same even if they occur capitalized at the beginning of the headline or lowercase in the middle.

In [4]:
from train import normalize_text
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
#print(len(news['TEXT']),news['TEXT'])

In [14]:
print(len(news['STORY']))
print(len(news['STORY'][0]))

422419
29


In [None]:
def get_related_article_idxs(x_1, story_1):
    idxs = news['STORY'] == story_1
    return idxs
def get_liberal_idxs(idxs):
    pubs = [s.lower() for s in news['PUBLISHER'][idxs]]
    idxs_cons = np.zeros((len(idxs),1))
    return idxs_cons
def get_conservative_idxs(idxs):
    pubs = [s.lower() for s in news['PUBLISHER'][idxs]]
    return idxs
idxs_init = get_related_article_idxs(news['TEXT'][0],news['STORY'][0])
idxs_liberal = get_liberal_idxs(idxs_init)
idxs_conservative = get_conservative_idxs(idxs_init)
idxs_fair = np.logical_and(idxs_init,np.logical_not(np.logical_or(idxs_liberal,idxs_conservative)))
print("sums",np.sum(idxs_init),np.sum(idxs_liberal),np.sum(idxs_conservative),np.sum(idxs_fair))

In [36]:
pubs=news['PUBLISHER']
myset = set(pubs)
print(myset)
#pubs = np.array(pubs,dtype="str")
#x = np.unique(pubs,return_counts=True)
#print(x)



# format into matrix

In [6]:
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])

encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])

# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# take a look at the shape of each of these
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(337935, 54637)
(337935,)
(84484, 54637)
(84484,)


So the x training vector contains 337935 observations of 54637 occurrences -- this latter number is the number of unique words in the entire collection of headlines. The x training vector contains the 337935 labels associated with each observation in the x training vector.

So we're ready to go. Let's make the classifier!

In [7]:
nb = MultinomialNB()
nb.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

How well did it do?

In [8]:
nb.score(x_test, y_test)

0.92754841153354484

If you feel like exploring what words are characteristic of each category, you can pull out the coefficients of the Naive Bayes classifier: