## NEWS ARTICLE CLASSIFICATION

### 1. Load data
* Records in the file are tab delimited;
* No column titles in the text file;
* Set column titles when loading;
* Load the data into a Pandas data frame;
* View unique values for the `category` column for later transformation to discrete numerical values.

In [246]:
import pandas as pd

news_df = pd.read_table(
    filepath_or_buffer = '~/ws/python/data/news_articles/newsCorpora.csv', 
    sep = '\t', 
    header = None, 
    names = ['id', 'title', 'url', 'publisher', 'category', 'story', 'hostname', 'timestamp']
)

# news_df.category.unique()

### 2. Preprocess data
* Transform categories into discrete numerical values;
* Transform all words to lowercase;
* Remove all punctuations.

In [247]:
import string

news_df['category'] = news_df.category.map({ 'b': 1, 't': 2, 'e': 3, 'm': 4 })
news_df['title'] = news_df.title.map(lambda x: x.lower().translate(str.maketrans('','', string.punctuation)))#
news_df

Unnamed: 0,id,title,url,publisher,category,story,hostname,timestamp
0,1,fed official says weak data caused by weather ...,http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,1,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,feds charles plosser sees high bar for change ...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,1,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,us open stocks fall after fed official hints a...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,1,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,fed risks falling behind the curve charles plo...,http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,1,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,feds plosser nasty weather has curbed job growth,http://www.moneynews.com/Economy/federal-reser...,Moneynews,1,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027
5,6,plosser fed may have to accelerate tapering pace,http://www.nasdaq.com/article/plosser-fed-may-...,NASDAQ,1,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.nasdaq.com,1394470372212
6,7,feds plosser taper pace may be too slow,http://www.marketwatch.com/story/feds-plosser-...,MarketWatch,1,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.marketwatch.com,1394470372405
7,8,feds plosser expects us unemployment to fall t...,http://www.fxstreet.com/news/forex-news/articl...,FXstreet.com,1,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.fxstreet.com,1394470372615
8,9,us jobs growth last month hit by weatherfed pr...,http://economictimes.indiatimes.com/news/inter...,Economic Times,1,ddUyU0VZz0BRneMioxUPQVP6sIxvM,economictimes.indiatimes.com,1394470372792
9,10,ecb unlikely to end sterilisation of smp purch...,http://www.iii.co.uk/news-opinion/reuters/news...,Interactive Investor,1,dPhGU51DcrolUIMxbRm0InaHGA2XM,www.iii.co.uk,1394470501265


### 3. Split into train and test data sets

In [216]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    news_df['title'], 
    news_df['category'], 
    random_state = 1
)

### 4. Extract features
* Apply bag of words processing to the dataset

In [217]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer(stop_words = 'english')
training_data = count_vector.fit_transform(X_train)
testing_data = count_vector.transform(X_test)

### 5. Train Multinomial Naive Bayes classifier

In [218]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### 6. Generate predictions

In [219]:
predictions = naive_bayes.predict(testing_data)
predictions

array([1, 1, 3, ..., 1, 1, 4])

### 7. Evaluate model performance
* This is a multi-class classification. So, for these evaulation scores, explicitly specify `average` = `weighted`

In [220]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

print("Accuracy score: ", accuracy_score(y_test, predictions))
print("Recall score: ", recall_score(y_test, predictions, average = 'weighted'))
print("Precision score: ", precision_score(y_test, predictions, average = 'weighted'))
print("F1 score: ", f1_score(y_test, predictions, average = 'weighted'))

Accuracy score:  0.928336726481
Recall score:  0.928336726481
Precision score:  0.928405793865
F1 score:  0.928361147966
