# News Document Classification

## Assignment Question:

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  http://archive.ics.uci.edu/ml/datasets/Spambase

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

## Solution:

For this assignment, we have selected the dataset from short news articles(https://drive.google.com/open?id=0Bz8a_Dbh9QhbUDNpeUdjb0wxRms). This dataset contains a training and testing dataset. Train has around 120K short news articles with the classification, title and description. We need to classify the test document which has around 7.2K short news. 

As it has four different labels, this problem is multilabel classification. We will also categorize the news with multiple algorithms and generate confidence interval for the predicted class. 

As a next step will try to create convolutional neural network to classify the news which has different classification labels from various algorithms.

In [460]:
import nltk
import pandas as pd
from nltk.corpus import stopwords
import numpy as np
import pickle
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode, median
import operator

In [461]:
# Load the train and test dataset
df = pd.read_csv("./data/train.csv",sep=",",header=None)
df_test = pd.read_csv("./data/test.csv",sep=",",header=None)

As the dataset is huge, we will take 1/6th of the dataset from the train class.

In [462]:
df = df.sample(int(df.count()[1]/6))

In [463]:
#Peek of the train input file
df.head()

Unnamed: 0,0,1,2
50826,4,Intel Pushes Further into Mobile Space (NewsFa...,NewsFactor - Intel has served notice that it w...
10307,2,Dominating Dirrell dispatches Despaigne,Leon Lawson was watching over Andre Dirrell #3...
114369,3,Honeywell Agrees Takeover of UK #39;s Novar,US manufacturer Honeywell International agreed...
79604,3,"Lord James Hanson, prominent British businessm...","Lord James Hanson, a wealthy industrialist who..."
56353,2,Gallacher #39;s wedge shot wraps up #39;dream...,A famous name once more adorns a European Tour...


As the function contains title and description, we will combine it to a single string and its classification label. Also drop unnecessary columns and convert it to text.

In [464]:
def data_cleaning(df,only_str = 'N'):
    """Function for combining and cleaning the text"""
    # Combine heading and news
    df['news'] = df[1].map(str) + ' - ' + df[2].map(str)
    # Drop previous column
    df.drop([1,2],axis=1,inplace=True)
    documents = list(df.to_records(index=False))
    strings = df['news'].to_string(index=False)
    
    # Return the format according to type required
    if only_str=='N':
        return documents
    else:
        return strings

In [465]:
# Combined data for training. All the data is converted to string bytes
news = data_cleaning(df,'Y')
news[:50]

'Intel Pushes Further into Mobile Space (NewsFa...\n'

In [466]:
# Combined data for testing. All the data is converted to records
doc = data_cleaning(df_test)

Now lets try to build the features list. This features will contain all the top words sorted by frequency. Once this is list is compiled, we will build the individual features list for training and testing.

In [467]:
all_words = nltk.word_tokenize(news)
all_words[:50]

['Intel',
 'Pushes',
 'Further',
 'into',
 'Mobile',
 'Space',
 '(',
 'NewsFa',
 '...',
 'Dominating',
 'Dirrell',
 'dispatches',
 'Despaigne',
 '-',
 'Leon',
 '...',
 'Honeywell',
 'Agrees',
 'Takeover',
 'of',
 'UK',
 '#',
 '39',
 ';',
 's',
 'Novar',
 '-',
 '...',
 'Lord',
 'James',
 'Hanson',
 ',',
 'prominent',
 'British',
 'businessm',
 '...',
 'Gallacher',
 '#',
 '39',
 ';',
 's',
 'wedge',
 'shot',
 'wraps',
 'up',
 '#',
 '39',
 ';',
 'dream',
 '...']

In [468]:
#Total words in the train dataset without cleaning
len(all_words)

202028

In [469]:
#from nltk.stem import WordNetLemmatizer
#lem=WordNetLemmatizer()

Now perform series of cleaning steps on 'all_words'. Remove the stopwords and non-alpha characters. Also convert it to lowercase.

In [470]:
filtered_words = [word.lower() for word in all_words if word not in stopwords.words('english') and
                 word.isalpha()]

Converting the filtered words to FrequencyDistribution chart and use the top 10000 words for features list.

In [471]:
all_words_freq = nltk.FreqDist(filtered_words)

In [472]:
word_features = [freq[0] for freq in all_words_freq.most_common(n=10000)]

In [473]:
#As the features list is huge, we will save it for future purposes if required
save_word_features = open("pickled_files/word_features.pickle","wb")
pickle.dump(word_features, save_word_features)
save_word_features.close()

In [475]:
def find_features(document):
    """Finding feature set in any strings. This requires list of all words (word_features)"""
    words = nltk.word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

In [476]:
# Finally generate the features set for training dataset
documents = list(df.to_records(index=False))
featuresets = [(find_features(rev.lower()), category) for (category, rev) in documents]

In [477]:
# Saving the feature set as pickle file for future purposes
save_featuresets = open("./pickled_files/featureset.pickle","wb")
pickle.dump(featuresets, save_featuresets)
save_featuresets.close()

We have selected 20000 features in random as our input. Now we will split that into training and testing features to calculate the accuracy of our model.


In [479]:
# Training and testing split
testing_set = featuresets[5000:]
training_set = featuresets[:15000]

Now we will build a list of different algorithms from sklearn library.

1. Naive Bayes
2. Multinomial Naive Bayes
3. Bernoulli Naive Bayes
4. Logistic Regression
5. LinearSVC
6. SGD Classifier

In [480]:

classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

save_classifier = open("./pickled_files/originalnaivebayes5k.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

save_classifier = open("./pickled_files/MNB_classifier5k.pickle","wb")
pickle.dump(MNB_classifier, save_classifier)
save_classifier.close()

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

save_classifier = open("./pickled_files/BernoulliNB_classifier5k.pickle","wb")
pickle.dump(BernoulliNB_classifier, save_classifier)
save_classifier.close()

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

save_classifier = open("./pickled_files/LogisticRegression_classifier5k.pickle","wb")
pickle.dump(LogisticRegression_classifier, save_classifier)
save_classifier.close()


LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

save_classifier = open("./pickled_files/LinearSVC_classifier5k.pickle","wb")
pickle.dump(LinearSVC_classifier, save_classifier)
save_classifier.close()

SGDC_classifier = SklearnClassifier(SGDClassifier())
SGDC_classifier.train(training_set)
print("SGDClassifier accuracy percent:",nltk.classify.accuracy(SGDC_classifier, testing_set)*100)

save_classifier = open("./pickled_files/SGDC_classifier5k.pickle","wb")
pickle.dump(SGDC_classifier, save_classifier)
save_classifier.close()

Original Naive Bayes Algo accuracy percent: 90.6133333333
Most Informative Features
                  prices = True                3 : 2      =    282.4 : 1.0
                internet = True                4 : 2      =    205.6 : 1.0
                  league = True                2 : 4      =    190.0 : 1.0
                software = True                4 : 1      =    157.2 : 1.0
                   iraqi = True                1 : 2      =    155.6 : 1.0
                   coach = True                2 : 3      =    151.7 : 1.0
              technology = True                4 : 2      =    149.3 : 1.0
                minister = True                1 : 2      =    147.8 : 1.0
                  market = True                3 : 2      =    143.7 : 1.0
                military = True                1 : 2      =    137.8 : 1.0
                 olympic = True                2 : 3      =    129.5 : 1.0
                     web = True                4 : 2      =    125.3 : 1.0
                

Wow! We get good accuracy scores from different algorithms. As a next step we will create a ensemble approach with all these classifiers.

Creating a class to get the majority votes from all classifiers and create a confidence interval for these classifiers. 

In [481]:
class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            #print(v)
            votes.append(v)
            try:
                final =  mode(votes)
            except:
                final = votes[0]
        return final
                
    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

We will create a instance of above class and create the final classification for each news.

In [482]:
voted_classifier = VoteClassifier(
                                  classifier,
                                  LinearSVC_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)


def sentiment(text):
    """This classify the news by combining all the features"""
    feats = find_features(text)
    return voted_classifier.classify(feats)#,voted_classifier.confidence(feats)

Finally, we will validate the model against test file. We will find the news which has been misclassified and calculate the accuracy score.

In [504]:
featuresets_testing = [(find_features(rev.lower()), category) for (category, rev) in doc]
content=[]
for d in doc:
    content.append((sentiment(d[1]),d[0],d[1]))
results = pd.DataFrame(content,columns=['Predicted','Actual','News'])

In [507]:
#As the features testing list is huge, we will save it for future purposes if required
save_featuresets_testing = open("pickled_files/featuresets_testing.pickle","wb")
pickle.dump(featuresets_testing, save_featuresets_testing)
save_featuresets_testing.close()

Below is the list of news which were actually mis-classified. But this classification is the result of ensemble approach. We will also calculate the accuracy score of each individual algorithms.

In [534]:
results[results['Predicted']!=results['Actual']].head()#.count()/7600

Unnamed: 0,Predicted,Actual,News
3,2,4,Prediction Unit Helps Forecast Wildfires (AP) ...
15,3,4,Teenage T. rex's monster growth - Tyrannosauru...
19,3,4,"Storage, servers bruise HP earnings - update E..."
20,3,4,IBM to hire even more new workers - By the end...
24,3,4,Rivals Try to Turn Tables on Charles Schwab - ...


In [542]:
print("Accuracy of the ensemble mode: {}".format(100-(results[results['Predicted']!=results['Actual']].count()[1]/df_test.count()[1])*100))

Accuracy of the ensemble mode:82.97368421052632


In [505]:
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, featuresets_testing))*100)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, featuresets_testing))*100)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, featuresets_testing))*100)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, featuresets_testing))*100)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, featuresets_testing))*100)
print("SGDClassifier accuracy percent:",nltk.classify.accuracy(SGDC_classifier, featuresets_testing)*100)

Original Naive Bayes Algo accuracy percent: 88.1710526316
MNB_classifier accuracy percent: 88.0921052632
BernoulliNB_classifier accuracy percent: 88.0131578947
LogisticRegression_classifier accuracy percent: 87.7368421053
LinearSVC_classifier accuracy percent: 85.2763157895
SGDClassifier accuracy percent: 85.6052631579


## Summary:

Below are some of the findings from this assignment:
1. Integrating NLTK and SKLearn library brings wider access to different algorithms.
2. Performing data cleaning on original yields better results. Like removing stopwords, symbols.
3. Creating ensemble models approach will provide better confidence on the predictions. If each model gives different results then we can take out the list for further investigation.
4. Multiple models accuracy score is around 83%
5. Each individual algorithms provides better results than multiple models approach.

### Future Steps which can be performed:
1. Create a word to vec approach and test models using that approach.
2. Creating a convolutional neural net and perform classification using it.

### References:
1. pythonprogramming.net for combining different models.
2. Natural Language Processing with Python book.