Doc2Vec for document classification

Dataset - http://mlg.ucd.ie/datasets/bbc.html

Citation - D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.

Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.
Class Labels: 5 (business, entertainment, politics, sport, tech)

- 510 business
- 386 entertainment
- 417 politics
- 511 sports
- 401 tech

We will add the first following documents to our corpus. The remaining I will use for testing purposes.
- 500 business
- 350 entertainment
- 400 politics
- 500 sports
- 390 tech

In [1]:
from pathlib import Path

trainingDocuments = []
testingDocuments = []
trainCount = 0
testCount = 0

sample = {'business': 500, 'entertainment': 350, 'politics': 400, 'sport': 500, 'tech': 390}

def dataSetup(folder, category):
    size = sample[category]
    allDocs = Path(folder).glob('**/*')
    global trainCount
    global testCount
    count = 0

    for news in allDocs:
        file = open(news, "r")
        data = file.read()
        
        if (count < size):
            trainingDocuments.append([])
            trainingDocuments[trainCount].append(category)
            trainingDocuments[trainCount].append(data)
            trainCount = trainCount + 1
            count = count + 1
        else:
            testingDocuments.append([])
            testingDocuments[testCount].append(category)
            testingDocuments[testCount].append(data)
            testCount = testCount + 1

    return

businessText = dataSetup("bbc-fulltext/business", "business")
entertainmentText = dataSetup("bbc-fulltext/entertainment", "entertainment")
politicsText = dataSetup("bbc-fulltext/politics", "politics")
sportText = dataSetup("bbc-fulltext/sport", "sport")
techText = dataSetup("bbc-fulltext/tech", "tech")


In [2]:
from gensim.models import doc2vec
from gensim.models.doc2vec import TaggedDocument
from collections import namedtuple
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk import download

download('stopwords') # For stopword removal
download('punkt') # For tokenizer

def removeStopwords(text):
    # Removing stopwords improved results with BBC news data, but test with and without stop words.
    stop_words = stopwords.words('english')
    text = [w for w in text if w not in stop_words]
    text = [w for w in text if w.isalpha()]
    return text

def text2tokens(text):
    text = text.lower()
    wordList = word_tokenize(text)
    wordList = removeStopwords(wordList)
    return wordList

training_corpus = []
testing_corpus = []

for i, record in enumerate(trainingDocuments):
    words = text2tokens(record[1])
    tag = [record[0]] # IMPORTANT - I am using the news category as the document tag for training purpose!
    training_corpus.append(TaggedDocument(words=words, tags=tag))

for i, record in enumerate(testingDocuments):
    words = text2tokens(record[1])
    testing_corpus.append([])
    testing_corpus[i].append(record[0])
    testing_corpus[i].append(words)



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dhiraj\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dhiraj\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# The model parameters below can impact the outcome.
# 1. Size - Vector size. 100 worked best with the BBC news data set. Tried various between 50 to 300 before choosing 100.
# 2. Window - context window, i.e. the number of words on the left and right of a word that 
# defines a "context" for learning the meaning of the word. Context window of 1 gave the best result (tried between 1 and 10)
# ..... probably due to the very small size of documents / vocabulary.

model = doc2vec.Doc2Vec(training_corpus, size = 100, negative=5, window = 1, iter = 20, min_count = 2, workers = 4, alpha=0.025, min_alpha=0.025)
model.save("bbc_news_doc2vec.model")
print("Doc2Vec Model Saved")

Doc2Vec Model Saved


In [4]:
def classifyDoc(doc, category):
    model.random.seed(0) # Force model to use a static seed rather than a random one.
    new_vec = model.infer_vector(doc)
    sims = model.docvecs.most_similar(positive=[new_vec], topn=5)
    
    print("Test document category - ", category)
    print("Similarity results -")
    for neighbor in sims:
        print(neighbor)
    return

In [5]:
testCategory = testing_corpus[10][0]
testDoc = testing_corpus[10][1]
classifyDoc(testDoc, testCategory)

Test document category -  entertainment
Similarity results -
('politics', 0.5703830122947693)
('tech', 0.5324764251708984)
('entertainment', 0.5239570736885071)
('sport', 0.42939186096191406)
('business', 0.4044690728187561)


In [6]:
testCategory = testing_corpus[80][0]
testDoc = testing_corpus[80][1]
classifyDoc(testDoc, testCategory)

Test document category -  tech
Similarity results -
('tech', 0.642406702041626)
('politics', 0.4994729459285736)
('sport', 0.44506770372390747)
('business', 0.43930721282958984)
('entertainment', 0.39877429604530334)


In [7]:
testCategory = testing_corpus[70][0]
testDoc = testing_corpus[70][1]
classifyDoc(testDoc, testCategory)

Test document category -  sport
Similarity results -
('sport', 0.8286051750183105)
('politics', 0.588887095451355)
('tech', 0.5434785485267639)
('entertainment', 0.523962676525116)
('business', 0.451291024684906)


In [8]:
testCategory = testing_corpus[0][0]
testDoc = testing_corpus[0][1]
classifyDoc(testDoc, testCategory)

Test document category -  business
Similarity results -
('business', 0.614448070526123)
('sport', 0.5170470476150513)
('politics', 0.49962225556373596)
('tech', 0.4433245360851288)
('entertainment', 0.41390979290008545)


In [9]:
testCategory = testing_corpus[60][0]
testDoc = testing_corpus[60][1]
classifyDoc(testDoc, testCategory)

Test document category -  politics
Similarity results -
('politics', 0.7728608250617981)
('tech', 0.5455231666564941)
('entertainment', 0.5144734382629395)
('business', 0.504486620426178)
('sport', 0.5038948655128479)


In [10]:
correct_classification = 0
incorrect_classification = 0

for i, record in enumerate(testing_corpus):
    model.random.seed(0)
    new_vec = model.infer_vector(record[1])
    sims = model.docvecs.most_similar(positive=[new_vec], topn=5)
    if (sims[0][0] == testing_corpus[i][0]):
        correct_classification = correct_classification + 1
    else:
        incorrect_classification = incorrect_classification + 1

print("Number of correctly classified documents - ", correct_classification)
print("Number of incorrectly classified documents - ", incorrect_classification)

Number of correctly classified documents -  81
Number of incorrectly classified documents -  4
