# Document Tagging: BBC News Articles 

This corpus used in this project includes 2,225 documents from BBC's news website corresponding to stories in five topical areas (business, entertainment, politics, sport, tech) from 2004-2005. 

The CSV file includes two columns: category (the five class labels) and text (pre-processed article content). In this project, I will use only the text column.

More information on this data set as well as a paper written using this data set is available here http://mlg.ucd.ie/datasets/bbc.html.

#### Import Libraries

In [1]:
import pprint
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import gensim

import random
random.seed(42)

#### Load Data

In [2]:
df = pd.read_csv("data/BBC-articles.csv")
df = df[['text']]#[:100]
df.head()

Unnamed: 0,text
0,tv future in the hands of viewers with home th...
1,worldcom boss left books alone former worldc...
2,tigers wary of farrell gamble leicester say ...
3,yeading face newcastle in fa cup premiership s...
4,ocean s twelve raids box office ocean s twelve...


#### Initial Prep

In [3]:
'''
This function takes as input a string and does basic text preprocessing on it.
It returns a string
'''

def preprocess(text):
    import contractions
    import string
    from nltk.stem import WordNetLemmatizer

    # load the text and convert to lowercase
    text = text.lower()

    # expand contractions
    expanded_words = [contractions.fix(word) for word in text.split()]
    text = ' '.join(expanded_words)

    # remove punctuations: using translate
    text = text.translate(str.maketrans('', '', string.punctuation))

    # tokenize
    tokens_raw = text.split(" ")

    # limit to tokens with more than 2 characters
    tokens_raw = [token for token in tokens_raw if len(token) > 2]

    # remove stopwords
    stop_words = set(stopwords.words('english'))

    # lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens_filtered = [lemmatizer.lemmatize(token) for token in tokens_raw if not token in stop_words]
    text = ' '.join(tokens_filtered)
    
    return text

In [4]:
# preprocess the text column using the defined function
df['text'] = df.text.apply(lambda x: preprocess(x))

In [5]:
splitText = df['text'].apply(lambda x: x.split())

#### Modeling

In [6]:
'''
This function takes as input a string of text and returns a list of nouns, noun phrases and named entities.
The function has a high complexity, and there may be more efficient ways to go about it.
However, this gives me the output I desire more compared to available methods/packages.
'''
import nltk
# nltk.download('brown')
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('maxent_ne_chunker')
# nltk.download('words')

def getNouns(text):
        from nltk import ne_chunk, pos_tag, sent_tokenize, word_tokenize
        from nltk.tree import Tree
        
        global nouns
        nouns = []

        for sentence in sent_tokenize(text):
                for word, pos in pos_tag(word_tokenize(sentence)):
                        if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
                                nouns.append(word)

                chunked = ne_chunk(pos_tag(word_tokenize(sentence)))
                continuous_chunk = []
                current_chunk = []
                
                for i in chunked:
                        if type(i) == Tree:
                                current_chunk.append(" ".join([token for token, pos in i.leaves()]))
                        if current_chunk:
                                named_entity = " ".join(current_chunk)
                                if named_entity not in nouns:
                                        nouns.append(named_entity)
                                        current_chunk = []
                        else:
                                continue
        return nouns

In [7]:
'''
This function takes as input a dataframe, a column name (str) and a a TFIDF data format (basic, filtered & nouns)
basic -> only basic preprocessing done before building tfidf
filtered -> tfidf filtered to remove top 10% of the most frequent words and words that appear less than 5 times in the documents
nouns -> tfidf built on text limited to nouns, noun phrases, and named entity recognition.

It returns a dictionary and a TF-IDF corpus.
'''

def getCorpus(df=df, column="text", tfidfFormat="basic"):
    from gensim.corpora import Dictionary
    from gensim.models import TfidfModel

    # TF-IDF with basic cleaning
    if tfidfFormat=="basic":
        tokens = df[column].apply(lambda x: x.split())
        dictionary = Dictionary(tokens)
        dtm = [dictionary.doc2bow(doc) for doc in tokens]
        vectorizer = TfidfModel(dtm)
        tfidfCorpus = vectorizer[dtm]
        return dictionary, tfidfCorpus
    
    # TF-IDF with term frequency filter cleaning    
    elif tfidfFormat=="filtered":
        tokens = df[column].apply(lambda x: x.split())
        dictionary = Dictionary(tokens)
        dictionary.filter_extremes(no_below=5, no_above=0.90)
        dtm = [dictionary.doc2bow(doc) for doc in tokens]
        vectorizer = TfidfModel(dtm)
        tfidfCorpus = vectorizer[dtm]
        return dictionary, tfidfCorpus

    # TF-IDF with only nouns, noun phrases and NER
    elif tfidfFormat=="nouns":
        tokens = df[column].apply(lambda x: getNouns(x))
        dictionary = Dictionary(tokens)
        dtm = [dictionary.doc2bow(doc) for doc in tokens]
        vectorizer = TfidfModel(dtm)
        tfidfCorpus = vectorizer[dtm]
        return dictionary, tfidfCorpus               

In [8]:
'''
This function takes as input a dictionary, a corpus, the type of model (lda or lsi) and the number of topics.
It builds a model using these parameters and returns the model and its coherence score.
'''

def buildModel(dictionary, corpus, modelType:str, num_topics):
    from gensim.models import LsiModel,LdaModel,CoherenceModel

    if modelType=="lda":
        model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        coherenceModel = CoherenceModel(model=model, texts=splitText, dictionary=dictionary, coherence='c_v')
        coherenceScore = coherenceModel.get_coherence()
        return model, coherenceScore
    
    elif modelType=="lsi":
        model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        coherenceModel = CoherenceModel(model=model, texts=splitText, dictionary=dictionary, coherence='c_v')
        coherenceScore = coherenceModel.get_coherence()
        return model, coherenceScore

In [9]:
'''
This function takes as input a corpus and a model (like the one returned by the buildModel function).
It returns a list of keywords found from running on the corpus
'''

def getKeywords(model, corpus):
    n = len(corpus)
    keywords = []

    for i in range(n):    
        for index, score in sorted(model[corpus[i]], key=lambda tup: -1*tup[1]):
            elements = model.print_topic(index, 5).split("+")
            keywords.append([x.strip().replace('"', '').split("*")[1] for x in elements])
    keywords = keywords[:n]
    return keywords

In [20]:
'''
In this step we create a list of formats (basic/filtered/nouns) and model types (lda/lsa) which we will use to build our models.
We use getKeywords() function and add new columns to the initial dataframe with these keywords.
'''
# dictionaries to store models and their coherence scores
models = {'lda': [], 'lsi': []}
coherenceScores = {'lda': [], 'lsi': []}

lstFormats = ['basic', 'filtered', 'nouns']
modelTypes = ['lda', 'lsi']

# iterate through the list of TF-IDF corpus formats
for lstFormat in lstFormats:
    dictionary, corpus = getCorpus(df=df, column="text", tfidfFormat=lstFormat)

    # iterate through the list of model types
    for modelType in modelTypes:
        model, coherence = buildModel(dictionary=dictionary, corpus=corpus, modelType=modelType, num_topics =4)
        
        # save models and their coherence scores to the dictionaries
        models[modelType].append(model)
        coherenceScores[modelType].append(coherence)

        # get keywords from the given text
        kw = getKeywords(model, corpus)

        # add keywords as new columns
        colname = lstFormat + "_" + modelType
        df[colname] = kw

  sparsetools.csc_matvecs(
  sparsetools.csc_matvecs(
  sparsetools.csc_matvecs(


In [21]:
# let us look at the new columns
df.head(3)

Unnamed: 0,text,basic_lda,basic_lsi,filtered_lda,filtered_lsi,nouns_lda,nouns_lsi
0,future hand viewer home theatre system plasma ...,"[holmes, mobile, email, music, british]","[labour, election, blair, tax, brown]","[film, award, bank, game, dollar]","[mobile, phone, film, award, best]","[blair, party, election, tax, phone]","[film, growth, economy, rate, bank]"
1,worldcom bos left book alone former worldcom b...,"[mobile, phone, search, blog, people]","[mobile, phone, film, award, best]","[search, phone, mobile, virus, user]","[film, award, england, best, oscar]","[search, sale, virus, film, figure]","[film, game, england, award, oscar]"
2,tiger wary farrell gamble leicester say rushed...,"[blair, labour, party, election, oil]","[film, award, best, oscar, england]","[price, profit, game, 2004, sale]","[labour, election, blair, brown, tax]","[award, film, nomination, game, band]","[election, tax, blair, party, film]"


#### Looking through the created columns of keywords, I opine that the LSI model trained on a TF-IDF corpus that has been filtered to remove the top 10% of the most frequent words and words that appear less than 5 times in the documents does a better job.

In [22]:
# function for highlighting best model using coherence score
def highlight_cells(val):
    color = 'yellow' if val == maxVal else ''
    return 'background-color: {}'.format(color)

coherenceScoresDF = pd.DataFrame.from_dict(coherenceScores)
coherenceScoresDF.set_index([pd.Index(lstFormats)], inplace=True)

maxVal = coherenceScoresDF.max().max()
coherenceScoresDF.style.applymap(highlight_cells)

Unnamed: 0,lda,lsi
basic,0.353226,0.650648
filtered,0.437494,0.700128
nouns,0.342381,0.656276


#### Looking at the coherence values, it is clear that the LDA model trained on the basically cleaned corpus performes better than the rest.

### LDA Interactive for the best model

In [27]:
# the winning model is stored in the models list 
winningModel = models['lda'][1]

In [28]:
# interacting with LDA output
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()

d, c = getCorpus(df=df, column="text", tfidfFormat="basic")

vis = gensimvis.prepare(winningModel, c, d)
vis

IndexError: index 8328 is out of bounds for axis 1 with size 8328

### Evidently, there is an overlap among many of the topics as seen in the visualization above. It is important to objectively select the best number of topics on which to base the model.