# Natural Language Processing (NLP) Project

Author: Argha Sarkar (1221352) <a.sarkar@warwick.ac.uk>

## Part A: Text preprocessing

### Declaring file paths for the corpus

Over here, the path to the file is declared and it's validated to ensure the file exists.

In [1]:
from os.path import normpath
import os

DIRECTORY_PATH = "signal-news1"
FILE_NAME = "test_1.jsonl"
FILE_NAME = "signal-news1.jsonl"
FILE_PATH = normpath(os.path.join(os.getcwd(), DIRECTORY_PATH, FILE_NAME))

if os.path.isfile(FILE_PATH):
    print("The file has been found.")
else:
    raise IOError("File not found.")
    


The file has been found.


### Declaring the file paths for the positive and negative keywords

In [2]:
# Subdirectory 
SUB_DIRECTORY = "opinion-lexicon-English"

# Read the list of positive words into a hash set 
POSITIVE_WORDS_FILE_NAME = "positive-words.txt"
POSITIVE_FILE_PATH = normpath(os.path.join(os.getcwd(), DIRECTORY_PATH, SUB_DIRECTORY, POSITIVE_WORDS_FILE_NAME))

if os.path.isfile(POSITIVE_FILE_PATH):
    print("The positive words file has been found.")
else:
    raise IOError("File not found: positive words file at path: {}".format(POSITIVE_FILE_PATH))

# Read the list of negative words into a hash set
NEGATIVE_WORDS_FILE_NAME = "negative-words.txt"
NEGATIVE_FILE_PATH = normpath(os.path.join(os.getcwd(), DIRECTORY_PATH, SUB_DIRECTORY, NEGATIVE_WORDS_FILE_NAME))

if os.path.isfile(NEGATIVE_FILE_PATH):
    print("The negative words file has been found.")
else:
    raise IOError("File not found: negative words file at path: {}".format(NEGATIVE_FILE_PATH))

The positive words file has been found.
The negative words file has been found.


### NLTK Path

The code cell below is used for setting the NLTK path on DCS machines and Joshua. On my personal computer where this assignment was done, I had downloaded the NLTK corpus using **nltk.download()**


In [3]:
import getpass
username = getpass.getuser()

if username is not "ArghaWin10" or username is not "arghasarkar":
    import nltk
    nltk.data.path.append('/modules/cs918/nltk_data/')


### Creates a corpus object

The corpus class loads the converts the *json lines* into separate json documents. It then adds them to a list called **json_docs**. Using a pointer to keep track of the current object, its possible to iterate over the corpus processing each document step by step. This can also be done from any given index and not just from the start. Furthermore, the corpus class also offloads the hassle of dealing with the storage of the preprocessed news articles. 

In [4]:
import json
import itertools

from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize

class Corpus:
    
    def __init__(self, file_path):
        # Private vars
        self._corpus_loaded = False
        self._corpus_parsed = False
        self._sentence_tokenized = False
        
        self._document_view_pointer = 0
        
        self.file_path = file_path 
        # Checks if the file exists
        if os.path.isfile(FILE_PATH):
            print("The file has been found.")
        else:
            raise IOError("File not found.")
        
        self.raw_docs = []
        self.json_docs = []
        
        # Tokenized sentences
        self.sent_tokens = []
        # Flat list of setences
        self.flat_sent_tokens = []
        
    
    def load_corpus(self):
        with open(self.file_path, "r") as corpus:
            for json_line in corpus:
                self.raw_docs.append(json_line)
        
        self._corpus_loaded = True
    
    
    def parse_corpus(self):
        for json_line in self.raw_docs:
            parsed_json = json.loads(json_line)
            self.json_docs.append(parsed_json)
        
        self._corpus_parsed = True
    
    
    def tokenize_sentences(self):

        for article in self.json_docs:
            content = article["content"]
            sent_tokenized = sent_tokenize(content)
            self.sent_tokens.append(sent_tokenized)
            
        self.flat_sent_tokens = list(itertools.chain.from_iterable(self.sent_tokens))
        
    
    def print_summary_of_corpus(self):
        if not self._corpus_loaded:
            self.load_corpus()
        
        if not self._corpus_parsed:
            self.parse_corpus()
            
        if not self._sentence_tokenized:
            self.tokenize_sentences()
        
        print("Number of documents: {}.".format(str(len(self.json_docs))))
        print("Number of sentences: {}.".format(str(len(self.flat_sent_tokens))))
    
    
    def print_document(self, doc_idx):
        print("-- Printing document with index: {} --".format(doc_idx))
        print(self.json_docs[doc_idx])

    
    def get_document(self, doc_idx):
        return self.json_docs[doc_idx]
    
    
    def get_next_document(self):        
        self._document_view_pointer += 1
        if self._document_view_pointer > len(self.json_docs):
            return None
        else:
            return self.json_docs[(self._document_view_pointer - 1)]
        
    
    def set_current_document(self, json_doc):
        self.json_docs[(self._document_view_pointer - 1)] = json_doc
    
    
    def reset_pointer(self):
        self._document_view_pointer = 0
        
    
    def set_pointer(self, pointer):
        self._document_view_pointer = pointer        


### Preprocessing

The preprocessing of text takes place here. The following tasks are performed:

- All the text is converted to lower case.
- All URLs are removed from the text using the following regex: <br>
  ```\b(http)(s)?:\/\/((([\w])(\.)?)+)\b```
  
  The regex looks for boundaries at the start and the end. After that, it looks for "http" at the start with an optional "s". After that the semi-colon and two slashes are used. After that, any number (one or greater) of words and fullstops will be matched.
  
  
- All numbers are removed which are not concatenated with any non-numeric characters. EG: 5 is removed but not 5pm.
- All non-alphanumeric characters expect for spaces are removed.
- All words of length below 4 characters are removed. This has been implemented by removing all words of length between one and three characters.
  ```\b[A-Za-z0-9]{1}\b```
  
  All boundary alphanumeric characters which are one character in length will be removed.
  
  
Using the **WordNetLemmatizer** from NLTK's stem module, all the words are lemmatized using the default (noun) Part Of Speech tagging. This aims to reduce the vocabulary and make the words more consistent.  

In [5]:
import re

from nltk.stem import WordNetLemmatizer

class Coursework:
    
    def __init__(self, corpus):
        self.corpus = corpus
        self.lemmatizer = WordNetLemmatizer()
    
    def process(self):
        
        # Debug flag
        DEBUG = True
        
        json_doc = self.corpus.get_next_document()
        
        i = 0
        
        temp_doc = json_doc
        
        while json_doc != None:
            json_doc["content"] = self.pre_process_document(json_doc)
            
            # This entire block can be deleted before submission
            if DEBUG:
                if i < 10:
                    with open("output.txt", "a") as w:
                        w.write(json_doc["content"])
                        w.write("--------------------------")

                    temp_doc = json_doc
                i += 1

            # Update the list with the pre-processed document
            self.corpus.set_current_document(json_doc)
            json_doc = self.corpus.get_next_document()
                
            
    def pre_process_document(self, json_doc):
        content = json_doc["content"]
        
        # Converting it to lower case
        content = self.lowercase(content)
              
        # Remove all URLs
        content = self.removeAllURLs(content)
        
        # Remove all numbers
        content = self.removeAllNumbers(content)
        
        # Removing all non-alphanumeric characters
        content = self.removeNonAlphanumericCharacters(content)
            
        # Removing short words
        content = self.removeAllWordsLessThanFourCharsLong(content)
        
         # Remove white space
        content = self.removeNewLines(content)
        content = self.removeTabs(content)
        content = self.removeMultipleSpaces(content)
        
        # Using the default lemmatizer
        #content = self.lemmatize_document(content)
        
        return content
    

    def lemmatize_document(self, content):
        lemmatizer = self.lemmatizer
        words = content.split()
        for i in range(len(words)):
            words[i] = lemmatizer.lemmatize(words[i])
        return " ".join(words)

    
    def lowercase(self, content):
        return content.lower()
        
        
    def removeNonAlphanumericCharacters(self, content):
        return re.sub(r'[^\sa-zA-Z0-9]', ' ', content)


    def removeNewLines(self, content):
        return re.sub(r'\n',' ', content)
    
    
    def removeMultipleSpaces(self, content):
        return re.sub(r'\s\s+',' ', content)
    
    
    def removeTabs(self, content):
        return re.sub(r'\t', ' ', content)
    
    
    def removeAllWordsLessThanFourCharsLong(self, content):
        #shortword = re.compile(r'\W*\b\w{1,3}\b')
        return re.sub(r'\b[A-Za-z0-9]{1}\b', ' ', content)


    def removeAllNumbers(self, content):
        return re.sub(r'\b[0-9]+\b', ' ', content)
   

    def removeAllURLs(self, content):
        return re.sub(r'\b(http)(s)?:\/\/((([\w])(\.)?)+)\b', ' ', content)
       
    

In [6]:
corpus = Corpus(FILE_PATH)
corpus.print_summary_of_corpus()

p = Coursework(corpus)
p.process()


The file has been found.
Number of documents: 19228.
Number of sentences: 270722.


## Part B: N-grams

The preprocessing of the data has been performed. 

For the first part of this task, the vocab size and the number of tokens needs to be calculated. For this, a class has been created which uses a **dictionary** with the words as the key and their counts as the values. The **number of keys** will be the **vocab size** and the **token** will be the **total sum of the word counts**.

The **TokenCounter** class uses a dictionary to keep the count of how many times a word has occurred in the corpus so far. If the word is already in the dictionary, it's value is incremented by one. If the word is appearing for the first time, then it's added to the dictionary and it's value is set to 1.

In [7]:
class TokenCounter:
    
    def __init__(self):
        self.tokens_dict = {}
    
    
    def add_word(self, word):
        if word in self.tokens_dict:
            count = self.tokens_dict[word]
            count += 1
            self.tokens_dict[word] = count
        else:
            self.tokens_dict[word] = 1

    
    def get_vocab_size(self):
        return len(self.tokens_dict.keys())
    
    
    def get_token_count(self):
        num_tokens = 0
        keys = self.tokens_dict.keys()
        for key in keys:
            num_tokens += self.tokens_dict[key]
        
        return num_tokens


The class **VocabAndTokens** class iterates through the preprocessed corpus. For each of the preprocessed documents, the text is split into words. After this, each of the word is added to the **TokenCounter** class. After calculating the size of the vocabulary and the number of tokens, it's printed out.

In [8]:
class VocabAndTokens:
    
    def __init__(self, corpus):
        self.token_counter = TokenCounter()
        self.corpus = corpus
        
        
    def calculate_tokens_and_vocab_size(self):
        json_doc = self.corpus.get_next_document()
        
        while json_doc != None:
            content = json_doc["content"]
            words = content.split()
            for word in words:
                self.token_counter.add_word(word)
            
            json_doc = self.corpus.get_next_document()
        
    
    def get_V(self):
        return self.token_counter.get_vocab_size()
    
    
    def get_N(self):
        return self.token_counter.get_token_count()
    
    
    def get_vocab_size(self):
        return self.get_V()
    
    
    def get_number_of_tokens(self):
        return self.get_N()
            


#### Part B.1) **N** Print out here is the number of tokens and **V** is the vocabulary size.

Using the ```get_V()``` and ```get_N()``` methods of the VocabsAndToken counter class, get the vocab size and the total number of tokens. In this case, **V: 102166 N: 5775947.**

In [9]:
# Resets it to the first document
corpus.reset_pointer()

# Initializes the VocabAndTokens class
vn = VocabAndTokens(corpus)
vn.calculate_tokens_and_vocab_size()

print("V: {} N: {}".format(vn.get_V(), vn.get_N()))


V: 102166 N: 5775947


### Trigrams calculator: NGramHolder

The **NGramHodler** the class  uses a dictionary to store the most common trigrams and provide helper functionalities. For example, it's able to take a list of words as an input and generate a list of n_grams for it. The "n" depends on the class instantiation. Furthermore, it's also able to generate the top n_grams in a descending order of count. Given an n_gram, it's able to return how many times it's occured so far in the text processed. Also it's able to search n_grams which begin with a certain phrase. For example, searching for **"is this"** using the ```get_n_grams_with_phrase(phrase)``` will return all n_grams which begins with "is this" for example: **"is this the"**.

In [10]:
import operator

class NGramHolder:
    
    def __init__(self, n=3):
        self.n_grams_dict = {}
        self.n = n
        
    def add_n_grams_from_sentence(self, sentence):
        words = sentence.split()
        
        for i in range(len(words) - (self.n - 1)):
            n_gram = ""
            
            for j in range(self.n):
                n_gram += words[i + j] + " "
                
            n_gram = n_gram.strip()
            self.add_n_gram(n_gram)
            
    
    def add_n_gram(self, n_gram):
        if n_gram in self.n_grams_dict:
            count = self.n_grams_dict[n_gram]
            count += 1
            self.n_grams_dict[n_gram] = count
        else:
            self.n_grams_dict[n_gram] = 1
    
    
    def get_n_gram_count(self, n_gram):
        if n_gram in self.n_grams_dict:
            return self.n_grams_dict[n_gram]
        
        return 0
    
    def get_top_n_grams(self, top_threshold=25):
        sorted_list = sorted(self.n_grams_dict.items(), key=lambda kv: kv[1], reverse=True)
        return sorted_list[:top_threshold]

    
    def get_n_grams_with_phrase(self, phrase):
        temp_dict = {}
        
        for key, value in self.n_grams_dict.items():
            if key.find(phrase) == 0:
                temp_dict[key] = value
                
        sorted_list = sorted(temp_dict.items(), key=lambda kv: kv[1], reverse=True)
                
        return sorted_list
        

To calculate the top 25 trigrams, NGramHolder is used with an initialization value of 3. After that, all the news articles are added using the ```add_n_grams_from_sentence()``` method.

In [11]:
class TriGramCalculator:
    
    def __init__(self, corpus, n_gram_holder):
        self.corpus = corpus
        self.n_gram_holder = n_gram_holder

        
    def calculate_top_trigrams(self, top_threshold=25):
        json_doc = self.corpus.get_next_document()
        
        while json_doc != None:
            content = json_doc["content"]
            self.n_gram_holder.add_n_grams_from_sentence(content)
            json_doc = self.corpus.get_next_document()
        
        return self.n_gram_holder.get_top_n_grams(top_threshold)        
    

#### Part B.2) The top 25 trigrams are listed below

For the preprecessing that took place earlier, punctuation and all words with fewer than 2 characters were removed. As a result, the trigrams listed below consists of only words which are two or more characters long.

In [12]:
# Reset corpus's pointer
corpus.reset_pointer()

# Number of top trigrams to display
TOP_X_NUM_OF_TRIGRAMS = 25

trigram_holder = NGramHolder()
trigrams_calc = TriGramCalculator(corpus, trigram_holder)

top_trigrams = trigrams_calc.calculate_top_trigrams()

# Prints of the top 25 trigrams
print(top_trigrams)


[('one of the', 2439), ('on shares of', 2093), ('day moving average', 1972), ('on the stock', 1567), ('as well as', 1427), ('in research report', 1417), ('in research note', 1375), ('the year old', 1255), ('the united states', 1225), ('for the quarter', 1221), ('average price of', 1193), ('research report on', 1177), ('research note on', 1138), ('the end of', 1134), ('in report on', 1124), ('earnings per share', 1123), ('shares of the', 1081), ('buy rating to', 1075), ('cell phone plan', 1073), ('phone plan details', 1070), ('according to the', 1068), ('of the company', 1039), ('appeared first on', 995), ('moving average price', 995), ('price target on', 968)]


#### Part B.3) Calculating the list of positive and negative words in the corpus

Use the class **WordCount** to read the file with the positive and negative words. After reading the file, the words are stripped of any trailing spaces or new line characters using **rstrip**. I had to do this as I found there were trailing **\n** characters to these words.

After the the words have been added to the dictionary, the **add_word** method can be used for increasing the count for a positive or negative word in the dictionary if it already exists. Alternatively, **add_sentence** can be used for processing all the words in the whole sentence at the same time. 

In [13]:
class WordCount:

    def __init__(self, file_path):
        self._words_dict = {}
        self._file_path = file_path
        
        if os.path.isfile(file_path):
            print("The file has been found.")
        else:
            raise IOError("File not found: {}.".format(file_path))
            
        # Read all the words and add to a hashset
        with open(file_path, "r") as file:
            words_str = file.readlines()
        
        for word in words_str:
            word = word.rstrip()
            self._words_dict[word] = 0

            
    def get_words_counts_dict(self):
        return self._words_dict

    
    def add_sentence(self, sentence):
        for word in sentence.split():
            self.add_word(word)
    
    def add_word(self, word):
        if word in self._words_dict:
            count = self._words_dict[word]
            count += 1
            self._words_dict[word] = count
    
    def get_total_word_count(self):
        total_count = 0
        for word, count in self._words_dict.items():
            total_count += count
        
        return total_count
    

Iterates through all the documents and adds each document to the positive and negative word counter.

In [14]:
class CorpusPositiveNegativeWordsCalculator:
    
    def __init__(self, pos_file_path, neg_file_path, corpus):
        # Positive and negative word counter
        self._pos = WordCount(pos_file_path)
        self._neg = WordCount(neg_file_path)
        
        corpus.reset_pointer()
        self.corpus = corpus
        
        self._calculated = False
        
        
    def calculate(self):
        json_doc = self.corpus.get_next_document()
        
        while json_doc != None:
            content = json_doc["content"]
            self._pos.add_sentence(content)
            self._neg.add_sentence(content)
         
            json_doc = self.corpus.get_next_document()
        
        self._calculated = True
            

    def get_positive_word_count(self):
        if self._calculated == False:
            self.calculate()
        return self._pos.get_total_word_count()
    
    
    def get_negative_word_count(self):
        if self._calculated == False:
            self.calculate()
        return self._neg.get_total_word_count()
    

Calculates the positive and negative words. After that, it prints out the counts of the positive and negative words in the corpus and draws a bar graph to show the difference.

In [15]:
# import matplotlib.pyplot as plt

# Resets it to the first document
corpus.reset_pointer()
corpus_pos_neg_words_calc = CorpusPositiveNegativeWordsCalculator(POSITIVE_FILE_PATH, NEGATIVE_FILE_PATH, corpus)
corpus_pos_neg_words_calc.calculate()


The file has been found.
The file has been found.


In [16]:

print("Positive word count: {}  Negative word count: {}".format(corpus_pos_neg_words_calc.get_positive_word_count(),corpus_pos_neg_words_calc.get_negative_word_count() ))

'''
    Commented out as it's not running on Joshua due to issues with numpy
'''
#import numpy as np

# # Plot a bar graph showing the distribution
# def plot_sentiment_graph():
#     word_count = [corpus_pos_neg_words_calc.get_positive_word_count(), corpus_pos_neg_words_calc.get_negative_word_count()]
#     label = ["positive", "negative"]
#     index = np.arange(len(label))
#     plt.bar(index, word_count)
#     plt.xlabel('Sentiment', fontsize=10)
#     plt.ylabel('Number of words', fontsize=10)
#     plt.xticks(index, label, fontsize=10)
#     plt.title('Positive and negative word counts in corpus')
#     plt.show()
    
# plot_sentiment_graph()

Positive word count: 171508  Negative word count: 125916


"\n    Commented out as it's not running on Joshua due to issues with numpy\n"

#### Part B.4: Calculating the number of new stories with more positive than negative words

A class **ArticleSentimentAnalyser** has been used to calculate the news stories' positive and negative words. 

Firstly, the class takes the file paths of the two files containing the positive and negative words. After that, the two list of words are loaded into **sets**. (Sets are used instead of list to increase performance). Additionally, there are two counters for positive and negative articles. Using the **corpus** class's iterator, all the articles are processed. 

For each of the article, the number of positive and negative words are calculated. If the number of positive words in the article is greater than the number of the negative words then the positive article counter is incremented. Similarly, if the number of negative words in the article is greater than the number of postive words, then the negative counter is incremented. If the number of positive and negative words are equal, nothing happens. 

After the processing takes places, the number of positive and negative methods can be accessed via ```analyser.get_positive_article_count()``` and ```analyser.get_negative_article_count()```

In [17]:
# Constants
POSITIVE = "POSITIVE"
NEUTRAL = "NEUTRAL"
NEGATIVE = "NEGATIVE"

class ArticleSentimentAnalyser:
    
    def __init__(self, pos_file_path, neg_file_path, corpus):
        self._pos_path = pos_file_path
        self._neg_path = neg_file_path
        
        # Hash sets of the positive and negative words
        self._pos_words = set()
        self._neg_words = set()
        
        # Storing the positive / negative counts of each article
        self._pos_article_count = 0
        self._neg_article_count = 0
        
        # Calculated flag
        self._calculated = False
        
        # Resets the pointer
        corpus.reset_pointer()
        self.corpus = corpus
        
        words_str = ""
        
        # Read the list of positive words
        with open(pos_file_path, "r") as file:
            words_str = file.readlines()
        
        for word in words_str:
            word = word.rstrip()
            self._pos_words.add(word)
            
        words_str = ""
        # Read the list of negative words
        with open(neg_file_path, "r") as file:
            words_str = file.readlines()
            
        for word in words_str:
            word = word.rstrip()
            self._neg_words.add(word)
            
    
    def analyse_sentence(self, sentence):      
        pos_count = 0
        neg_count = 0
        
        words = sentence.split()
        for word in words:
            if word in self._pos_words:
                pos_count += 1
            
            if word in self._neg_words:
                neg_count += 1
        
        if pos_count > neg_count:
            self._pos_article_count += 1
        
        if neg_count > pos_count:
            self._neg_article_count += 1
            

    def calculate(self):
        json_doc = self.corpus.get_next_document()
        
        while json_doc != None:
            content = json_doc["content"]
            self.analyse_sentence(content)
         
            json_doc = self.corpus.get_next_document()
        
        self._calculated = True
            
    
    def get_positive_article_count(self):
        if self._calculated == False:
            self.calculate()  
            
        return self._pos_article_count
    
    def get_negative_article_count(self):
        if self._calculated == False:
            self.calculate()  
            
        return self._neg_article_count
              

Before running the analyser, the pointer of the corpus is reset to 0. This ensures that all the articles are processed. 

The **ArticleSentimentAnalyser** class is instantiated with the positive, negative file paths as well as the corpus being passed in as arguments. 

After that, the results are printed out. 

**NB: In addition to printing out the results, I had created a bar graph showing the article counts but had to comment out the code as there was an issue running numpy and matplotlib.plt on Joshua**

In [18]:
# Resets the pointer
corpus.reset_pointer()

analyser = ArticleSentimentAnalyser(POSITIVE_FILE_PATH, NEGATIVE_FILE_PATH, corpus)      

print("Positive article count: {} Negative article count: {}".format(analyser.get_positive_article_count(), analyser.get_negative_article_count()))

'''
    Commented out as it's not running on Joshua due to issues with numpy
'''
# # Plot a bar graph showing the distribution
# def plot_article_sentiment_graph():
#     word_count = [analyser.get_positive_article_count(), analyser.get_negative_article_count()]
#     label = ["positive", "negative"]
#     index = np.arange(len(label))
#     plt.bar(index, word_count)
#     plt.xlabel('Sentiment of articles', fontsize=10)
#     plt.ylabel('Number of articles', fontsize=10)
#     plt.xticks(index, label, fontsize=10)
#     plt.title('Positive and negative articles in corpus')
#     plt.show()
    
# plot_article_sentiment_graph()

Positive article count: 11044 Negative article count: 6255


"\n    Commented out as it's not running on Joshua due to issues with numpy\n"

## Part C: Language models 

Calculate the language models for trigrams based on the on the first 16,000 rows from the corpus.

To do this, the trigrams, bigrams and unigrams were generated from the first 16000 rows of the corpus.

In [19]:
# The first 16,000 documents will be used for building up the trigrams language model


class LanguageModelBuilder:
    
    def __init__(self, corpus):
        
        # Reset corpus's pointer
        corpus.reset_pointer()
        self.corpus = corpus
        
        self.TRAIN_UPTO = 16000
        
        # Uses the previously defined NGramHolder as a trigram holder
        self.tri_holder = NGramHolder(3)
        self.bi_holder = NGramHolder(2)
        self.uni_holder = NGramHolder(1)
        
        # Flags 
        self.tri_calculated = False
        self.bi_calculated = False
        self.uni_calculated = False
        
        self.build_trigram_model()
        self.build_bigram_model()
        self.build_unigram_model()
    
    
    def build_trigram_model(self):
        
        for i in range(self.TRAIN_UPTO):
            json_doc = self.corpus.get_document(i)
            content = json_doc["content"]
            self.tri_holder.add_n_grams_from_sentence(content)
        
        self.tri_calculated = True
        
    
    def build_bigram_model(self):
        
        for i in range(self.TRAIN_UPTO):
            json_doc = self.corpus.get_document(i)
            content = json_doc["content"]
            self.bi_holder.add_n_grams_from_sentence(content)
                 
        self.bi_calculated = True
        
        
    def build_unigram_model(self):

        for i in range(self.TRAIN_UPTO):
            json_doc = self.corpus.get_document(i)
            content = json_doc["content"]
            self.uni_holder.add_n_grams_from_sentence(content)
                 
        self.uni_calculated = True
     

In [20]:
corpus.reset_pointer()
lm = LanguageModelBuilder(corpus)

### Part C.2: Sentence generation

For generating the sentence, I am taking the last two words and searching all trigrams where those two words are the first two words. 

For example, if there is a phrase **"is this"**, all trigrams like **"is this the"** are returned. Getting the count of the appearance of **is this the** and dividing it by the count of the bi-gram **is this**. All the possible trigrams are divided by the count of the bi-gram to get the probabilities. The trigram for which the highest probability occurs is used. The last word in that trigram is taken and that's the next word that's added to the sentence that's being generated.

**P("the" | "is this") = P("is this the") / P("is this")** 

**P("the" | "is this") = count(tri_grams("is this the")) / count(bi_grams("is this"))**

In the above example, the trigrams of all different words are used instead of "the". The one with the highest probability is appended to the current sentence.

In [21]:

def generate_sentence(phrase, lang_mod, sent_len=10):
    '''
        lm is the language model.        
    '''
    V = vn.get_V()
    
    # Stop when 10 words are done 
    while len(phrase.split()) < sent_len:
        words = phrase.split()
        
        second_last_word = words[len(words) - 2]
        last_word = words[len(words) - 1]
        
        matching_phrases = lang_mod.tri_holder.get_n_grams_with_phrase(second_last_word + " " + last_word)
        
        max_prob = 0
        next_word = ""
        for mp in matching_phrases:
        
            prob = mp[1] / lang_mod.bi_holder.get_n_gram_count(second_last_word + " " + last_word)
            
            if prob > max_prob:
                max_prob = prob
                words_in_mp = mp[0].split()
                next_word = words_in_mp[len(words_in_mp) - 1]
        
        phrase += " " + next_word
    
    return phrase
        
print(generate_sentence("is this", lm))


is this the company has market capitalization of billion and


## Part C.2: Evaluation and Perplexity

After building the language model on the first 16000 articles, it's time to see how well the models performs. This can be done by calculating the perplexity.

For calculating the perplexity, the articles from 16001 to the end will be used. The perplexity will be calculated on the entire document.

 

In [22]:
TEST_FROM = 16001

corpus.set_pointer(TEST_FROM)

def concat_test_str(corpus, start_point=16001):
    corpus.set_pointer(start_point)
    
    concat_str = ""
    
    json_doc = corpus.get_next_document()
    while json_doc != None:    
        concat_str += json_doc["content"]
    
        json_doc = corpus.get_next_document()
    
    return concat_str

test_sentence = concat_test_str(corpus, TEST_FROM)

#### Laplace smoothing

In order to smooth out the results for unseen ngrams in the training corpus, **Laplace smoothing** has been used. This means that 1 has been added to the numerator and the value of **V** (the vocabulary size) has been added to the denominator. It's not usually the best solution for language models due to the huge number of 0's that's replaced. However, in this case it has been used as it's very easy to implement.

#### Calculating Perplexity

1) To calculate the perplexity, generate bigram for the article.

2) For each of the bigrams in the article, get the count for that bigram from the language model generated in part c.1. and add 1.

3) For the first word in the bigram, get it's count from the unigram of the language model and add the value of the vocab size. This has been done according to the "add-1" estimate.

4) Calculate the probability and store it in a list of probabilities

5) iterate through the list of probabilities and divide 1 by the probabilities

6) Iterate over this new list and multiply all the probabilities together to get the perplexity.

After the perplexities for all the test articles have been calculated, they've been stored in a list called **perplexities**.

For some of the perplexities, the perplexity shows up as infinite. 

In [23]:

def calculate_perplexity(lang_mod, sentence):

    '''
        lang_mod: Language model
        setence: The whole string for rest of all the articles
    '''
    
    V = vn.get_V()
    
    bi_holder = NGramHolder(2)
    bi_holder.add_n_grams_from_sentence(sentence)
    
    bi_grams_dict = bi_holder.n_grams_dict 
    
    probs = []
    
    for bi_gram, count in bi_grams_dict.items():
        lm_bi_count = lang_mod.bi_holder.get_n_gram_count(bi_gram) + 1
        lm_uni_count = lang_mod.uni_holder.get_n_gram_count(bi_gram.split()[0]) + V
        
        prob = lm_bi_count / lm_uni_count
        probs.append(prob)
    
    for i in range(len(probs)):
        probs[i] = 1 / probs[i]
        

    total_prob = float(1)
    
    for i in range(len(probs)):
        total_prob *= probs[i]
    
    n_value = len(sentence.split())
 
    return (total_prob**(1/ n_value))


Using the perplexity function, the perplexities for all of the remaining news articles after 16001 are calculated and stored in the perplexities list. The perplexities are printed off in the end. 

## The results

The perplexities for some sentences are **inf**. For others they are in the hundreds or thousands. Clearly a perplexity of infinity or thousands is terrible.

In [24]:

def calculate_perplexities(corpus, lang_mod):
    perplexities = []
    corpus.set_pointer(TEST_FROM)

    json_doc = corpus.get_next_document()
    while json_doc != None:
        
        content = json_doc["content"]
        perplexities.append(calculate_perplexity(lm, content))
        
        json_doc = corpus.get_next_document()
    
    return perplexities


perplexities = calculate_perplexities(corpus, lm)
        
print(perplexities)


[inf, inf, inf, 85.22271501345406, inf, inf, inf, inf, inf, inf, inf, inf, inf, 58.72866833305018, inf, inf, inf, inf, inf, 9149.517136229177, inf, 3478.3667911379607, 773.863488487506, 4210.326414404433, inf, 11614.912859094531, inf, inf, inf, inf, inf, inf, inf, 94.32810935569726, inf, inf, inf, inf, inf, 8692.482644396674, inf, inf, inf, inf, inf, 5449.676485684123, inf, inf, inf, inf, inf, inf, inf, inf, 793.9173623628344, 6242.3983778278625, inf, inf, 228.9016369257342, inf, inf, inf, inf, 294.59226013195934, 11626.16677647464, 4097.013088150379, inf, inf, inf, inf, inf, inf, inf, inf, inf, 10300.482588569994, 2248.7158729524385, inf, inf, inf, 3475.9227231961977, inf, inf, inf, 5232.486521305419, inf, inf, inf, inf, 10113.786827451866, inf, inf, 196.62126276803642, inf, inf, inf, inf, 1764.3261575634453, inf, inf, inf, 9342.37631952564, inf, 4452.886322670646, 9275.107153992596, inf, inf, inf, inf, inf, inf, inf, inf, inf, 3955.7350140651156, inf, inf, inf, inf, inf, inf, 5574.29