<h1 align="center"> Natural Language Processing

## <font color="green"> Home assignment 1 </font>: N-gram Language Model

### Work had been done by: Ryabykin Aleksey
---


In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [3]:
cfd = nltk.ConditionalFreqDist(nltk.bigrams(text2))

In [4]:
import random
def generate(cfd, first_word, length = 50):
    word = first_word
    print(word, end = ' ')
    for i in range(length):
        pairs = cfd[word].most_common()
        next_words = [word for word, freq in pairs]
        freqs = [freq for word, freq in pairs]
        word = random.choices(next_words, weights = freqs)[0]
        print(word, end = ' ')

In [5]:
generate(cfd, "Me")

Me , and it to a gentleman , " I am sure Marianne ; and words in the next morning was summoned to have been taken the drawing - trees in many pursuits and all advise you to Mrs . Sir John Dashwood alone could even SHE would give offence would 

Let's concatenate all texts.

In [8]:
tokens = []
texts_list = [text1, text2, text3, text4, text5, text6, text7, text8, text9]
for text in texts_list:
    tokens += text.tokens

In [9]:
from tqdm import tqdm
from typing import List, Tuple
from collections import deque 


class Generator:
    def __init__(self, tokens: List[str], n_grams:int=2) -> None:
        '''

        Initializing text generator. Collecting all `2..n`-grams.
        The idea is that in case if we can't find such `n`-gram, it will apply the 
        `n-1`-gramm method and so on.  

        ## Inputs:
        
        * `tokens` - list of tokens from all texts. Just for this case. It can be generalized, I think.
        * `n_grams` - the number of grams that will be separated and used in the algorithm.
        '''

        assert n_grams > 1 and n_grams < 100, "Wrong `n_grams` value"

        print(f"Collecting {n_grams}-grams freq...")
        first = True
        ngrams_as_bigrams = []

        for n in tqdm(range(2, n_grams + 1)):
            ngrams = nltk.ngrams(tokens, n)
            ngrams_as_bigrams.extend([((t[:-1]), t[-1]) for t in ngrams])
        self.cfd = nltk.ConditionalFreqDist(ngrams_as_bigrams)
        self.n_grams = n_grams

    def generate_word(self, input_sentence: Tuple[str]) -> str:
        '''
        Function that generates the following word by the given sequence of n-grams.
        The main approach the same as it was on the lecture: 
        it's just a weighted by frequencies random choice from all possible following 
        words from the initial texts. 
        
        ## Inputs:
        
        * `input_sentence` - sequence of `n_grams`.

        ## Returns:
        
        * `word` - the following word.
        '''

        sentence = input_sentence
        pairs = self.cfd[sentence].most_common()
        next_words = [word for word, freq in pairs]
        freqs = [freq for word, freq in pairs]
        word = random.choices(next_words, weights=freqs)[0]
        return word
    
    def generate(self, first_word: str="I", length=100) -> str:
        '''
        The function for generating text by the first word. It can be generalized to the given sequence as well.
        But in case of one first word there is simple approach: firstly is to find the following word by 2-gram
        generation, add it to the sequence with length `n-grams - 1` and so on from 2 to n-gram.
        Here is deque with fixed length in usage that helps us to save the context for prediction the following word.
        It simplifies the work with sliding through `n-gramms`.

        ## Inputs:

        * `first_word` - string with first word;
        * `length` - length of the sentence to generate.
        
        ## Returns: 

        * `generated` - generated sentence.
        '''
        queue = deque([first_word], maxlen=self.n_grams - 1)
        generated = [first_word]
        print(first_word, end=' ')
        
        assert self.cfd[tuple(queue)], "Cannot find this word in the dictionary" 
        
        new_word = self.generate_word(tuple(queue))
        print(new_word, end=' ')
        queue.append(new_word)
        for i in range(length):
            not_generated = True
            while not_generated:
                assert len(queue) > 0, "Something went wrong"


                if tuple(queue) in self.cfd.keys():
                    new_word = self.generate_word(tuple(queue))
                    queue.append(new_word)
                    not_generated = False
                else:
                    queue.popleft()
                    new_word = self.generate_word(tuple(queue))
                    queue.append(new_word)
                    not_generated = False
            print(new_word, end=' ')
            generated.append(new_word)
        return generated

In [10]:
generator = Generator(tokens, n_grams=5)

Collecting 5-grams freq...


100%|██████████| 4/4 [00:01<00:00,  3.34it/s]


In [11]:
generated_text = generator.generate()

I say ; your whales must be seen before they can be killed ; and this sunken - eyed young Platonist will tow you ten wakes round the world , at another she would seclude herself from it for ever , and has made all those over whom she had any influence , cast him off likewise . Surely , after doing so , she cannot be imagined liable to any impression of sorrow or of joy on his account -- she cannot be interested in any thing that befalls him .-- She would not be frightened from paying him those attentions 

Let's check whether it is working or not.

In [12]:
generator.cfd[('I', 'say', ';', 'your')]

FreqDist({'whales': 1})

In [13]:
generator.cfd[('I', 'remained', 'in', 'this', 'whales')]

FreqDist({})

Yes, it is. Let's compare with the initial algo. Let's make the same cfd for this.

In [15]:
ngrams_as_bigrams = []
cfd = nltk.ConditionalFreqDist(nltk.bigrams(tokens))
for n in tqdm(range(3, 5 + 1)):
        ngrams = nltk.ngrams(tokens, n)
        ngrams_as_bigrams.extend([((t[:-1]), t[-1]) for t in ngrams])
        cfd += nltk.ConditionalFreqDist(ngrams_as_bigrams)

100%|██████████| 3/3 [00:33<00:00, 11.15s/it]


In [16]:
generate(cfd, 'I', 100)

I that of hope of them , influence your behaviour of a paper . " I don ' s sleek , a trifling importance , says 0 it might plug it ; for the open is called . What are committed . " You Gotta Have a moment . The Constitution of appointment is technically fast asleep ; yet now 18 / New Hampshire , Quohog , bravely he ' t repeat it ! That is the Britons . But he U6 ... weighs ... CONCORDE : Burn ! FRENCH GUARD # 2 days of blubber into the principle , flows 

In [17]:
cfd[('I', 'that')]

FreqDist({'I': 3, 'you': 3})

In [28]:
cfd[('the', 'open')]

FreqDist({'air': 24, 'sea': 18, 'ocean': 9, 'independence': 3, 'atmosphere': 3, 'field': 3, 'jaw': 3, 'firmament': 3, 'door': 3, 'market': 3, ...})

In [29]:
cfd[('the', 'open', 'is')]

FreqDist({})

As we can see, it is not working in the same way.