# Model Functions

This notebook contains the functions that is needed for the production model for the web application.

## Start Notebook
Run all of the command below to start the notebook training session of the model.

Uncomment and run the code below to kill the runtime in Google Colaboratory

In [None]:
# !kill -9 -1


Uncomment and run the code below to install necessary file in Colaboratory

In [None]:
# %tensorflow_version 1.x

# !pip install gpt_2_simple
# !pip install wikipedia

# !nvidia-smi


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')


In [None]:
# !mkdir -p drive/My\ Drive/Project\ Writer/datasets
# !mkdir -p drive/My\ Drive/Project\ Writer/samples
# !mkdir -p drive/My\ Drive/Project\ Writer/samples


Uncomment and run the code below to download the model in Colaboratory

In [None]:
# import gpt_2_simple as gpt2
# gpt2.download_gpt2(model_name='355M')


Import necessary libraries

In [3]:
# Import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
from datetime import datetime
import gpt_2_simple as gpt2
import tensorflow as tf
import wikipedia
import re
import nltk


## Text Scrapping

This function is used for scrapping all the text that is contains within a certain website.

In [86]:
class Extract():
    def __init__(self, return_as_file=True):
        self.return_as_file = return_as_file  # Export the dataset as a file

    def remove_blank_lines(self, paragraph):
        '''

        Function to remove extra blank lines in a paragraphs.

        @paragraph: A list of string

        return: A paragraph without extra blank lines

        '''

        lines = paragraph.split('\n')

        non_empty_lines = [line for line in lines if line.strip() != '']

        string_without_empty_lines = ''
        for line in non_empty_lines:
            string_without_empty_lines += line + '\n'

        return string_without_empty_lines

    def extract_from_investopedia(self, urls):
        '''

        Function to extract the text in the list of urls of Investopedia.

        @urls: List of investopedia urls

        return: String containing text from urls

        '''

        # List of elements containing text
        elements = [
            'article'
        ]

        # List of elements to delete
        delete_elements = [
            'header',
            'span',
            'footer'
        ]

        # Initialize string container
        texts = ''

        # Loop over the urls
        for url in urls:
            page = urlopen(url).read()

            soup = BeautifulSoup(page, 'html.parser')

            # Delete some elements
            for element in soup(delete_elements):
                element.decompose()

            # Remove useless div
            for div in soup.find_all('div', ['breadcrumbs']): 
                div.decompose()

            list_text_tags = soup.find_all(elements)

            for tag in list_text_tags:
                text = tag.text

                # Remove extra spaces
                text = text.strip()

                # Add the text to the container
                texts += text

        # Remove extra spaces
        texts = self.remove_blank_lines(texts)
        texts = texts.strip()

        if self.return_as_file:
            filename = 'investopedia_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())  # Export a text file
            path = r'./datasets/' + filename

            # Write the text file in the datasets folder
            with open(path, 'w') as f:
                f.write(texts)

        else:
            return texts

    def extract_from_wikipedia(self, titles):
        '''

        Function to extract text from wikipedia.

        @title: The list title of the Wikipedia article

        return: A string containing the text from a wikipedia

        '''

        # Initialize the container
        texts = ''

        for title in titles:
            # Get the wikipedia page
            page = wikipedia.page(title)

            # Extract the text
            text = page.content

            # Clean text
            text = re.sub(r'==.*?==+', '', text)

            texts += text
        
        texts = self.remove_blank_lines(texts)

        if self.return_as_file:
            filename = 'wikipedia_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())  # Export a text file
            path = r'./datasets/' + filename

            # Write the text file in the datasets folder
            with open(path, 'w') as f:
                f.write(texts)

        else:
            return texts


Testing the text extraction function

In [89]:
extract = Extract(return_as_file=False)

# test_investopedia_texts = extract.extract_from_investopedia(['https://www.investopedia.com/terms/a/artificial-intelligence-ai.asp'])
# test_wikipedia_texts = extract.extract_from_wikipedia(['Artificial Intelligence'])

print(test_wikipedia_texts)


ast, the rare loyal robots such as Gort from The Day the Earth Stood Still (1951) and Bishop from Aliens (1986) are less prominent in popular culture.Isaac Asimov introduced the Three Laws of Robotics in many books and stories, most notably the "Multivac" series about a super-intelligent computer of the same name. Asimov's laws are often brought up during lay discussions of machine ethics; while almost all artificial intelligence researchers are familiar with Asimov's laws through popular culture, they generally consider the laws useless for many reasons, one of which is their ambiguity.Transhumanism (the merging of humans and machines) is explored in the manga Ghost in the Shell and the science-fiction series Dune. In the 1980s, artist Hajime Sorayama's Sexy Robots series were painted and published in Japan depicting the actual organic human form with lifelike muscular metallic skins and later "the Gynoids" book followed that was used by or influenced movie makers including George Luc

## Finetuning Model
This function is used to finetune the model based on the current latest dataset. 

In [12]:
def finetune_model(dataset, any_checkpoint=False, reset_session=False):
    '''

    Function to finetune the model and save the trained model every checkpoint on the checkpoint folder.

    @dataset: Path to the training data (TXT) with minimum 1024 tokens
    @any_checkpoint: Boolean if there is any previous checkpoint
    @reset_session: Boolean if reseting the session graph is needed

    return: None

    '''

    # Parameters
    STEPS = 10000
    MODEL_NAME = '355M'
    LEARNING_RATE = 0.0001
    RUN_NAME = 'trained_model'

    MODEL_DIR = 'models'
    CHECKPOINT_DIR = 'checkpoint'

    # Clear session graph
    if reset_session:
        tf.reset_default_graph()

    # Initialize training session
    sess = gpt2.start_tf_sess()

    # Load the previous checkpoint
    if any_checkpoint:
        gpt2.load_gpt2(
            sess,
            run_name=RUN_NAME,
            checkpoint_dir=CHECKPOINT_DIR,
            model_name=None,
            model_dir=MODEL_DIR,
            multi_gpu=False
        )

    # Finetune the model
    gpt2.finetune(
        sess,
        dataset=dataset,  # Dataset file
        steps=STEPS,
        model_name=MODEL_NAME,  # Model name: 124M, 355M, etc.
        model_dir=MODEL_DIR,
        combine=50000,
        batch_size=1,
        learning_rate=LEARNING_RATE,  # Learning rate
        accumulate_gradients=5,
        restore_from='latest',  # Start training the model from the latest model
        run_name=RUN_NAME,  # Name of the trained model
        checkpoint_dir=CHECKPOINT_DIR,  # Directory to save the model
        sample_every=1000,
        sample_length=300,  # Number of token generated
        sample_num=1,
        multi_gpu=False,
        save_every=1000,
        print_every=10,
        max_checkpoints=1,
        use_memory_saving_gradients=False,
        only_train_transformer_layers=False,
        optimizer='adam',
        overwrite=True  # Overwrite the current model when training
    )


Uncomment and run the code below to test the finetuning model function

In [None]:
# test_training_data = ''
# finetune_model(test_training_data)


## Generating Text
This functions is used to generate the text based on some input from the users.

In [None]:
def generate_text(outline_to_length):
    '''

    Function to generate the text.

    @outline_to_length: A 2D array containing the list of outline and the length desired
        [[outline, length],
        [outline, length],
        [outline, length]]

    return: List of generated text

    '''

    # Parameters
    MODEL_NAME = '355M'
    RUN_NAME = 'trained_model'

    MODEL_DIR = 'models'
    CHECKPOINT_DIR = 'checkpoint'
    SAMPLE_DIR = 'samples'

    # Clear session graph
    tf.reset_default_graph()

    # Initialize TensorFlow session
    sess = gpt2.start_tf_sess()

    # Create an empty list to store lists
    essay = []

    # Loop over the list
    for record in outline_to_length:
        prefix = record[0]  # The first sentence of the paragraph
        length = record[1]  # The length of the paragraph (max: 1023)

        text = gpt2.generate(
            sess,
            run_name=RUN_NAME,
            checkpoint_dir=CHECKPOINT_DIR,
            model_name=None,
            model_dir=MODEL_DIR,
            sample_dir=SAMPLE_DIR,
            return_as_list=True,  # Return as list of string
            truncate=None,
            destination_path=None,
            sample_delim='\n\n' + '=' * 20 + '\n\n',
            prefix=prefix,
            seed=None,
            nsamples=1,  # Number of sample to be generated
            batch_size=1,
            length=length,
            temperature=0.7,
            top_k=0,
            top_p=0.0,
            include_prefix=True
        )

        text = ''.join(text) + '\n\n'
        essay += text

    return essay


Uncomment and run the code below to test the function

In [None]:
# test_outline_to_length = [[]]
# print(generate_text(test_outline_to_length))


## Paraphrasing Text (Under Development)
This function is used to paraphrase the sentences to avoid direct plagiarism.

In [124]:
def paraphrase_paragraph(paragraphs):
    '''

    Function to paraphrase list of paragraphs.

    @paragraph: A list containing the paragraphs

    return: Paraphrased paragraphs

    '''

    def tag(sentence):
        '''

        Function to tag a word with their type.

        @sentence: String sentence

        return: List of words with their tags

        '''

        words = nltk.tokenize.word_tokenize(sentence)
        words = nltk.tag.pos_tag(words)

        return words

    def paraphraseable(tag):
        return tag.startswith('NN') or tag == 'VB' or tag.startswith('JJ')

    def pos(tag):
        if tag.startswith('NN'):
            return nltk.corpus.wordnet.NOUN
        elif tag.startswith('V'):
            return nltk.corpus.wordnet.VERB

    def synonyms(word, tag): 
        lemma_lists = [ss.lemmas() for ss in nltk.corpus.wordnet.synsets(word, pos(tag))]
        lemmas = [lemma.name() for lemma in sum(lemma_lists, [])]
        return set(lemmas)

    def if_synonym_exists(sentence):
        for (word, t) in tag(sentence):
            if paraphraseable(t):
                syns = synonyms(word, t)
                if syns:
                    if len(syns) > 1:
                        yield [word, list(syns)[1]]
                        continue
            yield [word, '']

    def sentence_paraphrase(sentence):
        return [w for w in if_synonym_exists(sentence)]

    # 2D array of sentences in each paragraph
    list_sentences = []

    # Convert a list paragraph into lists of sentences
    for paragraph in paragraphs:
        sentences = [s for s in nltk.tokenize.sent_tokenize(paragraph)]

        # Loop over the sentences
        for sentence in sentences:
            sentence = sentence_paraphrase(sentence)

        list_sentences.append(sentences)

    return list_sentences


Testing the paraphraser function

In [125]:
test_paragraphs = ['At its core, AI is the branch of computer science that aims to answer Turing question in the affirmative. It is the endeavor to replicate or simulate human intelligence in machines.', 'The expansive goal of artificial intelligence has given rise to many questions and debates. So much so, that no singular definition of the field is universally accepted.']

print(paraphrase_paragraph(test_paragraphs))


[['At its core, AI is the branch of computer science that aims to answer Turing question in the affirmative.', 'It is the endeavor to replicate or simulate human intelligence in machines.'], ['The expansive goal of artificial intelligence has given rise to many questions and debates.', 'So much so, that no singular definition of the field is universally accepted.']]


## Grammar Check (Under Development)
This function is used to check and correct any grammatical error in the paragraph.