# Model Functions

This notebook contains the functions that is needed for the production model for the web application.

## Start Notebook
Run all of the command below to start the notebook training session of the model.

Uncomment and run the code below to kill the runtime in Google Colaboratory

In [None]:
# !kill -9 -1


Uncomment and run the code below to install necessary file in Colaboratory

In [None]:
# %tensorflow_version 1.x
# !pip install gpt_2_simple
# !nvidia-smi
# !mkdir -p datasets
# !mkdir -p checkpoint
# !mkdir -p samples


Uncomment and run the code below to download the model in Colaboratory

In [None]:
# import gpt_2_simple as gpt2
# gpt2.download_gpt2(model_name='124M')


Import necessary libraries

In [3]:
# Import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
from datetime import datetime
import tensorflow as tf


## Text Scrapping

This function is used for scrapping all the text that is contains within a certain website.

In [72]:
class Extract():
    def __init__(self, urls, return_as_file=True):
        self.urls = urls  # List of urls for the dataset
        self.return_as_file = return_as_file  # Export the dataset as a file

    def remove_blank_lines(self, paragraph):
        '''

        Function to remove extra blank lines in a paragraphs.

        @paragraph: A list of string

        return: A paragraph without extra blank lines

        '''

        lines = paragraph.split('\n')

        non_empty_lines = [line for line in lines if line.strip() != '']

        string_without_empty_lines = ''
        for line in non_empty_lines:
            string_without_empty_lines += line + '\n'

        return string_without_empty_lines

    def extract_from_investopedia(self):
        '''

        Function to extract the text in the list of urls of Investopedia.

        return: String containing text from urls

        '''

        # List of elements containing text
        elements = [
            'article'
        ]

        # List of elements to delete
        delete_elements = [
            'header',
            'span',
            'footer'
        ]

        # Initialize string container
        texts = ''

        # Loop over the urls
        for url in self.urls:
            page = urlopen(url).read()

            soup = BeautifulSoup(page, 'html.parser')

            # Delete some elements
            for element in soup(delete_elements):
                element.decompose()

            # Remove useless div
            for div in soup.find_all('div', ['breadcrumbs']): 
                div.decompose()

            list_text_tags = soup.find_all(elements)

            for tag in list_text_tags:
                text = tag.text

                # Remove extra spaces
                text = text.strip()

                # Add the text to the container
                texts += text

        # Remove extra spaces
        texts = self.remove_blank_lines(texts)
        texts = texts.strip()

        if self.return_as_file:
            filename = 'dataset_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())  # Export a text file
            path = r'./datasets/' + filename

            # Write the text file in the datasets folder
            with open(path, 'w') as f:
                f.write(texts)

        else:
            return texts


Testing the text extraction function

In [73]:
# Function test
extract = Extract(['https://www.investopedia.com/terms/a/artificial-intelligence-ai.asp'], return_as_file=False)
test_investopedia_texts = extract.extract_from_investopedia()
print(test_investopedia_texts)


Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.
The ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best chance of achieving a specific goal.
When most people hear the term artificial intelligence, the first thing they usually think of is robots. That's because big-budget films and novels weave stories about human-like machines that wreak havoc on Earth. But nothing could be further from the truth.
Artificial intelligence is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it and execute tasks, from the most simple to those that are even more complex. The goals of artificial intelligence include learning, reasoning, and perception.
As tech

## Finetuning Model
This function is used to finetune the model based on the current latest dataset. 

In [12]:
def finetune_model(dataset):
    '''

    Function to finetune the model and save the trained model every checkpoint on the checkpoint folder.

    @dataset: Path to the training data (TXT) with minimum 1024 tokens
    @model_name: The name of the model: 124M, 355M, etc.
    @learning_rate: The learning rate of the model

    return: None

    '''
    # Parameters
    STEPS = 1000
    MODEL_NAME = '124M'
    LEARNING_RATE = 0.0001

    # Clear session graph
    tf.reset_default_graph()

    # Initialize training session
    sess = gpt2.start_tf_sess()

    # Finetune the model
    gpt2.finetune(
        sess,
        dataset=dataset,  # Dataset CSV file
        steps=STEPS,
        model_name=MODEL_NAME,  # Model name: 124M, 355M, etc.
        model_dir='models',
        combine=50000,
        batch_size=1,
        learning_rate=LEARNING_RATE,  # Learning rate
        accumulate_gradients=5,
        restore_from='latest',  # Start training the model from the latest model
        run_name='trained_model',  # Name of the trained model
        checkpoint_dir='checkpoint',  # Directory to save the model
        sample_every=250,
        sample_length=500,  # Number of token generated
        sample_num=1,
        multi_gpu=False,
        save_every=250,
        print_every=10,
        max_checkpoints=1,
        use_memory_saving_gradients=False,
        only_train_transformer_layers=False,
        optimizer='adam',
        overwrite=True  # Overwrite the current model when training
    )


Uncomment and run the code below to test the finetuning model function

In [None]:
# data_file = ''
# finetune_model(data_file)


## Generating Text
This functions is used to generate the text based on some input from the users.

In [None]:
def generate_text(outline_to_length):
    '''

    Function to generate the text.

    @outline_to_length: A 2D array containing the list of outline and the length desired
        [[outline, length],
        [outline, length],
        [outline, length]]

    return: List of generated text

    '''

    # Clear session graph
    tf.reset_default_graph()

    # Initialize TensorFlow session
    sess = gpt2.start_tf_sess()

    # Create an empty list to store lists
    essay = []

    # Loop over the list
    for record in outline_to_length:
        prefix = record[0]  # The first sentence of the paragraph
        length = record[1]  # The length of the paragraph (max: 1023)

        text = gpt2.generate(
            sess,
            run_name='trained_model',
            checkpoint_dir='checkpoint',
            model_name=None,
            model_dir='models',
            sample_dir='samples',
            return_as_list=True,  # Return as list of string
            truncate=None,
            destination_path=None,
            sample_delim='\n' + '=' * 20 + '\n\n',
            prefix=prefix,
            seed=None,
            nsamples=1,  # Number of sample to be generated
            batch_size=1,
            length=length,
            temperature=0.7,
            top_k=0,
            top_p=0.0,
            include_prefix=True
        )

        essay += text

        # Add double newline
        essay += ['\n\n']

    return ''.join(essay)


Uncomment and run the code below to test the function

In [None]:
# test_outline_to_length = [[]]
# print(generate_text(test_outline_to_length))


## Paraphrasing Text
This function is used to paraphrase the sentences to avoid direct plagiarism.

## Grammar Check
This function is used to check and correct any grammatical error in the paragraph.