# Model Functions

This notebook contains the functions that is needed for the production model for the web application.

## Start Notebook
Run all of the command below to start the notebook training session of the model.

Uncomment and run the code below to kill the runtime in Google Colaboratory

In [None]:
# !kill -9 -1


Uncomment and run the code below to install necessary file in Colaboratory

In [None]:
# %tensorflow_version 1.x
# !pip install gpt_2_simple
# !nvidia-smi
# !mkdir -p datasets
# !mkdir -p checkpoint
# !mkdir -p samples


Uncomment and run the code below to download the model in Colaboratory

In [None]:
# import gpt_2_simple as gpt2
# gpt2.download_gpt2(model_name='124M')


Import necessary libraries

In [3]:
# Import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
from datetime import datetime
import tensorflow as tf


## Text Scrapping

This function is used for scrapping all the text that is contains within a certain website.

In [8]:
def extract_text(urls, export_as_file=True):
    '''

    Function to extract text from a website.

    @urls: The list of website url
    @export_as_file: Boolean to export the text result as a file

    return: List of string or file containing the text

    '''


    def remove_blank_lines(paragraph):
        '''

        Function to remove extra blank lines in a paragraphs.

        @paragraph: A list of string

        return: A paragraph without extra blank lines

        '''

        lines = paragraph.split('\n')

        non_empty_lines = [line for line in lines if line.strip() != '']

        string_without_empty_lines = ''
        for line in non_empty_lines:
            string_without_empty_lines += line + '\n'

        return string_without_empty_lines


    def clean_table_data(soup):
        '''

        Function to clean the table.

        @soup: HTML page object from BeautifulSoup

        return: Clean string containing table data

        '''

        table_elements = [
        'table',
        'thead',
        'tbody',
        'tfoot',
        'tr',
        'th',
        'td'
        ]

        table_data = soup.find_all(table_elements, string=True)

        string_table_data = ''
        for data in table_data:
            string_table_data += data.get_text() + ' '

        return string_table_data


    def delete_elements(soup, elements):
        '''

        Function to delete some elements.

        @soup: HTML page object from BeautifulSoup
        @elements: List of tags to delete

        return: BeautifulSoup object without deleted elements

        '''

        for element in soup(elements):
            element.decompose()

        return soup

    # List of elements to remove
    ELEMENTS = [
        'head',
        'script',
        'style',
        'header',
        'nav',
        'div',
        'table',
        'form',
        'input',
        'button',
        'footer'
    ]

    # Initialize strings
    texts = ''

    # Loop over the list of urls
    for url in urls:
        page = urlopen(url).read()

        soup = BeautifulSoup(page, 'html.parser')

        # Clean table
        table_text = clean_table_data(soup)
        soup = delete_elements(soup, ELEMENTS)

        # Fetch the text from the soup
        text = soup.get_text()

        # Clean the text
        text = text.strip()
        text = remove_blank_lines(text)

        # Append the text
        texts += text

    # Export dataframe
    if export_as_file:
        filename = 'dataset_{:%Y%m%d_%H%M%S}.txt'.format(datatime.utcnow())
        path = r'./datasets/' + filename
        
        # Write the text file in the datasets folder
        with open(path, 'w') as f:
            f.write(texts)

    else:
        return texts


Testing the text extraction function

In [9]:
# Function test
url = ['https://simple.wikipedia.org/wiki/Zeus']
print(extract_text(url, export_as_file=False))


Zeus
From Wikipedia, the free encyclopedia
Jump to navigation
Jump to search
Zeus is the god of the sky, lightning and the thunder in Ancient Greek religion and legends, and ruler of all the gods on Mount Olympus. Zeus is the sixth child of Cronos and Rhea, king and queen of the Titans. His father, Cronos, swallowed his children as soon as they were born for fear of a prophecy which foretold that one of them would overthrow him. When Zeus was born, Rhea hid him in a cave on Mount Ida in Crete, giving Cronos a stone wrapped in swaddling clothes to swallow instead. When Zeus was older he went to free his brothers and sisters; together with their allies, the Hekatonkheires and the Elder Cyclopes, Zeus and his siblings fought against the Titans in a ten-year war known as the Titanomachy. At the end of the war, Zeus took Cronos' scythe and cut him into pieces, throwing his remains into Tartarus. He then became the king of gods. 
The supreme deity of the Greek pantheon, Zeus was universally 

## Finetuning Model
This function is used to finetune the model based on the current latest dataset. 

In [12]:
def finetune_model(dataset):
    '''

    Function to finetune the model and save the trained model every checkpoint on the checkpoint folder.

    @dataset: Path to the training data (TXT) with minimum 1024 tokens
    @model_name: The name of the model: 124M, 355M, etc.
    @learning_rate: The learning rate of the model

    return: None

    '''
    # Parameters
    STEPS = 1000
    MODEL_NAME = '124M'
    LEARNING_RATE = 0.0001

    # Clear session graph
    tf.reset_default_graph()

    # Initialize training session
    sess = gpt2.start_tf_sess()

    # Finetune the model
    gpt2.finetune(
        sess,
        dataset=dataset,  # Dataset CSV file
        steps=STEPS,
        model_name=MODEL_NAME,  # Model name: 124M, 355M, etc.
        model_dir='models',
        combine=50000,
        batch_size=1,
        learning_rate=LEARNING_RATE,  # Learning rate
        accumulate_gradients=5,
        restore_from='latest',  # Start training the model from the latest model
        run_name='trained_model',  # Name of the trained model
        checkpoint_dir='checkpoint',  # Directory to save the model
        sample_every=250,
        sample_length=500,  # Number of token generated
        sample_num=1,
        multi_gpu=False,
        save_every=250,
        print_every=10,
        max_checkpoints=1,
        use_memory_saving_gradients=False,
        only_train_transformer_layers=False,
        optimizer='adam',
        overwrite=True  # Overwrite the current model when training
    )


Uncomment and run the code below to test the finetuning model function

In [None]:
# data_file = ''
# finetune_model(data_file)


## Generating Text
This functions is used to generate the text based on some input from the users.

In [None]:
def generate_text(outline_to_length):
    '''

    Function to generate the text.

    @outline_to_length: A 2D array containing the list of outline and the length desired
        [[outline, length],
        [outline, length],
        [outline, length]]

    return: List of generated text

    '''

    # Clear session graph
    tf.reset_default_graph()

    # Initialize TensorFlow session
    sess = gpt2.start_tf_sess()

    # Create an empty list to store lists
    essay = []

    # Loop over the list
    for record in outline_to_length:
        prefix = record[0]  # The first sentence of the paragraph
        length = record[1]  # The length of the paragraph (max: 1023)

        text = gpt2.generate(
            sess,
            run_name='trained_model',
            checkpoint_dir='checkpoint',
            model_name=None,
            model_dir='models',
            sample_dir='samples',
            return_as_list=True,  # Return as list of string
            truncate=None,
            destination_path=None,
            sample_delim='\n' + '=' * 20 + '\n\n',
            prefix=prefix,
            seed=None,
            nsamples=1,  # Number of sample to be generated
            batch_size=1,
            length=length,
            temperature=0.7,
            top_k=0,
            top_p=0.0,
            include_prefix=True
        )

        essay += text

        # Add double newline
        essay += ['\n\n']

    return ''.join(essay)


Uncomment and run the code below to test the function

In [None]:
# test_outline_to_length = [[]]
# print(generate_text(test_outline_to_length))


## Paraphrasing Text
This function is used to paraphrase the sentences to avoid direct plagiarism.

## Grammar Check
This function is used to check and correct any grammatical error in the paragraph.