# Model Functions

This notebook contains the functions that is needed for the production model for the web application.

## Text Scrapping

This function is used for scrapping all the text that is contains within a certain website.

In [104]:
# Import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
from re import sub


def extract_text(url, export_as_file=True):
    '''

    Function to extract text from a website.

    @url: The website url
    @export_as_file: Boolean to export the text result as a file

    return: List of string or file containing the text

    '''


    def remove_blank_lines(paragraph):
        '''

        Function to remove extra blank lines in a paragraphs.

        @paragraph: A list of string

        return: A paragraph without extra blank lines

        '''

        lines = paragraph.split('\n')

        non_empty_lines = [line for line in lines if line.strip() != '']

        string_without_empty_lines = ''
        for line in non_empty_lines:
            string_without_empty_lines += line + '\n'

        return string_without_empty_lines


    def clean_table_data(soup):
        '''

        Function to clean the table.

        @soup: HTML page object from BeautifulSoup

        return: Clean string containing table data

        '''

        table_elements = [
        'table',
        'thead',
        'tbody',
        'tfoot',
        'tr',
        'th',
        'td'
        ]

        table_data = soup.find_all(table_elements, string=True)

        string_table_data = ''
        for data in table_data:
            string_table_data += data.get_text() + ' '

        return string_table_data


    def delete_elements(soup, elements):
        '''

        Function to delete some elements.

        @soup: HTML page object from BeautifulSoup
        @elements: List of tags to delete

        return: BeautifulSoup object without deleted elements

        '''

        for element in soup(elements):
            element.decompose()

        return soup

    page = urlopen(url).read()

    soup = BeautifulSoup(page, 'html.parser')

    # Clean table
    table_text = clean_table_data(soup)

    # Delete some elements
    elements = [
        'head',
        'script',
        'style',
        'header',
        'nav',
        'table',
        'form',
        'input',
        'footer'
    ]

    soup = delete_elements(soup, elements)

    # Fetch the text from the soup
    text = soup.get_text()

    # Clean the text
    text = text.strip()
    text = remove_blank_lines(text)

    # Combine the text with the text from table
    text = '<|startoftext|>' + text + table_text + '<|endoftext|>'

    if export_as_file:
        pass
    else:
        return text

In [105]:
# Function test
url = 'https://simple.wikipedia.org/wiki/Zeus'
print(extract_text(url, export_as_file=False))

<|startoftext|>Zeus
From Wikipedia, the free encyclopedia
Jump to navigation
Jump to search
Zeus is the god of the sky, lightning and the thunder in Ancient Greek religion and legends, and ruler of all the gods on Mount Olympus. Zeus is the sixth child of Cronos and Rhea, king and queen of the Titans. His father, Cronos, swallowed his children as soon as they were born for fear of a prophecy which foretold that one of them would overthrow him. When Zeus was born, Rhea hid him in a cave on Mount Ida in Crete, giving Cronos a stone wrapped in swaddling clothes to swallow instead. When Zeus was older he went to free his brothers and sisters; together with their allies, the Hekatonkheires and the Elder Cyclopes, Zeus and his siblings fought against the Titans in a ten-year war known as the Titanomachy. At the end of the war, Zeus took Cronos' scythe and cut him into pieces, throwing his remains into Tartarus. He then became the king of gods. 
The supreme deity of the Greek pantheon, Zeus w