# Dataset Class

This notebook contains the functions that is needed to scrap the dataset from various resources in the internet and export it as a file.

## Start Notebook
Run all of the command below to start the notebook training session of the model.

Uncomment and run the code below to install necessary file in Colaboratory

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')


In [None]:
# !mkdir -p drive/My\ Drive/Project\ Writer/datasets
# !mkdir -p drive/My\ Drive/Project\ Writer/samples
# !mkdir -p drive/My\ Drive/Project\ Writer/samples


## Text Scrapping

This class is used for scrapping all the text that is contains within a certain website and export it as a file.

In [86]:
class Extract():
    '''

    This class is used to extract text resources from web and export it as a file.

    @return_as_file: Boolean to define whether the resources will be exported as a file.
    
    Method:
    extract_from_investopedia
    extract_from_wikipedia

    '''

    # Import libraries
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    from datetime import datetime
    import wikipedia

    def __init__(self, return_as_file=True):
        self.return_as_file = return_as_file  # Export the dataset as a file

    def remove_blank_lines(self, paragraph):
        '''

        Function to remove extra blank lines in a paragraphs.

        @paragraph: A list of string

        return: A paragraph without extra blank lines

        '''

        lines = paragraph.split('\n')

        non_empty_lines = [line for line in lines if line.strip() != '']

        string_without_empty_lines = ''
        for line in non_empty_lines:
            string_without_empty_lines += line + '\n'

        return string_without_empty_lines

    def extract_from_investopedia(self, urls, dataset_dir=None):
        '''

        Function to extract the text in the list of urls of Investopedia.

        @urls: List of investopedia urls
        @dataset_dir: Path to the dataset directory: ./datasets/path/

        return: String containing text from urls

        '''

        # List of elements containing text
        elements = [
            'article'
        ]

        # List of elements to delete
        delete_elements = [
            'header',
            'span',
            'footer'
        ]

        # Initialize string container
        texts = ''

        # Loop over the urls
        for url in urls:
            page = urlopen(url).read()

            soup = BeautifulSoup(page, 'html.parser')

            # Delete some elements
            for element in soup(delete_elements):
                element.decompose()

            # Remove useless div
            for div in soup.find_all('div', ['breadcrumbs']): 
                div.decompose()

            list_text_tags = soup.find_all(elements)

            for tag in list_text_tags:
                text = tag.text

                # Remove extra spaces
                text = text.strip()

                # Add the text to the container
                texts += text

        # Remove extra spaces
        texts = self.remove_blank_lines(texts)
        texts = texts.strip()

        if self.return_as_file:

            filename = 'investopedia_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())  # Export a text file

            if dataset_dir:
                path = dataset_dir + filename

            else:
                path = r'./datasets/' + filename

            # Write the text file in the datasets folder
            with open(path, 'w') as f:
                f.write(texts)

        else:
            return texts

    def extract_from_wikipedia(self, titles, dataset_dir=None):
        '''

        Function to extract text from wikipedia.

        @title: The list title of the Wikipedia article
        @dataset_dir: Path to the directory dataset: ./datasets/path/

        return: A string containing the text from a wikipedia

        '''

        # Initialize the container
        texts = ''

        for title in titles:
            # Get the wikipedia page
            page = wikipedia.page(title)

            # Extract the text
            text = page.content

            # Clean text
            text = re.sub(r'==.*?==+', '', text)

            texts += text
        
        texts = self.remove_blank_lines(texts)

        if self.return_as_file:
            filename = 'wikipedia_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())  # Export a text file

            if dataset_dir:
                path = dataset_dir + filename

            else:
                path = r'./datasets/' + filename

            # Write the text file in the datasets folder
            with open(path, 'w') as f:
                f.write(texts)

        else:
            return texts


Testing the text extraction function

In [89]:
# extract = Extract(return_as_file=False)

# test_investopedia_texts = extract.extract_from_investopedia(['https://www.investopedia.com/terms/a/artificial-intelligence-ai.asp'])

# test_wikipedia_texts = extract.extract_from_wikipedia(['Artificial Intelligence'])

# print(test_wikipedia_texts)


ast, the rare loyal robots such as Gort from The Day the Earth Stood Still (1951) and Bishop from Aliens (1986) are less prominent in popular culture.Isaac Asimov introduced the Three Laws of Robotics in many books and stories, most notably the "Multivac" series about a super-intelligent computer of the same name. Asimov's laws are often brought up during lay discussions of machine ethics; while almost all artificial intelligence researchers are familiar with Asimov's laws through popular culture, they generally consider the laws useless for many reasons, one of which is their ambiguity.Transhumanism (the merging of humans and machines) is explored in the manga Ghost in the Shell and the science-fiction series Dune. In the 1980s, artist Hajime Sorayama's Sexy Robots series were painted and published in Japan depicting the actual organic human form with lifelike muscular metallic skins and later "the Gynoids" book followed that was used by or influenced movie makers including George Luc