# Project Gutenberg Web Scrapping
In this notebook, we will web scraping from Project Gutenberg & Google to obtain the content and other informations of political philosophy texts. Those would be the building blocks of a database for Natural Language Processing (NLP) purposes open-source for anyone interested in the intersection between data science and political thought. 


## Table of Contents
1. Environment set-up
    * Importing Libraries
2. Books Web Scraping
    * Variable Definition
    * Info Details Extraction
    * Text Content Extraction
3. Data Cleaning
    * Null Values
    * Duplicates

### 1. Environment set-up

In [170]:
# importing libraries

# We scraping tools
from urllib import request
from bs4 import BeautifulSoup
import requests
import json

# Dataframe manipulation
import pandas as pd
import numpy as np

# Text manipulation
import re

### 2. Books Web Scraping

### Note: 
As a start, we defined a list with authors' names to scrape Project Gutenberg. We obtain that list researching influential thikers in the political philosophy tradition. As one may note, it is heavily biased towards western throught, but as we progress, we will enlarge our perspective with texts from across the world. We eventually hope a comprehensive library for Natural Language Processing purposes.

In [194]:
# Missing a few authors:
# Hannah Arendt, Étienne de La Boétie, Carl Schmitt,
# Simone de Beauvoir, Jean-Paul Sartre, Michel Foucault
# Jacques Derrida, Gilles Deleuze, Jean Baudrillard,
# W. E. B. Dubois, Aimé Cesaire, Leopold Senghor
author_names = [
    'Plato','Aristotle','Rousseau, Jean-Jacques', 'Hegel, Georg Wilhelm Friedrich'
    'Hume, David', 'Locke, John', 'Machiavelli, Niccolò','Mill, John Stuart',
    'Kant, Immanuel','Nietzsche, Friedrich Wilhelm', 'Hobbes, Thomas',
    'Montesquieu, Charles de Secondat, baron de','Russell, Bertrand',
    'Burke, Edmund', 'Priestley', 'Spencer, Herbert', 'Comte, Auguste',
    'Epictetus','Bodin, Jean','Godwin, William','Harrington, James',
    'Jellinek, Georg', 'Lieber, Francis', 'Proudhon, P.-J. (Pierre-Joseph)',
    'Labriola, Antonio', 'Dante Alighieri', 'Jefferson, Thomas', 'Adams, John',
    'Croce, Benedetto', 'Marx, Karl', 'Montaigne, Michel de', 
    'Sunzi, active 6th century B.C.', 'Anarchism Emma Goldman',
    'A Vindication of the Rights of Woman Wollstonecraft, Mary',
    'Bakunin, Mikhail Aleksandrovich', 'Kropotkin',
]

### Note: 
Our first steps in scraping Project Gutenberg, we will get details on the books such as title, topics/themes, language, etc. It would help contextualize without having the text content as of yet.

In [205]:
# Clean it up & make it a class object
# Handling errors & exceptions
def book_details_extraction(names):
    # Formatting author names for API request
    books_formatted = []
    
    for name in names:
        url = 'http://gutendex.com/books?search='+name.replace(' ', '%20')
        
        req = requests.get(url)
        soup = BeautifulSoup(req.text, 'html.parser')
        books = json.loads(soup.text)['results']
        
        for book in books:
            if len(book['authors']) > 0:
                book_authors = book['authors'][0]
            else:
                author_details = ''
            text_urls = book['formats']
            res = [val for key, val in text_urls.items() if 'text/plain' in key 
                                                        and '.txt' in val]
            topics = book['subjects']+book['bookshelves']
            lang = book['languages'][0]
            book.update({'name':book_authors})
            book.update({'topics':topics})
            book.update({'language':lang})
            if len(res) == 0:
                book.update({'text_url':''})
            else:
                book.update({'text_url':res[0]})  

            keys_to_remove = ['id', 'authors','translators', 'subjects',
                          'bookshelves', 'languages','copyright', 
                          'media_type','formats','download_count']
            for key in keys_to_remove:
                book.pop(key)
            
            books_formatted.append(book)
    
    return pd.json_normalize(books_formatted)

In [None]:
book_details_extraction(author_names)

### Note: 
In this section, we will proceed in extracting the texts under three main formats: its full body, its broken paragraphs, and its sentences. It would give more flexibility in terms of NLP projects and the various forms they make take.

In [100]:
# Putting everything into a Class with different methods
class GutenbergTextRetrieveal():
    ''' 
    A class extracting the details and content of texts 
    pulled from Project Gutenberg
    '''
    def __init__(self, links):
        self.links = links
    
    def text_scraper(self):
        self.texts = []

        content_begs = ['*** START OF', '***START OF']
        content_ends = ['*** END OF ', '*** END OF ',]
        for link in self.links:
            # Details on the text
            req = requests.get(link)
            soup = BeautifulSoup(req.content, "html.parser")
            raw_text = soup.text
            
            for beg in content_begs:
                if raw_text.find(beg) == -1:
                    continue
                else:
                    idx0 = raw_text.find(beg)
                    
            for end in content_ends:
                if raw_text.find(end) == -1:
                    continue
                else:
                    idx1 = raw_text.find(end)
                    
            text_body = raw_text[idx0:idx1].replace(content_beg, '')
            text_body = re.sub(r'http\S+', '', text_body)
            text_body = re.sub('\W+',' ', text_body)
            
            self.texts.append(text_body)
        return self.texts

# Base URL
# links = list(hume_texts['text_url'])
# ret = GutenbergTextRetrieveal(links)
# texts = ret.text_scraper()
# hume_texts['text_content'] = texts
# hume_texts

In [22]:
## Think about the inclusion of paragraphs & sentences delimitations
# req = requests.get('https://www.gutenberg.org/files/4320/4320-h/4320-h.htm')
# soup = BeautifulSoup(req.content, "html.parser")
# paragraph=soup.find_all("p")
# for para in paragraph:
#     print(para.text)

In [None]:
## Think about duplicates
## Consider multiple authors