# Project Gutenberg Web Scrapping
In this notebook, we will web scraping from Project Gutenberg & Google to obtain the content and other informations of political philosophy texts. Those would be the building blocks of a database for Natural Language Processing (NLP) purposes open-source for anyone interested in the intersection between data science and political thought. 


## Table of Contents
1. Environment set-up
2. Variable Definition
3. Project Gutenberg: Text Data Retrieval 
4. Google Search API: Date Retrieval

### 1. Environment set-up

In [30]:
# importing libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

### 2. Variable Definition

In [31]:
# Base URL
base_url = 'https://www.gutenberg.org/files'

In [32]:
# Isolating the actual text from the Project Gutenberg license statements
text_beg = 'START OF THIS PROJECT GUTENBERG EBOOK'
text_end = 'End of the Project Gutenberg EBook'

# Delimeters to extract info from text
title_del = '\r\nTitle: '
author_del0 = '\r\n\r\nAuthor: '
author_del1 = '\r\n\r\nRelease Date: '
lang_del0 = '\r\n\r\nLanguage: '
lang_del1 = '\r\n\r\nCharacter set encoding: '

### 3. Project Gutenberg: Text Data Retrieval 

In [51]:
# Putting everything into a Class with different methods
def text_retrieval(links):
    # Lists hosting text data info
    titles, authors, languages = [],[],[]
    texts = []

    # Looping through the links for data retrieval
    for link in links:
        #Send HTTP request
        req = requests.get(base_url+link)
        
        # Get the raw text
        raw_text = BeautifulSoup(html_content, "lxml").text
        
        # Get book title
        delims = ['\r\nTitle: ', '\r\n\r\nAuthor: ', 
                '\r\nRelease Date: ', '\r\nLanguage: ',
                '\r\nCharacter set encoding: ']
        idx = []
        for delim in delims:
            idx.append(raw_text.find(delim))

        title = raw_text[idx[0]:idx[1]]
        title = title.replace('Title: ', '').replace('\r\n', '')
        titles.append(title)
        
        author = raw_text[idx[1]:idx[2]]
        author = author.replace('Author: ', '').replace('\r\n', '')
        authors.append(author)
        
        lang = raw_text[idx[3]:idx[4]]
        lang = lang.replace('Language: ', '').replace('\r\n', '')
        languages.append(lang)
        
        # Get the body of the text
        raw_text = BeautifulSoup(html_content, "lxml").text
        text_idx0 = raw_text.find(text_beg)
        text_idx1 = raw_text.find(text_end)
        text_body = raw_text[text_idx0:text_idx1]
        texts.append(text_body)

    
    # Creating the dictionary for the dataframe structure
    books_dict = {'title': titles,
                  'author': authors,
                  'language': languages,
                  'text': texts}

    # Creating the datafame
    df = pd.DataFrame.from_dict(data=books_dict, orient='columns')
       
    return df

links = ['4320/4320-h/4320-h.htm']
text_retrieval(links)

Unnamed: 0,title,author,language,text
0,An Enquiry Concerning the Principles of Morals,David Hume,English,START OF THIS PROJECT GUTENBERG EBOOK PRINCIPL...


### 4. Google Search API: Date Retrieval