# Data Acquisition for Digital Humanities Projects: Webscraping

The first step in any Digital Humanities project is **Data Acquisition**. Data acquisition simply means **1) Locating the data that you want to work with** (where is it and can I use it?) and **2) actually getting it**. In this notebook we are learning how to do something called **Webscraping**, which in essence, it simply means getting data that is on the internet on your laptop.

# 1. We import the libraries

In [77]:
from urllib import request
from bs4 import BeautifulSoup

import re

import pandas as pd

# 2. Locating the Data Source

Project Gutenberg (https://www.gutenberg.org/) is an excellent database that contains hundreds of Open Source books ready to be used for Data Mining Purposes. 
o
In this notebook, we will be using **Around the World in 80 Days** by Jules Vern. So:

* Let's go to the Project Gutenberg website: https://www.gutenberg.org/ebooks/103
* And let's use the "Plain Text UTF-8" file (it's easier to work with it!): https://www.gutenberg.org/cache/epub/103/pg103.txt

# 3. We get the data from the Internet

The first thing that we need to do is to get the data from the internet into our Jupyter Notebook. For that, let's use the urllib Python Library (https://docs.python.org/3/library/urllib.html).

In [78]:
from urllib import request #we already imported this in step one: this is just to show you that it is here where we are using it!

In [79]:
url = "https://www.gutenberg.org/cache/epub/103/pg103.txt" #we create a variable with the url of our target book

In [80]:
response = request.urlopen(url)
raw = response.read().decode("utf-8")

Now we have the data of that website stored in our computer. But it is a total mess! We need to transform it into a readable format.

In [81]:
raw

'\ufeffThe Project Gutenberg eBook of Around the World in Eighty Days\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenberg.org. If you are not located in the United States,\r\nyou will have to check the laws of the country where you are located\r\nbefore using this eBook.\r\n\r\nTitle: Around the World in Eighty Days\r\n\r\nAuthor: Jules Verne\r\n\r\nRelease date: January 1, 1994 [eBook #103]\r\n                Most recently updated: October 29, 2024\r\n\r\nLanguage: English\r\n\r\n\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK AROUND THE WORLD IN EIGHTY DAYS ***\r\n\r\n[Illustration]\r\n\r\n\r\n\r\n\r\nAround the World in Eighty Days\r\n\r\nby Jules Verne\r\n\r\n\r\nContents\r\n\r\n CHAPTER I. IN WHICH PHILEAS FOGG AND

# 4. We prettify the data

Now we use another library called Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to transform that data into a legible version.

In [82]:
from bs4 import BeautifulSoup #we already installed this library but it is useful to understand that we are going to use it now!

In [83]:
soup = BeautifulSoup(raw, "html.parser")

print(soup.prettify()) 

﻿The Project Gutenberg eBook of Around the World in Eighty Days
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Around the World in Eighty Days

Author: Jules Verne

Release date: January 1, 1994 [eBook #103]
                Most recently updated: October 29, 2024

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK AROUND THE WORLD IN EIGHTY DAYS ***

[Illustration]




Around the World in Eighty Days

by Jules Verne


Contents

 CHAPTER I. IN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE ONE AS MASTER, THE OTHER AS MAN
 CHAPT

In [84]:
data = soup.prettify()
data

'\ufeffThe Project Gutenberg eBook of Around the World in Eighty Days\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenberg.org. If you are not located in the United States,\r\nyou will have to check the laws of the country where you are located\r\nbefore using this eBook.\r\n\r\nTitle: Around the World in Eighty Days\r\n\r\nAuthor: Jules Verne\r\n\r\nRelease date: January 1, 1994 [eBook #103]\r\n                Most recently updated: October 29, 2024\r\n\r\nLanguage: English\r\n\r\n\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK AROUND THE WORLD IN EIGHTY DAYS ***\r\n\r\n[Illustration]\r\n\r\n\r\n\r\n\r\nAround the World in Eighty Days\r\n\r\nby Jules Verne\r\n\r\n\r\nContents\r\n\r\n CHAPTER I. IN WHICH PHILEAS FOGG AND

In [85]:
print(data)

﻿The Project Gutenberg eBook of Around the World in Eighty Days
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Around the World in Eighty Days

Author: Jules Verne

Release date: January 1, 1994 [eBook #103]
                Most recently updated: October 29, 2024

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK AROUND THE WORLD IN EIGHTY DAYS ***

[Illustration]




Around the World in Eighty Days

by Jules Verne


Contents

 CHAPTER I. IN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE ONE AS MASTER, THE OTHER AS MAN
 CHAPT

That looks so much better!

# 5. We select the data

Now: let's only get the proper text of the book. So, we need to get rid of the header and of the end of that text to get "the beef" (the proper book). Let's use the **Regular Expressions Python library** (https://docs.python.org/3/library/re.html) to do that. Specifically, let's use the re.search method to find the position in the text where our text begins and ends.

In [86]:
import re #we already imported this library but again it's good to know that we are using it now!

In [87]:
text = re.search("START OF THE PROJECT GUTENBERG EBOOK", data)
text

<re.Match object; span=(720, 756), match='START OF THE PROJECT GUTENBERG EBOOK'>

That means that the position of that header goes from character 821 to 857 in the text. Let's repeat the same thing with the end of the book.

In [88]:
text_2 = re.search("END OF THE PROJECT GUTENBERG EBOOK", data)
text_2

<re.Match object; span=(377929, 377963), match='END OF THE PROJECT GUTENBERG EBOOK'>

Now let's select only the data that we need.

In [89]:
data = data[756:377933]

In [91]:
data #voilá!

' AROUND THE WORLD IN EIGHTY DAYS ***\r\n\r\n[Illustration]\r\n\r\n\r\n\r\n\r\nAround the World in Eighty Days\r\n\r\nby Jules Verne\r\n\r\n\r\nContents\r\n\r\n CHAPTER I. IN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE ONE AS MASTER, THE OTHER AS MAN\r\n CHAPTER II. IN WHICH PASSEPARTOUT IS CONVINCED THAT HE HAS AT LAST FOUND HIS IDEAL\r\n CHAPTER III. IN WHICH A CONVERSATION TAKES PLACE WHICH SEEMS LIKELY TO COST PHILEAS FOGG DEAR\r\n CHAPTER IV. IN WHICH PHILEAS FOGG ASTOUNDS PASSEPARTOUT, HIS SERVANT\r\n CHAPTER V. IN WHICH A NEW SPECIES OF FUNDS, UNKNOWN TO THE MONEYED MEN, APPEARS ON ’CHANGE\r\n CHAPTER VI. IN WHICH FIX, THE DETECTIVE, BETRAYS A VERY NATURAL IMPATIENCE\r\n CHAPTER VII. WHICH ONCE MORE DEMONSTRATES THE USELESSNESS OF PASSPORTS AS AIDS TO DETECTIVES\r\n CHAPTER VIII. IN WHICH PASSEPARTOUT TALKS RATHER MORE, PERHAPS, THAN IS PRUDENT\r\n CHAPTER IX. IN WHICH THE RED SEA AND THE INDIAN OCEAN PROVE PROPITIOUS TO THE DESIGNS OF PHILEAS FOGG\r\n CHAPTER X. 

In [92]:
type(data) #let's check what type of data variable we have

str

In [93]:
len(data)

377177

# Exercise 1

Let's do exercise number one in the "Exercises Data Acquisition Notebook".

# 5. Parsing the data

For Text Data Mining purposes, we need to divide the content of that very long string into chapters. So, having a look at our website (https://www.gutenberg.org/cache/epub/103/pg103.txt), we can see that the book is into chapters. That is great news!

We can see that the text file includes an index with all the chapters. This can be confusing for parsing purposes, so first of all let's get rid of that part of the text. Let's cut things using Chapter 38 of the book.

In [94]:
text_2 = re.search("CHAPTER XXXVII. IN WHICH IT IS SHOWN THAT PHILEAS FOGG GAINED NOTHING BY HIS TOUR AROUND THE WORLD, UNLESS IT WERE HAPPINESS.", data)
text_2

<re.Match object; span=(3361, 3486), match='CHAPTER XXXVII. IN WHICH IT IS SHOWN THAT PHILEAS>

In [95]:
data = data[3486:]
print(data)






CHAPTER I.
IN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE ONE AS
MASTER, THE OTHER AS MAN


Mr. Phileas Fogg lived, in 1872, at No. 7, Saville Row, Burlington
Gardens, the house in which Sheridan died in 1814. He was one of the
most noticeable members of the Reform Club, though he seemed always to
avoid attracting attention; an enigmatical personage, about whom little
was known, except that he was a polished man of the world. People said
that he resembled Byron—at least that his head was Byronic; but he was
a bearded, tranquil Byron, who might live on a thousand years without
growing old.

Certainly an Englishman, it was more doubtful whether Phileas Fogg was
a Londoner. He was never seen on ’Change, nor at the Bank, nor in the
counting-rooms of the “City”; no ships ever came into London docks of
which he was the owner; he had no public employment; he had never been
entered at any of the Inns of Court, either at the Temple, or Lincoln’s
Inn, o

Great! Now let's find how many chapters we have.

In [96]:
chapters = re.findall(r"CHAPTER", data)
print(chapters)
print(len(chapters))

['CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER', 'CHAPTER']
37


All looks good. Now let's split them.

In [97]:
chapters = re.split(r"CHAPTER", data)
print(chapters)

['\n\r\n\r\n\r\n\r\n', ' I.\r\nIN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE ONE AS\r\nMASTER, THE OTHER AS MAN\r\n\r\n\r\nMr. Phileas Fogg lived, in 1872, at No. 7, Saville Row, Burlington\r\nGardens, the house in which Sheridan died in 1814. He was one of the\r\nmost noticeable members of the Reform Club, though he seemed always to\r\navoid attracting attention; an enigmatical personage, about whom little\r\nwas known, except that he was a polished man of the world. People said\r\nthat he resembled Byron—at least that his head was Byronic; but he was\r\na bearded, tranquil Byron, who might live on a thousand years without\r\ngrowing old.\r\n\r\nCertainly an Englishman, it was more doubtful whether Phileas Fogg was\r\na Londoner. He was never seen on ’Change, nor at the Bank, nor in the\r\ncounting-rooms of the “City”; no ships ever came into London docks of\r\nwhich he was the owner; he had no public employment; he had never been\r\nentered at any of the Inns of Court

In [98]:
print(len(chapters))

38


Let's have a look at the last element in that list.

In [99]:
chapters[0]

'\n\r\n\r\n\r\n\r\n'

That happens very frequently. Let's remove it using the pop() method!

In [100]:
chapters.pop(0)

'\n\r\n\r\n\r\n\r\n'

In [101]:
len(chapters)

37

In [102]:
chapters[0]

' I.\r\nIN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE ONE AS\r\nMASTER, THE OTHER AS MAN\r\n\r\n\r\nMr. Phileas Fogg lived, in 1872, at No. 7, Saville Row, Burlington\r\nGardens, the house in which Sheridan died in 1814. He was one of the\r\nmost noticeable members of the Reform Club, though he seemed always to\r\navoid attracting attention; an enigmatical personage, about whom little\r\nwas known, except that he was a polished man of the world. People said\r\nthat he resembled Byron—at least that his head was Byronic; but he was\r\na bearded, tranquil Byron, who might live on a thousand years without\r\ngrowing old.\r\n\r\nCertainly an Englishman, it was more doubtful whether Phileas Fogg was\r\na Londoner. He was never seen on ’Change, nor at the Bank, nor in the\r\ncounting-rooms of the “City”; no ships ever came into London docks of\r\nwhich he was the owner; he had no public employment; he had never been\r\nentered at any of the Inns of Court, either at the Temple,

And now let's have a look at the end of the book.

In [103]:
chapters[-1] #looks good!

' XXXVII.\r\nIN WHICH IT IS SHOWN THAT PHILEAS FOGG GAINED NOTHING BY HIS TOUR\r\nAROUND THE WORLD, UNLESS IT WERE HAPPINESS\r\n\r\n\r\nYes; Phileas Fogg in person.\r\n\r\nThe reader will remember that at five minutes past eight in the\r\nevening—about five and twenty hours after the arrival of the travellers\r\nin London—Passepartout had been sent by his master to engage the\r\nservices of the Reverend Samuel Wilson in a certain marriage ceremony,\r\nwhich was to take place the next day.\r\n\r\nPassepartout went on his errand enchanted. He soon reached the\r\nclergyman’s house, but found him not at home. Passepartout waited a\r\ngood twenty minutes, and when he left the reverend gentleman, it was\r\nthirty-five minutes past eight. But in what a state he was! With his\r\nhair in disorder, and without his hat, he ran along the street as never\r\nman was seen to run before, overturning passers-by, rushing over the\r\nsidewalk like a waterspout.\r\n\r\nIn three minutes he was in Saville R

# Exercise 2

# 6. Creating a Data Frame and saving it into our laptop

Now we need to transform those **37 chapters** into a **CSV File** (that stands for Comma Separated Value), that we will later on use for TDM analysis.

To do that, we need to follow two steps:

1. First, we need to create a list of chapters
2. We need to append our data to those chapters

In [104]:
chapter = []

x = list(range(1, 38)) #we need to write 38 due to Python notation
for i in x:
    chapter.append(f"Chapter {i}")
print(chapter)

['Chapter 1', 'Chapter 2', 'Chapter 3', 'Chapter 4', 'Chapter 5', 'Chapter 6', 'Chapter 7', 'Chapter 8', 'Chapter 9', 'Chapter 10', 'Chapter 11', 'Chapter 12', 'Chapter 13', 'Chapter 14', 'Chapter 15', 'Chapter 16', 'Chapter 17', 'Chapter 18', 'Chapter 19', 'Chapter 20', 'Chapter 21', 'Chapter 22', 'Chapter 23', 'Chapter 24', 'Chapter 25', 'Chapter 26', 'Chapter 27', 'Chapter 28', 'Chapter 29', 'Chapter 30', 'Chapter 31', 'Chapter 32', 'Chapter 33', 'Chapter 34', 'Chapter 35', 'Chapter 36', 'Chapter 37']


What we just did (telling Pyton that we want 37 variables named "Chapter" and attached to numbers from 1-38), is called string formatting (you can read about it in here: https://realpython.com/python-string-formatting/)

And now, we need to zip both lists (clean_text, chapter) into a dictionary that will contain our data.

In [105]:
key_list = chapter
value_list = chapters

In [106]:
data = dict(zip(key_list, value_list))
data

{'Chapter 1': ' I.\r\nIN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE ONE AS\r\nMASTER, THE OTHER AS MAN\r\n\r\n\r\nMr. Phileas Fogg lived, in 1872, at No. 7, Saville Row, Burlington\r\nGardens, the house in which Sheridan died in 1814. He was one of the\r\nmost noticeable members of the Reform Club, though he seemed always to\r\navoid attracting attention; an enigmatical personage, about whom little\r\nwas known, except that he was a polished man of the world. People said\r\nthat he resembled Byron—at least that his head was Byronic; but he was\r\na bearded, tranquil Byron, who might live on a thousand years without\r\ngrowing old.\r\n\r\nCertainly an Englishman, it was more doubtful whether Phileas Fogg was\r\na Londoner. He was never seen on ’Change, nor at the Bank, nor in the\r\ncounting-rooms of the “City”; no ships ever came into London docks of\r\nwhich he was the owner; he had no public employment; he had never been\r\nentered at any of the Inns of Court, either 

We are ready to transform that into a Pandas Dataframe!

In [107]:
data = pd.DataFrame(chapter, columns = ["Chapter"]) #so the only thing that we need to do is to create a dataframe with pandas.
data["Text"] = chapters

In [108]:
data

Unnamed: 0,Chapter,Text
0,Chapter 1,I.\r\nIN WHICH PHILEAS FOGG AND PASSEPARTOUT ...
1,Chapter 2,II.\r\nIN WHICH PASSEPARTOUT IS CONVINCED THA...
2,Chapter 3,III.\r\nIN WHICH A CONVERSATION TAKES PLACE W...
3,Chapter 4,IV.\r\nIN WHICH PHILEAS FOGG ASTOUNDS PASSEPA...
4,Chapter 5,"V.\r\nIN WHICH A NEW SPECIES OF FUNDS, UNKNOW..."
5,Chapter 6,"VI.\r\nIN WHICH FIX, THE DETECTIVE, BETRAYS A..."
6,Chapter 7,VII.\r\nWHICH ONCE MORE DEMONSTRATES THE USEL...
7,Chapter 8,VIII.\r\nIN WHICH PASSEPARTOUT TALKS RATHER M...
8,Chapter 9,IX.\r\nIN WHICH THE RED SEA AND THE INDIAN OC...
9,Chapter 10,X.\r\nIN WHICH PASSEPARTOUT IS ONLY TOO GLAD ...


And now let's save that into our laptop.

In [109]:
data.to_csv("around_the_world_chapters.csv")

# Exercise 3