# Step 2. Cleaning and Pre-processing data

Now that you have your data (Webscraping, APIs, PDFs, databases...) the next step in our Digital Humanities project is **cleaning and pre-processing**. In this notebook we are going to use the file that we just created in the previous notebook (*Around the World in Eighty Days*) and we are going to:

* Tokenize
* Lowercase
* Remove Punctuation
* Remove Stopwords

# 1. We import our libraries

In [20]:
import pandas as pd

import string

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords


import nltk
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize

# 2. We get our data

In [21]:
data = pd.read_csv("around_the_world_chapters.csv", index_col = 0)

In [22]:
data

Unnamed: 0,Chapter,Text
0,Chapter 1,I.\r\nIN WHICH PHILEAS FOGG AND PASSEPARTOUT ...
1,Chapter 2,II.\r\nIN WHICH PASSEPARTOUT IS CONVINCED THA...
2,Chapter 3,III.\r\nIN WHICH A CONVERSATION TAKES PLACE W...
3,Chapter 4,IV.\r\nIN WHICH PHILEAS FOGG ASTOUNDS PASSEPA...
4,Chapter 5,"V.\r\nIN WHICH A NEW SPECIES OF FUNDS, UNKNOW..."
5,Chapter 6,"VI.\r\nIN WHICH FIX, THE DETECTIVE, BETRAYS A..."
6,Chapter 7,VII.\r\nWHICH ONCE MORE DEMONSTRATES THE USEL...
7,Chapter 8,VIII.\r\nIN WHICH PASSEPARTOUT TALKS RATHER M...
8,Chapter 9,IX.\r\nIN WHICH THE RED SEA AND THE INDIAN OC...
9,Chapter 10,X.\r\nIN WHICH PASSEPARTOUT IS ONLY TOO GLAD ...


# 3. We extract the text

In [23]:
text = data["Text"].to_list() #we need to convert it into a list or else we would get a Pandas series and we could not work with it!

In [24]:
len(text)

37

# 4. Tokenization

In [25]:
import nltk #we already imported this library but it is just to show you that we are going to use it in here!
from nltk.tokenize import word_tokenize

In [26]:
tokens = []

for i in text:
    tokens.append(word_tokenize(i))

In [27]:
tokens

[['I',
  '.',
  'IN',
  'WHICH',
  'PHILEAS',
  'FOGG',
  'AND',
  'PASSEPARTOUT',
  'ACCEPT',
  'EACH',
  'OTHER',
  ',',
  'THE',
  'ONE',
  'AS',
  'MASTER',
  ',',
  'THE',
  'OTHER',
  'AS',
  'MAN',
  'Mr.',
  'Phileas',
  'Fogg',
  'lived',
  ',',
  'in',
  '1872',
  ',',
  'at',
  'No',
  '.',
  '7',
  ',',
  'Saville',
  'Row',
  ',',
  'Burlington',
  'Gardens',
  ',',
  'the',
  'house',
  'in',
  'which',
  'Sheridan',
  'died',
  'in',
  '1814',
  '.',
  'He',
  'was',
  'one',
  'of',
  'the',
  'most',
  'noticeable',
  'members',
  'of',
  'the',
  'Reform',
  'Club',
  ',',
  'though',
  'he',
  'seemed',
  'always',
  'to',
  'avoid',
  'attracting',
  'attention',
  ';',
  'an',
  'enigmatical',
  'personage',
  ',',
  'about',
  'whom',
  'little',
  'was',
  'known',
  ',',
  'except',
  'that',
  'he',
  'was',
  'a',
  'polished',
  'man',
  'of',
  'the',
  'world',
  '.',
  'People',
  'said',
  'that',
  'he',
  'resembled',
  'Byron—at',
  'least',
  'that',


In [28]:
len(tokens)

37

# 5. Lower casing

To lower case our data, we need to right a double loop, as we are looping over each element (tokens) contained in each element of the list. So, first we loop over the list, and then we loop over each token in each list item.

In [29]:
lower_tokens = []

In [30]:
for i in tokens:
    list_1 = [] 
    for x in i: 
        list_1.append(x.lower()) 
    lower_tokens.append(list_1)

In [31]:
lower_tokens

[['i',
  '.',
  'in',
  'which',
  'phileas',
  'fogg',
  'and',
  'passepartout',
  'accept',
  'each',
  'other',
  ',',
  'the',
  'one',
  'as',
  'master',
  ',',
  'the',
  'other',
  'as',
  'man',
  'mr.',
  'phileas',
  'fogg',
  'lived',
  ',',
  'in',
  '1872',
  ',',
  'at',
  'no',
  '.',
  '7',
  ',',
  'saville',
  'row',
  ',',
  'burlington',
  'gardens',
  ',',
  'the',
  'house',
  'in',
  'which',
  'sheridan',
  'died',
  'in',
  '1814',
  '.',
  'he',
  'was',
  'one',
  'of',
  'the',
  'most',
  'noticeable',
  'members',
  'of',
  'the',
  'reform',
  'club',
  ',',
  'though',
  'he',
  'seemed',
  'always',
  'to',
  'avoid',
  'attracting',
  'attention',
  ';',
  'an',
  'enigmatical',
  'personage',
  ',',
  'about',
  'whom',
  'little',
  'was',
  'known',
  ',',
  'except',
  'that',
  'he',
  'was',
  'a',
  'polished',
  'man',
  'of',
  'the',
  'world',
  '.',
  'people',
  'said',
  'that',
  'he',
  'resembled',
  'byron—at',
  'least',
  'that',


In [32]:
len(lower_tokens)

37

# 6. Punctuation

In [33]:
punctuation_free = []

for i in lower_tokens:
    punctuation_free.append(" ".join(c for c in i if c not in string.punctuation))

In [34]:
punctuation_free

['i in which phileas fogg and passepartout accept each other the one as master the other as man mr. phileas fogg lived in 1872 at no 7 saville row burlington gardens the house in which sheridan died in 1814 he was one of the most noticeable members of the reform club though he seemed always to avoid attracting attention an enigmatical personage about whom little was known except that he was a polished man of the world people said that he resembled byron—at least that his head was byronic but he was a bearded tranquil byron who might live on a thousand years without growing old certainly an englishman it was more doubtful whether phileas fogg was a londoner he was never seen on ’ change nor at the bank nor in the counting-rooms of the “ city ” no ships ever came into london docks of which he was the owner he had no public employment he had never been entered at any of the inns of court either at the temple or lincoln ’ s inn or gray ’ s inn nor had his voice ever resounded in the court 

In [35]:
len(punctuation_free)

37

# 7. Stopwords

Let's first have a look at stopwords in English.

In [36]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [37]:
super_clean = []
for i in punctuation_free:
    super_clean.append(" ".join(c for c in i.split() if c not in stopwords.words("english")))

In [38]:
super_clean

['phileas fogg passepartout accept one master man mr. phileas fogg lived 1872 7 saville row burlington gardens house sheridan died 1814 one noticeable members reform club though seemed always avoid attracting attention enigmatical personage little known except polished man world people said resembled byron—at least head byronic bearded tranquil byron might live thousand years without growing old certainly englishman doubtful whether phileas fogg londoner never seen ’ change bank counting-rooms “ city ” ships ever came london docks owner public employment never entered inns court either temple lincoln ’ inn gray ’ inn voice ever resounded court chancery exchequer queen ’ bench ecclesiastical courts certainly manufacturer merchant gentleman farmer name strange scientific learned societies never known take part sage deliberations royal institution london institution artisan ’ association institution arts sciences belonged fact none numerous societies swarm english capital harmonic entomol

In [39]:
len(super_clean)

37

Great! We have 37 very clean chapters of **Around the World in 80 days**!

# Exercise 1

# 8. Using that data

Now that we have **very clean data**, we have two options:
    
    1. We use it chapter by chapter the way we have it (in case we may want to see how things evolve over the novel)
    2. We transform it into a single string in case we may want to analyse the whole book at once.
    
Let's do both things!

**8.1. Chapters**

In [40]:
chapter = []

x = list(range(1, 38)) #we need to write 38 due to Python notation
for i in x:
    chapter.append(f"Chapter {i}")
print(chapter)

['Chapter 1', 'Chapter 2', 'Chapter 3', 'Chapter 4', 'Chapter 5', 'Chapter 6', 'Chapter 7', 'Chapter 8', 'Chapter 9', 'Chapter 10', 'Chapter 11', 'Chapter 12', 'Chapter 13', 'Chapter 14', 'Chapter 15', 'Chapter 16', 'Chapter 17', 'Chapter 18', 'Chapter 19', 'Chapter 20', 'Chapter 21', 'Chapter 22', 'Chapter 23', 'Chapter 24', 'Chapter 25', 'Chapter 26', 'Chapter 27', 'Chapter 28', 'Chapter 29', 'Chapter 30', 'Chapter 31', 'Chapter 32', 'Chapter 33', 'Chapter 34', 'Chapter 35', 'Chapter 36', 'Chapter 37']


In [41]:
key_list = chapter
value_list = super_clean

We can stor things into a dictionary

In [42]:
data = dict(zip(key_list, value_list))
data

{'Chapter 1': 'phileas fogg passepartout accept one master man mr. phileas fogg lived 1872 7 saville row burlington gardens house sheridan died 1814 one noticeable members reform club though seemed always avoid attracting attention enigmatical personage little known except polished man world people said resembled byron—at least head byronic bearded tranquil byron might live thousand years without growing old certainly englishman doubtful whether phileas fogg londoner never seen ’ change bank counting-rooms “ city ” ships ever came london docks owner public employment never entered inns court either temple lincoln ’ inn gray ’ inn voice ever resounded court chancery exchequer queen ’ bench ecclesiastical courts certainly manufacturer merchant gentleman farmer name strange scientific learned societies never known take part sage deliberations royal institution london institution artisan ’ association institution arts sciences belonged fact none numerous societies swarm english capital har

Or into a csv file

In [50]:
data = pd.DataFrame(chapter, columns = ["Chapter"]) #so the only thing that we need to do is to create a dataframe with pandas.
data["Text"] = super_clean

In [51]:
data

Unnamed: 0,Chapter,Text
0,Chapter 1,phileas fogg passepartout accept one master ma...
1,Chapter 2,ii passepartout convinced last found ideal “ f...
2,Chapter 3,iii conversation takes place seems likely cost...
3,Chapter 4,iv phileas fogg astounds passepartout servant ...
4,Chapter 5,v. new species funds unknown moneyed men appea...
5,Chapter 6,vi fix detective betrays natural impatience ci...
6,Chapter 7,vii demonstrates uselessness passports aids de...
7,Chapter 8,viii passepartout talks rather perhaps prudent...
8,Chapter 9,ix red sea indian ocean prove propitious desig...
9,Chapter 10,x passepartout glad get loss shoes everybody k...


And now let's save that into a dataframe

In [52]:
data.to_csv("clean_around_the_world_chapters.csv")

**8.2. Full text**

In [53]:
full_text = " ".join(super_clean)

In [54]:
full_text

'phileas fogg passepartout accept one master man mr. phileas fogg lived 1872 7 saville row burlington gardens house sheridan died 1814 one noticeable members reform club though seemed always avoid attracting attention enigmatical personage little known except polished man world people said resembled byron—at least head byronic bearded tranquil byron might live thousand years without growing old certainly englishman doubtful whether phileas fogg londoner never seen ’ change bank counting-rooms “ city ” ships ever came london docks owner public employment never entered inns court either temple lincoln ’ inn gray ’ inn voice ever resounded court chancery exchequer queen ’ bench ecclesiastical courts certainly manufacturer merchant gentleman farmer name strange scientific learned societies never known take part sage deliberations royal institution london institution artisan ’ association institution arts sciences belonged fact none numerous societies swarm english capital harmonic entomolo

In [55]:
len(full_text)

237700

Let's now save that into a txt file format (which is widely used in Digital Humanities!).

In [56]:
with open("full_text_aroundtheworld.txt", "w", encoding = "utf-8") as f:
    f.write(full_text)

You should now have that file into your laptop! We will be using it during the following days.

# Exercise 2