This is a Jupyter Notebook that can be used with any book found at Project Gutenberg. However, manual inspection of the book is necessary to check the book display in there. 

The first part of the notebook (A), covers data acquisition. The second part (B), covers cleaning and preprocessing and saving data for later on counting words (so, difficult character removal >> tokenization >> lower casing >> punctuation removal >> stop word removal). The third part (C), covers clearning and preprocessing data for later on applying NER (so, not following the previous pipeline, but just removing noise characters). 

# A. Data Acquisition

### 1. We import the libraries

In [55]:
from urllib import request
from bs4 import BeautifulSoup

import re
import pandas as pd
import string

import nltk 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

### 2. We get the data

In [56]:
url = "https://www.gutenberg.org/cache/epub/580/pg580.txt" #we create a variable with the url of our target book.This url 
                                                           #contains text of "The Picwick Papers" by Charles Dickens

In [57]:
response = request.urlopen(url)
raw = response.read().decode("utf-8")

### 3. We prettify the data

In [58]:
soup = BeautifulSoup(raw, "html.parser")

#print(soup.prettify())

In [59]:
data = soup.prettify()
#data

### 4. We select the data

In [60]:
text = re.search("START OF THE PROJECT GUTENBERG EBOOK THE PICKWICK PAPERS", data)

In [61]:
text

<re.Match object; span=(748, 804), match='START OF THE PROJECT GUTENBERG EBOOK THE PICKWICK>

In [62]:
text_2 = re.search("END OF THE PROJECT GUTENBERG EBOOK", data)
text_2

<re.Match object; span=(1783956, 1783990), match='END OF THE PROJECT GUTENBERG EBOOK'>

In [63]:
text_3 = data[804:1783956]

In [64]:
text_3



In [65]:
text_4 = re.search("THE POSTHUMOUS PAPERS OF THE PICKWICK CLUB\r\n\r\n\r\n\r\nCHAPTER I. THE PICKWICKIANS\r\n\r\n", text_3)
text_4

<re.Match object; span=(11818, 11899), match='THE POSTHUMOUS PAPERS OF THE PICKWICK CLUB\r\n\r\>

In [66]:
data = text_3[11818:]
data



Let's clean that even more. Let's get rid of all the first part containing the chapter index

# B) Clearning and Pre-Processing. Word counts

This part is very much depending on what we want to do with our data. I have selected to follow the pipeline " difficult character removal >> tokenization >> lower casing >> punctuation removal >> stop word removal". But this is very personal and totally up to us. This cleaning and preprocessing part is taylor-made for counting words (NER will follow a different cleaning and pre-processing process). 

**1. Difficult character removal**

In [83]:
text = re.sub(r'--+', ' ', data)

In [84]:
text_1 = re.sub(r'[‘’“”]', " ", text)

In [85]:
text_1



**2. Tokenization**

In [97]:
tokens = word_tokenize(text_1)

In [98]:
tokens

['THE',
 'POSTHUMOUS',
 'PAPERS',
 'OF',
 'THE',
 'PICKWICK',
 'CLUB',
 'CHAPTER',
 'I',
 '.',
 'THE',
 'PICKWICKIANS',
 'The',
 'first',
 'ray',
 'of',
 'light',
 'which',
 'illumines',
 'the',
 'gloom',
 ',',
 'and',
 'converts',
 'into',
 'a',
 'dazzling',
 'brilliancy',
 'that',
 'obscurity',
 'in',
 'which',
 'the',
 'earlier',
 'history',
 'of',
 'the',
 'public',
 'career',
 'of',
 'the',
 'immortal',
 'Pickwick',
 'would',
 'appear',
 'to',
 'be',
 'involved',
 ',',
 'is',
 'derived',
 'from',
 'the',
 'perusal',
 'of',
 'the',
 'following',
 'entry',
 'in',
 'the',
 'Transactions',
 'of',
 'the',
 'Pickwick',
 'Club',
 ',',
 'which',
 'the',
 'editor',
 'of',
 'these',
 'papers',
 'feels',
 'the',
 'highest',
 'pleasure',
 'in',
 'laying',
 'before',
 'his',
 'readers',
 ',',
 'as',
 'a',
 'proof',
 'of',
 'the',
 'careful',
 'attention',
 ',',
 'indefatigable',
 'assiduity',
 ',',
 'and',
 'nice',
 'discrimination',
 ',',
 'with',
 'which',
 'his',
 'search',
 'among',
 'the'

In [99]:
type(tokens)

list

**3. Lower casing**

In [100]:
lower_tokens = []

for i in tokens:
    lower_tokens.append(i.lower()) 

In [101]:
lower_tokens

['the',
 'posthumous',
 'papers',
 'of',
 'the',
 'pickwick',
 'club',
 'chapter',
 'i',
 '.',
 'the',
 'pickwickians',
 'the',
 'first',
 'ray',
 'of',
 'light',
 'which',
 'illumines',
 'the',
 'gloom',
 ',',
 'and',
 'converts',
 'into',
 'a',
 'dazzling',
 'brilliancy',
 'that',
 'obscurity',
 'in',
 'which',
 'the',
 'earlier',
 'history',
 'of',
 'the',
 'public',
 'career',
 'of',
 'the',
 'immortal',
 'pickwick',
 'would',
 'appear',
 'to',
 'be',
 'involved',
 ',',
 'is',
 'derived',
 'from',
 'the',
 'perusal',
 'of',
 'the',
 'following',
 'entry',
 'in',
 'the',
 'transactions',
 'of',
 'the',
 'pickwick',
 'club',
 ',',
 'which',
 'the',
 'editor',
 'of',
 'these',
 'papers',
 'feels',
 'the',
 'highest',
 'pleasure',
 'in',
 'laying',
 'before',
 'his',
 'readers',
 ',',
 'as',
 'a',
 'proof',
 'of',
 'the',
 'careful',
 'attention',
 ',',
 'indefatigable',
 'assiduity',
 ',',
 'and',
 'nice',
 'discrimination',
 ',',
 'with',
 'which',
 'his',
 'search',
 'among',
 'the'

**4. Punctuation**

In [102]:
 string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [103]:
punctuation_free = []
punctuation_free = [token for token in lower_tokens if token not in string.punctuation]

In [104]:
punctuation_free

['the',
 'posthumous',
 'papers',
 'of',
 'the',
 'pickwick',
 'club',
 'chapter',
 'i',
 'the',
 'pickwickians',
 'the',
 'first',
 'ray',
 'of',
 'light',
 'which',
 'illumines',
 'the',
 'gloom',
 'and',
 'converts',
 'into',
 'a',
 'dazzling',
 'brilliancy',
 'that',
 'obscurity',
 'in',
 'which',
 'the',
 'earlier',
 'history',
 'of',
 'the',
 'public',
 'career',
 'of',
 'the',
 'immortal',
 'pickwick',
 'would',
 'appear',
 'to',
 'be',
 'involved',
 'is',
 'derived',
 'from',
 'the',
 'perusal',
 'of',
 'the',
 'following',
 'entry',
 'in',
 'the',
 'transactions',
 'of',
 'the',
 'pickwick',
 'club',
 'which',
 'the',
 'editor',
 'of',
 'these',
 'papers',
 'feels',
 'the',
 'highest',
 'pleasure',
 'in',
 'laying',
 'before',
 'his',
 'readers',
 'as',
 'a',
 'proof',
 'of',
 'the',
 'careful',
 'attention',
 'indefatigable',
 'assiduity',
 'and',
 'nice',
 'discrimination',
 'with',
 'which',
 'his',
 'search',
 'among',
 'the',
 'multifarious',
 'documents',
 'confided',
 '

**5. Stop words**

In [105]:
stop_words = set(stopwords.words("english"))

clean = [token for token in punctuation_free if token not in stop_words]

super_clean = ' '.join(clean)

In [106]:
super_clean



**6. Saving things into a text file**

In [107]:
with open("The Pickwick Papers_word_counts.txt", "w", encoding = "utf-8") as f:
    f.write(super_clean)

# C) Cleaning and Pre-Processing. NER

In order to use Spacy pipelines for NER (Name Entity Recognition), we need to keep the text in its oringinal form and only remove noise characters such as r/n/n/r. So lets go back to the data variable and lets do that!

In [52]:
# Remove carriage returns, line feeds, and other escape codes
clean_text = re.sub(r'[\r\n]+', ' ', data)  # replace newlines with a space
clean_text = clean_text.replace('\xa0', ' ')  # remove non-breaking spaces

# Optionally, collapse multiple spaces and strip
clean_text = re.sub(r'\s+', ' ', clean_text).strip()

In [53]:
clean_text



In [54]:
with open("The Pickwick Papers_NER.txt", "w", encoding = "utf-8") as f:
    f.write(clean_text)