The website I selected is Project Gutenberg, specifically the page that includes the Journal 1 of Henry David Thoreau, since I will use that data in my final project.

First i install all the libraries i need, i included Regex too since I needed to extract dates from the text data i scrape:

In [10]:
import requests
from scrapy.selector import Selector
import pandas as pd
import re

I fetch the webpage:

In [11]:
url = "https://www.gutenberg.org/cache/epub/57393/pg57393-images.html"
response = requests.get(url)

print("Webpage successfully fetched.")

Webpage successfully fetched.


I use the selector to bring the text from xpaths that contains the text of the journals

In [12]:
sel = Selector(text=response.text)

all_nodes = sel.xpath('//body/*')

print(f"Found {len(all_nodes)} elements to process.")

Found 2057 elements to process.


This is the main loop, i first defined a date regex to identify matches in the text which looks at abbrivetions of the months and if they are being followed by a number. I run the main loop on the elements of all_nodes, check if first 20 letters involve any dates, if yes start a new entry. I remove the date from that entry since my goal with the data is to train a UML and i dont want the dates to be included in the training. I also added a small loop to keep track of titles, since some journal entries included those, others were names untitled. When we end the loop, i append the last entry picked up to journal entries.

In [19]:
journal_entries = []

current_entry = None
last_seen_title = "Untitled"

date_regex = re.compile(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[a-z.]*\s\d{1,2}', re.IGNORECASE)

for node in all_nodes:
    html_class = node.xpath('@class').get()
    tag_name = node.root.tag
    text_content = node.xpath('string(.)').get().strip()

    if html_class == 'p2 center':
        last_seen_title = text_content
        continue 

    if tag_name == 'p':
        
        date_match = date_regex.match(text_content[:20])
        
        if date_match:
            if current_entry:
                journal_entries.append(current_entry)
            
            found_date = date_match.group(0)
            clean_text = text_content.replace(found_date, '', 1).strip()
            

            clean_text = re.sub(r'^[.,-]\s*', '', clean_text)

            current_entry = {
                'Title': last_seen_title, 
                'Date': found_date,       
                'Content': clean_text     
            }
            
            last_seen_title = "Untitled"
            
        else:
            if current_entry:
                current_entry['Content'] += " " + text_content

if current_entry:
    journal_entries.append(current_entry)

print(f"Extraction complete. Found {len(journal_entries)} entries.")

Extraction complete. Found 396 entries.


Lastly I put all the entries in a pandas dataframe and save as csv.

In [20]:
df = pd.DataFrame(journal_entries)

print(df.head())

df.to_csv('thoreau_data.csv', index=False)

print("File saved as 'thoreau_data.csv'")

                       Title     Date  \
0                   Untitled   Oct 22   
1  THE MOULD OUR DEEDS LEAVE  Oct. 24   
2                     SPRING   Oct 25   
3                   THE POET   Oct 26   
4                    THE FOG  Oct. 27   

                                             Content  
0  "What are you doing now?" he asked.\r\n"Do you...  
1  Every part of nature teaches that the passing\...  
2  She appears, and we are once more children;\r\...  
3   "A noble man has not to thank a private circl...  
4  The prospect is limited to Nobscot and\r\nAnnu...  
File saved as 'thoreau_data.csv'


When I check the data it looks quite clean. The data is well organized by dates and no date entry seems to be skipped. However, i suspect the writer himself might skipped some dates especially towards the end, some entries are entirely too long, and I dont have many data points. I need to find the most logical way to seperate those long rows. 

My top picks right now are to seperate the data into rows not by date directly but by paragraph ends. This way i will have more managable units and a much larger quantitiy of unit-text for training. I can impute the dates for undated entries then either by repeating the previous date entry (e.g. multiple rows are March 13), or estimating based on the date of entry before and after (e.g. if 7 undated entries exist between march 13 and june 16, those are scattered in between with equal distances). I also consider expanding my data set with other volumes of his journal.