# Book Preprocessing and Chunking

## 1. Introduction and Preliminary Work

This notebook details the first step in our analysis - dataset creation. We decided to collect our own dataset, using the following steps:
- Selected books from feminist writers that are publicly available on the Gutenberg library website
- Colleted titles, authors, year of release and afferent link in `book_links.csv` (data/)
- Extracted each book's content into a text file using `extract_data.py` (src/)
- .txt files are saved in (data/raw/), along with `metadata.csv` (data/)

These .txt files are not included in our final deliverable, as we included their cleaned version, following after Step 3 in this notebook. All the other mentioned files can be found in the (extra/) folder we sent.

Thus, our task is to carefully process the books, removing unnecessary content, normalizing the texts, tokenizing and splitting them into chunks. The final outputs are 2 datasets, `processed_book_chunks.csv` and `processed_book_chunks_context.csv` (data/processed/) which will then be used for our 3-part analysis.

**Summary of Preprocessing and Chunking:**
- We cleaned the text files by removing the standard Guutenberg heather and footer
- We normalized line endings and whitespaces
- We manually cleaned the text contents that could not be removed in a standard way (data/cleaned/)
- We performed chunking with and without context, tokenizing and splitting sections in batches for efficient use
- The 2 outputs are dataframes that include text chunk (~500 words), title, author, year of release and chunk ID

## 2. Setup and File Paths

In [4]:
# standard libraries
from pathlib import Path
import pandas as pd
import spacy
import re
import shutil
import os
from tqdm import tqdm

# load NLP model (can later be swapped for en_core_web_trf, for now we'll start with this a lighter one)
nlp = spacy.load("en_core_web_sm")

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# set base repository path (assumes notebook is in repo/notebooks/)
REPO_DIR = Path(".").resolve().parents[1]

# key data paths
RAW_DATA_DIR = REPO_DIR / "data" / "raw"
PROCESSED_DIR = REPO_DIR / "data" / "processed"
BOOK_LINKS_PATH = REPO_DIR / "data" / "book_links.csv"
METADATA_PATH = REPO_DIR / "data" / "metadata.csv"

# confirm setup
print("Repository Path:", REPO_DIR)
print("Raw Data Directory:", RAW_DATA_DIR)
print("Processed Book Chunk Output:", PROCESSED_DIR)
print("Metadata CSV:", METADATA_PATH)

Repository Path: /Users/emmamora/Documents/GitHub/feminist_nlp
Raw Data Directory: /Users/emmamora/Documents/GitHub/feminist_nlp/data/raw
Processed Book Chunk Output: /Users/emmamora/Documents/GitHub/feminist_nlp/data/processed
Metadata CSV: /Users/emmamora/Documents/GitHub/feminist_nlp/data/metadata.csv


## 3. Clean Raw Text: Gutenberg Add-ons and Normalize

At this stage, we process the raw .txt files to remove boilerplate text added by Project Gutenberg and normalize formatting. The goals are:
- Remove standardized headers and footers using regular expressions.
- Normalize inconsistent newlines, tabs, and spacing for cleaner input downstream.
- Attempt to isolate the actual content of the book, excluding prefaces, introductions, chapter indices, etc.

We implemented a `clean_gutenberg_text()` function to handle the standardized parts of the Gutenberg text. It performs:
- Header removal using *** START OF THIS PROJECT GUTENBERG EBOOK ... ***
- Footer removal using *** END OF THIS PROJECT GUTENBERG EBOOK ... ***
- Normalization of line breaks and collapsing of excess whitespace

However, this was **not sufficient for all files**:
- Many books include content before the actual novel begins (e.g., prefaces, illustrations, chapter listings)
- These vary widely across titles and could not be reliably removed with a universal pattern
- For example, “Pride and Prejudice” includes a full preface that the regex cannot detect

Because of this variability, we decided:
- To manually inspect and clean the start of each file
- This ensures only the actual narrative content is preserved
- Manual cleaning was done on the files saved (data/cleaned/)

After this:
- We proceed with tokenization and chunking
- Further NLP steps such as lemmatization were not applied at this stage to keep text closer to original form

The cleaned files will now serve as input for downstream chunking and analysis tasks.

In [4]:
def clean_gutenberg_text(text: str) -> str:
    # remove Gutenberg header
    text = re.split(r'\*\*\* START OF (THE|THIS) PROJECT GUTENBERG EBOOK .* \*\*\*', text, flags=re.IGNORECASE)[-1]

    # remove Gutenberg footer
    text = re.split(r'\*\*\* END OF (THE|THIS) PROJECT GUTENBERG EBOOK .* \*\*\*', text, flags=re.IGNORECASE)[0]

    # normalize line endings and whitespace
    text = text.replace('\r\n', '\n').replace('\r', '\n')          # normalize newlines
    text = re.sub(r'\n{2,}', '\n\n', text)                         # collapse multiple empty lines to 2
    text = re.sub(r'[ \t]+', ' ', text)                            # remove multiple spaces/tabs
    text = text.strip()                                            # trim leading/trailing whitespace

    return text

In [5]:
# test file
sample_txt_path = RAW_DATA_DIR / "The_Yellow_Wallpaper.txt"

# load and clean it
with open(sample_txt_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

cleaned_text = clean_gutenberg_text(raw_text)

# show result
print("Original length:", len(raw_text))
print("Cleaned length :", len(cleaned_text))
print("\n--- Cleaned Sample ---\n")
print(cleaned_text[:1000])

Original length: 31602
Cleaned length : 31496

--- Cleaned Sample ---

The Yellow Wallpaper

By Charlotte Perkins Gilman

It is very seldom that mere ordinary people like John and myself secure
ancestral halls for the summer.

A colonial mansion, a hereditary estate, I would say a haunted house,
and reach the height of romantic felicity—but that would be asking too
much of fate!

Still I will proudly declare that there is something queer about it.

Else, why should it be let so cheaply? And why have stood so long
untenanted?

John laughs at me, of course, but one expects that in marriage.

John is practical in the extreme. He has no patience with faith, an
intense horror of superstition, and he scoffs openly at any talk of
things not to be felt and seen and put down in figures.

John is a physician, and perhaps—(I would not say it to a living soul,
of course, but this is dead paper and a great relief to my
mind)—perhaps that is one reason I do not get well faster.

You see, he does not

In [6]:
# test file
sample_txt_path = RAW_DATA_DIR / "Pride_and_Prejudice.txt"

# load and clean it
with open(sample_txt_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

cleaned_text = clean_gutenberg_text(raw_text)

# show result
print("Original length:", len(raw_text))
print("Cleaned length :", len(cleaned_text))
print("\n--- Cleaned Sample ---\n")
print(cleaned_text[:1000])

Original length: 728842
Cleaned length : 721387

--- Cleaned Sample ---

[Illustration:

 GEORGE ALLEN
 PUBLISHER

 156 CHARING CROSS ROAD
 LONDON

 RUSKIN HOUSE
 ]

 [Illustration:

 _Reading Jane’s Letters._ _Chap 34._
 ]

 PRIDE.
 and
 PREJUDICE

 by
 Jane Austen,

 with a Preface by
 George Saintsbury
 and
 Illustrations by
 Hugh Thomson

 [Illustration: 1894]

 Ruskin 156. Charing
 House. Cross Road.

 London
 George Allen.

 CHISWICK PRESS:--CHARLES WHITTINGHAM AND CO.
 TOOKS COURT, CHANCERY LANE, LONDON.

 [Illustration:

 _To J. Comyns Carr
 in acknowledgment of all I
 owe to his friendship and
 advice, these illustrations are
 gratefully inscribed_

 _Hugh Thomson_
 ]

PREFACE.

[Illustration]

_Walt Whitman has somewhere a fine and just distinction between “loving
by allowance” and “loving with personal love.” This distinction applies
to books as well as to men and women; and in the case of the not very
numerous authors who are the objects of the personal affection, it
brings

In [6]:
# create clean directory in data, we will copy-paste the raw data and manually clean it
CLEANED_DATA_DIR = REPO_DIR / "data" / "cleaned"
CLEANED_DATA_DIR.mkdir(parents=True, exist_ok=True)
print(f"Cleaned directory: {CLEANED_DATA_DIR}")

Cleaned directory: /Users/emmamora/Documents/GitHub/feminist_nlp/data/cleaned


In [10]:
# loop through each .txt file and clean + save
for txt_file in RAW_DATA_DIR.glob("*.txt"):
    with open(txt_file, "r", encoding="utf-8") as f:
        raw_text = f.read()

    cleaned_text = clean_gutenberg_text(raw_text)

    cleaned_file_path = CLEANED_DATA_DIR / txt_file.name
    with open(cleaned_file_path, "w", encoding="utf-8") as f:
        f.write(cleaned_text)

    print(f"Cleaned: {txt_file.name}")

Cleaned: Herland.txt
Cleaned: The_Voyage_Out.txt
Cleaned: The_Womans_Bible.txt
Cleaned: A_Vindication_of_the_Rights_of_Woman.txt
Cleaned: Narrative_of_Sojourner_Truth.txt
Cleaned: Woman_in_the_Nineteenth_Century.txt
Cleaned: Strife_and_Peace.txt
Cleaned: Mrs_Dalloway.txt
Cleaned: Wuthering_Heights.txt
Cleaned: An_OldFashioned_Girl.txt
Cleaned: The_Mill_on_the_Floss.txt
Cleaned: The_Yellow_Wallpaper.txt
Cleaned: Eighty_Years_and_More.txt
Cleaned: Emma.txt
Cleaned: Pride_and_Prejudice.txt
Cleaned: The_Awakening.txt
Cleaned: Little_Women.txt
Cleaned: Mary_A_Fiction.txt
Cleaned: Middlemarch.txt
Cleaned: Bayou_Folk.txt
Cleaned: Jane_Eyre.txt


## 4. Chunking: Process Text into Final Dataset

After cleaning, we proceed with tokenizing and chunking each book’s content to prepare for analysis. The goal is to transform long book texts into manageable, context-rich units (~500 words each), while preserving relevant metadata for each chunk.

**Steps Overview:**
- Load the metadata from metadata.csv and align it with cleaned .txt files
- Define functions for sentence tokenization and word-based chunking
- Tokenize using spaCy in batches (to efficiently handle large books like Middlemarch)
- Generate two datasets:
	- `processed_book_chunks_context.csv`: with overlapping 50-word context between chunks
	- `processed_book_chunks.csv`: without overlap (independent chunks)
	- Store for downstream use in analysis notebooks

**Tokenization and Chunking Functions**
- `tokenize_sentences()`:
	- Splits the text into paragraphs
	- Uses spaCy’s pipe() method to batch-process and tokenize paragraphs into sentences
- `chunk_sentences()`:
	- Groups tokenized sentences into ~500 word chunks
	- Allows optional overlap between chunks to preserve continuity
	- Appends metadata (title, author, year, chunk ID) for each unit

**Execution**

We ran the pipeline twice:
1. With 50-word overlap (`processed_book_chunks_context.csv`): Ensures semantic continuity across adjacent chunks, useful for contextual models.
2.	Without overlap (`processed_book_chunks.csv`): Ensures strict independence between chunks, useful for classification or indexing tasks.

Both versions include:
- `chunk`: the raw text of the chunk
- `title`: book title
- `author`: author name
- `year`: publication year
- `chunk_id`: a unique identifier formatted as AuthorInitials_ChunkNumber

This structure forms the foundation of our subsequent exploratory and modeling notebooks.

In [13]:
# set path for final dataset
OUTPUT_PATH_CONTEXT = PROCESSED_DIR / "processed_book_chunks_context.csv"
OUTPUT_PATH_NO_CONTEXT = PROCESSED_DIR / "processed_book_chunks.csv"

# load metadata
metadata = pd.read_csv(METADATA_PATH)
metadata.set_index("filename", inplace=True)

# checks
print("Metadata loaded. Example:")
display(metadata.head())

Metadata loaded. Example:


Unnamed: 0_level_0,id,title,author,year
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Pride_and_Prejudice.txt,Pride_and_Prejudice,Pride and Prejudice,Jane Austen,1813
Emma.txt,Emma,Emma,Jane Austen,1815
Jane_Eyre.txt,Jane_Eyre,Jane Eyre,Charlotte Brontë,1847
Wuthering_Heights.txt,Wuthering_Heights,Wuthering Heights,Emily Brontë,1847
Middlemarch.txt,Middlemarch,Middlemarch,George Eliot,1871


In [15]:
# functions for tokenization and chunking

def tokenize_sentences(text, batch_size=20, show_progress=False):
    # tokenizes text into sentences using spaCy's pipe in batches => prevents memory overload on large books like Middlemarch
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    # show progress bar if requested
    if show_progress:
        paragraphs = tqdm(paragraphs, desc="Paragraphs")
    sentences = []
    for doc in nlp.pipe(paragraphs, batch_size=batch_size):
        sentences.extend([sent.text.strip() for sent in doc.sents if sent.text.strip()])
    return sentences

def chunk_sentences(sentences, max_words=500, overlap=0, debug=False):
    # split list of sentences into word-limited chunks, optionally preserving overlap for context retention
    chunks = []
    current_chunk = []
    current_word_count = 0

    for sent in sentences:
        words_in_sent = len(sent.split())

        # if adding this sentence exceeds max_words, save current chunk
        if current_word_count + words_in_sent > max_words:
            chunk_text = " ".join(current_chunk)
            chunks.append(chunk_text)

            # reset, but include overlap from end of previous chunk
            if overlap > 0:
                overlap_tokens = " ".join(current_chunk).split()[-overlap:]
                current_chunk = [" ".join(overlap_tokens)]
                current_word_count = len(overlap_tokens)
            else:
                current_chunk = []
                current_word_count = 0

        current_chunk.append(sent)
        current_word_count += words_in_sent

    # add the final chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    # debug first 3 chunks (optional)
    if debug:
        for i, chunk in enumerate(chunks[:3]):
            print(f"\n--- Chunk {i+1} (words: {len(chunk.split())}) ---\n{chunk[:300]}...\n")

    return chunks

In [9]:
for book_file in CLEANED_DATA_DIR.glob("*.txt"):
    book_name = book_file.name
    print(f"Processing {book_name}...") 


Processing Herland.txt...
Processing The_Voyage_Out.txt...
Processing The_Womans_Bible.txt...
Processing A_Vindication_of_the_Rights_of_Woman.txt...
Processing Narrative_of_Sojourner_Truth.txt...
Processing Woman_in_the_Nineteenth_Century.txt...
Processing Strife_and_Peace.txt...
Processing Mrs_Dalloway.txt...
Processing Wuthering_Heights.txt...
Processing An_OldFashioned_Girl.txt...
Processing The_Mill_on_the_Floss.txt...
Processing The_Yellow_Wallpaper.txt...
Processing Eighty_Years_and_More.txt...
Processing Emma.txt...
Processing Pride_and_Prejudice.txt...
Processing The_Awakening.txt...
Processing Little_Women.txt...
Processing Mary_A_Fiction.txt...
Processing Middlemarch.txt...
Processing Bayou_Folk.txt...
Processing Jane_Eyre.txt...


In [11]:
# process all cleaned books => CONTEXT CHUNKS
all_chunk_data = []

book_files = list(CLEANED_DATA_DIR.glob("*.txt"))

for book_file in tqdm(book_files, desc="Processing books"):
    book_name = book_file.name
    if book_name not in metadata.index:
        print(f"Skipping {book_name} (no metadata)")
        continue

    with open(book_file, "r", encoding="utf-8") as f:
        text = f.read()

    # tokenize
    sentences = tokenize_sentences(text, show_progress=False)

    # chunk
    chunks = chunk_sentences(sentences, overlap=50, debug=True)

    # metadata
    meta = metadata.loc[book_name]
    title, author, year = meta["title"], meta["author"], meta["year"]
    author_code = "".join([w[0].upper() for w in author.split()[:2]])

    # collect chunks
    for i, chunk in enumerate(chunks):
        chunk_id = f"{author_code}_{i+1:03d}"
        all_chunk_data.append({
            "chunk": chunk,
            "title": title,
            "author": author,
            "year": year,
            "chunk_id": chunk_id
        })

print(f"Finished processing: {len(all_chunk_data)} total chunks from {len(book_files)} books.")

Processing books:   5%|▍         | 1/21 [00:06<02:13,  6.68s/it]


--- Chunk 1 (words: 466) ---
CHAPTER 1.
A Not Unnatural Enterprise This is written from memory, unfortunately. If I could have brought with
me the material I so carefully prepared, this would be a very different
story. Whole books full of notes, carefully copied records, firsthand
descriptions, and the pictures--that’s the wors...


--- Chunk 2 (words: 499) ---
few things that don’t. We three had a chance to join a big scientific expedition. They needed a doctor, and that gave Jeff an excuse for dropping his just opening practice; they needed Terry’s experience, his machine, and his money; and as for me, I got in through Terry’s influence. The expedition w...


--- Chunk 3 (words: 499) ---
I had understood, so I showed him a red and blue pencil I carried, and asked again. Yes, he pointed to the river, and then to the southwestward. “River--good water--red and blue.” Terry was close by and interested in the fellow’s pointing. “What does he say, Van?” I told him. Terry blazed up at once

Processing books:  10%|▉         | 2/21 [00:23<04:03, 12.81s/it]


--- Chunk 1 (words: 475) ---
CHAPTER I As the streets that lead from the Strand to the Embankment are very
narrow, it is better not to walk down them arm-in-arm. If you persist,
lawyers’ clerks will have to make flying leaps into the mud; young lady
typists will have to fidget behind you. In the streets of London where
beauty g...


--- Chunk 2 (words: 482) ---
case they should proceed to tease his wife, Mr. Ambrose flourished his stick at them, upon which they decided that he was grotesque merely, and four instead of one cried “Bluebeard!” in chorus. Although Mrs. Ambrose stood quite still, much longer than is natural, the little boys let her be. Some one...


--- Chunk 3 (words: 447) ---
already occupied by two city men. The fixity of her mood was broken by the action of walking. The shooting motor cars, more like spiders in the moon than terrestrial objects, the thundering drays, the jingling hansoms, and little black broughams, made her think of the world she lived in. Somewhere u

Processing books:  14%|█▍        | 3/21 [00:39<04:16, 14.26s/it]


--- Chunk 1 (words: 494) ---
PART I. Comments on Genesis, Exodus, Leviticus, Numbers and Deuteronomy. "In every soul there is bound up some truth and some error, and each
gives to the world of thought what no other one possesses."--Cousin. 1898. By Elizabeth Cady Stanton REVISING COMMITTEE. "We took sweet counsel together. "--P...


--- Chunk 2 (words: 498) ---
the work of the various committees into one consistent whole. IV. The completed work will be submitted to an advisory committee assembled at some central point, as London, New York, or Chicago, to sit in final judgment on "The Woman's Bible." As to the manner of doing the practical work: Those who h...


--- Chunk 3 (words: 472) ---
transfigure this mournful object of pity into an exalted, dignified personage, worthy our worship as the mother of the race, are to be congratulated as having a share of the occult mystic power of the eastern Mahatmas. The plain English to the ordinary mind admits of no such liberal interpretation. 

Processing books:  19%|█▉        | 4/21 [00:49<03:28, 12.29s/it]


--- Chunk 1 (words: 487) ---
INTRODUCTION. After considering the historic page, and viewing the living world
with anxious solicitude, the most melancholy emotions of sorrowful
indignation have depressed my spirits, and I have sighed when
obliged to confess, that either nature has made a great difference
between man and man, or ...


--- Chunk 2 (words: 477) ---
way, and I cannot pass it over without subjecting the main tendency of my reasoning to misconstruction, I shall stop a moment to deliver, in a few words, my opinion. In the government of the physical world, it is observable that the female, in general, is inferior to the male. The male pursues, the ...


--- Chunk 3 (words: 495) ---
spread corruption through the whole mass of society! As a class of mankind they have the strongest claim to pity! the education of the rich tends to render them vain and helpless, and the unfolding mind is not strengthened by the practice of those duties which dignify the human character. They only 

Processing books:  24%|██▍       | 5/21 [00:53<02:30,  9.41s/it]


--- Chunk 1 (words: 441) ---
HER BIRTH AND PARENTAGE. THE subject of this biography, SOJOURNER TRUTH, as she now calls
herself-but whose name, originally, was Isabella-was born, as near as
she can now calculate, between the years 1797 and 1800. She was the
daughter of James and Betsey, slaves of one Colonel Ardinburgh, Hurley,
...


--- Chunk 2 (words: 488) ---
cellar, and sees its inmates, of both sexes and all ages, sleeping on those damp boards, like the horse, with a little straw and a blanket; and she wonders not at the rheumatisms, and fever-sores, and palsies, that distorted the limbs and racked the bodies of those fellow-slaves in after-life. Still...


--- Chunk 3 (words: 485) ---
so near at hand, but of which his parents had an uncertain, but all the more cruel foreboding. There was snow on the ground, at the time of which we are speaking; and a large old-fashioned sleigh was seen to drive up to the door of the late Col. Ardinburgh. This event was noticed with childish pleas

Processing books:  29%|██▊       | 6/21 [01:06<02:39, 10.62s/it]


--- Chunk 1 (words: 492) ---
INTRODUCTION. * * * * * The problem of Woman's position, or "sphere,"--of her duties,
responsibilities, rights and immunities as Woman,--fitly attracts a
large and still-increasing measure of attention from the thinkers and
agitators of our time, The legislators, so called,--those who
ultimately ena...


--- Chunk 2 (words: 445) ---
which springs from the ripening of profound reflection into assured conviction. She wrote as one who had observed, and who deeply felt what she deliberately uttered. Others have since spoken more fluently, more variously, with a greater affluence of illustration; but none, it is believed, more earne...


--- Chunk 3 (words: 474) ---
preferred, partly for the reason others do not like it,--that is, that it requires some thought to see what it means, and might thus prepare the reader to meet me on my own ground. Besides, it offers a larger scope, and is, in that way, more just to my desire. I meant by that
title to intimate the f

Processing books:  33%|███▎      | 7/21 [01:12<02:08,  9.17s/it]


--- Chunk 1 (words: 500) ---
OLD NORWAY. Still the old tempests rage around the mountains,
 And ocean's billows as of old appear;
 The roaring wood and the resounding fountains
 Time has not silenced in his long career,
 For Nature is the same as ever. MUNCH. The shadow of God wanders through Nature. LINNÆUS. Before yet a song ...


--- Chunk 2 (words: 481) ---
to the fresh mighty throbbing of the heart of nature; alone with the quiet, calm, and yet so eloquent, objects of nature, and there wilt thou gain strength and life! There falls no dust. Fresh and clear stand the thoughts of life there, as in the days of their creation. "Wilt thou behold the great a...


--- Chunk 3 (words: 496) ---
down upon a younger generation; observe in these valleys the morning and evening play of colours upon the heights, in the depths; see the affluent pomp of the storm; see the calm magnificence of the rainbow, as it vaults itself over the waterfall,--depressed spirit, see this, understand it, and---- 

Processing books:  38%|███▊      | 8/21 [01:20<01:53,  8.73s/it]


--- Chunk 1 (words: 469) ---
Mrs. Dalloway said she would buy the flowers herself. For Lucy had her work cut out for her. The doors would be taken off
their hinges; Rumpelmayer’s men were coming. And then, thought Clarissa
Dalloway, what a morning--fresh as if issued to children on a beach. What a lark! What a plunge! For so it...


--- Chunk 2 (words: 500) ---
making it up, building it round one, tumbling it, creating it every moment afresh; but the veriest frumps, the most dejected of miseries sitting on doorsteps (drink their downfall) do the same; can’t be dealt with, she felt positive, by Acts of Parliament for that very reason: they love life. In peo...


--- Chunk 3 (words: 461) ---
as children. “Where are you off to?” “I love walking in London,” said Mrs. Dalloway. “Really it’s better than walking in the country.” They had just come up--unfortunately--to see doctors. Other people came to see pictures; go to the opera; take their daughters out; the Whitbreads came “to see docto

Processing books:  43%|████▎     | 9/21 [01:34<02:06, 10.55s/it]


--- Chunk 1 (words: 481) ---
CHAPTER I 1801—I have just returned from a visit to my landlord—the solitary
neighbour that I shall be troubled with. This is certainly a beautiful
country! In all England, I do not believe that I could have fixed on a
situation so completely removed from the stir of society. A perfect
misanthropist...


--- Chunk 2 (words: 485) ---
times, indeed: one may guess the power of the north wind, blowing over the edge, by the excessive slant of a few stunted firs at the end of the house; and by a range of gaunt thorns all stretching their limbs one way, as if craving alms of the sun. Happily, the architect had foresight to build
it st...


--- Chunk 3 (words: 479) ---
arm-chair, his mug of ale frothing on the round table before him, is to be seen in any circuit of five or six miles among these hills, if you go at the right time after dinner. But Mr. Heathcliff forms a singular contrast to his abode and style of living. He is a dark-skinned gipsy in aspect, in dre

Processing books:  48%|████▊     | 10/21 [01:47<02:03, 11.20s/it]


--- Chunk 1 (words: 475) ---
CHAPTER I. POLLY ARRIVES “IT'S time to go to the station, Tom.” “Come on, then.” “Oh, I'm not going; it's too wet. Should n't have a crimp left if I
went out such a day as this; and I want to look nice when Polly comes.” “You don't expect me to go and bring home a strange girl alone, do you?”
 And T...


--- Chunk 2 (words: 474) ---
fellows” know too well. “Do go along, or you'll be too late; and then, what will Polly think of me?” cried Fanny, with the impatient poke which is peculiarly aggravating to masculine dignity. “She'll think you cared more about your frizzles than your friends, and she'll be about right, too.” Feeling...


--- Chunk 3 (words: 495) ---
behind him made him turn in time to see a fresh-faced little girl running down the long station, and looking as if she rather liked it. As she smiled, and waved her bag at him, he stopped and waited for her, saying to himself, “Hullo! I wonder if that's Polly?” Up came the little girl, with her hand

Processing books:  52%|█████▏    | 11/21 [02:12<02:34, 15.50s/it]


--- Chunk 1 (words: 488) ---
BOOK FIRST BOY AND GIRL. Chapter I. Outside Dorlcote Mill A wide plain, where the broadening Floss hurries on between its green
banks to the sea, and the loving tide, rushing to meet it, checks its
passage with an impetuous embrace. On this mighty tide the black
ships—laden with the fresh-scented fi...


--- Chunk 2 (words: 489) ---
the booming of the mill bring a dreamy deafness, which seems to heighten the peacefulness of the scene. They are like a great curtain of sound, shutting one out from the world beyond. And now there is the thunder of the huge covered wagon coming home with sacks of grain. That honest wagoner is think...


--- Chunk 3 (words: 483) ---
in the left-hand parlour, on that very afternoon I have been dreaming of. Chapter II. Mr Tulliver, of Dorlcote Mill, Declares His Resolution about Tom “What I want, you know,” said Mr Tulliver,—“what I want is to give Tom a good eddication; an eddication as’ll be a bread to him. That was what
I was 

Processing books:  57%|█████▋    | 12/21 [02:13<01:39, 11.01s/it]


--- Chunk 1 (words: 489) ---
It is very seldom that mere ordinary people like John and myself secure
ancestral halls for the summer. A colonial mansion, a hereditary estate, I would say a haunted house,
and reach the height of romantic felicity—but that would be asking too
much of fate! Still I will proudly declare that there i...


--- Chunk 2 (words: 496) ---
were greenhouses, too, but they are all broken now. There was some legal trouble, I believe, something about the heirs and co-heirs; anyhow, the place has been empty for years. That spoils my ghostliness, I am afraid; but I don’t care—there is something strange about the house—I can feel it. I even ...


--- Chunk 3 (words: 489) ---
is repellant, almost revolting; a smouldering, unclean yellow, strangely faded by the slow-turning sunlight. It is a dull yet lurid orange in some places, a sickly sulphur tint in others. No wonder the children hated it! I should hate it myself if I had to live in this room long. There comes John, a

Processing books:  62%|██████▏   | 13/21 [02:28<01:38, 12.35s/it]


--- Chunk 1 (words: 485) ---
CHAPTER I. CHILDHOOD. The psychical growth of a child is not influenced by days and years, but
by the impressions passing events make on its mind. What may prove a
sudden awakening to one, giving an impulse in a certain direction that
may last for years, may make no impression on another. People won...


--- Chunk 2 (words: 493) ---
New York, was elected to Congress. Perhaps the excitement of a political campaign, in which my mother took the deepest interest, may have had an influence on my prenatal life and given me the strong desire that I have always felt to participate in the rights and duties of government. My father was a...


--- Chunk 3 (words: 495) ---
the reader will see that, under such conditions, nothing but strong self-will and a good share of hope and mirthfulness could have saved an ordinary child from becoming a mere nullity. The first event engraved on my memory was the birth of a sister when I was four years old. It was a cold morning in

Processing books:  67%|██████▋   | 14/21 [02:47<01:40, 14.38s/it]


--- Chunk 1 (words: 487) ---
VOLUME I CHAPTER I Emma Woodhouse, handsome, clever, and rich, with a comfortable home and
happy disposition, seemed to unite some of the best blessings of
existence; and had lived nearly twenty-one years in the world with very
little to distress or vex her. She was the youngest of the two daughters...


--- Chunk 2 (words: 484) ---
of every day. She recalled her past kindness—the kindness, the affection of sixteen years—how she had taught and how she had played with her from five years old—how she had devoted all her powers to attach and amuse her in health—and how nursed her through the various illnesses of childhood. A large...


--- Chunk 3 (words: 498) ---
one among them who could be accepted in lieu of Miss Taylor for even half a day. It was a melancholy change; and Emma could not but sigh over it, and wish for impossible things, till her father awoke, and made it necessary to be cheerful. His spirits required support. He
was a nervous man, easily de

Processing books:  71%|███████▏  | 15/21 [03:03<01:27, 14.59s/it]


--- Chunk 1 (words: 497) ---
Chapter I. It is a truth universally acknowledged, that a single man in possession
of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families,...


--- Chunk 2 (words: 498) ---
not often much beauty to think of.” “But, my dear, you must indeed go and see Mr. Bingley when he comes into the neighbourhood.” “It is more than I engage for, I assure you.” “But consider your daughters. Only think what an establishment it would be for one of them. Sir William and Lady Lucas are de...


--- Chunk 3 (words: 500) ---
of those who waited on Mr. Bingley. He had always intended to visit him, though to the last always assuring his wife that he should not go; and till the evening after the visit was paid she had no knowledge of it. It was then disclosed in the following manner. Observing his second daughter employed 

Processing books:  76%|███████▌  | 16/21 [03:10<01:02, 12.59s/it]


--- Chunk 1 (words: 483) ---
I A green and yellow parrot, which hung in a cage outside the door, kept
repeating over and over: “_Allez vous-en! Allez vous-en! Sapristi!_ That’s all right!” He could speak a little Spanish, and also a language which nobody
understood, unless it was the mocking-bird that hung on the other side
of ...


--- Chunk 2 (words: 497) ---
persons of the _pension_ had gone over to the _Chênière Caminada_ in Beaudelet’s lugger to hear mass. Some young people were out under the water-oaks playing croquet. Mr. Pontellier’s two children were there—sturdy little fellows of four and five. A quadroon nurse followed them about with a faraway,...


--- Chunk 3 (words: 500) ---
holding it out to him. He accepted the sunshade, and lifting it over his head descended the steps and walked away. “Coming back to dinner?” his wife called after him. He halted a moment and shrugged his shoulders. He felt in his vest pocket; there was a ten-dollar bill there. He did not know; perhap

Processing books:  81%|████████  | 17/21 [03:34<01:02, 15.75s/it]


--- Chunk 1 (words: 500) ---
PART 1 CHAPTER ONE
PLAYING PILGRIMS “Christmas won’t be Christmas without any presents,” grumbled Jo, lying
on the rug. “It’s so dreadful to be poor!” sighed Meg, looking down at her old
dress. “I don’t think it’s fair for some girls to have plenty of pretty
things, and other girls nothing at all,” ...


--- Chunk 2 (words: 498) ---
but I do think washing dishes and keeping things tidy is the worst work in the world. It makes me cross, and my hands get so stiff, I can’t practice well at all.” And Beth looked at her rough hands with a sigh that any one could hear that time. “I don’t believe any of you suffer as I do,” cried Amy,...


--- Chunk 3 (words: 477) ---
look as prim as a China Aster! It’s bad enough to be a girl, anyway, when I like boy’s games and work and manners! I can’t get over my disappointment in not being a boy. And it’s worse than ever now, for I’m dying to go and fight with Papa. And
I can only stay home and knit, like a poky old woman!” 

Processing books:  86%|████████▌ | 18/21 [03:36<00:35, 11.88s/it]


--- Chunk 1 (words: 452) ---
MARY CHAP. I. Mary, the heroine of this fiction, was the daughter of Edward, who
married Eliza, a gentle, fashionable girl, with a kind of indolence in
her temper, which might be termed negative good-nature: her virtues,
indeed, were all of that stamp. She carefully attended to the _shews_ of
things...


--- Chunk 2 (words: 477) ---
called _hell_, the regions below; but whether her's was a mounting spirit, I cannot pretend to determine; or what sort of a planet would have been proper for her, when she left her _material_ part in this world, let metaphysicians settle; I have nothing to say to her unclothed spirit. As she was som...


--- Chunk 3 (words: 497) ---
that kind of _attendrissement_ which makes a person take pleasure in providing for the subsistence and comfort of a living creature; but it proceeded from vanity, it gave her an opportunity of lisping out the prettiest French expressions of ecstatic fondness, in accents that had never been attuned b

Processing books:  90%|█████████ | 19/21 [04:14<00:39, 19.70s/it]


--- Chunk 1 (words: 494) ---
PRELUDE. Who that cares much to know the history of man, and how the mysterious
mixture behaves under the varying experiments of Time, has not dwelt,
at least briefly, on the life of Saint Theresa, has not smiled with
some gentleness at the thought of the little girl walking forth one
morning hand-i...


--- Chunk 2 (words: 491) ---
the living stream in fellowship with its own oary-footed kind. Here and there is born a Saint Theresa, foundress of nothing, whose loving heart-beats and sobs after an unattained goodness tremble off and are dispersed among hindrances, instead of centring in some long-recognizable deed. BOOK I. MISS...


--- Chunk 3 (words: 500) ---
by heart; and to her the destinies of mankind, seen by the light of Christianity, made the solicitudes of feminine fashion appear an occupation for Bedlam. She could not reconcile the anxieties of a spiritual life involving eternal consequences, with a keen interest in gimp and artificial protrusion

Processing books:  95%|█████████▌| 20/21 [04:22<00:16, 16.01s/it]


--- Chunk 1 (words: 476) ---
A NO-ACCOUNT CREOLE I. One agreeable afternoon in late autumn two young men stood together on
Canal Street, closing a conversation that had evidently begun within
the club-house which they had just quitted. "There's big money in it, Offdean," said the elder of the two. "I would
n't have you touch it...


--- Chunk 2 (words: 488) ---
business man may be said alternately to exist, and which reduce him, naturally, to a rather ragged condition of soul. Offdean had done, in a temperate way, the usual things which young men do who happen to belong to good society, and are possessed of moderate means and healthy instincts. He had gone...


--- Chunk 3 (words: 500) ---
in a slovenly fashion, but so rich that cotton and corn and weed and "cocoa-grass" grew rampant if they had only the semblance of a chance. The negro quarters were at the far end of this open stretch, and consisted of a long row of old and very crippled cabins. Directly back of these a dense wood gr

Processing books: 100%|██████████| 21/21 [04:45<00:00, 13.60s/it]


--- Chunk 1 (words: 498) ---
CHAPTER I There was no possibility of taking a walk that day. We had been
wandering, indeed, in the leafless shrubbery an hour in the morning;
but since dinner (Mrs. Reed, when there was no company, dined early)
the cold winter wind had brought with it clouds so sombre, and a rain
so penetrating, th...


--- Chunk 2 (words: 487) ---
I was, I could not pass quite as a blank. They were those which treat of the haunts of sea-fowl; of “the solitary rocks and promontories” by them only inhabited; of the coast of Norway, studded with isles from its southern extremity, the Lindeness, or Naze, to the North Cape— “Where the Northern Oce...


--- Chunk 3 (words: 450) ---
soon. The breakfast-room door opened. “Boh! Madam Mope!” cried the voice of John Reed; then he paused: he found the room apparently empty. “Where the dickens is she!” he continued. “Lizzy! Georgy! (calling to his sisters) Joan is not here: tell mama she is run out into the rain—bad animal!” “It is w




In [12]:
# save chunks to csv
df_chunks = pd.DataFrame(all_chunk_data)
df_chunks.to_csv(OUTPUT_PATH_CONTEXT, index=False)

print(f"Final dataset saved to: {OUTPUT_PATH_CONTEXT}")
display(df_chunks.head())

Final dataset saved to: /Users/emmamora/Documents/GitHub/feminist_nlp/data/processed/processed_book_chunks.csv


Unnamed: 0,chunk,title,author,year,chunk_id
0,CHAPTER 1.\nA Not Unnatural Enterprise This is...,Herland,Charlotte Perkins Gilman,1915,CP_001
1,few things that don’t. We three had a chance t...,Herland,Charlotte Perkins Gilman,1915,CP_002
2,"I had understood, so I showed him a red and bl...",Herland,Charlotte Perkins Gilman,1915,CP_003
3,glass and squatted down to investigate. “Chemi...,Herland,Charlotte Perkins Gilman,1915,CP_004
4,a dozen Montenegroes up and down these great r...,Herland,Charlotte Perkins Gilman,1915,CP_005


In [16]:
# process books => NO CONTEXT CHUNKS
all_chunk_data = []

book_files = list(CLEANED_DATA_DIR.glob("*.txt"))

for book_file in tqdm(book_files, desc="Processing books"):
    book_name = book_file.name
    if book_name not in metadata.index:
        print(f"Skipping {book_name} (no metadata)")
        continue

    with open(book_file, "r", encoding="utf-8") as f:
        text = f.read()

    # tokenize
    sentences = tokenize_sentences(text, show_progress=False)

    # chunk
    chunks = chunk_sentences(sentences, debug=True)

    # metadata
    meta = metadata.loc[book_name]
    title, author, year = meta["title"], meta["author"], meta["year"]
    author_code = "".join([w[0].upper() for w in author.split()[:2]])

    # collect chunks
    for i, chunk in enumerate(chunks):
        chunk_id = f"{author_code}_{i+1:03d}"
        all_chunk_data.append({
            "chunk": chunk,
            "title": title,
            "author": author,
            "year": year,
            "chunk_id": chunk_id
        })

print(f"Finished processing: {len(all_chunk_data)} total chunks from {len(book_files)} books.")

Processing books:   5%|▍         | 1/21 [00:06<02:12,  6.60s/it]


--- Chunk 1 (words: 466) ---
CHAPTER 1.
A Not Unnatural Enterprise This is written from memory, unfortunately. If I could have brought with
me the material I so carefully prepared, this would be a very different
story. Whole books full of notes, carefully copied records, firsthand
descriptions, and the pictures--that’s the wors...


--- Chunk 2 (words: 500) ---
The expedition was up among the thousand tributaries and enormous
hinterland of a great river, up where the maps had to be made, savage
dialects studied, and all manner of strange flora and fauna expected. But this story is not about that expedition. That was only the merest
starter for ours. My int...


--- Chunk 3 (words: 499) ---
It was early yet; we had just breakfasted; and leaving word that we’d
be back before night, we got away quietly, not wishing to be thought
too gullible if we failed, and secretly hoping to have some nice little
discovery all to ourselves. It was a long two hours, nearer three. I fancy the savage cou

Processing books:  10%|▉         | 2/21 [00:23<04:03, 12.82s/it]


--- Chunk 1 (words: 475) ---
CHAPTER I As the streets that lead from the Strand to the Embankment are very
narrow, it is better not to walk down them arm-in-arm. If you persist,
lawyers’ clerks will have to make flying leaps into the mud; young lady
typists will have to fidget behind you. In the streets of London where
beauty g...


--- Chunk 2 (words: 499) ---
Some one is always looking into the river
near Waterloo Bridge; a couple will stand there talking for half an
hour on a fine afternoon; most people, walking for pleasure,
contemplate for three minutes; when, having compared the occasion with
other occasions, or made some sentence, they pass on. Some...


--- Chunk 3 (words: 474) ---
She knew how to read the people who were passing her; there were the
rich who were running to and from each others’ houses at this hour;
there were the bigoted workers driving in a straight line to their
offices; there were the poor who were unhappy and rightly malignant. Already, though there was s

Processing books:  14%|█▍        | 3/21 [00:40<04:21, 14.53s/it]


--- Chunk 1 (words: 494) ---
PART I. Comments on Genesis, Exodus, Leviticus, Numbers and Deuteronomy. "In every soul there is bound up some truth and some error, and each
gives to the world of thought what no other one possesses."--Cousin. 1898. By Elizabeth Cady Stanton REVISING COMMITTEE. "We took sweet counsel together. "--P...


--- Chunk 2 (words: 471) ---
Those who have been engaged this summer have adopted the following
plan, which may be suggestive to new members of the committee. Each
person purchased two Bibles, ran through them from Genesis to
Revelations, marking all the texts that concerned women. The passages
were cut out, and pasted in a bla...


--- Chunk 3 (words: 476) ---
These familiar texts are quoted by clergymen in their pulpits, by
statesmen in the halls of legislation, by lawyers in the courts, and
are echoed by the press of all civilized nations, and accepted by woman
herself as "The Word of God." So perverted is the religious element in
her nature, that with 

Processing books:  19%|█▉        | 4/21 [00:49<03:33, 12.56s/it]


--- Chunk 1 (words: 487) ---
INTRODUCTION. After considering the historic page, and viewing the living world
with anxious solicitude, the most melancholy emotions of sorrowful
indignation have depressed my spirits, and I have sighed when
obliged to confess, that either nature has made a great difference
between man and man, or ...


--- Chunk 2 (words: 451) ---
The male pursues, the female yields--this is
the law of nature; and it does not appear to be suspended or
abrogated in favour of woman. This physical superiority cannot be
denied--and it is a noble prerogative! But not content with this
natural pre-eminence, men endeavour to sink us still lower, mer...


--- Chunk 3 (words: 498) ---
But as I purpose taking a separate view of the different ranks of
society, and of the moral character of women, in each, this hint
is, for the present, sufficient; and I have only alluded to the
subject, because it appears to me to be the very essence of an
introduction to give a cursory account of 

Processing books:  24%|██▍       | 5/21 [00:54<02:32,  9.53s/it]


--- Chunk 1 (words: 441) ---
HER BIRTH AND PARENTAGE. THE subject of this biography, SOJOURNER TRUTH, as she now calls
herself-but whose name, originally, was Isabella-was born, as near as
she can now calculate, between the years 1797 and 1800. She was the
daughter of James and Betsey, slaves of one Colonel Ardinburgh, Hurley,
...


--- Chunk 2 (words: 438) ---
Still, she does not attribute this cruelty-for cruelty
it certainly is, to be so unmindful of the health and comfort of any
being, leaving entirely out of sight his more important part, his
everlasting interests,-so much to any innate or constitutional cruelty
of the master, as to that gigantic inco...


--- Chunk 3 (words: 481) ---
This event was noticed with childish pleasure by the
unsuspicious boy; but when he was taken and put into the sleigh, and
saw his little sister actually shut and locked into the sleigh box, his
eyes were at once opened to their intentions; and, like a frightened
deer he sprang from the sleigh, and r

Processing books:  29%|██▊       | 6/21 [01:06<02:39, 10.66s/it]


--- Chunk 1 (words: 492) ---
INTRODUCTION. * * * * * The problem of Woman's position, or "sphere,"--of her duties,
responsibilities, rights and immunities as Woman,--fitly attracts a
large and still-increasing measure of attention from the thinkers and
agitators of our time, The legislators, so called,--those who
ultimately ena...


--- Chunk 2 (words: 493) ---
It is due to her memory, as well as to the great and living
cause of which she was so eminent and so fearless an advocate, that
what she thought and said with regard to the position of her sex and
its limitations, should be fully and fairly placed before the public. For several years past her princi...


--- Chunk 3 (words: 471) ---
I lay no
especial stress on the welfare of either. I believe that the
development of the one cannot be effected without that of the other. My highest wish is that this truth should be distinctly and rationally
apprehended, and the conditions of life and freedom recognized as the
same for the daughte

Processing books:  33%|███▎      | 7/21 [01:13<02:08,  9.21s/it]


--- Chunk 1 (words: 500) ---
OLD NORWAY. Still the old tempests rage around the mountains,
 And ocean's billows as of old appear;
 The roaring wood and the resounding fountains
 Time has not silenced in his long career,
 For Nature is the same as ever. MUNCH. The shadow of God wanders through Nature. LINNÆUS. Before yet a song ...


--- Chunk 2 (words: 487) ---
"Wilt thou behold the great and the
majestic? Behold the Gausta, which raises its colossal knees six
thousand feet above the surface of the earth; behold the wild giant
forms of Hurrungen, Fannarauken, Mugnafjeld; behold the Rjukan (the
rushing), the Vöring, and Vedal rivers foaming and thundering o...


--- Chunk 3 (words: 491) ---
We set ourselves down in a region whose name and
situation we counsel nobody to seek out in maps, and which we call-- HEIMDAL. Knowest thou the deep, cool dale,
 Where church-like stillness doth prevail;
 Where neither flock nor herd you meet;
 Which hath no name nor track of feet? VELHAVEN. Heimdal

Processing books:  38%|███▊      | 8/21 [01:20<01:53,  8.77s/it]


--- Chunk 1 (words: 469) ---
Mrs. Dalloway said she would buy the flowers herself. For Lucy had her work cut out for her. The doors would be taken off
their hinges; Rumpelmayer’s men were coming. And then, thought Clarissa
Dalloway, what a morning--fresh as if issued to children on a beach. What a lark! What a plunge! For so it...


--- Chunk 2 (words: 466) ---
In people’s eyes, in the swing, tramp, and trudge; in
the bellow and the uproar; the carriages, motor cars, omnibuses, vans,
sandwich men shuffling and swinging; brass bands; barrel organs; in the
triumph and the jingle and the strange high singing of some aeroplane
overhead was what she loved; life...


--- Chunk 3 (words: 499) ---
Evelyn was a good deal out of sorts, said Hugh, intimating by a kind
of pout or swell of his very well-covered, manly, extremely handsome,
perfectly upholstered body (he was almost too well dressed always,
but presumably had to be, with his little job at Court) that his wife
had some internal ailmen

Processing books:  43%|████▎     | 9/21 [01:35<02:07, 10.59s/it]


--- Chunk 1 (words: 481) ---
CHAPTER I 1801—I have just returned from a visit to my landlord—the solitary
neighbour that I shall be troubled with. This is certainly a beautiful
country! In all England, I do not believe that I could have fixed on a
situation so completely removed from the stir of society. A perfect
misanthropist...


--- Chunk 2 (words: 480) ---
Happily, the architect had foresight to build
it strong: the narrow windows are deeply set in the wall, and the
corners defended with large jutting stones. Before passing the threshold, I paused to admire a quantity of
grotesque carving lavished over the front, and especially about the
principal doo...


--- Chunk 3 (words: 494) ---
Possibly, some people might suspect him of a degree of under-bred
pride; I have a sympathetic chord within that tells me it is nothing of
the sort: I know, by instinct, his reserve springs from an aversion to
showy displays of feeling—to manifestations of mutual kindliness. He’ll
love and hate equal

Processing books:  48%|████▊     | 10/21 [01:48<02:03, 11.23s/it]


--- Chunk 1 (words: 475) ---
CHAPTER I. POLLY ARRIVES “IT'S time to go to the station, Tom.” “Come on, then.” “Oh, I'm not going; it's too wet. Should n't have a crimp left if I
went out such a day as this; and I want to look nice when Polly comes.” “You don't expect me to go and bring home a strange girl alone, do you?”
 And T...


--- Chunk 2 (words: 473) ---
Feeling that he said rather a neat and cutting thing, Tom sauntered
leisurely away, perfectly conscious that it was late, but bent on not
being hurried while in sight, though he ran himself off his legs to make
up for it afterward. “If I was the President, I'd make a law to shut up all boys till the...


--- Chunk 3 (words: 480) ---
“Oh, Fan told me you'd got curly hair, and a funny nose, and kept
whistling, and wore a gray cap pulled over your eyes; so I knew you
directly.” And Polly nodded at him in the most friendly manner, having
politely refrained from calling the hair “red,” the nose “a pug,” and
the cap “old,” all of whi

Processing books:  52%|█████▏    | 11/21 [02:13<02:35, 15.52s/it]


--- Chunk 1 (words: 488) ---
BOOK FIRST BOY AND GIRL. Chapter I. Outside Dorlcote Mill A wide plain, where the broadening Floss hurries on between its green
banks to the sea, and the loving tide, rushing to meet it, checks its
passage with an impetuous embrace. On this mighty tide the black
ships—laden with the fresh-scented fi...


--- Chunk 2 (words: 470) ---
That honest wagoner is thinking of his
dinner, getting sadly dry in the oven at this late hour; but he will
not touch it till he has fed his horses,—the strong, submissive,
meek-eyed beasts, who, I fancy, are looking mild reproach at him from
between their blinkers, that he should crack his whip at ...


--- Chunk 3 (words: 491) ---
The two years at th’ academy ’ud ha’ done well enough, if I’d meant to
make a miller and farmer of him, for he’s had a fine sight more
schoolin’ nor _I_ ever got. All the learnin’ _my_ father ever paid for
was a bit o’ birch at one end and the alphabet at th’ other. But I
should like Tom to be a bit

Processing books:  57%|█████▋    | 12/21 [02:14<01:39, 11.03s/it]


--- Chunk 1 (words: 489) ---
It is very seldom that mere ordinary people like John and myself secure
ancestral halls for the summer. A colonial mansion, a hereditary estate, I would say a haunted house,
and reach the height of romantic felicity—but that would be asking too
much of fate! Still I will proudly declare that there i...


--- Chunk 2 (words: 479) ---
I even said so to John one moonlight evening, but he said what I felt
was a draught, and shut the window. I get unreasonably angry with John sometimes. I’m sure I never used to
be so sensitive. I think it is due to this nervous condition. But John says if I feel so I shall neglect proper self-contro...


--- Chunk 3 (words: 466) ---
I am sitting by the window now, up in this atrocious nursery, and there
is nothing to hinder my writing as much as I please, save lack of
strength. John is away all day, and even some nights when his cases are serious. I am glad my case is not serious! But these nervous troubles are dreadfully depre

Processing books:  62%|██████▏   | 13/21 [02:29<01:38, 12.35s/it]


--- Chunk 1 (words: 485) ---
CHAPTER I. CHILDHOOD. The psychical growth of a child is not influenced by days and years, but
by the impressions passing events make on its mind. What may prove a
sudden awakening to one, giving an impulse in a certain direction that
may last for years, may make no impression on another. People won...


--- Chunk 2 (words: 476) ---
My father was a man of firm character and unimpeachable integrity, and
yet sensitive and modest to a painful degree. There were but two places
in which he felt at ease--in the courthouse and at his own fireside. Though gentle and tender, he had such a dignified repose and reserve of
manner that, as ...


--- Chunk 3 (words: 492) ---
The large,
pleasant room with the white curtains and bright wood fire on the
hearth, where panada, catnip, and all kinds of little messes which we
were allowed to taste were kept warm, was the center of attraction for
the older children. I heard so many friends remark, "What a pity it is
she's a gir

Processing books:  67%|██████▋   | 14/21 [02:48<01:40, 14.34s/it]


--- Chunk 1 (words: 487) ---
VOLUME I CHAPTER I Emma Woodhouse, handsome, clever, and rich, with a comfortable home and
happy disposition, seemed to unite some of the best blessings of
existence; and had lived nearly twenty-one years in the world with very
little to distress or vex her. She was the youngest of the two daughters...


--- Chunk 2 (words: 461) ---
A large debt of
gratitude was owing here; but the intercourse of the last seven years,
the equal footing and perfect unreserve which had soon followed
Isabella’s marriage, on their being left to each other, was yet a
dearer, tenderer recollection. She had been a friend and companion such
as few poss...


--- Chunk 3 (words: 485) ---
Matrimony, as the origin of change, was always disagreeable; and he was
by no means yet reconciled to his own daughter’s marrying, nor could
ever speak of her but with compassion, though it had been entirely a
match of affection, when he was now obliged to part with Miss Taylor
too; and from his hab

Processing books:  71%|███████▏  | 15/21 [03:03<01:26, 14.48s/it]


--- Chunk 1 (words: 497) ---
Chapter I. It is a truth universally acknowledged, that a single man in possession
of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families,...


--- Chunk 2 (words: 492) ---
Sir William and Lady Lucas are determined to go,
merely on that account; for in general, you know, they visit no new
comers. Indeed you must go, for it will be impossible for _us_ to visit
him, if you do not.” “You are over scrupulous, surely. I dare say Mr. Bingley will be very
glad to see you; and...


--- Chunk 3 (words: 499) ---
“But you forget, mamma,” said Elizabeth, “that we shall meet him at the
assemblies, and that Mrs. Long has promised to introduce him.” “I do not believe Mrs. Long will do any such thing. She has two nieces
of her own. She is a selfish, hypocritical woman, and I have no opinion
of her.” “No more have

Processing books:  76%|███████▌  | 16/21 [03:11<01:02, 12.45s/it]


--- Chunk 1 (words: 483) ---
I A green and yellow parrot, which hung in a cage outside the door, kept
repeating over and over: “_Allez vous-en! Allez vous-en! Sapristi!_ That’s all right!” He could speak a little Spanish, and also a language which nobody
understood, unless it was the mocking-bird that hung on the other side
of ...


--- Chunk 2 (words: 497) ---
Mr. Pontellier finally lit a cigar and began to smoke, letting the
paper drag idly from his hand. He fixed his gaze upon a white sunshade
that was advancing at snail’s pace from the beach. He could see it
plainly between the gaunt trunks of the water-oaks and across the
stretch of yellow camomile. T...


--- Chunk 3 (words: 487) ---
Both children wanted to follow their father when they saw him starting
out. He kissed them and promised to bring them back bonbons and
peanuts. II Mrs. Pontellier’s eyes were quick and bright; they were a yellowish
brown, about the color of her hair. She had a way of turning them
swiftly upon an obj

Processing books:  81%|████████  | 17/21 [03:33<01:02, 15.56s/it]


--- Chunk 1 (words: 500) ---
PART 1 CHAPTER ONE
PLAYING PILGRIMS “Christmas won’t be Christmas without any presents,” grumbled Jo, lying
on the rug. “It’s so dreadful to be poor!” sighed Meg, looking down at her old
dress. “I don’t think it’s fair for some girls to have plenty of pretty
things, and other girls nothing at all,” ...


--- Chunk 2 (words: 491) ---
“I don’t believe any of you suffer as I do,” cried Amy, “for you don’t
have to go to school with impertinent girls, who plague you if you
don’t know your lessons, and laugh at your dresses, and label your
father if he isn’t rich, and insult you when your nose isn’t nice.” “If you mean libel, I’d say...


--- Chunk 3 (words: 495) ---
So you must try to be
contented with making your name boyish, and playing brother to us
girls,” said Beth, stroking the rough head with a hand that all the
dish washing and dusting in the world could not make ungentle in its
touch. “As for you, Amy,” continued Meg, “you are altogether too particular

Processing books:  86%|████████▌ | 18/21 [03:36<00:35, 11.70s/it]


--- Chunk 1 (words: 452) ---
MARY CHAP. I. Mary, the heroine of this fiction, was the daughter of Edward, who
married Eliza, a gentle, fashionable girl, with a kind of indolence in
her temper, which might be termed negative good-nature: her virtues,
indeed, were all of that stamp. She carefully attended to the _shews_ of
things...


--- Chunk 2 (words: 427) ---
As she was sometimes obliged to be alone, or only with her French
waiting-maid, she sent to the metropolis for all the new publications,
and while she was dressing her hair, and she could turn her eyes from
the glass, she ran over those most delightful substitutes for bodily
dissipation, novels. I s...


--- Chunk 3 (words: 486) ---
She was chaste, according to the vulgar acceptation of the word, that
is, she did not make any actual _faux pas_; she feared the world, and
was indolent; but then, to make amends for this seeming self-denial, she
read all the sentimental novels, dwelt on the love-scenes, and, had she
thought while s

Processing books:  90%|█████████ | 19/21 [04:14<00:39, 19.52s/it]


--- Chunk 1 (words: 494) ---
PRELUDE. Who that cares much to know the history of man, and how the mysterious
mixture behaves under the varying experiments of Time, has not dwelt,
at least briefly, on the life of Saint Theresa, has not smiled with
some gentleness at the thought of the little girl walking forth one
morning hand-i...


--- Chunk 2 (words: 441) ---
Since I can do no good because a woman,
Reach constantly at something that is near it. —_ The Maid’s Tragedy:_ BEAUMONT AND FLETCHER. Miss Brooke had that kind of beauty which seems to be thrown into
relief by poor dress. Her hand and wrist were so finely formed that she
could wear sleeves not less ...


--- Chunk 3 (words: 463) ---
Her mind was theoretic, and yearned
by its nature after some lofty conception of the world which might
frankly include the parish of Tipton and her own rule of conduct there;
she was enamoured of intensity and greatness, and rash in embracing
whatever seemed to her to have those aspects; likely to s

Processing books:  95%|█████████▌| 20/21 [04:21<00:15, 15.86s/it]


--- Chunk 1 (words: 476) ---
A NO-ACCOUNT CREOLE I. One agreeable afternoon in late autumn two young men stood together on
Canal Street, closing a conversation that had evidently begun within
the club-house which they had just quitted. "There's big money in it, Offdean," said the elder of the two. "I would
n't have you touch it...


--- Chunk 2 (words: 492) ---
He had gone to college, had traveled a
little at home and abroad, had frequented society and the clubs, and
had worked in his uncle's commission-house; in all of which employments
he had expended much time and a modicum of energy. But he felt all through that he was simply in a preliminary stage
of ...


--- Chunk 3 (words: 494) ---
A dozen rods or more from the Red River bank stood the dwelling-house,
and nowhere upon the plantation had time touched so sadly as here. The
steep, black, moss-covered roof sat like an extinguisher above the
eight large rooms that it covered, and had come to do its office so
poorly that not more th

Processing books: 100%|██████████| 21/21 [04:44<00:00, 13.56s/it]


--- Chunk 1 (words: 498) ---
CHAPTER I There was no possibility of taking a walk that day. We had been
wandering, indeed, in the leafless shrubbery an hour in the morning;
but since dinner (Mrs. Reed, when there was no company, dined early)
the cold winter wind had brought with it clouds so sombre, and a rain
so penetrating, th...


--- Chunk 2 (words: 497) ---
“Where the Northern Ocean, in vast whirls,
Boils round the naked, melancholy isles
Of farthest Thule; and the Atlantic surge
Pours in among the stormy Hebrides.” Nor could I pass unnoticed the suggestion of the bleak shores of
Lapland, Siberia, Spitzbergen, Nova Zembla, Iceland, Greenland, with
“the...


--- Chunk 3 (words: 497) ---
And I came out immediately, for I trembled at the idea of being dragged
forth by the said Jack. “What do you want?” I asked, with awkward diffidence. “Say, ‘What do you want, Master Reed?’” was the answer. “I want you to
come here;” and seating himself in an arm-chair, he intimated by a
gesture that




In [17]:
# save chunks to csv
df_chunks = pd.DataFrame(all_chunk_data)
df_chunks.to_csv(OUTPUT_PATH_NO_CONTEXT, index=False)

print(f"Final dataset saved to: {OUTPUT_PATH_NO_CONTEXT}")
display(df_chunks.head())

Final dataset saved to: /Users/emmamora/Documents/GitHub/feminist_nlp/data/processed/processed_book_chunks.csv


Unnamed: 0,chunk,title,author,year,chunk_id
0,CHAPTER 1.\nA Not Unnatural Enterprise This is...,Herland,Charlotte Perkins Gilman,1915,CP_001
1,The expedition was up among the thousand tribu...,Herland,Charlotte Perkins Gilman,1915,CP_002
2,It was early yet; we had just breakfasted; and...,Herland,Charlotte Perkins Gilman,1915,CP_003
3,“Woman Country--up\nthere.” Then we were inter...,Herland,Charlotte Perkins Gilman,1915,CP_004
4,"It had a special covering of fitted armor, thi...",Herland,Charlotte Perkins Gilman,1915,CP_005
