# PDF Dataset Pre-processing

This is an attempt to extract texts from a very heterogenous set of literature papers collected as PDF files. All the papers are related to _romance fiction_ and _post-feminist femininity_. The dataset can be found in a [shared Proton Drive folder](https://drive.proton.me/urls/XHCN6HYPTW#dKr8VEhPePbt).

### Required Modules

In [9]:
import os
from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams
import re

from typing import List, Dict

### Get Filenames

In [7]:
DIR = "../data/raw/pdfs/"

def get_filenames_from(dir: str=DIR) -> List[str]:
    """Return all PDF filenames in given directory"""
    try:
        filenames: List[str] = [
            os.path.join(
                dir,
                filename) for
            filename in os.listdir(dir)
            if os.path.isfile(
                os.path.join(
                    dir,
                    filename))
                and filename.lower().endswith(
                    '.pdf')]
        return filenames
    except:
        print("Error gathering filenames")
        return []

# filenames = get_filenames_from()
# for filename in filenames:
#     print(filename)

### Extract Text

In [8]:
def extract_text_from(filename: str) -> str:
    """Extract text from PDF using pdfminer.six"""
    laparams = LAParams(
        char_margin=2.0,
        word_margin=0.1,
        line_margin=0.5)
    return extract_text(
        filename,
        laparams=laparams)

def extract_all_texts() -> Dict[str, str]:
    """Extract text from all PDF files"""
    texts: Dict[str, str] = {}
    filenames = get_filenames_from()
    for filename in filenames:
        texts[filename] = extract_text_from(filename)
    return texts

# print(extract_text_from(filenames[11]))
# print(len(extract_all_texts()))

#### Issues with Text Extraction from PDF Files
1. Broken words do to quirks on the file formats.
```{txt}
[...] Rape of 
P
ossession, and the [...]
```
2. Different layouts remove the option of tweaking layout parameters when processing in bulk, some papers end up with poor formatting.
```{txt}
[...]
Haskell  relates  the  popularity  of  domination 

fantasies  to  the  growth  of  the  women's  liberation  movement 
[...]
```
3. Some characters from fonts used in the PDF files are not available for Unicode translation.
```{txt}
[...]
(cid:0)
(cid:0)
 a woman underwent [...]
```

In [16]:
def extract_abstracts(texts: Dict[str, str]) -> Dict[str, str]:
    """Extract abstract from each text"""
    abstracts: Dict[str, str] = {}
    pattern = re.compile(
        r'(?i)(?:^|\n)\s*abstract\s*[:.\n]\s*'
        r'(.*?)'
        r'(?=\n\s*(?:keywords|introduction|[0-9]+\s)|\Z)',
        re.S)
    for filename, text in texts.items():
        result = pattern.search(text)
        if result:
            abstract = re.sub(r'\s+', ' ', result.group(1)).strip()
            abstracts[filename] = abstract
    return abstracts

# abstracts = extract_abstracts(extract_all_texts())
# print(len(abstracts))

My original pattern only finds **10** abstracts from the **23** texts.
Not all PDFs have an abstract but a this number can definitely improve.

In [17]:
abstracts = extract_abstracts(extract_all_texts())
keys = list(abstracts.keys())
print(keys[1])
print(abstracts[keys[1]])

../data/raw/pdfs/The Ethics and Economics of Middle Class Romance.pdf
This article shows the philosophical kinship between Adam Smith and Mary Wollstonecraft on the subject of love. Though the two major 18th century think- ers are not traditionally brought into conversation with each other, Wollstonecraft and Smith share deep moral concerns about the emerging commercial society. As the new middle class continues to grow along with commerce, vanity becomes an ever more common vice among its members. But a vain person is preoccupied with appearance, status, and flattery—things that get in the way of what Smith and Wol- lstonecraft regard as the deep human connection they variously describe as love, sympathy, and esteem. Commercial society encourages inequality, Smith argues, and Wollstonecraft points out that this inequality is particularly obvious in the rela- tionships between men and women. Men are vain about their wealth, power and sta- tus; women about their appearance. Added to thi