# PDF Dataset Pre-processing

This is an attempt to extract texts from a very heterogenous set of literature papers collected as PDF files. All the papers are related to _romance fiction_ and _post-feminist femininity_. The dataset can be found in a [shared Proton Drive folder](https://drive.proton.me/urls/XHCN6HYPTW#dKr8VEhPePbt).

### Required Modules

In [32]:
import os
from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams
import re
import nltk
import json

from typing import List, Dict, Optional, Union

Download required corpora.

In [33]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/guaya/nltk_data...


True

### Get Filenames

In [None]:
def get_filenames_from(
    dir: str="../data/raw/pdfs/") -> List[str]:
    """Return all PDF filenames in given directory"""
    try:
        filenames: List[str] = [
            os.path.join(
                dir,
                filename) for
            filename in os.listdir(dir)
            if os.path.isfile(
                os.path.join(
                    dir,
                    filename))
                and filename.lower().endswith(
                    '.pdf')]
        return filenames
        
    except:
        print("Error gathering filenames")
        return []

### Extract Text

In [3]:
def extract_text_from(filename: str) -> str:
    """Extract text from PDF using pdfminer.six"""
    laparams = LAParams(
        char_margin=2.0,
        word_margin=0.1,
        line_margin=0.5)
    return extract_text(
        filename,
        laparams=laparams)

def extract_all_texts() -> Dict[str, str]:
    """Extract text from all PDF files"""
    texts: Dict[str, str] = {}
    filenames = get_filenames_from()
    for filename in filenames:
        texts[filename] = extract_text_from(filename)
    return texts

texts = extract_all_texts()

#### Issues with Text Extraction from PDF Files
1. Broken words do to quirks on the file formats.
```{txt}
[...] Rape of 
P
ossession, and the [...]
```
2. Different layouts remove the option of tweaking layout parameters when processing in bulk, some papers end up with poor formatting.
```{txt}
[...]
Haskell  relates  the  popularity  of  domination 

fantasies  to  the  growth  of  the  women's  liberation  movement 
[...]
```
3. Some characters from fonts used in the PDF files are not available for Unicode translation.
```{txt}
[...]
(cid:0)
(cid:0)
 a woman underwent [...]
```

### Clean Text

In [6]:
for filename in texts.keys():

    # Remove PDF word splits
    texts[filename] = re.sub(
        r'-\n\s*',
        '',
        texts[filename])

    # Remove CID markers
    texts[filename] = re.sub(
        r'\(cid:\d+\)',
        '',
        texts[filename])

### Separate Abstract when Available

In [None]:
def extract_abstracts(
    texts: Dict[str, str]) -> Dict[str, str]:
    """Extract abstract from each text"""
    abstracts: Dict[str, str] = {}
    pattern = re.compile(
        r'(?i)(?:^|\n)\s*abstract\s*[:.\n]\s*'
        r'(.*?)'
        r'(?=\n\s*(?:keywords|introduction|about|citation|iv|lay|[0-9]+\s)|\Z)',
        re.S)
        
    for filename, text in texts.items():
        result = pattern.search(text)
        if result:
            abstract = re.sub(r'\s+', ' ', result.group(1)).strip()
            abstracts[filename] = abstract
    return abstracts

abstracts = extract_abstracts(texts)
print(len(abstracts))

10


The pattern only finds **10** abstracts from the **23** texts.
Not all PDFs have an abstract or, at least, a _clearly_ _identified_ one.

In [20]:
for key in abstracts.keys():
    print(key)
    print(abstracts[key])
    print()

../data/raw/pdfs/How Male and Female Literary Authors Write About Affect Across Cultures and Over Historical Periods.pdf
A wealth of literature suggests the existence of sex differences in how emotions are experienced, recognized, expressed, and regulated. However, to what extent these differences result from the put in place of stereotypes and social rules is still a matter of debate. Literature is an essential cultural institution, a transposition of the social life of people but also of their intimate affective experiences, which can serve to address questions of psychological relevance. Here, we created a large corpus of literary fiction enriched by authors’ metadata to measure the extent to which culture influences how men and women write about emotion. Our results show that even though before the twenty-first century and across 116 countries women more than men have written about affect, starting from 2000, this difference has diminished substantially. Also, in the past, women’s 

### Tokenize and Lemmatize

In [47]:
def save_to_json(
    data: Dict[str, Union[str, List[List[str]]]],
    dir: str="../data/literature/") -> None:
    """Save data to JSON file"""
    filename = "_".join(data["name"].split(" "))
    with open(dir + filename + ".json", 'w') as file:
        json.dump(data, file, indent=2)


def pre_process_text(text: str) -> List[List[str]]:
    """Split text into words and lemmatize"""
    if text is None:
        return []

    stop_words = set(nltk.corpus.stopwords.words("english"))
    lemmatizer = nltk.stem.WordNetLemmatizer()

    sentence_split = nltk.sent_tokenize(text)
    word_split = [
        nltk.word_tokenize(sentence) for
        sentence in sentence_split]

    result = []
    for sentence in word_split:
        result.append([])

        for word in sentence:
            token = word.lower()
            if token.isalpha() and token not in stop_words:
                result[-1].append(lemmatizer.lemmatize(token))

        # Discard empty sentences
        if not result[-1]:
            result.pop()
    return result


def build_data_json(
    filename: str,
    text: str,
    abstract: Optional[str]=None) -> Dict[str, Union[str, List[List[str]]]]:
    """Organize data with JSON structure"""
    name = filename.split("/")[-1].split(".")[0]
    data = {
        "name": name,
        "text": pre_process_text(text),
        "abstract": pre_process_text(abstract)}
    return data

### Save to JSON Files

In [49]:
for filename, text in texts.items():
    if filename in abstracts:
        data = build_data_json(
            filename,
            text,
            abstracts[filename])
    else:
        data = build_data_json(filename, text)
    save_to_json(data)

To be continued...