# PDF Dataset Pre-processing

This is an attempt to extract texts from a very heterogenous set of literature papers collected as PDF files.

### Required Modules

In [18]:
import os
from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams

from typing import List

### Get Filenames

In [15]:
DIR = "../data/raw/pdfs/"

def get_filenames_from(dir: str=DIR) -> List[str]:
    """Return all PDF filenames in given dir."""
    try:
        filenames = [
            os.path.join(
                dir,
                filename) for
            filename in os.listdir(dir)
            if os.path.isfile(
                os.path.join(
                    dir,
                    filename))
                and filename.lower().endswith(
                    '.pdf')]
        return filenames
    except:
        print("Error gathering filenames")
        return []

filenames = get_filenames_from()

# for filename in filenames:
#     print(filename)

### Extract Text

In [30]:
def extract_text_from(filename: str) -> str:
    """Extract text using pdfminer.six"""
    laparams = LAParams(
        char_margin=2.0,
        word_margin=0.1,
        line_margin=0.5)
    return extract_text(
        filename,
        laparams=laparams)

print(extract_text_from(filenames[4]))

Affective Science (2023) 4:770–780
https://doi.org/10.1007/s42761-023-00219-9

RESEARCH ARTICLE 

How Male and Female Literary Authors Write About Affect Across 
Cultures and Over Historical Periods

Giada Lettieri1,2 

 · Giacomo Handjaras2 

 · Erika Bucci2 

 · Pietro Pietrini3 

 · Luca Cecchetti2 

Received: 28 February 2023 / Accepted: 9 August 2023 / Published online: 5 September 2023 
© The Author(s) 2023

Abstract
A wealth of literature suggests the existence of sex differences in how emotions are experienced, recognized, expressed, 
and regulated. However, to what extent these differences result from the put in place of stereotypes and social rules is still 
a matter of debate. Literature is an essential cultural institution, a transposition of the social life of people but also of their 
intimate affective experiences, which can serve to address questions of psychological relevance. Here, we created a large 
corpus of literary fiction enriched by authors’ metadata to measure

#### Issues with Text Extraction from PDF Files
1. Broken words do to quirks on the file formats.
```{txt}
[...] Rape of 
P
ossession, and the [...]
```
2. Different layouts remove the option of tweaking layout parameters.
3. Some characters from fonts used in the PDF files are not available for Unicode translation.
```{txt}
[...]
(cid:0)
(cid:0)
 a woman underwent [...]
```