<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 1 - eyamrog

The aim of this phase is to determine the basic usage of the following Python libraries:

- `pypdf` and `PyMUPdf` for PDF document scraping
- `python-docx` for DOCX document scraping

## What is `pypdf`?

`pypdf` is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.

Please refer to:
- [Scrape Data from PDF: A Comprehensive Guide for Data Analysts](https://parser.expert/blog/scrape-data-from-pdf)
- [How to Extract Data from PDF Files with Python](https://www.freecodecamp.org/news/extract-data-from-pdf-files-with-python/)
- [Extracting Semi-Structured Data from PDFs on a large scale](https://github.com/janedoesrepo/pdfreader/blob/master/README.md)

## What is `PyMuPDF`?

`PyMuPDF` is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

## What is `python-docx`?

`python-docx` is a Python library for reading, creating, and updating Microsoft Word 2007+ (.docx) files.

Please refer to:
- [5 Best Ways to Read Microsoft Word Documents with Python](https://blog.finxter.com/5-best-ways-to-read-microsoft-word-documents-with-python/)

## Required Python packages

- pypdf
- PyMuPDF
- python-docx

## Importing the required libraries

In [1]:
import pypdf
import pymupdf
from docx import Document
import os
import sys

## Defining input variables

In [2]:
output_directory = 'cl_st1_ph1_eyamrog'

## Creating output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory already exists.


## `pypdf` PDF scraping

### Setting required variables

In [4]:
pdf_path = f'{output_directory}/monte_mor_1.pdf'
output_txt = f'{output_directory}/monte_mor_1_pypdf.txt'
page_num = 0

### Sampling pages

In [5]:
pdf_reader = pypdf.PdfReader(pdf_path)
page = pdf_reader.pages[page_num]
text = page.extract_text()
text

'Universidad Nacional de Colombia - Facultad de Ciencias Humanas – Bogotá \nwww.revistamatices.unal.edu.co \n \nRevista Electrónica Matices en Lenguas  Extranjeras No. 2, Diciembre 2008 \n 1 \n Critical Literacies, Meaning Ma king and New Epistemological \nPerspectives \n \nWalkyria Monte Mór \uf02a \nwalsil@uol.com.br \nUniversity of São Paulo, Brazil \n \nThis article presents a research analysis in which Brazilian university students were the subjects of \nresearch in regard to their reading of cinema images. The analysis discusses the way the meanings are \nconstructed by these students and reflects on interpretation and meaning construction in accordance with new \nepistemological perspectives that have been postulated recently (Morin, 1998; Lankshear & Knobel, 2003). It, \nthus, considers the present needs of the multimodal and hypertextual communication approached in the \nmultiliteracy studies (Cope & Kalantzis, 2000), and the university preparation for a critical and participa

### Scraping the entire document

In [6]:
def scrape_pdf(pdf_path, output_txt):
    # Creating a PdfReader object
    pdf_reader = pypdf.PdfReader(pdf_path)
    
    # Initialising an empty string to store the text
    text = ''
    
    # Iterating through all the pages and extract text
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        text += page.extract_text()
    
    # Writing the extracted text to a text file in UTF-8 encoding
    with open(output_txt, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

scrape_pdf(pdf_path, output_txt)
print('PDF scraped successfully!')

PDF scraped successfully!


## `PyMuPDF` PDF scraping

### Setting required variables

In [7]:
pdf_path = f'{output_directory}/monte_mor_1.pdf'
output_txt = f'{output_directory}/monte_mor_1_PyMuPDF.txt'
page_num = 0

### Sampling pages

In [8]:
doc = pymupdf.open(pdf_path)
page = doc[page_num]
text = page.get_text()
text

'Universidad Nacional de Colombia - Facultad de Ciencias Humanas – Bogotá \nwww.revistamatices.unal.edu.co \n \nRevista Electrónica Matices en Lenguas Extranjeras No. 2, Diciembre 2008 \n 1\n \n \nCritical Literacies, Meaning Making and New Epistemological \nPerspectives \n \nWalkyria Monte Mór \uf02a \nwalsil@uol.com.br \nUniversity of São Paulo, Brazil \n \nThis article presents a research analysis in which Brazilian university students were the subjects of \nresearch in regard to their reading of cinema images. The analysis discusses the way the meanings are \nconstructed by these students and reflects on interpretation and meaning construction in accordance with new \nepistemological perspectives that have been postulated recently (Morin, 1998; Lankshear & Knobel, 2003). It, \nthus, considers the present needs of the multimodal and hypertextual communication approached in the \nmultiliteracy studies (Cope & Kalantzis, 2000), and the university preparation for a critical and partici

### Scraping the entire document

In [9]:
def scrape_pdf(pdf_path, output_txt):
    # Opening the PDF file
    doc = pymupdf.open(pdf_path)
    
    # Initialising an empty string to store the text
    text = ''
    
    # Iterating through all the pages and extract text
    for page in doc:
        text += page.get_text()
    
    # Writing the extracted text to a text file in UTF-8 encoding
    with open(output_txt, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

scrape_pdf(pdf_path, output_txt)
print('PDF scraped successfully!')

PDF scraped successfully!


## `python-docx` DOCX scraping

### Setting required variables

In [10]:
docx_path = f'{output_directory}/yamada_1.docx'
output_txt = f'{output_directory}/yamada_1_python-docx.txt'
paragraph_num = 0

### Sampling paragraphs

In [11]:
doc = Document(docx_path)
paragraph = doc.paragraphs[paragraph_num]
text = paragraph.text
text

'THE RELEVANCE OF AI-POWERED TOOLS IN THE ENGLISH ACADEMIC WRITING OF BRAZILIAN SCHOLARS IN APPLIED LINGUISTICS AND IN THE VISUALISATION OF RESEARCH DATA'

### Scraping the entire document

In [12]:
def scrape_docx(docx_path):
    # Opening the DOCX file
    doc = Document(docx_path, output_txt)

    # Initialising an empty string to store the text
    text_list = []

    # Iterating through all the paragraphs and extract text
    for paragraph in doc.paragraphs:
        text_list.append(paragraph.text)

    text = '\n'.join(text_list)
    
    # Writing the extracted text to a text file in UTF-8 encoding
    with open(output_txt, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

scrape_docx(docx_path, output_txt)
print('DOCX scraped successfully!')

DOCX scraped successfully!


## Results

Comparing the results, `pypdf` introduces spaces between and in the middle of words that should not exist. It does not happen with `PyMuPDF`, which makes it perform better in this aspect.