# Extracting Text Data

## Overview:
- In this lesson, we will practice extracting text data from various documents such as PDF, DOCX, and JSON files.
- Then, we will clean the extracted text using regular expressions.
- The exercises require knowledge of Python programming and libraries: `PyPDF2`, `docx`, `json`, and `re`.


## Question 1: Extracting Data from a PDF File

Using the `PyPDF2` library, write a Python script to extract the entire text from a PDF file. Ensure that you handle cases where the PDF file has multiple pages.

In [1]:
# Installing the PyPDF2 Library
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [4]:

# import thư viện
import PyPDF2
from PyPDF2 import PdfFileReader


Task Completion
- Find a PDF File with More Than 20,000 Words
- Read the Content and Page Information
- Store the Content in a String Variable

In [2]:
#### YOUR CODE HERE ####
from google.colab import files
uploaded = files.upload()
with open("nlp-book.pdf", "rb") as file:
    reader = PyPDF2.PdfReader(file)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
print(text[:100])
#### END YOUR CODE #####

Saving nlp-book.pdf to nlp-book.pdf


FileNotFoundError: [Errno 2] No such file or directory: 'example.pdf'

## Question 2: Extracting Data from a DOCX File


In [10]:
# Installing the docx Library
!pip install python-docx

Collecting python-docx
  Downloading python_docx-1.2.0-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.2.0-py3-none-any.whl (252 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.0/253.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.2.0


In [11]:
#Import library
from docx import Document

Task Completion
- Find a PDF File with More Than 20,000 Words
- Read the Content and Page Information
- Store the Content in a String Variable

In [14]:
#### YOUR CODE HERE ####
uploaded = files.upload()
doc = Document("The Project Gutenberg eBook of The movie boys in the jungle.docx")
doc_text = "\n".join([para.text for para in doc.paragraphs])
print(type(doc_text))
#### END YOUR CODE #####

Saving The Project Gutenberg eBook of The movie boys in the jungle.docx to The Project Gutenberg eBook of The movie boys in the jungle.docx


## Question 3: Extracting Data from a JSON File

In [None]:
# import thư viện
import json

Task Completion
- Find a JSON File from  with More Than 20,000 Words
- Store the Content in a String Variable
- Then concatenate the results from the previous questions into this variable and store them in a string variable, with each result saved on a new line.

In [12]:
#### YOUR CODE HERE ####
uploaded = files.upload()

with open("pokemonDB_dataset.json", "r", encoding="utf-8") as f:
    json_text = f.read()
print(type(json_text))
print(len(json_text))
#### END YOUR CODE #####

Saving pokemonDB_dataset.json to pokemonDB_dataset.json


## Question 4: Processing the Extracted Data

### Question 4.1: From the data extracted in Questions 1, 2, and 3, concatenate them into a single string variable.







In [21]:
#### YOUR CODE HERE ####
data = text + doc_text + json_text
#### END YOUR CODE #####

### Question 4.2: Complete the String Processing Function

Description of the function: This function takes a string as input and returns a processed version of the string. The main tasks performed in the function are as follows:

- Replace characters matching the pattern `^A-Za-z0-9(),!?\'\`` with a space (" ").
- Replace `\'s` with ` \'s`.
- Replace `\'ve` with ` \'ve`.
- Replace `n\'t` with ` n\'t`.
- Replace `\'re` with ` \'re`.
- Replace `\'d` with ` \'d`.
- Replace `\'ll` with ` \'ll`.
- Replace `,` with ` , `.
- Replace `!` with ` ! `.
- Replace `\(` with ` \( `.
- Replace `\)` with ` \) `.
- Replace `\?` with ` \? `.
- Replace multiple spaces (`\s{2,}`) with a single space.
- Trim leading spaces.
- Convert the text to lowercase.


In [23]:
#### YOUR CODE HERE ####
import re

def clean_str(text: str) -> str:
    # 1. Replace non-alphanumeric and some punctuation with space
    text = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", text)

    # 2. Handle contractions
    text = re.sub(r"\'s", " \'s", text)
    text = re.sub(r"\'ve", " \'ve", text)
    text = re.sub(r"n\'t", " n\'t", text)
    text = re.sub(r"\'re", " \'re", text)
    text = re.sub(r"\'d", " \'d", text)
    text = re.sub(r"\'ll", " \'ll", text)

    # 3. Separate punctuation with spaces
    text = re.sub(r",", " , ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\(", " ( ", text)
    text = re.sub(r"\)", " ) ", text)
    text = re.sub(r"\?", " ? ", text)

    # 4. Replace multiple spaces with single space
    text = re.sub(r"\s{2,}", " ", text)

    # 5. Trim + lowercase
    return text.strip().lower()

result = clean_str(data)

#### END YOUR CODE #####

Check the results with the function just written on the extracted data.


In [24]:
#### YOUR CODE HERE ####
print(result)
#### END YOUR CODE #####

