Collecting Data from PDFs and Text Documents
--------------------------------------------------------------------

Collecting Data from PDFs
-------------------------------------
Most of the time your data will be stored as PDF files. We need to extract
text from these files and store it for further analysis.

Problem
------------
You want to read a PDF file.

Solution
------------
The simplest way to do this is by using the PyPDF2 library.

How It Works
-------------------
Let’s follow the steps in this section to extract data from PDF files.



In [1]:
# Install and import all the necessary libraries.
!pip install PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader



You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
# Now we extract the text.

# Creating a pdf file object
pdf = open("sample.pdf","rb") 
# opening the file in read-binary mode

#creating pdf reader object
pdf_reader = PyPDF2.PdfFileReader(pdf)

#checking number of pages in a pdf file
print(pdf_reader.numPages)

#creating a page object
page = pdf_reader.getPage(0)

#finally extracting text from the page
print(page.extractText())

#closing the pdf file
pdf.close()

1
One minute read on Web Scrapping
 
Page 
1
 
 
1 minute read on Web scraping
 
 
Web scraping is a term used to describe the use of a program or algorithm to extract and process large 
amounts of data from the web. 
 
Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the 
ability to scrape data from the web is a useful skill to have. 
 
Let's say you find data from the web, and there is no direct way to download it, web scraping
 
using 
Python is a skill you can use to extract the data into a useful form that can be imported.
 

Training 
batch
!!
 
Schedule on 
https://training.suven.net
 
 
and 
 
here
 
 
 



Note : 
-------

> 1> Please note that the function above doesn’t work for scanned PDFs or Images. PyPDF2 does not have a way to extract images, charts, or other media from PDF documents, but it can extract text and return it as a Python string.

> 2> PyPDF2 cannot insert pages in the middle of a PdfFileWriter object; the addPage() method will only add pages to the end.

Collecting Data from Word Files
----------------------------------------------

Problem
-------------
You want to read Word files.

Solution
------------
The simplest way to do this is by using the python-docx module.

How It Works
-------------------
Python can create and modify Word documents, which have the .docx file extension, with the python-docx module. 

In [3]:
# You can install the module by running 
!pip install python-docx
# this will install 2 packages : lxml and python-docx



You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [4]:
#--use file "demo.docx" for this code
import docx

## your code goes here
fileName = 'C:\Program Files\Python36\suven\Adv ML\datasets\datasets\demo.docx'

doc = docx.Document(fileName)

len_of_docx = len(doc.paragraphs)

print(f"#paras in {fileName} is {len_of_docx}")

# The Document object contains a list of Paragraph object


# for the paragraphs in the document


# The Document object contains a list of Paragraph objects 
# for the paragraphs in the document.

#paras in C:\Program Files\Python36\suven\Adv ML\datasets\datasets\demo.docx is 7


In [5]:
# python-docx lib treats the entire word doc as DOM
# where Document is the root and refers to entire doc
# and every element there-after is a child to the root

line_1 = doc.paragraphs[0].text      # picks all the text matter.
print(f"first Line of {fileName} is {line_1}")


# then the second line would be 
line_2 = doc.paragraphs[1].text
print(f"Next Line of {fileName} is {line_2}")

first Line of C:\Program Files\Python36\suven\Adv ML\datasets\datasets\demo.docx is Document Title
Next Line of C:\Program Files\Python36\suven\Adv ML\datasets\datasets\demo.docx is A plain paragraph with some bold and some italic. This is the second line !!


In [6]:
# Each of these Paragraph objects contains a list of one 
# or more Run objects. A Run object is a contiguous run 
# of text with the same style.
num_of_textStyles_in_second_line = len(doc.paragraphs[1].runs)
print(f"number of runs in second line is {num_of_textStyles_in_second_line}")


#--printing all runs in 2nd para
for i in range(0,num_of_textStyles_in_second_line):
    print(doc.paragraphs[1].runs[i].text)

print("--------------------------------------")
print("reading all text contents in the file")

# defining a fn to read the entire file as text
def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(' ' + para.text)
    return '\n'.join(fullText) # joining all string elements of the 
                               # array by \n

#--call the above function
print(getText(fileName))

number of runs in second line is 6
A plain paragraph with some
 
bold
 and some 
italic
. This is the second line !!
--------------------------------------
reading all text contents in the file
 Document Title
 A plain paragraph with some bold and some italic. This is the second line !!
 Heading, level 1
 Intense quote
 first item in unordered list
 first item in ordered list
 



Note : 
-------

> Conceptually, Word documents have two layers, a text layer and a drawing layer. In the text layer, text objects are flowed from left to right and from top to bottom, starting a new page when the prior one is filled. In the drawing layer, drawing objects, called shapes, are placed at arbitrary positions. These are sometimes referred to as floating shapes.

> A picture is a shape that can appear in either the text or drawing layer. When it appears in the text layer it is called an inline shape, or more  specifically, an inline picture.

> Inline shapes are treated like a big text character (a character glyph).python-docx only supports inline pictures. we can only add. Not get or remove or process the images.

> Hence for extracting images use docx2txt (all these codes were covered in the Python-Core+Advanced Course , Refer Chapter 9)