# Reading Text from PDF Files
In this page, we review some basic commands useful for reading text from pdf files. We consider two different libraries:
### PyPDF2
and 
### pdfreader

In [6]:
import PyPDF2

In [7]:
f1 = open("us0.pdf", "rb") # the mode "rb" means "reading binary" files
reader = PyPDF2.PdfFileReader(f1)

In [8]:
reader.numPages

5

In [9]:
page1 = reader.getPage(1) # This reads the content of page 2 of the file
print(page1.extractText()) # This extracts only test part of the page

He has dissolved Representative Houses repeatedly, for opposing with manlyfirmness his invasions on the rights of the people.He has refused for a long time, after such dissolutions, to cause others to beelected; whereby the Legislative powers, incapable of Annihilation, have returnedto the People at large for their exercise; the State remaining in the mean timeexposed to all the dangers of invasion from without, and convulsions within.He has endeavoured to prevent the population of these States; for that purposeobstructing the Laws for Naturalization of Foreigners; refusing to pass others toencourage their migrations hither, and raising the conditions of newAppropriations of Lands.He has obstructed the Administration of Justice, by refusing his Assent to Lawsfor establishing Judiciary powers.He has made Judges dependent on his Will alone, for the tenure of their offices,and the amount and payment of their salaries.He has erected a multitude of New Offices, and sent hither swarms of Off

In [11]:
f1.close()

#### One notes that the end of each line in the pdf text is attached (without any whitespace) to the beginning of the next line.
#### Therefore one has to remedy this problem with some extra work.
#### This will also happens when using "pdfreader".

### One should note that the method based on PyPDF2 does not work for every pdf file such as the following file which was extracted from wikipedia.
### Therefore we have to use other libraries such as "pdfreader" in the following:

# Reading a pdf file using the package "pdfreader"

In [12]:
import pdfreader
from pdfreader import PDFDocument, SimplePDFViewer

In [13]:
fd = open("nlp0.pdf", "rb") # nlp0.pdf is a wikipedia article about natural language processing
doc = PDFDocument(fd) # making a PDFDocument stance

In [14]:
print("version = ", doc.header.version)

version =  1.4


### Catalog or Document root

In [15]:
print("type = ", doc.root.Type)
# print("metadata = ", doc.root.Metadata.Subtype) # It is not generally available for every document
# print("title = ", doc.root.Outlines.First['Title']) # It is not generally available for every document
# for a list of page attributes in a pdf file, see PDF-<version> specifications.  

type =  Catalog


## Various pages of a pdf document

In [16]:
all_pages = [p for p in doc.pages()]
print("number of pages = ", len(all_pages))

number of pages =  11


In [18]:
viewer = SimplePDFViewer(fd)
for index, canvas in enumerate(viewer):
    page_strings = canvas.strings
    text="".join(page_strings)
    print(f"=============================  Page {index}  ==============================\n", text)

 An automated online assistantproviding customer service on aweb page, an example of anapplication where naturallanguage processing is a majorcomponent.[1]Natural language processingNatural language processing (NLP) is a subfield of linguistics, computerscience, and artificial intelligence concerned with the interactions betweencomputers and human language, in particular how to program computers toprocess and analyze large amounts of natural language data. The result is acomputer capable of ‘understanding’ the contents of documents, including thecontextual nuances of the language within them. The technology can thenaccurately extract information and insights contained in the documents as well ascategorize and organize the documents themselves.Challenges in natural language processing frequently involve speech recognition,natural language understanding, and natural-language generation.HistorySymbolic NLP (1950s - early 1990s)Statistical NLP (1990s - 2010s)Neural NLP (present)Methods: Ru

 each input feature. Such models have the advantage that they can express the relative certainty of many differentpossible answers rather than only one, producing more reliable results when such a model is included as a componentof a larger system.Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rulessimilar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models tonatural language processing, and increasingly, research has focused on statistical models, which make soft,probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cachelanguage models upon which many speech recognition systems now rely are examples of such statistical models. Suchmodels are generally more robust when given unfamiliar input, especially input that contains errors (as is very commonfor real-world data), and produce more reliable results when integ

 through the front door", "the front door" is a referring expression and the bridging relationship to beidentified is the fact that the door being referred to is the front door of John's house (rather than ofsome other structure that might also be referred to).Discourse analysisThis rubric includes several related tasks. One task is discourse parsing, i.e., identifying thediscourse structure of a connected text, i.e. the nature of the discourse relationships betweensentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing andclassifying the speech acts in a chunk of text (e.g. yes-no question, content question, statement,assertion, etc.).Implicit Semantic Role LabellingGiven a single sentence, identify and disambiguate semantic predicates (e.g., verbal frames) andtheir explicit semantic roles in the current sentence (see Semantic Role Labelling above). Then,identify semantic roles that are not explicitly realized in the current sentence, classify them in

 12. Winograd, Terry (1971). Procedures as a Representation for Data in a Computer Program forUnderstanding Natural Language (http://hci.stanford.edu/winograd/shrdlu/) (Thesis).13. Schank, Roger C.; Abelson, Robert P. (1977). Scripts, Plans, Goals, and Understanding: An InquiryInto Human Knowledge Structures. Hillsdale: Erlbaum. ISBN 0-470-99033-3.14. Mark Johnson. How the statistical revolution changes (computational) linguistics. (http://www.aclweb.org/anthology/W09-0103) Proceedings of the EACL 2009 Workshop on the Interaction betweenLinguistics and Computational Linguistics.15. Philip Resnik. Four revolutions. (http://languagelog.ldc.upenn.edu/nll/?p=2946) Language Log,February 5, 2011.16. Socher, Richard. "Deep Learning For NLP-ACL 2012 Tutorial" (https://www.socher.org/index.php/Main/DeepLearningForNLP-ACL2012Tutorial). www.socher.org. Retrieved 2020-08-17. This was an earlyDeep Learning tutorial at the ACL 2012 and met with both interest and (at the time) skepticism by mostparti

In [20]:
fd.close()

#### This library has the same problem as the former, as it attaches the end of each line to the beginning of the next line.
# For more information about "pdfreader" visit:
https://pdfreader.readthedocs.io/en/latest/tutorial.html