<a href="https://colab.research.google.com/github/2003mahi/AI_Intern_Projects/blob/main/Extract_Text_from_PDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting Text from PDF Files

Let's look at how to extract text from a PDF file, using the [`pdfx`](https://www.metachris.com/pdfx/) library in Python.

First we need to install the library:

In [4]:
!pip install pdfx

Collecting pdfx
  Downloading pdfx-1.4.1-py2.py3-none-any.whl.metadata (7.9 kB)
Collecting pdfminer.six==20201018 (from pdfx)
  Downloading pdfminer.six-20201018-py3-none-any.whl.metadata (3.3 kB)
Collecting chardet==4.0.0 (from pdfx)
  Downloading chardet-4.0.0-py2.py3-none-any.whl.metadata (3.5 kB)
Collecting sortedcontainers (from pdfminer.six==20201018->pdfx)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Downloading pdfx-1.4.1-py2.py3-none-any.whl (21 kB)
Downloading chardet-4.0.0-py2.py3-none-any.whl (178 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.7/178.7 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20201018-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m62.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Installing collected packages: sortedcontainers, chardet, pdfminer.six, pdfx
 

Next, let's work with an example from the corpus in the [Rich Context leaderboard competition](https://github.com/Coleridge-Initiative/rclc/blob/master/corpus.ttl) – a machine learning competition about parsing named entities from PDFs of open access research publications.

The following snippets in [TTL format](https://en.wikipedia.org/wiki/Turtle_(syntax)) show a research paper `publication-7aa3d69253e37668541c` hosted on [EuropePMC](https://europepmc.org/) that has a known link to a dataset `dataset-0a7b604ab2e52411d45a` hosted by the [Centers for Disease Control and Prevention](https://wwwn.cdc.gov/nchs/nhanes/).

```
:publication-7aa3d69253e37668541c
  rdf:type :ResearchPublication ;
  foaf:page "http://europepmc.org/articles/PMC3001474"^^xsd:anyURI ;
  dct:publisher "PLoS One" ;
  dct:title "VKORC1 common variation and bone mineral density in the Third National Health and Nutrition Examination Survey" ;
  dct:identifier "10.1371/journal.pone.0015088" ;
  :openAccess "http://europepmc.org/articles/PMC3001474?pdf=render"^^xsd:anyURI ;
  cito:citesAsDataSource :dataset-0a7b604ab2e52411d45a ;
.

:dataset-0a7b604ab2e52411d45a
  rdf:type :Dataset ;
  foaf:page "https://wwwn.cdc.gov/nchs/nhanes/"^^xsd:anyURI ;
  dct:publisher "Centers for Disease Control and Prevention" ;
  dct:title "National Health and Nutrition Examination Survey" ;
  dct:alternative "NHANES" ;
  dct:alternative "NHANES I" ;
  dct:alternative "NHANES II" ;
  dct:alternative "NHANES III" ;
.
```

The paper is:

  * ["VKORC1 common variation and bone mineral density in the Third National Health and Nutrition Examination Survey"](http://europepmc.org/articles/PMC3001474); Dana C. Crawford, Kristin Brown-Gentry, Mark J. Rieder; _PLoS One_. 2010; 5(12): e15088.

We'll used `pdfx` to download the PDF file directly from the open access URL:

In [None]:
import pdfx

pdf = pdfx.PDFx("http://europepmc.org/articles/PMC3001474?pdf=render")

pdf

Next, use the `get_text()` function to extract the text from the `pdf` object:

In [None]:
text = pdf.get_text()
text

Now we can use `spaCy` to parse that text:

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

Let's look at a dataframe of the parsed tokens:

In [None]:
import pandas as pd

cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)
df

The parsed text shows lots of characters that could be cleaned up, but for this demo, let's run *named entity resolution* in `spaCy` to extract the entities:

In [1]:
for ent in doc.ents:
    print(ent.text, ent.label_)

NameError: name 'doc' is not defined

In [2]:
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")

text = "This is your sample text."

doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)
df

Unnamed: 0,text,lemma,POS,explain,stopword
0,This,this,PRON,pronoun,True
1,is,be,AUX,auxiliary,True
2,your,your,PRON,pronoun,True
3,sample,sample,NOUN,noun,False
4,text,text,NOUN,noun,False
5,.,.,PUNCT,punctuation,False


In [3]:
import string

df['text_no_punct'] = df['text'].str.translate(str.maketrans('', '', string.punctuation))

from spacy.lang.en.stop_words import STOP_WORDS

def remove_stopwords(text):
  return " ".join([word for word in text.split() if word not in STOP_WORDS])

df['text_no_stopwords'] = df['text_no_punct'].apply(remove_stopwords)

df


Unnamed: 0,text,lemma,POS,explain,stopword,text_no_punct,text_no_stopwords
0,This,this,PRON,pronoun,True,This,This
1,is,be,AUX,auxiliary,True,is,
2,your,your,PRON,pronoun,True,your,
3,sample,sample,NOUN,noun,False,sample,sample
4,text,text,NOUN,noun,False,text,text
5,.,.,PUNCT,punctuation,False,,
