#### Exract All Names from a PDF File
*In this tutorial we will build a pipeline of extracting all the names from a pdf file. Here we will go through the following steps-*

*> Load the PDF File*

*> Extract the Text*

*> Make a Text File (Optional)*

*> Data Preprocessing*

*> Extact Names*

##### Load the PDF File and Extarct Text
*To load the pdf file we will use *****PdfFileReader*****, is a default module of *****PyPDF2*****. After load the pdf file we will extract the text from the pdf file. We can do the both load_the_pdf and extract_text together.*

*To do so, we will first import the libraries for loading pdf and extracting text form the file. If you don't install the pypdf2, open the terminal and use the command-*

*> "pip install pypdf2"*

In [1]:
# Import PyPDF2 and its module
import PyPDF2
from PyPDF2 import PdfFileReader

*To extract the text we need to know how many pages are there in the pdf file. Then for every pages we can extract the text data and make a string adding all the extracted text.*

In [2]:
# Load the pdf file
my_file = PdfFileReader("Bridging_The_Gap_Between_Training_&_Inference_For_Neural_Machine_Translation.pdf")
#print(my_file.getNumPages())
str = ""
for i in range(10):
    str += my_file.getPage(i).extractText()

##### Make a Text File
*Text file will help us in further processing. We already created a text file namely *****my_file.txt*****. Once created, you can just open the file whenever you want. Usually, the text file will saved in the current directory.*

In [13]:
# Making a text file
#with open("my_file.txt", "w", encoding="utf-8") as f:
#    f.write(str)

In [17]:
# Open the text file
with open("my_file.txt", "r", encoding="utf-8") as f:
    text = f.read()

##### Preprocessing
*Before start preprocessing, we split the data in tokens and replaced all the tabs with a new line. Python *****split()***** method seperate the text by tokens and make a list of them.*

In [6]:
# Import the libraries
import nltk
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Abs_Sayem\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [22]:
# Replacing tab(\t) into newline(\n)
replaced_data = text.replace('\t','\n').split('\n')
# Convert list to string
string_data = " ".join(map(str, replaced_data))
# Remove numbers
nonnumbered_text = re.sub(r'\d+', '', string_data)
# Remove Punctuation
s = set(string.punctuation)
tokenized_text = word_tokenize(nonnumbered_text)
#print(tokenized_text)
filtered_text = []
for i in tokenized_text:
    if i not in s:
        filtered_text.append(i)
# Remove Stopwords
stop = set(stopwords.words("english"))
stopped_text = [word.lower() for word in filtered_text if word.lower() not in stop]

##### Parsing Name

###### Using nameparser

In [8]:
import nltk
from nameparser.parser import HumanName
from nltk.corpus import wordnet

In [23]:
person_list = []
person_names=person_list
def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)

    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []
#     print (person_list)

In [24]:
# Convert list to string
new_string_data = " ".join(map(str, stopped_text))
#print(string_data1)

In [25]:
names = get_human_names(new_string_data)
for person in person_list:
    person_split = person.split(" ")
    for name in person_split:
        if wordnet.synsets(name):
            if(name in person):
                person_names.remove(person)
                break

print(person_names)

[]


###### Using NERTragger

In [16]:
import nltk
from nltk.tag.stanford import NERTagger

ImportError: cannot import name 'NERTagger' from 'nltk.tag.stanford' (C:\Users\Abs_Sayem\AppData\Local\Programs\Python\Python39\lib\site-packages\nltk\tag\stanford.py)

In [15]:
for sent in nltk.sent_tokenize(new_string_data):
      for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if hasattr(chunk, 'label'):
                  print(chunk.label(), ' '.join(c[0] for c in chunk))

In [None]:
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
#text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(new_string_data):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON':
            print(tag)