### Working with pdf and text files for NLP

#### String Formatter

In [1]:
name = "Anmol Pant"

In [2]:
print('The name of the person is {}'.format(name))

The name of the person is Anmol Pant


In [4]:
print(f'The name of the person is {name}')

The name of the person is Anmol Pant


In [5]:
# minimum width and alignment b/w columns

In [11]:
mylist = [('tuple 1',12),
         ('tuple 2',11),
         ('phirse tuple',10),
         ('ek aur tuple', 23),
         ('last tuple', 12)]

In [12]:
mylist

[('tuple 1', 12),
 ('tuple 2', 11),
 ('phirse tuple', 10),
 ('ek aur tuple', 23),
 ('last tuple', 12)]

In [15]:
for i in mylist:
    print (i)

('tuple 1', 12)
('tuple 2', 11)
('phirse tuple', 10)
('ek aur tuple', 23)
('last tuple', 12)


In [19]:
#formatting
for i in mylist:
    print (f'{i[0]:{50}} {i[1]:{10}}')

tuple 1                                                    12
tuple 2                                                    11
phirse tuple                                               10
ek aur tuple                                               23
last tuple                                                 12


In [21]:
#>,<,^ for alignment
for i in mylist:
    print (f'{i[0]:<{50}} {i[1]:.>{10}}')

tuple 1                                            ........12
tuple 2                                            ........11
phirse tuple                                       ........10
ek aur tuple                                       ........23
last tuple                                         ........12


### Working w .CSV and .TSV files

In [22]:
import pandas as pd

In [23]:
data = pd.read_csv('moviereviews.tsv',sep = '\t')

In [24]:
data.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [25]:
data.shape

(2000, 2)

In [26]:
data['label'].value_counts()

pos    1000
neg    1000
Name: label, dtype: int64

In [28]:
pos = data[data['label'] == 'pos']

In [29]:
pos.head()

Unnamed: 0,label,review
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
11,pos,"with stars like sigourney weaver ( "" alien "" t..."
16,pos,i remember hearing about this film when it fir...
18,pos,garry shandling makes his long overdue starrin...


In [32]:
pos.to_csv('pos.tsv', sep = '\t', index = False)

In [34]:
pd.read_csv('pos.tsv', sep = '\t').head()

Unnamed: 0,label,review
0,pos,this has been an extraordinary year for austra...
1,pos,according to hollywood movies made in last few...
2,pos,"with stars like sigourney weaver ( "" alien "" t..."
3,pos,i remember hearing about this film when it fir...
4,pos,garry shandling makes his long overdue starrin...


In [39]:
%%writefile text1.txt
hello, writing with magic command.
This is line two.
#built in magic command

Overwriting text1.txt


In [40]:
%%writefile -a text1.txt
Appending New Line

Appending to text1.txt


### Using Inbuilt Commands

In [1]:
file = open('text1.txt','r')

In [2]:
file

<_io.TextIOWrapper name='text1.txt' mode='r' encoding='cp1252'>

In [3]:
file.read()

'hello, writing with magic command.\nThis is line two.\n#built in magic command\nAppending New Line\n'

In [4]:
file.read()

''

In [7]:
file.seek(0)

0

In [6]:
file.read()

'hello, writing with magic command.\nThis is line two.\n#built in magic command\nAppending New Line\n'

In [9]:
file.readline()

'This is line two.\n'

In [10]:
file.seek(0)

0

In [11]:
file.readlines()

['hello, writing with magic command.\n',
 'This is line two.\n',
 '#built in magic command\n',
 'Appending New Line\n']

In [12]:
file.close()

In [13]:
with open('text1.txt') as file:
    text_data = file.readlines()
    print(text_data)

['hello, writing with magic command.\n', 'This is line two.\n', '#built in magic command\n', 'Appending New Line\n']


In [15]:
for temp in text_data:
    print(temp.strip())

hello, writing with magic command.
This is line two.
#built in magic command
Appending New Line


In [16]:
for i, temp in enumerate(text_data):
    print(str(i) + "--->" + temp.strip())

0--->hello, writing with magic command.
1--->This is line two.
2--->#built in magic command
3--->Appending New Line


In [17]:
## Write file

In [18]:
file = open('text2.txt','w')

In [19]:
file

<_io.TextIOWrapper name='text2.txt' mode='w' encoding='cp1252'>

In [24]:
file.write("This is just another file")

ValueError: I/O operation on closed file.

In [21]:
#to complete write operation
file.close()

In [25]:
with open('text3.txt','w') as file:
    file.write('This is third file \n')

In [26]:
with open('text3.txt','a') as file:
    for temp in text_data:
        file.write(temp)

### Text Extraction from PDF files

In [27]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
Installing collected packages: PyPDF2
  Running setup.py install for PyPDF2: started
    Running setup.py install for PyPDF2: finished with status 'done'
Successfully installed PyPDF2-1.26.0


You are using pip version 19.0.3, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [28]:
import PyPDF2 as pdf

In [30]:
file = open('NLP.pdf','rb')

In [31]:
file

<_io.BufferedReader name='NLP.pdf'>

In [32]:
pdf_reader = pdf.PdfFileReader(file)

In [33]:
pdf_reader

<PyPDF2.pdf.PdfFileReader at 0x6c20db0>

In [34]:
help(pdf_reader)

Help on PdfFileReader in module PyPDF2.pdf object:

class PdfFileReader(builtins.object)
 |  
 |  Initializes a PdfFileReader object.  This operation can take some time, as
 |  the PDF stream's cross-reference tables are read into memory.
 |  
 |  :param stream: A File object or an object that supports the standard read
 |      and seek methods similar to a File object. Could also be a
 |      string representing a path to a PDF file.
 |  :param bool strict: Determines whether user should be warned of all
 |      problems and also causes some correctable problems to be fatal.
 |      Defaults to ``True``.
 |      ``sys.stderr``).
 |      ``True``).
 |  
 |  Methods defined here:
 |  
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  cacheGetIndirectObject(self, generation, idnum)
 |  
 |  cacheIndirectObject(self, generation, idnum, obj)
 |  
 |  decrypt(self, password)
 |      When using an encrypted / secured PDF file with the PDF Standard
 |      encryp

In [35]:
pdf_reader.getIsEncrypted()

False

In [37]:
pdf_reader.getNumPages()

9

In [40]:
page4 = pdf_reader.getPage(3)

In [41]:
page4.extractText()

'patientprobablyhasaleft-sidedcerebrovascularaccident;post-\nconvulsivestateislesslikely.\n™Negation,uncertainty,and\nafrmationformacontinuum.Uncertaintydetectionwasthe\nfocusofarecentNLPcompetition.\n365.Relationshipextraction\n:determiningrelationshipsbetween\nentitiesorevents,suchas\n‚treats,™‚causes,™and‚occurswith.\n™Lookupofproblem-speci\ncinformation\ndforexample,thesauri,\ndatabasesdfacilitatesrelationshipextraction.\nAnaphorareferenceresolution\n37isasub-taskthatdetermines\nrelationshipsbetween\n‚hierarchicallyrelated\n™entities:suchrela-\ntionshipsinclude:\n<Identity:oneentity\ndforexample,apronounlike\n‚s/he,™‚hers/his,™oranabbreviation\ndreferstoapreviouslymentioned\nnamedentity;\n<Part/whole\n:forexample,citywithinstate;\n<Superset/subset:forexample,antibiotic/penicillin.\n6.Temporalinferences/relationshipextraction\n3839\n:makinginfer-\nencesfromtemporalexpressionsandtemporalrelations\ndforexample,inferringthatsomethinghasoccurredinthepastor\nmayoccurinthefuture,andorderi

In [45]:
page5 = pdf_reader.getPage(4)

In [46]:
page5.extractText()

'AtutorialbyHearst\netal\n62andtheDTREGonlinedocu-\nmentation63provideapproachableintroductionstoSVMs.\nFradkinandMuchnik\n64provideamoretechnicaloverview.\nHiddenMarkovmodels(HMMs)\nAnHMMisasystemwhereavariablecanswitch(withvarying\n\nprobabilities)betweenseveralstates,generatingoneofseveral\npossibleoutputsymbolswitheachswitch(alsowithvarying\n\nprobabilities).Thesetsofpossiblestatesanduniquesymbols\nmaybelarge,but\nniteandknown(see\ngure2).Wecanobserve\ntheoutputs,butthesystem\n™sinternals(ie,state-switchproba-\nbilitiesandoutputprobabilities)are\n‚hidden.™Theproblemsto\nbesolvedare:\nA.Inference:givenaparticularsequenceofoutputsymbols,\ncomputetheprobabilitiesofoneormorecandidatestate-\n\nswitchsequences.\nB.Patternmatching\n:ndthestate-switchsequencemostlikelyto\nhavegeneratedaparticularoutput-symbolsequence.\nC.Training\n:givenexamplesofoutput-symbolsequence\n(training)data,computethestate-switch/outputprobabili-\nties(ie,systeminternals)that\ntthisdatabest.\nBandCareactuallyNaiv

### Append and Merge PDFs

In [42]:
pdf_writer = pdf.PdfFileWriter()

In [47]:
pdf_writer.addPage(page4)
pdf_writer.addPage(page5)

In [48]:
output = open('Pages.pdf','wb')
pdf_writer.write(output)
output.close()