 * Working with f-strings (formatted string literals) to format printed text
 * Working with Files - opening, reading, writing and appending text files

In [499]:
name = 'Fred'

In [500]:
print('His name is {var}.'.format(var=name))

His name is Fred.


In [501]:
print(f'His name is {name}.')

His name is Fred.


### Minimum Widths, Alignment and Padding

In [502]:
library = [('Author', 'Topic', 'Pages'), ('Twain', 'Rafting', 601), ('Feynman', 'Physics', 95), ('Hamilton', 'Mythology', 144)]

for book in library:
    print(f'{book[0]:{10}} {book[1]:{8}} {book[2]:{7}}')

Author     Topic    Pages  
Twain      Rafting      601
Feynman    Physics       95
Hamilton   Mythology     144


In [503]:
#To set the alignment, use the character `<` for left-align,  `^` for center, `>` for right.<br>
#To set padding, precede the alignment character with the padding character (`-` and `.` are common choices).

In [504]:
for book in library:
    print(f'{book[0]:{10}} {book[1]:{10}} {book[2]:.^{7}}') # here .> was added

Author     Topic      .Pages.
Twain      Rafting    ..601..
Feynman    Physics    ..95...
Hamilton   Mythology  ..144..


In [505]:
from datetime import datetime

today = datetime(year=2018, month=1, day=27)

print(f'{today:%B %d, %Y}')

January 27, 2018


In [506]:
%%writefile test.txt
Hello, this is a quick test file.
This is the second line of the file.

Overwriting test.txt


In [507]:
#To avoid this error, make sure your .txt file is saved in the same location as your notebook. To check your notebook location, use **pwd**:
pwd = "../kaggle/working/"

In [508]:
my_file = open('test.txt')

In [509]:
# But what happens if we try to read it again?
my_file.read()

'Hello, this is a quick test file.\nThis is the second line of the file.\n'

In [510]:
# Now read again
my_file.read()

''

In [511]:
# Readlines returns a list of the lines in the file
my_file.seek(0)
my_file.readlines()

['Hello, this is a quick test file.\n',
 'This is the second line of the file.\n']

In [512]:
my_file.close()

## Writing to a File

In [513]:
my_file = open('test.txt','w+')

In [514]:
# Write to the file
my_file.write('This is a new first line')

24

In [515]:
# Read the file
my_file.seek(0)
my_file.read()

'This is a new first line'

In [516]:
my_file.close()  # always do this when you're done with a file

In [517]:
my_file = open('test.txt','a+')
my_file.write('\nThis line is being appended to test.txt')
my_file.write('\nAnd another line here.')

23

In [518]:
my_file.seek(0)
print(my_file.read())

This is a new first line
This line is being appended to test.txt
And another line here.


In [519]:
my_file.close()

In [520]:
%%writefile -a test.txt

This is more text being appended to test.txt
And another line here.

Appending to test.txt


In [521]:
with open('test.txt','r') as txt:
    first_line = txt.readlines()[0]
    
print(first_line)

This is a new first line



In [522]:
with open('test.txt','r') as txt:
    for line in txt:
        print(line, end='')  # the end='' argument removes extra linebreaks

This is a new first line
This line is being appended to test.txt
And another line here.
This is more text being appended to test.txt
And another line here.


### Working with PDF

In [523]:
pip install PyPDF2

Note: you may need to restart the kernel to use updated packages.


In [524]:
# note the capitalization
import PyPDF2

In [525]:
# Notice we read it as a binary with 'rb'
f = open('/kaggle/input/us-declaration-pdf-file/US_Declaration.pdf','rb')

In [526]:
pdf_reader = PyPDF2.PdfReader(f)

In [527]:
len(pdf_reader.pages)

5

In [528]:
page_one = pdf_reader.pages[0]

In [529]:
page_one_text = page_one.extract_text()

In [530]:
page_one_text

"Declaration of Independence\nIN CONGRESS, July 4, 1776.  \nThe unanimous Declaration of the thirteen united States of America,  \nWhen in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.— \x14That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  \x14That whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter or to 

In [531]:
f.close()

## Adding to PDFs

We can not write to PDFs using Python because of the differences between the single string type of Python, and the variety of fonts, placements, and other parameters that a PDF could have.

What we *can* do is copy pages and append pages to the end.

In [532]:
f = open('/kaggle/input/us-declaration-pdf-file/US_Declaration.pdf','rb')
pdf_reader = PyPDF2.PdfReader(f)

In [533]:
first_page = pdf_reader.pages[0]

In [534]:
pdf_writer = PyPDF2.PdfWriter()

In [535]:
### pdf_writer.add_page(first_page)

In [536]:
pdf_output = open("Some_New_Doc.pdf","wb")

In [537]:
pdf_writer.write(pdf_output)

(False, <_io.BufferedWriter name='Some_New_Doc.pdf'>)

In [538]:
pdf_output.close()
f.close()

In [539]:
f = open('/kaggle/input/us-declaration-pdf-file/US_Declaration.pdf','rb')

# List of every page's text.
# The index will correspond to the page number.
pdf_text = [0]  # zero is a placehoder to make page 1 = index 1

pdf_reader = PyPDF2.PdfReader(f)

for p in range(len(pdf_reader.pages)):
    
    #print(p)
    page = pdf_reader.pages[p]
    
    pdf_text.append(page.extract_text())

f.close()

In [540]:
pdf_text

[0,
 "Declaration of Independence\nIN CONGRESS, July 4, 1776.  \nThe unanimous Declaration of the thirteen united States of America,  \nWhen in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.— \x14That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  \x14That whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter o

In [541]:
print(pdf_text[2])

He has dissolved Re presentative Ho uses repeatedly , for opposing wit h manly
firmness his invasions on the rights of the people.
He has refused for a long time, after such dissolutions, to cause others to be
elected; whereby the Leg islative powers, incapable of Annihilation, have returned
to the People at lar ge for their exe rcise; the State r emaining in the me an time
exposed to all the dangers of invasion from without, and convulsions within.
He has endeavou red to prevent the  population of these  States; for that pur pose
obstructing the L aws for Natural ization of Foreig ners; refusing  to pass others to
encourage their migrations hither, and raising the conditions of new
Appropriations of  Lands.
He has obstructed the Administration of Justice, by refusing his Assent to Laws
for establishing  Judiciary pow ers.
He has made Judge s dependent on his Wil l alone, for the te nure of their off ices,
and the amount and  payment of t heir salaries.
He has erected  a multitude of N

In [542]:
text = "The agent's phone number is 408-555-1234. Call soon!"

In [543]:
'phone' in text

True

In [544]:
import re

In [545]:
pattern = 'phone'

In [546]:
re.search(pattern,text)

<re.Match object; span=(12, 17), match='phone'>

In [547]:
pattern = "number"

In [548]:
re.search(pattern,text)

<re.Match object; span=(18, 24), match='number'>

In [549]:
match = re.search(pattern,text)

In [550]:
match

<re.Match object; span=(18, 24), match='number'>

In [551]:
match.span()

(18, 24)

In [552]:
match.start()

18

In [553]:
match.end()

24

# But what if the pattern occurs more than once?

In [554]:
text = "my phone is a new phone"

In [555]:
match = re.search("phone",text)

In [556]:
match.span()

(3, 8)

In [557]:
matches = re.findall("phone",text)

In [558]:
matches

['phone', 'phone']

In [559]:
len(matches)

2

In [560]:
for match in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)


In [561]:
match.group()

'phone'

# Patterns

So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this; often it's just a matter of looking up the pattern code.

Let's begin!

## Identifiers for Characters in Patterns

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [562]:
text = "My telephone number is 408-555-1234"

In [563]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

In [564]:
phone.group()

'408-555-1234'

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [565]:
re.search(r'\d{3}-\d{3}-\d{4}',text)

<re.Match object; span=(23, 35), match='408-555-1234'>

In [566]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [567]:
results = re.search(phone_pattern,text)

In [568]:
# The entire result
results.group()

'408-555-1234'

In [569]:
# Can then also call by group position.
# remember groups were separated by parentheses ()
# Something to note is that group ordering starts at 1. Passing in 0 returns everything
results.group(1)

'408'

In [570]:
results.group(2)

'555'

## Additional Regex Syntax

### Or operator |

Use the pipe operator to have an **or** statment. For example

In [571]:
re.search(r"man|woman","This man was here.")

<re.Match object; span=(5, 8), match='man'>

In [572]:
re.search(r"man|woman","This woman was here.")

<re.Match object; span=(5, 10), match='woman'>

### The Wildcard Character

Use a "wildcard" as a placement that will match any character placed there. You can use a simple period **.** for this. For example:

In [573]:
re.findall(r".at","The cat in the hat sat here.")

['cat', 'hat', 'sat']

In [574]:
re.findall(r".at","The bat went splat")

['bat', 'lat']

In [575]:
re.findall(r"...at","The bat went splat")

['e bat', 'splat']

However this still leads the problem to grabbing more beforehand. Really we only want words that end with "at".

In [576]:
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

### Starts With and Ends With

We can use the **^** to signal starts with, and the **$** to signal ends with:

In [577]:
# Ends with a number
re.findall(r'\d$','This ends with a number 2')

['2']

# Ends with a number
re.findall(r'\d$','This ends with a number 2')

In [578]:
# Starts with a number
re.findall(r'^\d','1 is the loneliest number.')

['1']

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. Anything inside the brackets is excluded. For example:

In [579]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [580]:
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

In [581]:
#re.findall(r'[^\d]',phrase)
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

In [582]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [583]:
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [584]:
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))

In [585]:
clean

'This is a string But it has punctuation How can we remove it'

In [586]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [587]:
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

In [588]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [589]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [590]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [591]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)

## f-Strings
#### 1. Print an f-string that displays `NLP stands for Natural Language Processing` using the variables provided.

In [592]:
abbr = 'NLP'
full_text = 'Natural Language Processing'

# Enter your code here:
print(f'{abbr} stands for {full_text}')

NLP stands for Natural Language Processing


## Files
#### 2. Create a file in the current working directory called `contacts.txt` by running the cell below:

In [593]:
%%writefile contacts.txt
First_Name Last_Name, Title, Extension, Email

Overwriting contacts.txt


#### 3. Open the file and use .read() to save the contents of the file to a string called `fields`.  Make sure the file is closed at the end.

In [594]:
# Write your code here:
with open('contacts.txt') as c:
    fields = c.read()

    
# Run fields to see the contents of contacts.txt:
fields

'First_Name Last_Name, Title, Extension, Email\n'

## Working with PDF Files
#### 4. Use PyPDF2 to open the file `Business_Proposal.pdf`. Extract the text of page 2.

In [595]:
# Perform import
import PyPDF2

# Open the file as a binary object
f = open('/kaggle/input/businessproposaltestfile/Business_Proposal.pdf','rb')

# Use PyPDF2 to read the text of the file
pdf_reader = PyPDF2.PdfReader(f)


# Get the text from page 2 (CHALLENGE: Do this in one step!)
#page_two_text = pdf_reader.pages.extract_text
page_two_text = pdf_reader.pages[1].extract_text


# Close the file
f.close()

# Print the contents of page_two_text
print(page_two_text)

<bound method PageObject.extract_text of {'/Type': '/Page', '/Parent': IndirectObject(2, 0, 132344296340880), '/Resources': {'/Font': {'/F3': IndirectObject(11, 0, 132344296340880)}, '/ExtGState': {'/GS7': IndirectObject(7, 0, 132344296340880), '/GS8': IndirectObject(8, 0, 132344296340880)}, '/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC', '/ImageI']}, '/MediaBox': [0, 0, 612, 792], '/Contents': IndirectObject(14, 0, 132344296340880), '/Group': {'/Type': '/Group', '/S': '/Transparency', '/CS': '/DeviceRGB'}, '/Tabs': '/S', '/StructParents': 1}>
