# Outline:
## 1) Working with Text Files
 * f-strings to format printed text
 * Create text file, opening and reading it
 * Writing to text files
 * Appending text files
 
## 2) Working with PDF Files
 * Reading PDF and exctracting text
 * Adding PDF pages 

## 3) Regular Expressions
 * Searching for Basic Patterns
     * Patterns
     * Quantifiers
     * Groups
 * Additional Regex Syntax
     * Wildcard character
     * Starts with and Ends with
     * Exclusion
     * Brackets for grouping
     * Parentheses for multiple options

# 1) Working with Text Files

## f-strings to format printed text

In [1]:
name = 'Mido'
print(f'His name is {name}.')

His name is Mido.


In [4]:
d = {'a':123,'b':456}

print(f"Address: {d['a']} Main Street")

Address: 123 Main Street


You can pass arguments inside a nested set of curly braces to set a minimum width for the field, the alignment and even padding characters.

In [5]:
library = [('Author', 'Topic', 'Pages'), ('Twain', 'Rafting', 601), ('Feynman', 'Physics', 95), ('Hamilton', 'Mythology', 144)]

for book in library:
    print(f'{book[0]:{10}} {book[1]:{8}} {book[2]:{7}}')

Author     Topic    Pages  
Twain      Rafting      601
Feynman    Physics       95
Hamilton   Mythology     144


In [6]:
for book in library:
    print(f'{book[0]:{10}} {book[1]:{10}} {book[2]:.>{7}}')

Author     Topic      ..Pages
Twain      Rafting    ....601
Feynman    Physics    .....95
Hamilton   Mythology  ....144


In [7]:
from datetime import datetime

today = datetime(year=2018, month=1, day=27)

print(f'{today:%B %d, %Y}')

January 27, 2018


## Creating a text file and read it

In [9]:
%%writefile test.txt
Hello, this is a quick test file.
This is the second line of the file.

Writing test.txt


In [10]:
my_file = open('test.txt')

In [11]:
my_file.read()

'Hello, this is a quick test file.\nThis is the second line of the file.\n'

Once read is used once you can't excute it again. This happens because you can imagine the reading "cursor" is at the end of the file after having read it. So there is nothing left to read. We can reset the "cursor" like this:

In [12]:
my_file.seek(0)

0

In [13]:
my_file.read()

'Hello, this is a quick test file.\nThis is the second line of the file.\n'

In [14]:
# Readlines returns a list of the lines in the file
my_file.seek(0)
my_file.readlines()

['Hello, this is a quick test file.\n',
 'This is the second line of the file.\n']

When you have finished using a file, it is always good practice to close it.

In [16]:
my_file.close()

## Writing to a File

In [19]:
# Add a second argument to the function, 'w' which stands for write.
# Passing 'w+' lets us read and write to the file
# Opening a file with 'w' or 'w+' *truncates the original*, meaning that anything that was in the original file is deleted!

my_file = open('test.txt','w+')

In [20]:
my_file.write('This is a new first line')

24

In [21]:
my_file.seek(0)
my_file.read()

'This is a new first line'

In [22]:
my_file.close()

## Appending to a File

In [24]:
my_file = open('test.txt','a+')
my_file.write('\nThis line is being appended to test.txt')
my_file.write('\nAnd another line here.')

23

In [25]:
my_file.seek(0)
print(my_file.read())

This is a new first line
This line is being appended to test.txt
And another line here.


In [26]:
my_file.close()

Appending with `%%writefile`

In [28]:
%%writefile -a test.txt

This is more text being appended to test.txt
And another line here.

Appending to test.txt


You can assign temporary variable names as aliases, and manage the opening and closing of files automatically using a context manager:

In [30]:
with open('test.txt','r') as txt:
    first_line = txt.readlines()[0]
    
print(first_line)

This is a new first line



In [31]:
with open('test.txt','r') as txt:
    for line in txt:
        print(line, end='')

This is a new first line
This line is being appended to test.txt
And another line here.
This is more text being appended to test.txt
And another line here.


 # 2) Working with PDF Files

## Reading PDF and exctracting text

In [1]:
import PyPDF2

In [2]:
# Notice we read it as a binary with 'rb'
f = open('US_Declaration.pdf','rb')

In [3]:
pdf_reader = PyPDF2.PdfFileReader(f)

In [4]:
pdf_reader.numPages

5

In [5]:
page_one = pdf_reader.getPage(0)

In [6]:
page_one_text = page_one.extractText()

In [7]:
page_one_text

"Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the\npolitical bands which have connected them with another, and to assume among the powers of the\nearth, the separate and equal station to which the Laws of Nature and of Nature's God entitle\n\nthem, a decent respect to the opinions of mankind requires that they should declare the causes\n\nwhich impel them to the separation. \nWe hold these truths to be self-evident, that all men are created equal, that they are endowed by\n\ntheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving\n\ntheir just powers from the consent of the governed,ŠThat whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter or 

In [8]:
f.close()

## Adding to PDFs
We can not write to PDFs using Python but we *can* copy pages and append pages to the end.

In [10]:
f = open('US_Declaration.pdf','rb')
pdf_reader = PyPDF2.PdfFileReader(f)

In [11]:
first_page = pdf_reader.getPage(0)

In [12]:
pdf_writer = PyPDF2.PdfFileWriter()

In [13]:
pdf_writer.addPage(first_page)

In [14]:
pdf_output = open("Some_New_Doc.pdf","wb")

In [15]:
pdf_writer.write(pdf_output)

In [16]:
pdf_output.close()
f.close()

Extract all the text from this PDF file:

In [20]:
f = open('US_Declaration.pdf','rb')

pdf_text = [0]  # zero is a placehoder to make page 1 = index 1

pdf_reader = PyPDF2.PdfFileReader(f)

for p in range(pdf_reader.numPages):
    
    page = pdf_reader.getPage(p)
    
    pdf_text.append(page.extractText())

f.close()

In [21]:
pdf_text

[0,
 "Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the\npolitical bands which have connected them with another, and to assume among the powers of the\nearth, the separate and equal station to which the Laws of Nature and of Nature's God entitle\n\nthem, a decent respect to the opinions of mankind requires that they should declare the causes\n\nwhich impel them to the separation. \nWe hold these truths to be self-evident, that all men are created equal, that they are endowed by\n\ntheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving\n\ntheir just powers from the consent of the governed,ŠThat whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alte

# 3) Regular Expressions

## Searching for Basic Patterns

In [23]:
import re

In [25]:
pattern = 'phone'
text = "The agent's phone number is 408-555-1234. Call soon!"

In [26]:
re.search(pattern,text)

<_sre.SRE_Match object; span=(12, 17), match='phone'>

In [29]:
#if no match, nothing is returned
pattern = "NOT IN TEXT"
re.search(pattern,text)

In [30]:
pattern = 'phone'
text = "The agent's phone number is 408-555-1234. Call soon!"
match = re.search(pattern,text)

In [31]:
match.span()

(12, 17)

In [32]:
match.start()

12

In [34]:
match.end()

17

What if more than one instance found? 

In [35]:
text = "my phone is a new phone"
match = re.search("phone",text)
match.span()
matches = re.findall("phone",text)
len(matches)

2

In [36]:
for match in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)


In [37]:
match.group()

'phone'

### Patterns

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Example Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [40]:
text = "My telephone number is 408-555-1234"
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)
phone.group()

'408-555-1234'

### Quantifiers

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Example Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [42]:
re.search(r'\d{3}-\d{3}-\d{4}',text)

<_sre.SRE_Match object; span=(23, 35), match='408-555-1234'>

### Groups

In [45]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
results = re.search(phone_pattern,text)
# The entire result
results.group()

'408-555-1234'

In [46]:
results.group(1)

'408'

In [47]:
results.group(3)

'1234'

## Additional Regex Syntax

Use the pipe operator to have an **or** statment. For example

In [50]:
re.search(r"man|woman","This man was here.")

<_sre.SRE_Match object; span=(5, 8), match='man'>

### The Wildcard Character

In [52]:
re.findall(r".at","The cat in the hat sat here.")

['cat', 'hat', 'sat']

In [53]:
re.findall(r"...at","The bat went splat")

['e bat', 'splat']

In [54]:
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

### Starts With and Ends With

In [56]:
# Ends with a number
re.findall(r'\d$','This ends with a number 2')

['2']

In [57]:
# Starts with a number
re.findall(r'^\d','1 is the loneliest number.')

['1']

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. Anything inside the brackets is excluded. For example:

In [59]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [60]:
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

In [62]:
# To get the words back together, use a + sign 
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

In [63]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [64]:
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))
clean

'This is a string But it has punctuation How can we remove it'

### Brackets for Grouping

In [66]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

### Parentheses for Multiple Options

In [68]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"

In [69]:
re.search(r'cat(fish|nap|claw)',text)

<_sre.SRE_Match object; span=(27, 34), match='catfish'>

In [70]:
re.search(r'cat(fish|nap|claw)',texttwo)

<_sre.SRE_Match object; span=(32, 38), match='catnap'>

For more info:
    https://docs.python.org/3/howto/regex.html