## Course Information
🎥 **[COURSE VIDEOS](https://www.udemy.com/course/nlp-natural-language-processing-with-python/learn)**

## Setup

In [None]:
# DRIVE
from google.colab import drive
drive.mount('/content/drive')

data_path = 'drive/MyDrive/Colab Notebooks/NLP Course/data'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# NLP 02 Text Basics

## String Formatting

In [None]:
name = "bob"
print(f"My name is {name}")

My name is bob


In [None]:
library = [('Author', 'Topic', 'Pages'), ('Twain', 'Rafting', 601), ('Feynman', 'Physics', 95), ('Hamilton', 'Mythology', 144)]
for book in library:
  print(book)
print('--------')
for book in library:
  print(f"author is {book[0]}")



('Author', 'Topic', 'Pages')
('Twain', 'Rafting', 601)
('Feynman', 'Physics', 95)
('Hamilton', 'Mythology', 144)
--------
author is Author
author is Twain
author is Feynman
author is Hamilton


In [None]:

# tuple unpack
for author,topic,pages in library:
  print(f"{author} {topic} {pages}")
#but that is ugly, let's format a bit

Author Topic Pages
Twain Rafting 601
Feynman Physics 95
Hamilton Mythology 144


In [None]:
print("A Little Minimum Spacing")
print('--------')
for author,topic,pages in library:
  print(f"{author:{14}} {topic:{20}} {pages:{10}}")

A Little Minimum Spacing
--------
Author         Topic                Pages     
Twain          Rafting                     601
Feynman        Physics                      95
Hamilton       Mythology                   144


In [None]:
print("A Little Minimum Spacing")
print('--------')
for author,topic,pages in library:
  print(f"{author:{14}} {topic:{20}} {pages:>{10}}")

A Little Minimum Spacing
--------
Author         Topic                     Pages
Twain          Rafting                     601
Feynman        Physics                      95
Hamilton       Mythology                   144


In [None]:
print("A Little Minimum Spacing")
print('--------')
for author,topic,pages in library:
  print(f"{author:{14}} {topic:{20}} {pages:.>{10}}")

A Little Minimum Spacing
--------
Author         Topic                .....Pages
Twain          Rafting              .......601
Feynman        Physics              ........95
Hamilton       Mythology            .......144


## Date Time formatting

You will need to provide a format. You can find the key here: https://strftime.org

In [None]:
from datetime import datetime

In [None]:
today = datetime.now()

In [None]:
print(f"{today}")

2022-02-06 17:31:45.231193


In [None]:
print(f"{today:%B}")
print(f'Now Showing {datetime.today():%m/%d/%Y}')


February
Now Showing 02/06/2022


## Read and Write text files

In [None]:
%%writefile test.txt 
Hello this is a quick test text file
that I'm creating from a notebook

Writing test.txt


In [None]:
pwd

'/content'

In [None]:
myfile = open('test.txt')
myfile

<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'>

In [None]:
myfile.read()

"Hello this is a quick test text file\nthat I'm creating from a notebook"

In [None]:
myfile.read()

''

In [None]:
myfile.seek(0)

0

In [None]:
myfile.read()

"Hello this is a quick test text file\nthat I'm creating from a notebook"

In [None]:
myfile.seek(0)
content = myfile.read()
content

"Hello this is a quick test text file\nthat I'm creating from a notebook"

In [None]:
myfile.close()

### Reading each line

In [None]:
myfile = open('test.txt')
mylines = myfile.readlines()
myfile.close()

In [None]:
for line in mylines:
  print(line)

Hello this is a quick test text file

that I'm creating from a notebook


### Writing to the file

In [None]:
myfile = open('test.txt',mode='w+')
# that 'w' will truncate the file on open

In [None]:
myfile.read()

''

In [None]:
myfile.write("my brand new text")
myfile.seek(0)
myfile.read()


'my brand new text'

In [None]:
myfile.close()

In [None]:
myfile = open('test.txt',mode='a+')
# that 'a' allows for append and the 
# + allows us to do other things like read

In [None]:
myfile.write("A new section of code!")
myfile.seek(0)
myfile.read()

'A new section of code!A new section of code!'

In [None]:
myfile.close()

### CONTEXT MANAGER

In [None]:
with open('test.txt','r') as mynewfile:
  mynewfile.seek(0)
  print(mynewfile.read())
# auto-closed when done with the block! yay!

A new section of code!A new section of code!


## PDFs!

In [None]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading PyPDF2-1.26.0.tar.gz (77 kB)
[?25l[K     |████▎                           | 10 kB 14.3 MB/s eta 0:00:01[K     |████████▌                       | 20 kB 15.2 MB/s eta 0:00:01[K     |████████████▊                   | 30 kB 10.8 MB/s eta 0:00:01[K     |█████████████████               | 40 kB 9.1 MB/s eta 0:00:01[K     |█████████████████████▏          | 51 kB 5.2 MB/s eta 0:00:01[K     |█████████████████████████▍      | 61 kB 5.5 MB/s eta 0:00:01[K     |█████████████████████████████▋  | 71 kB 5.7 MB/s eta 0:00:01[K     |████████████████████████████████| 77 kB 3.0 MB/s 
[?25hBuilding wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py) ... [?25l[?25hdone
  Created wheel for PyPDF2: filename=PyPDF2-1.26.0-py3-none-any.whl size=61102 sha256=ef367249f6c71b45ce9c8e16cd9644a2b55aec860ef886b52d62b97c2677bfe5
  Stored in directory: /root/.cache/pip/wheels/80/1a/24/648467ade3a77ed20f35cfd2badd32134e96dd25ca811e64b3
Successfu

In [None]:
import PyPDF2


In [None]:

myfile = open(f'{data_path}/US_Declaration.pdf', mode='rb')

In [None]:
pdf_reader = PyPDF2.PdfFileReader(myfile)

In [None]:
pdf_reader.numPages

5

In [None]:
page_one = pdf_reader.getPage(0)

In [None]:
mytext = page_one.extractText()

In [None]:
myfile.close()

In [None]:

with open(f'{data_path}/US_Declaration.pdf', mode='rb') as f:
  pdf_reader = PyPDF2.PdfFileReader(f)
  first_page = pdf_reader.getPage(0)
  pdf_writer = PyPDF2.PdfFileWriter()
  pdf_writer.addPage(first_page)
  # now write that page
  with open(f'{data_path}/MY_NEW_PDF.pdf', 'wb') as pdf_output:
    pdf_writer.write(pdf_output)



In [None]:
from PyPDF2.pdf import PdfFileReader
pdf_text = [0]
with open(f'{data_path}/US_Declaration.pdf', mode='rb') as f:
  pdf_reader = PdfFileReader(f)
  for p in range(pdf_reader.numPages):
    page = pdf_reader.getPage(p)
    pdf_text.append(page.extractText())

for page in pdf_text:
  print(page)

0
Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the
political bands which have connected them with another, and to assume among the powers of the
earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle

them, a decent respect to the opinions of mankind requires that they should declare the causes

which impel them to the separation. 
We hold these truths to be self-evident, that all men are created equal, that they are endowed by

their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit
of Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving

their just powers from the consent of the governed,ŠThat whenever any Form of Government
becomes destructive of these ends, it is the Right of the People to alter or to abolish i

## Regular Expressions

In [103]:
text = "The guys phone number was 171-122-3349. He wants a call!"


In [104]:
"phone" in text

True

In [105]:
import re

In [108]:
pattern = "phone"
my_match = re.search(pattern, text)

In [109]:
my_match.span()

(9, 14)

### Multiple matches

In [119]:
look_for = "phone"
phrase = "I have a new phone that replaces my old phone."

In [120]:
# search will only find the first
my_match = re.search(look_for, phrase)
my_match.span()

(13, 18)

In [121]:
# findall does what it says
my_matches = re.findall(look_for, phrase)
len(my_matches)

2

In [124]:

# or even an Itterator
for match in re.finditer(look_for, phrase):
  print(match.span())

(13, 18)
(40, 45)


### Patterns

In [125]:
text

'The guys phone number was 171-122-3349. He wants a call!'

In [128]:
pattern = r'\d\d\d-\d\d\d-\d\d\d\d'
phone = re.search(pattern, text)
phone

<re.Match object; span=(26, 38), match='171-122-3349'>

In [131]:
phone.group()

'171-122-3349'

In [133]:
pattern = r'\d{3}-\d{3}-\d{4}'
phone = re.search(pattern, text)
phone.group()

'171-122-3349'

### Groups

In [135]:
pattern = r'(\d{3})-(\d{3})-(\d{4})'
phone = re.search(pattern, text)
phone.group()

'171-122-3349'

In [137]:
phone.group(3)

'3349'