## Python Text Basics

#### f Strings

In [1]:
person = 'Bala'

In [2]:
print('my name is {}'.format(person))

my name is Bala


In [3]:
print(f'my name is {person}')

my name is Bala


In [4]:
d = {'a': 123, 'b': 456}

In [5]:
mylist = [0, 1, 2]

In [8]:
print(f'my number is {d["b"]}')

my number is 456


In [9]:
print(f'my number is {mylist[0]}')

my number is 0


In [10]:
book_store = [('Author', 'Book', 'Pages'), ('Ben', 'Good things', 900), ('Bruce', 'Man From Earth', 321), ('J.K. Rowling', 'Harry Potter', 1098)]

In [11]:
for book in book_store:
    print(book)

('Author', 'Book', 'Pages')
('Ben', 'Good things', 900)
('Bruce', 'Man From Earth', 321)
('J.K. Rowling', 'Harry Potter', 1098)


In [13]:
for author, book, pages in book_store:
    print(f'{author} {book} {pages}')

Author Book Pages
Ben Good things 900
Bruce Man From Earth 321
J.K. Rowling Harry Potter 1098


In [14]:
for author, book, pages in book_store:
    print(f'{author:{20}} {book:{20}} {pages:{20}}')

Author               Book                 Pages               
Ben                  Good things                           900
Bruce                Man From Earth                        321
J.K. Rowling         Harry Potter                         1098


In [15]:
for author, book, pages in book_store:
    print(f'{author:{20}} {book:{20}} {pages:>{20}}')

Author               Book                                Pages
Ben                  Good things                           900
Bruce                Man From Earth                        321
J.K. Rowling         Harry Potter                         1098


In [16]:
for author, book, pages in book_store:
    print(f'{author:{20}} {book:{20}} {pages:.>{20}}')

Author               Book                 ...............Pages
Ben                  Good things          .................900
Bruce                Man From Earth       .................321
J.K. Rowling         Harry Potter         ................1098


#### Thanks to https://strftime.org/ for the date time formatting using the f String

In [17]:
from datetime import datetime

In [19]:
today = datetime(year = 2020, month = 2, day = 20)

In [20]:
today

datetime.datetime(2020, 2, 20, 0, 0)

In [21]:
print(f'{today}')

2020-02-20 00:00:00


In [23]:
print(f'{today:%B}')

February


In [24]:
print(f'{today:%B %d}')

February 20


In [25]:
print(f'{today:%B %d, %Y}')

February 20, 2020


#### File Basics - Writing, Reading and Appending

In [26]:
%%writefile test.txt
Hello World, Python is good language and it used by milions.
Machine Learning and Deep learning are good fields to work

Writing test.txt


In [28]:
test_file = open('test.txt')

In [29]:
test_file

<_io.TextIOWrapper name='test.txt' mode='r' encoding='cp1252'>

In [30]:
test_file.read()

'Hello World, Python is good language and it used by milions.\nMachine Learning and Deep learning are good fields to work\n'

In [31]:
test_file.read()

''

In [34]:
test_file.seek(0)

0

In [35]:
content = test_file.read()

In [36]:
content

'Hello World, Python is good language and it used by milions.\nMachine Learning and Deep learning are good fields to work\n'

In [37]:
print(content)

Hello World, Python is good language and it used by milions.
Machine Learning and Deep learning are good fields to work



In [38]:
test_file.close()

In [39]:
myfile = open('test.txt')

In [40]:
myfile.readlines()

['Hello World, Python is good language and it used by milions.\n',
 'Machine Learning and Deep learning are good fields to work\n']

In [41]:
myfile.seek(0)

0

In [42]:
mylines = myfile.readlines()

In [43]:
mylines

['Hello World, Python is good language and it used by milions.\n',
 'Machine Learning and Deep learning are good fields to work\n']

In [44]:
myfile = open('test.txt', 'w+')

#### 'w+' usually overwrites the file

In [45]:
myfile.read()

''

In [46]:
myfile.write('Hello World, this is a very new text')

36

In [47]:
myfile.seek(0)

0

In [48]:
myfile.read()

'Hello World, this is a very new text'

In [49]:
myfile.close()

In [50]:
myfile = open('whoops.txt', 'a+')

In [51]:
myfile.write('MY FIRST LINE IN A+ OPENING')

27

In [52]:
myfile.close()

In [54]:
newfile = open('whoops.txt')

In [55]:
newfile.read()

'MY FIRST LINE IN A+ OPENING'

In [56]:
newfile.close()

In [57]:
newfile = open('whoops.txt', 'a+')

In [58]:
newfile.write('Appending another line bro')

26

In [59]:
newfile

<_io.TextIOWrapper name='whoops.txt' mode='a+' encoding='cp1252'>

In [60]:
newfile.read()

''

In [61]:
newfile.seek(0)

0

In [62]:
newfile.read()

'MY FIRST LINE IN A+ OPENINGAppending another line bro'

In [63]:
newfile.write('\nThis is a new line')

19

In [64]:
newfile.seek(0)

0

In [65]:
newfile.read()

'MY FIRST LINE IN A+ OPENINGAppending another line bro\nThis is a new line'

In [66]:
newfile.seek(0)

0

In [67]:
content = newfile.read()

In [68]:
print(content)

MY FIRST LINE IN A+ OPENINGAppending another line bro
This is a new line


In [69]:
newfile.close()

#### 'with' allows you to mautomatically close the file

In [70]:
with open('whoops.txt', 'r') as newfile:
    newlines = newfile.readlines()

In [71]:
newlines

['MY FIRST LINE IN A+ OPENINGAppending another line bro\n',
 'This is a new line']

### Regular Expressions

Regular Expressions (sometimes called regex for short) allow a user to search for strings using almost any sort of rule they can come up with. For example, finding all capital letters in a string, or finding a phone number in a document. 

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

Regular expressions are handled using Python's built-in **re** library. See [the docs](https://docs.python.org/3/library/re.html) for more information.

In [72]:
'phone' in 'Hey did you see my phone'

True

In [73]:
'Phone' in 'Hey did you see my phone'

False

In [74]:
text = 'The phone number of the agent is 408-555-1234. Call soon'

In [75]:
'phone' in text

True

In [76]:
'408-555-1234' in text

True

In [77]:
import re

In [78]:
re.search('phone', text)

<re.Match object; span=(4, 9), match='phone'>

In [79]:
match = re.search('phone', text)

In [80]:
match

<re.Match object; span=(4, 9), match='phone'>

In [81]:
match.span()

(4, 9)

In [82]:
text = 'My phone is lost, give your phone'

In [84]:
match = re.search('phone', text)

In [85]:
match.span()

(3, 8)

In [86]:
all_matches = re.findall('phone', text)

In [87]:
all_matches

['phone', 'phone']

In [88]:
len(all_matches)

2

In [89]:
re.finditer('phone', text)

<callable_iterator at 0x290660c4508>

In [90]:
for match in re.finditer('phone', text):
    print(match)

<re.Match object; span=(3, 8), match='phone'>
<re.Match object; span=(28, 33), match='phone'>


In [91]:
for match in re.finditer('phone', text):
    print(match.span())

(3, 8)
(28, 33)


In [92]:
text

'My phone is lost, give your phone'

In [93]:
text = 'My telephone number is 480-555-1234'

### Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>
<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>
<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>
<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>
<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>
<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>
<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [94]:
pattern = r'\d\d\d-\d\d\d-\d\d\d\d'

In [95]:
phone_number = re.search(pattern, text)

In [97]:
phone_number

<re.Match object; span=(23, 35), match='480-555-1234'>

In [98]:
phone_number.group()

'480-555-1234'

### Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>
<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>
<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>
<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>
<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>
<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>
<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [99]:
pattern = r'\d{3}-\d{3}-\d{4}'

In [100]:
phone_number = re.search(pattern, text)

In [101]:
phone_number.group()

'480-555-1234'

### Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down).

Using the phone number example, we can separate groups of regular expressions using parentheses:

In [109]:
pattern = r'(\d{3}-)(\d{3})-(\d{4})'

In [110]:
phone_number = re.search(pattern, text)

In [111]:
phone_number.group(1)

'480-'

In [112]:
phone_number.group(2)

'555'

In [113]:
phone_number.group(3)

'1234'

In [114]:
re.search(r'man|woman', 'This woman was here')

<re.Match object; span=(5, 10), match='woman'>

In [116]:
re.findall(r'.at', 'The cat in the hat sat from the bat splat')

['cat', 'hat', 'sat', 'bat', 'lat']

In [117]:
re.findall(r'..at', 'The cat in the hat sat from the bat splat')

[' cat', ' hat', ' sat', ' bat', 'plat']

#### Starts with and Ends with

^ Starts with

$ Ends with

In [120]:
re.findall(r'^\d', '100 people ran in streets')

['1']

In [122]:
re.findall(r'[^\d]+', '100 people ran in streets')

[' people ran in streets']

In [123]:
re.findall(r'[^\d]', '100 people ran in streets')

[' ',
 'p',
 'e',
 'o',
 'p',
 'l',
 'e',
 ' ',
 'r',
 'a',
 'n',
 ' ',
 'i',
 'n',
 ' ',
 's',
 't',
 'r',
 'e',
 'e',
 't',
 's']

In [124]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [125]:
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [126]:
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))

In [127]:
clean

'This is a string But it has punctuation How can we remove it'

#### Brackets for Grouping

As we showed above we can use brackets to group together options, for example if we wanted to find hyphenated words:

In [129]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [130]:
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

#### Parentheses for Multiple Options

If we have multiple options for matching, we can use parentheses to list out these options. For Example:

In [131]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [132]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [133]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [134]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)