### <font color="brown">Writing to a File</font>

In [1]:
grades = {'Jenna': 80, 'Dylan': 78, 'Anis': 65, 'Keisha': 82}
print(grades)

{'Jenna': 80, 'Dylan': 78, 'Anis': 65, 'Keisha': 82}


**Open file in "w" (write) mode, and write to it**

In [2]:
scores_file = open("scores_file.txt","w")   # open a file in "write" mode
for key,value in grades.items():
    scores_file.write(key + ' has ' + str(value) + '\n')  # call write method on file
scores_file.close()  # make sure to close the file

---

#### Exercise: Parsing a population file with '|' field separator

In [5]:
# read populations from file into a dictionary
# country name is key, population is value
# each line of file is <country>|<population>
# population may have commas, need to remove

def getPopulations(file):
    pops = {}
    for line in open(file):
        country, pop = line.split('|')
        population = int(pop.replace(',',''))  # using string replace method
        pops[country] = population
    return pops

In [6]:
populations = getPopulations('population.txt')

FileNotFoundError: [Errno 2] No such file or directory: 'population.txt'

In [None]:
populations['China']

In [None]:
populations['Nepal']

**Use list comprehension to get countries with population over 1 million**

In [None]:
large_pops = [c for c,p in populations.items() if p > 100000000 ]
print(large_pops)

---

#### Exercise: Counting words in a document using Counter

In [7]:
from collections import Counter

word_counts = Counter()
for line in open('metamorphosis.txt'):
    tokens = line.split()  # separate into non-whitespace sequences
    for token in tokens:
        word_counts.update([token.lower().strip(',.')])  # strip ',' and '.' from words

FileNotFoundError: [Errno 2] No such file or directory: 'metamorphosis.txt'

In [8]:
print(word_counts)

Counter()


**Find top 5 most common words**

In [9]:
for word, count in word_counts.most_common(5):
    print(word, count)

**Find words of length 4 or more that occur at least twice**

In [10]:
commons = [(w,c) for w,c in word_counts.most_common() if len(w) > 3 and c > 1]
commons

[]

**Find up to 5 words of length 4 or more that occur at least twice**

In [None]:
commons = [(w,c) for w,c in word_counts.most_common() if len(w) > 3 and c > 1][:5]
commons

---

### <font color="brown">Regular Expressions</font>

Tutorials can be found at the following sites

1. https://www.w3schools.com/python/python_regex.asp
2. https://developers.google.com/edu/python/regular-expressions#basic-patterns
3. https://docs.python.org/3/howto/regex.html?highlight=regular%20expressions

And the site https://regex101.com/ has a regular expression engine you can use to try things out.


---

#### <font color="brown">Import the re module</font>

In [None]:
import re

---

#### <font color="brown">Search for a pattern in a string using re.search function</font>

In [None]:
res = re.search('a','cat')  # search for pattern 'a' in target 'cat'
res

**search returns a Match object: span(1,2) is the span from start index to end index (exclusive)<br>
of target string "cat" where the match is found, and match gives the actual match**

In [None]:
res = re.search('a','dog')
print(res)

**If you simply echo res, nothing will be echoed since res is null, see below**

In [None]:
res

**So it's good policy to print the return from search, in case the return was None**

In [None]:
print ('matched') if re.search('a','dog') else print('not matched')

**search returns the first occurrence of a match, in case there are multiple matches**

In [None]:
res = re.search('ar','barbaric')  
print(res)  

In [None]:
# when searching, because failure is possible, use condition
def searchit(pattern,astr): 
    if re.search(pattern,astr):   # same as if re.search(pattern,astr) != None
        return 'Matched'
    else:
        return 'No match' 

print(searchit('a','cat'))
print(searchit('a','dog'))
print(searchit('ar','barbaric'))

**<font color="red">Matching literal strings is faster with string method</font>**

In [None]:
def findit(litstr,target):
    if target.find(litstr) == -1:
        return 'No match'
    else:
        return 'Matched'
    
print(searchit('a','cat'))
print(searchit('a','dog'))
print(searchit('ar','barbaric'))

In [None]:
def findit(litstr,target):
    res = 'No Match' if target.find(litstr) == -1 else 'Matched'
    return res
    
print(searchit('a','cat'))
print(searchit('a','dog'))
print(searchit('ar','barbaric'))

---

#### <font color="brown">Writing regexp patterns with metacharacters</font>

**Metacharacter [ ] is used for a class of characters<br>
Metacharacter * means 0 or more of preceding character/class<br>
Metacharacter + means 1 or more of preceding character/class**

**Example 1**<br>
Search for any sequence of characters that starts with 'a', ends with 't', and has zero or more 'c's in between

In [None]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('ac*t',astr)  # uses metacharacter *
    print('match') if res else print('no match')

**Example 2**<br>
Search for any sequence of characters that starts with 'a', ends with 't', and has AT LEAST one 'c' in between

In [None]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('ac+t',astr)  # uses metacharacter +
    print('match') if res else print('no match')

**Example 3**<br>
Search for any sequence that starts with a, ends with t, and has any number of digits (zero included) in between

In [None]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a[0-9]*t',astr)  # uses metacharacters [] and *
    print('match') if res else print('no match')

**Example 4**<br>
Search for any sequence that starts with a, ends with t, and has any number of letters or digits (zero included) in between

In [None]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a[a-zA-Z0-9]*t',astr)  # uses metacharacters [] and *
    print('match') if res else print('no match')

**Example 5**<br>
Search for any sequence that starts with a, ends with t, and has AT LEAST one letter and one digit between, in that order<br>
i.e. between a and t, all letters must precede all digits

In [None]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a[a-zA-Z]+[0-9]+t',astr)  # uses metacharacters [] and +
    print('match') if res else print('no match')

---

**Metacharacter . matches any character**

**Example**<br>
Search for any sequence that starts with a, ends with t, and has any character any number of times (including zero) between

In [None]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('a.*t',astr)  # uses metacharacters . and *
    print('match') if res else print('no match')

---

**Metacharacter ? matches one or zero occurrence of preceding character**

**Example**<br>
Search for the sequence 'act' or 'at' in any string

In [None]:
res = re.search('ac?t','at')
print(res)
res = re.search('ac?t','act')
print(res)
res = re.search('ac?t','tractor')
print(res)
res = re.search('ac?t','art')
print(res)

---

**Metacharacter ^ matches start of target string when used outside of a [ ] class<br>
Metacharacter $ matches end of target string**

**Example 1**<br>
Match all target strings that start with p

In [None]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('^p',astr)  # uses metacharacter ^
    print('match') if res else print('no match')

**Example 2**<br>
Match all target strings that end with p

In [None]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('p$',astr)  # uses metacharacter $
    print('match') if res else print('no match')

**Example 3**<br>
Match all target strings that start with ar, end with t, and have at least one lowercase letter between

In [None]:
while True:
    astr = input("string? ('quit' to stop) ")
    if astr == 'quit':
        break
    res = re.search('^ar[a-z]+t$',astr)  # uses metacharacters ^, [ ], +, and $
    print('match') if res else print('no match')