<a href="https://colab.research.google.com/github/Yenaaa/24spring_hss510/blob/main/SelectingCleaning_Mar6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **HSS 510 Guide Coding: Selecting and Cleaning Texts**

### **2024 Mar 6, Taegyoon Kim**


---

## **Topics**
- String manipulation
- Regular expressions

## **String manipulation**
-   In Python, "methods" are functions that belong to an object
-   They only work with that object
-   They are used with dot notation: `object.method()`
-   Python has really rich string methods
-   See a full list [here](https://www.w3schools.com/python/python_ref_string.asp)
-   Let's explore some of them

### `startswith()` and `endswith()`
- Return a Boolean value

In [None]:
url = 'www.google.com'
url.startswith('www.g')
url.startswith('www')
url.endswith('com')
url.endswith('.co.kr')

False

### Use `in` operator to see if a string 'just contains' a substring

In [None]:
print('www' in url and '.ac.kr' in url)
print('www' in url and '.com' in url)
print('www' in url and '.com' in url)

False
True
True


### `capitalize()`, `title()`, `upper()`, `lower()`
- `capitalize()` capitalizes the first character (and lowercases the rest)

In [None]:
paper = 'the New York times'
paper.capitalize()

'The new york times'

- `title()` capitalizes the first character of every word

In [None]:
paper.title()

'The New York Times'

- `upper()` capitalizes everything

In [None]:
paper.upper()

'THE NEW YORK TIMES'

- `lower()` converts string to all lowercase

In [None]:
paper.lower()

'the new york times'

### `split()`
- `split()` splits strings into a list of strings
- If no character is specified, it will split from space (' ')

In [None]:
print(paper)

list_split = paper.split()
print(list_split)

the New York times
['the', 'New', 'York', 'times']


In [None]:
print(url)
url.split('.')

In [None]:
'MICHIGAN'.split('I')

['M', 'CH', 'GAN']

### `strip`, `lstrip`, `rstrip`
-   Trim from right (`rstrip`), left (`lstrip`), or both sides (`strip`)
-   Removes trailing white space characters such as `' '`, `\t`, `\n`

In [None]:
msg = '\n   Hello, world! Again!   \n'
print(msg)


   Hello, world! Again!   



In [None]:
print(msg.rstrip())


   Hello, world! Again!


### Modificadtion in place?
- As you might have noticed, string methods will never modify in place

In [None]:
paper = 'hss510'
paper.upper()
print(paper)
paper_mod = paper.upper()
print(paper_mod)

'HSS510'

### Indexing and Slicing Strings
- Strings can be indexed and sliced just like lists

In [None]:
word = 'slicing'
print(word[:3])
print(word[2:5:2]) # [start:stop:step] framework
print(word[0])
print(word[-1])
print(word[2])

'i'

- Strings are immutable, but they are iterable

In [None]:
for char in 'hey you':
  if not char == ' ':
    print(char.upper() + '!')
  else:
    break

for char in 'hey you':
    if not char == " ":
        print(char.upper() + '!')
    else:
        break

H!
E!
Y!
H!
E!
Y!


- The length of a string is its number of characters (including white spaces)

In [None]:
len('Ann Arbor')

9

### Strings and lists
- Convert string to list: gives a list with every character in string

In [None]:
list('sci art')

['s', 'c', 'i', ' ', 'a', 'r', 't']

In [None]:
'sci art'.split() # what about this?

['sci', 'art']

* One can also join a list of elements into a string (e.g., when you want to lump multiple sentences)

In [None]:
letters = ['U', 'M', 'I', 'C', 'H']
print(''.join(letters))

UMICH


- We could join by another string too

In [None]:
'-'.join(letters)

'U-M-I-C-H'

## **Regular expressions**

### Why regex?
* We use regex to identify patterns
* Very useful when we parse texts and extract certain components

### Let's look at `find()` and `count()` string methods first
* `find()` returns the index where the first occurrence of the specified pattern takes place
* `count()` returns the number of times the specified pattern takes place

In [None]:
print("strawberry".find("berry"))
print("strawberry and blueberry".count("berry"))

2


* `find()` returns -1 for non-match

In [None]:
print("strawberry".find("better"))
print("strawberry".count("better"))

0


* `find()` returns only the first match

In [None]:
print("berryberrystrawberry".find("berry"))

0


* Let's read a list of 80 fruits

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/taegyoon-kim/programming_dhcss_23fw/main/week_14/fruit.txt'
fruit = pd.read_csv(url, header = None)[0].to_list() # Series to list

print(len(fruit))
print(fruit)

80
['apple', 'apricot', 'avocado', 'banana', 'bell pepper', 'bilberry', 'blackberry', 'blackcurrant', 'blood orange', 'blueberry', 'boysenberry', 'breadfruit', 'canary melon', 'cantaloupe', 'cherimoya', 'cherry', 'chili pepper', 'clementine', 'cloudberry', 'coconut', 'cranberry', 'cucumber', 'currant', 'damson', 'date', 'dragonfruit', 'durian', 'eggplant', 'elderberry', 'feijoa', 'fig', 'goji berry', 'gooseberry', 'grape', 'grapefruit', 'guava', 'honeydew', 'huckleberry', 'jackfruit', 'jambul', 'jujube', 'kiwi fruit', 'kumquat', 'lemon', 'lime', 'loquat', 'lychee', 'mandarine', 'mango', 'mulberry', 'nectarine', 'nut', 'olive', 'orange', 'pamelo', 'papaya', 'passionfruit', 'peach', 'pear', 'persimmon', 'physalis', 'pineapple', 'plum', 'pomegranate', 'pomelo', 'purple mangosteen', 'quince', 'raisin', 'rambutan', 'raspberry', 'redcurrant', 'rock melon', 'salal berry', 'satsuma', 'star fruit', 'strawberry', 'tamarillo', 'tangerine', 'ugli fruit', 'watermelon']


- Use list comprehension to extract matches

In [None]:
[i for i in fruit if i.find("berry") != -1] # find
[i for i in fruit if i.count("berry") > 0] # count

['bilberry',
 'blackberry',
 'blueberry',
 'boysenberry',
 'cloudberry',
 'cranberry',
 'elderberry',
 'goji berry',
 'gooseberry',
 'huckleberry',
 'mulberry',
 'raspberry',
 'salal berry',
 'strawberry']

### Import `re` module and search
- Regex is a sequence of characters that forms a flexible search pattern
- Can be used to check if a string contains the specified pattern

In [None]:
import re
mo = re.search(r'berry', 'strawberry.') # the prefix 'r' denotes raw string
print(mo)
print(type(mo)) # a match object

<re.Match object; span=(5, 10), match='berry'>
<class 're.Match'>


* Escape with `\` and denote raw string with `r`

In [None]:
sent = 'NLP\nTEXT' # \n indicates a line space
print(sent)

sent_new = 'NLP\\nTEXT' # if we want a literal substring '\n\', we escape it
print(sent_new)
sent_new

print('This is a backslash \\n character') # like here too

'NLP\\nTEXT'

In [None]:
result_without_r = re.search('\\n', 'This is a backslash \\n character')
result_with_r = re.search(r'\\n', 'This is a backslash \\n character')

print(result_without_r) # re.search returns None when there is no match
print(result_with_r) # a literal match
result_with_r.group()

None
<re.Match object; span=(20, 22), match='\\n'>


'\\n'

- Extraction into a list


In [None]:
mo = re.search(r'berry', 'strawberry.')

print(mo)
print(mo.group()) # returns the part of the string where there was a match
print(mo.span()) # returns the related position

berries = [i for i in fruit if re.search(r'berry', i)]
print(berries)

c = re.compile(r'berry') # *compile* the pattern into an object
berries_compile = [i for i in fruit if c.search(i)]

for i, j in zip(berries, berries_compile):
  print(i == j)

<re.Match object; span=(5, 10), match='berry'>
berry
(5, 10)
['bilberry', 'blackberry', 'blueberry', 'boysenberry', 'cloudberry', 'cranberry', 'elderberry', 'goji berry', 'gooseberry', 'huckleberry', 'mulberry', 'raspberry', 'salal berry', 'strawberry']
True
True
True
True
True
True
True
True
True
True
True
True
True
True


- Multiple matches

In [None]:
mo_m = re.search(r'berry',"berryberrystrawberry")
print(mo_m) # returns the first match

mo_m2 = re.findall(r'berry',"berryberrystrawberry")
print(mo_m2) # this is a list of strings

mo_m3 = re.finditer(r'berry',"berryberrystrawberry")

for i in mo_m3: # this returns an iterator
  print(i)

['berry', 'berry', 'berry']


In [None]:
texts = ['Contact us at 123-456-7890 or 111-999-1842',
         'Our phone number is 987-654-3210 (987-654-3211)',
         'This text have no phone number',
         'Reach out at 555-555-5555 for more information']

phone_pattern = r'\d{3}-\d{3}-\d{4}'
phone_numbers = []

for text in texts:
    found_numbers = re.findall(phone_pattern, text) # raw text as r''
    found_numbers = text.find(phone_pattern) # text itself
    phone_numbers.append(found_numbers)

print(phone_numbers)

[-1, -1, -1, -1]


### `[]` for multiple characters
- Matches "any one of" the characters in `[]`
- Read data first

In [None]:
url = 'https://raw.githubusercontent.com/taegyoon-kim/programming_dhcss_23fw/main/week_14/sentences.txt'
sentences = pd.read_csv(url, header = None, sep = '@')[0].to_list()
print(len(sentences))
print(sentences[0])

720
The birch canoe slid on the smooth planks.




-   For `beat`, `heat`, `peat`

In [None]:
c = re.compile(r' [bhp]eat ')
l_mo = [i for i in sentences if c.search(i)]

for i in l_mo:
  print(i)

The heart beat strongly and with firm strokes.
Burn peat after the logs give out.
Feel the heat of the weak dying flame.
A speedy man can beat this track mark.
Even the worst will beat his low score.
It takes heat to bring out the odor.


-   Use `-` to indicate a range of contiguous characters


In [None]:
c = re.compile(r' [b-p]eat ') #bcdefghijklmn
l_mo = [i for i in sentences if c.search(i)]

for i in l_mo:
  print(i)

The heart beat strongly and with firm strokes.
Burn peat after the logs give out.
Feel the heat of the weak dying flame.
A speedy man can beat this track mark.
Even the worst will beat his low score.
Pack the records in a neat thin case.
It takes heat to bring out the odor.
A clean neck means a neat collar.


-   Use `^` to Match anything but one of the characters in the square brackets

In [None]:
c = re.compile(r' [^bhp]eat ')
l_mo = [i for i in sentences if c.search(i)]

for i in l_mo:
  print(i)

Pack the records in a neat thin case.
A clean neck means a neat collar.


### `|` over multi-character patterns

-   `|` operator is for OR

-   Parentheses can be used to indicate parts in the pattern

In [None]:
c = re.compile(r'(black|blue|red)(currant|berry)')
l_mo = [i for i in fruit if c.search(i)]
l_mo

['blackberry', 'blackcurrant', 'blueberry', 'redcurrant']

### The group() method

* Used to access the parts of the string that match the various parts of the regular expression


In [None]:
c = re.compile(r'(black|blue|red)(currant|berry)')
mo = c.search('strawberry and blackberry')
print(mo.group())
print(mo.group(0))
print(mo.group(1))
print(mo.group(2))

blackberry
blackberry
black
berry


### Special characters and the backslash

-   Let's create a random example string

In [None]:
eg_str = 'Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences.'

-   There are several characters that have a special meaning in regex, and (may) have to be **escaped** in order to match the literal character
-   They include `^`, `$`, `.`, `*`, `+`, `|`, `!`, `?`, `(`, `)`, `[`, `]`, `{`, `}`, `<`, and `>`

-   For example, `.` means "any character but a newline"


In [None]:
allchars = re.findall(r'.', eg_str) # . vs \. (escaped)

print(allchars)
print(len(allchars) == len(eg_str))

['.', '.']
False


In [None]:
allperiods = re.findall(r'\.', eg_str) # get the literal periods in the example

print(allperiods)

['.', '.']


In [None]:
print(eg_str) # just reprint it nearby

matches = re.findall(r'a.', eg_str)
print(matches)

matches = re.findall(r'a\.', eg_str)
print(matches)

Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences.
['am', 'an', 'al']
[]


Class shorthands

-   "\\w" (any alphanumeric character), "\\s" (any space character), and "\\d" (any numeric digit)

-   The capitalized versions of these are used to mean "anything but" that class

In [None]:
print(eg_str) # just reprint it nearby

matches = re.findall(r'\w',eg_str) # any alphanumeric character
print(matches)

matches = re.findall(r'\W',eg_str) # any non-alphanumeric character
print(matches)

Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences.
['E', 'x', 'a', 'm', 'p', 'l', 'e', 'S', 'T', 'R', 'I', 'N', 'G', 'w', 'i', 't', 'h', 'n', 'u', 'm', 'b', 'e', 'r', 's', '1', '2', '1', '5', 'a', 'n', 'd', 'a', 'l', 's', 'o', '1', '0', '2', 'W', 'o', 'w', 't', 'w', 'o', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's']
[' ', ',', ' ', ' ', ' ', '(', ',', ' ', ' ', ' ', ' ', '.', ')', '?', '!', ' ', ',', ' ', ' ', '.']


In [None]:
print(eg_str) # just reprint it nearby

matches = re.findall(r'\s', eg_str) # any whitespace character
print(matches)

matches = re.findall(r'\S', eg_str) # any non-whitespace character
print(matches)

Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences.
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
['E', 'x', 'a', 'm', 'p', 'l', 'e', 'S', 'T', 'R', 'I', 'N', 'G', ',', 'w', 'i', 't', 'h', 'n', 'u', 'm', 'b', 'e', 'r', 's', '(', '1', '2', ',', '1', '5', 'a', 'n', 'd', 'a', 'l', 's', 'o', '1', '0', '.', '2', ')', '?', '!', 'W', 'o', 'w', ',', 't', 'w', 'o', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', '.']


In [None]:
print(eg_str) # just reprint it nearby

matches = re.findall(r'\d', eg_str) # any digit character
print(matches)

matches = re.findall(r'\D', eg_str) # any non-digit character
print(matches)

['1', '2', '1', '5', '1', '0', '2']
['E', 'x', 'a', 'm', 'p', 'l', 'e', ' ', 'S', 'T', 'R', 'I', 'N', 'G', ',', ' ', 'w', 'i', 't', 'h', ' ', 'n', 'u', 'm', 'b', 'e', 'r', 's', ' ', '(', ',', ' ', ' ', 'a', 'n', 'd', ' ', 'a', 'l', 's', 'o', ' ', '.', ')', '?', '!', ' ', 'W', 'o', 'w', ',', ' ', 't', 'w', 'o', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', '.']


### Quantifiers: \* (zero or more of the previous)


In [None]:
print(eg_str) # just reprint it nearby

matches = re.findall('\d*', eg_str) # any string of zero or more digits
print(matches) # all the non-digit characters are matched as zero digit character

Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences.
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '12', '', '', '15', '', '', '', '', '', '', '', '', '', '', '10', '', '2', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


### Quantifiers: + (one or more of the previous)


In [None]:
print(eg_str) # just reprint it nearby

matches = re.findall('\d+', eg_str) # any string of one or more digits
print(matches)

Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences.
['12', '15', '10', '2']


### Quantifiers: {n} {n,m} and {n,}

-   {n} = "exactly n" of the previous
-   {n,m} = "between n and m" of the previous
-   {n,} = "n or more" of the previous

In [None]:
print(eg_str) # just reprint it nearby

matches = re.findall(r'xy{1}','x xx xxy xxxx xxxxx') # 3 'x's
print(matches)

matches = re.findall(r'x{3,4}','x xx xxx xxxx xxxxx') # 3 or 4 'x's
print(matches)

matches = re.findall(r'x{3,}','x xx xxx xxxx xxxxx') # 3 or more 'x's
print(matches)

Example STRING, with numbers (12, 15 and also 10.2)?! Wow, two sentences.
['xy']
['xxx', 'xxxx', 'xxxx']
['xxx', 'xxxx', 'xxxxx']


### Quantifiers: ? (zero or one of the previous)


In [None]:
c = re.compile(r' [bp]?eat ')

matches = [i for i in sentences if c.search(i)]
print(matches)

['The heart beat strongly and with firm strokes.', 'Burn peat after the logs give out.', 'A speedy man can beat this track mark.', 'Even the worst will beat his low score.', 'Quench your thirst, then eat the crackers.']


### Quantifiers: `+`, `+?`

-   `+` is a greedy quantifier that matches one or more of any character as many times as possible
-   `+?` is a non-greedy quantifier that matches as few characters as possible

In [None]:
matches = re.findall(r'\(.+\)', '(First bracketed statement) Other text (Second bracketed statement)')
print(matches)

matches = re.findall(r'\(.+?\)', '(First bracketed statement) Other text (Second bracketed statement)')
print(matches)

['(First bracketed statement) Other text (Second bracketed statement)']
['(First bracketed statement)', '(Second bracketed statement)']


In [None]:
matches = re.findall(r'x.+x','x xx xxx xxxx xxxxx') # as long as possible
print(matches)

matches = re.findall(r'x.+?x','x xx xxx xxxx xxxxx') # x x/x x/xx x,
print(matches)

['x xx xxx xxxx xxxxx']
['x x', 'x x', 'xx x', 'xxx', 'xxx']
