Why do you want to manipulate text? 
- Transform text into meaningful data Eg: remove infanity from a comments to analyze the sentiment of the text
- Extract information from text Eg: extract date and time from a log file generated by a system. 

## Python String Methods

Python and Pandas(python package) provides many string methods to manipulate texts. For eg: 

In [4]:
import pandas as pd

### Transformation

In [5]:
str1 = "abracadabra"
str1.upper()

'ABRACADABRA'

In [6]:
series1 = pd.Series(["tyler", "john", "alexandra"])
series1.str.upper()

0        TYLER
1         JOHN
2    ALEXANDRA
dtype: object

### Replacement

In [9]:
str1.replace('cad', 'CAD')

'abraCADabra'

In [10]:
series1.str.replace('le', 'LE')

0        tyLEr
1         john
2    aLExandra
dtype: object

In [12]:
### Task: 

#Assume we want to merge the following tables: 

df1 = pd.DataFrame({
                    'County': ['De Witt County', 'Lac qui Parle County', 'Lewis and Clark County', 'St John the Baptist Parish'],
                    'State': ['IL', 'MN', 'MT', 'LA']
                    })

df2 = pd.DataFrame({'County': ['DeWitt', 'Lac Qui Parle', 'Lewis & Clark', 'St. John the Baptist'],
                   'Population': [16798, 8067, 55716, 43044]})

In [13]:
df1

Unnamed: 0,County,State
0,De Witt County,IL
1,Lac qui Parle County,MN
2,Lewis and Clark County,MT
3,St John the Baptist Parish,LA


In [14]:
df2

Unnamed: 0,County,Population
0,DeWitt,16798
1,Lac Qui Parle,8067
2,Lewis & Clark,55716
3,St. John the Baptist,43044


In [16]:
pd.merge(df1, df2, how = 'inner')

Unnamed: 0,County,State,Population


In [20]:
def clean_text(srs):
    return (srs
    .str.lower()
    .str.replace(' ', '')
    .str.replace('&', 'and')
    .str.replace('.', '')
    .str.replace('county', '')
    .str.replace('parish', '')
    )

In [25]:
df1['cleaned_county'] = clean_text(df1['County'])
df1

Unnamed: 0,County,State,cleaned_county
0,De Witt County,IL,dewitt
1,Lac qui Parle County,MN,lacquiparle
2,Lewis and Clark County,MT,lewisandclark
3,St John the Baptist Parish,LA,stjohnthebaptist


In [24]:
df2['cleaned_county'] = clean_text(df2['County'])
df2

Unnamed: 0,County,Population,cleaned_county
0,DeWitt,16798,dewitt
1,Lac Qui Parle,8067,lacquiparle
2,Lewis & Clark,55716,lewisandclark
3,St. John the Baptist,43044,stjohnthebaptist


In [26]:
pd.merge(df1, df2, how='inner', on='cleaned_county')

Unnamed: 0,County_x,State,cleaned_county,County_y,Population
0,De Witt County,IL,dewitt,DeWitt,16798
1,Lac qui Parle County,MN,lacquiparle,Lac Qui Parle,8067
2,Lewis and Clark County,MT,lewisandclark,Lewis & Clark,55716
3,St John the Baptist Parish,LA,stjohnthebaptist,St. John the Baptist,43044


### Extraction

Extraction is about extracting useful information from a text. 

In [27]:
with open('log.txt', 'r') as f:
    for line in f:
        print(line)

['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /csci4165/Spring04/ HTTP/1.1" 200 2585 "http://singhd1.sfasu.edu/courses/"\n',

 '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] "GET /csci4165/Notes/dim.html HTTP/1.0" 404 302 "http://abowden.sfasu.edu/csci4165/Notes/session.html"\n',

 '169.237.46.240 - "" [3/Feb/2006:10:18:37 -0800] "GET /csci4165/homework/Solutions/hw1Sol.pdf HTTP/1.1"\n']


Suppose we want to extract the day, month, year.

In [32]:
with open('log.txt', 'r') as f:
    for line in f:
        lst = line.split('/')
        date = lst[0].split('[')[-1]
        mon = lst[1]
        year = lst[2].split(':')
        year = year[0]
        
        print(date, mon, year)

26 Jan 2014
2 Feb 2005
3 Feb 2006


### Regex Basics

Regex is a sequence of characters that species a matching pattern. They are usually written to extract some pattern from a text. 

For eg: extract social security number from a text

In [34]:
text = "Our records show that your Social Security Number, 111-23-4567, has been successfully registered. Please let us know if any information is incorrect."

In [35]:
import re

In [38]:
re.findall('[0-9]{3}-[0-9]{2}-[0-9]{4}', text)

['111-23-4567']

You can learn more about it here: 

[Regex documentation](https://docs.python.org/3/howto/regex.html)


[Regex 101](https://regex101.com/)

### Basic Regex Syntax

There are four basic operations with regular expressions.

| Operation | Order | Syntax Example | Matches | Doesn't Match |
| ---- | --- | ---- | ---- | --- | 
| Or: \| | 	4 | AA\|BAAB | `AA` `BAAB` | every other string |
| Concatenation	| 3	| AABAAB	| `AABAAB`	| every other string |
| Closure: * (zero or more)	| 2	| AB*A	| `AA` `ABBBBBBA`| `AB` `ABABA` |
| Group: () (parenthesis) |	1 |	A\(A\|B\)AAB | `AAAAB` `ABAAB` | every other string| 
| | | (AB)*A | `A` `ABABABABA` | `AA` `ABBA`|

Task:  Give a regular expression that matches `moon`, `moooon`, etc. Your expression should match any even number of `o`s except zero (i.e. don’t match `mn`).

Task: Using only basic operations, formulate a regex that matches `muun`, `muuuun`, `moon`, `moooon`, etc. Your expression should match any even number of `u`s or `o`s except zero (i.e. don’t match `mn`).

### More Complex Regex

| Operation	| Syntax Example	| Matches	| Doesn’t Match |
| ----- | ---- | ---- | ---- |
| Any Character: `.` (except newline) | 	.U.U.U. | 	`CUMULUS` `JUGULUM`	| `SUCCUBUS` `TUMULTUOUS` |
| Character Class: `[ ]` (match one character in []) |	[A-Za-z][a-z]*	|  `Capitalized` ` word`	| camelCase 4illegal |
| Repeated "a" Times: `{a}` | j[aeiou]{3}hn| `jaoehn` `jooohn` |	`jhn` `jaeiouhn` |
| Repeated "from a to b" Times: `{a, b}` | j[ou]{1,2}hn	| `john` `juohn` |	`jhn` `jooohn` |
| At Least One: `+`	| jo+hn | 	`john` `joooooohn` |`jhn` `jjohn` |
| Zero or One: `?` |	joh?n	| `jon` `john`	| any other string |

Short-hand form to mean a range of characters: 

- [A-Z]: Any capitalized letter
- [a-z]: Any lowercase letter
- [0-9]: Any single digit
- [A-Za-z]: Any capitalized or lowercase letter
- [A-Za-z0-9]: Any capitalized or lowercase letter or single digit

### Examples

1. `.*SPB.*`

A. RASPBERRY<br>
B. SPBOO<br>
C. SUBSPACE<br>
D. SUBSPECIES<br>


2. `[0-9]{3}-[0-9]{2}-[0-9]{4}`

A. 231-41-5121<br>
B. 231415121<br>
C. 57-3571821<br>
D. 573-57-1821<br>

3. `[a-z]+@([a-z]+\.)+(edu|com)`
   
A. horse@pizza.com<br>
B. frank_99@yahoo.com<br>
C. horse@pizza.food.com<br>
D. hug@cs<br>


### Convenient Regex

| Operation |	Syntax Example |	Matches	| Doesn’t Match |
| ---- | ---- | ---- | ---- |
|built in character class | \w+ | `Fawef_03` | `this person` | 
|  | \d+ | `231123` | `423 people` |
| | \s+ | `white    space` | `white-space` |
| character class negation: [^] (everything except the given characters)	| [^a-z]+.	| `PEPPERS3982` `17211!↑å`	| `porch` `CLAmS` |
| escape character: \ (match the literal next character)	| cow\\.com	| `cow.com` | 	`cowscom` |
| beginning of line: ^	| ^ark	| `ark two ark o` `ark`	| `dark` |
|end of line: \$ |	ark$ |	`dark` `ark o ark`	| `ark two` |
| lazy version of zero or more : *? |	5.*?5	| `5005` `55` |	`5005005`| 

In [48]:
### Greediness

text = "This is a <div>example</div> of greediness <div>in</div> regular expressions"

In [53]:
re.findall("<div>.*</div>", text)

['<div>example</div> of greediness <div>in</div>']

In [54]:
re.findall("<div>.*?</div>", text)

['<div>example</div>', '<div>in</div>']

Task: Extract day, mon, year from the following text

In [56]:
text = "['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /csci4165/Spring04/ HTTP/1.1\" 200 2585 \"http://singhd1.sfasu.edu/courses/\"\n'"

In [70]:
### Problem : Remove the text surrounded by <> in the following text. In other words, replace the HTML tags with empty string including the brackets <>. 

srs = pd.Series(["<div><td valign='top'>Moo</td></div>", "<a href='http://sfasu.edu'>Link</a>", "<b>Bold text</b>"])
srs

0    <div><td valign='top'>Moo</td></div>
1     <a href='http://sfasu.edu'>Link</a>
2                        <b>Bold text</b>
dtype: object

### Capture Groups

Earlier we used parentheses ( ) to specify the highest order of operation in regular expressions. However, they have another meaning; parentheses are often used to represent capture groups. Capture groups are essentially, a set of smaller regular expressions that match multiple substrings in text data.

In [71]:
text = "Observations: 03:04:53 - Horse awakens. 03:05:14 - Horse goes back to sleep."
text

'Observations: 03:04:53 - Horse awakens. 03:05:14 - Horse goes back to sleep.'

Say we want to capture all occurences of time data (hour, minute, and second) as separate entities.

In [73]:
re.findall(r'(\d{2}):(\d{2}):(\d{2})', text)

[('03', '04', '53'), ('03', '05', '14')]

Task: Extract the day, month and year from the text

In [74]:
text = "['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /csci4165/Spring04/ HTTP/1.1\" 200 2585 \"http://singhd1.sfasu.edu/courses/\"\n'"
text

'[\'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /csci4165/Spring04/ HTTP/1.1" 200 2585 "http://singhd1.sfasu.edu/courses/"\n\''

In [83]:
re.findall(r'(\d{1,2})\/([A-Z][a-z]{2})\/(\d{4})', text)

[('26', 'Jan', '2014')]