# Text Processing: Regular Expressions

Regular expressions (or _REGEX_) are fundamental parts of any text processing.
Regular expressions represent a robust and flexible way to define patterns of characters within text documents.  
There are many uses, such as pattern matching and term extraction.

Below is some log information from the JupyterHub environment.
The user ID as been changed to protect the innocent.
The task is to extract the following into a data frame using **only** regular expressions for search and extract.
 * User ID
 * Time Stamp
 * Notebook Name
 * Day of the Course
 * Course ID


**Resources:**
 * [Python Regular Expressions](https://docs.python.org/3/library/re.html)
 * https://www.debuggex.com/cheatsheet/regex/python


In [1]:
import pandas as pd
import numpy as np

# The Python library
import re


## Data 


In [2]:
jupyter_log_data = """1082,05:19:23.433 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day5/labs/L1_RLibraries.ipynb?content=0&_=1537784241735 to http://127.0.0.1:35215,2018-09-24T10:19:23.433Z
1081,05:21:23.518 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day4/labs/L1_RLibraries.ipynb?content=0&_=1537784241736 to http://127.0.0.1:35215,2018-09-24T10:21:23.519Z
1080,05:22:25.808 - debug: [ConfigProxy] PROXY WEB /user/gjs.0002/api/contents/PSDS2120OP2-4_gjs.0002/Day3/labs/L1_RLibraries.ipynb?content=0&_=1537784241737 to http://127.0.0.1:35215,2018-09-24T10:22:25.809Z
1079,05:25:03.504 - debug: [ConfigProxy] PROXY WEB /user/gjs.0002/api/contents/PSDS2120OP2-4_gjs.0002/Day2/labs/L2_RBasicDataTypes.ipynb?content=0&_=1537784580735 to http://127.0.0.1:35215,2018-09-24T10:25:03.505Z
1078,05:27:03.552 - debug: [ConfigProxy] PROXY WEB /user/gjs.0003/api/contents/PSDS2120OP2-4_gjs.0003/Day1/labs/L2_RBasicDataTypes.ipynb?content=0&_=1537784580736 to http://127.0.0.1:35215,2018-09-24T10:27:03.553Z
1077,05:29:03.511 - debug: [ConfigProxy] PROXY WEB /user/gjs.0003/api/contents/PSDS2120OP2-4_gjs.0003/Day1/labs/L2_RBasicDataTypes.ipynb?content=0&_=1537784580737 to http://127.0.0.1:35215,2018-09-24T10:29:03.511Z
1076,05:49:08.482 - debug: [ConfigProxy] PROXY WEB /user/gjs.0004/api/contents/PSDS2120OP2-4_gjs.0004/Day2/labs/L2_RBasicDataTypes.ipynb?content=0&_=1537784580738 to http://127.0.0.1:35215,2018-09-24T10:49:08.483Z
1075,05:51:03.965 - debug: [ConfigProxy] PROXY WEB /user/gjs.0004/api/contents/PSDS2120OP2-4_gjs.0004/Day3/labs/L2_RBasicDataTypes.ipynb?content=0&_=1537784580739 to http://127.0.0.1:35215,2018-09-24T10:51:03.966Z
1074,06:01:03.633 - debug: [ConfigProxy] PROXY WEB /user/gjs.0005/api/contents/PSDS2120OP2-4_gjs.0005/Day4/labs/L2_RBasicDataTypes.ipynb?content=0&_=1537784580740 to http://127.0.0.1:35215,2018-09-24T11:01:03.633Z
1073,06:03:03.597 - debug: [ConfigProxy] PROXY WEB /user/gjs.0005/api/contents/PSDS2120OP2-4_gjs.0005/Day5/labs/L2_RBasicDataTypes.ipynb?content=0&_=1537784580741 to http://127.0.0.1:35215,2018-09-24T11:03:03.597Z"""

In [3]:
print(jupyter_log_data)

1082,05:19:23.433 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day5/labs/L1_RLibraries.ipynb?content=0&_=1537784241735 to http://127.0.0.1:35215,2018-09-24T10:19:23.433Z
1081,05:21:23.518 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day4/labs/L1_RLibraries.ipynb?content=0&_=1537784241736 to http://127.0.0.1:35215,2018-09-24T10:21:23.519Z
1080,05:22:25.808 - debug: [ConfigProxy] PROXY WEB /user/gjs.0002/api/contents/PSDS2120OP2-4_gjs.0002/Day3/labs/L1_RLibraries.ipynb?content=0&_=1537784241737 to http://127.0.0.1:35215,2018-09-24T10:22:25.809Z
1079,05:25:03.504 - debug: [ConfigProxy] PROXY WEB /user/gjs.0002/api/contents/PSDS2120OP2-4_gjs.0002/Day2/labs/L2_RBasicDataTypes.ipynb?content=0&_=1537784580735 to http://127.0.0.1:35215,2018-09-24T10:25:03.505Z
1078,05:27:03.552 - debug: [ConfigProxy] PROXY WEB /user/gjs.0003/api/contents/PSDS2120OP2-4_gjs.0003/Day1/labs/L2_RBasicDataTypes.ipynb?content=0&_=1537784580

## Task 1: Split the data into lines!


In [4]:
# Add code between comments:
# -----------------------------

line_break_pattern = r'(.*)\n' 

lines = re.findall(line_break_pattern, jupyter_log_data)

# -----------------------------
for i,l in enumerate(lines):
    print("{}: {}".format(i,l))

0: 1082,05:19:23.433 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day5/labs/L1_RLibraries.ipynb?content=0&_=1537784241735 to http://127.0.0.1:35215,2018-09-24T10:19:23.433Z
1: 1081,05:21:23.518 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day4/labs/L1_RLibraries.ipynb?content=0&_=1537784241736 to http://127.0.0.1:35215,2018-09-24T10:21:23.519Z
2: 1080,05:22:25.808 - debug: [ConfigProxy] PROXY WEB /user/gjs.0002/api/contents/PSDS2120OP2-4_gjs.0002/Day3/labs/L1_RLibraries.ipynb?content=0&_=1537784241737 to http://127.0.0.1:35215,2018-09-24T10:22:25.809Z
3: 1079,05:25:03.504 - debug: [ConfigProxy] PROXY WEB /user/gjs.0002/api/contents/PSDS2120OP2-4_gjs.0002/Day2/labs/L2_RBasicDataTypes.ipynb?content=0&_=1537784580735 to http://127.0.0.1:35215,2018-09-24T10:25:03.505Z
4: 1078,05:27:03.552 - debug: [ConfigProxy] PROXY WEB /user/gjs.0003/api/contents/PSDS2120OP2-4_gjs.0003/Day1/labs/L2_RBasicDataTypes.ipynb?content

## Task 2: Develop some extract patterns

In [5]:
test_line = lines[0]
print(test_line)

1082,05:19:23.433 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day5/labs/L1_RLibraries.ipynb?content=0&_=1537784241735 to http://127.0.0.1:35215,2018-09-24T10:19:23.433Z


### 2.A: Develop an extract for User ID


In [8]:
# Add code between comments:
# -----------------------------

pattern = r'\/user\/(.+)\/api' 

test = re.findall(pattern,test_line)


# -----------------------------
print(test_line)
print(test)

1082,05:19:23.433 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day5/labs/L1_RLibraries.ipynb?content=0&_=1537784241735 to http://127.0.0.1:35215,2018-09-24T10:19:23.433Z
['gjs.0001']


### 2.B: Develop an extract for Date/Time stamp

E.g., 2018-09-24T10:19:23

In [9]:
# Add code between comments:
# -----------------------------

pattern = r',(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})' 

test = re.findall(pattern,test_line)


# -----------------------------
print(test_line)
print(test)

1082,05:19:23.433 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day5/labs/L1_RLibraries.ipynb?content=0&_=1537784241735 to http://127.0.0.1:35215,2018-09-24T10:19:23.433Z
['2018-09-24T10:19:23']


### 2.C: Develop an extract for Notebook Name


In [10]:
# Add code between comments:
# -----------------------------

pattern = r'PROXY WEB .*\/([\w]*\.ipynb)' 

test = re.findall(pattern,test_line)


# -----------------------------
print(test_line)
print(test)

1082,05:19:23.433 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day5/labs/L1_RLibraries.ipynb?content=0&_=1537784241735 to http://127.0.0.1:35215,2018-09-24T10:19:23.433Z
['L1_RLibraries.ipynb']


### 2.D: Develop an extract for Day of the Course
E.g., Day1, Day2, etc.


In [11]:
# Add code between comments:
# -----------------------------


pattern = r'Day(\d)' 

test = re.findall(pattern,test_line)


# -----------------------------
print(test_line)
print(test)

1082,05:19:23.433 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day5/labs/L1_RLibraries.ipynb?content=0&_=1537784241735 to http://127.0.0.1:35215,2018-09-24T10:19:23.433Z
['5']


### 2.E: Develop an extract for Course ID

In [12]:
# Add code between comments:
# -----------------------------

pattern = '/api/contents/([a-zA-Z0-9\-]*)\_' 

test = re.findall(pattern,test_line)


# -----------------------------
print(test_line)
print(test)

1082,05:19:23.433 - debug: [ConfigProxy] PROXY WEB /user/gjs.0001/api/contents/PSDS2120OP2-4_gjs.0001/Day5/labs/L1_RLibraries.ipynb?content=0&_=1537784241735 to http://127.0.0.1:35215,2018-09-24T10:19:23.433Z
['PSDS2120OP2-4']


## 3: Construct a Data Frame

We can start this step by loading the lines in the data frame,
then use the Pandas Series `str.extract` functionality.
 * https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html
 
Create the following columns:
 * `uid :` User ID
 * `dttm:` Time Stamp
 * `nb  :` Notebook Name
 * `day :` Day of the Course
 * `cid :` Course ID

In [13]:
df = pd.DataFrame({'logline':lines})

# Add code between comments:
# -----------------------------

# uid : User ID
df['uid'] = df.logline.str.extract( r'\/user\/(.+)\/api' 
                                , expand=False)

# dttm: Time Stamp
df['dttm'] = df.logline.str.extract( r',(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})'
                                , expand=False)

# nb : Notebook Name
df['nb'] = df.logline.str.extract(r'PROXY WEB .*\/([\w]*\.ipynb)', expand=False)

# day : Day of the Course
df['day'] = df.logline.str.extract(r'Day(\d)', expand=False)


# cid : Course ID
df['cid'] = df.logline.str.extract(r'/api/contents/([a-zA-Z0-9\-]*)\_', expand=False)


# -----------------------------
df.head(10)

Unnamed: 0,logline,uid,dttm,nb,day,cid
0,"1082,05:19:23.433 - debug: [ConfigProxy] PROXY...",gjs.0001,2018-09-24T10:19:23,L1_RLibraries.ipynb,5,PSDS2120OP2-4
1,"1081,05:21:23.518 - debug: [ConfigProxy] PROXY...",gjs.0001,2018-09-24T10:21:23,L1_RLibraries.ipynb,4,PSDS2120OP2-4
2,"1080,05:22:25.808 - debug: [ConfigProxy] PROXY...",gjs.0002,2018-09-24T10:22:25,L1_RLibraries.ipynb,3,PSDS2120OP2-4
3,"1079,05:25:03.504 - debug: [ConfigProxy] PROXY...",gjs.0002,2018-09-24T10:25:03,L2_RBasicDataTypes.ipynb,2,PSDS2120OP2-4
4,"1078,05:27:03.552 - debug: [ConfigProxy] PROXY...",gjs.0003,2018-09-24T10:27:03,L2_RBasicDataTypes.ipynb,1,PSDS2120OP2-4
5,"1077,05:29:03.511 - debug: [ConfigProxy] PROXY...",gjs.0003,2018-09-24T10:29:03,L2_RBasicDataTypes.ipynb,1,PSDS2120OP2-4
6,"1076,05:49:08.482 - debug: [ConfigProxy] PROXY...",gjs.0004,2018-09-24T10:49:08,L2_RBasicDataTypes.ipynb,2,PSDS2120OP2-4
7,"1075,05:51:03.965 - debug: [ConfigProxy] PROXY...",gjs.0004,2018-09-24T10:51:03,L2_RBasicDataTypes.ipynb,3,PSDS2120OP2-4
8,"1074,06:01:03.633 - debug: [ConfigProxy] PROXY...",gjs.0005,2018-09-24T11:01:03,L2_RBasicDataTypes.ipynb,4,PSDS2120OP2-4


# Save your notebook, then `File > Close and Halt`

## Optional: Decompose the Date/Time

Change the extracted Date/Time into separate parts: 
 1. Time of day into a column, **`tod`**  
 1. Date into a separate column, **`date`**


In [14]:
# Add code between comments:
# -----------------------------

df['dttm'] = pd.to_datetime(df['dttm'])
df['tod'] = [str(var.hour)+":"+str(var.minute) for var in df.dttm]
df['date'] = [str(var.year)+"-"+str(var.month)+"-"+str(var.day) for var in df.dttm]


# -----------------------------
df.head(10)

Unnamed: 0,logline,uid,dttm,nb,day,cid,tod,date
0,"1082,05:19:23.433 - debug: [ConfigProxy] PROXY...",gjs.0001,2018-09-24 10:19:23,L1_RLibraries.ipynb,5,PSDS2120OP2-4,10:19,2018-9-24
1,"1081,05:21:23.518 - debug: [ConfigProxy] PROXY...",gjs.0001,2018-09-24 10:21:23,L1_RLibraries.ipynb,4,PSDS2120OP2-4,10:21,2018-9-24
2,"1080,05:22:25.808 - debug: [ConfigProxy] PROXY...",gjs.0002,2018-09-24 10:22:25,L1_RLibraries.ipynb,3,PSDS2120OP2-4,10:22,2018-9-24
3,"1079,05:25:03.504 - debug: [ConfigProxy] PROXY...",gjs.0002,2018-09-24 10:25:03,L2_RBasicDataTypes.ipynb,2,PSDS2120OP2-4,10:25,2018-9-24
4,"1078,05:27:03.552 - debug: [ConfigProxy] PROXY...",gjs.0003,2018-09-24 10:27:03,L2_RBasicDataTypes.ipynb,1,PSDS2120OP2-4,10:27,2018-9-24
5,"1077,05:29:03.511 - debug: [ConfigProxy] PROXY...",gjs.0003,2018-09-24 10:29:03,L2_RBasicDataTypes.ipynb,1,PSDS2120OP2-4,10:29,2018-9-24
6,"1076,05:49:08.482 - debug: [ConfigProxy] PROXY...",gjs.0004,2018-09-24 10:49:08,L2_RBasicDataTypes.ipynb,2,PSDS2120OP2-4,10:49,2018-9-24
7,"1075,05:51:03.965 - debug: [ConfigProxy] PROXY...",gjs.0004,2018-09-24 10:51:03,L2_RBasicDataTypes.ipynb,3,PSDS2120OP2-4,10:51,2018-9-24
8,"1074,06:01:03.633 - debug: [ConfigProxy] PROXY...",gjs.0005,2018-09-24 11:01:03,L2_RBasicDataTypes.ipynb,4,PSDS2120OP2-4,11:1,2018-9-24


# Save your notebook, then `File > Close and Halt`