# Data Aggregation

The process where information is gathered and presented in a summary form with intent to prepare data for statiatical analysis.

# Exam Scheduler

Task: integrate the [exam schedule](http://registrar.emory.edu/faculty-staff/exam-schedule/spring-2019.html) with the [course atlas](http://atlas.college.emory.edu/class-schedules/spring-2019.php).

## HTML Parsing

Retrieve the HTML source from the exam schedule page:

In [7]:
import requests

url = 'http://registrar.emory.edu/faculty-staff/exam-schedule/spring-2019.html'
r = requests.get(url)
print(r.text[:82])  # print only the first line

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">


Find the table containing the exam schedule information from the source:

<img src="res/exam-schedule-spring-2019.png">

```html
<table cellpadding="0" cellspacing="0" class="table table-striped">
<thead>
<tr>
<td>Class Meeting Time</td>
<td>Exam Day</td>
<td>Exam Date</td>
<td>Exam Time</td>
</tr>
</thead>
<tbody>
<tr>
<td>08:00 MWF</td>
<td>Friday</td>
<td>3-May</td>
<td>11:30 A.M - 2:00 P.M</td>
</tr>
<tr>
<td>08:00 TThF</td>
<td>Friday</td>
<td>3-May</td>
<td>3:00 P.M - 5:30 P.M</td>
</tr>
```

Retrieve the exam schedule information from the table:

In [9]:
!pip install beautifulsoup4
from bs4 import BeautifulSoup

html = BeautifulSoup(r.text, 'html.parser')
tbody = html.find('tbody')
schedule = []

for tr in tbody.find_all('tr'):
    tds = tr.find_all('td')
    class_time = tds[0].string.strip()
    exam_day   = tds[1].string.strip()
    exam_date  = tds[2].string.strip()
    exam_time  = tds[3].string.strip()
    schedule.append([class_time, exam_day, exam_date, exam_time])

print(schedule[0])
print(schedule[1])

['08:00 MWF', 'Friday', '3-May', '11:30 A.M - 2:00 P.M']
['08:00 TThF', 'Friday', '3-May', '3:00 P.M - 5:30 P.M']


## Regular Expressions

Split each class meeting time into (hour, minute, days):

In [2]:
import re

TIME_DAYS = re.compile('(\d{1,2}):(\d\d)\s+([A-Za-z]+)')

m = TIME_DAYS.match('8:00 MW')
print('Hour: %2s, Minute: %2s, Day(s): %s' % (m.group(1), m.group(2), m.group(3)))

m = TIME_DAYS.match('12:30 TThF')
print('Hour: %2s, Minute: %2s, Day(s): %s' % (m.group(1), m.group(2), m.group(3)))

Hour:  8, Minute: 00, Day(s): MW
Hour: 12, Minute: 30, Day(s): TThF


If the input string does not match the expression, `None` is returned:

In [3]:
m = TIME_DAYS.match('Math*')
print(m)
print(m.groups())

None


AttributeError: 'NoneType' object has no attribute 'groups'

### Exercise

Write a regular expression that handles various ways to indicate time:

```python
['08:00', '12:30', '2:30pm', '2:30 pm', '2:30PM', '2:30P.M', '2:30P.M.', '2:30 PM.']
```

In [5]:
times = ['08:00', '12:30', '2:30pm', '2:30 pm', '2:30PM', '2:30P.M', '2:30P.M.', '2:30 PM.']
TIME = re.compile('(\d{1,2}):(\d\d)\s*([AaPp]\.?\s*[Mm]\.?)?')

for time in times:
    m = TIME.match(time)
    hour   = m.group(1)
    minute = m.group(2)
    period = m.group(3)
    print('%10s : (%2s, %2s, %s)' % (time, hour, minute, period))

     08:00 : (08, 00, None)
     12:30 : (12, 30, None)
    2:30pm : ( 2, 30, pm)
   2:30 pm : ( 2, 30, pm)
    2:30PM : ( 2, 30, PM)
   2:30P.M : ( 2, 30, P.M)
  2:30P.M. : ( 2, 30, P.M.)
  2:30 PM. : ( 2, 30, PM.)


## Normalization

Uniform the format of the input data.

### Exercise

Write a function that converts the above matched results to military time (e.g., `"3:30 P.M"` &rarr; `1530`):

```python
def norm_time(hour: str, minute: str, period: str) -> int:
    # TODO: to be updated
    return 0
```

In [4]:
from typing import Optional

def norm_time(hour: str, minute: str, period: Optional[str]=None) -> int:
    h = int(hour)
    m = int(minute)

    if period and period[0].upper() == 'P':
#     if period:
#         if period[0].upper() == 'P':
            h += 12

    return h * 100 + m

In [7]:
for time in times:
    m = TIME.match(time)
    n = norm_time(m.group(1), m.group(2), m.group(3))
    print('%10s : %4d' % (time, n))

     08:00 :  800
     12:30 : 1230
    2:30pm : 1430
   2:30 pm : 1430
    2:30PM : 1430
   2:30P.M : 1430
  2:30P.M. : 1430
  2:30 PM. : 1430


### Exercise

Write a function that coverts days into a binary form, then converts the binary form into an integer (e.g., `"MWF"` &rarr; `"10101"` &rarr; `21`): 

```python
def norm_days(days: str) -> int:
    # TODO: to be updated
    return 0
```

In [8]:
def norm_days(days: str) -> int:
    DAYS = [('M', 0), ('TU', 1), ('W', 2), ('TH', 3), ('F', 4)]
    days = days.upper()
    b = ['0'] * 5

    for d, i in DAYS:
        if d in days:
            b[i] = '1'
            days = days.replace(d, '')
    
    if 'T' in days:
        b[1] = '1'
        days = days.replace('T', '')

    return int(''.join(b), 2)

In [9]:
days = ['MWF', 'TuTh', 'MTuWThF', 'TThF', 'MWFf']

for day in days:
    n = norm_days(day)
    print('%7s %5s %2d' % (day, bin(n)[2:], n))

    MWF 10101 21
   TuTh  1010 10
MTuWThF 11111 31
   TThF  1011 11
   MWFf 10101 21


Write a function that takes the exam schedule URL and returns a dictionary where the key is the normalized class meeting time and the value is its exam schedule information.

In [10]:
from typing import Dict, Tuple

def extract_exam_schedule(url) -> Dict[Tuple[int, str], Tuple[str, str, str]]:
    r = requests.get(url)
    html = BeautifulSoup(r.text, 'html.parser')
    tbody = html.find('tbody')
    schedule = {}

    for tr in tbody.find_all('tr'):
        tds = tr.find_all('td')
        class_time = tds[0].string.strip()
        m = TIME_DAYS.match(class_time)
        if m:
            time = norm_time(int(m.group(1)), int(m.group(2)))
            days = m.group(3)
            key  = (time, days)
            exam_day  = tds[1].string.strip()
            exam_date = tds[2].string.strip()
            exam_time = tds[3].string.strip()
            schedule[key] = (exam_day, exam_date, exam_time)

    return schedule

In [11]:
exam_schedule = extract_exam_schedule(url)
for k, v in exam_schedule.items():
    print('%14s : %s' % (k, v))

  (800, 'MWF') : ('Friday', '3-May', '11:30 A.M - 2:00 P.M')
 (800, 'TThF') : ('Friday', '3-May', '3:00 P.M - 5:30 P.M')
   (830, 'MW') : ('Friday', '3-May', '11:30 A.M - 2:00 P.M')
  (830, 'TTh') : ('Friday', '3-May', '3:00 P.M - 5:30 P.M')
  (900, 'MWF') : ('Friday', '3-May', '11:30 A.M - 2:00 P.M')
 (900, 'TThF') : ('Friday', '3-May', '3:00 P.M - 5:30 P.M')
  (1000, 'MW') : ('Thursday', '2-May', '8:00 A.M - 10:30 A.M')
 (1000, 'MWF') : ('Thursday', '2-May', '8:00 A.M - 10:30 A.M')
 (1000, 'TTh') : ('Friday', '3-May', '8:00 A.M - 10:30 A.M')
(1000, 'TThF') : ('Friday', '3-May', '8:00 A.M - 10:30 A.M')
 (1100, 'MWF') : ('Wednesday', '8-May', '8:00 A.M - 10:30 A.M')
(1100, 'TThF') : ('Tuesday', '7-May', '8:00 A.M - 10:30 A.M')
  (1130, 'MW') : ('Wednesday', '8-May', '8:00 A.M - 10:30 A.M')
 (1130, 'TTh') : ('Tuesday', '7-May', '8:00 A.M - 10:30 A.M')
 (1200, 'MWF') : ('Wednesday', '8-May', '3:00 P.M - 5:30 P.M')
(1200, 'TThF') : ('Wednesday', '8-May', '3:00 P.M - 5:30 P.M')
   (100, 'M

### Question

Which exam schedules are not extracted by the `extract_exam_schedule` function, and why?