# Intermediate Regex Homework

## UFO sightings

The [ufo-reports](https://github.com/planetsig/ufo-reports) GitHub repository contains reports of UFO sightings downloaded from the [National UFO Reporting Center](http://www.nuforc.org/) website. One of the data fields is the **duration of the sighting**, which includes **free-form text**. These are some example entries:

- 45 minutes
- 1-2 hrs
- 20 seconds
- 1/2 hour
- about 3 mins
- several minutes
- one hour?
- 5min

Here is **how to read in the file:**

- Use the pandas **`read_csv()`** function to read directly from this [URL](https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv).
- Use the **`header=None`** parameter to specify that the data does not have a header row.
- Use the **`nrows=100`** parameter to specify that you only want to read in the first 100 rows.
- Save the relevant Series as a Python list, just like we did in a class exercise.

In [107]:
import pandas as pd
ufos = pd.read_csv('https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv',header=None,nrows=100)
durations = ufos[6].tolist()
durations[:5]

['45 minutes', '1-2 hrs', '20 seconds', '1/2 hour', '15 minutes']

Your assignment is to **normalize the duration data for the first 100 rows** by splitting each entry into two parts:

- The first part should be a **number**: either a whole number (such as '45') or a decimal (such as '0.5').
- The second part should be a **unit of time**: either 'hr' or 'min' or 'sec'

The expected output is a **list of tuples**, containing the **original (unedited) string**, the **number**, and the **unit of time**. Here is a what the output should look like:

> `clean_durations = [('45 minutes', '45', 'min'), ('1-2 hrs', '1', 'hr'), ('20 seconds', '20', 'sec'), ...]`

In [119]:
import re

# Utility function to get fractions
def frac(s):
    res = re.search(r'(\d+)/(\d+)',s)
    if res:
        n = float(res.group(1))
        d = float(res.group(2))
        return(str(n/d))
    else:
        return('')
    
# Utility function to get midpoint of ranges
def span(s):
    res = re.search(r'(\d+) *(to|-) *(\d+)',s)
    if res:
        t1 = float(res.group(1))
        t2 = float(res.group(3))
        return(str((t1+t2)/2))
    else:
        return('')
    
# text of numbers
numtext = r'zero one two three four five six seven eight nine'.split()

def make_tuples(duration):
    timesub = [re.sub(r'( |-)?min.*$',r' min',x) for x in duration]
    timesub = [re.sub(r'( |-)?sec.*$',r' sec',x) for x in timesub]
    timesub = [re.sub(r'( |-)?(hour.*|hr.*)$',r' hr',x) for x in timesub]
    approx_sub = [re.sub(r'(<|>|less than|more than|or more|or less|about|approx\.?|~|\+/-) ?',r'',x) for x in timesub]
    approx_sub = [re.sub(r'several',r'5',x) for x in approx_sub]
    approx_sub = [re.sub(r'few',r'3',x) for x in approx_sub]
    approx_sub = [re.sub(r'couple',r'2',x) for x in approx_sub]
    frac_sub = [re.sub(r'\d+/\d+',frac(x),x) for x in approx_sub]
    span_sub = [re.sub(r'(\d+) *(to|-) *(\d+)',span(x),x) for x in frac_sub]
    text_sub = span_sub
    for i in range(0,9):
        text_sub = [re.sub(numtext[i],str(i),x) for x in text_sub]
    split_re = [re.search(r'([\d.]+) *(hr|min|sec)',x) for x in text_sub]
    if None in split_re:
        print('Match failures:')
        for i in range(0,len(duration)):
            if split_re[i] == None:
                print(duration[i])
        return
    else:
        split_times = [x.groups() for x in split_re]
        return([(duration[i],split_times[i][0],split_times[i][1]) for i in range(0,len(duration))])

make_tuples(durations)[:10]

[('45 minutes', '45', 'min'),
 ('1-2 hrs', '1.5', 'hr'),
 ('20 seconds', '20', 'sec'),
 ('1/2 hour', '0.5', 'hr'),
 ('15 minutes', '15', 'min'),
 ('5 minutes', '5', 'min'),
 ('about 3 mins', '3', 'min'),
 ('20 minutes', '20', 'min'),
 ('3  minutes', '3', 'min'),
 ('several minutes', '5', 'min')]

Here are the **"rules" and guiding principles** for this assignment:

- The normalized duration does not have to be exactly correct, but it must be at least **within the given range**. For example:
    - If the duration is '20-30 min', acceptable answers include '20 min' and '30 min'.
    - If the duration is '1/2 hour', the only acceptable answer is '0.5 hr'.
- When a number is not given, you should make a **"reasonable" substitution for the words**. For example:
    - If the duration is 'several minutes', you can approximate this as '5 min'.
    - If the duration is 'couple minutes', you can approximate this as '2 min'.
- You are not allowed to **skip any entries**. (Your list of tuples should have a length of 100.)
- Try to use **as few substitutions as possible**, and make your regular expression **as simple as possible**.
- Just because you don't get an error doesn't mean that your code was successful. Instead, you should **check each entry by hand** to see if it produced an acceptable result.

**Bonus tasks:**

- Try reading in **more than 100 rows**, and see if your code still produces the correct results.
- When a range is specified (such as '1-2 hrs' or '10 to 15 sec'), **calculate the exact midpoint** ('1.5 hr' or '12.5 sec') to use in your normalized data.

In [120]:
#more_ufos = pd.read_csv('https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv',header=None,nrows=200)
more_dur = more_ufos[6].tolist()
make_tuples(more_dur)

Match failures:
2/min.
less then a minute
00:43
5:00
one + minutes
1:00:00


Ugh. Does "00:43" mean 43 seconds or 43 minutes? And is "5:00" five hours, or five minutes, or an interval that began (or ended) at 5 o'clock? 