# Intermediate Regex Homework

## UFO sightings

The [ufo-reports](https://github.com/planetsig/ufo-reports) GitHub repository contains reports of UFO sightings downloaded from the [National UFO Reporting Center](http://www.nuforc.org/) website. One of the data fields is the **duration of the sighting**, which includes **free-form text**. These are some example entries:

- 45 minutes
- 1-2 hrs
- 20 seconds
- 1/2 hour
- about 3 mins
- several minutes
- one hour?
- 5min

Here is **how to read in the file:**

- Use the pandas **`read_csv()`** function to read directly from this [URL](https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv).
- Use the **`header=None`** parameter to specify that the data does not have a header row.
- Use the **`nrows=100`** parameter to specify that you only want to read in the first 100 rows.
- Save the relevant Series as a Python list, just like we did in a class exercise.

Your assignment is to **normalize the duration data for the first 100 rows** by splitting each entry into two parts:

- The first part should be a **number**: either a whole number (such as '45') or a decimal (such as '0.5').
- The second part should be a **unit of time**: either 'hr' or 'min' or 'sec'

The expected output is a **list of tuples**, containing the **original (unedited) string**, the **number**, and the **unit of time**. Here is a what the output should look like:

> `clean_durations = [('45 minutes', '45', 'min'), ('1-2 hrs', '1', 'hr'), ('20 seconds', '20', 'sec'), ...]`

Here are the **"rules" and guiding principles** for this assignment:

- The normalized duration does not have to be exactly correct, but it must be at least **within the given range**. For example:
    - If the duration is '20-30 min', acceptable answers include '20 min' and '30 min'.
    - If the duration is '1/2 hour', the only acceptable answer is '0.5 hr'.
- When a number is not given, you should make a **"reasonable" substitution for the words**. For example:
    - If the duration is 'several minutes', you can approximate this as '5 min'.
    - If the duration is 'couple minutes', you can approximate this as '2 min'.
- You are not allowed to **skip any entries**. (Your list of tuples should have a length of 100.)
- Try to use **as few substitutions as possible**, and make your regular expression **as simple as possible**.
- Just because you don't get an error doesn't mean that your code was successful. Instead, you should **check each entry by hand** to see if it produced an acceptable result.

**Bonus tasks:**

- Try reading in **more than 100 rows**, and see if your code still produces the correct results.
- When a range is specified (such as '1-2 hrs' or '10 to 15 sec'), **calculate the exact midpoint** ('1.5 hr' or '12.5 sec') to use in your normalized data.

In [184]:
import pandas as pd
import re

In [185]:
ufo = pd.read_csv('https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv',header=None,nrows=100)

In [186]:
ufo.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611


In [187]:
ufo.shape

(100, 11)

In [188]:
#converted the duration column to list
durations= ufo[6].tolist()

In [189]:
durations

['45 minutes',
 '1-2 hrs',
 '20 seconds',
 '1/2 hour',
 '15 minutes',
 '5 minutes',
 'about 3 mins',
 '20 minutes',
 '3  minutes',
 'several minutes',
 '5 min.',
 '3 minutes',
 '30 min.',
 '3 minutes',
 '30 seconds',
 '20minutes',
 '2 minutes',
 '20-30 min',
 '20 sec.',
 '45 minutes',
 '20 minutes',
 'one hour?',
 '5-6 minutes',
 '1 minute',
 '3 seconds',
 '30 seconds',
 'approx: 30 seconds',
 '5min',
 '15 minutes',
 '4.5 or more min.',
 '3 minutes',
 '30mins.',
 '3 min',
 '5 minutes',
 '3 to 5 min',
 '2min',
 '1 minute',
 'couple minutes',
 '15-20 seconds',
 '10min',
 '3 minutes',
 '10 minutes',
 'few minutes',
 '1 minute',
 '2 sec.',
 'approx 5 min',
 '1 minute',
 '3min',
 '2 minutes',
 '30 minutes',
 '10 minutes',
 '1 hour(?)',
 '10 seconds',
 '1min. 39s',
 '30 seconds',
 '20 minutes',
 '8 seconds',
 'less than 1 min',
 '1 hour',
 '2 minutes',
 '5 seconds',
 '~1 hour',
 '2 min.',
 '1 minute',
 '3sec',
 '5 min',
 '5 min',
 '1 minute',
 '4 hours',
 '30 seconds',
 '<5 minutes',
 '1-hou

In [190]:
duration_min = [re.sub(r'(([0-9]+) *(minte|minutes|min.|mins|minute|min. ))', r'\1 \2 min', duration) for duration in durations]

In [191]:
duration_min

['45 minutes 45 min',
 '1-2 hrs',
 '20 seconds',
 '1/2 hour',
 '15 minutes 15 min',
 '5 minutes 5 min',
 'about 3 mins 3 min',
 '20 minutes 20 min',
 '3  minutes 3 min',
 'several minutes',
 '5 min. 5 min',
 '3 minutes 3 min',
 '30 min. 30 min',
 '3 minutes 3 min',
 '30 seconds',
 '20minutes 20 min',
 '2 minutes 2 min',
 '20-30 min',
 '20 sec.',
 '45 minutes 45 min',
 '20 minutes 20 min',
 'one hour?',
 '5-6 minutes 6 min',
 '1 minu 1 minte',
 '3 seconds',
 '30 seconds',
 'approx: 30 seconds',
 '5min',
 '15 minutes 15 min',
 '4.5 or more min.',
 '3 minutes 3 min',
 '30mins 30 min.',
 '3 min',
 '5 minutes 5 min',
 '3 to 5 min',
 '2min',
 '1 minu 1 minte',
 'couple minutes',
 '15-20 seconds',
 '10min',
 '3 minutes 3 min',
 '10 minutes 10 min',
 'few minutes',
 '1 minu 1 minte',
 '2 sec.',
 'approx 5 min',
 '1 minu 1 minte',
 '3min',
 '2 minutes 2 min',
 '30 minutes 30 min',
 '10 minutes 10 min',
 '1 hour(?)',
 '10 seconds',
 '1min. 1 min 39s',
 '30 seconds',
 '20 minutes 20 min',
 '8 sec

In [200]:
[re.search(r'1-2 hrs',duration) for duration in duration_min]

AttributeError: 'NoneType' object has no attribute 'group'

AttributeError: 'list' object has no attribute 'group'