**Bin Hu's Submission of LING5801 HW1**

In [1]:
import re
import codecs
from pprint import pprint

################# FUNCTION DEFINITIONS -- DO NOT EDIT #######################

def precision(system,gold):
    true_pos = [x for x in system if x in gold]
    return len(true_pos)/len(system)

def recall(system,gold):
    true_pos = [x for x in system if x in gold]
    return len(true_pos)/len(gold)

def f1(p,r):
    return (2 * p * r)/(p + r)

def analyze(system,gold):
    p = precision(system,gold)
    r = recall(system,gold)
    f = f1(p,r)
    print("Precision: %f\tRecall: %f\tF1: %f" % (p,r,f))

### Assignment 1: Information extraction using regular expressions

#### Due: February 14, 2023

Your task in this assignment is to write a Python regular expressoin that searches a text for expressions formatted as **dates**. 

This might seem like a straightforward task, but there are many
different conventions for formatting dates (e.g., `month-day-year`,
`day-month-year` and `year-month-day`) and for each of these
conventions there are a number of additional variables to consider
(what is the separator? is the month given numerically? is the month
name abbreviated? etc.).  It might also be the case that the text mentions just year, a span of years, or refers somehow to a parituclar decatge.  

We have deliberately obfuscated the
document that you will search so that there is no one consistent
format for date expressions.  This is not an unrealistic
situation. For example, if we were processing a large collection of
documents, it is entirely plausible that different documents adopt
different conventions for date expressions.  Therefore your regular expression
needs to match as many of these patterns as possible without
mistakenly matching non-date expressions.

The file `DevList.txt`, the "gold standard", contains the full list of date expressions that your regular expression should match.  The code below reads the dates from the gold standard file into the list `gold_dates`.   

In [2]:
gold_dates = [date.strip() for date in open("DevList.txt", "r", encoding='utf-8').readlines() if date.strip()]

You can inspect the list of date expressions that your regular expression needs to match by running the code cell below.  

In [3]:
pprint(gold_dates)

["1800's",
 "1900's",
 '1854',
 '1854',
 '1854',
 '1854',
 '1854',
 '2019',
 'June 4, 2019',
 '1851',
 'May 11, 1858',
 '1851-1900',
 '1851',
 '1858',
 '1861',
 '7/2/1862',
 '1868',
 'May 1873',
 'May of 1875',
 'Sep 1877',
 'April 1st, 1880',
 '1881',
 'Mar, 2, 1887',
 '1888',
 '2nd of Nov 1898',
 '05. 01.1900',
 "1900's",
 '1901',
 '1960',
 '30th of January, 1904',
 '24th of Sep 1904',
 '1907',
 '1908',
 'Sep 1909',
 'September 1949',
 '1909',
 'Feb 14']


The following code reads a text document `DevText.txt` into the string `doc`.  This is the object that your regular expression will search for date expressions.

In [4]:
doc = codecs.open("DevText.txt", encoding='utf-8').read()
#doc = open("DevText.txt").read()

In [5]:
print(doc)

This document lists a history of the University of Minnesota since establishment in the 1800's until the 1900's. Before the chronological laying out of the significant milestones of the university, a land acknowledgement of one of the campuses sets a backdrop for the history of the land on which the campuses were established. 
The University of Minnesota Duluth's Land Acknowledgment

We collectively acknowledge that the University of Minnesota Duluth is located on the traditional, ancestral, and contemporary lands of Indigenous people. The University resides on land that was cared for and called home by the Ojibwe people, before them the Dakota and Northern Cheyenne people, and other Native peoples from time immemorial. Ceded by the Ojibwe in an 1854 treaty, this land holds great historical, spiritual, and personal significance for its original stewards, the Native nations and peoples of this region. We recognize and continually support and advocate for the sovereignty of the Native

Define your regular expression in the string variable `pattern`.  **This is the only part of the notebook that you need to edit.**  There is an initial definition for `pattern` that simply looks for three groups of one or more numeric characters with any single character occuring between them.

Note that we will compile the regular expression using the `VERBOSE` flag, which allows you to leverage multi-line strings, whitespace, and comments within a regex definition.  See the instruction sheet for further details.

In [6]:
pattern = r"""
\b\d{2,4}'s\b
|
\b\d{4}\s*-\s*\d{4}\b
|
\b\d{1,2}[^1-9]{2}-century\b
|
\b(?:\d{1,2}[^0-9]|Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)(?:(?!until|in|to|the|Until|In|To|The).){1,18}\d{4}\b
|
\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[^a-zA-Z0-9]{1,2}\d{1,4}\b
|
\b\d{1,4}[^a-zA-Z0-9]{1,2}\d{1,2}[^a-zA-Z0-9]{1,2}\d{1,4}\b
|
\b\d{4}\b
"""
len(pattern)

535

In [7]:
regex = re.compile(pattern, re.VERBOSE)

## Uncomment the line below if you do not want to use the VERBOSE mode.
#regex = re.compile(pattern)

Next, we search the string `doc` for strings that match your regular expression and store any matches in the list `dates`. We then evaluate the results using the **precision**, **recall** and **F1** metrics.   Recall rewards correct guesses,
  precision penalizes incorrect guesses, and the f-score
  is a combination of precision and recall (technically, it is the harmonic mean of precision and recall).

In [8]:
dates = re.findall(regex, doc)

In [9]:
# Evaluate your regex using the evaluate function.
print(analyze(dates, gold_dates))

Precision: 1.000000	Recall: 1.000000	F1: 1.000000
None


Our initial regular expression only matches 5\% of the dates in the development text but 67\% of
the strings it returned were actual date expressions.  Not bad for
such a simple regular expression. **But you can do better!**

Note that if your pattern does not match any strings, the evaluation code will throw an error due
to a division by zero. This means that the list `dates` is empty.
    

### Error analysis

As you develop your regular expression it may be useful to conduct some error analysis.  The code cells below provide some simple first steps, but feel free to explore the development data as you see fit.

Have you looked at the content of `dates`?  These are the strings your regular expression is matching.  Inspecting the contents of this list is a good first step to improving the performance of your regular expression.

In [10]:
pprint(dates)

["1800's",
 "1900's",
 '1854',
 '1854',
 '1854',
 '1854',
 '1854',
 '2019',
 'June 4, 2019',
 '1851',
 'May 11, 1858',
 '1851-1900',
 '1851',
 '1858',
 '1861',
 '7/2/1862',
 '1868',
 'May 1873',
 'May of 1875',
 'Sep 1877',
 'April 1st, 1880',
 '1881',
 'Mar, 2, 1887',
 '1888',
 '2nd of Nov 1898',
 '05. 01.1900',
 "1900's",
 '1901',
 '1960',
 '30th of January, 1904',
 '24th of Sep 1904',
 '1907',
 '1908',
 'Sep 1909',
 'September 1949',
 '1909',
 'Feb 14']


You can zero in on the errors your regular expression is making by inspecting the **false positives**, that is, the matches that do not occur in the gold standard list `gold_dates`.  Minimizing false positives will improve your **precision** score.

In [11]:
false_positives = [d for d in dates if d not in gold_dates]
pprint(false_positives)

[]


What dates in the `gold_dates` is your regular expression missing; that is, what are the **false negatives**?  Minimizing false negatives will improve your **recall**.

In [12]:
false_negatives = [d for d in gold_dates if d not in dates]
pprint(false_negatives)

[]


Finally, feel free to use the empty code cell below to perform your own error analyses.

### Evaluation

**Your grade will be based on your F1 score!**  However, we
  will not evaluate your regular expression only against the development
  text, but also against a new, unseen **test text**.  The test text will
  contain the same date *formats* as found in the
  development text, so if your regular expression is general enough,
  your score on the test text should be similar to your score on the
  development text. The formula for your score will be:
  
  $\mathrm{Score} = (F1_{\mathit{dev}} \times 0.75) + (F1_{\mathit{test}} \times 0.25)$

When you are satisfied with the performance of your regular
  expression submit  this notebook, `DateFinder.ipynb`,  via Canvas.  Submissions are
  due by **11:59pm on Tuesday, February 14**.

In [13]:
gold_dates = [date.strip() for date in open("TestList.txt", "r", encoding='utf-8').readlines() if date.strip()]
doc = codecs.open("TestText.txt", encoding='utf-8').read()
dates = re.findall(regex, doc)
print(analyze(dates, gold_dates))
false_positives = [d for d in dates if d not in gold_dates]
print("False Positives\n")
pprint(false_positives)
false_negatives = [d for d in gold_dates if d not in dates]
print("False Negatives\n")
pprint(false_negatives)

Precision: 0.968750	Recall: 0.837838	F1: 0.898551
None
False Positives

['December 7']
False Negatives

['1940s', '1940s', 'December 7, 41', '1950s', 'June', '1960s']
