# Assignment 1

In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. 

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. 

Here is a list of some of the variants you might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices. **This Series should be sorted by a tie-break sort in the format of ("extracted date", "original row number").**

For example if the original series was this:

    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.

*This function should return a Series of length 500 and dtype int.*

In [82]:
import pandas as pd
import numpy as np

import re


doc = []


with open("assets/dates.txt") as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)


df.head()

0         03/25/93 Total time of visit (in minutes):\n
1                       6/18/85 Primary Care Doctor:\n
2    sshe plans to move as of 7/8/71 In-Home Servic...
3                7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level (N...
dtype: object

In [99]:
def date_sorter():
    text = df.copy()
    text = text.str.lower()
    months = "(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)"
    extracted_dates = text.str.extract(
        r"(?P<month>\d?\d)[/-](?P<day>\d?\d)[/-](?P<year>\d{4})"
    )

    not_extracted = ~text.index.isin(extracted_dates.dropna().index)
    extracted_dates = pd.concat(
        [
            extracted_dates,
            text[not_extracted].str.extract(
                r"(?P<month>\d?\d)[/-](?P<day>(?:[0-2]?[0-9])|(?:[3][01]))[/-](?P<year>\d{2})"
            ),
        ],
        axis=0,
        join="inner",
    )
    not_extracted = ~text.index.isin(extracted_dates.dropna().index)

    extracted_dates = pd.concat(
        [
            extracted_dates,
            text[not_extracted].str.extract(
                r"(?P<day>\d?\d)\s?(?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w*)\.?,?\s?(?P<year>\d{4})"
            ),
        ],
        axis=0,
        join="inner",
    )
    not_extracted = ~text.index.isin(extracted_dates.dropna().index)

    extracted_dates = pd.concat(
        [
            extracted_dates,
            text[not_extracted].str.extract(
                r"(?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w*)\.?-?\s?(?P<day>\d\d?)(?:th|nd|st)?,?-?\s?(?P<year>\d{4})"
            ),
        ],
        axis=0,
        join="inner",
    )
    not_extracted = ~text.index.isin(extracted_dates.dropna().index)

    without_day = text[not_extracted].str.extract(
        r"(?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w*),?\.?\s(?P<year>\d{4})"
    )
    without_day = pd.concat(
        [
            without_day,
            text[not_extracted].str.extract(r"(?P<month>\d\d?)/(?P<year>\d{4})"),
        ]
    )
    without_day["day"] = 1
    extracted_dates = pd.concat([extracted_dates, without_day])
    not_extracted = ~text.index.isin(extracted_dates.dropna().index)

    without_month = text[not_extracted].str.extract(r"(?P<year>\d{4})")
    without_month["day"] = 1
    without_month["month"] = 1
    extracted_dates = pd.concat([extracted_dates, without_month])
    not_extracted = ~text.index.isin(extracted_dates.dropna().index)
    extracted_dates = extracted_dates.dropna()

    # Year
    extracted_dates["year"] = extracted_dates["year"].apply(
        lambda x: "19" + x if len(x) == 2 else x
    )
    extracted_dates["year"] = extracted_dates["year"].apply(lambda x: str(x))

    # Month
    extracted_dates["month"] = extracted_dates["month"].apply(
        lambda x: x[1:] if type(x) is str and x.startswith("0") else x
    )
    month_dict = {
        "january": 1,
        "jan": 1,
        "february": 2,
        "feb": 2,
        "march": 3,
        "mar": 3,
        "april": 4,
        "apr": 4,
        "may": 5,
        "june": 6,
        "jun": 6,
        "july": 7,
        "jul": 7,
        "august": 8,
        "aug": 8,
        "september": 9,
        "sep": 9,
        "october": 10,
        "oct": 10,
        "november": 11,
        "nov": 11,
        "december": 12,
        "dec": 12,
        "age": 1,
        "janaury": 1,
        "decemeber": 12,
    }

    extracted_dates = extracted_dates.replace(month_dict)
    extracted_dates["month"] = extracted_dates["month"].apply(lambda x: str(x))
    extracted_dates["day"] = extracted_dates["day"].apply(lambda x: str(x))

    extracted_dates["day"] = extracted_dates["day"].apply(
        lambda x: pd.NaT if int(x) < 1 or int(x) > 31 else x
    )
    extracted_dates["month"] = extracted_dates["month"].apply(
        lambda x: pd.NaT if int(x) < 1 or int(x) > 12 else x
    )
    extracted_dates["year"] = extracted_dates["year"].apply(
        lambda x: pd.NaT if int(x) < 1900 or int(x) > 2050 else x
    )

    extracted_dates["date"] = (
        extracted_dates["month"]
        + "/"
        + extracted_dates["day"]
        + "/"
        + extracted_dates["year"]
    )

    extracted_dates["date"] = pd.to_datetime(extracted_dates["date"])
    extracted_dates["index"] = extracted_dates.index

    extracted_dates = extracted_dates.sort_values(by=["date", "index"])
    return_rank = pd.Series(list(extracted_dates.index))

    return return_rank

In [100]:
date_sorter()

0        9
1       84
2        2
3       53
4       28
      ... 
495    427
496    141
497    186
498    161
499    413
Length: 500, dtype: int64

## Test

In [102]:
import numpy as np

s_test = date_sorter()


def run_df_modified_check():
    """
    Check if df appears to be modified.
    """
    try:
        assert type(df) == pd.Series
        assert (df.index == pd.RangeIndex(start=0, stop=500, step=1)).all()
        assert (df.apply(type) == str).all()
        assert df.str.len().min() >= 6
        assert df.str[5].apply(ord).sum() == 38354
        print("Passed df modification check")
    except:
        print("Failed df modification check")


run_df_modified_check()

# check if running the code twice produces the same result
try:
    assert (date_sorter() == s_test).all()
    print("Passed repeatability check")
except:
    print("Failed repeatability check")

# check if the result has the expected index
try:
    # assert type(date_sorter().index) == pd.RangeIndex
    # assert (date_sorter().index == pd.RangeIndex(start=0, stop=500, step=1)).all()
    assert list(date_sorter().index) == list(range(500))
    print("Passed index check")
except:
    print("Failed index check")

# check the tie-break sort for a sample of records where some have the same date
# note that this only tests a sample and does not check the entire answer
try:
    test_indices = [
        335,
        415,
        323,
        405,
        370,
        382,
        303,
        488,
        283,
        395,
        318,
        369,
        493,
        252,
        314,
        410,
        490,
    ]
    answer_lkp = {
        original_index: answer_index
        for answer_index, original_index in s_test.to_dict().items()
    }
    i_test = [answer_lkp[i] for i in test_indices]
    assert sorted(i_test) == i_test
    print("Passed secondary sort sample check")
except:
    print("Failed secondary sort sample check")


def run_v_check(s_test):
    """
    Check if the parsed dates appear to be correct and correctly sorted.
    The check works by producing some test checksums
    if you get for example a False entry in the agree column for
    index value 20 that would mean you have at least one incorrectly
    parsed or incorrectly sorted date in the **output** index
    range 20,21,...,29
    The results of the test are printed.
    Args:
    s_test: Series such as produced by date_sorter()
    Returns:
    None
    """
    try:
        v_check = pd.DataFrame(
            {
                "correct": [
                    6695,
                    14428,
                    16742,
                    9275,
                    12290,
                    14654,
                    9421,
                    10185,
                    11464,
                    16491,
                    11797,
                    14036,
                    15459,
                    9412,
                    13069,
                    10400,
                    10498,
                    14322,
                    13274,
                    11001,
                    11383,
                    11910,
                    10977,
                    9692,
                    10199,
                    10187,
                    15456,
                    13491,
                    9186,
                    13646,
                    11142,
                    13724,
                    10994,
                    12905,
                    15968,
                    16648,
                    13966,
                    14607,
                    16932,
                    14622,
                    17942,
                    18220,
                    17818,
                    18305,
                    19633,
                    12522,
                    13978,
                    18445,
                    20156,
                    14797,
                ],
                "learner": [
                    (
                        s_test.iloc[10 * i : (i + 1) * 10].values
                        * np.array(range(1, 11))
                    ).sum()
                    for i in range(50)
                ],
            },
            index=range(0, 500, 10),
        ).assign(agree=lambda x: x["correct"] == x["learner"])
        print("Values checksums:")
        print(v_check)
        assert v_check["agree"].all()
        print("Passed values check")
    except:
        print("Failed values check")
    return


run_v_check(s_test)

Passed df modification check
Passed repeatability check
Passed index check
Passed secondary sort sample check
Values checksums:
     correct  learner  agree
0       6695     6695   True
10     14428    14428   True
20     16742    16742   True
30      9275     9275   True
40     12290    12290   True
50     14654    14654   True
60      9421     9421   True
70     10185    10185   True
80     11464    11464   True
90     16491    16491   True
100    11797    11797   True
110    14036    14036   True
120    15459    15459   True
130     9412     9412   True
140    13069    13069   True
150    10400    10400   True
160    10498    10498   True
170    14322    14322   True
180    13274    13274   True
190    11001    11001   True
200    11383    11383   True
210    11910    11910   True
220    10977    10977   True
230     9692     9692   True
240    10199    10199   True
250    10187    10187   True
260    15456    15456   True
270    13491    13491   True
280     9186     9186   True
29