# Weekly Challenge 04

*Original URL* https://community.alteryx.com/t5/Weekly-Challenge/Challenge-4-Date-Parsing/td-p/36731 and [**My Alteryx Approach**](https://github.com/dsmdavid/Alteryx-Weekly-Challenge/tree/master/submitted/sub_Challenge%2304)

## Brief

A dataset contains a text field that has a date embedded within the text. The problem is that the date is represented a few different ways. For example:

*	16-APR-2005  
*	Nov•16,•1900  
*	4-SEP-00  
*	Jan•5•2000  

The goal is to create a new Date/Time field populated with the dates contained within the text field. You will also need to standardize the dates so that they are all formatted the same.


In [1]:
import pandas as pd
import re #regex module
from datetime import date
from dateutil.relativedelta import relativedelta

## Approach I want to follow:
1. Read the data.
1. Create a function to parse the dates.
1. Apply the function to create a new column.

In [2]:
#Read the dataframe
df = pd.read_csv("./04_files/input.csv")

In [3]:
def getDates(string):
    '''Processes the string and extracts the two potential patterns for the dates
    Pattern 1: r'\d+[-\s]\w+[-\s]\d{2,4}' should match: 16-APR-2005 and 4-SEP-00
    Pattern 2: r'\w*[\s\d,]*\d{4}' should match: Nov•16,•1900 and Jan•5•2000
    and then converts them to dates
    '''
    
    patterns = [r'\d+[-\s]\w+[-\s]\d{2,4}',
                r'\w*[\s\d,]*\d{4}'
               ]
    
    for pat in patterns:
        if re.search(pat,string): #re.search returns an object that is True if there is a match
            return pd.to_datetime(re.search(pat,string).group(0)) #Use the pandas "to_datetime" to convert the entire regex match
                                                                # that is stored in the group 0 

In [4]:
df['Date'] = df.Field_1.apply(getDates)

In [5]:
df.head()

Unnamed: 0,Field_1,Date
0,He who sleeps on the floor will not fall off t...,2005-04-16
1,"After all is said and done, more is said than ...",1856-01-09
2,I want to see you shoot the way you shoutTeddy...,1900-11-16
3,get someone else to do it.15-APR-1944This reco...,1944-04-15
4,Why do they call it rush hour when nothing mov...,1970-06-27


In [6]:
# Year should be 1969 instead of 2069
df['Date'].iloc[11]

Timestamp('2069-09-16 00:00:00')

In [7]:
def correct_year(date_col):
    """If the year is greater than the current year, the date should be in the previous century"""
    
    year_today = date.today().year
    if date_col.year > year_today:
        return date_col - relativedelta(years=100)
    else:
        return date_col

In [8]:
df['Date'] = df['Date'].apply(correct_year)
df.head()

Unnamed: 0,Field_1,Date
0,He who sleeps on the floor will not fall off t...,2005-04-16
1,"After all is said and done, more is said than ...",1856-01-09
2,I want to see you shoot the way you shoutTeddy...,1900-11-16
3,get someone else to do it.15-APR-1944This reco...,1944-04-15
4,Why do they call it rush hour when nothing mov...,1970-06-27


In [9]:
df['Date'].iloc[11]

Timestamp('1969-09-16 00:00:00')

## No differences with the Alteryx Solution:

In [10]:
alteryx = pd.read_csv('./04_files/output_alteryx.csv', parse_dates=[1])

In [11]:
alteryx.head()

Unnamed: 0,Field_1,DateTime_Out
0,He who sleeps on the floor will not fall off t...,2005-04-16
1,"After all is said and done, more is said than ...",1856-01-09
2,I want to see you shoot the way you shoutTeddy...,1900-11-16
3,get someone else to do it.15-APR-1944This reco...,1944-04-15
4,Why do they call it rush hour when nothing mov...,1970-06-27


In [12]:
test = pd.merge(df,alteryx)

In [13]:
test.head()

Unnamed: 0,Field_1,Date,DateTime_Out
0,He who sleeps on the floor will not fall off t...,2005-04-16,2005-04-16
1,"After all is said and done, more is said than ...",1856-01-09,1856-01-09
2,I want to see you shoot the way you shoutTeddy...,1900-11-16,1900-11-16
3,get someone else to do it.15-APR-1944This reco...,1944-04-15,1944-04-15
4,Why do they call it rush hour when nothing mov...,1970-06-27,1970-06-27


In [14]:
test[test['Date'] != test['DateTime_Out']]

Unnamed: 0,Field_1,Date,DateTime_Out
