---

_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Assignment 1

In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. 

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. 

Here is a list of some of the variants you might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.

For example if the original series was this:

    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.

*This function should return a Series of length 500 and dtype int.*

In [78]:
import re

In [79]:
import pandas as pd

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
df.head(10)

0         03/25/93 Total time of visit (in minutes):\n
1                       6/18/85 Primary Care Doctor:\n
2    sshe plans to move as of 7/8/71 In-Home Servic...
3                7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level (N...
5                    .Per 7/06/79 Movement D/O note:\n
6    4, 5/18/78 Patient's thoughts about current su...
7    10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                         3/7/86 SOS-10 Total Score:\n
9             (4/10/71)Score-1Audit C Score Current:\n
dtype: object

In [80]:
# make the dataframe global so you can access it in functions
global df

In [81]:
dates = ['04/20/2009;' ,'04/20/09;' ,'4/20/09;' ,'4/3/09', 'Mar-20-2009;',' Mar 20, 2009;', 'March 20, 2009;', 'Mar. 20, 2009;', 
         'Mar 20 2009;','20 Mar 2009;', '20 March 2009;', '20 Mar. 2009;', '20 March, 2009','Mar 20th, 2009;',
         'Mar 21st, 2009;', 'Mar 22nd, 2009','Feb 2009;' ,'Sep 2009;' ,'Oct 2010','6/2008; ','12/2009','2010']

In [82]:
type1 = r'(\d{1,2})*[/-]*(\d{1,2})*[/-]*(\d{2,4})' #dates of type 03/25/93, 3/5/1993,6/2008,12/2009,2010....or replace backslash with hyphens
type2 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[a-z]*[./-]?[- ]*(\d{1,2})*[a-z]*[,]?[- /](\d{2,4})' #dates of type
# Jan.01,2000 or January/20/2000 ,Mar 20th, 2009 , or year with two digits or hyphens in place of backslashes or commas
type3 = r'(\d{1,2})[ -/,.]((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[a-z]*[.,]? (\d{2,4})' 
#to handle dates of type 20 Mar 2009,20 March 2009..

In [83]:
for i in range(1,4):
    print('type'+str(i),end=',')

type1,type2,type3,

In [84]:
types = [type1,type2,type3]

In [85]:
for i in range(len(types)):
    typ = types[i]
    print('type'+str(i+1),'\n')
    for date in dates:
        print(date,'\t',re.sub(typ,'',date))
    print('\n\n')

type1 

04/20/2009; 	 ;
04/20/09; 	 ;
4/20/09; 	 ;
4/3/09 	 
Mar-20-2009; 	 Mar;
 Mar 20, 2009; 	  Mar , ;
March 20, 2009; 	 March , ;
Mar. 20, 2009; 	 Mar. , ;
Mar 20 2009; 	 Mar  ;
20 Mar 2009; 	  Mar ;
20 March 2009; 	  March ;
20 Mar. 2009; 	  Mar. ;
20 March, 2009 	  March, 
Mar 20th, 2009; 	 Mar th, ;
Mar 21st, 2009; 	 Mar st, ;
Mar 22nd, 2009 	 Mar nd, 
Feb 2009; 	 Feb ;
Sep 2009; 	 Sep ;
Oct 2010 	 Oct 
6/2008;  	 ; 
12/2009 	 
2010 	 



type2 

04/20/2009; 	 04/20/2009;
04/20/09; 	 04/20/09;
4/20/09; 	 4/20/09;
4/3/09 	 4/3/09
Mar-20-2009; 	 ;
 Mar 20, 2009; 	  ;
March 20, 2009; 	 ;
Mar. 20, 2009; 	 ;
Mar 20 2009; 	 ;
20 Mar 2009; 	 20 ;
20 March 2009; 	 20 ;
20 Mar. 2009; 	 20 ;
20 March, 2009 	 20 
Mar 20th, 2009; 	 ;
Mar 21st, 2009; 	 ;
Mar 22nd, 2009 	 
Feb 2009; 	 ;
Sep 2009; 	 ;
Oct 2010 	 
6/2008;  	 6/2008; 
12/2009 	 12/2009
2010 	 2010



type3 

04/20/2009; 	 04/20/2009;
04/20/09; 	 04/20/09;
4/20/09; 	 4/20/09;
4/3/09 	 4/3/09
Mar-20-2009; 	 Mar-20-2009;
 Mar 20, 

top 3 are able to handle different cases, but you can't apply them in series  
either you have to divide the dates into types and merge them later, or keep seperate cases, which specially takes care of them

In [86]:
for i in range(len(types)):
    typ = types[i]
    print('type'+str(i+1),'\n')
    for i,date in enumerate(dates):
        
        print(date,'\t',re.sub(typ,'',date))
        dates[i] = re.sub(typ,'',date)
    print('\n\n')

type1 

04/20/2009; 	 ;
04/20/09; 	 ;
4/20/09; 	 ;
4/3/09 	 
Mar-20-2009; 	 Mar;
 Mar 20, 2009; 	  Mar , ;
March 20, 2009; 	 March , ;
Mar. 20, 2009; 	 Mar. , ;
Mar 20 2009; 	 Mar  ;
20 Mar 2009; 	  Mar ;
20 March 2009; 	  March ;
20 Mar. 2009; 	  Mar. ;
20 March, 2009 	  March, 
Mar 20th, 2009; 	 Mar th, ;
Mar 21st, 2009; 	 Mar st, ;
Mar 22nd, 2009 	 Mar nd, 
Feb 2009; 	 Feb ;
Sep 2009; 	 Sep ;
Oct 2010 	 Oct 
6/2008;  	 ; 
12/2009 	 
2010 	 



type2 

; 	 ;
; 	 ;
; 	 ;
 	 
Mar; 	 Mar;
 Mar , ; 	  Mar , ;
March , ; 	 March , ;
Mar. , ; 	 Mar. , ;
Mar  ; 	 Mar  ;
 Mar ; 	  Mar ;
 March ; 	  March ;
 Mar. ; 	  Mar. ;
 March,  	  March, 
Mar th, ; 	 Mar th, ;
Mar st, ; 	 Mar st, ;
Mar nd,  	 Mar nd, 
Feb ; 	 Feb ;
Sep ; 	 Sep ;
Oct  	 Oct 
;  	 ; 
 	 
 	 



type3 

; 	 ;
; 	 ;
; 	 ;
 	 
Mar; 	 Mar;
 Mar , ; 	  Mar , ;
March , ; 	 March , ;
Mar. , ; 	 Mar. , ;
Mar  ; 	 Mar  ;
 Mar ; 	  Mar ;
 March ; 	  March ;
 Mar. ; 	  Mar. ;
 March,  	  March, 
Mar th, ; 	 Mar th, ;
Mar st, ; 	 Mar 

In [87]:
print(dates) #if the top thing worked well, this should have been empty strings or just semi colons

[';', ';', ';', '', 'Mar;', ' Mar , ;', 'March , ;', 'Mar. , ;', 'Mar  ;', ' Mar ;', ' March ;', ' Mar. ;', ' March, ', 'Mar th, ;', 'Mar st, ;', 'Mar nd, ', 'Feb ;', 'Sep ;', 'Oct ', '; ', '', '']


## carefully taking only special cases, such that it doesn't screw(replace parts of other dates which it can't detect) with others

In [88]:
type1 = r'(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})' #dates of type 03/25/93, 3/5/1993....or replace backslash with hyphens
type2 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[a-z]*[.]?[-/ ](\d{1,2})[a-z{2}]*[,/.]?[- ](\d{2,4})' #dates of type
# Jan.01,2000 or January/20/2000 or march 20th 1272 or year with two digits or hyphens in place of backslashes or commas
type3 = r'(\d{1,2})[ -/,.]((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[a-z]*[.,]? (\d{2,4})' 
#of type 20 Jan,2009
type4 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[a-z]*[,]? (\d{2,4})'  #just years with months
type5 = r'(\d{1,2})[-/. ](\d{2,4})' # to handle 10/2019, 10 2019, 10.2019...
type6 = r'\d{4}' #to handle 3002,3721...

In [112]:
dates = ['04/20/2009;' ,'04/20/09;' ,'4/20/09;' ,'4/3/09', 'Mar-20-2009;',' Mar 20, 2009;', 'March 20, 2009;', 'Mar. 20, 2009;', 
         'Mar 20 2009;','20 Mar 2009;', '20 March 2009;', '20 Mar. 2009;', '20 March, 2009','Mar 20th, 2009;',
         'Mar 21st, 2009;', 'Mar 22nd, 2009','Feb 2009;' ,'Sep 2009;' ,'Oct 2010','6/2008; ','12/2009','2010']

In [90]:
for i in range(1,7):
    print('type'+str(i),end=',')

type1,type2,type3,type4,type5,type6,

In [113]:
types = [type1,type2,type3,type4,type5,type6]

In [116]:
for i in range(len(types)):
    typ = types[i]
    print('type'+str(i+1),'\n')
    for date in dates:
        print(date,'\t',list(re.findall(typ,date)))
    print('\n\n')

type1 

04/20/2009; 	 [('04', '20', '2009')]
04/20/09; 	 [('04', '20', '09')]
4/20/09; 	 [('4', '20', '09')]
4/3/09 	 [('4', '3', '09')]
Mar-20-2009; 	 []
 Mar 20, 2009; 	 []
March 20, 2009; 	 []
Mar. 20, 2009; 	 []
Mar 20 2009; 	 []
20 Mar 2009; 	 []
20 March 2009; 	 []
20 Mar. 2009; 	 []
20 March, 2009 	 []
Mar 20th, 2009; 	 []
Mar 21st, 2009; 	 []
Mar 22nd, 2009 	 []
Feb 2009; 	 []
Sep 2009; 	 []
Oct 2010 	 []
6/2008;  	 []
12/2009 	 []
2010 	 []



type2 

04/20/2009; 	 []
04/20/09; 	 []
4/20/09; 	 []
4/3/09 	 []
Mar-20-2009; 	 [('Mar', '20', '2009')]
 Mar 20, 2009; 	 [('Mar', '20', '2009')]
March 20, 2009; 	 [('Mar', '20', '2009')]
Mar. 20, 2009; 	 [('Mar', '20', '2009')]
Mar 20 2009; 	 [('Mar', '20', '2009')]
20 Mar 2009; 	 []
20 March 2009; 	 []
20 Mar. 2009; 	 []
20 March, 2009 	 []
Mar 20th, 2009; 	 [('Mar', '20', '2009')]
Mar 21st, 2009; 	 [('Mar', '21', '2009')]
Mar 22nd, 2009 	 [('Mar', '22', '2009')]
Feb 2009; 	 []
Sep 2009; 	 []
Oct 2010 	 []
6/2008;  	 []
12/2009 	 []
20

In [93]:
for i in range(len(types)):
    typ = types[i]
    print('type'+str(i+1),'\n')
    for i,date in enumerate(dates):
        
        print(date,'\t',re.sub(typ,'',date))
        dates[i] = re.sub(typ,'',date)
    print('\n\n')

type1 

04/20/2009; 	 ;
04/20/09; 	 ;
4/20/09; 	 ;
4/3/09 	 
Mar-20-2009; 	 Mar-20-2009;
 Mar 20, 2009; 	  Mar 20, 2009;
March 20, 2009; 	 March 20, 2009;
Mar. 20, 2009; 	 Mar. 20, 2009;
Mar 20 2009; 	 Mar 20 2009;
20 Mar 2009; 	 20 Mar 2009;
20 March 2009; 	 20 March 2009;
20 Mar. 2009; 	 20 Mar. 2009;
20 March, 2009 	 20 March, 2009
Mar 20th, 2009; 	 Mar 20th, 2009;
Mar 21st, 2009; 	 Mar 21st, 2009;
Mar 22nd, 2009 	 Mar 22nd, 2009
Feb 2009; 	 Feb 2009;
Sep 2009; 	 Sep 2009;
Oct 2010 	 Oct 2010
6/2008;  	 6/2008; 
12/2009 	 12/2009
2010 	 2010



type2 

; 	 ;
; 	 ;
; 	 ;
 	 
Mar-20-2009; 	 ;
 Mar 20, 2009; 	  ;
March 20, 2009; 	 ;
Mar. 20, 2009; 	 ;
Mar 20 2009; 	 ;
20 Mar 2009; 	 20 Mar 2009;
20 March 2009; 	 20 March 2009;
20 Mar. 2009; 	 20 Mar. 2009;
20 March, 2009 	 20 March, 2009
Mar 20th, 2009; 	 ;
Mar 21st, 2009; 	 ;
Mar 22nd, 2009 	 
Feb 2009; 	 Feb 2009;
Sep 2009; 	 Sep 2009;
Oct 2010 	 Oct 2010
6/2008;  	 6/2008; 
12/2009 	 12/2009
2010 	 2010



type3 

; 	 ;
; 	 ;
; 	 ;


In [94]:
print(dates) #this thing worls perfectly, every date is vanished

[';', ';', ';', '', ';', ' ;', ';', ';', ';', ';', ';', ';', '', ';', ';', '', ';', ';', '', '; ', '', '']


In [132]:
def date_sorter():
    import re
    import pandas as pd
    
    doc = []
    with open('dates.txt') as file:
        for line in file:
            doc.append(line)

    df = pd.Series(doc)
    
    type1 = r'(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})' #dates of type 03/25/93, 3/5/1993....or replace backslash with hyphens
    type2 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[a-z]*[.]?[-/ ](\d{1,2})[a-z{2}]*[,/.]?[- ](\d{2,4})' #dates of type
    # Jan.01,2000 or January/20/2000 or march 20th 1272 or year with two digits or hyphens in place of backslashes or commas
    type3 = r'(\d{1,2})[ -/,.]((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[a-z]*[.,]? (\d{2,4})' 
    #of type 20 Jan,2009
    type4 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[a-z]*[,]? (\d{2,4})'  #just years with months
    type5 = r'(\d{1,2})[-/. ](\d{2,4})' # to handle 10/2019, 10 2019, 10.2019...
    type6 = r'\d{4}' #to handle 3002,3721...
    
    
    month_dict = {'Jan': '01', 
                  'Feb': '02' ,
                  'Mar': '03',
                  'Apr':'04',
                  'May':'05',
                  'Jun':'06',
                  'Jul':'07',
                  'Aug':'08',
                  'Sep':'09',
                  'Oct':'10',
                  'Nov':'11',
                  'Dec':'12'}
    
    #we can run a for loop in dates, but as many of the dates will be trapped in first two types, we can be wasting computing power to run the loop
#     string_num = 0
    dates = []
    for string in df:
#         string_num+=1
#         print(string_num)# to know the status of program
        
#         '''04/20/2009'''
        if (re.findall(type1,string)):
            date=list(re.findall(type1,string)[0]) #we might have to run a loop within re,findall() if there were multiple dates in each entry
            if len(date[2])==2: # year
                date[2] ='19'+date[2]
            if len(date[0])==1:   ##date
                date[0] = '0'+ date[0]
            if len(date[1])==1:   #month
                date[1] = '0'+ date[1]
            dates.append(date[2]+date[0]+date[1])
                     
#         '''Mar 22nd, 2009    Mar 22, 2009'''
        elif (re.findall(type2,string)): 
            date=list(re.findall(type2,string)[0])
            if len(date[2])==2: # year
                date[2] ='19'+date[2]
            if len(date[1])==1:  #date
                date[1] = '0'+ date[1]
            date[0] = month_dict[date[0]] #month
            dates.append(date[2]+date[0]+date[1])

#         '''20 March 2009 20 Mar. 2009'''
        elif re.findall(type3,string):
            date=list(re.findall(type3,string)[0])
            if len(date[2])==2: # year
                date[2] ='19'+date[2]
            if len(date[0])==1:  #date
                date[0] = '0'+ date[0]
            date[1] = month_dict[date[1]]
            dates.append(date[2]+date[1]+date[0])

#         '''Sep 2009'''
        elif re.findall(type4,string):
            date=list(re.findall(type4,string)[0])
            date[0] = month_dict[date[0]] #month
            if len(date[1])==2: # year
                date[1] ='19'+date[1]
            dates.append(date[1]+date[0]+'01')

#         '''6/2008'''
        elif re.findall(type5,string):
            date=list(re.findall(type5,string)[0])
            if len(date[0])== 1: #month
                date[0] = '0'+date[0]
            if len(date[1])==2: #year
                date[1] == '19'+date[1]
            dates.append(date[1]+date[0]+'01')

#         '''2010'''
        elif re.findall(type6,string):
            date=(re.findall(type6,string)[0])
            dates.append(date+'01'+'01')

#         '''to check if we were not able to capture any dates'''
        else :
            print(string)
        datesdf = pd.Series(dates)
        datesdf.sort_values(inplace=True)
        result  = pd.Series(datesdf.index)
    
    return result# Your answer here

In [134]:
ok = date_sorter()

In [135]:
ok

0        9
1       84
2        2
3       53
4       28
5      474
6      153
7       13
8      129
9       98
10     111
11     225
12      31
13     171
14     191
15     486
16     335
17     415
18      36
19     405
20     323
21     422
22     375
23     380
24     345
25      57
26     481
27     436
28     104
29     299
      ... 
470    243
471    139
472    320
473    383
474    244
475    286
476    480
477    431
478    279
479    198
480    381
481    463
482    366
483    439
484    255
485    401
486    475
487    257
488    152
489    235
490    464
491    253
492    231
493    427
494    141
495    186
496    161
497    413
498    392
499    490
Length: 500, dtype: int64