# Week 1 Assignment
- **Assignment Description**
Dates are expressed in many different ways. In India, the date is usually expressed in **`dd-MM-yy`**; while in the US, it is usually expressed in **`MM-dd-yyyy`**. Sometimes people also use month abbreviations, like **`Aug.`**. 

 This assignment is aimed to help you get familiar with regular expressions (or "regex") and apply your knowledge to real-world data. You will able to:
 - Use Regular Expressions to extract dates in various formats from text


- **Requirements**
 - Extract dates from the provided .txt file
 - Transform/Normalize extracted dates into desired format **`yyyy-MM-dd`**
   - Example: Extracted date **`Sep 27, 2021`**; Formatted date **`2021-09-27`**

- **Rules for Normalizing Formatted Dates**
 - Assume all dates in **`xx/xx/xx`** format are **`mm/dd/yy`**
 - Assume dates in **`xx/xx`** format are **`mm/yy`**
 - Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. `1/5/89` is `January 5th, 1989`)
 - If the day is missing (e.g. `9/2009`), assume it is the first day of the month (e.g. `September 1, 2009`)
 - If the month is missing (e.g. `2010`), assume it is the first of January of that year (e.g. `January 1, 2010`).

- **Additional Info**
 - You can use this [external tool](https://regex101.com/) to help complete the assignment
 - [Regular Expression Documentation](https://docs.python.org/3/library/re.html)
---



### Data File Description
The file we're using for this assginment is named **`dates.txt`**. Each row consists of a row index, a tab and the corresponding content. For example:
- `0<tab>: 10/12/92Total time of visit (in minutes):`


### Code Template

##### Please modify code template as much as you want. If you find using functions would be useful, please do that!

#### Step 1: Read File
- Try to print out some lines and observe what kinds of dates are included in the file

In [1]:
import re

In [2]:
with open('data/dates.txt') as file:
  lines = file.readlines()

print(*lines[:5],sep='\n')  # print the first 5 rows in the file

0	: Na 130 on 7/21/1999Pertinent Medical Review of Systems Constitutional:

1	"""Hx of suicidal ideation and last felt suicidal in Marc, 1981. No Hx of suicide attempts. ""Felt that after being sober for 5 years and in custody for 22 months that I just wasn't getting it. I couldn't do it. He had my parents come in the following week and we talked and that's when we decided I should go on the methadone clinic,""Hx of Non Suicidal Self Injurious Behavior: No"

2	s 03/1980 Positive PPD: treated with INH for 6 months

3	: 7/11/90CPT code: 99205

4	: 6/02/1986CPT Code: 90792: With medical services



#### Step 2: Extract Dates
- In our template, we're using `reg = "\w+"`, which means that we want to select all "words" (consecutive word characters) in a string.
- We're also using function `re.findall()`, which will return a list of the items matching the regular expression you choose. You can choose to use any other functions to best help you find dates you want.

In [54]:
reg = "(\d{1,2}[/-]\d{1,2}?[/-]?\d{2,4})"
reg_2 ='(?:Jan|Feb|Marc|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*,? \d{2,4}'
reg_3 = '(\d{4})'
extracted_contents = {}  # {row_index: extracted_contents}

for line in lines:
    line_content = line.split('\t')
    extracted_contents[line_content[0]] = re.findall(reg, line_content[1])
    if len(extracted_contents[line_content[0]]) == 0 : 
        extracted_contents[line_content[0]] = re.findall(reg_2 , line_content[1])
    if len(extracted_contents[line_content[0]]) == 0 :
        extracted_contents[line_content[0]] = re.findall(reg_3 , line_content[1])


In [55]:
extracted_contents # run this to see what you have for each row

{'0': ['7/21/1999'],
 '1': ['Marc, 1981'],
 '2': ['03/1980'],
 '3': ['7/11/90'],
 '4': ['6/02/1986'],
 '5': ['Mar 2012'],
 '6': ['6/18/85'],
 '7': ['Sep 2007'],
 '8': ['Dec 1975'],
 '9': ['8/1986'],
 '10': ['March 1979'],
 '11': ['9/17/72'],
 '12': ['Oct 2003'],
 '13': ['11/24/94'],
 '14': ['November 1987'],
 '15': ['8/16/92'],
 '16': ['12/12/97'],
 '17': ['9/30/76'],
 '18': ['May 2010'],
 '19': ['June 1981'],
 '20': ['March 1973'],
 '21': ['12/8/97'],
 '22': ['09/13/94'],
 '23': ['9/1984'],
 '24': ['4/18/86'],
 '25': ['2/14/77'],
 '26': ['7/13/97'],
 '27': ['Apr 2007'],
 '28': ['4/12/88'],
 '29': ['6/10/71'],
 '30': ['12/12/1993'],
 '31': ['2002'],
 '32': ['3/26/81'],
 '33': ['Aug 2010'],
 '34': ['8/2010'],
 '35': ['10/6/79'],
 '36': ['8/31/77'],
 '37': ['1975'],
 '38': ['9/09/94'],
 '39': ['12/1975'],
 '40': ['9/21/79'],
 '41': ['3/18/80'],
 '42': ['1/1983'],
 '43': ['8/30/77'],
 '44': ['4/1972'],
 '45': ['05/12/1995'],
 '46': ['6/08/90'],
 '47': ['2/9/2008'],
 '48': ['08/21/77'],
 '

#### Step 3: Date Normalization
- Here, we provide some rules we explained above in code.


      - Example 1: If a date looks like this `09/1975` or `12/1989`, you need to transform it into `1975-09-01` and `1989-12-01`.


In [5]:
# example 1
date = '09/1975'
date_split = date.split('/')
new_date = f"{date_split[1]}-{date_split[0]}-01"
new_date

'1975-09-01'

    - Example 2: If a date looks like 'June 1981' , you need to transform it into '1981-06-01'.

In [6]:
# example 2
date = 'June 1981'
date_split = date.split(' ')
i = '06' if date_split[0]=='June' else 'write your own code'
new_date = f"{date_split[1]}-{i}-01"
new_date

'1981-06-01'

Based on the rules we provided at the beginning, try to transform all the dates in the file into our desired format.  If you used the template for Step 2 provided above, all the dates will be in the variable extracted_contents .  Otherwise, get the extracted dates from where you stored them.  You will need to have each line's number and your normalized date to write to the file (see submission requirements below).

In [124]:
########### YOUR CODE GOES HERE ###########
## here we will spliting our dictionary into several dictionaries to processing each on separitly :
######################
#three pieces and two#
#####################
reg_three = "(\d{1,2}[/-]\d{1,2}?[/-]?\d{2,4})"
three_dict ={}
for line in lines :
    line_cont = line.split('\t')
    finding = re.findall(reg_three , line_cont[1])
    if len(finding) !=0:
        three_dict[line_cont[0]]= finding 
        
# print(three_dict)
##############
# one piece ##
##############
one_dict ={}
keys = extracted_contents.keys()
for i in keys :
    if len(extracted_contents[i][0])==4:
        one_dict[i] = extracted_contents[i]        

# print(one_dict)
##############
# MONTHS #####
##############
reg_months ='(?:Jan|Feb|Marc|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*,? \d{2,4}'
month_dict ={}
for line in lines :
    line_cont = line.split('\t')
    finding = re.findall(reg_months , line_cont[1])
    if len(finding) !=0:
        month_dict[line_cont[0]]= finding 
        

# print(month_dict)
############################################
## here we are ready to process our data 

#####################
## three&two process#
#####################

process_three_dict = {}
for key in three_dict.keys() :
    cont = three_dict[key][0]
    pro_cont = re.sub('/','-',cont)
    spliting = pro_cont.split('-')
    length = len(spliting)
    month_date = spliting[0]
    if len(spliting[0])<2:
        month_date = f'0{month_date}'
        
        
    if length == 2 :
        if len(spliting[1])< 4 :
            new_time = int(spliting[1]) + 1900
            form = f'{new_time}-{month_date}-01'
        else :
            form = f'{spliting[1]}-{month_date}-01'
            
        
    elif length ==3:
        if len(spliting[2])< 4 : 
            new_time = int(spliting[2]) + 1900
            form = f'{new_time}-{month_date}-{spliting[1]}'
        else :
            form = f'{spliting[2]}-{month_date}-{spliting[1]}'
            
    process_three_dict[key] = form 
    
# print(process_three_dict)


#####################
##### one process ###
#####################

process_one_dict = {}
for key in one_dict.keys():
    cont = one_dict[key][0]
    form = f'{cont}-01-01'
    process_one_dict[key] = form


# print(process_one_dict)

#####################
## months process ###
#####################
JAN = ['January' ,'Jan']
FEB = ['Feb','February']
MAR = ['Mar','Marc' ,'March']
APR = ['Apr' ,'April']
MAY = ['May']
JUN = ['June','January']
JUL = ['July' ,'Jul']
AUG = ['Aug' , 'August']
SEP = ['Sep' , 'September']
OCT = ['Oct' ,'October']
NOV = ['November','Nov']
DEC = ['Dec' ,'December']

process_month_dict ={}

for key in month_dict.keys() :
    cont = month_dict[key][0]
    date = cont.split(' ')[1]
    first_word = cont.split(' ')[0]
    if first_word[-1] ==',':
        first_word =first_word[:-1]
        
        
        
    if first_word in JAN :
        form = f'{date}-01-01'
        process_month_dict[key] = form
    if first_word in FEB :
        form = f'{date}-02-01'
        process_month_dict[key] = form
    if first_word in MAR :
        form = f'{date}-03-01'
        process_month_dict[key] = form

    if first_word in APR :
        form = f'{date}-04-01'
        process_month_dict[key] = form

    if first_word in MAY :
        form = f'{date}-05-01'
        process_month_dict[key] = form

    if first_word in JUN :
        form = f'{date}-06-01'
        process_month_dict[key] = form
    if first_word in JUL :
        form = f'{date}-07-01'
        process_month_dict[key] = form
        
    if first_word in AUG :
        form = f'{date}-08-01'
        process_month_dict[key] = form
        
        
    if first_word in SEP :
        form = f'{date}-09-01'
        process_month_dict[key] = form
        
    if first_word in OCT :
        form = f'{date}-10-01'
        process_month_dict[key] = form        
        
    if first_word in NOV :
        form = f'{date}-11-01'
        process_month_dict[key] = form        
    if first_word in DEC :
        form = f'{date}-12-01'
        process_month_dict[key] = form        
           
        
# print(process_month_dict)
##############
# collecting #
##############


### finally we have three dictionaries have all processed data 
#process_three_dict
#process_one_dict
#process_month_dict

                        ###############################################################
                        ### now we will collect our dictionaries in final_dictionary ##
                        ###############################################################

final_dictionary ={}

#process_three_dict
for key in process_three_dict.keys():
    final_dictionary[int(key)] = process_three_dict[key]
            
#process_one_dict
for key in process_one_dict.keys():
    final_dictionary[int(key)] = process_one_dict[key]
            
#process_month_dict
for key in process_month_dict.keys():
    final_dictionary[int(key)] =process_month_dict[key]
            

        
        
## we need sorting our dictionary 
our_keys = list(final_dictionary.keys())
sorted_keys = sorted(our_keys)
# print(sorted_keys)


final_output_dict ={i:final_dictionary[i] for i in sorted_keys}

# print(final_output_dict)



output = []

for key in final_output_dict.keys():
    line = f'{key}\t{final_output_dict[key]}'
    output.append(line)
    
       


#### Step 4: Write Your Results to  `.txt` File

- You may change this if you find it easier to write to the file differently, as long as it follows the submissions requirements.

In [126]:
with open("assignment1.txt", "w") as f:
    for line in output:
        f.write(line)
        f.write('\n')

        
        


In [128]:
with open("assignment1.txt", "r") as f:
    for lin in f :
        print(lin)

0	1999-07-21

1	1981-03-01

2	1980-03-01

3	1990-07-11

4	1986-06-02

5	2012-03-01

6	1985-06-18

7	2007-09-01

8	1975-12-01

9	1986-08-01

10	1979-03-01

11	1972-09-17

12	2003-10-01

13	1994-11-24

14	1987-11-01

15	1992-08-16

16	1997-12-12

17	1976-09-30

18	2010-05-01

19	1981-06-01

20	1973-03-01

21	1997-12-8

22	1994-09-13

23	1984-09-01

24	1986-04-18

25	1977-02-14

26	1997-07-13

27	2007-04-01

28	1988-04-12

29	1971-06-10

30	1993-12-12

31	2002-01-01

32	1981-03-26

33	2010-08-01

34	2010-08-01

35	1979-10-6

36	1977-08-31

37	1975-01-01

38	1994-09-09

39	1975-12-01

40	1979-09-21

41	1980-03-18

42	1983-01-01

43	1977-08-30

44	1972-04-01

45	1995-05-12

46	1990-06-08

47	2008-02-9

48	1977-08-21

49	1998-01-27

50	1982-05-02

51	1979-02-18

52	25-07-01

53	1979-08-01

54	1987-10-11

55	1980-10-17

56	1983-02-01

57	2013-05-18

58	1980-01-02

59	2011-08-24

60	1979-01-29

61	1977-05-21

62	1979-02-12

63	1978-09-08

64	1981-07-9

65	2013-06-01

66	2007-06-01

67	1985-09-

## **Submission Requirements**
1. Make sure you used code to write to file name **`assignment1.txt`** for submission
2. Submit your results as a **.txt** file
3. Each line in your file must follow this **`<line_number><tab><formatted_date>`** format.
  > Example: `0    1992-10-12`



In [None]:
# Check if you got 200/500
# 30 points



In [None]:
# Check if you got 250/500
# 20 points

In [None]:
# Check if you got 300/500
# 20 points

In [None]:
# Check if you got 350/500
# 15 points

In [None]:
# Check if you got 400/500
# 15 points