### The `dateutil` Library

https://dateutil.readthedocs.io/en/stable/

This library has actually a lot of functionality, but we're going to focus on one aspect of it only - the parser.

This parser provides a very simple mechanism to parse a string date time into a datetime object.

Remember how we had to always specify a format string to use the `strptime` function:

In [1]:
from datetime import datetime

In [2]:
datetime.strptime('2020-01-01T10:30:00', '%Y-%m-%dT%H:%M:%S')

But if we have a different time format, we have to accomodate for it in our format string:

In [3]:
datetime.strptime('2020-01-01 10:30:00 am', '%Y-%m-%d %I:%M:%S %p')

This is where `dateutil` comes in handy - it can recognize many different formats automatically:

In [4]:
from dateutil import parser

In [5]:
parser.parse('2020-01-01T10:30:00')

In [6]:
parser.parse('2020-01-01 10:30:00 am')

It can even recognize timezone information:

In [7]:
parser.parse('2020-01-01 10:30:00 am +01:00')

It can even differentiate betweem `M/D/Y` and `D/M/Y` in **some** cases:

In [8]:
parser.parse('12/31/2020')

In [9]:
parser.parse('31/12/2020')

Now we do have ambiguities at times, for example what is this date?

```
3/6/2020
```

Is it June 3, or Mar 6?

Let's see the default that `parser` uses:

In [10]:
parser.parse('3/6/2020')

As we can see the default is to use `M/D/Y`.

But, if we know that the dates are specified in `Y/D/M` format, we can tell the parser about it:

In [11]:
parser.parse('3/6/2020', dayfirst=True)

datetime.datetime(2020, 6, 3, 0, 0)

Remember also that we dealt with datetime strings that may have had additional text in them that we need to ignore when converting to a datetime:

In [12]:
s = "Today is May the 23, 2020 at 3pm UTC"

We can sometimes get `parser` to understand these by applying a "fuzzy" parse:

In [13]:
try:
    parser.parse(s)
except Exception as ex:
    print(type(ex), ex)

<class 'dateutil.parser._parser.ParserError'> Unknown string format: Today is May the 23, 2020 at 3pm UTC


But if we set the `fuzzy_with_tokens` argument to `True` (ir defaults to `False`), the parser will actually be able to parse a datetime out of that string:

In [14]:
parser.parse(s, fuzzy_with_tokens=True)

(datetime.datetime(2020, 5, 23, 15, 0, tzinfo=tzutc()),
 ('Today is ', ' the ', ' ', 'at ', ' '))

With the `fuzzy_tokens=True` option, the returned value is actually a tuple that contains two elements:
- first item is the parsed datetime
- second item is a tuple containing the text elements that were ignored when parsing the datetime

The parser will raise a `ParserError` if it cannot parse a datetime string, either because of an unrecognized format, or, if the format is recognized, but an invalid datetime is specified:

In [15]:
try:
    parser.parse('2020/30/30')
except Exception as ex:
    print(type(ex), ex)

<class 'dateutil.parser._parser.ParserError'> month must be in 1..12: 2020/30/30


In [16]:
try:
    parser.parse('May the fourth, 2020')
except Exception as ex:
    print(type(ex), ex)

<class 'dateutil.parser._parser.ParserError'> Unknown string format: May the fourth, 2020


Let's apply all this to the following problem:

We have a file (located in the same directory as this notebook) called `DEXUSEU.csv` which we've seen before.

In [17]:
source_file = 'DEXUSEU.csv'

In [18]:
with open(source_file) as f:
    for _ in range(5):
        print(next(f), end='')

DATE,DEXUSEU
2015-04-03,1.0990
2015-04-06,1.1008
2015-04-07,1.0850
2015-04-08,1.0818


Because we know our data source, we know that the dates are all 5pm New York local times (ET).

Our objective is to transform this data into a new CSV file that will contain an iso formatted UTC string (with timezone information), e.g:

```
'2015-04-03T21:00:00+00:00'
```

In [19]:
import csv

with open(source_file) as source:
    csv_reader = csv.reader(source)
    
    header = next(csv_reader)
    print(header)
    for row in csv_reader:
        print(row)

['DATE', 'DEXUSEU']
['2015-04-03', '1.0990']
['2015-04-06', '1.1008']
['2015-04-07', '1.0850']
['2015-04-08', '1.0818']
['2015-04-09', '1.0671']
['2015-04-10', '1.0598']
['2015-04-13', '1.0582']
['2015-04-14', '1.0672']
['2015-04-15', '1.0596']
['2015-04-16', '1.0742']
['2015-04-17', '1.0780']
['2015-04-20', '1.0763']
['2015-04-21', '1.0758']
['2015-04-22', '1.0729']
['2015-04-23', '1.0803']
['2015-04-24', '1.0876']
['2015-04-27', '1.0892']
['2015-04-28', '1.0979']
['2015-04-29', '1.1174']
['2015-04-30', '1.1162']
['2015-05-01', '1.1194']
['2015-05-04', '1.1145']
['2015-05-05', '1.1174']
['2015-05-06', '1.1345']
['2015-05-07', '1.1283']
['2015-05-08', '1.1241']
['2015-05-11', '1.1142']
['2015-05-12', '1.1240']
['2015-05-13', '1.1372']
['2015-05-14', '1.1368']
['2015-05-15', '1.1428']
['2015-05-18', '1.1354']
['2015-05-19', '1.1151']
['2015-05-20', '1.1079']
['2015-05-21', '1.1126']
['2015-05-22', '1.1033']
['2015-05-25', '.']
['2015-05-26', '1.0876']
['2015-05-27', '1.0888']
['2015-05-

So now, we need to process that date string by adding 5pm and attaching the New York (ET) timezone.

We could use either the `US/Eastern` timezone name, or the `America/New_York`:

In [20]:
import pytz

In [21]:
dt_eastern = pytz.timezone('US/Eastern')
dt_nyc = pytz.timezone('America/New_York')

In [22]:
dt_eastern.localize(parser.parse('5/1/2020 5:00 pm'))

datetime.datetime(2020, 5, 1, 17, 0, tzinfo=<DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST>)

In [23]:
dt_nyc.localize(parser.parse('5/1/2020 5:00 pm'))

datetime.datetime(2020, 5, 1, 17, 0, tzinfo=<DstTzInfo 'America/New_York' EDT-1 day, 20:00:00 DST>)

In [24]:
dt_eastern.localize(parser.parse('12/1/2020 5:00 pm'))

datetime.datetime(2020, 12, 1, 17, 0, tzinfo=<DstTzInfo 'US/Eastern' EST-1 day, 19:00:00 STD>)

In [25]:
dt_nyc.localize(parser.parse('12/1/2020 5:00 pm'))

datetime.datetime(2020, 12, 1, 17, 0, tzinfo=<DstTzInfo 'America/New_York' EST-1 day, 19:00:00 STD>)

As you can see either one will work for us, and both handle DST just fine.

So let's apply this to our data:

In [26]:
import csv

with open(source_file) as source:
    csv_reader = csv.reader(source)
    
    header = next(csv_reader)
    print(header)
    for dt, rate in csv_reader:
        dt_naive = parser.parse(dt).replace(hour=17)
        dt_aware = dt_nyc.localize(dt_naive)
        dt_utc = dt_aware.astimezone(pytz.UTC)
        print(f'{dt_naive}  :  {dt_aware}  :  {dt_utc.isoformat()}  :  {rate}')

['DATE', 'DEXUSEU']
2015-04-03 17:00:00  :  2015-04-03 17:00:00-04:00  :  2015-04-03T21:00:00+00:00  :  1.0990
2015-04-06 17:00:00  :  2015-04-06 17:00:00-04:00  :  2015-04-06T21:00:00+00:00  :  1.1008
2015-04-07 17:00:00  :  2015-04-07 17:00:00-04:00  :  2015-04-07T21:00:00+00:00  :  1.0850
2015-04-08 17:00:00  :  2015-04-08 17:00:00-04:00  :  2015-04-08T21:00:00+00:00  :  1.0818
2015-04-09 17:00:00  :  2015-04-09 17:00:00-04:00  :  2015-04-09T21:00:00+00:00  :  1.0671
2015-04-10 17:00:00  :  2015-04-10 17:00:00-04:00  :  2015-04-10T21:00:00+00:00  :  1.0598
2015-04-13 17:00:00  :  2015-04-13 17:00:00-04:00  :  2015-04-13T21:00:00+00:00  :  1.0582
2015-04-14 17:00:00  :  2015-04-14 17:00:00-04:00  :  2015-04-14T21:00:00+00:00  :  1.0672
2015-04-15 17:00:00  :  2015-04-15 17:00:00-04:00  :  2015-04-15T21:00:00+00:00  :  1.0596
2015-04-16 17:00:00  :  2015-04-16 17:00:00-04:00  :  2015-04-16T21:00:00+00:00  :  1.0742
2015-04-17 17:00:00  :  2015-04-17 17:00:00-04:00  :  2015-04-17T21:00

2019-01-24 17:00:00  :  2019-01-24 17:00:00-05:00  :  2019-01-24T22:00:00+00:00  :  1.1322
2019-01-25 17:00:00  :  2019-01-25 17:00:00-05:00  :  2019-01-25T22:00:00+00:00  :  1.1407
2019-01-28 17:00:00  :  2019-01-28 17:00:00-05:00  :  2019-01-28T22:00:00+00:00  :  1.1438
2019-01-29 17:00:00  :  2019-01-29 17:00:00-05:00  :  2019-01-29T22:00:00+00:00  :  1.1424
2019-01-30 17:00:00  :  2019-01-30 17:00:00-05:00  :  2019-01-30T22:00:00+00:00  :  1.1418
2019-01-31 17:00:00  :  2019-01-31 17:00:00-05:00  :  2019-01-31T22:00:00+00:00  :  1.1454
2019-02-01 17:00:00  :  2019-02-01 17:00:00-05:00  :  2019-02-01T22:00:00+00:00  :  1.1474
2019-02-04 17:00:00  :  2019-02-04 17:00:00-05:00  :  2019-02-04T22:00:00+00:00  :  1.1438
2019-02-05 17:00:00  :  2019-02-05 17:00:00-05:00  :  2019-02-05T22:00:00+00:00  :  1.1406
2019-02-06 17:00:00  :  2019-02-06 17:00:00-05:00  :  2019-02-06T22:00:00+00:00  :  1.138
2019-02-07 17:00:00  :  2019-02-07 17:00:00-05:00  :  2019-02-07T22:00:00+00:00  :  1.1357


And now we can create our CSV output file using the iso formatted UTC datetime and rate.

We'll make a function out of it while we're at it, and for completeness we'll repeat the imports we actually need for this function to work:

In [27]:
from datetime import datetime

import dateutil
import pytz

def convert(source_file, target_file):
    tz_eastern = pytz.timezone('US/Eastern')
    
    with open(source_file) as source:
        with open(target_file, 'w') as target:
            csv_reader = csv.reader(source)
            csv_writer = csv.writer(target)
    
            header = next(csv_reader)
            csv_writer.writerow(header)
            
            for dt, rate in csv_reader:
                dt_naive = parser.parse(dt).replace(hour=17)
                dt_aware = dt_nyc.localize(dt_naive)
                dt_utc = dt_aware.astimezone(pytz.UTC)
                csv_writer.writerow([dt_utc.isoformat(), rate])
    

And let's run our function:

In [28]:
target_file = 'converted.csv'

In [29]:
convert(source_file, target_file)

And let's read our file back and see what we have:

In [30]:
with open(target_file) as f:
    csv_reader = csv.reader(f)
    for row in csv_reader:
        print(row)

['DATE', 'DEXUSEU']
['2015-04-03T21:00:00+00:00', '1.0990']
['2015-04-06T21:00:00+00:00', '1.1008']
['2015-04-07T21:00:00+00:00', '1.0850']
['2015-04-08T21:00:00+00:00', '1.0818']
['2015-04-09T21:00:00+00:00', '1.0671']
['2015-04-10T21:00:00+00:00', '1.0598']
['2015-04-13T21:00:00+00:00', '1.0582']
['2015-04-14T21:00:00+00:00', '1.0672']
['2015-04-15T21:00:00+00:00', '1.0596']
['2015-04-16T21:00:00+00:00', '1.0742']
['2015-04-17T21:00:00+00:00', '1.0780']
['2015-04-20T21:00:00+00:00', '1.0763']
['2015-04-21T21:00:00+00:00', '1.0758']
['2015-04-22T21:00:00+00:00', '1.0729']
['2015-04-23T21:00:00+00:00', '1.0803']
['2015-04-24T21:00:00+00:00', '1.0876']
['2015-04-27T21:00:00+00:00', '1.0892']
['2015-04-28T21:00:00+00:00', '1.0979']
['2015-04-29T21:00:00+00:00', '1.1174']
['2015-04-30T21:00:00+00:00', '1.1162']
['2015-05-01T21:00:00+00:00', '1.1194']
['2015-05-04T21:00:00+00:00', '1.1145']
['2015-05-05T21:00:00+00:00', '1.1174']
['2015-05-06T21:00:00+00:00', '1.1345']
['2015-05-07T21:00:0

['2017-05-29T21:00:00+00:00', '.']
['2017-05-30T21:00:00+00:00', '1.1183']
['2017-05-31T21:00:00+00:00', '1.1236']
['2017-06-01T21:00:00+00:00', '1.1214']
['2017-06-02T21:00:00+00:00', '1.1270']
['2017-06-05T21:00:00+00:00', '1.1250']
['2017-06-06T21:00:00+00:00', '1.1266']
['2017-06-07T21:00:00+00:00', '1.1236']
['2017-06-08T21:00:00+00:00', '1.1217']
['2017-06-09T21:00:00+00:00', '1.1190']
['2017-06-12T21:00:00+00:00', '1.1204']
['2017-06-13T21:00:00+00:00', '1.1194']
['2017-06-14T21:00:00+00:00', '1.1277']
['2017-06-15T21:00:00+00:00', '1.1152']
['2017-06-16T21:00:00+00:00', '1.1194']
['2017-06-19T21:00:00+00:00', '1.1160']
['2017-06-20T21:00:00+00:00', '1.1124']
['2017-06-21T21:00:00+00:00', '1.1143']
['2017-06-22T21:00:00+00:00', '1.1148']
['2017-06-23T21:00:00+00:00', '1.1196']
['2017-06-26T21:00:00+00:00', '1.1196']
['2017-06-27T21:00:00+00:00', '1.1300']
['2017-06-28T21:00:00+00:00', '1.1364']
['2017-06-29T21:00:00+00:00', '1.1420']
['2017-06-30T21:00:00+00:00', '1.1411']
['201