# Better date handling


In [1]:
import pandas as pd
import numpy as np

We will work on the `orders` and `order_details` dataframes.
This time, we do not use the `parse_dates` option.

In [2]:
orders = pd.read_csv("https://github.com/gdv/foundationsCS/raw/main/students/ex-data/Northwind/Orders.csv")

In [3]:
details = pd.read_csv("https://github.com/gdv/foundationsCS/raw/main/students/ex-data/Northwind/OrderDetails.csv")

Sometimes we need to specify the format of the field containing the date. We can use `to_datetime` and its `format` option (see the format [specification](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)).

In [4]:
orders['OrderDate'].tail()

16813    2013-06-29 21:05:55
16814    2014-01-19 12:27:11
16815    2014-10-15 09:51:09
16816    2013-02-07 02:06:05
16817    2013-08-31 02:59:28
Name: OrderDate, dtype: object

In [5]:
pd.to_datetime(orders['OrderDate'], format = '%Y-%m-%d %H:%M:%S')

0       2012-07-04 00:00:00
1       2012-07-05 00:00:00
2       2012-07-08 00:00:00
3       2012-07-08 00:00:00
4       2012-07-09 00:00:00
                ...        
16813   2013-06-29 21:05:55
16814   2014-01-19 12:27:11
16815   2014-10-15 09:51:09
16816   2013-02-07 02:06:05
16817   2013-08-31 02:59:28
Name: OrderDate, Length: 16818, dtype: datetime64[ns]

In [6]:
orders['parsed_date'] = pd.to_datetime(orders['OrderDate'], format = '%Y-%m-%d %H:%M:%S')

Parse the column `RequiredDate`.

In [7]:
pd.to_datetime(orders['RequiredDate'], format = '%Y-%m-%d %H:%M:%S')

0       2012-08-01 00:00:00
1       2012-08-16 00:00:00
2       2012-08-05 00:00:00
3       2012-08-05 00:00:00
4       2012-08-06 00:00:00
                ...        
16813   2013-08-02 04:10:53
16814   2014-01-24 15:15:31
16815   2014-11-11 14:31:37
16816   2013-03-14 09:43:16
16817   2013-09-15 23:11:49
Name: RequiredDate, Length: 16818, dtype: datetime64[ns]

## String/Regex manipulation

Extract the orders shipped to Europe (look at the `ShipRegion` column)

In [8]:
orders['ShipRegion'].head()

0    Western Europe
1    Western Europe
2     South America
3    Western Europe
4    Western Europe
Name: ShipRegion, dtype: object

In [9]:
orders['ShipRegion'].str.contains('Europe')

0         True
1         True
2        False
3         True
4         True
         ...  
16813    False
16814     True
16815     True
16816    False
16817    False
Name: ShipRegion, Length: 16818, dtype: bool

In [10]:
orders['ShipRegion'].str.contains('[Ee]urope')

0         True
1         True
2        False
3         True
4         True
         ...  
16813    False
16814     True
16815     True
16816    False
16817    False
Name: ShipRegion, Length: 16818, dtype: bool

Build a new column with the continent Europe, in place of the regions.

In [11]:
orders['ShipRegion'].str.extract(r'([Ee]urope)')

Unnamed: 0,0
0,Europe
1,Europe
2,
3,Europe
4,Europe
...,...
16813,
16814,Europe
16815,Europe
16816,


Another way is to remove the portion of the text preceding `Europe`

In [12]:
orders['ShipRegion'].str.replace('.*\s[Ee]urope', 'Europe', regex = True)

0               Europe
1               Europe
2        South America
3               Europe
4               Europe
             ...      
16813    South America
16814           Europe
16815           Europe
16816      Scandinavia
16817    South America
Name: ShipRegion, Length: 16818, dtype: object

## Zipping lists

In [13]:
lista = [1, 2, 3, 4]
listb = "abcd"

In [14]:
list(zip(lista, listb))

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]