# Null values


In [1]:
import pandas as pd
import numpy as np

We will work on the `orders` and `order_details` dataframes.

In [7]:
orders = pd.read_csv("https://github.com/gdv/foundationsCS/raw/main/students/ex-data/Northwind/Orders.csv", 
                     parse_dates=['OrderDate'],
                     date_format = '%Y-%m-%d-%H-%M-%S'
)

In [5]:
orders['OrderDate']

0                 2012-07-04
1                 2012-07-05
2                 2012-07-08
3                 2012-07-08
4                 2012-07-09
                ...         
16813    2013-06-29 21:05:55
16814    2014-01-19 12:27:11
16815    2014-10-15 09:51:09
16816    2013-02-07 02:06:05
16817    2013-08-31 02:59:28
Name: OrderDate, Length: 16818, dtype: object

In [3]:
details = pd.read_csv("https://github.com/gdv/foundationsCS/raw/main/students/ex-data/Northwind/OrderDetails.csv")

### Null values in Python

In [4]:
type(None)

NoneType

In [5]:
type(np.nan)

float

In [6]:
type(pd.NaT)

pandas._libs.tslibs.nattype.NaTType

### Looking for null values

The `ShipPostalCode` column of `orders` contains some missing values.

In [7]:
orders['ShipPostalCode'].isnull()

0        False
1        False
2        False
3        False
4        False
         ...  
16813    False
16814    False
16815    False
16816    False
16817    False
Name: ShipPostalCode, Length: 16818, dtype: bool

### Extract the row with a missing `ShipPostalCode`

It's a boolean mask, so we can use it to select the rows.

In [8]:
orders[orders['ShipPostalCode'].isnull()].head()

Unnamed: 0,Id,CustomerId,EmployeeId,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry
50,10298,HUNGO,6,2012-09-05,2012-10-03,2012-09-11,2,41.25,Hungry Owl All-Night Grocers,8 Johnstown Road,Cork,British Isles,,Ireland
61,10309,HUNGO,3,2012-09-19,2012-10-17,2012-10-23,1,28.75,Hungry Owl All-Night Grocers,8 Johnstown Road,Cork,British Isles,,Ireland
87,10335,HUNGO,7,2012-10-22,2012-11-19,2012-10-24,2,31.5,Hungry Owl All-Night Grocers,8 Johnstown Road,Cork,British Isles,,Ireland
125,10373,HUNGO,4,2012-12-05,2013-01-02,2012-12-11,3,42.5,Hungry Owl All-Night Grocers,8 Johnstown Road,Cork,British Isles,,Ireland
132,10380,HUNGO,8,2012-12-12,2013-01-09,2013-01-16,3,28.5,Hungry Owl All-Night Grocers,8 Johnstown Road,Cork,British Isles,,Ireland


### How many rows have a missing `ShipPostalCode`?

In [9]:
orders['ShipPostalCode'].isnull().sum()

195

But that's a bit of a trick, since it exploits the numerical values associated to booleans. A cleaner way is the following.

In [10]:
orders[orders['ShipPostalCode'].isnull()].count()

Id                195
CustomerId        195
EmployeeId        195
OrderDate         195
RequiredDate      195
ShippedDate       195
ShipVia           195
Freight           195
ShipName          195
ShipAddress       195
ShipCity          195
ShipRegion        195
ShipPostalCode      0
ShipCountry       195
dtype: int64

Or we can exploit that a dataframe is also a list of rows.

In [11]:
len(orders[orders['ShipPostalCode'].isnull()])

195

### Extract the rows that do not have a missing `ShipPostalCode`

In [12]:
orders[orders['ShipPostalCode'].notnull()].head()

Unnamed: 0,Id,CustomerId,EmployeeId,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry
0,10248,VINET,5,2012-07-04,2012-08-01,2012-07-16,3,16.75,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,Western Europe,51100,France
1,10249,TOMSP,6,2012-07-05,2012-08-16,2012-07-10,1,22.25,Toms Spezialitäten,Luisenstr. 48,Münster,Western Europe,44087,Germany
2,10250,HANAR,4,2012-07-08,2012-08-05,2012-07-12,2,25.0,Hanari Carnes,"Rua do Paço, 67",Rio de Janeiro,South America,05454-876,Brazil
3,10251,VICTE,3,2012-07-08,2012-08-05,2012-07-15,1,20.25,Victuailles en stock,"2, rue du Commerce",Lyon,Western Europe,69004,France
4,10252,SUPRD,4,2012-07-09,2012-08-06,2012-07-11,2,36.25,Suprêmes délices,"Boulevard Tirou, 255",Charleroi,Western Europe,B-6000,Belgium


### Remove all rows with a missing value

In [13]:
orders.dropna()

Unnamed: 0,Id,CustomerId,EmployeeId,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry
0,10248,VINET,5,2012-07-04 00:00:00,2012-08-01,2012-07-16,3,16.75,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,Western Europe,51100,France
1,10249,TOMSP,6,2012-07-05 00:00:00,2012-08-16,2012-07-10,1,22.25,Toms Spezialitäten,Luisenstr. 48,Münster,Western Europe,44087,Germany
2,10250,HANAR,4,2012-07-08 00:00:00,2012-08-05,2012-07-12,2,25.00,Hanari Carnes,"Rua do Paço, 67",Rio de Janeiro,South America,05454-876,Brazil
3,10251,VICTE,3,2012-07-08 00:00:00,2012-08-05,2012-07-15,1,20.25,Victuailles en stock,"2, rue du Commerce",Lyon,Western Europe,69004,France
4,10252,SUPRD,4,2012-07-09 00:00:00,2012-08-06,2012-07-11,2,36.25,Suprêmes délices,"Boulevard Tirou, 255",Charleroi,Western Europe,B-6000,Belgium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16813,27061,FOLKO,5,2013-06-29 21:05:55,2013-08-02 04:10:53,2013-07-02 16:05:51,3,307.25,Familia Arquibaldo,"Rua Orós, 92",Sao Paulo,South America,05442-030,Brazil
16814,27062,FRANK,2,2014-01-19 12:27:11,2014-01-24 15:15:31,2014-01-27 02:14:31,2,550.50,Bon app',"12, rue des Bouchers",Marseille,Western Europe,13008,France
16815,27063,ALFKI,5,2014-10-15 09:51:09,2014-11-11 14:31:37,2014-10-16 06:26:55,1,328.50,Furia Bacalhau e Frutos do Mar,Jardim das rosas n. 32,Lisboa,Southern Europe,1675,Portugal
16816,27064,TRADH,8,2013-02-07 02:06:05,2013-03-14 09:43:16,2013-02-24 10:15:47,3,357.00,Wilman Kala,Keskuskatu 45,Helsinki,Scandinavia,21240,Finland


`dropna` has an argument `how` that allows to remove the rows with at least a missing value or without missing values.

## Datetime values

We convert datetime-like values to actual datetime values with the `.dt` converter. For this reason, we have read the `OrderDate` column with the `parse_dates` argument of `read_csv`.

In [14]:
orders['OrderDate'].dt

<pandas.core.indexes.accessors.DatetimeProperties object at 0x7fb8616a2080>

There are several variables that can be extracted from a datetime Series.

In [15]:
orders['OrderDate'].dt.year

0        2012
1        2012
2        2012
3        2012
4        2012
         ... 
16813    2013
16814    2014
16815    2014
16816    2013
16817    2013
Name: OrderDate, Length: 16818, dtype: int64

In [16]:
orders['OrderDate'].dt.day

0         4
1         5
2         8
3         8
4         9
         ..
16813    29
16814    19
16815    15
16816     7
16817    31
Name: OrderDate, Length: 16818, dtype: int64

In [17]:
orders['OrderDate'].dt.is_leap_year

0         True
1         True
2         True
3         True
4         True
         ...  
16813    False
16814    False
16815    False
16816    False
16817    False
Name: OrderDate, Length: 16818, dtype: bool

## Join tables

The pandas equivalent to a join is a `merge`.

In [18]:
details.columns

Index(['Id', 'OrderId', 'ProductId', 'UnitPrice', 'Quantity', 'Discount'], dtype='object')

Join the `orders` and `details` dataframes, exploiting the `Id` column of `orders` and `OrderId` of `details`.

In [19]:
pd.merge(orders, details, left_on='Id', right_on='OrderId')

Unnamed: 0,Id_x,CustomerId,EmployeeId,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry,Id_y,OrderId,ProductId,UnitPrice,Quantity,Discount
0,10248,VINET,5,2012-07-04 00:00:00,2012-08-01,2012-07-16,3,16.75,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,Western Europe,51100,France,10248/11,10248,11,14.00,12,0.0
1,10248,VINET,5,2012-07-04 00:00:00,2012-08-01,2012-07-16,3,16.75,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,Western Europe,51100,France,10248/42,10248,42,9.80,10,0.0
2,10248,VINET,5,2012-07-04 00:00:00,2012-08-01,2012-07-16,3,16.75,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,Western Europe,51100,France,10248/72,10248,72,34.80,5,0.0
3,10249,TOMSP,6,2012-07-05 00:00:00,2012-08-16,2012-07-10,1,22.25,Toms Spezialitäten,Luisenstr. 48,Münster,Western Europe,44087,Germany,10249/14,10249,14,18.60,9,0.0
4,10249,TOMSP,6,2012-07-05 00:00:00,2012-08-16,2012-07-10,1,22.25,Toms Spezialitäten,Luisenstr. 48,Münster,Western Europe,44087,Germany,10249/51,10249,51,42.40,40,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
621878,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,27065/22,27065,22,21.00,20,0.0
621879,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,27065/77,27065,77,13.00,11,0.0
621880,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,27065/17,27065,17,39.00,45,0.0
621881,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,27065/6,27065,6,25.00,7,0.0


If the columns used to link the dataframes have the same name in both, we can use the `on` argument.

Notice that `Id` in `details` is not the one we have used.

What if both dataframes have `Id` as the index?

In [20]:
orders_id = orders.set_index('Id')
orders_id.head()

Unnamed: 0_level_0,CustomerId,EmployeeId,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
10248,VINET,5,2012-07-04,2012-08-01,2012-07-16,3,16.75,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,Western Europe,51100,France
10249,TOMSP,6,2012-07-05,2012-08-16,2012-07-10,1,22.25,Toms Spezialitäten,Luisenstr. 48,Münster,Western Europe,44087,Germany
10250,HANAR,4,2012-07-08,2012-08-05,2012-07-12,2,25.0,Hanari Carnes,"Rua do Paço, 67",Rio de Janeiro,South America,05454-876,Brazil
10251,VICTE,3,2012-07-08,2012-08-05,2012-07-15,1,20.25,Victuailles en stock,"2, rue du Commerce",Lyon,Western Europe,69004,France
10252,SUPRD,4,2012-07-09,2012-08-06,2012-07-11,2,36.25,Suprêmes délices,"Boulevard Tirou, 255",Charleroi,Western Europe,B-6000,Belgium


In [21]:
details_id = details.set_index('Id')
details_id.head()

Unnamed: 0_level_0,OrderId,ProductId,UnitPrice,Quantity,Discount
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10248/11,10248,11,14.0,12,0.0
10248/42,10248,42,9.8,10,0.0
10248/72,10248,72,34.8,5,0.0
10249/14,10249,14,18.6,9,0.0
10249/51,10249,51,42.4,40,0.0



Now we want to merge those dataframes.

In [22]:
pd.merge(orders, details, left_index = True, right_on='OrderId')

Unnamed: 0,Id_x,CustomerId,EmployeeId,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry,Id_y,OrderId,ProductId,UnitPrice,Quantity,Discount
0,20496,HUNGC,6,2014-04-04 12:22:02,2014-04-05 23:01:55,2014-04-20 16:55:53,1,273.25,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,10248/11,10248,11,14.0,12,0.0
1,20496,HUNGC,6,2014-04-04 12:22:02,2014-04-05 23:01:55,2014-04-20 16:55:53,1,273.25,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,10248/42,10248,42,9.8,10,0.0
2,20496,HUNGC,6,2014-04-04 12:22:02,2014-04-05 23:01:55,2014-04-20 16:55:53,1,273.25,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,10248/72,10248,72,34.8,5,0.0
3,20497,OCEAN,2,2015-08-04 04:28:53,2015-09-04 13:17:27,2015-08-16 16:53:47,1,219.50,Comércio Mineiro,"Av. dos Lusíadas, 23",Sao Paulo,South America,05432-043,Brazil,10249/14,10249,14,18.6,9,0.0
4,20497,OCEAN,2,2015-08-04 04:28:53,2015-09-04 13:17:27,2015-08-16 16:53:47,1,219.50,Comércio Mineiro,"Av. dos Lusíadas, 23",Sao Paulo,South America,05432-043,Brazil,10249/51,10249,51,42.4,40,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
225259,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,16817/71,16817,71,21.5,30,0.0
225260,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,16817/33,16817,33,2.5,10,0.0
225261,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,16817/32,16817,32,32.0,23,0.0
225262,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,16817/31,16817,31,12.5,33,0.0


We can compute an outer join too.

In [23]:
pd.merge(orders, details, left_index = True, right_on='OrderId',
        how = 'left')

Unnamed: 0,Id_x,CustomerId,EmployeeId,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry,Id_y,OrderId,ProductId,UnitPrice,Quantity,Discount
,10248,VINET,5,2012-07-04 00:00:00,2012-08-01,2012-07-16,3,16.75,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,Western Europe,51100,France,,0,,,,
,10249,TOMSP,6,2012-07-05 00:00:00,2012-08-16,2012-07-10,1,22.25,Toms Spezialitäten,Luisenstr. 48,Münster,Western Europe,44087,Germany,,1,,,,
,10250,HANAR,4,2012-07-08 00:00:00,2012-08-05,2012-07-12,2,25.00,Hanari Carnes,"Rua do Paço, 67",Rio de Janeiro,South America,05454-876,Brazil,,2,,,,
,10251,VICTE,3,2012-07-08 00:00:00,2012-08-05,2012-07-15,1,20.25,Victuailles en stock,"2, rue du Commerce",Lyon,Western Europe,69004,France,,3,,,,
,10252,SUPRD,4,2012-07-09 00:00:00,2012-08-06,2012-07-11,2,36.25,Suprêmes délices,"Boulevard Tirou, 255",Charleroi,Western Europe,B-6000,Belgium,,4,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
225259.0,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,16817/71,16817,71.0,21.5,30.0,0.0
225260.0,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,16817/33,16817,33.0,2.5,10.0,0.0
225261.0,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,16817/32,16817,32.0,32.0,23.0,0.0
225262.0,27065,ANATR,1,2013-08-31 02:59:28,2013-09-15 23:11:49,2013-09-03 14:09:08,3,233.75,LINO-Delicateses,Ave. 5 de Mayo Porlamar,I. de Margarita,South America,4980,Venezuela,16817/31,16817,31.0,12.5,33.0,0.0
