# Dataset normalization

In the sheets of the **dims.xlsx** file there are dictionaries for the data from the **product_prices_cleaned.csv** file. Use `merge` to normalize the data following the steps:

1. Read the contents of the **dims.xlsx** file sheets to separate `DataFrames`.
For readability base names of frames on the names of sheets.

1. Read the data from **product_prices_cleaned.csv** file to the `df` variable.

1. Based on the **d_province** workbook, use the `id` column to add the `province_id` column to the `df` frame.

1. Based on the  **d_product** workbook, add the `product_id` column to the `df` frame.

1. From the table, extract only the columns that refer to other tables, e.g.. **product_id** and the columns **value**, **date**. Do you think this is more readable? What are potential benefits of this approach?

> We will tell you how to read many workbooks at once when we discuss `openpyxl`.

You can find more about database normalization at the [link](https://www.sqlshack.com/what-is-database-normalization-in-sql-server/).

In [9]:
import pandas as pd

In [10]:
d_province = pd.read_excel('../../01_Data/dims.xlsx', sheet_name='d_province')
d_province.head()

Unnamed: 0,province_id,province
0,8,SUBCARPATHIA
1,14,ŁÓDŹ
2,2,KUYAVIA-POMERANIA
3,1,LOWER SILESIA
4,11,WARMIA-MASURIA


In [11]:
d_product = pd.read_excel('../../01_Data/dims.xlsx', sheet_name='d_product')
d_product.head()

Unnamed: 0,product_id,product,product_group_id
0,20,pork ham cooked - per 1kg,2
1,26,bread - per 1kg,4
2,10,barley groats sausage - per 1kg,2
3,12,dressed chickens - per 1kg,2
4,19,Italian head cheese - per 1kg,2


In [12]:
df = pd.read_csv('../../01_Data/product_prices_cleaned.csv', sep=';')
df.head()

Unnamed: 0,province,product_types,currency,product_group_id,product_line,value,date,product2,product
0,SUBCARPATHIA,,PLN,2,pork ham cooked - per 1kg,21.37,2013-3,pork ham cooked - per 1kg,pork ham cooked - per 1kg
1,ŁÓDŹ,,PLN,4,bread - per 1kg,,2018-2,bread - per 1kg,bread - per 1kg
2,KUYAVIA-POMERANIA,,PLN,2,barley groats sausage - per 1kg,3.55,2019-12,barley groats sausage - per 1kg,barley groats sausage - per 1kg
3,LOWER SILESIA,,PLN,2,dressed chickens - per 1kg,6.14,2019-2,dressed chickens - per 1kg,dressed chickens - per 1kg
4,WARMIA-MASURIA,,PLN,2,Italian head cheese - per 1kg,5.63,2002-3,Italian head cheese - per 1kg,Italian head cheese - per 1kg


In [13]:
pd.merge(df, d_province, on='province', how='left')

Unnamed: 0,province,product_types,currency,product_group_id,product_line,value,date,product2,product,province_id
0,SUBCARPATHIA,,PLN,2,pork ham cooked - per 1kg,21.37,2013-3,pork ham cooked - per 1kg,pork ham cooked - per 1kg,8.0
1,ŁÓDŹ,,PLN,4,bread - per 1kg,,2018-2,bread - per 1kg,bread - per 1kg,14.0
2,KUYAVIA-POMERANIA,,PLN,2,barley groats sausage - per 1kg,3.55,2019-12,barley groats sausage - per 1kg,barley groats sausage - per 1kg,2.0
3,LOWER SILESIA,,PLN,2,dressed chickens - per 1kg,6.14,2019-2,dressed chickens - per 1kg,dressed chickens - per 1kg,1.0
4,WARMIA-MASURIA,,PLN,2,Italian head cheese - per 1kg,5.63,2002-3,Italian head cheese - per 1kg,Italian head cheese - per 1kg,11.0
...,...,...,...,...,...,...,...,...,...,...
128498,SILESIA,,PLN,2,smoked bacon with ribs - per 1kg,15.95,2015-9,smoked bacon with ribs - per 1kg,smoked bacon with ribs - per 1kg,15.0
128499,SILESIA,,PLN,2,barley groats sausage - per 1kg,4.50,2004-8,barley groats sausage - per 1kg,barley groats sausage - per 1kg,15.0
128500,KUYAVIA-POMERANIA,,PLN,2,pork meat (raw bacon) - per 1kg,12.15,2016-11,pork meat (raw bacon) - per 1kg,pork meat (raw bacon) - per 1kg,2.0
128501,ŁÓDŹ,"beet sugar white, bagged - per 1kg",PLN,3,,0.00,2012-5,"beet sugar white, bagged - per 1kg","beet sugar white, bagged - per 1kg",14.0


In [14]:
# Merge df with d_province to add the province_id column
df = pd.merge(df, d_province, on='province', how='left')
df.head()

Unnamed: 0,province,product_types,currency,product_group_id,product_line,value,date,product2,product,province_id
0,SUBCARPATHIA,,PLN,2,pork ham cooked - per 1kg,21.37,2013-3,pork ham cooked - per 1kg,pork ham cooked - per 1kg,8.0
1,ŁÓDŹ,,PLN,4,bread - per 1kg,,2018-2,bread - per 1kg,bread - per 1kg,14.0
2,KUYAVIA-POMERANIA,,PLN,2,barley groats sausage - per 1kg,3.55,2019-12,barley groats sausage - per 1kg,barley groats sausage - per 1kg,2.0
3,LOWER SILESIA,,PLN,2,dressed chickens - per 1kg,6.14,2019-2,dressed chickens - per 1kg,dressed chickens - per 1kg,1.0
4,WARMIA-MASURIA,,PLN,2,Italian head cheese - per 1kg,5.63,2002-3,Italian head cheese - per 1kg,Italian head cheese - per 1kg,11.0


In [15]:
# Merge df with d_product to add the product_id column
df = pd.merge(df, d_product, on='product', how='left')
df.head()


Unnamed: 0,province,product_types,currency,product_group_id_x,product_line,value,date,product2,product,province_id,product_id,product_group_id_y
0,SUBCARPATHIA,,PLN,2,pork ham cooked - per 1kg,21.37,2013-3,pork ham cooked - per 1kg,pork ham cooked - per 1kg,8.0,20.0,2.0
1,ŁÓDŹ,,PLN,4,bread - per 1kg,,2018-2,bread - per 1kg,bread - per 1kg,14.0,26.0,4.0
2,KUYAVIA-POMERANIA,,PLN,2,barley groats sausage - per 1kg,3.55,2019-12,barley groats sausage - per 1kg,barley groats sausage - per 1kg,2.0,10.0,2.0
3,LOWER SILESIA,,PLN,2,dressed chickens - per 1kg,6.14,2019-2,dressed chickens - per 1kg,dressed chickens - per 1kg,1.0,12.0,2.0
4,WARMIA-MASURIA,,PLN,2,Italian head cheese - per 1kg,5.63,2002-3,Italian head cheese - per 1kg,Italian head cheese - per 1kg,11.0,19.0,2.0


In [16]:
# Extract only the columns that refer to other tables, and the columns value, date
df_normalized = df[['province_id', 'product_id', 'value', 'date']]

df_normalized.head()


Unnamed: 0,province_id,product_id,value,date
0,8.0,20.0,21.37,2013-3
1,14.0,26.0,,2018-2
2,2.0,10.0,3.55,2019-12
3,1.0,12.0,6.14,2019-2
4,11.0,19.0,5.63,2002-3
