# AUTONORMALIZE DEMO 
Using Autonormalize to normalize a kaggle dataset about food purchasing.

In [1]:
import os

import pandas as pd
import autonormalize as an

In [10]:
food_df = pd.read_csv(os.path.join(os.getcwd(), 'autonormalize/downloads/food.csv'), encoding='latin1')
food_df = food_df.drop(columns=food_df.columns[10:])
food_df.head(3)

Unnamed: 0,Area Abbreviation,Area Code,Area,Item Code,Item,Element Code,Element,Unit,latitude,longitude
0,AFG,2,Afghanistan,2511,Wheat and products,5142,Food,1000 tonnes,33.94,67.71
1,AFG,2,Afghanistan,2805,Rice (Milled Equivalent),5142,Food,1000 tonnes,33.94,67.71
2,AFG,2,Afghanistan,2513,Barley and products,5521,Feed,1000 tonnes,33.94,67.71


This dataset has 21477 rows and we've cut it down to 10 columns. As you can see, there are many data dependencies between that columns that obviously should be split up. For example, Area, Area Code, and Area Abreviation obviously should be dependent on each other.

In [13]:
deps_exact = an.find_dependencies(food_df, 1.00)
deps_approx = an.find_dependencies(food_df, 0.96)
print("\nExact dependencies...")
print(deps_exact)
print("\nApproximate dependencies...")
print(deps_approx)


100%|██████████| 10/10 [00:01<00:00,  9.72it/s]
100%|██████████| 10/10 [00:02<00:00,  3.91it/s]


Exact dependencies...
 {Area Code}  {Area}  --> Area Abbreviation
 {Area}  --> Area Code
 {Area Code}  --> Area
 --> Item Code
 --> Item
 {Element}  --> Element Code
 {Element Code}  --> Element
 {Item Code}  {Area}  {Element}  {latitude}  {Area Abbreviation}  {Area Code}  {Element Code}  {Item}  {longitude}  --> Unit
 {Area Code}  {Area}  --> latitude
 {Area Code}  {Area}  --> longitude

Approximate dependencies...
 {Area Code}  {Area}  --> Area Abbreviation
 {Area Abbreviation}  {Area}  --> Area Code
 {Area Abbreviation}  {Area Code}  --> Area
 --> Item Code
 --> Item
 {Element}  --> Element Code
 {Element Code}  --> Element
 {Item Code}  {Area}  {Element}  {latitude}  {Area Abbreviation}  {Area Code}  {Element Code}  {Item}  {longitude}  --> Unit
 {Area Abbreviation}  {Area Code}  {Area}  --> latitude
 {Area Abbreviation}  {Area Code}  {Area}  --> longitude





In [9]:
groupings = an.normalize_dependencies(deps_approx)
for grp in groupings:
    print('\n~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~\n')
    print(grp)


~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

 --> Area Abbreviation
 --> Item Code
 --> Item
 {Element}  --> Element Code
 {Element Code}  --> Element

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

 --> Area Abbreviation
 {Area Abbreviation}  --> Area Code

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

 --> Area Code
 {Area Code}  --> Area

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

 --> Area
 {Area}  --> latitude
 {Area}  --> longitude

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

 --> latitude
 {latitude}  --> Unit
