# What is a Data Dictionary

I believe data understanding is the capstone for any analysis, modeling, etc.
When you have Data Dictionary, your shouldn't believe, imagine and remember (you can never do it all perfectly) - you take a look and know exactly what this feature for.

In this notebook you will find the way I've created Data Dictionary for Titanic set.

Let's get started.

In [1]:
import pandas as pd
from IPython.core.display import HTML 

In [2]:
def ddict(df, tdict, path):
    """
    Create a Data Dictionary. 
    
        Parameters:
            arg_1: data set
            arg_2: temp data dictionary frame (index=df.columns, columns: def, des, type, unit)
            arg_3: where to write an output file
        Returns:
            Data Dictionary Frame with feature's name are in index and columns are:
                'Definition' - meaning of feature
                'Description' - meaning of feature's values 
                '#Unique' - number of unique values in the columns, where NaN calculated as value also  
                'TopValue' - the most used value 
                '%UsedTop' - % of using top value
                '%Missing' - % of missing values
                'Unit' - measurement units 
                'Type' - measurement scales                  
                'Dtype' - column's python data type                
    """    
    a, b, c, d, e, f, g, h, i = [], [], [], [], [], [], [], [], []
    l = len(df)
    cols = df.columns.tolist()
    for col in cols:
        a.append(tdict.loc[col,'def'])
        b.append(tdict.loc[col,'des']) 
        
        c.append(df[col].nunique()) # number of unique values
        
        top = df[col].value_counts().to_frame().reset_index().head(1)
        d.append(top.iloc[0,0]) # the top value
        e.append(round(top.iloc[0,1]*100/l, 2)) # % of using
        
        x = df[col].isnull().sum() # a number of missing values
        f.append(round(x*100/l,2)) # % of missing
        
        g.append(tdict.loc[col,'unit']) 
        h.append(tdict.loc[col,'type']) 
        
        i.append(str(df[col].dtype)) 
        
    tdf = pd.DataFrame({'Definition':a, 'Description':b, '#Unique':c, 'TopValue':d, '%UsedTop':e, '%Missing':f, 'Unit':g, 
                        'Type':h, 'Dtype':i }, index=[cols])
    tdf = tdf[['Definition', 'Description', '#Unique', 'TopValue', '%UsedTop' ,  '%Missing', 'Unit', 'Type','Dtype']].sort_values('Dtype')
    
    display(HTML(tdf.to_html()))
    #print(tdf)

    writer = pd.ExcelWriter(path)
    tdf.to_excel(writer, 'Data_Dictionary', index=True)
    writer.save()

First of all, of course, we are reading a data set and checking it's shape (rows and columns numbers)

In [3]:
df = pd.read_excel('...Titanic.xls')
df.shape

(1309, 11)

Because it's 11 columns only, we can estimate data good enough by looking to first rows of table.

In [4]:
df.head(3)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S


Then simple copy-past from a Kaggle to an exclel file and open a result here:

In [5]:
tdict = pd.read_excel('...TitanicDDict_raw_1.xlsx', index_col='f')
tdict

Unnamed: 0_level_0,def,des,type,unit
f,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
survival,Survival,"0 = No, 1 = Yes",binary,-
pclass,Ticket class,"1 = 1st, 2 = 2nd, 3 = 3rd",ordinal,-
sex,Sex,,binary,-
Age,Age in years,,cont,years
sibsp,# of siblings / spouses aboard the Titanic,,discr,person
parch,# of parents / children aboard the Titanic,,discr,person
ticket,Ticket number,,nominal,-
fare,Passenger fare,,cont,usd
cabin,Cabin number,,nominal,-
embarked,Port of Embarkation,"C = Cherbourg, Q = Queenstown, S = Southampton",nominal,-


Where
- "f" - feature's names
- "def" - feature's definition
- "des" - feature's values desctiption
- "unit" - feature's measures
- "type" consists data types in followed logic:
    - Nominal - truly categorical, labels, groups without order
    - Binary - dichotomous, a type of nominal scales that contains only two categories
    - Ordinal - categorical groups with order  
    - Discrete - discrete numerical, this type of data can’t be measured but it can be counted  
    - Continuous - continuous with an absolute zero, without a temporal component

To check if we ready to create a dictionary, let's compare features in both files: main data frame and temporary file.

In [6]:
set1 = set(df.columns.tolist())
set2 = set(tdict.index.tolist())
set1.difference(set2) 

{'age', 'name', 'survived'}

So, we need to work "by hand" again:
- change the index columns (it should be identical to feature's names)
- exclude all NANs

In [7]:
tdict = pd.read_excel('...TitanicDDict_raw_2.xlsx', index_col='f')
tdict

Unnamed: 0_level_0,def,des,type,unit
f,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
survived,Survival,"0 = No, 1 = Yes",binary,-
pclass,Ticket class,"1 = 1st, 2 = 2nd, 3 = 3rd",ordinal,-
sex,Sex,"f,m",binary,-
age,Age in years,123,cont,years
sibsp,# of siblings / spouses aboard the Titanic,123,discr,person
parch,# of parents / children aboard the Titanic,123,discr,person
ticket,Ticket number,-,nominal,-
fare,Passenger fare,-,cont,usd
cabin,Cabin number,-,nominal,-
embarked,Port of Embarkation,"C = Cherbourg, Q = Queenstown, S = Southampton",nominal,-


Now we're ready to create a dictionary:

In [8]:
ddict(df, tdict, '...TitanicDDict.xlsx')

Unnamed: 0,Definition,Description,#Unique,TopValue,%UsedTop,%Missing,Unit,Type,Dtype
age,Age in years,123,98,24,3.59,20.09,years,cont,float64
fare,Passenger fare,-,281,8.05,4.58,0.08,usd,cont,float64
pclass,Ticket class,"1 = 1st, 2 = 2nd, 3 = 3rd",3,3,54.16,0.0,-,ordinal,int64
survived,Survival,"0 = No, 1 = Yes",2,0,61.8,0.0,-,binary,int64
sibsp,# of siblings / spouses aboard the Titanic,123,7,0,68.07,0.0,person,discr,int64
parch,# of parents / children aboard the Titanic,123,8,0,76.55,0.0,person,discr,int64
name,passengers names,-,1307,"Connolly, Miss. Kate",0.15,0.0,-,nominal,object
sex,Sex,"f,m",2,male,64.4,0.0,-,binary,object
ticket,Ticket number,-,939,CA. 2343,0.84,0.0,-,nominal,object
cabin,Cabin number,-,186,C23 C25 C27,0.46,77.46,-,nominal,object


Where
- index - feature's names
- "Definition" - feature's definitions
- "Desctiption" - feature's values descriptions
- "#Unique" - number on unique values 
- "TopValue" - the most used value
- "%UsedTop" - how often was the top value used, in percents
- "%Missing" - % of missing values
- "Unit" - measurement units
- "Type" - measurement scales
- "Dtype" - python data types

I believe, the data engineering using this table became more accurate and faster, because we can see many important information at once and in one table

I'll be glad to see any your questions or suggestions. Criticism welcomed also.  
Many thanks for your time.  
Best,  
Lana  