__[Open and try this file online (Colab)](https://colab.research.google.com/github/djekra/pandasklar/blob/master/jupyter/15_Analyse_Redundancy.ipynb)__

# Analyse uniqueness, discrepancies und redundancy
* `analyse_groups`: Analyses a DataFrame for uniqueness and redundancy.
* `same_but_different`: Returns the rows of a DataFrame that are the same on the one hand and different on the other: They are the same in the fields named in same. And they differ in the field named in different. This is useful for analysing whether fields correlate 100% with each other or are independent.

In [1]:
# blab init
try:
    import blab
except ImportError as e:
    !pip install blab
    import blab    
startup_notebook = blab.blab_startup()
%run $startup_notebook 

blab init
environment['dropbox_path'] = /home/me/Data_Linux/Dropbox
environment['lib_path']     = /home/me/Data_Linux/Dropbox/31_Projekte/01_Python/libs
Start Time: 21:38:01


time: 483 ms


In [11]:
import pandas     as pd 
import bpyth      as bpy

# pandasklar
try:
    import pandasklar as pak 
except ImportError as e:
    !pip install pandasklar
    import pandasklar as pak   
    
# verbose
#pak.Config.set('VERBOSE', True)

time: 18.5 ms


In [3]:
# Generate random data
anz = 500
v = pak.random_series( anz, 'name',                  p_nan=0)
w = v.str[:1]
g = pak.random_series( anz, 'int',   min=2, max=7 ) * 10

s = pak.random_series( anz, 'string',                p_nan=0)
o = pak.random_series( anz, 'choice', choice=['Bremen','Berlin','Hamburg'], p_nan=0.2   )
p = pak.random_series( anz, 'choice', choice=['cats','dogs']   )
a = pak.random_series( anz, 'int',   min=0, max=anz*10, p_dup=0 ) # there will be no dups
b = pak.random_series( anz, 'int',   min=0, max=anz-10          ) # there will be 10 dups

df = pak.dataframe( [ v, w, g, s, o, p, a, b] )
df.columns = ['first_name','firstletter','age_class','secret','city','loves','int_fine','int_rough',]
df

Input rtype=('list', 'Series', 'str') shape=(8, 500)
rotated=True Output rtype=('DataFrame', 'Series') shape=(500, 8)


Unnamed: 0,first_name,firstletter,age_class,secret,city,loves,int_fine,int_rough
0,Birgit,B,40,1idJe,Hamburg,dogs,1080,453
1,Ole,O,30,üaXDiJ,Berlin,cats,1542,325
2,Pia,P,50,äOJ19Kq,,dogs,919,64
3,Petra,P,20,vHDeDÄ5,,cats,4391,272
4,Willi,W,60,Wnq1f0,Bremen,dogs,944,467
...,...,...,...,...,...,...,...,...
495,Anja,A,20,VQÜH,,dogs,2372,207
496,Lieselotte,L,50,umcz3Z,Berlin,dogs,1752,362
497,Tom,T,50,ReiWKö,Berlin,dogs,3771,261
498,Anette,A,50,rtY4k,Bremen,dogs,4775,101


time: 136 ms


## analyse_groups

In [4]:
?pak.analyse_groups

time: 75.6 ms


[0;31mSignature:[0m [0mpak[0m[0;34m.[0m[0manalyse_groups[0m[0;34m([0m[0mdf[0m[0;34m,[0m [0mexclude[0m[0;34m=[0m[0;34m[[0m[0;34m][0m[0;34m,[0m [0mtiefe_max[0m[0;34m=[0m[0;36m3[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Analyses a DataFrame for uniqueness and redundancy.
Groups by many combinations of columns and counts the duplicates that are created in the process.
Interpretation:
0 dups => This combination of columns is unique
Same number of dups than other combination of columns => Indication of redundancy
[0;31mFile:[0m      ~/Data_Linux/Dropbox/31_Projekte/01_Python/git/pandasklar/src/pandasklar/analyse.py
[0;31mType:[0m      function


In [5]:
# Analyse for uniqueness and redundancy
a = pak.analyse_groups(df)
a

Unnamed: 0,columns,level,dups_abs,dups_rel
0,[secret],1,0,0.0
1,[int_fine],1,0,0.0
2,"[int_rough, first_name]",2,0,0.0
3,"[int_rough, firstletter]",2,0,0.0
4,"[int_rough, age_class]",2,0,0.0
5,"[int_rough, city]",2,4,0.008
6,"[int_rough, loves]",2,4,0.008
7,[int_rough],1,9,0.018
8,"[first_name, age_class]",2,18,0.036
9,"[first_name, city]",2,21,0.042


time: 101 ms


_Interpretation:_ 
 * column `int_fine` uniquely identifies all records
 * column `secret` uniquely identifies all records
 * columns `first_name` and `int_rough` uniquely identify all records together (this depends on te random data)
 * column `firstletter` is redundant to `first_name`

## same_but_different

In [6]:
?pak.same_but_different

time: 30.2 ms


[0;31mSignature:[0m [0mpak[0m[0;34m.[0m[0msame_but_different[0m[0;34m([0m[0mdf[0m[0;34m,[0m [0msame[0m[0;34m,[0m [0mdifferent[0m[0;34m,[0m [0msort[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mreturn_mask[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Returns the rows of a DataFrame that are the same on the one hand and different on the other:
They are the same in the fields named in same.
And they differ in the field named in different.
This is useful for analysing whether fields correlate 100% with each other or are independent.
* same:       Array of column names.
* different:  Single column name.  This column is used to search for differences.
[0;31mFile:[0m      ~/Data_Linux/Dropbox/31_Projekte/01_Python/git/pandasklar/src/pandasklar/analyse.py
[0;31mType:[0m      function


In [7]:
# There is one discrepancy in the dataframe

df2 = pak.dataframe( [ list('Packesel'), 
                      list('Packesel'), 
                      list('Packesel'), 
                      list('Packese#'), 
                      list('Packesel'), 
                      list('Packesel'), 
                      list('Packesel'),       
                      list('Packesel'),                     
                ] )
df2

Input rtype=('list', 'list', 'str') shape=(8, 8)
rotated=False Output rtype=('DataFrame', 'Series') shape=(8, 8)


Unnamed: 0,A,B,C,D,E,F,G,H
0,P,a,c,k,e,s,e,l
1,P,a,c,k,e,s,e,l
2,P,a,c,k,e,s,e,l
3,P,a,c,k,e,s,e,#
4,P,a,c,k,e,s,e,l
5,P,a,c,k,e,s,e,l
6,P,a,c,k,e,s,e,l
7,P,a,c,k,e,s,e,l


time: 36.2 ms


In [8]:
# no discrepancy in column E
pak.same_but_different(df2, ['A','B','C','D'], 'E')

Unnamed: 0,A,B,C,D,E,F,G,H


time: 38 ms


In [9]:
# but in column H
pak.same_but_different(df2, ['A','B','C','D'], 'H')

Unnamed: 0,A,B,C,D,E,F,G,H
0,P,a,c,k,e,s,e,l
1,P,a,c,k,e,s,e,l
2,P,a,c,k,e,s,e,l
3,P,a,c,k,e,s,e,#
4,P,a,c,k,e,s,e,l
5,P,a,c,k,e,s,e,l
6,P,a,c,k,e,s,e,l
7,P,a,c,k,e,s,e,l


time: 28.5 ms


In [10]:
# Another example with the DataFrame from above
# (if you don't see any result, run the notebook again to generate different random data)
pak.same_but_different( df, same=['first_name','age_class','city'], different='loves' )

Unnamed: 0,first_name,firstletter,age_class,secret,city,loves,int_fine,int_rough
13,Anna,A,70,louJo3Q,Berlin,dogs,3977,447
469,Anna,A,70,cÄim,Berlin,cats,3895,89


time: 32.9 ms
