__[Open and try this file online (Colab)](https://colab.research.google.com/github/djekra/pandasklar/blob/master/jupyter/16_Compare_Series_and_DataFrames.ipynb)__

# Compare Series and DataFrames
* `compare_series`: Compares the content of two Series.
    Returns several indicators of equality.
* `compare_dataframes`: Compares the content of two DataFrames column by column. Returns several indicators of equality.
* `check_equal`: Compares the content of two DataFrames column by column.
* `compare_col_dtype`: Returns the column names of two DataFrames whose dtype differs
* `get_different_rows`: Returns the rows of two DataFrames that differ

In [1]:
# blab init
try:
    import blab
except ImportError as e:
    !pip install blab
    import blab    
startup_notebook = blab.blab_startup()
%run $startup_notebook 

blab init
environment['in_colab']     = False
environment['dropbox_path'] = D:\Dropbox
environment['lib_path']     = D:\Dropbox\31_Projekte\01_Python\libs
Start Time: 21:36:20


In [2]:
import pandas     as pd 
import bpyth      as bpy

# pandasklar
try:
    import pandasklar as pak 
except ImportError as e:
    !pip install pandasklar
    import pandasklar as pak   
    
# verbose
pak.Config.set('VERBOSE', True)

# copy_on_write
pd.set_option("mode.copy_on_write", True)

VERBOSE = True
--> setting verbose=True as default for all pandasklar functions



## compare_series()

In [3]:
help(pak.compare_series)

<class 'IPython.core.display.Markdown'>


<span style="font-size:larger;">compare_series(s, t, format='dict', decimals=None):</span>

**Compares two Pandas Series and returns indicators of equality.**

This function compares two Pandas Series and provides detailed information about their similarities and differences.
It checks for equality in various aspects, including name, data type, length, number of NaNs, content, sort order, and index-data relations.

**Args:**
- `s` (`pd.Series`): The first Pandas Series.
- `t` (`pd.Series`): The second Pandas Series.
- `format` (`str`, optional): Output format for the comparison results.
  - `'dict'` or `'d'`: Returns a dictionary.
  - `'series'` or `'Series'` or `'s'`: Returns a Pandas Series.
  - `'dataframe'` or `'DataFrame'` or `'Dataframe'` or `'df'`: Returns a Pandas DataFrame.
  Defaults to `'dict'`.
- `decimals` (`int`, optional): The number of decimal places to round to when comparing numeric values.
  If `None`, no rounding is performed. Defaults to `None`.

**Returns:**
`dict`, `pd.Series`, or `pd.DataFrame`: Comparison results, depending on the `'format'` parameter.
The output contains the following keys/indices:
- `'name'`: `True` if the series have the same name
- `'dtype'`: `True` if the series have the same dtype (or both are `float32`/`float64`), `False` otherwise.
- `'len'`: `True` if the series have the same length, `False` otherwise.
- `'nnan'`: `True` if the series have the same number of NaNs, `False` otherwise.
- `'nan_pat'`: `True` if the series have the same pattern of NaNs, `False` otherwise.
- `'content'`: `True` if the series have the same content (ignoring index, sort and NaNs), `False` otherwise.
  - For numeric series: If `decimals` is not `None`, values are rounded before comparison.
- `'sort'`: `True` if the series have the same sort order (ignoring index), `False` otherwise.
- `'eq'`: `True` if the series have the same index-data relations (ignoring sort), `False` otherwise.

**Examples:**
```python
>>> s1 = pd.Series([1, 2, 3], name='numbers')
>>> s2 = pd.Series([1, 2, 3], name='numbers')
>>> compare_series(s1, s2, format='dict')
{'name': True, 'dtype': True, 'len': True, 'nnan': True, 'nan_pat': True, 'content': True, 'sort': True, 'eq': True}

>>> s3 = pd.Series([1.1, 2.2, np.nan], name='floats')
>>> s4 = pd.Series([1.1, 2.2, np.nan], name='floats')
>>> compare_series(s3, s4, format='series', decimals=1)
name        True
dtype       True
len         True
nnan        True
nan_pat     True
content     True
sort        True
eq          True
Name: floats, dtype: object

>>> s5 = pd.Series([1, 2, 3], name='numbers')
>>> s6 = pd.Series([3, 2, 1], name='numbers')
>>> compare_series(s5, s6, format='df')
         name  dtype   len  nnan  nan_pat content   sort     eq
numbers  True   True  True  True  True    True  False   True
    

In [4]:
# Generate test data
s = pak.random_series( 100, 's')
s = s.apply(pak.decorate, p=0.1) # nan
s

0       9O8Q6
1     7mgGj8p
2       8U321
3        eIu7
4     a9uZND2
       ...   
95       llvi
96      ÄFmX3
97    eIvyHQw
98    ÄfANvGR
99      pYTX9
Name: rnd_string, Length: 100, dtype: object

In [5]:
# Generate compare data
# Play with it!

t = s.copy()
# t.name = 's' # name
# t = t[:99] # len
# t = t.apply(pak.decorate, p=0.5) # nan
#t = t.astype('object') # dtype

t[0], t[1] = t[1], t[0] 
#t = t.sort_values()



In [6]:
r = pak.compare_series(s,t, format='df')
r

Unnamed: 0,rnd_string
name,True
dtype,True
len,True
nnan,True
nan_pat,True
content,True
sort,False
eq,False


## compare_dataframes()

In [7]:
help(pak.compare_dataframes)

<class 'IPython.core.display.Markdown'>


<span style="font-size:larger;">compare_dataframes(df1, df2, format='df', decimals=None):</span>

**Compares two DataFrames column by column and returns indicators of equality.**

This function compares two Pandas DataFrames and provides detailed information about their similarities and differences.
It checks for equality in various aspects for each column, including name, data type, number of NaNs, content, sort order, and index-data relations.
It also provides a summary row (`'(Total)'`) indicating the overall equality of the DataFrames.

**Args:**
- `df1` (`pd.DataFrame`): The first DataFrame.
- `df2` (`pd.DataFrame`): The second DataFrame.
- `format` (`str`, optional): Output format for the comparison results.
  - `'dataframe'` or `'DataFrame'` or `'Dataframe'` or `'df'`: Returns a Pandas DataFrame.
  - `'series'` or `'Series'` or `'s'`: Returns a Pandas Series (only the `'(Total)'` row).
  - `'dict'` or `'d'`: Returns a dictionary (only the `'(Total)'` row).
  - `'bool'` or `'b'`: Returns a boolean (only the `'eq'` value of the `'(Total)'` row).
  Defaults to `'df'`.
- `decimals` (`int`, optional): The number of decimal places to round to when comparing numeric values.
  If `None`, no rounding is performed. Defaults to `None`.

**Returns:**
`pd.DataFrame`, `pd.Series`, `dict`, or `bool`: Comparison results, depending on the `'format'` parameter.
The output contains the following columns/keys:
- `'name'`: `True` if columns exist in both DataFrames, `'left_only'` if the column is only in `df1`, `'right_only'` if the column is only in `df2`.
- `'dtype'`: `True` if columns have the same dtype (or both are `float32`/`float64`), `False` otherwise.
- `'nnan'`: `True` if columns have the same number of NaNs, `False` otherwise.
- `'nan_pat'`: `True` if the columns have the same pattern of NaNs, `False` otherwise.
- `'content'`: `True` if columns have the same content (ignoring index and sort), `False` otherwise.
  - For numeric columns: If `decimals` is not `None`, values are rounded before comparison.
- `'sort'`: `True` if columns have the same sort order (ignoring index), `False` otherwise.
- `'eq'`: `True` if columns have the same index-data relations (ignoring sort), `False` otherwise.
- `'(Total)'`: A summary row indicating the overall equality of the DataFrames.

**Examples:**
```python
>>> df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
>>> compare_dataframes(df1, df2, format='df')
         name  dtype  nnan  nan_pat content  sort    eq
A        True   True  True  True    True  True  True
B        True   True  True  True    True  True  True
(Total)  True   True  True  True    True  True  True

In [8]:
# Generate test data
s = pak.people(10)
s

Unnamed: 0,first_name,age,age_class,postal_code,birthplace,secret,features,history
0,Linda,20,20,83692,Bremen,vGAcÖ,"{r, Q, u, F}","[c, b, a]"
1,Manfred,36,30,64354,Berlin,Kwo4Xwb0Ol,"{T, o}","[a, b, c]"
2,Helga,30,30,59344,Bremen,tQClma,{},"[A, x]"
3,Yannik,39,30,44111,Berlin,ÜkTC88W,"{g, i, S, n}","[A, A, A]"
4,Lucas,25,20,79960,Bremen,GeQÜuwq,{s},"[A, B, C]"
5,Rita,34,30,78442,Berlin,NW7N1xKxz,"{p, n, L}","[A, B, C, C]"
6,Hannes,32,30,95604,Bremen,üSVCMY,"{I, 0}","[A, C, C, B]"
7,Linda,26,20,64354,Bremen,Hr2TFBbC,{m},[]
8,Lucas,40,40,78442,Bremen,uqSeH3,"{T, o}","[b, b, a, b]"
9,Lucas,27,20,83692,Bremen,nüpo5IÖ,{},"[A, x]"


In [9]:
# Generate compare data
# Play with it!

t = s.copy()
# t.name = 's' # name
# t = t[:99] # len
# t = t.apply(pak.decorate, p=0.5) # nan
t['age'] = t.age.astype('float') # dtype

#t = t.sort_values()
#t = pak.drop_cols(t, 'age')
t['AAGE'] = 0
#t = t.sort_values(['first_name'])
#t.loc[0,'age'] = None
#t = t.head(50)
t= pak.move_cols(t,'age',-1)
t



Unnamed: 0,first_name,age_class,postal_code,birthplace,secret,features,history,AAGE,age
0,Linda,20,83692,Bremen,vGAcÖ,"{r, Q, u, F}","[c, b, a]",0,20.0
1,Manfred,30,64354,Berlin,Kwo4Xwb0Ol,"{T, o}","[a, b, c]",0,36.0
2,Helga,30,59344,Bremen,tQClma,{},"[A, x]",0,30.0
3,Yannik,30,44111,Berlin,ÜkTC88W,"{g, i, S, n}","[A, A, A]",0,39.0
4,Lucas,20,79960,Bremen,GeQÜuwq,{s},"[A, B, C]",0,25.0
5,Rita,30,78442,Berlin,NW7N1xKxz,"{p, n, L}","[A, B, C, C]",0,34.0
6,Hannes,30,95604,Bremen,üSVCMY,"{I, 0}","[A, C, C, B]",0,32.0
7,Linda,20,64354,Bremen,Hr2TFBbC,{m},[],0,26.0
8,Lucas,40,78442,Bremen,uqSeH3,"{T, o}","[b, b, a, b]",0,40.0
9,Lucas,20,83692,Bremen,nüpo5IÖ,{},"[A, x]",0,27.0


In [10]:
# Output as DataFrame
pak.compare_dataframes(s,t)

Unnamed: 0,name,dtype,nnan,nan_pat,content,sort,eq
first_name,True,True,True,True,True,True,True
age_class,True,True,True,True,True,True,True
postal_code,True,True,True,True,True,True,True
birthplace,True,True,True,True,True,True,True
secret,True,True,True,True,True,True,True
features,True,True,True,True,True,True,True
history,True,True,True,True,True,True,True
age,True,False,True,True,True,True,True
AAGE,right_only,,,,False,,False
(Total),False,False,False,False,False,False,False


In [11]:
# Output as dict
pak.compare_dataframes(s,t, format='dict')

{'name': False,
 'dtype': np.False_,
 'nnan': np.False_,
 'nan_pat': np.False_,
 'content': np.False_,
 'sort': np.False_,
 'eq': np.False_}

In [12]:
# Output as bool
pak.compare_dataframes(s,t, format='bool')

np.False_

In [13]:
# This ist the same as check_equal
pak.check_equal(s,t)

False

## check_equal()

In [14]:
?pak.check_equal

[31mSignature:[39m pak.check_equal(obj1, obj2)
[31mDocstring:[39m
Compares the content of two DataFrames column by column.
Two DataFrames are equal, if 
* they have the same shape
* they have the same column names
* and compare_dataframes(format='bool') is True
[31mFile:[39m      d:\dropbox\31_projekte\01_python\88_pycharm\pandasklar\src\pandasklar\compare.py
[31mType:[39m      function

In [15]:
df1 = pak.dataframe( [ list('Babykorb'), 
                       list('abfällig'), 
                       list('Abgründe'), 
                       list('Kätzchen'), 
                       list('Landwirt'), 
                       list('lebendig'), 
                       list('Saugrohr'),       
                       list('Trugbild'),                     
                ] )

df2 = pak.dataframe( [ list('Babykorb'), 
                       list('abfällig'), 
                       list('Abgründe'), 
                       list('Kätzchen'), 
                       list('Landwirt'), 
                       list('lebendig'), 
                       list('Saugrohr'),       
                       list('Trugbild'),                     
                ] )

df1

Input rtype=('list', 'list', 'str') shape=(8, 8)
rotated=False Output rtype=('DataFrame', 'Series', 'str') shape=(8, 8)
Input rtype=('list', 'list', 'str') shape=(8, 8)
rotated=False Output rtype=('DataFrame', 'Series', 'str') shape=(8, 8)


Unnamed: 0,A,B,C,D,E,F,G,H
0,B,a,b,y,k,o,r,b
1,a,b,f,ä,l,l,i,g
2,A,b,g,r,ü,n,d,e
3,K,ä,t,z,c,h,e,n
4,L,a,n,d,w,i,r,t
5,l,e,b,e,n,d,i,g
6,S,a,u,g,r,o,h,r
7,T,r,u,g,b,i,l,d


In [16]:
# Initially the DataFrames are equal
assert pak.check_equal(df1, df2)

In [17]:
# One change >> not equal
mask = df2['A'] == 'L'
df2.loc[mask,'A'] = 'R'
assert not pak.check_equal(df1, df2)

In [18]:
# Change back >> equal again
mask = df2['A'] == 'R'
df2.loc[mask,'A'] = 'L'
assert pak.check_equal(df1, df2)

In [19]:
# change column order and row order 
df2 = pak.move_cols(df2,'D').sort_values('D')
df2

Unnamed: 0,D,A,B,C,E,F,G,H
4,d,L,a,n,w,i,r,t
5,e,l,e,b,n,d,i,g
6,g,S,a,u,r,o,h,r
7,g,T,r,u,b,i,l,d
2,r,A,b,g,ü,n,d,e
0,y,B,a,b,k,o,r,b
3,z,K,ä,t,c,h,e,n
1,ä,a,b,f,l,l,i,g


In [20]:
# still equal
assert pak.check_equal(df1, df2)

## compare_col_dtype()

In [21]:
?pak.compare_col_dtype

[31mSignature:[39m pak.compare_col_dtype(df1, df2)
[31mDocstring:[39m Returns the column names of two DataFrames whose dtype differs.
[31mFile:[39m      d:\dropbox\31_projekte\01_python\88_pycharm\pandasklar\src\pandasklar\compare.py
[31mType:[39m      function

In [22]:
pak.compare_col_dtype(s, t)

['age']

## get_different_rows()

In [23]:
?pak.get_different_rows

[31mSignature:[39m pak.get_different_rows(df1, df2, use_index=[38;5;28;01mTrue[39;00m, indicator=[38;5;28;01mTrue[39;00m)
[31mDocstring:[39m
Returns the rows of two DataFrames that differ.

This function compares two DataFrames and returns the rows that are different.
It offers two modes of comparison, controlled by the `use_index` parameter:

- **`use_index=True` (Index-based comparison):**
  The DataFrames are compared row by row, based on their index.
  Rows with the same index but different content are returned.
  Rows that exist only in one DataFrame are also returned.

- **`use_index=False` (Content-based comparison):**
  The indexes of the DataFrames are completely ignored.
  Rows are compared based solely on their content (based on the hashable columns).
  Rows that exist in one DataFrame but not in the other (regardless of index) are returned.
  Duplicate rows are considered as one.

Additional or missing columns are ignored.
Float columns may cause mistakes due to flo

In [24]:
df1 = pak.people(size=10, seed=84)
df2 = pak.people(size=10, seed=84).sort_values('secret')
df2.loc[3, 'first_name'] = 'Test'

In [25]:
pak.get_different_rows(df1, df2)

Unnamed: 0,first_name,age,age_class,postal_code,birthplace,secret,_merge
3,Lothar,22,20,,Bremen,Gls3üXXIT,left_only
3,Test,22,20,,Bremen,Gls3üXXIT,right_only


# Spielwiese