<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-pandas" data-toc-modified-id="Introduction-to-pandas-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to pandas</a></span></li><li><span><a href="#get_dummies_nan-version" data-toc-modified-id="get_dummies_nan-version-2"><span class="toc-item-num">2&nbsp;&nbsp;</span><code>get_dummies_nan</code> version</a></span></li><li><span><a href="#Make-new-columns-from-a-series-containing-lists" data-toc-modified-id="Make-new-columns-from-a-series-containing-lists-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Make new columns from a series containing lists</a></span></li></ul></div>

# Introduction to pandas

In [8]:
import pandas as pd
import numpy as np
from collections.abc import Iterable 

# `get_dummies_nan` version

In [35]:
df_cars = pd.DataFrame([[2,"mercedes","middleclass"], 
                        [np.NaN,"mercedes","middleclass"],
                        [3,"Audi",np.NaN]],
                        columns= ["members","vehicles","status"])


In [36]:
print(df_cars)

   members  vehicles       status
0      2.0  mercedes  middleclass
1      NaN  mercedes  middleclass
2      3.0      Audi          NaN


In [14]:

def contains_nan(df_col):
    '''
    This functions checks if a certain column has nans
    '''
    return df_col.isna().any()

def get_dummies_nan(df, return_nancols=False,inplace=False):
    '''
    `get_dummies_nan` creates a new dataframe with binary columns stating wheather variables contain NaNs.
    
    
    
    Examples:
    --------
    
    
    >>> df = pd.DataFrame([[2,"mercedes","middleclass"], 
                           [np.NaN,"mercedes","middleclass"],
                           [3,"Audi",np.NaN]],
                           columns= ["members","vehicles","status"])
    
    >>> df 

           members  vehicles       status
    0      2.0  mercedes  middleclass
    1      NaN  mercedes  middleclass
    2      3.0      Audi          NaN
    
    >>> df_ = get_dummies_nan(df)
    
    >>> df_
    
           members  vehicles       status  members_nan  status_nan
    0      2.0  mercedes  middleclass        False       False
    1      NaN  mercedes  middleclass         True       False
    2      3.0      Audi          NaN        False        True

    >>> df_, nancols = get_dummies_nan(df, return_nancols=True)
    
    >>> nancols
    
    ['members', 'status']

    '''
    if inplace:
        cols_with_nan = []
        for c in df.columns:
            if contains_nan(df[c]):
                cols_with_nan.append(c)
                df[c + "_nan"] = df[c].isna().values    
        
        if return_nancols:
            return cols_with_nan
    else:
        df_copy = df.copy(deep=True)
        cols_with_nan = []
        for c in df.columns:
            if contains_nan(df[c]):
                cols_with_nan.append(c)
                df_copy[c + "_nan"] = df[c].isna().values
        
        if return_nancols:
            return df_copy, cols_with_nan
        else:
            return df_copy

In [41]:
df_= get_dummies_nan(df_cars)

In [43]:
df_

Unnamed: 0,members,vehicles,status,members_nan,status_nan
0,2.0,mercedes,middleclass,False,False
1,,mercedes,middleclass,True,False
2,3.0,Audi,,False,True


In [48]:
df_, nancols = get_dummies_nan(df_cars, return_nancols=True)

In [49]:
nancols

['members', 'status']

In [50]:
df_

Unnamed: 0,members,vehicles,status,members_nan,status_nan
0,2.0,mercedes,middleclass,False,False
1,,mercedes,middleclass,True,False
2,3.0,Audi,,False,True


# Make new columns from a series containing lists

Column `vehicles` contains lists with different vehicle names. Let us assume we consider this feature to be a list or the ordered vehicles a family has.

For example: Family 0 has 2 vehicles, and the most used one is a Mercedes, then a Toyota.

Now we want to create 3 features from this column: `vehicle_1`, `vehicle_2`, `vehicle_3` and write the different
models in the corresponding columns

In [6]:
df_cars = pd.DataFrame([[2,["mercedes","toyota",None],"middleclass"], 
                        [3,["Renault","Mercedes",None],"middleclass"],
                        [3,["Audi","Mercedes","Tesla"],"uppermiddleclass"]],
                        columns= ["members","vehicles","status"])

df_cars

Unnamed: 0,members,vehicles,status
0,2,"[mercedes, toyota, None]",middleclass
1,3,"[Renault, Mercedes, None]",middleclass
2,3,"[Audi, Mercedes, Tesla]",uppermiddleclass


In [7]:
def proc_df_collist(df: pd.DataFrame, colname: str, inplace=False):
    """
    
    `proc_df_collist` takes a dataframe and a column made of lists and generates new columns containing
    values from the lists. For each position in the list it generates a new column. The nimber of generated
    columns equals the length of the largest list in `df[colname]`. Each new collumn  k is filled
    with the values of the lists at position k. If the value does not exist (because the position does not exist)
    the position is filled with `NaN`. 
    
    Given `df`  and `colname`, create as many new columns as `len(df[colname].iloc[0])`
    Write in column `colname_k[j]` the value found `df[colname].iloc[j][k]`.
    
    
    Examples:
    ---------
    >>> df = pd.DataFrame([[2,["p","b",None]], 
                   [3,["a","c",None]],
                  [3,["d","w","a"]]],columns= ["first","second"])
                  
    >>> df
    
       first        second
    0      2  [p, b, None]
    1      3  [a, c, None]
    2      3     [d, w, a]

    >>> newcols = proc_df_collist(df, "second")

    >>> newcols
          second_0 second_1 second_2
    0        p        b     None
    1        a        c     None
    2        d        w        a

    
    >>> df2 = pd.DataFrame([[2,["p"]], 
                   [3,["a",2,3]],
                   [3,[4]]],columns= ["A","B"])
                   
    >>> df2
       A          B
    0  2        [p]
    1  3  [a, 2, 3]
    2  3        [4]

    >>> proc_df_collist(df2, "B")

      B_0  B_1  B_2
    0   p  NaN  NaN
    1   a  2.0  3.0
    2   4  NaN  NaN

    """
    assert isinstance(df, pd.DataFrame), "type(df)={} but it should be pd.DataFrame".format(type(df))
    assert isinstance(colname, str), "type(columname)={} but it should be str".format(type(str))
    assert isinstance(df[colname].iloc[0],(list,set, np.ndarray)), "type(df[colname].iloc[0])={} but it, should be in [list, set, np.ndarray]".format(type(df[colname].iloc[0]))
    
    
    #n_new_cols = len(df[colname].iloc[0])
    n_new_cols = df[colname].apply(len).max()
    colnames   = [colname + "_" + str(i) for i in range(n_new_cols)]   
    
    return pd.DataFrame(df[colname].tolist(), columns=colnames)

In [4]:
df = pd.DataFrame([[2,["p","b",None]], 
                   [3,["a","c",None]],
                   [3,["d","w","a"]]],columns= ["first","second"])


In [5]:
df

Unnamed: 0,first,second
0,2,"[p, b, None]"
1,3,"[a, c, None]"
2,3,"[d, w, a]"


In [6]:
proc_df_collist(df, "second")

Unnamed: 0,second_0,second_1,second_2
0,p,b,
1,a,c,
2,d,w,a


If the column passed to `proc_df_collist` contains iterables with different sizes then it will generate as many colums as the longest iterable. Then it will fill with `NaN`  positions where we don't have information

In [7]:
df2 = pd.DataFrame([[2,["p"]], 
                   [3,["a",2,3]],
                   [3,[4]]],columns= ["A","B"])

In [8]:
proc_df_collist(df2, "B")

Unnamed: 0,B_0,B_1,B_2
0,p,,
1,a,2.0,3.0
2,4,,
