<a id="toc"></a>

# THEORY PANDAS-DataFrames:

- [pd.DataFrame](#df)

- [IO](#io):
   - read_csv, read_excel, read_html, read_json (example: call api)
   - to_csv, to_json

- [OPERATIONS](#op):
   - +- <, add,multiply

- [INDEXING](#idx) :
    - at,loc, iat,iloc, boolean indexation, head, tail
    - series [] vs dataframe [[]]

- [ORDER BY](#sort)

- [USEFULL FUNCTIONS](#func)
    - isnull, fillna
    - drop
    - string methods
    - datetime/timestamp
        - timeindex
        - moving average 

- [AGGREGATION](#agg)

- [GROUPBY/PIVOT](#gb)
    - value_counts, groupby, pivot


- [JOINS & concat](#join)
    - merge

- [RESTRUCTURE](#restr)
    - get_dummies
    - (unpivot, unstack, explode,)

- [APPLY/MAP](#map)



official doc  
https://pandas.pydata.org/docs/index.html

comparison to sql  
https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html

In [3]:
import numpy as np
import pandas as pd

<a id="df"></a>

# pd.DataFrame
[toc](#toc)

In [5]:
series_1 = pd.Series(data=np.random.randint(5,size=(5),dtype=np.uint8),
                    index=list('abcde'))

series_a = pd.Series(data=list('abcab'),
                     index=list('abcde'),
                     dtype='category')

series_b = pd.Series(data=list('xyzzz'),
                     index=list('abcde'),
                     dtype='O'
                    )


series_2 = pd.Series([1,1,1,0,np.nan], 
                     index = list('abcde'))

df = pd.DataFrame(data = {'num':series_1,'cat':series_a,'cat2':series_b,'num2':series_2})
print(df)
display(df)

   num cat cat2  num2
a    2   a    x   1.0
b    4   b    y   1.0
c    3   c    z   1.0
d    2   a    z   0.0
e    4   b    z   NaN


Unnamed: 0,num,cat,cat2,num2
a,2,a,x,1.0
b,4,b,y,1.0
c,3,c,z,1.0
d,2,a,z,0.0
e,4,b,z,


In [6]:

df1 = pd.DataFrame(data = {'A':[1,2,3], 'B':[3,2,1], 'C':[4,5,6]})
df1

Unnamed: 0,A,B,C
0,1,3,4
1,2,2,5
2,3,1,6


In [7]:
df2 = pd.DataFrame(data = np.random.randint(2,10,size=(8,3)),columns = list('ABC'))
df2

Unnamed: 0,A,B,C
0,6,6,8
1,7,6,8
2,8,5,9
3,9,4,5
4,8,8,4
5,7,2,3
6,2,7,5
7,7,3,7


In [8]:
df3 = pd.DataFrame(data = [{'i':1, 'j':5}, {'i':2,'k':3},{'i':6,'j':2,'k':7}])
df3

Unnamed: 0,i,j,k
0,1,5.0,
1,2,,3.0
2,6,2.0,7.0


## datatypes

https://pandas.pydata.org/docs/reference/arrays.html

## attributes

In [11]:
df.shape

(5, 4)

In [12]:
df.dtypes

num        uint8
cat     category
cat2      object
num2     float64
dtype: object

In [13]:
df['cat'] ## known possible/autorized values -->  less flexible, more efficient 

a    a
b    b
c    c
d    a
e    b
Name: cat, dtype: category
Categories (3, object): ['a', 'b', 'c']

In [14]:
df["cat2"] ## object = 'python object', anything possible --> more flexible, less efficient

a    x
b    y
c    z
d    z
e    z
Name: cat2, dtype: object

In [15]:
df.columns

Index(['num', 'cat', 'cat2', 'num2'], dtype='object')

In [16]:
df.values

array([[2, 'a', 'x', 1.0],
       [4, 'b', 'y', 1.0],
       [3, 'c', 'z', 1.0],
       [2, 'a', 'z', 0.0],
       [4, 'b', 'z', nan]], dtype=object)

In [17]:
df.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [18]:
list(df.items())

[('num',
  a    2
  b    4
  c    3
  d    2
  e    4
  Name: num, dtype: uint8),
 ('cat',
  a    a
  b    b
  c    c
  d    a
  e    b
  Name: cat, dtype: category
  Categories (3, object): ['a', 'b', 'c']),
 ('cat2',
  a    x
  b    y
  c    z
  d    z
  e    z
  Name: cat2, dtype: object),
 ('num2',
  a    1.0
  b    1.0
  c    1.0
  d    0.0
  e    NaN
  Name: num2, dtype: float64)]

In [19]:
for col_name,series in df.items():
    print(col_name)
    print(series)
    print('-'*50)
    for row in series:
        print(row)
    break

num
a    2
b    4
c    3
d    2
e    4
Name: num, dtype: uint8
--------------------------------------------------
2
4
3
2
4


<a id="io"></a>
# IO
[toc](#toc)

In [21]:
df.to_csv('my_df.csv',index=False,sep=';')

In [22]:
df_from_csv = pd.read_csv('my_df.csv', sep=';')
df_from_csv

Unnamed: 0,num,cat,cat2,num2
0,2,a,x,1.0
1,4,b,y,1.0
2,3,c,z,1.0
3,2,a,z,0.0
4,4,b,z,


In [23]:
### prereq : install beautifull soup 4 (bs4) or lxml
try:
    url = 'https://fr.wikipedia.org/wiki/Liste_des_villes_de_Belgique'
    
    list_of_dfs = pd.read_html(url) ## requires internet connection
    display(list_of_dfs[1])
except Exception as err:
    print("install beautifull soup 4 (bs4) or lxml ?")
    print("-"*50)
    print(err)

Unnamed: 0,Ville,Arrondissement,Province,Habitants de la commune (2019),Année de l'acte,Charte communale
0,Aerschot (Aarschot),Louvain (Leuven),Province du Brabant flamand,3 128,1825,1194
1,Alost (Aalst),Alost (Aalst),Province de Flandre-Orientale,90 931,1825,1174
2,Andenne,Namur,Province de Namur,28 169,1825,
3,Ans,Liège,Province de Liège,28 998,2021[3],
4,Antoing,Tournai,Province de Hainaut,7 619,1825,1817 (titre)
...,...,...,...,...,...,...
132,Wavre,Nivelles,Province du Brabant wallon,35 541,1825,1222
133,Wervicq (Wervik),Ypres,Province de Flandre-Occidentale,19 195,1825,
134,Ypres (Ieper),Ypres,Province de Flandre-Occidentale,35 534,1825,1174
135,Zottegem,Alost,Province de Flandre-Orientale,28 062,1985,


In [24]:
EW = pd.ExcelWriter(path='my_new_excel.xlsx')
df_from_csv.to_excel(excel_writer=EW, sheet_name='from csv to excel')
EW.close()

In [25]:
### pd.read_json() see doc: https://pandas.pydata.org/docs/reference/api/pandas.read_json.html

<a id="io"></a>
# Operations
[toc](#toc)

In [27]:
try:
    display(df+1)
except Exception as err:
    print(err)

unsupported operand type(s) for +: 'Categorical' and 'int'


In [28]:
df1+100

Unnamed: 0,A,B,C
0,101,103,104
1,102,102,105
2,103,101,106


In [29]:
df1<5

Unnamed: 0,A,B,C
0,True,True,True
1,True,True,False
2,True,True,False


In [30]:
df1+df2

Unnamed: 0,A,B,C
0,7.0,9.0,12.0
1,9.0,8.0,13.0
2,11.0,6.0,15.0
3,,,
4,,,
5,,,
6,,,
7,,,


In [31]:
df1.add(other=df2, fill_value=0)

Unnamed: 0,A,B,C
0,7.0,9.0,12.0
1,9.0,8.0,13.0
2,11.0,6.0,15.0
3,9.0,4.0,5.0
4,8.0,8.0,4.0
5,7.0,2.0,3.0
6,2.0,7.0,5.0
7,7.0,3.0,7.0


In [32]:
display(df1)
np.exp(df1)-1 ## numpy fonctions are compatible

Unnamed: 0,A,B,C
0,1,3,4
1,2,2,5
2,3,1,6


Unnamed: 0,A,B,C
0,1.718282,19.085537,53.59815
1,6.389056,6.389056,147.413159
2,19.085537,1.718282,402.428793


<a id="idx"></a>
# Indexing & co
[toc](#toc)

## simple indexing: columns only

In [35]:
## simple indexing
print(type(df['cat']))

print('-'*50)
print(df['cat']) ## return series
print(df.cat) ## return series

<class 'pandas.core.series.Series'>
--------------------------------------------------
a    a
b    b
c    c
d    a
e    b
Name: cat, dtype: category
Categories (3, object): ['a', 'b', 'c']
a    a
b    b
c    c
d    a
e    b
Name: cat, dtype: category
Categories (3, object): ['a', 'b', 'c']


## at,loc & iat,iloc

In [37]:
### row 'a', column 'num'
display(df.loc['a','num']) ## return value
display(df.at['a','num']) ## faster, but one cell only
print('-'*50)

### all rows, 'cat' column
display(df.loc[:,'cat']) ## return series

print('-'*50)

### row 'a', all columns
display(df.loc['a',:]) ## return series

print('-'*50)
### rows ['a','c'], columns ['num','num2']
display(df.loc[['a','c'],['num','num2']]) ## return dataframe

2

2

--------------------------------------------------


a    a
b    b
c    c
d    a
e    b
Name: cat, dtype: category
Categories (3, object): ['a', 'b', 'c']

--------------------------------------------------


num       2
cat       a
cat2      x
num2    1.0
Name: a, dtype: object

--------------------------------------------------


Unnamed: 0,num,num2
a,2,1.0
c,3,1.0


In [38]:
### row 0, column 1
display(df.iloc[0,1]) ## return value
display(df.iat[0,1])
print('-'*50)

### row 0, all columns
display(df.iloc[0,:]) ## return series
print('-'*50)

### all rows, columns 1
display(df.iloc[:,1]) ## return series
print('-'*50)

### rows [0,1] , columns [1,2]
display(df.iloc[[0,1],[1,2]]) ## return dataframe


'a'

'a'

--------------------------------------------------


num       2
cat       a
cat2      x
num2    1.0
Name: a, dtype: object

--------------------------------------------------


a    a
b    b
c    c
d    a
e    b
Name: cat, dtype: category
Categories (3, object): ['a', 'b', 'c']

--------------------------------------------------


Unnamed: 0,cat,cat2
a,a,x
b,b,y


## Boolean indexing

In [40]:
cond = df['cat'].isin(['a','c'])
df.loc[cond,:]

Unnamed: 0,num,cat,cat2,num2
a,2,a,x,1.0
c,3,c,z,1.0
d,2,a,z,0.0


In [41]:
cond1 = df['num2']>0
df.loc[cond1,:]

Unnamed: 0,num,cat,cat2,num2
a,2,a,x,1.0
b,4,b,y,1.0
c,3,c,z,1.0


In [42]:
df.loc[cond1 & ~cond,:] ### &= "AND" , | = "OR" , ~ = NOT  

Unnamed: 0,num,cat,cat2,num2
b,4,b,y,1.0


## head,tail

In [44]:
df.head(2)

Unnamed: 0,num,cat,cat2,num2
a,2,a,x,1.0
b,4,b,y,1.0


## Series \[\] vs DataFrame \[\[\]\]

In [46]:
### indexing 1 value for columns
df.loc[:,'cat'] ## series

a    a
b    b
c    c
d    a
e    b
Name: cat, dtype: category
Categories (3, object): ['a', 'b', 'c']

In [47]:
### slincing a list of columns
display(df.loc[:,['cat']]) ## dataframe

print(df.loc[:,['cat']])

Unnamed: 0,cat
a,a
b,b
c,c
d,a
e,b


  cat
a   a
b   b
c   c
d   a
e   b


<a id="sort"></a>
# Order by
[toc](#toc)

In [49]:
df['num'].sort_values(ascending=False)

b    4
e    4
c    3
a    2
d    2
Name: num, dtype: uint8

In [50]:
df.sort_values(by=['num','num2'],ascending=[False,True])

Unnamed: 0,num,cat,cat2,num2
b,4,b,y,1.0
e,4,b,z,
c,3,c,z,1.0
d,2,a,z,0.0
a,2,a,x,1.0


<a id="func"></a>

# Usefull functions and methods
[toc](#toc)

## uniques, isnull, fillna, drop


In [52]:
df

Unnamed: 0,num,cat,cat2,num2
a,2,a,x,1.0
b,4,b,y,1.0
c,3,c,z,1.0
d,2,a,z,0.0
e,4,b,z,


In [53]:
### unique
print("unique:")
display(df['cat'].unique())


### isna, isnull
print('-'*50)
print('isna,isnull:')
display(df.isna()) ## alias of isnull

### fillna
print('-'*50)
print('fillna:')
display(df.fillna(df.median(numeric_only=True))) 
#df.fillna(np.inf, inplace = True) ## inplace=True => modify directly the dataframe

### drop
print('-'*50)
print('drop, dropna:')

df['static'] = 1
display(df)

df.drop(columns=['static'], inplace=True)
display(df)

unique:


['a', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']

--------------------------------------------------
isna,isnull:


Unnamed: 0,num,cat,cat2,num2
a,False,False,False,False
b,False,False,False,False
c,False,False,False,False
d,False,False,False,False
e,False,False,False,True


--------------------------------------------------
fillna:


Unnamed: 0,num,cat,cat2,num2
a,2,a,x,1.0
b,4,b,y,1.0
c,3,c,z,1.0
d,2,a,z,0.0
e,4,b,z,1.0


--------------------------------------------------
drop, dropna:


Unnamed: 0,num,cat,cat2,num2,static
a,2,a,x,1.0,1
b,4,b,y,1.0,1
c,3,c,z,1.0,1
d,2,a,z,0.0,1
e,4,b,z,,1


Unnamed: 0,num,cat,cat2,num2
a,2,a,x,1.0
b,4,b,y,1.0
c,3,c,z,1.0
d,2,a,z,0.0
e,4,b,z,


<a id="agg"></a>
# Aggregation
[toc](#toc)

In [55]:
### sum,mean, (count,) ,std, median,quantile,...
print(df.mean(numeric_only=True)) ## ignore non-numeric columns/values
print('-'*50)
print(df1.sum(axis=0)) ## return series
print(df1.sum(axis=1)) ## return series
print('-'*50)
print(df1.sum().sum()) ## return (series.sum()) --> value

num     3.00
num2    0.75
dtype: float64
--------------------------------------------------
A     6
B     6
C    15
dtype: int64
0     8
1     9
2    10
dtype: int64
--------------------------------------------------
27


In [56]:
### mixed aggregations
df1.agg({'A':'sum','B':"mean"})

A    6.0
B    2.0
dtype: float64

In [57]:
df1.agg({'A':['sum','mean'],'B':['count','mean',]})

Unnamed: 0,A,B
sum,6.0,
mean,2.0,2.0
count,,3.0


In [58]:
df1.nunique()

A    3
B    3
C    3
dtype: int64

In [59]:
df1['A'].unique()

array([1, 2, 3], dtype=int64)

### descriptive statistics

In [61]:
df.describe(include='all')

Unnamed: 0,num,cat,cat2,num2
count,5.0,5,5,4.0
unique,,3,3,
top,,a,z,
freq,,2,3,
mean,3.0,,,0.75
std,1.0,,,0.5
min,2.0,,,0.0
25%,2.0,,,0.75
50%,3.0,,,1.0
75%,4.0,,,1.0


<a id="gb"></a>
# Groupby & co
[toc](#toc)

## value counts

In [64]:
VC = df.value_counts('cat') ### groupby count
#VC = df['cat'].value_counts()
VC

cat
a    2
b    2
c    1
Name: count, dtype: int64

## groupby

In [66]:
GB = df.groupby('cat2').count()

display(GB)

print(GB.index)
print(GB.columns)


Unnamed: 0_level_0,num,cat,num2
cat2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
x,1,1,1
y,1,1,1
z,3,3,2


Index(['x', 'y', 'z'], dtype='object', name='cat2')
Index(['num', 'cat', 'num2'], dtype='object')


In [67]:
GB_no_index = df.groupby(by=['cat2'], as_index=False).count()
display(GB_no_index)

Unnamed: 0,cat2,num,cat,num2
0,x,1,1,1
1,y,1,1,1
2,z,3,3,2


In [68]:
GB = df.groupby('cat2').agg({'num':['sum','count'],'num2':'mean'})
display(GB)
#print(GB.index)
print(GB.columns)

Unnamed: 0_level_0,num,num,num2
Unnamed: 0_level_1,sum,count,mean
cat2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
x,2,1,1.0
y,4,1,1.0
z,9,3,0.5


MultiIndex([( 'num',   'sum'),
            ( 'num', 'count'),
            ('num2',  'mean')],
           )


In [69]:
GB.loc[:, ('num','sum')]

cat2
x    2
y    4
z    9
Name: (num, sum), dtype: uint8

In [70]:
GB2 = GB.copy()
GB2.columns = ['num_sum','num_count','num2_mean']
GB2

Unnamed: 0_level_0,num_sum,num_count,num2_mean
cat2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
x,2,1,1.0
y,4,1,1.0
z,9,3,0.5


In [71]:
GB3 = GB2.copy().reset_index()
GB3

Unnamed: 0,cat2,num_sum,num_count,num2_mean
0,x,2,1,1.0
1,y,4,1,1.0
2,z,9,3,0.5


In [72]:
### watch out for categories !
df[['cat','cat2']].dtypes

cat     category
cat2      object
dtype: object

In [73]:
### "observed" parameter is mandatory for "category" datatypes
try:
    GB_ = df.groupby(by= 'cat').agg({'num':'sum'})
    print(GB_)
except Exception as err:
    print(err)

print("-"*50)
GB__ = df.groupby(by= 'cat', observed=True).agg({'num':'sum'})
GB__

     num
cat     
a      4
b      8
c      3
--------------------------------------------------


  GB_ = df.groupby(by= 'cat').agg({'num':'sum'})


Unnamed: 0_level_0,num
cat,Unnamed: 1_level_1
a,4
b,8
c,3


## pivot table

In [75]:
df4 = df.copy()
#df4['cat3']=np.random.choice(list('ij'),size=(df.shape[0]))
df4

Unnamed: 0,num,cat,cat2,num2
a,2,a,x,1.0
b,4,b,y,1.0
c,3,c,z,1.0
d,2,a,z,0.0
e,4,b,z,


In [76]:
pt = df4.pivot_table(index='cat',columns='cat2',aggfunc='sum')
pt

Unnamed: 0_level_0,num,num,num,num2,num2,num2
cat2,x,y,z,x,y,z
cat,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
a,2,0,2,1.0,0.0,0.0
b,0,4,4,0.0,1.0,0.0
c,0,0,3,0.0,0.0,1.0


<a id="join"></a>
# Joins
[toc](#toc)

In [78]:
series_ = pd.Series(data=[1,2,3],
                    index= np.random.choice(['a','b'],size=(3)))
df5 = series_.to_frame()
display(df5)
display(df)

Unnamed: 0,0
b,1
a,2
b,3


Unnamed: 0,num,cat,cat2,num2
a,2,a,x,1.0
b,4,b,y,1.0
c,3,c,z,1.0
d,2,a,z,0.0
e,4,b,z,


In [79]:
df.merge(df5, left_index=True, right_index=True, how='inner')

Unnamed: 0,num,cat,cat2,num2,0
a,2,a,x,1.0,2
b,4,b,y,1.0,1
b,4,b,y,1.0,3


In [80]:
pd.merge(df,df5, left_on='cat', right_index=True, how='inner')

Unnamed: 0,num,cat,cat2,num2,0
a,2,a,x,1.0,2
d,2,a,z,0.0,2
b,4,b,y,1.0,1
b,4,b,y,1.0,3
e,4,b,z,,1
e,4,b,z,,3


# Concatenate
[toc](#toc)

In [82]:
display(df1)
display(df2)
display(df3)

Unnamed: 0,A,B,C
0,1,3,4
1,2,2,5
2,3,1,6


Unnamed: 0,A,B,C
0,6,6,8
1,7,6,8
2,8,5,9
3,9,4,5
4,8,8,4
5,7,2,3
6,2,7,5
7,7,3,7


Unnamed: 0,i,j,k
0,1,5.0,
1,2,,3.0
2,6,2.0,7.0


In [83]:

df_concat = pd.concat([df1,df2],
                      axis=0,  ## along rows
                      ignore_index=False
                     )
#df_concat = pd.concat([df1,df2])
display(df_concat)
display(df_concat.loc[0,:])

print('-'*50)
df_concat = pd.concat([df1,df3],
                      axis=1    ## along columns
                     ) 
display(df_concat)

Unnamed: 0,A,B,C
0,1,3,4
1,2,2,5
2,3,1,6
0,6,6,8
1,7,6,8
2,8,5,9
3,9,4,5
4,8,8,4
5,7,2,3
6,2,7,5


Unnamed: 0,A,B,C
0,1,3,4
0,6,6,8


--------------------------------------------------


Unnamed: 0,A,B,C,i,j,k
0,1,3,4,1,5.0,
1,2,2,5,2,,3.0
2,3,1,6,6,2.0,7.0


<a id="restr"></a>
# Restructure table
[toc](#toc)

## get_dummies
OneHot Encoding

In [86]:
display(df)
gd = pd.get_dummies(df)
gd

Unnamed: 0,num,cat,cat2,num2
a,2,a,x,1.0
b,4,b,y,1.0
c,3,c,z,1.0
d,2,a,z,0.0
e,4,b,z,


Unnamed: 0,num,num2,cat_a,cat_b,cat_c,cat2_x,cat2_y,cat2_z
a,2,1.0,True,False,False,True,False,False
b,4,1.0,False,True,False,False,True,False
c,3,1.0,False,False,True,False,False,True
d,2,0.0,True,False,False,False,False,True
e,4,,False,True,False,False,False,True


## unpivot +-= melt

In [88]:
up = gd.melt(['num','num2'])
display(up)
up = up.loc[up['value']!=0]
up

Unnamed: 0,num,num2,variable,value
0,2,1.0,cat_a,True
1,4,1.0,cat_a,False
2,3,1.0,cat_a,False
3,2,0.0,cat_a,True
4,4,,cat_a,False
5,2,1.0,cat_b,False
6,4,1.0,cat_b,True
7,3,1.0,cat_b,False
8,2,0.0,cat_b,False
9,4,,cat_b,True


Unnamed: 0,num,num2,variable,value
0,2,1.0,cat_a,True
3,2,0.0,cat_a,True
6,4,1.0,cat_b,True
9,4,,cat_b,True
12,3,1.0,cat_c,True
15,2,1.0,cat2_x,True
21,4,1.0,cat2_y,True
27,3,1.0,cat2_z,True
28,2,0.0,cat2_z,True
29,4,,cat2_z,True


## explode

In [90]:
df7 = pd.DataFrame(data={'A':[[1,2],[2,3],[4,5]],'B':[['a','b'],['c','d'],['e','f']]})
df7

Unnamed: 0,A,B
0,"[1, 2]","[a, b]"
1,"[2, 3]","[c, d]"
2,"[4, 5]","[e, f]"


In [91]:
df7.explode(['A'])

Unnamed: 0,A,B
0,1,"[a, b]"
0,2,"[a, b]"
1,2,"[c, d]"
1,3,"[c, d]"
2,4,"[e, f]"
2,5,"[e, f]"


In [92]:
### joined explosion of values
df7.explode(['A','B'])

Unnamed: 0,A,B
0,1,a
0,2,b
1,2,c
1,3,d
2,4,e
2,5,f


<a id="map"></a>
# Apply&map
[toc](#toc)


## map
on series (or df)

In [95]:
def my_funct(x,val=0):
    
    if x < val:
        return -1
    elif x > val:
        return +1
    else:
        #print(x)
        return 0

df8= df.copy()

df8['res'] = df8['num2'].map(my_funct) 
display(df8)

Unnamed: 0,num,cat,cat2,num2,res
a,2,a,x,1.0,1
b,4,b,y,1.0,1
c,3,c,z,1.0,1
d,2,a,z,0.0,0
e,4,b,z,,0


## apply
more complex version of 'map';  
on row/column of df,  
passing arguments

In [97]:
display(df)
print('-'*50)
print(df['num'].apply(my_funct,args=(2,)))

print(df['num'].apply(my_funct, **{'val':2}))

Unnamed: 0,num,cat,cat2,num2
a,2,a,x,1.0
b,4,b,y,1.0
c,3,c,z,1.0
d,2,a,z,0.0
e,4,b,z,


--------------------------------------------------
a    0
b    1
c    1
d    0
e    1
Name: num, dtype: int64
a    0
b    1
c    1
d    0
e    1
Name: num, dtype: int64


In [98]:

def my_funct2 (x):
    return pd.isnull(x).sum()

print(df.apply(my_funct2,axis=0))
print(df.apply(my_funct2,axis=1))


num     0
cat     0
cat2    0
num2    1
dtype: int64
a    0
b    0
c    0
d    0
e    1
dtype: int64


### rewrite your problem to prevent use of apply/map

In [100]:
### vectorisation : faster alernative: recommended
### require to think the problem differently than "for each line"
def my_funct_2(series, val=0.5):
    cond1 = series<val
    cond2 = series>val
    return -1*cond1 + cond2

my_funct_2(df['num'],val=0.75)

a    1
b    1
c    1
d    1
e    1
Name: num, dtype: int32

### benchmarks

In [102]:
series_rdm = pd.Series(np.random.randint(0,10,1_000_000))
series_rdm

0         9
1         5
2         1
3         6
4         9
         ..
999995    3
999996    2
999997    6
999998    3
999999    8
Length: 1000000, dtype: int32

In [103]:
%timeit series_rdm.apply(my_funct, val=5)

695 ms ± 161 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [104]:
%timeit my_funct_2(series_rdm,val=5)

9.18 ms ± 861 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
