# Summary Functions and Maps

## Introduction

Sometimes, we'll have to do some work to reformat the data we are reading to get the format we need. In this section, we will see the different operations we can apply to our data to get the input "just right".

In [13]:
import pandas as pd
import math

br_small_caps = pd.read_csv('statusinvest-busca-avancada.csv', delimiter=';')

## Summary Functions

Pandas provides many simple "summary functions" which restructure the data in some useful way. For example, consider the ```describe()``` method:

In [45]:
br_small_caps['P/L'].describe()

count    56.000000
mean      6.868571
std       2.863761
min       2.500000
25%       4.667500
50%       6.530000
75%       8.527500
max      15.220000
Name: P/L, dtype: float64

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [6]:
br_small_caps['TICKER'].describe()

count        56
unique       56
top       AGRO3
freq          1
Name: TICKER, dtype: object

If you want to get just one of the summary statistic returned by ```describe()``` you can use it as a method:

In [8]:
br_small_caps['TICKER'].count()

np.int64(56)

Each method returns different data types. For example, the ```unique()``` method returns a list with the unique values of the given column:

In [9]:
br_small_caps['TICKER'].unique()

array(['AGRO3', 'ATOM3', 'BLAU3', 'BOAS3', 'BRBI11', 'BRIT3', 'CAMB3',
       'CAMB4', 'CAML3', 'CEBR3', 'CEBR5', 'CEBR6', 'CEDO3', 'CEDO4',
       'CGRA3', 'CGRA4', 'CSRN3', 'CSRN5', 'CSRN6', 'CSUD3', 'DEXP3',
       'DEXP4', 'EALT3', 'EALT4', 'EEEL3', 'EEEL4', 'EUCA3', 'EUCA4',
       'JHSF3', 'JSLG3', 'KEPL3', 'LEVE3', 'LJQQ3', 'MOAR3', 'MTSA3',
       'MTSA4', 'NAFG3', 'NAFG4', 'NEMO5', 'RANI3', 'RANI4', 'RAPT3',
       'RAPT4', 'ROMI3', 'RSUL4', 'SCAR3', 'SHOW3', 'SOJA3', 'SOMA3',
       'TECN3', 'TGMA3', 'TUPY3', 'VLID3', 'VULC3', 'WLMM3', 'WLMM4'],
      dtype=object)

There are also other handful methods to provide some statistic about the column. For example, if you want the list of unique values and how often they occur in the dataset, you can use the ```value_counts()``` method:

In [10]:
br_small_caps['TICKER'].value_counts()

TICKER
AGRO3     1
ATOM3     1
KEPL3     1
LEVE3     1
LJQQ3     1
MOAR3     1
MTSA3     1
MTSA4     1
NAFG3     1
NAFG4     1
NEMO5     1
RANI3     1
RANI4     1
RAPT3     1
RAPT4     1
ROMI3     1
RSUL4     1
SCAR3     1
SHOW3     1
SOJA3     1
SOMA3     1
TECN3     1
TGMA3     1
TUPY3     1
VLID3     1
VULC3     1
WLMM3     1
JSLG3     1
JHSF3     1
EUCA4     1
CEDO4     1
BLAU3     1
BOAS3     1
BRBI11    1
BRIT3     1
CAMB3     1
CAMB4     1
CAML3     1
CEBR3     1
CEBR5     1
CEBR6     1
CEDO3     1
CGRA3     1
EUCA3     1
CGRA4     1
CSRN3     1
CSRN5     1
CSRN6     1
CSUD3     1
DEXP3     1
DEXP4     1
EALT3     1
EALT4     1
EEEL3     1
EEEL4     1
WLMM4     1
Name: count, dtype: int64

## Maps

In data science, we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. **Maps** are what handle this work, making them extremely important for getting your work done!
There are two mapping methods that you will use often.
```map()``` is the first, and slightly simpler one. For example, suppose that we wanted to set to "0" the ```NaN``` value for the column P/L:

In [18]:
br_small_caps['DY'].map(lambda pl: 0 if math.isnan(pl) else pl)

0     11.66
1      0.00
2      2.48
3      0.00
4      9.10
5      2.06
6      1.32
7      0.43
8      2.95
9      9.07
10    10.55
11    10.81
12     0.00
13     0.00
14    10.17
15    10.04
16     6.06
17     6.97
18     5.21
19     5.37
20     5.10
21     5.37
22     4.27
23     4.44
24     0.00
25     0.00
26     3.63
27     4.02
28     9.51
29     3.17
30    10.03
31    23.96
32     0.00
33     6.50
34     1.40
35     2.79
36     0.00
37     0.00
38     2.33
39     8.30
40     0.00
41     5.91
42     4.96
43     6.84
44     3.28
45     0.00
46     0.00
47     6.15
48     1.41
49     2.49
50     5.34
51     2.62
52    10.53
53    18.89
54     2.60
55     2.29
Name: DY, dtype: float64

The function you pass to ```map()``` should expect a single value from the Series (a point value, in the above example), and return a transformed version of the value. ```map()``` returns a new Series where all the values have been transformed by your function.
```apply()``` is the equivalent method if we want to transform a whole DataFrame.

In [24]:
def replace_nan_dy(row):
    row['DY'] = 0 if math.isnan(row['DY']) else row['DY']
    return row

br_small_caps.apply(replace_nan_dy, axis='columns')

Unnamed: 0,TICKER,PRECO,DY,P/L,P/VP,P/ATIVOS,MARGEM BRUTA,MARGEM EBIT,MARG. LIQUIDA,P/EBIT,...,PATRIMONIO / ATIVOS,PASSIVOS / ATIVOS,GIRO ATIVOS,CAGR RECEITAS 5 ANOS,CAGR LUCROS 5 ANOS,LIQUIDEZ MEDIA DIARIA,VPA,LPA,PEG Ratio,VALOR DE MERCADO
0,AGRO3,27.55,11.66,11.95,1.45,0.85,27.98,15.46,21.58,16.68,...,0.59,0.41,0.33,27.22,13.38,6.609.378.40,19.06,2.31,0.04,2.828.928.882.20
1,ATOM3,2.03,0.0,2.5,1.23,0.91,91.76,-3.39,84.24,-62.11,...,0.74,0.21,0.43,31.18,22.25,13.557.37,1.65,0.81,0.01,48.323.942.94
2,BLAU3,11.29,2.48,8.51,1.0,0.64,34.09,22.18,16.14,6.19,...,0.64,0.36,0.47,11.91,14.06,2.278.661.06,11.26,1.33,-0.26,2.025.357.571.31
3,BOAS3,7.95,0.0,15.22,1.83,1.7,56.53,19.25,32.39,25.6,...,0.93,0.07,0.34,8.81,74.34,53.310.276.50,4.34,0.52,5.59,4.212.250.617.75
4,BRBI11,14.5,9.1,8.88,1.84,0.14,100.0,8.88,61.44,61.47,...,0.07,0.93,0.02,-16.03,27.88,3.910.827.37,7.87,1.63,0.4,1.522.437.708.00
5,BRIT3,4.24,2.06,12.01,1.25,0.58,45.24,18.58,12.48,8.07,...,0.47,0.53,0.39,41.71,44.43,2.218.034.23,3.4,0.35,0.11,1.904.162.443.84
6,CAMB3,11.15,1.32,6.52,1.86,1.29,48.26,21.89,15.87,4.73,...,0.69,0.31,1.24,13.16,34.14,363.010.12,5.98,1.71,1.58,471.367.142.00
7,CAMB4,6.25,0.43,3.66,1.05,0.72,48.26,21.89,15.87,2.65,...,0.69,0.31,1.24,13.16,34.14,,5.98,1.71,0.89,471.367.142.00
8,CAML3,8.71,2.95,4.41,1.02,0.33,19.82,6.85,4.12,2.66,...,0.33,0.67,1.82,18.83,13.77,6.617.186.57,8.57,1.97,0.01,3.048.500.000.00
9,CEBR3,20.93,9.07,8.64,1.43,1.02,50.78,61.36,49.19,6.93,...,0.71,0.11,0.24,-32.89,14.18,72.491.66,14.59,2.42,-0.52,1.450.093.620.45


If we had called ```apply()``` with ```axis='index'``` or ```axis=0```, then instead of receiving each *row*, the mapping function would receive each *column*.

In [39]:
def replace_nan_values(col):
    mapped_col = col.map(lambda x: 0 if isinstance(x, float) and math.isnan(x) else x)
    return mapped_col

br_small_caps.apply(replace_nan_values, axis='index')

Unnamed: 0,TICKER,PRECO,DY,P/L,P/VP,P/ATIVOS,MARGEM BRUTA,MARGEM EBIT,MARG. LIQUIDA,P/EBIT,...,PATRIMONIO / ATIVOS,PASSIVOS / ATIVOS,GIRO ATIVOS,CAGR RECEITAS 5 ANOS,CAGR LUCROS 5 ANOS,LIQUIDEZ MEDIA DIARIA,VPA,LPA,PEG Ratio,VALOR DE MERCADO
0,AGRO3,27.55,11.66,11.95,1.45,0.85,27.98,15.46,21.58,16.68,...,0.59,0.41,0.33,27.22,13.38,6.609.378.40,19.06,2.31,0.04,2.828.928.882.20
1,ATOM3,2.03,0.0,2.5,1.23,0.91,91.76,-3.39,84.24,-62.11,...,0.74,0.21,0.43,31.18,22.25,13.557.37,1.65,0.81,0.01,48.323.942.94
2,BLAU3,11.29,2.48,8.51,1.0,0.64,34.09,22.18,16.14,6.19,...,0.64,0.36,0.47,11.91,14.06,2.278.661.06,11.26,1.33,-0.26,2.025.357.571.31
3,BOAS3,7.95,0.0,15.22,1.83,1.7,56.53,19.25,32.39,25.6,...,0.93,0.07,0.34,8.81,74.34,53.310.276.50,4.34,0.52,5.59,4.212.250.617.75
4,BRBI11,14.5,9.1,8.88,1.84,0.14,100.0,8.88,61.44,61.47,...,0.07,0.93,0.02,-16.03,27.88,3.910.827.37,7.87,1.63,0.4,1.522.437.708.00
5,BRIT3,4.24,2.06,12.01,1.25,0.58,45.24,18.58,12.48,8.07,...,0.47,0.53,0.39,41.71,44.43,2.218.034.23,3.4,0.35,0.11,1.904.162.443.84
6,CAMB3,11.15,1.32,6.52,1.86,1.29,48.26,21.89,15.87,4.73,...,0.69,0.31,1.24,13.16,34.14,363.010.12,5.98,1.71,1.58,471.367.142.00
7,CAMB4,6.25,0.43,3.66,1.05,0.72,48.26,21.89,15.87,2.65,...,0.69,0.31,1.24,13.16,34.14,0,5.98,1.71,0.89,471.367.142.00
8,CAML3,8.71,2.95,4.41,1.02,0.33,19.82,6.85,4.12,2.66,...,0.33,0.67,1.82,18.83,13.77,6.617.186.57,8.57,1.97,0.01,3.048.500.000.00
9,CEBR3,20.93,9.07,8.64,1.43,1.02,50.78,61.36,49.19,6.93,...,0.71,0.11,0.24,-32.89,14.18,72.491.66,14.59,2.42,-0.52,1.450.093.620.45


In the example above, each value passed to the mapping function will be a Series contaning all values of the column being iterated. So, inside the mapping function we used ```map()``` on the column to access each value for that column.
Note that ```map()``` and ```apply()``` return new, transformed Series and DataFrames, respectively. They don't modify the original data they've called on.
Pandas provides many common mapping operations as built-ins. For example, we could concatenate some string to our ```TICKER``` column this way:

In [40]:
br_small_caps['TICKER'] + " - B3"

0      AGRO3 - B3
1      ATOM3 - B3
2      BLAU3 - B3
3      BOAS3 - B3
4     BRBI11 - B3
5      BRIT3 - B3
6      CAMB3 - B3
7      CAMB4 - B3
8      CAML3 - B3
9      CEBR3 - B3
10     CEBR5 - B3
11     CEBR6 - B3
12     CEDO3 - B3
13     CEDO4 - B3
14     CGRA3 - B3
15     CGRA4 - B3
16     CSRN3 - B3
17     CSRN5 - B3
18     CSRN6 - B3
19     CSUD3 - B3
20     DEXP3 - B3
21     DEXP4 - B3
22     EALT3 - B3
23     EALT4 - B3
24     EEEL3 - B3
25     EEEL4 - B3
26     EUCA3 - B3
27     EUCA4 - B3
28     JHSF3 - B3
29     JSLG3 - B3
30     KEPL3 - B3
31     LEVE3 - B3
32     LJQQ3 - B3
33     MOAR3 - B3
34     MTSA3 - B3
35     MTSA4 - B3
36     NAFG3 - B3
37     NAFG4 - B3
38     NEMO5 - B3
39     RANI3 - B3
40     RANI4 - B3
41     RAPT3 - B3
42     RAPT4 - B3
43     ROMI3 - B3
44     RSUL4 - B3
45     SCAR3 - B3
46     SHOW3 - B3
47     SOJA3 - B3
48     SOMA3 - B3
49     TECN3 - B3
50     TGMA3 - B3
51     TUPY3 - B3
52     VLID3 - B3
53     VULC3 - B3
54     WLMM3 - B3
55     WLM

Or we could get the decimal representation of the ```MARGEM BRUTA``` column:

In [41]:
br_small_caps['MARGEM BRUTA'] / 100

0     0.2798
1     0.9176
2     0.3409
3     0.5653
4     1.0000
5     0.4524
6     0.4826
7     0.4826
8     0.1982
9     0.5078
10    0.5078
11    0.5078
12    0.3085
13    0.3085
14    0.5160
15    0.5160
16    0.3423
17    0.3423
18    0.3423
19    0.4044
20    0.2193
21    0.2193
22    0.2512
23    0.2512
24    0.6119
25    0.6119
26    0.3323
27    0.3323
28    0.5925
29    0.1817
30    0.2987
31    0.3046
32    0.3476
33       NaN
34    0.2425
35    0.2425
36    0.4115
37    0.4115
38    0.3461
39    0.4124
40    0.4124
41    0.2551
42    0.2551
43    0.2881
44    0.4274
45    0.4851
46    0.2138
47    0.1601
48    0.5680
49    0.5523
50    0.1988
51    0.1698
52    0.3594
53    0.4186
54    0.1150
55    0.1150
Name: MARGEM BRUTA, dtype: float64

These operators are faster than ```map()``` or ```apply()``` because they use speed ups built into pandas. All of the standard Python operators (>, <, ==, and so on) work in this manner.
However, they are not as flexible as ```map()``` or ```apply()```, which can do more advanced things, like applying conditional logic as we saw in the previous examples.