# Pandas - Alinhamento de Dados e Aritmética entre Dataframes
## Pandas

O pandas é uma ferramenta de análise e manipulação de dados de código aberto rápida, poderosa, flexível e fácil de usar,
construída sobre a linguagem de programação Python 

https://pandas.pydata.org/


## Estruturas de dados do Pandas


In [1]:
import numpy as np
import pandas as pd

## Alinhamento de dados e aritmética 
O alinhamento de dados entre objetos DataFrame é alinhado automaticamente nas colunas e no índice (rótulos de linha) . Novamente, o objeto resultante terá a união dos rótulos de coluna e linha.

In [2]:
dados = {'percentualcomissao' : pd.Series([.05, .05, .5], index=['João','Alexandre','Willian']),
          'vendas' : pd.Series([100, 200, 300], index=['Willian','João','Alexandre']) }
dfcomissoes_janeiro = pd.DataFrame(dados)
dfcomissoes_janeiro

Unnamed: 0,percentualcomissao,vendas
Alexandre,0.05,300
João,0.05,200
Willian,0.5,100


In [3]:
dados = {'percentualcomissao' : pd.Series([.05, .05, .5], index=['João','Alexandre','Willian']),
          'vendas' : pd.Series([300, 400, 100], index=['Willian','João','Alexandre']) }
dfcomissoes_fevereiro = pd.DataFrame(dados)
dfcomissoes_fevereiro

Unnamed: 0,percentualcomissao,vendas
Alexandre,0.05,100
João,0.05,400
Willian,0.5,300


In [4]:
dfcomissoes_fevereiro + dfcomissoes_janeiro

Unnamed: 0,percentualcomissao,vendas
Alexandre,0.1,400
João,0.1,600
Willian,1.0,400


In [5]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])

df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])

df + df2

Unnamed: 0,A,B,C,D
0,-0.117509,0.142167,0.026226,
1,0.215175,-0.188215,-0.924195,
2,1.052859,0.072932,2.690337,
3,2.649584,2.257874,0.939033,
4,0.996707,0.142679,-0.256889,
5,-0.539836,-0.338355,1.464915,
6,1.725524,-3.145453,0.684305,
7,,,,
8,,,,
9,,,,


Ao fazer uma operação entre DataFrame e Series, o comportamento padrão é alinhar o índice Series nas colunas DataFrame , transmitindo assim por linha.

In [6]:
df = pd.read_csv('totaisestados_arr.csv', sep=';',names =['UF','CASOS','OBITOS'])
df = df[['CASOS','OBITOS'] ]
df.head()

Unnamed: 0,CASOS,OBITOS
0,955540,12025
1,2060518,28669
2,637169,11305
3,921447,15253
4,348701,7355


In [7]:
media = df.mean()

In [8]:
df.iloc[0]

CASOS     955540
OBITOS     12025
Name: 0, dtype: int64

In [9]:
(df - media).head()

Unnamed: 0,CASOS,OBITOS
0,-488601.1,-14422.333333
1,616376.9,2221.666667
2,-806972.1,-15142.333333
3,-522694.1,-11194.333333
4,-1095440.0,-19092.333333


In [10]:
df.sub(df['CASOS'], axis=0)

Unnamed: 0,CASOS,OBITOS
0,0,-943515
1,0,-2031849
2,0,-625864
3,0,-906194
4,0,-341346
5,0,-1812447
6,0,-1479920
7,0,-490018
8,0,-715248
9,0,-1212120


## operações com escalares

In [11]:
(df / 1000).head()

Unnamed: 0,CASOS,OBITOS
0,955.54,12.025
1,2060.518,28.669
2,637.169,11.305
3,921.447,15.253
4,348.701,7.355


In [12]:
(df ** 2).head()

Unnamed: 0,CASOS,OBITOS
0,913056691600,144600625
1,4245734428324,821911561
2,405984334561,127803025
3,849064573809,232654009
4,121592387401,54096025


Os operadores booleanos também funcionam:

In [13]:
df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)
df1.head()

Unnamed: 0,a,b
0,True,False
1,False,True
2,True,True


In [14]:
df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)
df2.head()

Unnamed: 0,a,b
0,False,True
1,True,True
2,True,False


In [15]:
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [16]:
df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [17]:
df1

Unnamed: 0,a,b
0,True,False
1,False,True
2,True,True


In [18]:
-df1

Unnamed: 0,a,b
0,False,True
1,True,False
2,False,False


## Transpondo  o Dataframe
Para transpor, acesse o atributo T ou a função transpose

In [19]:
df = pd.read_csv('totaisestados_arr.csv', sep=';',names =['UF','CASOS','OBITOS'])
df.head()

Unnamed: 0,UF,CASOS,OBITOS
0,DF,955540,12025
1,GO,2060518,28669
2,MS,637169,11305
3,MT,921447,15253
4,AL,348701,7355


In [20]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,17,18,19,20,21,22,23,24,25,26
UF,DF,GO,MS,MT,AL,BA,CE,MA,PB,PE,...,RO,RR,TO,ES,MG,RJ,SP,PR,RS,SC
CASOS,955540,2060518,637169,921447,348701,1844481,1508135,501121,725917,1235360,...,505111,190318,382320,1384800,4335714,2966219,6907741,3031840,3144956,2089867
OBITOS,12025,28669,11305,15253,7355,32034,28215,11103,10669,23240,...,7527,2202,4305,15214,66850,78238,184235,47029,43044,23145


In [21]:
df.index =  df['UF']
df =  df[['CASOS', 'OBITOS']]
df.T

UF,DF,GO,MS,MT,AL,BA,CE,MA,PB,PE,...,RO,RR,TO,ES,MG,RJ,SP,PR,RS,SC
CASOS,955540,2060518,637169,921447,348701,1844481,1508135,501121,725917,1235360,...,505111,190318,382320,1384800,4335714,2966219,6907741,3031840,3144956,2089867
OBITOS,12025,28669,11305,15253,7355,32034,28215,11103,10669,23240,...,7527,2202,4305,15214,66850,78238,184235,47029,43044,23145


In [22]:
df.T.T

Unnamed: 0_level_0,CASOS,OBITOS
UF,Unnamed: 1_level_1,Unnamed: 2_level_1
DF,955540,12025
GO,2060518,28669
MS,637169,11305
MT,921447,15253
AL,348701,7355
BA,1844481,32034
CE,1508135,28215
MA,501121,11103
PB,725917,10669
PE,1235360,23240


## Interoperabilidade de DataFrame com funções NumPy 
Ufuncs NumPy (log, exp, sqrt, ...) e várias outras funções NumPy podem ser usados sem problemas em Series e DataFrame, assumindo que os dados são numéricos

In [23]:
df[:]

Unnamed: 0_level_0,CASOS,OBITOS
UF,Unnamed: 1_level_1,Unnamed: 2_level_1
DF,955540,12025
GO,2060518,28669
MS,637169,11305
MT,921447,15253
AL,348701,7355
BA,1844481,32034
CE,1508135,28215
MA,501121,11103
PB,725917,10669
PE,1235360,23240


In [27]:
np.sum(df[['CASOS','OBITOS']], axis=0)

CASOS     38991809
OBITOS      714078
dtype: int64

In [28]:
np.cumsum(df[['CASOS','OBITOS']]).tail()

Unnamed: 0_level_0,CASOS,OBITOS
UF,Unnamed: 1_level_1,Unnamed: 2_level_1
RJ,23817405,416625
SP,30725146,600860
PR,33756986,647889
RS,36901942,690933
SC,38991809,714078


In [29]:
np.asarray(df)

array([[ 955540,   12025],
       [2060518,   28669],
       [ 637169,   11305],
       [ 921447,   15253],
       [ 348701,    7355],
       [1844481,   32034],
       [1508135,   28215],
       [ 501121,   11103],
       [ 725917,   10669],
       [1235360,   23240],
       [ 438676,    8445],
       [ 601493,    9321],
       [ 368073,    6575],
       [ 169625,    2083],
       [ 643745,   14531],
       [ 191877,    2175],
       [ 901045,   19291],
       [ 505111,    7527],
       [ 190318,    2202],
       [ 382320,    4305],
       [1384800,   15214],
       [4335714,   66850],
       [2966219,   78238],
       [6907741,  184235],
       [3031840,   47029],
       [3144956,   43044],
       [2089867,   23145]], dtype=int64)

Em alguns casos o pandas irá alinhar automaticamente as entradas rotuladas como parte de um ufunc com várias entradas.

In [30]:
ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
ser2 = pd.Series([1, 3, 5], index=['b', 'a', 'c'])
np.remainder(ser1, ser2)

a    1
b    0
c    3
dtype: int64

In [31]:
ser3 = pd.Series([2, 4, 6], index=['b', 'c', 'd'])
np.remainder(ser1, ser3)

a    NaN
b    0.0
c    3.0
d    NaN
dtype: float64