# DataFrames

DataFrame é o elemeto mais importante dos Pandas e são diretamente inspirados pela linguagem de programação R. Podemos pensar em um DataFrame como um monte de objetos da série juntos para compartilhar o mesmo índice. Vamos usar Pandas para explorar esse tópico!

In [113]:
import pandas as pd
import numpy as np
from numpy.random import randn

np.random.seed(101)
## para ficar mais claro algumas coisas: 

print(f"{'A B C D E'.split()} \n \n")

print(f"matriz 5x4: \n \n {randn(5,4)}")

['A', 'B', 'C', 'D', 'E'] 
 

matriz 5x4: 
 
 [[ 2.70684984  0.62813271  0.90796945  0.50382575]
 [ 0.65111795 -0.31931804 -0.84807698  0.60596535]
 [-2.01816824  0.74012206  0.52881349 -0.58900053]
 [ 0.18869531 -0.75887206 -0.93323722  0.95505651]
 [ 0.19079432  1.97875732  2.60596728  0.68350889]]


### Criando um dataFrame com dados aleatórios e especificando os índices e as colunas

In [114]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


## Seleção e indexação

Vamos aprender os vários métodos para pegar dados de um DataFrame

Repare que quando selecionamos a coluna, vamos obter um pandas.Series.

In [115]:
df['W']



A    0.302665
B   -0.134841
C    0.807706
D   -0.497104
E   -0.116773
Name: W, dtype: float64

In [116]:
# Passando uma lista com nomes das colunas
df[['W','Z']]

Unnamed: 0,W,Z
A,0.302665,-1.159119
B,-0.134841,0.184502
C,0.807706,0.329646
D,-0.497104,0.484752
E,-0.116773,1.996652


In [117]:
## Sintaxe SQL

df.W

A    0.302665
B   -0.134841
C    0.807706
D   -0.497104
E   -0.116773
Name: W, dtype: float64

---
As colunas dos DataFrames são Series:

In [118]:
type(df['W'])

pandas.core.series.Series

**Criando uma coluna:**

In [119]:
df['new'] = df['W'] + 2*df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,0.302665,1.693723,-1.706086,-1.159119,-3.109506
B,-0.134841,0.390528,0.166905,0.184502,0.198969
C,0.807706,0.07296,0.638787,0.329646,2.08528
D,-0.497104,-0.75407,-0.943406,0.484752,-2.383917
E,-0.116773,1.901755,0.238127,1.996652,0.359481


---
**Removendo colunas**

In [120]:
df.drop('new' , axis=1) # removendo colunas



Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


**Remover linhas com índice específicado**

In [121]:
df.drop('A', axis=0)

Unnamed: 0,W,X,Y,Z,new
B,-0.134841,0.390528,0.166905,0.184502,0.198969
C,0.807706,0.07296,0.638787,0.329646,2.08528
D,-0.497104,-0.75407,-0.943406,0.484752,-2.383917
E,-0.116773,1.901755,0.238127,1.996652,0.359481


Como reparamos,  essas alterações não foram salvas, devemos ou atribuir ao dataFrame essa alteração ou usar um parâmetro

In [122]:
df.drop('new',axis=1,inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


**Selecionando linhas:**

`loc`: Este é um método de acesso baseado em rótulos. Ele é usado para acessar linhas ou colunas por seus rótulos. Com `loc`, você especifica os nomes das linhas e/ou colunas que deseja selecionar. Ele permite acessar os dados de um DataFrame usando os rótulos das linhas e/ou colunas.Exemplo:


In [123]:
df.loc['A']

W    0.302665
X    1.693723
Y   -1.706086
Z   -1.159119
Name: A, dtype: float64

`iloc`: Este é um método de acesso baseado em índices inteiros. Ele é usado para acessar linhas ou colunas usando seus índices inteiros. Com `iloc`, você especifica as posições das linhas e/ou colunas que deseja selecionar. Ele permite acessar os dados de um DataFrame usando índices inteiros

In [124]:
### selecionando a linha com índice C

df.iloc[2]

W    0.807706
X    0.072960
Y    0.638787
Z    0.329646
Name: C, dtype: float64

**Selecionando o subconjunto de linhas e colunas**

In [125]:
#      linha, coluna 
df.loc['B',   'Y']

0.16690463609281317

In [126]:
#      lista_linha,  lista_coluna  
df.loc[['A',  'B'],   ['W', 'Y' ]]


Unnamed: 0,W,Y
A,0.302665,-1.706086
B,-0.134841,0.166905


### Seleção condicional

Uma característica importante dos pandas é a seleção condicional usando notação de colchetes, muito semelhante ao numpy:

In [127]:
import numpy as np

## no numpy ...

# Criando um array numpy
arr = np.array([1, 2, 3, 4, 5])

# Selecionando valores maiores que 2
resultado = arr > 2
resultado  # Saída: [3 4 5]


array([False, False,  True,  True,  True])

In [128]:
resultado = arr[(arr > 2) & (arr < 4)]
resultado

array([3])

### Condicionais -- Pandas

In [129]:
df

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


In [130]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,False,False
B,False,True,True,True
C,True,True,True,True
D,False,False,False,True
E,False,True,True,True


In [131]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,,
B,,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,,,,0.484752
E,,1.901755,0.238127,1.996652


Se quiser os valores maiores que 0 na coluna `w`

In [132]:
df[ df['W'] >0 ]

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
C,0.807706,0.07296,0.638787,0.329646


In [133]:
df[df['W']>0]['Y']

A   -1.706086
C    0.638787
Name: Y, dtype: float64

In [134]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,-1.706086,1.693723
C,0.638787,0.07296


Para duas condições compostas, você pode usar: 
- `|`: para ou 

- `&` para e:

Utilizando parênteses

In [135]:
## Na coluna W > 0 e coluna Y > 0.5

df[(df['W']>0) & (df['Y'] > 0.5)]

Unnamed: 0,W,X,Y,Z
C,0.807706,0.07296,0.638787,0.329646


In [136]:
### para nao se confundir tanto: 

condicao_1 = df['W'] > 0
condicao_2 = df['Y'] > 0.5

df[(condicao_1) & (condicao_2)]

Unnamed: 0,W,X,Y,Z
C,0.807706,0.07296,0.638787,0.329646


In [137]:
df[(condicao_1) | (condicao_2)]

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
C,0.807706,0.07296,0.638787,0.329646


## Mais Detalhes do Índice

Vamos discutir mais alguns recursos de indexação, incluindo resetar o índice ou configurá-lo de outra forma. Também falaremos sobre hierarquia de índice!

In [138]:
df

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


In [139]:
# Redefinir para o padrão 0,1 ... n índice
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,0.302665,1.693723,-1.706086,-1.159119
1,B,-0.134841,0.390528,0.166905,0.184502
2,C,0.807706,0.07296,0.638787,0.329646
3,D,-0.497104,-0.75407,-0.943406,0.484752
4,E,-0.116773,1.901755,0.238127,1.996652


In [140]:
novoind = 'CA NY WY OR CO'.split()
df['Estados'] = novoind
df

Unnamed: 0,W,X,Y,Z,Estados
A,0.302665,1.693723,-1.706086,-1.159119,CA
B,-0.134841,0.390528,0.166905,0.184502,NY
C,0.807706,0.07296,0.638787,0.329646,WY
D,-0.497104,-0.75407,-0.943406,0.484752,OR
E,-0.116773,1.901755,0.238127,1.996652,CO


### Impondo novos índices

In [141]:
df.set_index('Estados')

Unnamed: 0_level_0,W,X,Y,Z
Estados,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,0.302665,1.693723,-1.706086,-1.159119
NY,-0.134841,0.390528,0.166905,0.184502
WY,0.807706,0.07296,0.638787,0.329646
OR,-0.497104,-0.75407,-0.943406,0.484752
CO,-0.116773,1.901755,0.238127,1.996652


In [142]:
df

Unnamed: 0,W,X,Y,Z,Estados
A,0.302665,1.693723,-1.706086,-1.159119,CA
B,-0.134841,0.390528,0.166905,0.184502,NY
C,0.807706,0.07296,0.638787,0.329646,WY
D,-0.497104,-0.75407,-0.943406,0.484752,OR
E,-0.116773,1.901755,0.238127,1.996652,CO


In [143]:
df.set_index('Estados' , inplace=True)
df

Unnamed: 0_level_0,W,X,Y,Z
Estados,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,0.302665,1.693723,-1.706086,-1.159119
NY,-0.134841,0.390528,0.166905,0.184502
WY,0.807706,0.07296,0.638787,0.329646
OR,-0.497104,-0.75407,-0.943406,0.484752
CO,-0.116773,1.901755,0.238127,1.996652


---

## Hierarquia de índices e índices múltiplos

Vamos examinar como trabalhar com o Multi-Index, primeiro criaremos um exemplo rápido de como seria um DataFrame Multi-Indexado:

In [144]:
# Níveis de Índice
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [145]:
hier_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

### Como podemos observar, temos camadas de índices agora

In [146]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,-0.993263,0.1968
G1,2,-1.136645,0.000366
G1,3,1.025984,-0.156598
G2,1,-0.031579,0.649826
G2,2,2.154846,-0.610259
G2,3,-0.755325,-0.346419


Agora vamos mostrar como indexar isso! Para a hierarquia de índice, usamos `df.loc[]`. Se este fosse no eixo das colunas, você usaria a notação de suporte normal `df[]`. Chamar um nível do índice retorna um sub-dataframe:

In [147]:
df.loc['G1']

Unnamed: 0,A,B
1,-0.993263,0.1968
2,-1.136645,0.000366
3,1.025984,-0.156598


In [148]:
df.loc['G1'].loc[1]

A   -0.993263
B    0.196800
Name: 1, dtype: float64

---
#### Nomes para os índices? 

In [149]:
df.index.names

FrozenList([None, None])

In [150]:
df.index.names = ['Grupo','Número']
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Grupo,Número,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,-0.993263,0.1968
G1,2,-1.136645,0.000366
G1,3,1.025984,-0.156598
G2,1,-0.031579,0.649826
G2,2,2.154846,-0.610259
G2,3,-0.755325,-0.346419
