[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gcmatos/python-para-geociencias/blob/master/notebooks/3.1%20Limpeza%20e%20transformação%20de%20dados.ipynb)

Ctrl/Cmd + click para abrir em uma nova aba do navegador web e utilizar o Google Colab para rodar o tutorial.

# Limpeza e transformação de dados

__O que iremos aprender__
- Ferramentas para tratar valores nulos
- Manipulação de dados
- Operações de agrupamento
- Gerar sumários estatísticos por grupos de dados

## Configuração de ambiente

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Números aleatórios
np.random.seed(0)

## Limpeza de dados
As tarefas de detecção e correção de registros inválidos em conjuntos de dados são conhecidas genericamente como tarefas de limpeza, que podem ser feitas de forma interativa (*e.g. *Excel) ou programática (*e.g. * R, Python). Estas falhas no registro podem ser causadas no momento da aquisição por defeitos em equipamentos ou por falhas humanas como erros de digitação.

### Valores nulos

In [3]:
s = pd.Series([1, np.nan, 3.5, np.nan, 7])
s

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [4]:
# Total de valores nulos
s.isnull().sum()

2

In [5]:
# Remover valores nulos
s.dropna()
# s[s.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [6]:
df = pd.DataFrame([[1., 6.5, 3.], 
                   [2., np.nan, np.nan],
                   [np.nan, np.nan, np.nan], 
                   [np.nan, 7.5, 4]])
df

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,2.0,,
2,,,
3,,7.5,4.0


In [7]:
# Total de valores nulos por linhas ou por colunas
df.isnull().sum()
# df.isnull().sum(axis=1)

0    2
1    2
2    2
dtype: int64

In [8]:
# Remoer valores nulos
df.dropna()
# df.dropna(axis=1)

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [9]:
# Substituir valores nulos por constantes
# df.fillna(-9999)
df.fillna(df.mean())

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,2.0,7.0,3.5
2,1.5,7.0,3.5
3,1.5,7.5,4.0


### Substituir valores

In [10]:
df = pd.Series([1., -999., 2., -999., -1000., 3.])
df

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [11]:
df.replace(-999, np.nan)
# df.replace([-999, -1000], np.nan)
# df.replace([-999, -1000], [np.nan, 0])
# df.replace({-999: np.nan, -1000: 0})

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

### *Outliers*
Registros de valores fora de limites pré-estabelecidos como aceitáveis são denomidados *outliers* e podem causar problemas em sumários estatísticos e modelos preditivos.

In [12]:
df = pd.DataFrame(np.random.randn(1000, 4))
df

Unnamed: 0,0,1,2,3
0,1.764052,0.400157,0.978738,2.240893
1,1.867558,-0.977278,0.950088,-0.151357
2,-0.103219,0.410599,0.144044,1.454274
3,0.761038,0.121675,0.443863,0.333674
4,1.494079,-0.205158,0.313068,-0.854096
5,-2.552990,0.653619,0.864436,-0.742165
6,2.269755,-1.454366,0.045759,-0.187184
7,1.532779,1.469359,0.154947,0.378163
8,-0.887786,-1.980796,-0.347912,0.156349
9,1.230291,1.202380,-0.387327,-0.302303


In [13]:
df.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.062966,-0.002087,-0.025777,-0.010981
std,0.983517,0.967146,0.983671,0.993891
min,-3.740101,-3.046143,-3.116857,-3.3923
25%,-0.75572,-0.68368,-0.684833,-0.686776
50%,-0.029995,-0.02321,-0.025068,-0.038192
75%,0.604792,0.652095,0.624139,0.648778
max,2.929096,2.662727,3.80166,3.427539


In [14]:
# Selecionar valores fora do intervalo [-3, 3] em uma coluna
col = df[2]
col[np.abs(col) > 3]

263    3.170975
504   -3.116857
770    3.306574
779    3.801660
865   -3.006499
Name: 2, dtype: float64

In [15]:
# Selecionar linhas com valores fora do intervato [-3, 3]
df[(np.abs(df) > 3).any(1)]

Unnamed: 0,0,1,2,3
147,-0.392389,-3.046143,0.543312,0.439043
263,-0.156024,1.049093,3.170975,0.1895
504,0.70886,0.422819,-3.116857,0.644452
770,-0.387313,-0.347585,3.306574,-1.5102
779,-1.091033,-0.126856,3.80166,2.315171
861,0.236225,-0.752582,0.045113,3.427539
865,-0.087328,-0.553965,-3.006499,-0.047166
919,-0.953179,-0.479297,-1.345508,-3.3923
938,-3.740101,0.973577,1.175155,-1.124703


In [16]:
# Substituir valores fora do intervalo por valor constante com o mesmo sinal que o valor original
df[np.abs(df) > 3] = np.sign(df) * 3
# df[(np.abs(df) == 3).any(1)]

### Remover duplicatas

In [17]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [18]:
data.duplicated()
# data.drop_duplicates()
# data.drop_duplicates(['k1'])
# data.drop_duplicates(['k1', 'k2'], keep='first')

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

## Transformação com funções de mapeamento


__Exemplo:__

Para exemplificar a utilização da função `map()`, criamos um dataset que contém uma coluna com medidas de porosidade efetiva e outra com a litologia. Vamos criar uma nova coluna com a categoria 'rocktype', utilizando a função de mapeamento, que irá informar quais amostras são de rocha reservatório ou não reservatório.

In [19]:
data = pd.DataFrame({'rock': ['Sandstone', 'Sandstone', 
                              'Sandstone', 'Shale', 'Wackstone', 
                              'Siltstone', 'Sandstone', 
                              'Conglomerate', 'Limestone'],
                     'porosity': [0.2, 0.25, 0.1, 0.03, 0.07, 0.08, 0.3, 0.15, 0.36]})
data.head()

Unnamed: 0,rock,porosity
0,Sandstone,0.2
1,Sandstone,0.25
2,Sandstone,0.1
3,Shale,0.03
4,Wackstone,0.07


In [20]:
# Definir dicionário mapa de tipos de reservatório
rock_type = {
  'sandstone': 'reservoir',
  'conglomerate': 'reservoir',
  'siltstone': 'non-reservoir',
  'wackstone': 'non-reservoir',
  'limestone': 'reservoir',
  'shale': 'non-reservoir'
}

In [21]:
# Aplicar função diretamente nos elementos da coluna 'rock'
data['rock'].str.lower().map(rock_type)
# data['rock_type'] = data['rock'].map(rock_type)
# data['rock_type'] = data['rock'].str.lower().map(rock_type)
# data.head()

0        reservoir
1        reservoir
2        reservoir
3    non-reservoir
4    non-reservoir
5    non-reservoir
6        reservoir
7        reservoir
8        reservoir
Name: rock, dtype: object

In [22]:
# Outra forma de resolver o mapeamento de valores
lowercased = data['rock'].str.lower()
# lowercased

# Criar nova coluna 'rock_type' com função map()
data['rock_type'] = lowercased.map(rock_type)
data.head()

Unnamed: 0,rock,porosity,rock_type
0,Sandstone,0.2,reservoir
1,Sandstone,0.25,reservoir
2,Sandstone,0.1,reservoir
3,Shale,0.03,non-reservoir
4,Wackstone,0.07,non-reservoir


## Discretização
Objetos pandas podem ser discretizados em intervalos de valores com a função `pandas.cut`. Esta função cria uma variável do tipo `CategoricalDtype`

__Exemplo__

Para exemplificar a utilização da função `pd.cut()`, vamos criar um dataset que contém uma lista com 100 mediadas de Gamma Ray (API). Em seguida iremos criar 4 categorias de representam intervalos de valores.

In [23]:
gamma = pd.DataFrame(abs(np.random.randn(100)) * 60, columns=['Gamma Ray'])
# gamma.sample(5)
gamma.plot.hist();

In [24]:
# Separar em grupos de intervalos regularmente espaçados
gamma_cut = pd.cut(gamma['Gamma Ray'], 4)
# gamma_cut = pd.cut(gamma['Gamma Ray'], 4, labels=['low', 'medium', 'high', 'extreme'])

gamma_cut.head()
# gamma_cut.dtype
# pd.value_counts(gamma_cut)
# gamma_cut.value_counts()
# gamma_cut.values

0     (0.0503, 37.596]
1    (74.992, 112.389]
2     (0.0503, 37.596]
3     (37.596, 74.992]
4    (74.992, 112.389]
Name: Gamma Ray, dtype: category
Categories (4, interval[float64]): [(0.0503, 37.596] < (37.596, 74.992] < (74.992, 112.389] < (112.389, 149.785]]

In [25]:
# Definir os intervalos de cutoff
cutoff = [0, 50, 100, 150, 200]

gamma_cut = pd.cut(gamma['Gamma Ray'], cutoff)

# gamma_cut.head()
# gamma_cut.dtype
# gamma_cut.value_counts().plot(kind='barh')
# gamma_cut.values

In [26]:
# Particionar em quartil
gamma_cut_q = pd.qcut(gamma['Gamma Ray'], 4)
# gamma_cut_q = pd.qcut(gamma['Gamma Ray'], [0, 0.1, 0.5, 0.9, 1.])

# gamma_cut_q.head()
gamma_cut_q.value_counts()
# gamma_cut_q.values

(72.573, 149.785]    25
(45.245, 72.573]     25
(17.91, 45.245]      25
(0.199, 17.91]       25
Name: Gamma Ray, dtype: int64

## Manipulação

### Hierarquia de índices
Objetos pandas podem conter uma ou mais colunas com índices em diferentes níveis hierárquicos. Esse aspectos é extremamente útil para geociências, pois isso torna possível que sejam utilizadas colunas com coordenadas (x, y) e identificação de pontos como índices em planilhas de aquisição de dados.

In [27]:
s = pd.Series(np.random.randn(9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
s

a  1   -1.714165
   2   -0.049642
   3   -1.299395
b  1    0.625355
   3   -0.791719
c  1   -0.582943
   2   -1.552681
d  2    2.110154
   3    0.758830
dtype: float64

In [28]:
s.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

In [29]:
s['b']
# data.loc[['b', 'd']]

1    0.625355
3   -0.791719
dtype: float64

In [30]:
# Desmontar índice hierárquico
s.unstack()

Unnamed: 0,1,2,3
a,-1.714165,-0.049642,-1.299395
b,0.625355,,-0.791719
c,-0.582943,-1.552681,
d,,2.110154,0.75883


### Converter colunas em índices hierárquicos

In [31]:
df = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two',
                            'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
df

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [32]:
df2 = df.set_index(['c', 'd'])
# df2 = df.set_index(['c', 'd'], drop=False)
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [33]:
df2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


### Estatística por índices hiearárquicos


__Exemplo:__


In [34]:
# DataFrame de dados de fácies e rochas
df = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['facies1', 'facies1', 'facies2'],
                              ['Sand', 'Shale', 'Limestone']])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,facies1,facies1,facies2
Unnamed: 0_level_1,Unnamed: 1_level_1,Sand,Shale,Limestone
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [35]:
# Nomear índices
df.index.names = ['point', 'sample']
df.columns.names = ['Facies', 'Rock']
df

Unnamed: 0_level_0,Facies,facies1,facies1,facies2
Unnamed: 0_level_1,Rock,Sand,Shale,Limestone
point,sample,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [36]:
df['facies2']

Unnamed: 0_level_0,Rock,Limestone
point,sample,Unnamed: 2_level_1
a,1,2
a,2,5
b,1,8
b,2,11


In [37]:
df.sum(level='sample')
# df.sum(level='Facies', axis=1)

Facies,facies1,facies1,facies2
Rock,Sand,Shale,Limestone
sample,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


## Combinar datasets
Os dois métodos pandas para combinar objetos que serão tratados neste curso são:

- **`pandas.concat`** empilha objetos pandas concatenando os mesmo ao longo de um eixo.

- **`pandas.merge`** que conecta linhas em DataFrames baseados em uma ou mais chaves como em uma base realcional (SQL). 


__Tabela de opções de combinação relacional__

Método pd. merge | SQL | Descrição
--- | --- | ---
`how='inner'` | `INNER JOIN` | Combinar usando apenas chaves contidas (interseções) entre tabelas
`how='left'` | `LEFT OUTER JOIN` | Combinar usando apenas chaves contidas na tabela da esquerda
`how='right'` | `RIGHT OUTER JOIN` | Combinar usando apenas chaves contidas na tabela da direita
`how='outer'` | `FULL OUTER JOIN` | Combinar usando todas as chaves contidas nas tabelas

### `merge`

In [38]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
left

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3


In [39]:
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
right

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3


In [40]:
# pd.merge(left, right, on=['key1', 'key2'])
pd.merge(left, right, how='inner', on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


### `concatenate`

In [41]:
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

In [42]:
# COncatenar por eixo
pd.concat([s1, s2, s3], axis=1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [43]:
s4 = pd.concat([s1, s3])
s4

a    0
b    1
f    5
g    6
dtype: int64

In [44]:
pd.concat([s1, s4], axis=1, join='inner')

Unnamed: 0,0,1
a,0,0
b,1,1


In [45]:
pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])

Unnamed: 0,0,1
a,0.0,0.0
c,,
b,1.0,1.0
e,,


In [46]:
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df1

Unnamed: 0,a,b,c,d
0,-0.70993,0.151147,1.323091,-0.927825
1,1.906599,-1.032175,-0.177361,-1.650378
2,-2.538511,1.010091,0.085702,-1.733861


In [47]:
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
df2

Unnamed: 0,b,d,a
0,-1.640601,1.145361,-0.150595
1,1.431443,0.636587,-0.066563


In [48]:
# COncatenar ignorando índices
pd.concat([df1, df2])
# pd.concat([df1, df2], ignore_index=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,a,b,c,d
0,-0.70993,0.151147,1.323091,-0.927825
1,1.906599,-1.032175,-0.177361,-1.650378
2,-2.538511,1.010091,0.085702,-1.733861
0,-0.150595,-1.640601,,1.145361
1,-0.066563,1.431443,,0.636587


## Rearranjos e modificações

Métodos:

__**`stack`**__
> Rotaciona (pivot) de colunas para linhas

__**`unstack`**__
> Rotaciona  (pivot) de linhas para colunas

In [49]:
df = pd.DataFrame(np.arange(6).reshape((2, 3)),
                  index=pd.Index(['A', 'B'], name='index'),
                  columns=pd.Index(['one', 'two', 'three'],
                  name='columns'))
df

columns,one,two,three
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,1,2
B,3,4,5


In [50]:
df.stack()
# df.stack().unstack()

index  columns
A      one        0
       two        1
       three      2
B      one        3
       two        4
       three      5
dtype: int64

In [51]:
# Unstack com especificação de índices
df_stacked = df.stack()

df_stacked.unstack('index')

index,A,B
columns,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


## Agrupamento

As principais operações de agrupamento são:
- Fragmentar (split)
- Estatísticas por subgrupo
- Aplicação de métodos em grupos espeíficos dentro de uma tabela
- Computar tabelas pivot

![](https://i.stack.imgur.com/sgCn1.jpg)

Exemplo de agrupamento com a metodologia *split-apply-combine* utilizando o método `groupby`.

### GroupBy

### *split-apply-combine*

In [52]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0.03233,-0.163418
1,a,two,-0.555074,0.823555
2,b,one,0.097786,0.920632
3,b,two,-0.060985,0.180763
4,a,one,0.837517,-1.312968


In [53]:
# split
grouped = df['data1'].groupby(df['key1'])
# grouped

# apply-combine
grouped.mean()

key1
a    0.104924
b    0.018401
Name: data1, dtype: float64

In [54]:
# split-apply-combine com múltiplas chaves
means = df.groupby([df['key1'], df['key2']]).mean()
# means = df['data1'].groupby([df['key1'], df['key2']]).mean()

means

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.434923,-0.738193
a,two,-0.555074,0.823555
b,one,0.097786,0.920632
b,two,-0.060985,0.180763


In [55]:
means.unstack()

Unnamed: 0_level_0,data1,data1,data2,data2
key2,one,two,one,two
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,0.434923,-0.555074,-0.738193,0.823555
b,0.097786,-0.060985,0.920632,0.180763


In [56]:
# Gerar DataFrame com GroupBy
df_grouped = df.groupby(['key1', 'key2'])[['data2']].mean()
type(df_grouped)

pandas.core.frame.DataFrame

In [57]:
# Gerar Series com GroupBy
s_grouped = df.groupby(['key1', 'key2'])['data2'].mean()
type(s_grouped)

pandas.core.series.Series

### Iterações em grupos

In [58]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

a
  key1 key2     data1     data2
0    a  one  0.032330 -0.163418
1    a  two -0.555074  0.823555
4    a  one  0.837517 -1.312968
b
  key1 key2     data1     data2
2    b  one  0.097786  0.920632
3    b  two -0.060985  0.180763


In [59]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

('a', 'one')
  key1 key2     data1     data2
0    a  one  0.032330 -0.163418
4    a  one  0.837517 -1.312968
('a', 'two')
  key1 key2     data1     data2
1    a  two -0.555074  0.823555
('b', 'one')
  key1 key2     data1     data2
2    b  one  0.097786  0.920632
('b', 'two')
  key1 key2     data1     data2
3    b  two -0.060985  0.180763


In [60]:
df.dtypes

key1      object
key2      object
data1    float64
data2    float64
dtype: object

In [61]:
grouped = df.groupby(df.dtypes, axis=1)

for dtype, group in grouped:
    print(dtype)
    print(group)

float64
      data1     data2
0  0.032330 -0.163418
1 -0.555074  0.823555
2  0.097786  0.920632
3 -0.060985  0.180763
4  0.837517 -1.312968
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


### Agrupamento com dicionários de valores

In [62]:
df = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['line 1', 'line 2', 'line 3', 'line 4', 'line 5'])
df

Unnamed: 0,a,b,c,d,e
line 1,-0.160477,-1.906075,-1.206622,0.730418,0.494893
line 2,-0.003209,-0.302433,-0.739401,-0.512812,0.965252
line 3,0.476679,-1.371216,0.198853,0.139963,1.648673
line 4,-1.757552,-0.78313,0.973626,-1.110932,2.385622
line 5,-1.178944,0.029122,0.559548,0.881037,0.715208


In [63]:
facies = {'a': 'type 1', 'b': 'type 1', 'c': 'type 1',
          'd': 'type 2', 'e': 'type 2'}

In [64]:
# Agrupar por colunas
df.groupby(facies, axis=1).mean()

Unnamed: 0,type 1,type 2
line 1,-1.091058,0.612656
line 2,-0.348348,0.22622
line 3,-0.231895,0.894318
line 4,-0.522352,0.637345
line 5,-0.196758,0.798123


### Agregando valores
Agregar valores significa gerar valores escalares a partir da redução de arranjos de números. Algumas das funções mais utilizadas para agregar valores são:
- `sum` 
- `mean`
- `median`
- `min`, `max`

In [65]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.462078,1.509577
1,a,two,0.913207,-0.228171
2,b,one,-0.754652,-0.890342
3,b,two,-0.534977,1.209772
4,a,one,0.456642,-1.2648


In [66]:
# Agrupar por quantil
df.groupby('key1')['data1'].quantile(0.9)

key1
a    0.821894
b   -0.556945
Name: data1, dtype: float64

In [67]:
# Agrupar usando funções

def interval(arr):
    return arr.max() - arr.min()


df.groupby('key1').agg(interval)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.375285,2.774377
b,0.219675,2.100113


In [68]:
df.groupby('key1').describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,0.302591,0.700465,-0.462078,-0.002718,0.456642,0.684925,0.913207,3.0,0.005535,1.401876,-1.2648,-0.746486,-0.228171,0.640703,1.509577
b,2.0,-0.644815,0.155334,-0.754652,-0.699734,-0.644815,-0.589896,-0.534977,2.0,0.159715,1.485004,-0.890342,-0.365313,0.159715,0.684744,1.209772


In [69]:
grouped = df.groupby('key1')
# grouped = df.groupby('key1', as_index=False)
# grouped = df.groupby('key1', group_keys=False)

grouped.agg('mean')

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.302591,0.005535
b,-0.644815,0.159715


In [70]:
grouped.agg(['mean', 'median', interval])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,mean,median,interval,mean,median,interval
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
a,0.302591,0.456642,1.375285,0.005535,-0.228171,2.774377
b,-0.644815,-0.644815,0.219675,0.159715,0.159715,2.100113


### Tabelas Pivot e *cross-tabulations*
Tabelas pivot geram sumários estatísticos e *cross-tabulations *geram sumários com frequências de valores.

In [71]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.462078,1.509577
1,a,two,0.913207,-0.228171
2,b,one,-0.754652,-0.890342
3,b,two,-0.534977,1.209772
4,a,one,0.456642,-1.2648


In [72]:
# Utilizando tablea pivot
df.pivot_table(index='key1')
# df.pivot_table(index='key1', columns='key2')

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.302591,0.005535
b,-0.644815,0.159715


In [73]:
pd.crosstab(df.key1, df.key2, margins=True)

key2,one,two,All
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,1,3
b,1,1,2
All,3,2,5


__Referências__

- [Limpeza de dados](https://en.wikipedia.org/wiki/Data_cleansing)