## Aula 6 - Limpeza e Transformação de Dados com Pandas

**Documentação:** https://pandas.pydata.org/docs/index.html

**Conteúdo da aula:**

- Análise geral de uma tabela pandas
    - info
    - describe
    - dtypes
    - duplicates
    - NA (notna, isna)
- Excluindo e substituindo dados
    - Drop
    - Fill
    - replace
    - np.where
    - map
    - apply


**Dataset de exemplo:** titanic (https://github.com/pandas-dev/pandas/blob/main/doc/data/titanic.csv)

- **PassengerId:** Identificação de cada passageiro.
- **Survived:** Recebe 0 se o passageiro não sobreviveu; 1 caso tenha sobrevivido.
- **Pclass:** Classe do passageiro (1 = 1ª classe, 2 = 2ª classe, 3 = 3ª classe).
- **Name:** Nome do passageiro.
- **Sex:** Gênero do passageiro.
- **Age:** Idade do passageiro.
- **SibSp:** Número de irmãos/esposo(a) a bordo.
- **Parch:** Número de pais/filhos a bordo.
- **Ticket:** Número da passagem do passageiro.
- **Fare:** Valor da taxa do passageiro.
- **Cabin:** Número da cabine do passageiro.
- **Embarked:** Porto de embarcação do passageiro (C = Cherbourg, Q = Queenstown, S = Southampton).


In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('data/titanic.csv')

### Limpeza dos dados

O que seria limpar os dados? Quais seriam os benefícios?

A limpeza de um conjunto de dados é fundamental na análise e ciência de dados:
- Campos com valores faltantes podem ser um problema na hora de executar alguma função;
- Campos com valores atípicos podem causar erros de interpretação, gerando problemas nas tomadas de decisão;
- Podem haver colunas com tipos de dados diferentes do esperado, impedindo algumas funções de serem utilizadas. Exemplo: Coluna de idade como string -> Impede de calcular corretamente a idade média, máxima, mínima... 

A fim de mitigar esses problemas, podemos "limpar" os dados de diversas formas. Mas primeiro, temos que identificá-los!

In [4]:
df.Age.dtype

dtype('float64')

In [5]:
idade_str = df['Age'].astype(str)

In [6]:
## se a serie não estiver com o tipo de variável desejável, algumas funções não irão retornar a resposta correta
idade_str.max()

'nan'

In [8]:
df.Age.max()

80.0

In [32]:
idade_str.min()

'0.42'

In [33]:
## se a serie não estiver com o tipo de variável desejável, algumas funções não irão funcionar
idade_str.mean()

TypeError: Could not convert 22.038.026.035.035.0nan54.02.027.014.04.058.020.039.014.055.02.0nan31.0nan35.034.015.028.08.038.0nan19.0nannan40.0nannan66.028.042.0nan21.018.014.040.027.0nan3.019.0nannannannan18.07.021.049.029.065.0nan21.028.55.011.022.038.045.04.0nannan29.019.017.026.032.016.021.026.032.025.0nannan0.8330.022.029.0nan28.017.033.016.0nan23.024.029.020.046.026.059.0nan71.023.034.034.028.0nan21.033.037.028.021.0nan38.0nan47.014.522.020.017.021.070.529.024.02.021.0nan32.532.554.012.0nan24.0nan45.033.020.047.029.025.023.019.037.016.024.0nan22.024.019.018.019.027.09.036.542.051.022.055.540.5nan51.016.030.0nannan44.040.026.017.01.09.0nan45.0nan28.061.04.01.021.056.018.0nan50.030.036.0nannan9.01.04.0nannan45.040.036.032.019.019.03.044.058.0nan42.0nan24.028.0nan34.045.518.02.032.026.016.040.024.035.022.030.0nan31.027.042.032.030.016.027.051.0nan38.022.019.020.518.0nan35.029.059.05.024.0nan44.08.019.033.0nannan29.022.030.044.025.024.037.054.0nan29.062.030.041.029.0nan30.035.050.0nan3.052.040.0nan36.016.025.058.035.0nan25.041.037.0nan63.045.0nan7.035.065.028.016.019.0nan33.030.022.042.022.026.019.036.024.024.0nan23.52.0nan50.0nannan19.0nannan0.92nan17.030.030.024.018.026.028.043.026.024.054.031.040.022.027.030.022.0nan36.061.036.031.016.0nan45.538.016.0nannan29.041.045.045.02.024.028.025.036.024.040.0nan3.042.023.0nan15.025.0nan28.022.038.0nannan40.029.045.035.0nan30.060.0nannan24.025.018.019.022.03.0nan22.027.020.019.042.01.032.035.0nan18.01.036.0nan17.036.021.028.023.024.022.031.046.023.028.039.026.021.028.020.034.051.03.021.0nannannan33.0nan44.0nan34.018.030.010.0nan21.029.028.018.0nan28.019.0nan32.028.0nan42.017.050.014.021.024.064.031.045.020.025.028.0nan4.013.034.05.052.036.0nan30.049.0nan29.065.0nan50.0nan48.034.047.048.0nan38.0nan56.0nan0.75nan38.033.023.022.0nan34.029.022.02.09.0nan50.063.025.0nan35.058.030.09.0nan21.055.071.021.0nan54.0nan25.024.017.021.0nan37.016.018.033.0nan28.026.029.0nan36.054.024.047.034.0nan36.032.030.022.0nan44.0nan40.550.0nan39.023.02.0nan17.0nan30.07.045.030.0nan22.036.09.011.032.050.064.019.0nan33.08.017.027.0nan22.022.062.048.0nan39.036.0nan40.028.0nannan24.019.029.0nan32.062.053.036.0nan16.019.034.039.0nan32.025.039.054.036.0nan18.047.060.022.0nan35.052.047.0nan37.036.0nan49.0nan49.024.0nannan44.035.036.030.027.022.040.039.0nannannan35.024.034.026.04.026.027.042.020.021.021.061.057.021.026.0nan80.051.032.0nan9.028.032.031.041.0nan20.024.02.0nan0.7548.019.056.0nan23.0nan18.021.0nan18.024.0nan32.023.058.050.040.047.036.020.032.025.0nan43.0nan40.031.070.031.0nan18.024.518.043.036.0nan27.020.014.060.025.014.019.018.015.031.04.0nan25.060.052.044.0nan49.042.018.035.018.025.026.039.045.042.022.0nan24.0nan48.029.052.019.038.027.0nan33.06.017.034.050.027.020.030.0nan25.025.029.011.0nan23.023.028.548.035.0nannannan36.021.024.031.070.016.030.019.031.04.06.033.023.048.00.6728.018.034.033.0nan41.020.036.016.051.0nan30.5nan32.024.048.057.0nan54.018.0nan5.0nan43.013.017.029.0nan25.025.018.08.01.046.0nan16.0nannan25.039.049.031.030.030.034.031.011.00.4227.031.039.018.039.033.026.039.035.06.030.5nan23.031.043.010.052.027.038.027.02.0nannan1.0nan62.015.00.83nan23.018.039.021.0nan32.0nan20.016.030.034.517.042.0nan35.028.0nan4.074.09.016.044.018.045.051.024.0nan41.021.048.0nan24.042.027.031.0nan4.026.047.033.047.028.015.020.019.0nan56.025.033.022.028.025.039.027.019.0nan26.032.0 to numeric

In [34]:
df.Age.mean()

29.69911764705882

In [35]:
df.Age.max()

80.0

In [36]:
df.Age.min()

0.42

**Funções usuais para analisar informações gerais de uma tabela:**
- 'dtypes' - exibe o tipo das variáveis
- 'describe' - descreve, por default, as colunas numéricas com estatísticas pontuais (contagem, média, desvio padrão, valor mínimo, quartis e valor máximo)
- 'info' - Retorna várias informações a respeito da tabela: número de linhas e colunas, tipo das colunas, quantidade de valores não nulos por coluna, resumo da quantidade de colunas por tipo de dados.
- 'isnull' ou 'isna' - verifica se tem valores nulos
- 'notnull' ou 'notna' - verifica se tem valores não nulos
- 'duplicated' - exibe dados duplicados em um conjunto de dados
- 'unique' - Mostra os valores únicos de uma serie (coluna)
- 'nunique' - devolve o número de valores únicos de uma série (coluna)
- 'value_counts' - devolve a contagem que cada valor da série (coluna) aparece


In [39]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [42]:
df.describe()
## cuidado com df.describe sem parênteses

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [43]:
df.describe(include='all') ## Considera todas as colunas, inclusive não quantitativas. 
## Com isso, acrescenta estatísticas de dados qualitativos: 
## quantidade de valores únicos, maior valor e frequencia desse maior valor 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [44]:
df.Sex.value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [46]:
df[df.Ticket == '347082'].shape

(7, 12)

In [60]:
df.Cabin.value_counts()

B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
F33            3
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: Cabin, Length: 147, dtype: int64

In [62]:
df.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [63]:
df.describe(percentiles=[.1,.2,.3])

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
10%,90.0,0.0,1.0,14.0,0.0,0.0,7.55
20%,179.0,0.0,1.0,19.0,0.0,0.0,7.8542
30%,268.0,0.0,2.0,22.0,0.0,0.0,8.05
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [65]:
df.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [66]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [68]:
df.notnull().sum()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

In [70]:
df.isna().sum() ## mesma coisa que df.isnull()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [73]:
df.notna().sum()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

In [76]:
df.duplicated().sum()

0

In [78]:
df[['Survived','Pclass']].duplicated().sum()

885

### Substituindo e eliminando valores valores

Após identificar valores que devam ser tratados, podemos usar muitas abordagens:
- Eliminar a informação (cortar a linha)
- Eliminar a coluna (Se uma variável tiver muitos valores estranhos/missings, podemos optar por não usá-la na análise - mas temos que ter CUIDADO, pois pode ser uma informação importante).
- Substituir valores por um valor fixo (0, média, etc)
- Substituir valores por condições (utilizando funções)

**Algumas funções para eliminar linhas e/ou colunas:**
- drop_duplicates
- dropna
- drop

In [81]:
df.shape

(891, 12)

In [80]:
df.drop_duplicates().shape

(891, 12)

In [82]:
df[['Survived','Pclass']].shape

(891, 2)

In [85]:
df[['Survived','Pclass']].drop_duplicates().shape

(6, 2)

In [86]:
df[['Survived','Pclass']].drop_duplicates()

Unnamed: 0,Survived,Pclass
0,0,3
1,1,1
2,1,3
6,0,1
9,1,2
20,0,2


In [87]:
df.dropna().shape

(183, 12)

In [88]:
df.dropna(axis=1).shape

(891, 9)

In [89]:
df.dropna(subset=['Age']).shape

(714, 12)

In [90]:
df.shape[0] - df.Age.isnull().sum()

714

In [94]:
df.dropna(how = 'all').shape

(891, 12)

In [162]:
df.shape

(891, 12)

In [161]:
#df.dropna(thresh=3).shape

(891, 12)

**Algumas formas de substiuir valores:**
- De forma pontual
    - Seleção por posição (iloc; iat)
    - Seleção por label (loc; at)
- Função "fillna": Uma forma rápida de substituir valores nulos
- Por regras/funções mais complexas 
    - np.where
    - replace
    - map
    - apply
    - applymap

In [40]:
df2 = pd.DataFrame([[np.nan, 2, np.nan, 0],
                 [3, 4, np.nan, 1],
                 [np.nan, np.nan, np.nan, 5],
                 [np.nan, 3, np.nan, 4]],
                columns=['A', 'B', 'C', 'D'])

In [41]:
df2.iloc[0,1] = 'A'

In [43]:
df2.loc[1,'B'] = 'Q'

In [44]:
df2

Unnamed: 0,A,B,C,D
0,,A,,0
1,3.0,Q,,1
2,,,,5
3,,3.0,,4


In [65]:
## iloc vs iat
## iloc permite selecionar um recorte do dataframe com mais de um valor, enquanto o iat só pode ser usado para 
## dectar valor único (porém, o faz de forma mais rápida)

df3 = pd.DataFrame({'colunaA' : np.random.randint(low=10000, size=100000000),
                   'colunaB' : np.random.randint(low=10000, size=100000000),
                   'colunaC' : np.random.randint(low=10000, size=100000000),
                   'colunaD' : np.random.randint(low=10000, size=100000000),
                   'colunaE' : np.random.randint(low=10000, size=100000000),
                   'colunaF' : np.random.randint(low=10000, size=100000000),
                   'colunaG' : np.random.randint(low=10000, size=100000000)})

In [66]:
%%time
df3.iloc[4,0]

Wall time: 20.2 ms


302

In [67]:
%%time
df3.iat[4,0] ## 1ms = 1000 ns

Wall time: 0 ns


302

In [68]:
del df3

In [77]:
del df2

df2 = pd.DataFrame([[np.nan, 2, np.nan, 0],
                 [3, 4, np.nan, 1],
                 [np.nan, np.nan, np.nan, 5],
                 [np.nan, 3, np.nan, 4]],
                columns=['A', 'B', 'C', 'D'])

In [78]:
df2

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [79]:
df2.fillna(0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
3,0.0,3.0,0.0,4


In [80]:
df2.fillna('H')

Unnamed: 0,A,B,C,D
0,H,2.0,H,0
1,3.0,4.0,H,1
2,H,H,H,5
3,H,3.0,H,4


In [81]:
df2

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [82]:
df2.fillna('G', inplace = True)

In [83]:
df2

Unnamed: 0,A,B,C,D
0,G,2.0,G,0
1,3.0,4.0,G,1
2,G,G,G,5
3,G,3.0,G,4


In [85]:
df2.dtypes

A    object
B    object
C    object
D     int64
dtype: object

In [86]:
del df2

df2 = pd.DataFrame([[np.nan, 2, np.nan, 0],
                 [3, 4, np.nan, 1],
                 [np.nan, np.nan, np.nan, 5],
                 [np.nan, 3, np.nan, 4]],
                columns=['A', 'B', 'C', 'D'])

In [11]:
dfa = pd.DataFrame({'A' : [np.nan, 2, np.nan, 0],
                 'B' : [3, 4, np.nan, 1],
                 'C' : [np.nan, np.nan, np.nan, 5],
                 'D' : [ 3, np.nan, 4, np.nan]})

In [15]:
dfa

Unnamed: 0,A,B,C,D
0,,3.0,,3.0
1,2.0,4.0,,
2,,,,4.0
3,0.0,1.0,5.0,


In [14]:
dfa.fillna(method='pad')

Unnamed: 0,A,B,C,D
0,,3.0,,3.0
1,2.0,4.0,,3.0
2,2.0,4.0,,4.0
3,0.0,1.0,5.0,4.0


In [87]:
df2

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [88]:
df2.fillna(method="ffill")

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,4.0,,5
3,3.0,3.0,,4


In [103]:
df2.fillna(method="pad")

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,4.0,,5
3,3.0,3.0,,4


In [102]:
df2.ffill()

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,4.0,,5
3,3.0,3.0,,4


In [104]:
df2.fillna(method="backfill")

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0
1,3.0,4.0,,1
2,,3.0,,5
3,,3.0,,4


In [105]:
df2.fillna(method="bfill")

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0
1,3.0,4.0,,1
2,,3.0,,5
3,,3.0,,4


In [106]:
df2.bfill()

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0
1,3.0,4.0,,1
2,,3.0,,5
3,,3.0,,4


In [91]:
prova1 = pd.DataFrame({'Aluno' : ['Jorge', 'Gabriel','Paulo'],
                      'Idade' : [12, 12, 13],
                      'Nota' : [6.0, 7.8, 9.0]})

In [92]:
prova1

Unnamed: 0,Aluno,Idade,Nota
0,Jorge,12,6.0
1,Gabriel,12,7.8
2,Paulo,13,9.0


In [93]:
prova2 = pd.DataFrame({'Aluno' : ['Jorge', 'Gabriel','Paulo'],
                      'Idade' : [np.nan, np.nan, np.nan],
                      'Nota' : [2.0, 8, 7.2]})

In [94]:
prova2

Unnamed: 0,Aluno,Idade,Nota
0,Jorge,,2.0
1,Gabriel,,8.0
2,Paulo,,7.2


In [95]:
estudantes = pd.concat([prova1, prova2])

In [96]:
estudantes

Unnamed: 0,Aluno,Idade,Nota
0,Jorge,12.0,6.0
1,Gabriel,12.0,7.8
2,Paulo,13.0,9.0
0,Jorge,,2.0
1,Gabriel,,8.0
2,Paulo,,7.2


In [97]:
estudantes.sort_values(by = ['Aluno'], inplace = True)

In [98]:
estudantes

Unnamed: 0,Aluno,Idade,Nota
1,Gabriel,12.0,7.8
1,Gabriel,,8.0
0,Jorge,12.0,6.0
0,Jorge,,2.0
2,Paulo,13.0,9.0
2,Paulo,,7.2


In [100]:
estudantes.fillna(method="ffill", inplace=True)
estudantes

Unnamed: 0,Aluno,Idade,Nota
1,Gabriel,12.0,7.8
1,Gabriel,12.0,8.0
0,Jorge,12.0,6.0
0,Jorge,12.0,2.0
2,Paulo,13.0,9.0
2,Paulo,13.0,7.2


In [112]:
## replace
estudantes.replace("Jorge", "Jorgia")

Unnamed: 0,Aluno,Idade,Nota
1,Gabriel,12.0,7.8
1,Gabriel,12.0,8.0
0,Jorgia,12.0,6.0
0,Jorgia,12.0,2.0
2,Paulo,13.0,9.0
2,Paulo,13.0,7.2


In [113]:
nome = 'Jorge'

In [114]:
nome.replace('ge', 'gia')

'Jorgia'

In [115]:
estudantes.replace("ge", "gia")

Unnamed: 0,Aluno,Idade,Nota
1,Gabriel,12.0,7.8
1,Gabriel,12.0,8.0
0,Jorge,12.0,6.0
0,Jorge,12.0,2.0
2,Paulo,13.0,9.0
2,Paulo,13.0,7.2


In [116]:
estudantes.replace({'Gabriel' : 'Gabriela', 'Jorge' : 'Jorgia', 'Paulo' : 'Paula'})

Unnamed: 0,Aluno,Idade,Nota
1,Gabriela,12.0,7.8
1,Gabriela,12.0,8.0
0,Jorgia,12.0,6.0
0,Jorgia,12.0,2.0
2,Paula,13.0,9.0
2,Paula,13.0,7.2


In [119]:
## regex
estudantes.replace(r"\s*ge\s*", np.nan, regex=True)

Unnamed: 0,Aluno,Idade,Nota
1,Gabriel,12.0,7.8
1,Gabriel,12.0,8.0
0,,12.0,6.0
0,,12.0,2.0
2,Paulo,13.0,9.0
2,Paulo,13.0,7.2


## Replace x np.where x map x apply x applymap

Utilizam dicionários e/ou funções para facilitar a transformação de dados a partir de regras.

![map_apply_applymap](images/map_apply_applymap.png)

**Map:** Utilizado quando queremos alterar os valores de uma Series

**Apply:** Utilizado tanto em dataframes como Series

**Applymap:** Utilizado apenas em dataframes, quando queremos alterar todos os valores (não importa a coluna)

In [138]:
df.replace({'male' : 'masculino', 'female' : 'feminino'})

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",masculino,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",feminino,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",feminino,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",feminino,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",masculino,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",masculino,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",feminino,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",feminino,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",masculino,26.0,0,0,111369,30.0000,C148,C


In [140]:
df.Sex.replace({'male' : 'masculino', 'female' : 'feminino'})

0      masculino
1       feminino
2       feminino
3       feminino
4      masculino
         ...    
886    masculino
887     feminino
888     feminino
889    masculino
890    masculino
Name: Sex, Length: 891, dtype: object

In [141]:
df.Sex.unique()

array(['male', 'female'], dtype=object)

In [143]:
df.Sex.isnull().sum()

0

In [144]:
np.where(df.Sex == 'male', 'masculino', 'feminino')

array(['masculino', 'feminino', 'feminino', 'feminino', 'masculino',
       'masculino', 'masculino', 'masculino', 'feminino', 'feminino',
       'feminino', 'feminino', 'masculino', 'masculino', 'feminino',
       'feminino', 'masculino', 'masculino', 'feminino', 'feminino',
       'masculino', 'masculino', 'feminino', 'masculino', 'feminino',
       'feminino', 'masculino', 'masculino', 'feminino', 'masculino',
       'masculino', 'feminino', 'feminino', 'masculino', 'masculino',
       'masculino', 'masculino', 'masculino', 'feminino', 'feminino',
       'feminino', 'feminino', 'masculino', 'feminino', 'feminino',
       'masculino', 'masculino', 'feminino', 'masculino', 'feminino',
       'masculino', 'masculino', 'feminino', 'feminino', 'masculino',
       'masculino', 'feminino', 'masculino', 'feminino', 'masculino',
       'masculino', 'feminino', 'masculino', 'masculino', 'masculino',
       'masculino', 'feminino', 'masculino', 'feminino', 'masculino',
       'masculino', 'fem

In [145]:
## map
df.Sex.map({'male' : 'masculino', 'female' : 'feminino'})

0      masculino
1       feminino
2       feminino
3       feminino
4      masculino
         ...    
886    masculino
887     feminino
888     feminino
889    masculino
890    masculino
Name: Sex, Length: 891, dtype: object

In [153]:
def troca_lingua(x):
    if x == 'male':
        return 'masculino'
    elif x == 'female': 
        return 'feminino'
    else:
        return x

In [154]:
df.Sex.apply(troca_lingua)

0      masculino
1       feminino
2       feminino
3       feminino
4      masculino
         ...    
886    masculino
887     feminino
888     feminino
889    masculino
890    masculino
Name: Sex, Length: 891, dtype: object

In [156]:
df[['Sex']].applyymap(troca_lingua)

Unnamed: 0,Sex
0,masculino
1,feminino
2,feminino
3,feminino
4,masculino
...,...
886,masculino
887,feminino
888,feminino
889,masculino


In [157]:
df.applymap(troca_lingua)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",masculino,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",feminino,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",feminino,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",feminino,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",masculino,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",masculino,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",feminino,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",feminino,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",masculino,26.0,0,0,111369,30.0000,C148,C


### Variáveis Dummies

Chamamos de variáveis dummies as variáveis binárias (0 ou 1) criadas para representar categorias. 

**Exemplo:**
Uma variável 'Gênero Feminino' com valores binários (0 ou 1) indica se a observação é do gênero feminino (1) ou não (0). 

Função get_dummies: Converte variáveis categóricas em variáveis dummies

In [None]:
pandas.get_dummies(data, 
                   prefix=None, 
                   prefix_sep='_', 
                   dummy_na=False, 
                   columns=None, 
                   sparse=False, 
                   drop_first=False, 
                   dtype=None)

In [16]:
df['Sex']

0        male
1      female
2      female
3      female
4        male
        ...  
886      male
887    female
888    female
889      male
890      male
Name: Sex, Length: 891, dtype: object

In [18]:
pd.get_dummies(df, prefix='sex', columns=['Sex'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_female,sex_male
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,S,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,S,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,S,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,S,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,S,1,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,S,1,0
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,C,0,1


In [19]:
pd.get_dummies(df, columns=['Sex'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,S,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,S,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,S,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,S,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,S,1,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,S,1,0
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,C,0,1


In [20]:
pd.get_dummies(df, columns=['Sex', 'Survived'])

Unnamed: 0,PassengerId,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male,Survived_0,Survived_1
0,1,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,S,0,1,1,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,1,0,0,1
2,3,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,S,1,0,0,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,S,1,0,0,1
4,5,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,S,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,S,0,1,1,0
887,888,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,S,1,0,0,1
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,S,1,0,1,0
889,890,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,C,0,1,0,1
