# Base de Dados Census
## Importando Bibliotecas

In [16]:
library(repr)
options(repr.plot.width = 4, repr.plot.height = 4)

## Carregando os dados

In [37]:
census <- read.csv('../datasets/census.csv')
head(census)

X,age,workclass,final.weight,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loos,hour.per.week,native.country,income
1,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
2,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
3,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
4,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
5,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
6,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


## Apagar Coluna

- A coluna X é usada apenas como indíces

In [38]:
census$X <- NULL
head(census)

age,workclass,final.weight,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loos,hour.per.week,native.country,income
39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


## Análise dos Dados

- age - numérico discreto
- wordclass (tipo de emprego) - categórica nominal
- final.weight - numérico contínuas
- education - categórica ordinal
- education.num (anos de estudos) - numérico discreto
- marital.status - categórica nominal
- occupation - categórica nominal
- relpationship - categórica nominal
- race - categórica nominal
- sex - categórica nominal nominal
- capital.gain - numérica contínua
- capital.loss - numérica contínua
- hours.week - numética discreta
- native.country - categórica nominal
- income (renda anual) - classe que queremos encontrar

In [39]:
str(census)

'data.frame':	30162 obs. of  15 variables:
 $ age           : int  39 50 38 53 28 37 49 52 31 42 ...
 $ workclass     : Factor w/ 7 levels " Federal-gov",..: 6 5 3 3 3 3 3 5 3 3 ...
 $ final.weight  : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
 $ education     : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
 $ education.num : int  13 13 9 7 13 14 5 9 14 13 ...
 $ marital.status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
 $ occupation    : Factor w/ 14 levels " Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
 $ relationship  : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
 $ race          : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
 $ sex           : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
 $ capital.gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
 $ capital.loos  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hour.per.week : int  4

## Converter Classe Nominal para Valores Discretos

- A função **factor** usa um mapeamento para transformar

```R
factor(coluna, levels = c("valores Categóricos"), labels = c(valores numéricos) )
```

In [40]:
census$sex = factor(census$sex, levels = unique(census$sex), labels= 0:1)
census$workclass = factor(census$workclass, levels = unique(census$workclass), labels = 1:7)
census$education = factor(census$education, levels  =unique(census$education), labels = 1:16)
census$marital.status = factor(census$marital.status, levels = unique(census$marital.status), labels = 1:7)
census$occupation = factor(census$occupation, levels = unique(census$occupation), labels = 1:14)
census$relationship = factor(census$relationship, levels = unique(census$relationship), labels = 1:6)
census$race = factor(census$race, levels = unique(census$race), labels = 1:5)
census$native.country = factor(census$native.country, levels =unique(census$native.country), labels = 1:41)

census$income = factor(census$income, levels = c(' <=50K', ' >50K'), labels = c(0, 1))

head(census)

age,workclass,final.weight,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loos,hour.per.week,native.country,income
39,1,77516,1,13,1,1,1,1,0,2174,0,40,1,0
50,2,83311,1,13,2,2,2,1,0,0,0,13,1,0
38,3,215646,2,9,3,3,1,1,0,0,0,40,1,0
53,3,234721,3,7,2,3,2,2,0,0,0,40,1,0
28,3,338409,1,13,2,4,3,2,1,0,0,40,2,0
37,3,284582,4,14,2,2,3,1,1,0,0,40,1,0


## Escalonamento de Atributos

- **Normalização**
    
    $$x = \frac{x - \min{x}}{\max{x} - \min{x}}$$
    
- **Padronização**
    $$x = \frac{x - mean(x)}{std(x)credit[credit['age'].isna()]}$$

In [42]:
census$age = scale(census$age)
census$final.weight = scale(census$final.weight)
census[, c('capital.gain','capital.loos','hour.per.week', 'education.num')] = scale(census[, c('capital.gain','capital.loos','hour.per.week', 'education.num')])

head(census)

age,workclass,final.weight,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loos,hour.per.week,native.country,income
0.042795,1,-1.062704,1,1.1288997,1,1,1,1,0,0.1460899,-0.2185824,-0.07773282,1,0
0.88027355,2,-1.0078546,1,1.1288997,2,2,2,1,0,-0.1474422,-0.2185824,-2.33149205,1,0
-0.03333941,3,0.2446894,2,-0.4397309,3,3,1,1,0,-0.1474422,-0.2185824,-0.07773282,1,0
1.10867679,3,0.4252333,3,-1.2240462,2,3,2,2,0,-0.1474422,-0.2185824,-0.07773282,1,0
-0.79468355,3,1.406635,1,1.1288997,2,4,3,2,1,-0.1474422,-0.2185824,-0.07773282,2,0
-0.10947383,3,0.8971652,4,1.5210573,2,2,3,1,1,-0.1474422,-0.2185824,-0.07773282,1,0


# Base de Dados de Crédito

In [44]:
credit = read.csv('datasets/credit_data.csv')
head(credit)

clientid,income,age,loan,default
1,66155.93,59.01702,8106.5321,0
2,34415.15,48.11715,6564.745,0
3,57317.17,63.10805,8020.9533,0
4,42709.53,45.75197,6103.6423,0
5,66952.69,18.58434,8770.0992,1
6,24904.06,57.47161,15.4986,0


## Análise dos Dados

- cliente_id - Monimal (fica fora da previsão)
- income - Numérica contínua
- age - Numérica contínua (neste caso)
- loan (dívida) - Numérica Contínua
- default - 0 (não pagou) 1 (pagou empréstimo) - Numérica Discreta

In [45]:
str(credit)

'data.frame':	2000 obs. of  5 variables:
 $ clientid: int  1 2 3 4 5 6 7 8 9 10 ...
 $ income  : num  66156 34415 57317 42710 66953 ...
 $ age     : num  59 48.1 63.1 45.8 18.6 ...
 $ loan    : num  8107 6565 8021 6104 8770 ...
 $ default : int  0 0 0 0 1 0 0 1 0 0 ...


## Remover Coluna

- A coluna id_cliente fica fora da previsão e pode ser removida

In [46]:
credit$clientid = NULL
head(credit)

income,age,loan,default
66155.93,59.01702,8106.5321,0
34415.15,48.11715,6564.745,0
57317.17,63.10805,8020.9533,0
42709.53,45.75197,6103.6423,0
66952.69,18.58434,8770.0992,1
24904.06,57.47161,15.4986,0


## Valores Ausentes e Inválidos

- Alternativas
    1. Apagar a coluna inteira (recomendado para poucos casos)
```R
base$col = NULL
```

    2. Apagar somentos os registros inválidos
```R
base = base[condicao, ]
```

    3. Preencher os dados inválidos manualmente com a média, moda, análise caso a caso, etc. (melhor caso)
```R
base$col = ifelse(condição, condiçãoTRUE, condiçãoFalse)
```

### Substuir valores inválidos

- Substituir a idade negativa pela média das idades
- A função **mean** calcula a média
    - O parâmetro **na.rm** desconsidera os valores NA
    - **CUIDADO para não calcular a média com valores inválidos, como idades nagativas**

In [50]:
media = mean(credit$age[credit$age > 0], na.rm = TRUE)
credit$age = ifelse(credit$age < 0, media, credit$age)
head(credit)

summary(credit$age)

income,age,loan,default
66155.93,59.01702,8106.5321,0
34415.15,48.11715,6564.745,0
57317.17,63.10805,8020.9533,0
42709.53,45.75197,6103.6423,0
66952.69,18.58434,8770.0992,1
24904.06,57.47161,15.4986,0


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  18.06   29.07   41.32   40.93   52.59   63.97       3 

### Substituir valores ausentes/faltantes

In [52]:
credit[is.na(credit['age']), ]

Unnamed: 0,income,age,loan,default
29,59417.81,,2082.626,0
31,48528.85,,6155.785,0
32,23526.3,,2862.01,0


In [54]:
media = mean(credit$age, na.rm = TRUE)
credit$age = ifelse(is.na(credit$age), media, credit$age)

credit[is.na(credit['age']), ]

income,age,loan,default


## Escalonamento de Atributos

- **Normalização**
    
    $$x = \frac{x - \min{x}}{\max{x} - \min{x}}$$
    
- **Padronização**
    $$x = \frac{x - mean(x)}{std(x)credit[credit['age'].isna()]}$$

In [56]:
credit[, 1:3] = scale(credit[, 1:3])
head(credit)

income,age,loan,default
1.4535704,1.3650387,1.2025187,0
-0.761985,0.5425236,0.6962528,0
0.8366115,1.6737524,1.1744178,0
-0.1830243,0.3640446,0.5448437,0
1.5091858,-1.6860537,1.4204096,1
-1.4258739,1.2484206,-1.4542774,0


# Base de Dados de Risco de Crédito

In [57]:
risco = read.csv('datasets/risco_credito.csv')
head(risco)

historia,divida,garantias,renda,risco
ruim,alta,nenhuma,0_15,alto
desconhecida,alta,nenhuma,15_35,alto
desconhecida,baixa,nenhuma,15_35,moderado
desconhecida,baixa,nenhuma,acima_35,alto
desconhecida,baixa,nenhuma,acima_35,baixo
desconhecida,baixa,adequada,acima_35,baixo


## Análise dos Dados

- história - Categórica Nominal
- divida - Categórica Ordinal
- garandias - Categórica Nominal
- renda - Categórica Ordinal
- risco - Categórica Ordinal

In [58]:
str(risco)

'data.frame':	14 obs. of  5 variables:
 $ historia : Factor w/ 3 levels "boa","desconhecida",..: 3 2 2 2 2 2 3 3 1 1 ...
 $ divida   : Factor w/ 2 levels "alta","baixa": 1 1 2 2 2 2 2 2 2 1 ...
 $ garantias: Factor w/ 2 levels "adequada","nenhuma": 2 2 2 2 2 1 2 1 2 1 ...
 $ renda    : Factor w/ 3 levels "0_15","15_35",..: 1 2 2 3 3 3 1 3 3 3 ...
 $ risco    : Factor w/ 3 levels "alto","baixo",..: 1 1 3 1 2 2 1 3 2 2 ...


## Converter Valores Categóricos para Inteiros


- A função **factor** usa um mapeamento para transformar

```R
factor(coluna, levels = c("valores Categóricos"), labels = c(valores numéricos) )
``` 

In [61]:
risco$historia = factor(risco$historia, levels = unique(risco$historia), labels = 1:3)
risco$divida = factor(risco$divida, levels = unique(risco$divida), labels = 1:2)
risco$garantias= factor(risco$garantias, levels = unique(risco$garantias), labels = 1:2)
risco$renda = factor(risco$renda, levels = unique(risco$renda), labels = 1:3)

head(risco)

historia,divida,garantias,renda,risco
1,1,1,1,alto
2,1,1,2,alto
2,2,1,2,moderado
2,2,1,3,alto
2,2,1,3,baixo
2,2,2,3,baixo
