# Kickoff - CHALLENGE - RAIS

**E**xploratory **D**ata **A**nalysis on RAIS Database - Florianópolis, SC - Brasil

**Authors:**
- Luis Felipe Pelison
- Fernando Battisti
- Ígor Yamamoto

## Objective

How socialeconomic characteristics impacts how much you earn?

# Imports
Here is where you declare the external dependencies required for running the notebook

In [2]:
import pandas as pd
import numpy as np
# here you can import your libraries

pd.set_option('max_rows', 200)

# Open Data

Here is where your data is loaded from different file formats (e.g.: .csv, .json, .parquet, .xlsx) into pandas data frames

In [3]:
df = pd.read_parquet('data/rais_floripa_2018.parquet')
print(df.shape)
df.head(2)

(432486, 16)


Unnamed: 0,CNAE 2.0 Subclasse,Escolaridade após 2005,Idade,Mês Admissão,Mês Desligamento,Motivo Desligamento,Município,Qtd Hora Contr,Raça Cor,Sexo Trabalhador,Tamanho Estabelecimento,Tempo Emprego,Tipo Defic,Vl Remun Média Nom,UF,CBO 2002
4470590,8112500,1,23,0,3,12,420540,44,9,1,4,29,0,"1.484,79",42,514320
4470599,7112000,1,33,3,4,11,420540,44,9,1,3,19,0,"1.290,00",42,717020


# Pre Processing
The real world is a mess. We need to do some manipulations in order to clean the data.

## Converting to the right types

In order to be eaiser or possible to operate, we need to assign the most appropriate type for each column  

In [5]:
CAT_FEATURES = ['CNAE 2.0 Subclasse', 'Escolaridade após 2005', 'Mês Admissão', 'Mês Desligamento', 'Motivo Desligamento', 'Município', 'Raça Cor', 'Sexo Trabalhador', 'Tamanho Estabelecimento', 'Tipo Defic', 'UF', 'CBO 2002']

for cat_feat in CAT_FEATURES:
    df[cat_feat] = df[cat_feat].astype('str')

df['Tempo Emprego'] = df['Tempo Emprego'].str.replace(',','.').astype('float')

df['Vl Remun Média Nom'] = df['Vl Remun Média Nom'].str.replace('.', '').str.replace(',','.').astype('float')

## Mapping categories

Sometimes, real categories are not so understanble, then we map to more readable ones

In [6]:
df['Tamanho Estabelecimento'].value_counts()

10    174502
5      40351
4      36555
7      34376
3      32859
2      31635
6      25907
9      25797
8      22098
1       8406
Name: Tamanho Estabelecimento, dtype: int64

In [7]:
df['Tamanho Estabelecimento'] = (
    df['Tamanho Estabelecimento']
    .map(
        {
            '1': 'ZERO',
            '2': 'ATE_4',
            '3': 'DE_5_A_9',
            '4': 'DE_10_A_19',
            '5': 'DE_20_A_49',
            '6': 'DE_50_A_99',
            '7': 'DE_100_A_249',
            '8': 'DE_250_A_499',
            '9': 'DE_500_A_999',
            '10': '1000_OU_MAIS',
            '-1': 'IGNORADO',
        }
    )
)

In [8]:
df['Tamanho Estabelecimento'].value_counts()

1000_OU_MAIS    174502
DE_20_A_49       40351
DE_10_A_19       36555
DE_100_A_249     34376
DE_5_A_9         32859
ATE_4            31635
DE_50_A_99       25907
DE_500_A_999     25797
DE_250_A_499     22098
ZERO              8406
Name: Tamanho Estabelecimento, dtype: int64

## Removing wrong categories

Sometimes there are wrong or meaningless categories. In those cases we need to treat this.

In [9]:
df['CBO 2002'] = df['CBO 2002'].apply(lambda x: 'Unknown' if x == '0000-1' else x)

# Analysis

Now we can do our exploratory analysis. Be criative!

In [10]:
# All from Florianopolis

df['Município'].value_counts()

420540    432486
Name: Município, dtype: int64

## Challenge 0. List the most popular occupations (CBO)

In [12]:
# Code here

# Main Challenge

Here we will develop the answer for the main challenge described at the beginning.

In [13]:
# Code here

# Future Work

If your time has ended and you have another insights, you can list them here for future work

- Insight 1
- Insight 2