# COLLECTING AND TREATING RAIS DATA


Aiming to analyse **scientific, technological and innovative ecosystems in Brazilian metropolises**, this notebook presents the process through which the Brazilian formal employment data from the **Annual List of Social Information, or RAIS** in Portuguese abbreviation, is collected and treated to create a database that carries meaningful attributes for the project in an agile format.

RAIS data is available via its [microdata server](ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/) and comprises several 7zip compressed files, organized by year and Federation Unities (states and Federal District), with about 30 columns and one row for each formal employment contract. With that structure, retrieving information often demands a lot of process power and RAM. As it is, it also do not ensure that the right kind of information can be achieved, unless the data is grouped according to well established concepts and criteria.

Intending to work around this issues, Python packages, modules and functions are applied in the sessions ahead to collect and treat RAIS data. After the work flow applied in this notebook, the resulting data structure allows fast retrieving, adequate codification for exploratory analysis and comparative analysis, better performance for potential modeling operations and for public availability through web applications.

## PREPARING THE NOTEBOOK

The virtual environment used in the project of data analysis has some packages that were installed and are available (as shown in the result of the code cell below). A Python package was developed for this project, called app_rais, which carries modules and their functions suited for operations with Rais data: download, transformation, cleaning, optimization and analysis. 

Therefore, this notebook was prepared with some Python packages and modules suited to the tasks related to collect and treat RAIS data. The imported packages can be seen in the second code cell ahead.

In [1]:
! conda list

# packages in environment at C:\Users\augus\Anaconda3\envs\innovation_panel:
#
# Name                    Version                   Build  Channel
aiohttp                   3.6.2                    pypi_0    pypi
astroid                   2.4.2                    pypi_0    pypi
async-timeout             3.0.1                    pypi_0    pypi
attrs                     19.3.0                     py_0  
backcall                  0.2.0                      py_0  
beautifulsoup4            4.9.1                    pypi_0    pypi
blas                      1.0                         mkl  
bleach                    3.1.5                      py_0  
brotlipy                  0.7.0           py36he774522_1000  
bzip2                     1.0.8                he774522_0  
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py36h9f0ad1d_0    conda-forge
cffi                      1.14.0           py36h7a1dbc1_0  
cfitsio               

In [1]:
import os
import glob
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format

# Below some Python modules and functions created for this project are imported
import app_rais.treat as treat
from app_rais.collect import download_rais
from app_rais.collect import extract_rais_original
pd.options.display.max_columns = 50
import app_rais.dicts_constants as dc

## COLLECTING RAIS DATA

In the next cells, **download_rais** function from **collect** module was called for 14 designated Brazilian federation unities (FUs), namely AM, BA, CE, DF, ES, GO, MG, PA, PE, PR, RJ, RS, SC and SP. As a result, the compressed files from **[RAIS Microdata Server](http://pdet.mte.gov.br/microdados-rais-e-caged)** were donwloaded into rais_original subdirectories. Subsequently, the **extract_rais_original** function was applied to these files, decompressing its .txt extension files and deleting the original compressed ones. For 14 FUs, from 1993 to 2018, the original database comprises 399 files, which occupy 374.9GB of disk memory.

In [5]:
%%time
download_rais(
    list_ufs=['PR']
    , list_year=list(range(2006,2017))
)

-1 / unknown - D:\innovation_panel\innovation_panel\app_rais\data/rais_original/2006/PR2006.7z downloaded.
-1 / unknown - D:\innovation_panel\innovation_panel\app_rais\data/rais_original/2007/PR2007.7z downloaded.
-1 / unknown - D:\innovation_panel\innovation_panel\app_rais\data/rais_original/2008/PR2008.7z downloaded.
-1 / unknown - D:\innovation_panel\innovation_panel\app_rais\data/rais_original/2009/PR2009.7z downloaded.
-1 / unknown - D:\innovation_panel\innovation_panel\app_rais\data/rais_original/2010/PR2010.7z downloaded.
-1 / unknown - D:\innovation_panel\innovation_panel\app_rais\data/rais_original/2011/PR2011.7z downloaded.
-1 / unknown - D:\innovation_panel\innovation_panel\app_rais\data/rais_original/2012/PR2012.7z downloaded.
-1 / unknown - D:\innovation_panel\innovation_panel\app_rais\data/rais_original/2013/PR2013.7z downloaded.
-1 / unknown - D:\innovation_panel\innovation_panel\app_rais\data/rais_original/2014/PR2014.7z downloaded.
-1 / unknown - D:\innovation_panel\in

In [6]:
%%time
extract_rais_original(
    path_pattern='D:/innovation_panel/innovation_panel/app_rais/data/rais_original/**/**.7z'
    , remove_7z=True
)

"D:/innovation_panel/innovation_panel/app_rais/data/rais_original\2006\PR2006.txt" extracted.
"D:/innovation_panel/innovation_panel/app_rais/data/rais_original\2007\PR2007.txt" extracted.
"D:/innovation_panel/innovation_panel/app_rais/data/rais_original\2008\PR2008.txt" extracted.
"D:/innovation_panel/innovation_panel/app_rais/data/rais_original\2009\PR2009.txt" extracted.
"D:/innovation_panel/innovation_panel/app_rais/data/rais_original\2010\PR2010.txt" extracted.
"D:/innovation_panel/innovation_panel/app_rais/data/rais_original\2011\PR2011.txt" extracted.
"D:/innovation_panel/innovation_panel/app_rais/data/rais_original\2012\PR2012.txt" extracted.
"D:/innovation_panel/innovation_panel/app_rais/data/rais_original\2013\PR2013.txt" extracted.
"D:/innovation_panel/innovation_panel/app_rais/data/rais_original\2014\PR2014.txt" extracted.
"D:/innovation_panel/innovation_panel/app_rais/data/rais_original\2015\PR2015.txt" extracted.
"D:/innovation_panel/innovation_panel/app_rais/data/rais_ori

In [17]:
df_txt = pd.read_table('app_rais/data/rais_original/2006/PR2006.txt', sep=';', encoding='Latin-1', decimal=',');
df_txt.head()

  if (await self.run_code(code, result,  async_=asy)):


Unnamed: 0,Bairros SP,Bairros Fortaleza,Bairros RJ,Causa Afastamento 1,Causa Afastamento 2,Causa Afastamento 3,Motivo Desligamento,CBO Ocupação 2002,CNAE 2.0 Classe,CNAE 95 Classe,Distritos SP,Vínculo Ativo 31/12,Faixa Etária,Faixa Hora Contrat,Faixa Remun Dezem (SM),Faixa Remun Média (SM),Faixa Tempo Emprego,Escolaridade após 2005,Qtd Hora Contr,Idade,Ind CEI Vinculado,Ind Simples,Mês Admissão,Mês Desligamento,Mun Trab,Município,Nacionalidade,Natureza Jurídica,Qtd Dias Afastamento,Raça Cor,Regiões Adm DF,Vl Remun Dezembro Nom,Vl Remun Dezembro (SM),Vl Remun Média Nom,Vl Remun Média (SM),CNAE 2.0 Subclasse,Sexo Trabalhador,Tamanho Estabelecimento,Tempo Emprego,Tipo Admissão,Tipo Estab,Tipo Estab.1,Tipo Vínculo
0,0,0,0,{ñ,{ñ,{ñ,11,724315,74901,74993,0,0,3,6,0,3,6,1,44,22,0,1,5,9,0000-1,410690,10,2062,0,2,{ñ c,0.0,0.0,562.54,1.68,7490101,1,2,40.9,0,1,CNPJ,10
1,0,0,0,{ñ,{ñ,{ñ,11,761205,13111,17213,0,0,5,6,0,3,3,1,44,30,0,0,7,2,0000-1,412770,10,2062,0,2,{ñ c,0.0,0.0,545.87,1.81,1311100,1,8,7.0,0,1,CNPJ,10
2,0,0,0,{ñ,{ñ,{ñ,11,514210,81214,74705,0,0,7,6,0,2,7,1,44,58,0,0,6,5,411990,411990,10,2062,0,2,{ñ c,0.0,0.0,425.13,1.33,8121400,1,8,95.0,0,1,CNPJ,10
3,0,0,0,{ñ,{ñ,{ñ,11,784205,16218,20214,0,0,7,6,0,2,3,1,44,52,0,0,10,6,411990,411990,10,2062,0,9,{ñ c,0.0,0.0,458.25,1.42,1621800,1,2,8.1,0,1,CNPJ,10
4,0,0,0,40,{ñ,{ñ,11,514210,33198,28991,0,0,7,6,0,2,6,1,44,58,0,1,10,2,410860,410860,10,2062,10,2,{ñ c,0.0,0.0,366.38,1.21,3319800,1,3,52.3,0,1,CNPJ,10


In [14]:
df_txt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3307909 entries, 0 to 3307908
Data columns (total 43 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   Bairros SP               int64  
 1   Bairros Fortaleza        int64  
 2   Bairros RJ               int64  
 3   Causa Afastamento 1      object 
 4   Causa Afastamento 2      object 
 5   Causa Afastamento 3      object 
 6   Motivo Desligamento      int64  
 7   CBO Ocupação 2002        object 
 8   CNAE 2.0 Classe          int64  
 9   CNAE 95 Classe           int64  
 10  Distritos SP             int64  
 11  Vínculo Ativo 31/12      int64  
 12  Faixa Etária             object 
 13  Faixa Hora Contrat       int64  
 14  Faixa Remun Dezem (SM)   object 
 15  Faixa Remun Média (SM)   object 
 16  Faixa Tempo Emprego      object 
 17  Escolaridade após 2005   int64  
 18  Qtd Hora Contr           int64  
 19  Idade                    int64  
 20  Ind CEI Vinculado        int64  
 21  Ind Simp

In [18]:
list_files = glob.glob(f'app_rais/data/rais_original/**/**.txt')
size = round(sum(os.path.getsize(file) for file in list_files) * 1e-9, 2)
print(f'Resulting treated files occupy {size}GB in disk.')

Resulting treated files occupy 16.84GB in disk.


## TREATING DATA - FIRST ROUND

Ahead **original_to_treated_loop** function from **treat** module was called for designated Brazilian federation unities (FU - UF in Portuguese). This function: 
1. loads original Rais data from .txt extension files inside rais_original subdirectories;
2. removes the employment registers not active in 12/31 of each year;   
3. casts dtypes to its columns for better performance;
4. drops columns not relevant for analysis of S&T structure;
5. adds data related to S&T personnel, services according to knowledge intensity and industries technological level;
6. for each FU and year, saves a compressed pickle file in rais_treated subdirectories.

For 14 selected FUs, from 2007 to 2017, saved files sum up to 7GB on disk.

In [None]:
%%time
treat.pretreat_2018(list_ufs=['AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MG','PA','PE','PR','RJ','RS','SC','SP'])

In [4]:
df = treat.pretreat_rais(uf='PR', year=2007, treating=True)

NameError: name 'get_dict_mun' is not defined

In [None]:
%%time
treat.original_to_treated_loop(
    ufs=['PR']
    , years=list(range(2007,2018))
)

PR2007.zip saved.
PR2008.zip saved.


In [11]:
pd_pickle_ce2008 = pd.read_pickle('app_rais/data/rais_treated/2008/CE2008.zip')

In [12]:
pd_pickle_ce2008

Unnamed: 0,CBO Ocupação 2002,CNAE 2.0 Classe,CNAE 95 Classe,Faixa Etária,Faixa Remun Dezem (SM),Faixa Remun Média (SM),Escolaridade após 2005,Idade,Ind Simples,Município,Nacionalidade,Natureza Jurídica,Raça Cor,Vl Remun Dezembro Nom,Vl Remun Dezembro (SM),Vl Remun Média Nom,Vl Remun Média (SM),Sexo Trabalhador,Tamanho Estabelecimento,Tipo Estab,Tipo Estab.1,territorio_tese,arranjo,knowledge_services,technology_industries,potec
0,784205,11216,15946,07,03,02,1,61,0,230440,10,2062,08,550.83,1.32,500.64,1.21,01,8,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Without Classification,Low-technology,Demais Ocupações
1,517420,45111,50105,08,,,1,69,0,230440,10,2062,09,0.00,0.00,0.00,0.00,01,6,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Less knowledge-intensive market services,Without Classification,Demais Ocupações
2,622405,01229,01228,07,02,02,1,53,0,231020,10,2062,08,394.92,0.95,416.97,1.01,01,6,01,CNPJ,,Demais Municípios do Ceará,Without Classification,Without Classification,Demais Ocupações
3,622405,01229,01228,05,03,02,1,31,0,231020,10,2062,04,574.84,1.38,418.60,1.02,01,6,01,CNPJ,,Demais Municípios do Ceará,Without Classification,Without Classification,Demais Ocupações
4,622405,01229,01228,06,02,01,1,42,0,231020,10,2062,08,348.06,0.83,393.39,0.95,01,6,01,CNPJ,,Demais Municípios do Ceará,Without Classification,Without Classification,Demais Ocupações
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1129994,514320,82997,74993,05,03,02,11,31,0,230370,10,2062,08,437.71,1.05,437.71,1.05,02,10,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Less knowledge-intensive market services,Without Classification,Demais Ocupações
1129995,223115,87301,85316,06,09,08,11,49,0,230440,10,3999,08,3200.00,7.71,3287.27,7.91,02,6,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification,Demais Ocupações
1129996,223110,87301,85316,05,10,09,11,38,0,230440,10,3999,08,4500.00,10.84,4500.00,10.84,02,6,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification,Demais Ocupações
1129997,234520,85139,80152,05,11,10,11,37,0,230440,10,2062,02,7221.26,17.40,7118.89,17.36,02,7,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification,Professores de ensino superior


In [7]:
list_files = glob.glob(f'app_rais/data/rais_treated/**/**.zip')
size = round(sum(os.path.getsize(file) for file in list_files) * 1e-9, 2)
print(f'Resulting treated files occupy {size}GB in disk.')

Resulting treated files occupy 7.03GB in disk.


## TREATING DATA - SECOND ROUND

The pickle files generated by the first round of data treatment are used ahead as input for Generate_rais_dataframe, which applies a new round of treatment, grouping data for better use of memory, merging information from different UFs and creating columns for personnel data, namely:

* scientific and technical (S&T) personnel (POTEC);
* organization legal status;
* economic sectors;
* categories of services according to intensity of applied knowledge;
* industry technological levels; 
* level of education.

This function is called by another function, load_save_rais_dataframe_loop, which applies it to different years and saves the result as feather files, optimized for fast usage. When applied to selected Federation Unities, from 2007 to 2017, resulting data base encompassed 11 files that sum 0.14GB of disk memory. 

In [2]:
%%time
df_2007 = treat.generate_rais_dataframe(list_ufs=['PR'], year=2007)

Wall time: 34.5 s


In [4]:
df_2007

Unnamed: 0,UF,Município,Sectors,Tamanho Estabelecimento,Natureza Jurídica Grupo,Natureza Jurídica,knowledge_services,technology_industries,Potec Grupo,potec,Escolaridade1,Escolaridade2,Pessoal
0,PR,410010,Services,Até 49,Entidades Empresariais,SOC QT LTDA,Less knowledge-intensive market services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,2
1,PR,410010,Services,Até 49,Entidades Empresariais,SOC QT LTDA,Less knowledge-intensive market services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,1
2,PR,410010,Services,Até 49,Entidades Empresariais,SOC QT LTDA,Less knowledge-intensive market services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,5
3,PR,410010,Services,Até 49,Entidades Empresariais,SOC QT LTDA,Less knowledge-intensive market services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,3
4,PR,410010,Services,Até 49,Entidades Empresariais,SOC QT LTDA,Less knowledge-intensive market services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Ensino Médio ao Superior Incompleto,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...
104081,PR,412880,Services,de 100 a 249,Administração Pública,POD EXEC MU,Other knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,12
104082,PR,412880,Services,de 100 a 249,Administração Pública,POD EXEC MU,Other knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,6
104083,PR,412880,Services,de 100 a 249,Administração Pública,POD EXEC MU,Other knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Ensino Médio ao Superior Incompleto,56
104084,PR,412880,Services,de 100 a 249,Administração Pública,POD EXEC MU,Other knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Ensino Médio ao Superior Incompleto,3


In [9]:
df_2007.head()

Unnamed: 0,UF,Território,Sectors,Tamanho Estabelecimento,Natureza Jurídica Grupo,Natureza Jurídica,knowledge_services,technology_industries,Potec Grupo,potec,Escolaridade1,Escolaridade2,Pessoal
0,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,1
1,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,7
2,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Profissionais científicos,Profissionais da Informática,Não-Superior,Inferior ao Ensino Médio Completo,1
3,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,10
4,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13700,PR,Arranjo Populacional de Curitiba/PR,Services,a partir de 1000,Administração Pública,POD JUD FED,Other knowledge-intensive services,Without Classification,Profissionais científicos,Profissionais da Informática,Superior,Superior,11
13701,PR,Arranjo Populacional de Curitiba/PR,Services,a partir de 1000,Administração Pública,POD JUD FED,Other knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Superior,Superior,2785
13702,PR,Arranjo Populacional de Curitiba/PR,Services,a partir de 1000,Administração Pública,POD JUD FED,Other knowledge-intensive services,Without Classification,Profissionais científicos,Profissionais da Informática,Mestrado,Mestrado,4
13703,PR,Arranjo Populacional de Curitiba/PR,Services,a partir de 1000,Administração Pública,POD JUD FED,Other knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Mestrado,Mestrado,46


In [10]:
%%time
treat.load_save_rais_dataframe_loop(
    list_ufs=['AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MG','PA','PE','PR','RJ','RS','SC','SP']
    , list_years=list(range(2007,2019)
          )
)

ufs_with_motropolises_2007 saved.
ufs_with_motropolises_2008 saved.
ufs_with_motropolises_2009 saved.
ufs_with_motropolises_2010 saved.
ufs_with_motropolises_2011 saved.
ufs_with_motropolises_2012 saved.
ufs_with_motropolises_2013 saved.
ufs_with_motropolises_2014 saved.
ufs_with_motropolises_2015 saved.
ufs_with_motropolises_2016 saved.
ufs_with_motropolises_2017 saved.
Wall time: 20min 5s


In [11]:
list_files = glob.glob(f'app_rais/data/rais_dataframes/**.ftd')
size = round(sum(os.path.getsize(file) for file in list_files) * 1e-9, 2)
print(f'Resulting treated files occupy {size}GB in disk.')

Resulting treated files occupy 0.14GB in disk.
