# COLLECTING AND TREATING RAIS DATA


Aiming to analyse **scientific, technological and innovative ecosystems in Brazilian metropolises**, this notebook presents the process through which the Brazilian formal employment data from the **Annual List of Social Information, or RAIS** in Portuguese abbreviation, is collected and treated to create a database that carries meaningful attributes for the project in an agile format.

RAIS data is available via its [microdata server](ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/) and comprises several 7zip compressed files, organized by year and Federation Unities (states and Federal District), with about 30 columns and one row for each formal employment contract. With that structure, retrieving information often demands a lot of process power and RAM. As it is, it also do not ensure that the right kind of information can be achieved, unless the data is grouped according to well established concepts and criteria.

Intending to work around this issues, Python packages, modules and functions are applied in the sessions ahead to collect and treat RAIS data. After the work flow applied in this notebook, the resulting data structure allows fast retrieving, adequate codification for exploratory analysis and comparative analysis, better performance for potential modeling operations and for public availability through web applications.

## PREPARING THE NOTEBOOK

The virtual environment used in the project of data analysis has some packages that were installed and are available (as shown in the result of the code cell below). A Python package was developed for this project, called app_rais, which carries modules and their functions suited for operations with Rais data: download, transformation, cleaning, optimization and analysis. 

Therefore, this notebook was prepared with some Python packages and modules suited to the tasks related to collect and treat RAIS data. The imported packages can be seen in the second code cell ahead.

In [1]:
! conda list

# packages in environment at C:\Users\augus\Anaconda3\envs\inovation_panel:
#
# Name                    Version                   Build  Channel
aiohttp                   3.6.2                    pypi_0    pypi
astroid                   2.4.2                    pypi_0    pypi
async-timeout             3.0.1                    pypi_0    pypi
attrs                     19.3.0                     py_0  
backcall                  0.2.0                      py_0  
beautifulsoup4            4.9.1                    pypi_0    pypi
blas                      1.0                         mkl  
bleach                    3.1.5                      py_0  
brotlipy                  0.7.0           py36he774522_1000  
bzip2                     1.0.8                he774522_0  
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py36h9f0ad1d_0    conda-forge
cffi                      1.14.0           py36h7a1dbc1_0  
cfitsio                

In [2]:
import os
import glob
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format

# Below some Python modules and functions created for this project are imported
import app_rais.treat as treat
from app_rais.collect import download_rais
from app_rais.collect import extract_rais_original

## COLLECTING RAIS DATA

In the next cells, **download_rais** function from **collect** module was called for 14 designated Brazilian federation unities (FUs), namely AM, BA, CE, DF, ES, GO, MG, PA, PE, PR, RJ, RS, SC and SP. As a result, the compressed files from **[RAIS Microdata Server](ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/)** were donwloaded into rais_original subdirectories. Subsequently, the **extract_rais_original** function was applied to these files, decompressing its .txt extension files and deleting the original compressed ones. For 14 FUs, from 1993 to 2018, the original database comprises 399 files, which occupy 374.9GB of disk memory.

In [3]:
%%time
download_rais(
    list_ufs=['AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MG','PA','PE','PR','RJ','RS','SC','SP']
    , list_year=list(range(1993,2018))
)

-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/1993/AM1993.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/1993/BA1993.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/1993/CE1993.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/1993/DF1993.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/1993/ES1993.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/1993/GO1993.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/1993/MG1993.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/1993/PA1993.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/1993/PE1993.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_r

-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2004/CE2004.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2004/DF2004.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2004/ES2004.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2004/GO2004.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2004/MG2004.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2004/PA2004.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2004/PE2004.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2004/PR2004.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2004/RJ2004.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_r

-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2015/ES2015.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2015/GO2015.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2015/MG2015.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2015/PA2015.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2015/PE2015.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2015/PR2015.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2015/RJ2015.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2015/RS2015.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_rais\data/rais_original/2015/SC2015.7z downloaded.
-1 / unknown - D:\inovation_panel\inovation_panel\app_r

In [4]:
%%time
extract_rais_original(
    path_pattern='D:/inovation_panel/inovation_panel/app_rais/data/rais_original/**/**.7z'
    , remove_7z=True
)

"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\AM1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\BA1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\CE1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\DF1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\ES1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\GO1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\MG1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\PA1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\PE1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\PR1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\RJ1993.txt"

"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1999\MG1999.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1999\PA1999.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1999\PE1999.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1999\PR1999.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1999\RJ1999.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1999\RS1999.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1999\SC1999.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1999\SP1999.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2000\AM2000.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2000\BA2000.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2000\CE2000.txt"

"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2005\SC2005.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2005\SP2005.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2006\AM2006.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2006\BA2006.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2006\CE2006.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2006\DF2006.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2006\ES2006.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2006\GO2006.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2006\MG2006.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2006\PA2006.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2006\PE2006.txt"

"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2012\ES2012.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2012\GO2012.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2012\MG2012.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2012\PA2012.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2012\PE2012.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2012\PR2012.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2012\RJ2012.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2012\RS2012.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2012\SC2012.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2012\SP2012.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\2013\AM2013.txt"

In [5]:
list_files = glob.glob(f'app_rais/data/rais_original/**/**.txt')
size = round(sum(os.path.getsize(file) for file in list_files) * 1e-9, 2)
print(f'Resulting treated files occupy {size}GB in disk.')

Resulting treated files occupy 374.87GB in disk.


## TREATING DATA - FIRST ROUND

Ahead **original_to_treated_loop** function from **treat** module was called for designated Brazilian federation unities (FU - UF in Portuguese). This function: 
1. loads original Rais data from .txt extension files inside rais_original subdirectories;
2. removes the employment registers not active in 12/31 of each year;   
3. casts dtypes to its columns for better performance;
4. drops columns not relevant for analysis of S&T structure;
5. adds data related to S&T personnel, services according to knowledge intensity and industries technological level;
6. for each FU and year, saves a compressed pickle file in rais_treated subdirectories.

For 14 selected FUs, from 2007 to 2017, saved files sum up to 7GB on disk.

In [6]:
%%time
treat.original_to_treated_loop(
    ufs=['AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MG','PA','PE','PR','RJ','RS','SC','SP']
    , years=list(range(2007,2018))
)

AM2007.zip saved.
AM2008.zip saved.
AM2009.zip saved.
AM2010.zip saved.
AM2011.zip saved.
AM2012.zip saved.
AM2013.zip saved.
AM2014.zip saved.
AM2015.zip saved.
AM2016.zip saved.
AM2017.zip saved.
BA2007.zip saved.
BA2008.zip saved.
BA2009.zip saved.
BA2010.zip saved.
BA2011.zip saved.
BA2012.zip saved.
BA2013.zip saved.
BA2014.zip saved.
BA2015.zip saved.
BA2016.zip saved.
BA2017.zip saved.
CE2007.zip saved.
CE2008.zip saved.
CE2009.zip saved.
CE2010.zip saved.
CE2011.zip saved.
CE2012.zip saved.
CE2013.zip saved.
CE2014.zip saved.
CE2015.zip saved.
CE2016.zip saved.
CE2017.zip saved.
DF2007.zip saved.
DF2008.zip saved.
DF2009.zip saved.
DF2010.zip saved.
DF2011.zip saved.
DF2012.zip saved.
DF2013.zip saved.
DF2014.zip saved.
DF2015.zip saved.
DF2016.zip saved.
DF2017.zip saved.
ES2007.zip saved.
ES2008.zip saved.
ES2009.zip saved.
ES2010.zip saved.
ES2011.zip saved.
ES2012.zip saved.
ES2013.zip saved.
ES2014.zip saved.
ES2015.zip saved.
ES2016.zip saved.
ES2017.zip saved.
GO2007.zip

In [7]:
list_files = glob.glob(f'app_rais/data/rais_treated/**/**.zip')
size = round(sum(os.path.getsize(file) for file in list_files) * 1e-9, 2)
print(f'Resulting treated files occupy {size}GB in disk.')

Resulting treated files occupy 7.03GB in disk.


## TREATING DATA - SECOND ROUND

The pickle files generated by the first round of data treatment are used ahead as input for Generate_rais_dataframe, which applies a new round of treatment, grouping data for better use of memory, merging information from different UFs and creating columns for personnel data, namely:

* scientific and technical (S&T) personnel (POTEC);
* organization legal status;
* economic sectors;
* categories of services according to intensity of applied knowledge;
* industry technological levels; 
* level of education.

This function is called by another function, load_save_rais_dataframe_loop, which applies it to different years and saves the result as feather files, optimized for fast usage. When applied to selected Federation Unities, from 2007 to 2017, resulting data base encompassed 11 files that sum 0.14GB of disk memory. 

In [8]:
%%time
df_2007 = treat.generate_rais_dataframe(list_ufs=['PR'], year=2007)

Wall time: 4.31 s


In [9]:
df_2007.head()

Unnamed: 0,UF,Território,Sectors,Tamanho Estabelecimento,Natureza Jurídica Grupo,Natureza Jurídica,knowledge_services,technology_industries,Potec Grupo,potec,Escolaridade1,Escolaridade2,Pessoal
0,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,1
1,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,7
2,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Profissionais científicos,Profissionais da Informática,Não-Superior,Inferior ao Ensino Médio Completo,1
3,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,10
4,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13700,PR,Arranjo Populacional de Curitiba/PR,Services,a partir de 1000,Administração Pública,POD JUD FED,Other knowledge-intensive services,Without Classification,Profissionais científicos,Profissionais da Informática,Superior,Superior,11
13701,PR,Arranjo Populacional de Curitiba/PR,Services,a partir de 1000,Administração Pública,POD JUD FED,Other knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Superior,Superior,2785
13702,PR,Arranjo Populacional de Curitiba/PR,Services,a partir de 1000,Administração Pública,POD JUD FED,Other knowledge-intensive services,Without Classification,Profissionais científicos,Profissionais da Informática,Mestrado,Mestrado,4
13703,PR,Arranjo Populacional de Curitiba/PR,Services,a partir de 1000,Administração Pública,POD JUD FED,Other knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Mestrado,Mestrado,46


In [10]:
%%time
treat.load_save_rais_dataframe_loop(
    list_ufs=['AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MG','PA','PE','PR','RJ','RS','SC','SP']
    , list_years=list(range(2007,2018)
          )
)

ufs_with_motropolises_2007 saved.
ufs_with_motropolises_2008 saved.
ufs_with_motropolises_2009 saved.
ufs_with_motropolises_2010 saved.
ufs_with_motropolises_2011 saved.
ufs_with_motropolises_2012 saved.
ufs_with_motropolises_2013 saved.
ufs_with_motropolises_2014 saved.
ufs_with_motropolises_2015 saved.
ufs_with_motropolises_2016 saved.
ufs_with_motropolises_2017 saved.
Wall time: 20min 5s


In [11]:
list_files = glob.glob(f'app_rais/data/rais_dataframes/**.ftd')
size = round(sum(os.path.getsize(file) for file in list_files) * 1e-9, 2)
print(f'Resulting treated files occupy {size}GB in disk.')

Resulting treated files occupy 0.14GB in disk.
