# COLLECTING AND TREATING RAIS DATA


Aiming to analyse **scientific, technological and innovative ecosystems in Brazilian metropolises**, this notebook presents the process through which the Brazilian formal employment data from the **Annual List of Social Information, or RAIS** in Portuguese abbreviation, is collected and treated to create a database that carries meaningful attributes for the project in an agile format.

RAIS data is available via its [microdata server](ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/) and comprises several 7zip compressed files, organized by year and Federation Unities (states and Federal District), with about 30 columns and one row for each formal employment contract. With that structure, retrieving information often demands a lot of process power and RAM. As it is, it also do not ensure that the right kind of information can be achieved, unless the data is grouped according to well established concepts and criteria.

Intending to work around this issues, Python packages, modules and functions are applied in the sessions ahead to collect and treat RAIS data. After the work flow applied in this notebook, the resulting data structure allows fast retrieving, adequate codification for exploratory analysis and comparative analysis, better performance for potential modeling operations and for public availability through web applications.

## PREPARING THE NOTEBOOK

The virtual environment used in the project of data analysis has some packages that were installed and are available (as shown in the result of the code cell below). A Python package was developed for this project, called app_rais, which carries modules and their functions suited for operations with Rais data: download, transformation, cleaning, optimization and analysis. 

Therefore, this notebook was prepared with some Python packages and modules suited to the tasks related to collect and treat RAIS data. The imported packages can be seen in the second code cell ahead.

In [1]:
! conda list

# packages in environment at C:\Users\augus\Anaconda3\envs\innovation_panel:
#
# Name                    Version                   Build  Channel
aiohttp                   3.6.2                    pypi_0    pypi
astroid                   2.4.2                    pypi_0    pypi
async-timeout             3.0.1                    pypi_0    pypi
attrs                     19.3.0                     py_0  
backcall                  0.2.0                      py_0  
beautifulsoup4            4.9.1                    pypi_0    pypi
blas                      1.0                         mkl  
bleach                    3.1.5                      py_0  
brotlipy                  0.7.0           py36he774522_1000  
bzip2                     1.0.8                he774522_0  
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py36h9f0ad1d_0    conda-forge
cffi                      1.14.0           py36h7a1dbc1_0  
cfitsio               

In [1]:
import os
import glob
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format

# Below some Python modules and functions created for this project are imported
import app_rais.treat as treat
from app_rais.collect import download_rais
from app_rais.collect import extract_rais_original
pd.options.display.max_columns = 50

## COLLECTING RAIS DATA

In the next cells, **download_rais** function from **collect** module was called for 14 designated Brazilian federation unities (FUs), namely AM, BA, CE, DF, ES, GO, MG, PA, PE, PR, RJ, RS, SC and SP. As a result, the compressed files from **[RAIS Microdata Server](ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/)** were donwloaded into rais_original subdirectories. Subsequently, the **extract_rais_original** function was applied to these files, decompressing its .txt extension files and deleting the original compressed ones. For 14 FUs, from 1993 to 2018, the original database comprises 399 files, which occupy 374.9GB of disk memory.

In [1]:
%%time
download_rais(
    list_ufs=['AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MG','PA','PE','PR','RJ','RS','SC','SP']
    , list_year=list(range(1993,2018))
)

NameError: name 'download_rais' is not defined

In [4]:
%%time
extract_rais_original(
    path_pattern='D:/inovation_panel/inovation_panel/app_rais/data/rais_original/**/**.7z'
    , remove_7z=True
)

"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\AM1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\BA1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\CE1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\DF1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\ES1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\GO1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\MG1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\PA1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\PE1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\PR1993.txt" extracted.
"D:/inovation_panel/inovation_panel/app_rais/data/rais_original\1993\RJ1993.txt"

In [3]:
df_txt_mg2006 = pd.read_table('app_rais/data/rais_original/2000/MG2000.txt', encoding='Latin-1', sep=';', nrows=6)

In [3]:
df_txt_mg2006

Unnamed: 0,Bairros SP,Bairros Fortaleza,Bairros RJ,Causa Afastamento 1,Causa Afastamento 2,Causa Afastamento 3,Motivo Desligamento,CBO Ocupação 2002,CNAE 2.0 Classe,CNAE 95 Classe,Distritos SP,Vínculo Ativo 31/12,Faixa Etária,Faixa Hora Contrat,Faixa Remun Dezem (SM),Faixa Remun Média (SM),Faixa Tempo Emprego,Escolaridade após 2005,Qtd Hora Contr,Idade,Ind CEI Vinculado,Ind Simples,Mês Admissão,Mês Desligamento,Mun Trab,Município,Nacionalidade,Natureza Jurídica,Qtd Dias Afastamento,Raça Cor,Regiões Adm DF,Vl Remun Dezembro Nom,Vl Remun Dezembro (SM),Vl Remun Média Nom,Vl Remun Média (SM),CNAE 2.0 Subclasse,Sexo Trabalhador,Tamanho Estabelecimento,Tempo Emprego,Tipo Admissão,Tipo Estab,Tipo Estab.1,Tipo Vínculo
0,0,0,0,{ñ,{ñ,{ñ,11,511215,49221,60240,0,0,3,6,0,2,2,1,44,18,0,0,10,1,0000-1,317040,10,2062,0,8,{ñ c,0,0,40740,135,4922102,1,7,36,0,1,CNPJ,10
1,0,0,0,{ñ,{ñ,{ñ,11,621005,2306,2135,0,0,7,6,0,2,8,1,44,55,0,1,1,4,0000-1,314180,10,2062,0,4,{ñ c,0,0,33000,105,230600,1,4,1479,0,1,CNPJ,10
2,0,0,0,{ñ,{ñ,{ñ,11,992225,38319,37109,0,0,7,6,0,2,6,1,44,57,0,1,7,8,314930,314930,10,2062,0,4,{ñ c,0,0,36000,109,3831999,1,3,374,0,1,CNPJ,10
3,0,0,0,{ñ,{ñ,{ñ,11,715230,41204,45217,0,0,7,6,0,2,2,1,44,57,0,0,7,11,311830,314480,10,2062,0,4,{ñ c,0,0,45872,131,4120400,1,7,40,2,1,CNPJ,10
4,0,0,0,{ñ,{ñ,{ñ,10,715230,41204,45217,0,0,6,6,0,2,2,1,44,47,0,0,5,10,310620,314480,10,2062,0,6,{ñ c,0,0,51458,146,4120400,1,7,47,2,1,CNPJ,10
5,0,0,0,{ñ,{ñ,{ñ,11,715505,41204,45217,0,0,6,6,0,2,1,1,44,48,0,0,7,9,311830,314480,10,2062,0,2,{ñ c,0,0,49405,141,4120400,1,7,23,2,1,CNPJ,10


In [4]:
df_txt_mg2006.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 34 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Bairros SP                6 non-null      int64 
 1   Bairros Fortaleza         6 non-null      int64 
 2   Bairros RJ                6 non-null      int64 
 3   Motivo Desligamento       6 non-null      int64 
 4   CBO 94 Ocupação           6 non-null      int64 
 5   CNAE 95 Classe            6 non-null      int64 
 6   Distritos SP              6 non-null      int64 
 7   Vínculo Ativo 31/12       6 non-null      int64 
 8   Faixa Etária              6 non-null      int64 
 9   Faixa Hora Contrat        6 non-null      int64 
 10  Faixa Remun Dezem (SM)    6 non-null      int64 
 11  Faixa Remun Média (SM)    6 non-null      int64 
 12  Faixa Tempo Emprego       6 non-null      int64 
 13  Grau Instrução 2005-1985  6 non-null      int64 
 14  Qtd Hora Contr            6 no

In [5]:
list_files = glob.glob(f'app_rais/data/rais_original/**/**.txt')
size = round(sum(os.path.getsize(file) for file in list_files) * 1e-9, 2)
print(f'Resulting treated files occupy {size}GB in disk.')

Resulting treated files occupy 374.87GB in disk.


## TREATING DATA - FIRST ROUND

Ahead **original_to_treated_loop** function from **treat** module was called for designated Brazilian federation unities (FU - UF in Portuguese). This function: 
1. loads original Rais data from .txt extension files inside rais_original subdirectories;
2. removes the employment registers not active in 12/31 of each year;   
3. casts dtypes to its columns for better performance;
4. drops columns not relevant for analysis of S&T structure;
5. adds data related to S&T personnel, services according to knowledge intensity and industries technological level;
6. for each FU and year, saves a compressed pickle file in rais_treated subdirectories.

For 14 selected FUs, from 2007 to 2017, saved files sum up to 7GB on disk.

In [2]:
%%time
treat.original_to_treated_loop(
    ufs=['PR']
    , years=list(range(2000,2019))
)

PR2000.zip saved.
PR2001.zip saved.
PR2002.zip saved.
PR2003.zip saved.
PR2004.zip saved.
PR2005.zip saved.
PR2006.zip saved.
PR2007.zip saved.
PR2008.zip saved.
PR2009.zip saved.
PR2010.zip saved.
PR2011.zip saved.
PR2012.zip saved.
PR2013.zip saved.
PR2014.zip saved.
PR2015.zip saved.
PR2016.zip saved.
PR2017.zip saved.
PR2018.zip saved.
Wall time: 10min 23s


In [4]:
ufs=['AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MG','PA','PE','PR','RJ','RS','SC','SP']
dict_dfs = dict()
for uf in ufs:
    df = pd.read_pickle(f'app_rais/data/rais_treated/2007/{uf}2007.zip')
    dict_dfs[uf] = df
    print(f'{uf} done.')

df = pd.concat(objs=dict_dfs.values(), keys=dict_dfs.keys())

AM done.
BA done.
CE done.
DF done.
ES done.
GO done.
MG done.
PA done.
PE done.
PR done.
RJ done.
RS done.
SC done.
SP done.


In [5]:
df.head()

Unnamed: 0,Unnamed: 1,CBO Ocupação 2002,CNAE 2.0 Classe,CNAE 95 Classe,Faixa Etária,Faixa Remun Dezem (SM),Faixa Remun Média (SM),Escolaridade após 2005,Idade,Ind Simples,Município,Nacionalidade,Natureza Jurídica,Raça Cor,Vl Remun Dezembro Nom,Vl Remun Dezembro (SM),Vl Remun Média Nom,Vl Remun Média (SM),Sexo Trabalhador,Tamanho Estabelecimento,Tipo Estab,Tipo Estab.1,territorio_tese,arranjo,knowledge_services,technology_industries,potec
AM,0,715135,46745,51535,6,8,7,1,45,0,130260,10,2062,2,2054.14,5.4,1943.26,5.21,1,5,1,CNPJ,,Manaus,Less knowledge-intensive market services,Without Classification,Demais Ocupações
AM,1,717020,42995,45292,7,3,2,1,58,0,130260,10,2062,2,558.12,1.46,569.54,1.49,1,7,1,CNPJ,,Manaus,Without Classification,Without Classification,Demais Ocupações
AM,2,717020,42995,45292,7,4,3,1,52,0,130260,10,2062,8,623.31,1.64,581.28,1.52,1,7,1,CNPJ,,Manaus,Without Classification,Without Classification,Demais Ocupações
AM,3,715210,42995,45292,5,4,3,1,37,0,130260,10,2062,2,749.37,1.97,763.74,2.0,1,7,1,CNPJ,,Manaus,Without Classification,Without Classification,Demais Ocupações
AM,4,717020,42995,45292,5,4,3,1,36,0,130260,10,2062,4,641.96,1.68,610.01,1.59,1,7,1,CNPJ,,Manaus,Without Classification,Without Classification,Demais Ocupações


In [23]:
df['potec'].value_counts()

Demais Ocupações                                                                                                32505060
Professores de ensino superior                                                                                    389720
Profissionais da Informática                                                                                      165216
Engenheiros Civis etc                                                                                             148643
Professores de ensino Profissional                                                                                129917
Engenheiros agrônomos, de alimentos, florestais e de pesca                                                         15141
Pesquisadores                                                                                                      11980
Biólogos e biomédicos                                                                                              11597
Físicos, químicos e afins     

In [24]:
df['potec'].isna().sum()

0

In [19]:
pd.DataFrame({'potec':list(df['potec'].unique())}).to_clipboard()

In [7]:
import app_rais.dicts_constants as dc

In [9]:
df_ce_2000[['CNAE 95 Classe', 'arranjo', 'knowledge_services', 'technology_industries']]

Unnamed: 0,CNAE 95 Classe,arranjo,knowledge_services,technology_industries
0,75116,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
1,75116,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
2,75116,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
3,75116,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
4,75116,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
...,...,...,...,...
691088,80110,Arranjo Populacional de Fortaleza/CE,,
691089,85154,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
691090,74128,Demais Municípios do Ceará,Knowledge-intensive market services,Without Classification
691091,85154,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification


In [7]:
df_ce_2000

Unnamed: 0,CBO 94 Ocupação,CNAE 95 Classe,Faixa Etária,Faixa Remun Dezem (SM),Faixa Remun Média (SM),Grau Instrução 2005-1985,Idade,Município,Nacionalidade,Natureza Jurídica,Vl Remun Dezembro Nom,Vl Remun Dezembro (SM),Vl Remun Média Nom,Vl Remun Média (SM),Sexo Trabalhador,Tamanho Estabelecimento,Tipo Estab,Tipo Estab.1,territorio_tese,arranjo,knowledge_services,technology_industries
0,21490,75116,06,12,11,1,47,230440,10,1104,3124.77,20.69,4343.13,29.86,01,8,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
1,31990,75116,07,08,07,1,51,230440,10,1104,878.83,5.82,854.52,5.81,01,9,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
2,31990,75116,07,09,08,1,56,230440,10,1104,1139.96,7.55,1132.08,7.70,01,9,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
3,31990,75116,06,08,07,1,47,230440,10,1104,1012.22,6.70,934.72,6.36,01,9,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
4,31990,75116,07,08,07,1,60,230440,10,1104,891.60,5.90,894.93,6.09,01,9,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
691088,14320,80110,05,02,02,9,31,230440,10,2135,151.00,1.00,151.44,1.03,02,3,03,CEI,,Arranjo Populacional de Fortaleza/CE,,
691089,32105,85154,04,06,05,9,28,230440,10,4049,500.00,3.31,500.00,3.31,02,2,03,CEI,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification
691090,39310,74128,05,02,01,9,37,231290,10,4049,151.00,1.00,147.25,1.00,02,2,03,CEI,,Demais Municípios do Ceará,Knowledge-intensive market services,Without Classification
691091,07110,85154,05,05,04,9,35,230440,10,4049,420.00,2.78,420.00,2.86,02,2,03,CEI,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification


In [11]:
pd_pickle_ce2008 = pd.read_pickle('app_rais/data/rais_treated/2008/CE2008.zip')

In [12]:
pd_pickle_ce2008

Unnamed: 0,CBO Ocupação 2002,CNAE 2.0 Classe,CNAE 95 Classe,Faixa Etária,Faixa Remun Dezem (SM),Faixa Remun Média (SM),Escolaridade após 2005,Idade,Ind Simples,Município,Nacionalidade,Natureza Jurídica,Raça Cor,Vl Remun Dezembro Nom,Vl Remun Dezembro (SM),Vl Remun Média Nom,Vl Remun Média (SM),Sexo Trabalhador,Tamanho Estabelecimento,Tipo Estab,Tipo Estab.1,territorio_tese,arranjo,knowledge_services,technology_industries,potec
0,784205,11216,15946,07,03,02,1,61,0,230440,10,2062,08,550.83,1.32,500.64,1.21,01,8,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Without Classification,Low-technology,Demais Ocupações
1,517420,45111,50105,08,,,1,69,0,230440,10,2062,09,0.00,0.00,0.00,0.00,01,6,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Less knowledge-intensive market services,Without Classification,Demais Ocupações
2,622405,01229,01228,07,02,02,1,53,0,231020,10,2062,08,394.92,0.95,416.97,1.01,01,6,01,CNPJ,,Demais Municípios do Ceará,Without Classification,Without Classification,Demais Ocupações
3,622405,01229,01228,05,03,02,1,31,0,231020,10,2062,04,574.84,1.38,418.60,1.02,01,6,01,CNPJ,,Demais Municípios do Ceará,Without Classification,Without Classification,Demais Ocupações
4,622405,01229,01228,06,02,01,1,42,0,231020,10,2062,08,348.06,0.83,393.39,0.95,01,6,01,CNPJ,,Demais Municípios do Ceará,Without Classification,Without Classification,Demais Ocupações
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1129994,514320,82997,74993,05,03,02,11,31,0,230370,10,2062,08,437.71,1.05,437.71,1.05,02,10,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Less knowledge-intensive market services,Without Classification,Demais Ocupações
1129995,223115,87301,85316,06,09,08,11,49,0,230440,10,3999,08,3200.00,7.71,3287.27,7.91,02,6,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification,Demais Ocupações
1129996,223110,87301,85316,05,10,09,11,38,0,230440,10,3999,08,4500.00,10.84,4500.00,10.84,02,6,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification,Demais Ocupações
1129997,234520,85139,80152,05,11,10,11,37,0,230440,10,2062,02,7221.26,17.40,7118.89,17.36,02,7,01,CNPJ,,Arranjo Populacional de Fortaleza/CE,Other knowledge-intensive services,Without Classification,Professores de ensino superior


In [7]:
list_files = glob.glob(f'app_rais/data/rais_treated/**/**.zip')
size = round(sum(os.path.getsize(file) for file in list_files) * 1e-9, 2)
print(f'Resulting treated files occupy {size}GB in disk.')

Resulting treated files occupy 7.03GB in disk.


## TREATING DATA - SECOND ROUND

The pickle files generated by the first round of data treatment are used ahead as input for Generate_rais_dataframe, which applies a new round of treatment, grouping data for better use of memory, merging information from different UFs and creating columns for personnel data, namely:

* scientific and technical (S&T) personnel (POTEC);
* organization legal status;
* economic sectors;
* categories of services according to intensity of applied knowledge;
* industry technological levels; 
* level of education.

This function is called by another function, load_save_rais_dataframe_loop, which applies it to different years and saves the result as feather files, optimized for fast usage. When applied to selected Federation Unities, from 2007 to 2017, resulting data base encompassed 11 files that sum 0.14GB of disk memory. 

In [33]:
%%time
df_2006 = treat.generate_rais_dataframe(list_ufs=['PR'], year=2006)

Wall time: 3.9 s


In [34]:
df_2006.head()

Unnamed: 0,UF,Território,Sectors,Tamanho Estabelecimento,Natureza Jurídica Grupo,Natureza Jurídica,knowledge_services,technology_industries,Potec Grupo,potec,Escolaridade1,Escolaridade2,Pessoal
0,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Profissionais científicos,Profissionais da Informática,Não-Superior,Inferior ao Ensino Médio Completo,1
1,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,7
2,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,19
3,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Demais Ocupações,Demais Ocupações,Não-Superior,Inferior ao Ensino Médio Completo,54
4,PR,Demais Municípios do Paraná,Services,Até 49,Entidades Empresariais,SOC QT LTDA,High-tech knowledge-intensive services,Without Classification,Profissionais científicos,Profissionais da Informática,Não-Superior,Inferior ao Ensino Médio Completo,1


In [4]:
%%time
treat.load_save_rais_dataframe_loop(
    list_ufs=['AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MG','PA','PE','PR','RJ','RS','SC', 'SP']
    , list_years=list(range(2000,2019)
          )
)

ufs_with_motropolises_2000 saved.
ufs_with_motropolises_2001 saved.
ufs_with_motropolises_2002 saved.
ufs_with_motropolises_2003 saved.
ufs_with_motropolises_2004 saved.
ufs_with_motropolises_2005 saved.
ufs_with_motropolises_2006 saved.
ufs_with_motropolises_2007 saved.
ufs_with_motropolises_2008 saved.
ufs_with_motropolises_2009 saved.
ufs_with_motropolises_2010 saved.
ufs_with_motropolises_2011 saved.
ufs_with_motropolises_2012 saved.
ufs_with_motropolises_2013 saved.
ufs_with_motropolises_2014 saved.
ufs_with_motropolises_2015 saved.
ufs_with_motropolises_2016 saved.
ufs_with_motropolises_2017 saved.
ufs_with_motropolises_2018 saved.
Wall time: 36min 33s


In [36]:
list_files = glob.glob(f'app_rais/data/rais_dataframes/**.ftd')
size = round(sum(os.path.getsize(file) for file in list_files) * 1e-9, 2)
print(f'Resulting treated files occupy {size}GB in disk.')

Resulting treated files occupy 0.32GB in disk.
