## Module 00 - Loading Data and creating Training and Test Sets

In this first module, we perform the following steps:

1. load the data from Google Drive in two parts (the file is too big to load at once);
2. exclude irrelevant variables or variables with too many missing values;
3. rename variables in English and possibly with shorter names;
4. divide training and test set.

### 1 - Load the data in two parts:

In [1]:
import pandas as pd

url_a = 'https://drive.google.com/file/d/1prPbFSiXFTHmTHzXTGxy4HrtRxXUHhce/view?usp=sharing'
path_a = 'https://drive.google.com/uc?export=download&id='+url_a.split('/')[-2]
base_df_a = pd.read_excel(path_a)
base_df_a.shape

(309999, 37)

In [2]:
url_b = 'https://drive.google.com/file/d/1nGckSszPPifPvR3o5FeYaKArUYbfjHGn/view?usp=sharing'
path_b = 'https://drive.google.com/uc?export=download&id='+url_b.split('/')[-2]
base_df_b = pd.read_excel(path_b)
base_df_b.shape

(327823, 37)

In [10]:
complete_set = base_df_a.append(base_df_b)
complete_set.shape

(637822, 37)

### 2 - Excluding variables

In [11]:
complete_set = complete_set.drop(['NO_MUNICIPIO'], axis = 1)
complete_set.shape

(637822, 36)

### 3 - Renaming variables

In [12]:
complete_set.dtypes

NU_ANO_SEMESTRE_INSCRICAO                  int64
SG_SEXO                                   object
DS_OCUPACAO                               object
DS_ESTADO_CIVIL                           object
VL_RENDA_FAMILIAR_BRUTA_MENSAL           float64
VL_RENDA_PESSOAL_BRUTA_MENSAL            float64
SG_UF                                     object
DS_RACA_COR                               object
ST_ENSINO_MEDIO_ESCOLA_PUBLICA            object
NU_ANO_CONCLUSAO_ENSINO_MEDIO            float64
NU_SEMESTRE_REFERENCIA                     int64
SG_UF_CURSO                               object
NO_CURSO                                  object
VL_AVALIACAO_IGC                         float64
VL_FAIXA_CPC                             float64
VL_FAIXA_CC                              float64
QT_SEMESTRES_CURSO                         int64
QT_SEMESTRE_CONCLUIDO                      int64
QT_SEMESTRE_FINANCIAMENTO                  int64
QT_MESES_FINANC_SEMESTRE_ATUAL             int64
QT_MEMBRO           

In [14]:
new_names ={'NU_ANO_SEMESTRE_INSCRICAO':'year_enroll',
           'SG_SEXO':'gender',
           'DS_OCUPACAO':'occupation'}

complete_set = complete_set.rename(index=str, columns=new_names)
complete_set.dtypes

year_enroll                                int64
gender                                    object
occupation                                object
DS_ESTADO_CIVIL                           object
VL_RENDA_FAMILIAR_BRUTA_MENSAL           float64
VL_RENDA_PESSOAL_BRUTA_MENSAL            float64
SG_UF                                     object
DS_RACA_COR                               object
ST_ENSINO_MEDIO_ESCOLA_PUBLICA            object
NU_ANO_CONCLUSAO_ENSINO_MEDIO            float64
NU_SEMESTRE_REFERENCIA                     int64
SG_UF_CURSO                               object
NO_CURSO                                  object
VL_AVALIACAO_IGC                         float64
VL_FAIXA_CPC                             float64
VL_FAIXA_CC                              float64
QT_SEMESTRES_CURSO                         int64
QT_SEMESTRE_CONCLUIDO                      int64
QT_SEMESTRE_FINANCIAMENTO                  int64
QT_MESES_FINANC_SEMESTRE_ATUAL             int64
QT_MEMBRO           

### 4 - Creating a training and a test set

In this section we will create the training and test set using the function *train_test_split* from Scikit-Learn. Two important considerations about our choice:

* Our dataset is a sample provided by the Brazilian Governent and will not be updated. Therefore, we chose not to be concerned about future splits with updated data;

* Our data includes 637,822 instances. We assume it is big enough and do not employ stratified sampling.

In [15]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(complete_set, test_size=0.2, random_state=42)

In [16]:
train_set.shape

(510257, 36)

In [17]:
test_set.shape

(127565, 36)