# Pré-Processamento
Esse **Jupyter Notebook** tem como objetivo aplicar um **Pré-Processamento** no conjunto de dados (ou em parte dele).

# Resumo da Análise preliminar
Na etapa anterior foi feita uma breve análise do conjunto de dados. O **resumo** dessa análise foi o seguinte:

 - Temos um grande conjunto de dados para trabalharmos:
   - Com 244.768 amostras e 12 colunas (atributos/campos/features).
 - Porém, vai ser necessário um Pré-Processamento na maior parte das colunas, devido o fato das colunas serem representadas por textos (informações).
 - Algumas colunas estão com muitos dados faltantes, principalmente a **ContractType** que tem **73%** dos dados faltantes.
 - Estatísticas da variável (feature) **"SalaryNormalized"**:
   - O menor salário de todos (anualmente) foi de 5.000;
   - O maior salário de todos (anualmente) foi de 200.000;
   - A média (mean) de todos os salários (anualmente) foi de 34.122;
   - A mediana (median/2° Quartil = 50% dos dados) de todos os salários (anualmente) foi de 30.000:
     - Vejam que a nossa mediana não está tão distante da nossa média.
   - A moda (salário mais frequente) foi 35.000 com 9.178 amostras;
   - O TOP 10 salários mais frequentes foram os seguintes:
     - 35.000 com 9.178 amostras;
     - 30.000 com 8.319 amostras;
     - 40.000 com 7.688 amostras;
     - 45.000 com 6.735 amostras;
     - 25.000 com 6.309 amostras;
     - 32.500 com 6.215 amostras;
     - 37.500 com 5.756 amostras;
     - 27.500 com 5.544 amostras;
     - 50.000 com 5.424 amostras;
     - 42.500 com 4.555 amostras.
   - O Desvio Padrão (Standard Deviation/ que representa quão longe nós estamos da média) é 17.640.

# Classe "Preprocessing"
Um dos requisitos da **GRIA** para o desafio era que os códigos fossem *reaproveitados* e *documentados*. Isso para evitar códigos duplicados e reaproveitamento de códigos em trabalhos futuros.

In [1]:
import pandas as pd
import scipy.sparse
import platform
import datetime
import py7zr
import nltk
import os

from collections import Counter
from datetime import datetime

from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


class Preprocessing:

  def extract_7z_data(self, path):
    start_time = datetime.now()
    if platform.system() == 'Windows':
      try:
        # For Windows users.
        with py7zr.SevenZipFile(path, mode='r') as archive:
          archive.extractall(path="C:\Windows\Temp")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("File extracted!")
    elif platform.system() == 'Linux':
      try:
        # For Linux users.
        with py7zr.SevenZipFile(path, mode='r') as archive:
          archive.extractall(path="/tmp")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("File extracted!")
    else:
      print("This method only works with Windows and Linux Operating Systems.")
    end_time = datetime.now()
    print('Method runtime: {}'.format(end_time - start_time))


  def get_training_data(self):
    start_time = datetime.now()
    if platform.system() == 'Windows':
      try:
        # For Windows users.
        df_training = pd.read_csv("C:\Windows\Temp\Train_rev1.csv")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("Training data ready!")
        end_time = datetime.now()
        print('Method runtime: {}'.format(end_time - start_time))
        return df_training
    elif platform.system() == 'Linux':
      try:
        # For Linux users.
        df_training = pd.read_csv("/tmp/Train_rev1.csv")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("Training data ready!")
        end_time = datetime.now()
        print('Method runtime: {}'.format(end_time - start_time))
        return df_training


  def get_testing_data(self):
    start_time = datetime.now()
    if platform.system() == 'Windows':
      try:
        # For Windows users.
        df_testing = pd.read_csv("C:\Windows\Temp\Test_rev1.csv")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("Testing data ready!")
        end_time = datetime.now()
        print('Method runtime: {}'.format(end_time - start_time))
        return df_testing
    elif platform.system() == 'Linux':
      try:
        # For Linux users.
        df_testing = pd.read_csv("/tmp/Test_rev1.7z")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("Testing data ready!")
        end_time = datetime.now()
        print('Method runtime: {}'.format(end_time - start_time))
        return df_testing


  def missing_by_numbers(self, df):
    missing = df.isnull().sum()
    return missing


  def missing_by_percent(self, df):
    try:
      percentMissing = (df.isnull().sum() / len(df.index)) * 100
    except ZeroDivisionError:
      print("Sorry! You are dividing by zero.")
    else:
      return percentMissing


  def apply_lower_casing(self, df):
    df = df.str.lower()
    return df


  def remove_punctuations(self, df):
    df = df.str.replace('[^\w\s]',' ', regex=True)
    return df


  def remove_numbers(self, df):
    df = df.str.replace('[0-9]+', '', regex=True)
    return df


  def apply_stemming(self, df):
    start_time = datetime.now()
    stemmer = PorterStemmer() # Instance.
    return " ".join([stemmer.stem(word) for word in str(df).split() ])


  def apply_lemmatization(self, df):
    try:
      lemmatizer = WordNetLemmatizer()
      wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV} # Apply dict mapping.
      pos_tagged_text = nltk.pos_tag(df.split())
      return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])
      print("Lemmatization concluded!")
    except (LookupError, OSError):
      nltk.download('wordnet')
      nltk.download('omw-1.4')
      nltk.download('averaged_perceptron_tagger')
    finally:
      lemmatizer = WordNetLemmatizer()
      wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV} # Apply dict mapping.
      pos_tagged_text = nltk.pos_tag(df.split())
      return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])


  def check_most_common_words(self, df):
    start_time = datetime.now()
    cnt_df = Counter() # Instance
    for text in df.values:
      for word in text.split():
        cnt_df[word] += 1
    df_most_common = pd.DataFrame(
      cnt_df.most_common(),
      columns = ["Word", "Frequency"]
    )
    end_time = datetime.now()
    print('Method runtime: {}'.format(end_time - start_time))
    return df_most_common.sort_values(by=["Frequency"], ascending=False)


  def apply_tfidf_vectorizer(self, df, max_df=1, min_df=1):
    start_time = datetime.now()
    vectorizer = TfidfVectorizer(
      stop_words="english",
      max_df = max_df,
      min_df = min_df,
    )
    df_vectorized = vectorizer.fit_transform(df)
    end_time = datetime.now()
    print("DataFrame Vectorized!")
    print('Method runtime: {}'.format(end_time - start_time))
    return df_vectorized


  def save_feature(self, feature_name, feature):
    scipy.sparse.save_npz(f"../resources/processed_features/{feature_name}", feature)
    print("Feature saved!")


  def save_to_csv(self, df, df_name):
    df.to_csv(
      path_or_buf=f"../resources/load/{df_name}.csv",
      header=True,
      encoding='utf-8',
      index=False,
    )
    print("DataFrame saved!")


Agora que nós já temos uma classe responsável por **Pré-Processar** nossos dados vamos criar uma instância para trabalhar com a mesma:

In [2]:
# Preprocessing instance.
preprocessing = Preprocessing()

---

# 01 - Preparando os dados & Ambiente para o Pré-Processamento
Nessa etapa vamos preparar os *dados* e o *ambiente (jupyter notebook)* para o nosso **Pré-Processamento**.

---

## 01.1 -  Baixando as Bibliotecas necessárias
Inicialmente vamos baixar as bibliotecas necessárias para nossa análise (Eu já tenho todas baixadas no meu ambiente virtual, mas você pode remover o comentário e baixar para sua máquina local ou Ambiente Virtual).

In [3]:
#!pip install --upgrade -r ../requirements.txt --user

---

## 01.2 - Extraindo o conjunto de dados de treino e teste
Agora vamos extrair o conjunto de dados de **treino** e **teste**. Para isso, vamos utilizar o método **extract_7z_data()** da classe **Preprocessing**:

In [4]:
# Extract training set.
preprocessing.extract_7z_data("../datasets/Train_rev1.7z")

File extracted!
Method runtime: 0:00:09.909677


In [5]:
# Extract testing set.
preprocessing.extract_7z_data("../datasets/Test_rev1.7z")

File extracted!
Method runtime: 0:00:05.012095


**NOTE:**  
Como o conjunto de dados é muito grande resolvi baixar a versão mais comprimida **.7z**. Optei também por descomprimir o conjunto de dados em um local temporário (diretório **/temp** no meu caso que estou utilizando Linux / Como se fosse uma **Staging Area**).

---

## 01.3 - Pegando os conjuntos de dados de treino e teste
Por fim, vamos pegar os conjuntos de dados de **treino** e **teste** que foram extraídos com o método **extract_7z_data()**:

In [6]:
df_train = preprocessing.get_training_data()

Training data ready!
Method runtime: 0:00:06.855862


In [7]:
df_test = preprocessing.get_testing_data()

Testing data ready!
Method runtime: 0:00:03.755819


---

# 02 - Visão geral (Overview) dos conjuntos de dados
Agora vamos dar uma olhada geral nos conjuntos de dados de **treinamento** e **teste** com a função **info()** e **head()** do *Pandas*.

In [8]:
df_train.info()
df_train.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Id                  244768 non-null  int64 
 1   Title               244767 non-null  object
 2   FullDescription     244768 non-null  object
 3   LocationRaw         244768 non-null  object
 4   LocationNormalized  244768 non-null  object
 5   ContractType        65442 non-null   object
 6   ContractTime        180863 non-null  object
 7   Company             212338 non-null  object
 8   Category            244768 non-null  object
 9   SalaryRaw           244768 non-null  object
 10  SalaryNormalized    244768 non-null  int64 
 11  SourceName          244767 non-null  object
dtypes: int64(2), object(10)
memory usage: 22.4+ MB


Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk


---

In [9]:
df_test.info()
df_test.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122463 entries, 0 to 122462
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Id                  122463 non-null  int64 
 1   Title               122463 non-null  object
 2   FullDescription     122463 non-null  object
 3   LocationRaw         122463 non-null  object
 4   LocationNormalized  122463 non-null  object
 5   ContractType        33013 non-null   object
 6   ContractTime        90702 non-null   object
 7   Company             106202 non-null  object
 8   Category            122463 non-null  object
 9   SourceName          122463 non-null  object
dtypes: int64(1), object(9)
memory usage: 9.3+ MB


Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SourceName
0,11888454,Business Development Manager,The Company: Our client is a national training...,"Tyne Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Teaching Jobs,cv-library.co.uk
1,11988350,Internal Account Manager,The Company: Founded in **** our client is a U...,"Tyne and Wear, North East",Newcastle Upon Tyne,,permanent,Asset Appointments,Consultancy Jobs,cv-library.co.uk
2,12612558,Engineering Systems Analysts,Engineering Systems Analysts Surrey ****K Loca...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,cv-library.co.uk
3,12613014,CIS Systems Engineering Consultant,CIS Systems Engineering Consultant Bristol So...,"Bristol, South West, South West",Bristol,,permanent,Gregory Martin International,Engineering Jobs,cv-library.co.uk
4,22454872,CNC Miller / Programmer Fanac,"CNC Miller / Programmer Fanac Fleet, Hampshire...","Fleet, Hampshire",Fleet,,permanent,Gregory Martin International,Manufacturing Jobs,cv-library.co.uk


**NOTE:**  
Se você prestar bem anteção, vai ver que no **conjunto de dados de teste** não tem as variáveis **target**:
 - SalaryRaw
 - SalaryNormalized

---

# 03 - Aplicando Pré-Processamento nas colunas (features)
Nessa etapa vamos aplicar um **Pré-Processamento** em cada coluna individualmente. Vale salientar que o mesmo processo feito na coluna (feature) de *treinamento* devemos aplicar o mesmo **Pré-Processamento** na coluna correspondente nos dados de *teste*. Isso, porque nós vamos utilizar essa coluna (feature) futuramente para fazer previsões.

---

## 03.1 - Pré-Processando a coluna (feature) "Id"
> Essa coluna (feature) não vai precisar ser Pré-Processada. Como nós sabemos é apenas o identificado único de cada amostra.

---

## 03.2 - Pré-Processando a coluna (feature) "Title"
> Resumidamente, o **Title** é o resumo do *cargo* ou *função*.

### Preparando e colocando o tipo de dado mais adequado na *coluna (feature)* "title":

In [10]:
df_title_train = df_train[["Title"]].copy()
df_title_train = df_title_train.astype({'Title': 'string'})
df_title_train.info()
df_title_train.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Title   244767 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,Title
0,Engineering Systems Analyst
1,Stress Engineer Glasgow
2,Modelling and simulation analyst
3,Engineering Systems Analyst / Mathematical Mod...
4,"Pioneer, Miser Engineering Systems Analyst"


In [11]:
df_title_test = df_test[["Title"]].copy()
df_title_test = df_title_test.astype({'Title': 'string'})
df_title_test.info()
df_title_test.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122463 entries, 0 to 122462
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Title   122463 non-null  string
dtypes: string(1)
memory usage: 956.9 KB


Unnamed: 0,Title
0,Business Development Manager
1,Internal Account Manager
2,Engineering Systems Analysts
3,CIS Systems Engineering Consultant
4,CNC Miller / Programmer Fanac


### Verificando quanto porcento (%) dos dados são ausentes (missing):

Vamos começar verificando o **número** de dados ausentes nas colunas (feature) **Title**:

In [12]:
preprocessing.missing_by_numbers(df_title_train)

Title    1
dtype: int64

In [13]:
preprocessing.missing_by_numbers(df_title_test)

Title    0
dtype: int64

**NOTE:**  
Vejam que no conjunto de dados de treino entre às 244.768 amostras, apenas uma delas está faltando o **title (título)**. Agora vamos ver quanto porcento representa esse único título faltante:

In [14]:
preprocessing.missing_by_percent(df_title_train)

Title    0.000409
dtype: float64

In [15]:
preprocessing.missing_by_percent(df_title_test)

Title    0.0
dtype: float64

**NOTE:**  
Agora vem a pergunta-chave:

> **Por que apenas uma das amostras está sem o título?**

### Aplicando Lower Casing:

In [16]:
df_title_train["processed_title"] = preprocessing.apply_lower_casing(df_title_train["Title"])

In [17]:
df_title_train.head()

Unnamed: 0,Title,processed_title
0,Engineering Systems Analyst,engineering systems analyst
1,Stress Engineer Glasgow,stress engineer glasgow
2,Modelling and simulation analyst,modelling and simulation analyst
3,Engineering Systems Analyst / Mathematical Mod...,engineering systems analyst / mathematical mod...
4,"Pioneer, Miser Engineering Systems Analyst","pioneer, miser engineering systems analyst"


In [18]:
df_title_test["processed_title"] = preprocessing.apply_lower_casing(df_title_test["Title"])

In [19]:
df_title_test.head()

Unnamed: 0,Title,processed_title
0,Business Development Manager,business development manager
1,Internal Account Manager,internal account manager
2,Engineering Systems Analysts,engineering systems analysts
3,CIS Systems Engineering Consultant,cis systems engineering consultant
4,CNC Miller / Programmer Fanac,cnc miller / programmer fanac


### Removendo pontuações:

In [20]:
df_title_train["processed_title"] = preprocessing.remove_punctuations(df_title_train["processed_title"])

In [21]:
df_title_train.head()

Unnamed: 0,Title,processed_title
0,Engineering Systems Analyst,engineering systems analyst
1,Stress Engineer Glasgow,stress engineer glasgow
2,Modelling and simulation analyst,modelling and simulation analyst
3,Engineering Systems Analyst / Mathematical Mod...,engineering systems analyst mathematical mod...
4,"Pioneer, Miser Engineering Systems Analyst",pioneer miser engineering systems analyst


In [22]:
df_title_test["processed_title"] = preprocessing.remove_punctuations(df_title_test["processed_title"])

In [23]:
df_title_test.head()

Unnamed: 0,Title,processed_title
0,Business Development Manager,business development manager
1,Internal Account Manager,internal account manager
2,Engineering Systems Analysts,engineering systems analysts
3,CIS Systems Engineering Consultant,cis systems engineering consultant
4,CNC Miller / Programmer Fanac,cnc miller programmer fanac


### Removendo números:

In [24]:
df_title_train["processed_title"] = preprocessing.remove_numbers(df_title_train["processed_title"])

In [25]:
df_title_train.head()

Unnamed: 0,Title,processed_title
0,Engineering Systems Analyst,engineering systems analyst
1,Stress Engineer Glasgow,stress engineer glasgow
2,Modelling and simulation analyst,modelling and simulation analyst
3,Engineering Systems Analyst / Mathematical Mod...,engineering systems analyst mathematical mod...
4,"Pioneer, Miser Engineering Systems Analyst",pioneer miser engineering systems analyst


In [26]:
df_title_test["processed_title"] = preprocessing.remove_numbers(df_title_test["processed_title"])

In [27]:
df_title_test.head()

Unnamed: 0,Title,processed_title
0,Business Development Manager,business development manager
1,Internal Account Manager,internal account manager
2,Engineering Systems Analysts,engineering systems analysts
3,CIS Systems Engineering Consultant,cis systems engineering consultant
4,CNC Miller / Programmer Fanac,cnc miller programmer fanac


### Aplicando a técnica de Stemming:

In [28]:
df_title_train["processed_title"] = df_title_train["processed_title"].apply(lambda text: preprocessing.apply_stemming(text))

In [29]:
df_title_train.head()

Unnamed: 0,Title,processed_title
0,Engineering Systems Analyst,engin system analyst
1,Stress Engineer Glasgow,stress engin glasgow
2,Modelling and simulation analyst,model and simul analyst
3,Engineering Systems Analyst / Mathematical Mod...,engin system analyst mathemat model
4,"Pioneer, Miser Engineering Systems Analyst",pioneer miser engin system analyst


In [30]:
df_title_test["processed_title"] = df_title_test["processed_title"].apply(lambda text: preprocessing.apply_stemming(text))

In [31]:
df_title_test.head()

Unnamed: 0,Title,processed_title
0,Business Development Manager,busi develop manag
1,Internal Account Manager,intern account manag
2,Engineering Systems Analysts,engin system analyst
3,CIS Systems Engineering Consultant,ci system engin consult
4,CNC Miller / Programmer Fanac,cnc miller programm fanac


### Aplicando a técnica de Lemmatization + Part-of-Speech Tagging:

In [32]:
df_title_train["processed_title"] = df_title_train["processed_title"].apply(lambda text: preprocessing.apply_lemmatization(text))

In [33]:
df_title_train.head()

Unnamed: 0,Title,processed_title
0,Engineering Systems Analyst,engin system analyst
1,Stress Engineer Glasgow,stress engin glasgow
2,Modelling and simulation analyst,model and simul analyst
3,Engineering Systems Analyst / Mathematical Mod...,engin system analyst mathemat model
4,"Pioneer, Miser Engineering Systems Analyst",pioneer miser engin system analyst


In [34]:
df_title_test["processed_title"] = df_title_test["processed_title"].apply(lambda text: preprocessing.apply_lemmatization(text))

In [35]:
df_title_test.head()

Unnamed: 0,Title,processed_title
0,Business Development Manager,busi develop manag
1,Internal Account Manager,intern account manag
2,Engineering Systems Analysts,engin system analyst
3,CIS Systems Engineering Consultant,ci system engin consult
4,CNC Miller / Programmer Fanac,cnc miller programm fanac


### Analisando as palavras mais frequentes (max_df & min_df):
Agora nós vamos analisar quais as palavras mais **frequentes** e mais **raras**. Isso é importante porque com essa análise nós podemos remover palavras que não vão ajudar nossos modelos na hora da aprendizagem.

In [36]:
train_words_stats = preprocessing.check_most_common_words(df_title_train["processed_title"])

Method runtime: 0:00:01.075269


In [37]:
test_words_stats = preprocessing.check_most_common_words(df_title_test["processed_title"])

Method runtime: 0:00:00.509862


**NOTE:**  
Analisando as palavras mais frequentes (max_df):

In [38]:
train_words_stats.sort_values("Frequency", ascending=False).head(30)

Unnamed: 0,Word,Frequency
0,manag,54152
1,engin,26731
2,develop,21242
3,sale,19814
4,senior,16987
5,assist,12760
6,account,12648
7,k,11336
8,execut,10976
9,consult,10521


In [39]:
test_words_stats.sort_values("Frequency", ascending=False).head(30)

Unnamed: 0,Word,Frequency
0,manag,27092
1,engin,13447
2,develop,10387
3,sale,9670
4,senior,8473
5,assist,6386
6,account,6316
7,k,5797
8,execut,5413
9,consult,5264


**NOTE:**  
Analisando as palavras mais raras (min_df):

In [40]:
train_words_stats.sort_values("Frequency", ascending=True).head(6000)

Unnamed: 0,Word,Frequency
15021,mlnlyck,1
15001,omniscop,1
15020,hydrol,1
15019,techniciancivil,1
15018,improvemen,1
...,...,...
8841,progam,2
8840,nft,2
8839,managerbdm,2
8838,sunsystem,2


In [41]:
test_words_stats.sort_values("Frequency", ascending=True).head(6000)

Unnamed: 0,Word,Frequency
11229,latent,1
11213,_graduat,1
11222,javascriptperm,1
11228,reimburs,1
11227,replac,1
...,...,...
6460,wadhurst,2
6556,dive,2
6558,arzneimittelsicherheit,2
6541,portrush,2


### Vetorizando com TfidfVectorizer:

In [42]:
df_title_train_vectorized = preprocessing.apply_tfidf_vectorizer(df_title_train["processed_title"], max_df=0.60, min_df=0.05)

DataFrame Vectorized!
Method runtime: 0:00:04.227998


In [43]:
df_title_train_vectorized

<244768x7 sparse matrix of type '<class 'numpy.float64'>'
	with 160292 stored elements in Compressed Sparse Row format>

In [44]:
df_title_test_vectorized = preprocessing.apply_tfidf_vectorizer(df_title_test["processed_title"], max_df=0.60, min_df=0.05)

DataFrame Vectorized!
Method runtime: 0:00:01.996416


In [45]:
df_title_test_vectorized

<122463x7 sparse matrix of type '<class 'numpy.float64'>'
	with 79802 stored elements in Compressed Sparse Row format>

---

## 03.3 - Pré-Processando a coluna (feature) "SalaryNormalized"
> Tem o mesmo significado da coluna **"SalaryRaw"**, porém a **Adzuna** normalizou os dados para ser representado de forma anualizado - Essa é a variável **target**.

**NOTE:**  
Por hora, vamos trabalhar com ela do jeito que está normalizada pelo a **Adzuna** ignorando se a mesma realmente fez um bom trabalho. O objetivo dessa abordagem vai ser ter algo disponível para a etapa de **treinamento** e **validação** trabalhar o mais rápido possível.

---

# 04 - Load
> A etapa de **load** vai ser responsável por salvar os dados já ***Pré-Processados*** por uma ou mais colunas (features).

**NOTE:**  
Essa etapa segue uma lógica incremental, onde, em cada iteração **(Load-v1, Load-v2,..., Load-vn)** nós vamos salvando os dados já manipulados com objetivo de encontrar uma melhor métrica ou modelagem dos dados.

---

## 04.1 - Load-v1
Para esse 1° **Load** vamos começar com as colunas (features) mais simples possíveis, que são:
 - **Title** como variável **independente**:
   - Essa variável também foi *Pré-Processada* nos dados de *teste*.
 - **SalaryNormalized** como variável **dependente**.

**NOTE:**  
Eu escolhi essas colunas (features), pois, já receberam algum **Pré-Processamento básico** (o que não significa que mudanças possam ser feitas).

### Salvando a Matriz esparsa "df_title_vectorized" (resultado do Pré-Processamento da feature Title):
Primeiro, vamos salvar o resultado do **Pré-Processamento** na coluna (feature) **Title**.

In [46]:
preprocessing.save_feature(feature_name="df_title_train_vectorized", feature=df_title_train_vectorized)

Feature saved!


In [47]:
preprocessing.save_feature(feature_name="df_title_test_vectorized", feature=df_title_test_vectorized)

Feature saved!


### SalaryNormalized:
Para coluna (feature) **"SalaryNormalized"** nós vamos pegar ela na hora do treinamento do modelo visto que alterações não foram feitas.

---

# Resumos

 - **Load-v1:**
   - No *Load-v1* foi *Pré-Processada* a coluna (features) *Title*:
     - Essa variável também foi *Pré-Processada* nos dados de *teste*.
   - A coluna (feature) *Title* foi vetorizada com CountVectorizer com os seguintes argumentos:
     - stop_words="english"
     - max_df=0.60 (Ignores terms that appear in MORE than 60% of documents)
     - min_df=0.05 (Ignores terms that appear in LESS than 5% of documents)
   - Também foi utilizada a coluna (feature) *SalaryNormalized* que já havia sido normalizada pelo a *Adzuna*.
   - O objetivo era ter *features* o mais rápido possível disponíveis para a etapa de *Modelagem & Validação*.


**Rodrigo Leite -** *drigols*