# Pré-Processamento
Esse **Jupyter Notebook** tem como objetivo aplicar um **Pré-Processamento** no conjunto de dados (ou em parte dele).

# Resumo da Análise preliminar
Na etapa anterior foi feita uma breve análise do conjunto de dados. O **resumo** dessa análise foi o seguinte:

 - Temos um grande conjunto de dados para trabalharmos:
   - Com 244.768 amostras e 12 colunas (atributos/campos/features).
 - Porém, vai ser necessário um Pré-Processamento na maior parte das colunas, devido o fato das colunas serem representadas por textos (informações).
 - Algumas colunas estão com muitos dados faltantes, principalmente a **ContractType** que tem **73%** dos dados faltantes.
 - Estatísticas da variável (feature) **"SalaryNormalized"**:
   - O menor salário de todos (anualmente) foi de 5.000;
   - O maior salário de todos (anualmente) foi de 200.000;
   - A média (mean) de todos os salários (anualmente) foi de 34.122;
   - A mediana (median/2° Quartil = 50% dos dados) de todos os salários (anualmente) foi de 30.000:
     - Vejam que a nossa mediana não está tão distante da nossa média.
   - A moda (salário mais frequente) foi 35.000 com 9.178 amostras;
   - O TOP 10 salários mais frequentes foram os seguintes:
     - 35.000 com 9.178 amostras;
     - 30.000 com 8.319 amostras;
     - 40.000 com 7.688 amostras;
     - 45.000 com 6.735 amostras;
     - 25.000 com 6.309 amostras;
     - 32.500 com 6.215 amostras;
     - 37.500 com 5.756 amostras;
     - 27.500 com 5.544 amostras;
     - 50.000 com 5.424 amostras;
     - 42.500 com 4.555 amostras.
   - O Desvio Padrão (Standard Deviation/ que representa quão longe nós estamos da média) é 17.640.

# Classe "Preprocessing"
Um dos requisitos da **GRIA** para o desafio era que os códigos fossem *reaproveitados* e *documentados*. Isso para evitar códigos duplicados e reaproveitamento de códigos em trabalhos futuros.

In [1]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

import pandas as pd
import scipy.sparse
import platform
import py7zr
import nltk
import os


class Preprocessing:

  def extract_7z_data(self, path):
    if platform.system() == 'Windows':
      try:
        # For Windows users.
        with py7zr.SevenZipFile(path, mode='r') as archive:
          archive.extractall(path="C:\Windows\Temp")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("File extracted!")
    elif platform.system() == 'Linux':
      try:
        # For Linux users.
        with py7zr.SevenZipFile(path, mode='r') as archive:
          archive.extractall(path="/tmp")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("File extracted!")
    else:
      print("This method only works with Windows and Linux Operating Systems.")


  def get_traing_data(self):
    if platform.system() == 'Windows':
      try:
        # For Windows users.
        df_training = pd.read_csv("C:\Windows\Temp\Train_rev1.csv")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("Training data ready!")
        return df_training
    elif platform.system() == 'Linux':
      try:
        # For Linux users.
        df_training = pd.read_csv("/tmp/Train_rev1.csv")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("Training data ready!")
        return df_training


  def get_testing_data(self): 
    if platform.system() == 'Windows':
      try:
        # For Windows users.
        df_testing = pd.read_csv("C:\Windows\Temp\Test_rev1.7z")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("Testing data ready!")
        return df_testing
    elif platform.system() == 'Linux':
      try:
        # For Linux users.
        df_testing = pd.read_csv("/tmp/Test_rev1.7z")
      except FileNotFoundError:
        print("File or path not!")
      else:
        print("Testing data ready!")
        return df_testing

  def missing_by_numbers(self, df):
    missing = df.isnull().sum()
    return missing


  def missing_by_percent(self, df):
    try:
      percentMissing = (df.isnull().sum() / len(df_Title.index)) * 100
    except ZeroDivisionError:
      print("Sorry! You are dividing by zero.")
    else:
      return percentMissing


  def apply_lower_casing(self, df):
    df = df.str.lower()
    return df


  def remove_punctuations(self, df):
    df = df.str.replace('[^\w\s]',' ', regex=True)
    return df

  
  def remove_numbers(self, df):
    df = df.str.replace('[0-9]+', '', regex=True)
    return df


  def apply_stemming(self, df):
    stemmer = PorterStemmer() # Instance.    
    return " ".join([stemmer.stem(word) for word in str(df).split()])
    print("Stemming concluded!")


  def apply_lemmatization(self, df):
    try:
      lemmatizer = WordNetLemmatizer()
      wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV} # Apply dict mapping.
      pos_tagged_text = nltk.pos_tag(df.split())
      return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])
      print("Lemmatization concluded!")
    except (LookupError, OSError):
      nltk.download('wordnet')
      nltk.download('omw-1.4')
      nltk.download('averaged_perceptron_tagger')
    finally:
      lemmatizer = WordNetLemmatizer()
      wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV} # Apply dict mapping.
      pos_tagged_text = nltk.pos_tag(df.split())
      return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])
      print("Lemmatization concluded!")


  def apply_countVectorizer(self, df):
    vectorizer = CountVectorizer(
      stop_words="english", # Add stopwords.
      max_df=0.60, # Ignores terms that appear in MORE than 60% of documents.
      min_df=0.05, # Ignores terms that appear in LESS than 5% of documents
    )
    df_vectorized = vectorizer.fit_transform(df)
    return df_vectorized


  def save_feature(self, feature_name, feature):
    scipy.sparse.save_npz(f"../resources/processed_features/{feature_name}", feature)
    print("Feature saved!")

Agora que nós já temos uma classe responsável por **Pré-Processar** nossos dados vamos criar uma instância para trabalhar com a mesma:

In [2]:
# Preprocessing instance.
preprocessing = Preprocessing()

---

# 01 - Preparando os dados & Ambiente para o Pré-Processamento
Nessa etapa vamos preparar os *dados* e o *ambiente (jupyter notebook)* para o nosso **Pré-Processamento**.

---

## 01.1 -  Baixando as Bibliotecas necessárias
Inicialmente vamos baixar as bibliotecas necessárias para nossa análise (Eu já tenho todas baixadas no meu ambiente virtual, mas você pode remover o comentário e baixar para sua máquina local ou Ambiente Virtual).

In [3]:
#!pip install --upgrade -r ../requirements.txt --user

---

## 01.2 - Extraindo o conjunto de dados de treino
Agora vamos extrair o conjunto de dados de treino. Para isso vamos utilizar o método **extract_7z_data()** da classe **Preprocessing**:

In [4]:
preprocessing.extract_7z_data("../datasets/Train_rev1.7z")

File extracted!


**NOTE:**  
Como o conjunto de dados é muito grande resolvi baixar a versão mais comprimida **.7z**. Optei também por descomprimir o conjunto de dados em um local temporário (diretório **/temp** no meu caso que estou utilizando Linux / Como se fosse uma **Staging Area**).

---

## 01.3 - Configurando o tamanho das saídas (outputs)
Antes de iniciarmos nosso **Pré-Processamento** vamos configurar o *Pandas* para exibir todo o conteúdo por amostra:

In [5]:
import pandas as pd
pd.options.display.max_colwidth = 100000

---

## 01.4 - Pegando o conjunto de dados de treino
Por fim, vamos pegar o conjunto de dados de treino que foi extraído com o método **extract_7z_data()**:

In [6]:
full_df = preprocessing.get_traing_data()

Training data ready!


---

# 02 - Visão geral (Overview) do conjunto de dados treino
Agora vamos dar uma olhada geral no conjunto de dados de treinamento com a função **info()** do *Pandas*.

In [7]:
full_df.info()
full_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Id                  244768 non-null  int64 
 1   Title               244767 non-null  object
 2   FullDescription     244768 non-null  object
 3   LocationRaw         244768 non-null  object
 4   LocationNormalized  244768 non-null  object
 5   ContractType        65442 non-null   object
 6   ContractTime        180863 non-null  object
 7   Company             212338 non-null  object
 8   Category            244768 non-null  object
 9   SalaryRaw           244768 non-null  object
 10  SalaryNormalized    244768 non-null  int64 
 11  SourceName          244767 non-null  object
dtypes: int64(2), object(10)
memory usage: 22.4+ MB


Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,"Engineering Systems Analyst Dorking Surrey Salary ****K Our client is located in Dorking, Surrey and are looking for Engineering Systems Analyst our client provides specialist software development Keywords Mathematical Modelling, Risk Analysis, System Modelling, Optimisation, MISER, PIONEEER Engineering Systems Analyst Dorking Surrey Salary ****K","Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,"Stress Engineer Glasgow Salary **** to **** We re currently looking for talented engineers to join our growing Glasgow team at a variety of levels. The roles are ideally suited to high calibre engineering graduates with any level of appropriate experience, so that we can give you the opportunity to use your technical skills to provide high quality input to our aerospace projects, spanning both aerostructures and aeroengines. In return, you can expect good career opportunities and the chance for advancement and personal and professional development, support while you gain Chartership and some opportunities to possibly travel or work in other offices, in or outside of the UK. The Requirements You will need to have a good engineering degree that includes structural analysis (such as aeronautical, mechanical, automotive, civil) with some experience in a professional engineering environment relevant to (but not limited to) the aerospace sector. You will need to demonstrate experience in at least one or more of the following areas: Structural/stress analysis Composite stress analysis (any industry) Linear and nonlinear finite element analysis Fatigue and damage tolerance Structural dynamics Thermal analysis Aerostructures experience You will also be expected to demonstrate the following qualities: A strong desire to progress quickly to a position of leadership Professional approach Strong communication skills, written and verbal Commercial awareness Team working, being comfortable working in international teams and self managing PLEASE NOTE SECURITY CLEARANCE IS REQUIRED FOR THIS ROLE Stress Engineer Glasgow Salary **** to ****","Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk
2,12612844,Modelling and simulation analyst,"Mathematical Modeller / Simulation Analyst / Operational Analyst Basingstoke, Hampshire Up to ****K AAE pension contribution, private medical and dental The opportunity Our client is an independent consultancy firm which has an opportunity for a Data Analyst with 35 years experience. The role will require the successful candidate to demonstrate their ability to analyse a problem and arrive at a solution, with varying levels of data being available. Essential skills Thorough knowledge of Excel and proven ability to utilise this to create powerful decision support models Experience in Modelling and Simulation Techniques, Experience of techniques such as Discrete Event Simulation and/or SD modelling Mathematical/scientific background minimum degree qualified Proven analytical and problem solving skills Self Starter Ability to develop solid working relationships In addition to formal qualifications and experience, the successful candidate will require excellent written and verbal communication skills, be energetic, enterprising and have a determination to succeed. They will be required to build solid working relationships, both internally with colleagues and, most importantly, externally with our clients. They must be comfortable working independently to deliver against challenging client demands. The offices are located in Basingstoke, Hampshire, but our client work for clients worldwide. The successful candidate must therefore be prepared to undertake work at client sites for short periods of time. Physics, Mathematics, Modelling, Simulation, Analytical, Operational Research, Mathematical Modelling Mathematical Modeller / Simulation Analyst / Operational Analyst Basingstoke, Hampshire ****K AAE pension contribution, private medical and dental","Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Modeller,"Engineering Systems Analyst / Mathematical Modeller. Our client is a highly successful and respected Consultancy providing specialist software development MISER, PIONEER, Maths, Mathematical, Optimisation, Risk Analysis, Asset Management, Water Industry, Access, Excel, VBA, SQL, Systems . Engineering Systems Analyst / Mathematical Modeller. Salary ****K****K negotiable Location Dorking, Surrey","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Dorking Surrey Salary ****K Located in Surrey, our client provides specialist software development Pioneer, Miser Engineering Systems Analyst Dorking Surrey Salary ****K","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk


---

# 03 - Aplicando Pré-Processamento nas colunas (features)
Nessa etapa vamos aplicar um **Pré-Processamento** em cada coluna individualmente.

---

## 03.1 - Pré-Processando a coluna (feature) "Id"
> Essa coluna (feature) não vai precisar ser Pré-Processada. Como nós sabemos é apenas o identificado único de cada amostra.

---

## 03.2 - Pré-Processando a coluna (feature) "Title"
> Resumidamente, o **Title** é o resumo do *cargo* ou *função*.

### Preparando e colocando o tipo de dado mais adequado na *coluna (feature)* "title":

In [8]:
df_Title = full_df[["Title"]].copy()
df_Title = df_Title.astype({'Title': 'string'})
df_Title.info()
df_Title.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Title   244767 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,Title
0,Engineering Systems Analyst
1,Stress Engineer Glasgow
2,Modelling and simulation analyst
3,Engineering Systems Analyst / Mathematical Modeller
4,"Pioneer, Miser Engineering Systems Analyst"


### Verificando quanto porcento (%) dos dados são ausentes (missing):

Vamos começar verificando o **número** de dados ausentes na coluna (feature) **Title**:

In [9]:
preprocessing.missing_by_numbers(df_Title)

Title    1
dtype: int64

Nós temos que entre às 244.768 amostras, apenas uma delas está faltando o **title (título)**. Agora vamos ver quanto porcento representa esse único título faltante:

In [10]:
preprocessing.missing_by_percent(df_Title)

Title    0.000409
dtype: float64

**NOTE:**  
Agora vem a pergunta-chave:

> **Por que apenas uma das amostras está sem o título?**

### Aplicando Lower Casing:

In [11]:
df_Title["processed_title"] = preprocessing.apply_lower_casing(df_Title["Title"])

In [12]:
df_Title.head()

Unnamed: 0,Title,processed_title
0,Engineering Systems Analyst,engineering systems analyst
1,Stress Engineer Glasgow,stress engineer glasgow
2,Modelling and simulation analyst,modelling and simulation analyst
3,Engineering Systems Analyst / Mathematical Modeller,engineering systems analyst / mathematical modeller
4,"Pioneer, Miser Engineering Systems Analyst","pioneer, miser engineering systems analyst"


### Removendo pontuações:

In [13]:
df_Title["processed_title"] = preprocessing.remove_punctuations(df_Title["processed_title"])

In [14]:
df_Title.head()

Unnamed: 0,Title,processed_title
0,Engineering Systems Analyst,engineering systems analyst
1,Stress Engineer Glasgow,stress engineer glasgow
2,Modelling and simulation analyst,modelling and simulation analyst
3,Engineering Systems Analyst / Mathematical Modeller,engineering systems analyst mathematical modeller
4,"Pioneer, Miser Engineering Systems Analyst",pioneer miser engineering systems analyst


### Removendo números:

In [15]:
df_Title["processed_title"] = preprocessing.remove_numbers(df_Title["processed_title"])

In [16]:
df_Title.head()

Unnamed: 0,Title,processed_title
0,Engineering Systems Analyst,engineering systems analyst
1,Stress Engineer Glasgow,stress engineer glasgow
2,Modelling and simulation analyst,modelling and simulation analyst
3,Engineering Systems Analyst / Mathematical Modeller,engineering systems analyst mathematical modeller
4,"Pioneer, Miser Engineering Systems Analyst",pioneer miser engineering systems analyst


### Aplicando a técnica de Stemming:

In [17]:
df_Title["processed_title"] = df_Title["processed_title"].apply(lambda text: preprocessing.apply_stemming(text))

In [18]:
df_Title.head()

Unnamed: 0,Title,processed_title
0,Engineering Systems Analyst,engin system analyst
1,Stress Engineer Glasgow,stress engin glasgow
2,Modelling and simulation analyst,model and simul analyst
3,Engineering Systems Analyst / Mathematical Modeller,engin system analyst mathemat model
4,"Pioneer, Miser Engineering Systems Analyst",pioneer miser engin system analyst


### Aplicando a técnica de Lemmatization + Part-of-Speech Tagging:

In [19]:
df_Title["processed_title"] = df_Title["processed_title"].apply(lambda text: preprocessing.apply_lemmatization(text))

In [20]:
df_Title.head()

Unnamed: 0,Title,processed_title
0,Engineering Systems Analyst,engin system analyst
1,Stress Engineer Glasgow,stress engin glasgow
2,Modelling and simulation analyst,model and simul analyst
3,Engineering Systems Analyst / Mathematical Modeller,engin system analyst mathemat model
4,"Pioneer, Miser Engineering Systems Analyst",pioneer miser engin system analyst


### Vetorizando com CountVectorizer:

In [21]:
df_title_vectorized = preprocessing.apply_countVectorizer(df_Title["processed_title"])

In [22]:
df_title_vectorized

<244768x7 sparse matrix of type '<class 'numpy.int64'>'
	with 160292 stored elements in Compressed Sparse Row format>

---

## 03.3 - Pré-Processando a coluna (feature) "SalaryNormalized"
> Tem o mesmo significado da coluna **"SalaryRaw"**, porém a **Adzuna** normalizou os dados para ser representado de forma anualizado - Essa é a variável **target**.

**NOTE:**  
Por hora, vamos trabalhar com ela do jeito que está normalizada pelo a **Adzuna** ignorando se a mesma realmente fez um bom trabalho. O objetivo dessa abordagem vai ser ter algo disponível para a etapa de **treinamento** e **validação** trabalhar o mais rápido possível.

---

# 04 - Load
> A etapa de **load** vai ser responsável por salvar os dados já ***Pré-Processados*** por uma ou mais colunas (features).

**NOTE:**  
Essa etapa segue uma lógica incremental, onde, em cada iteração **(Load-v1, Load-v2,..., Load-vn)** nós vamos salvando os dados já manipulados com objetivo de encontrar uma melhor métrica ou modelagem dos dados.

---

## 04.1 - Load-v1
Para esse 1° **Load** vamos começar com as colunas (features) mais simples possíveis, que são:
 - **Title** como variável **independente**.
 - **"SalaryNormalized"** como variável **dependente**.

**NOTE:**  
Eu escolhi essas colunas (features), pois, já receberam algum **Pré-Processamento básico** (o que não significa que mudanças possam ser feitas).

### Salvando a Matriz esparsa "df_title_vectorized" (resultado do Pré-Processamento da feature Title):
Primeiro, vamos salvar o resultado do **Pré-Processamento** na coluna (feature) **Title**.

In [23]:
preprocessing.save_feature(feature_name="df_title_vectorized", feature=df_title_vectorized)

Feature saved!


### SalaryNormalized:
Para coluna (feature) **"SalaryNormalized"** nós vamos pegar ela na hora do treinamento do modelo visto que alterações não foram feitas.

---

# Resumos

 - **Load-v1:**
   - No *Load-v1* foi *Pré-Processada* a coluna (features) *Title*.
   - A coluna (feature) *Title* foi vetorizada com CountVectorizer com os seguintes argumentos:
     - stop_words="english"
     - max_df=0.60 (Ignores terms that appear in MORE than 60% of documents)
     - min_df=0.05 (Ignores terms that appear in LESS than 5% of documents)
   - Também foi utilizada a coluna (feature) *SalaryNormalized* que já havia sido normalizada pelo a *Adzuna*.
   - O objetivo era ter *features* o mais rápido possível disponíveis para a etapa de *Modelagem & Validação*.


**Rodrigo Leite -** *drigols*