# Pré-Processamento
Esse **Jupyter Notebook** tem como objetivo aplicar um **Pré-Processamento** no conjunto de dados (ou em parte dele).

# Resumo da Análise preliminar
Na etapa anterior foi feita uma breve análise do conjunto de dados. O **resumo** dessa análise foi o seguinte:

 - Temos um grande conjunto de dados para trabalharmos:
   - Com 244.768 amostras e 12 colunas (features).
 - Porém, vai ser necessário um Pré-Processamento na maior parte das colunas, devido o fato das colunas serem representadas por textos (informações).
 - Algumas colunas estão com muitos dados faltantes, principalmente a **ContractType** que tem **73%** dos dados faltantes.

# 01 - Baixando, Importando & Configurações iniciais

Vamos começar baixando as bibliotecas necessárias (Eu já tenho todas baixadas no meu ambiente virtual mas você pode remover o comentário e baixar para sua máquina local ou Ambiente Virtual):

In [1]:
# !pip install --upgrade -r ../requirements.txt

Agora vamos importar as bibliotecas necessárias:

In [2]:
import pandas as pd
import py7zr

Agora vamos extrair o conjunto de dados:

In [3]:
with py7zr.SevenZipFile("../datasets/Train_rev1.7z", mode='r') as archive:
  archive.extractall(path="/tmp") # For Linux users.

**NOTE:**  
Como é conjunto de dados é muito grande resolvi baixar a versão mais comprimida **.7z**. Optei também por descomprimir o conjunto de dados em um local temporário (diretório **/temp** no meu caso que estou utilizando Linux / Como se fosse uma **Staging Area**).

**Configurando o tamanho das saídas (outputs):**  
Antes de iniciarmos nossa análise vamos configurar o Pandas para exibir todo o conteúdo por amostra:

In [4]:
pd.options.display.max_colwidth = 100000

Por fim, vamos pegar o conjunto de dados baixado:

In [5]:
full_df = pd.read_csv("/tmp/Train_rev1.csv")

# 02 - Visão geral (overview) do conjunto de dados
Vamos começar com uma visão geral (overview) dos dados

In [6]:
full_df.info()
full_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Id                  244768 non-null  int64 
 1   Title               244767 non-null  object
 2   FullDescription     244768 non-null  object
 3   LocationRaw         244768 non-null  object
 4   LocationNormalized  244768 non-null  object
 5   ContractType        65442 non-null   object
 6   ContractTime        180863 non-null  object
 7   Company             212338 non-null  object
 8   Category            244768 non-null  object
 9   SalaryRaw           244768 non-null  object
 10  SalaryNormalized    244768 non-null  int64 
 11  SourceName          244767 non-null  object
dtypes: int64(2), object(10)
memory usage: 22.4+ MB


Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,"Engineering Systems Analyst Dorking Surrey Salary ****K Our client is located in Dorking, Surrey and are looking for Engineering Systems Analyst our client provides specialist software development Keywords Mathematical Modelling, Risk Analysis, System Modelling, Optimisation, MISER, PIONEEER Engineering Systems Analyst Dorking Surrey Salary ****K","Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,"Stress Engineer Glasgow Salary **** to **** We re currently looking for talented engineers to join our growing Glasgow team at a variety of levels. The roles are ideally suited to high calibre engineering graduates with any level of appropriate experience, so that we can give you the opportunity to use your technical skills to provide high quality input to our aerospace projects, spanning both aerostructures and aeroengines. In return, you can expect good career opportunities and the chance for advancement and personal and professional development, support while you gain Chartership and some opportunities to possibly travel or work in other offices, in or outside of the UK. The Requirements You will need to have a good engineering degree that includes structural analysis (such as aeronautical, mechanical, automotive, civil) with some experience in a professional engineering environment relevant to (but not limited to) the aerospace sector. You will need to demonstrate experience in at least one or more of the following areas: Structural/stress analysis Composite stress analysis (any industry) Linear and nonlinear finite element analysis Fatigue and damage tolerance Structural dynamics Thermal analysis Aerostructures experience You will also be expected to demonstrate the following qualities: A strong desire to progress quickly to a position of leadership Professional approach Strong communication skills, written and verbal Commercial awareness Team working, being comfortable working in international teams and self managing PLEASE NOTE SECURITY CLEARANCE IS REQUIRED FOR THIS ROLE Stress Engineer Glasgow Salary **** to ****","Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk
2,12612844,Modelling and simulation analyst,"Mathematical Modeller / Simulation Analyst / Operational Analyst Basingstoke, Hampshire Up to ****K AAE pension contribution, private medical and dental The opportunity Our client is an independent consultancy firm which has an opportunity for a Data Analyst with 35 years experience. The role will require the successful candidate to demonstrate their ability to analyse a problem and arrive at a solution, with varying levels of data being available. Essential skills Thorough knowledge of Excel and proven ability to utilise this to create powerful decision support models Experience in Modelling and Simulation Techniques, Experience of techniques such as Discrete Event Simulation and/or SD modelling Mathematical/scientific background minimum degree qualified Proven analytical and problem solving skills Self Starter Ability to develop solid working relationships In addition to formal qualifications and experience, the successful candidate will require excellent written and verbal communication skills, be energetic, enterprising and have a determination to succeed. They will be required to build solid working relationships, both internally with colleagues and, most importantly, externally with our clients. They must be comfortable working independently to deliver against challenging client demands. The offices are located in Basingstoke, Hampshire, but our client work for clients worldwide. The successful candidate must therefore be prepared to undertake work at client sites for short periods of time. Physics, Mathematics, Modelling, Simulation, Analytical, Operational Research, Mathematical Modelling Mathematical Modeller / Simulation Analyst / Operational Analyst Basingstoke, Hampshire ****K AAE pension contribution, private medical and dental","Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Modeller,"Engineering Systems Analyst / Mathematical Modeller. Our client is a highly successful and respected Consultancy providing specialist software development MISER, PIONEER, Maths, Mathematical, Optimisation, Risk Analysis, Asset Management, Water Industry, Access, Excel, VBA, SQL, Systems . Engineering Systems Analyst / Mathematical Modeller. Salary ****K****K negotiable Location Dorking, Surrey","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Dorking Surrey Salary ****K Located in Surrey, our client provides specialist software development Pioneer, Miser Engineering Systems Analyst Dorking Surrey Salary ****K","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk


# 03 - Aplicando Pré-Processamento nas colunas (features)
Bem, nessa etapa vamos aplicar um **Pré-Processamento** em cada coluna individualmente.

---

## 03.1 - Pré-Processando a coluna (feature) "Id"
> Essa coluna (feature) não vai precisar ser Pré-Processada. Como nós sabemos é apenas o identificado único de cada amostra.

In [8]:
full_df['Id'].head()

0    12612628
1    12612830
2    12612844
3    12613049
4    12613647
Name: Id, dtype: int64

---

## 03.2 - Pré-Processando a coluna (feature) "Title"
> Resumidamente, o **Title** é o resumo do *cargo* ou *função*.

### Preparando e colocando o tipo de dado mais adequado na *coluna (feature)* "title":

In [9]:
df_Title = full_df[["Title"]].copy()
df_Title = df_Title.astype({'Title': 'string'})
df_Title.info()
df_Title.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Title   244767 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,Title
0,Engineering Systems Analyst
1,Stress Engineer Glasgow
2,Modelling and simulation analyst
3,Engineering Systems Analyst / Mathematical Modeller
4,"Pioneer, Miser Engineering Systems Analyst"


### Verificando quanto porcento (%) dos dados são ausentes (missing):

Vamos começar verificando o **número** de dados ausentes na coluna (feature) **Title**:

In [10]:
# Data missing sum.
missing = df_Title.isnull().sum()
missing

Title    1
dtype: int64

Nós temos que entre às 244.768 amostras, apenas uma delas está faltando o **title (título)**. Vamos ver quanto porcento representa esse único título faltante:

In [11]:
# Data missing in percent.
percentMissing = (missing / len(df_Title.index)) * 100
percentMissing

Title    0.000409
dtype: float64

**NOTE:**  
Agora vem a pergunta-chave:

> **Por que apenas uma das amostras está sem o título?**

---

## 03.3 - Pré-Processando a coluna (feature) "FullDescription"
> O texto completo do anúncio de emprego, conforme fornecido pelo anunciante do emprego.

**NOTE:**  
Onde teria o salário (salary) qual foi retirado os valores da descrição para garantir que nenhuma informação de salário apareça nas descrições. Pode haver algum dano colateral aqui, pois também foi removido outros números.

### Preparando e colocando o tipo de dado mais adequado na *coluna (feature)* "FullDescription":

In [12]:
df_FullDescription = full_df[["FullDescription"]].copy()
df_FullDescription = df_FullDescription.astype({'FullDescription': 'string'})
df_FullDescription.info()
df_FullDescription.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   FullDescription  244768 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,FullDescription
0,"Engineering Systems Analyst Dorking Surrey Salary ****K Our client is located in Dorking, Surrey and are looking for Engineering Systems Analyst our client provides specialist software development Keywords Mathematical Modelling, Risk Analysis, System Modelling, Optimisation, MISER, PIONEEER Engineering Systems Analyst Dorking Surrey Salary ****K"
1,"Stress Engineer Glasgow Salary **** to **** We re currently looking for talented engineers to join our growing Glasgow team at a variety of levels. The roles are ideally suited to high calibre engineering graduates with any level of appropriate experience, so that we can give you the opportunity to use your technical skills to provide high quality input to our aerospace projects, spanning both aerostructures and aeroengines. In return, you can expect good career opportunities and the chance for advancement and personal and professional development, support while you gain Chartership and some opportunities to possibly travel or work in other offices, in or outside of the UK. The Requirements You will need to have a good engineering degree that includes structural analysis (such as aeronautical, mechanical, automotive, civil) with some experience in a professional engineering environment relevant to (but not limited to) the aerospace sector. You will need to demonstrate experience in at least one or more of the following areas: Structural/stress analysis Composite stress analysis (any industry) Linear and nonlinear finite element analysis Fatigue and damage tolerance Structural dynamics Thermal analysis Aerostructures experience You will also be expected to demonstrate the following qualities: A strong desire to progress quickly to a position of leadership Professional approach Strong communication skills, written and verbal Commercial awareness Team working, being comfortable working in international teams and self managing PLEASE NOTE SECURITY CLEARANCE IS REQUIRED FOR THIS ROLE Stress Engineer Glasgow Salary **** to ****"
2,"Mathematical Modeller / Simulation Analyst / Operational Analyst Basingstoke, Hampshire Up to ****K AAE pension contribution, private medical and dental The opportunity Our client is an independent consultancy firm which has an opportunity for a Data Analyst with 35 years experience. The role will require the successful candidate to demonstrate their ability to analyse a problem and arrive at a solution, with varying levels of data being available. Essential skills Thorough knowledge of Excel and proven ability to utilise this to create powerful decision support models Experience in Modelling and Simulation Techniques, Experience of techniques such as Discrete Event Simulation and/or SD modelling Mathematical/scientific background minimum degree qualified Proven analytical and problem solving skills Self Starter Ability to develop solid working relationships In addition to formal qualifications and experience, the successful candidate will require excellent written and verbal communication skills, be energetic, enterprising and have a determination to succeed. They will be required to build solid working relationships, both internally with colleagues and, most importantly, externally with our clients. They must be comfortable working independently to deliver against challenging client demands. The offices are located in Basingstoke, Hampshire, but our client work for clients worldwide. The successful candidate must therefore be prepared to undertake work at client sites for short periods of time. Physics, Mathematics, Modelling, Simulation, Analytical, Operational Research, Mathematical Modelling Mathematical Modeller / Simulation Analyst / Operational Analyst Basingstoke, Hampshire ****K AAE pension contribution, private medical and dental"
3,"Engineering Systems Analyst / Mathematical Modeller. Our client is a highly successful and respected Consultancy providing specialist software development MISER, PIONEER, Maths, Mathematical, Optimisation, Risk Analysis, Asset Management, Water Industry, Access, Excel, VBA, SQL, Systems . Engineering Systems Analyst / Mathematical Modeller. Salary ****K****K negotiable Location Dorking, Surrey"
4,"Pioneer, Miser Engineering Systems Analyst Dorking Surrey Salary ****K Located in Surrey, our client provides specialist software development Pioneer, Miser Engineering Systems Analyst Dorking Surrey Salary ****K"
5,"Engineering Systems Analyst Water Industry Location: Dorking Surrey Salary: **** to **** Located in Surrey, our client provides specialist software development Systems Analysis and Software Engineering. The projects cover a wide variety of topics typically working in small teams. Our client can offer you Intellectually challenging work undertaken within a supportive environment where personal development is nurtured and rewarded. This role will be working within a small team working on the modelling of water industry asset deterioration and asset failure consequences, including the uploading of these models onto industryleading optimal asset management software Strong maths, stats and IT skills needed, Any previous experience within the Water industry would be an advantage. Candidate requirements Candidates should have a good honours degree in a numerate discipline i.e. Engineering; Mathematics; Science; Computing/Software. Candidates need to be highly qualified within Physics, Mathematics or Engineering discipline ideally First Class Degree, or **** PhD or Masters, MEng. Candidates should ideally should have a background in Technical Consultancy; Systems Analysis; Software Engineering or a Graduate level looking to develop their career. Any experience of Pioneer or Miser software would be an advantage. Key areas: Mathematical modelling, leakage management, optimisation. risk analysis. Physics, Mathematics, Engineering Engineering; Mathematics; Science; Computing/Software Pioneer or Miser Engineering Systems Analyst Water Industry Location: Dorking, Surrey Salary **** to ****"
6,"A globally renowned engineering and training company in the Oil Develop and manage both internal and external inspection plans and plan remedial and preventative maintenance work Implement, apply and update pipeline databases and spreadsheets and assist operators with projects, including project management, assurance support and procedure reviews Handle consulting, problem solving, risk assessments and presentations Be responsible for technical representation offshore Prepare proposals and tenders for clients and write reports and specifications. You will offer analysis using various types of software such as MathCAD, ABAQUS, Olga or Orcaflex. You will also present the company s technical courses and handle project management and sales and customer relationship management. For this role, you must have a minimum of 10 years experience in subsea engineering, pipelines design or construction. Background in controls, corrosion, decommissioning or structures would be an advantage. A degree in Engineering, Aeronautics, Naval Architecture, Maths, or Physics, preferably with honours is essential for this role. A full clean driving license is also required. If you are a Subsea Engineering professional with excellent design/construction skills and exposure to the Oil & Gas/Subsea Engineering industry, we would love to hear from you. Send in your CV now"
7,"THIS IS A LIVE VACANCY NOT A GENERIC ADVERTISEMENT ) DO YOU WANT TO EARN UP TO ****K BASIC SALARY WITH UNCAPPED OTE, FOR A NATIONAL RECRUITER WHO ARE EXPANDING, RECRUITING AND CAN OFFER YOU JOB SECURITY ? ARE YOU AN EXPERIENCED RECRUITMENT CONSULTANT, SALES OR BUSINESS DEVELOPMENT EXECUTIVE? WHO WANTS INNOVATIVE TRAINING, GENUINE CAREER PROSPECTS WHERE ****5% OF THE MANAGEMENT TEAM ARE HOME GROWN, DYNAMIC WORKING ENVIRONMENT BASED IN MANCHESTER(ONE OF THE TOP PERFORMING LOCATIONS IN THE UK) AND EXCELLENT EARNING POTENTIAL UP TO 30% COMMISSION FOR ACHIEVERS Job details: Generating business for one of my clients **** specialist divisions across **** branches in the UK. Access to all the leading websites for candidate and job opportunities, over ****k annual investment why make the job harder? Full and comprehensive Administration and IT back up for every Consultant Working established bespoke database to generate opportunities with both existing and new clients. Inter division cross fertilisation, working closely with 8 professional divisions providing clients with the complete recruitment solution Corporate and HO support for national business/tenders and PSL s. Managing own desk to achieve daily, weekly and monthly targets. Person: Energy and desire to succeed, with a competitive streak that wants to beat the competition. Professional personal appearance with the ability to build sustainable business relationships. Positive mental attitude. Able to demonstrate working to and exceeding targets. Career focussed, good listening skills and a natural confidence. Naturally positive and looking for the opportunity to work for a national player who are committed to your success Recruitment experience or 2 years business to business sales experience required. Benefits: ****k ****K basic benefits Pension scheme. Uncapped OTE Innovative award winning training academy. Genuine career opportunities. Incentives, awards and excellent team building UK wide. IF YOU FEEL YOU HAVE THE ENERGY AND DRIVE TO SUCCEED IN A FAST PACED, DYNAMIC SALES ENVIRONMENT WITH REWARDS THEN CALL OR EMAIL FRO AN INFORMAL AND DISCREET DISCUSSION. PLEASE NOTE YOUR CV WILL NOT BE FORWARDED TO MY CLIENT WITHOUT YOUR WRITTEN AUTHORISATION. Code Blue Recruitment handle nearly **** LIVE VACANCIES, for all sectors and levels of the Recruitment industry across the whole of the UK . Positions include Junior and Trainee Consultants, Recruitment and Senior Recruitment Consultants, Team Leaders, Account Managers, Branch Managers, Sales Managers & Directors. Sectors covered include Professional Recruitment such as Finance, Accountancy & Banking, Legal & HR, Commercial markets such as Office Support, Industrial, Driving, Hospitality & Catering, the Public Sector including Medical, Care, Education, Technical markets such as I.T., Telecoms, Media, Construction, Engineering, Energy & Environment, and Pharmaceutical, as well as diverse areas such as Sales Recruitment, Search, and Supply Chain"
8,"This is an exceptional opportunity to join a construction / technical agency that hasn t shrunk in the current market one bit Our client is seeking a nononsense and highly skilled Recruiter with at least a couple of years experience under their belt. They specialise in placing highcalibre candidates both in the UK and worldwide, within blue and white collar on both a temp and perm basis. You will have a genuine servicefocus but not definately not be afraid to pick up the phone and develop new business, acting on leads and referrals like any professional should. Your matching skills should be excellent and your CVInterviewPlacement ratios should be impressive. The incentives and benefits here are excellent with the real opportunity to earn an aboveaverage basic salary and competitive commission package: Basic to ****k car allowan Bonus to 22.5%(OTE ****k) Car Alloowance . Additional quarterly and annual bonuses Sensible & Supportive atmosphere If you are looking to join a serious competitor in the marketplace who take a genuine interest in their people contact Donna Turner now, and please have your billing figures to hand Code Blue Recruitment handle nearly **** LIVE VACANCIES, for all sectors and levels of the Recruitment industry across the whole of the UK . Positions include Junior and Trainee Consultants, Recruitment and Senior Recruitment Consultants, Team Leaders, Account Managers, Branch Managers, Sales Managers & Directors. Sectors covered include Professional Recruitment such as Finance, Accountancy & Banking, Legal & HR, Commercial markets such as Office Support, Industrial, Driving, Hospitality & Catering, the Public Sector including Medical, Care, Education, Technical markets such as I.T., Telecoms, Media, Construction, Engineering, Energy & Environment, and Pharmaceutical, as well as diverse areas such as Sales Recruitment, Search, and Supply Chain"
9,"A subsea engineering company is looking for an experienced Subsea Cable Engineer who will be responsible for providing all issues related to cables. They will need someone who has at least 1015 years of subsea cable engineering experience with significant experience within offshore oil and gas industries. The qualified candidate will be responsible for developing new modelling methods for FEA and CFD. You will also be providing technical leadership to all staff therefore you must be an expert in problem solving and risk assessments. You must also be proactive and must have strong interpersonal skills. You must be a Chartered Engineer or working towards it the qualification. The company offers an extremely competitive salary, health care plan, training, professional membership sponsorship, and relocation package"


### Verificando quanto porcento (%) dos dados são ausentes (missing):

In [13]:
# Data missing sum.
missing = df_FullDescription.isnull().sum()
missing

FullDescription    0
dtype: int64

In [14]:
# Data missing in percent.
percentMissing = (missing / len(df_FullDescription.index)) * 100
percentMissing

FullDescription    0.0
dtype: float64

---

## 03.4 - Pré-Processando a coluna (feature) "LocationRaw"
> Imagine que essa coluna representa a localização da vaga, porém, utilizando pontos cardeais e/ou referências.

**NOTE:**  
Eu já tenho uma coluna (feature) igual a essa, porém, normalizada pelo **Adzuna**. Então, quando eu poderia utilizar essa no lugar da normalizada pelo **Adzuna**?

> **Depois de aplicar um Pré-Processamento nessa que nós der uma métrica ou modelagem melhor se comparada com a normalizada pelo Adzuna.**

### Preparando e colocando o tipo de dado mais adequado a *coluna (feature)* "LocationNormalized":

In [15]:
df_LocationRaw = full_df[["LocationRaw"]].copy()
df_LocationRaw = df_LocationRaw.astype({'LocationRaw': 'string'})
df_LocationRaw.info()
df_LocationRaw.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   LocationRaw  244768 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,LocationRaw
0,"Dorking, Surrey, Surrey"
1,"Glasgow, Scotland, Scotland"
2,"Hampshire, South East, South East"
3,"Surrey, South East, South East"
4,"Surrey, South East, South East"
5,"Dorking, Surrey, Surrey, Surrey"
6,"Aberdeen, Borders"
7,"MANCHESTER, Greater Manchester"
8,"LEEDS, West Yorkshire"
9,"Aberdeen, UK"


### Verificando quanto porcento (%) dos dados são ausentes (missing):

In [16]:
# Data missing sum.
missing = df_LocationRaw.isnull().sum()
missing

LocationRaw    0
dtype: int64

In [17]:
# Data missing in percent.
percentMissing = (missing / len(df_LocationRaw.index)) * 100
percentMissing

LocationRaw    0.0
dtype: float64

---

## 03.5 - Pré-Processando a coluna (feature) "LocationNormalized"
> Tem o mesmo significado da coluna **LocationRaw**, porém, com menos informações e/ou referências.

**NOTE:**  
Isso, porque essa coluna é o resultado de um **Pré-Processamento** da coluna **LocationRaw** feito pelo **Adzuna**.

### Preparando e colocando o tipo de dado mais adequado a *coluna (feature)* "LocationNormalized":

In [18]:
df_LocationNormalized = full_df[["LocationNormalized"]].copy()
df_LocationNormalized = df_LocationNormalized.astype({'LocationNormalized': 'string'})
df_LocationNormalized.info()
df_LocationNormalized.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   LocationNormalized  244768 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,LocationNormalized
0,Dorking
1,Glasgow
2,Hampshire
3,Surrey
4,Surrey
5,Dorking
6,UK
7,Manchester
8,Leeds
9,Aberdeen


### Verificando quanto porcento (%) dos dados são ausentes (missing):

In [19]:
# Data missing sum.
missing = df_LocationNormalized.isnull().sum()
missing

LocationNormalized    0
dtype: int64

In [20]:
# Data missing in percent.
percentMissing = (missing / len(df_LocationNormalized.index)) * 100
percentMissing

LocationNormalized    0.0
dtype: float64

---

## 03.6 - Pré-Processando a coluna (feature) "ContractType"
> Essa coluna representa os tipos de contratos por amostra de vaga de emprego, que são **full_time** ou **part_time**. Na verdade, essa coluna nos diz se o funcionário trabalha integral (por exemplo, 40h semanais) ou meio expediente (por exemplo, 20h semanais).

**NOTE:**  
Esse campo foi interpretado pela **Adzuna** a partir da descrição ou de um campo adicional específico.

### Preparando e colocando o tipo de dado mais adequado na coluna (feature) "ContractType":

In [21]:
df_ContractType = full_df[["ContractType"]].copy()
df_ContractType = df_ContractType.astype({'ContractType': 'string'})
df_ContractType.info()
df_ContractType.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   ContractType  65442 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,ContractType
0,
1,
2,
3,
4,
5,
6,
7,
8,
9,


### Verificando quanto porcento (%) dos dados são ausentes (missing):

In [22]:
# Data missing sum.
missing = df_ContractType.isnull().sum()
missing

ContractType    179326
dtype: int64

In [23]:
# Data missing in percent.
percentMissing = (missing / len(df_ContractType.index)) * 100
percentMissing

ContractType    73.263662
dtype: float64

**NOTE:**  
Como essa *coluna (feature)* tem 73% dos dados faltantes (missing) talvez seja interessante remové-la. Isso, porque se nós temos mais de 70% dos dados faltando talvez essa variável no nosso modelo seja quase nula.

**NOTE:**  
Porém, vamos apenas ignorá-la por agora. Quem sabe esses 27% disponível não seja relevante se pensarmos na importância da coluna (feature).

**NOTE:**  
Para finalizar nós temos a pergunta-chave:

> **Por que temos esse número tão grande de dados faltante (73%) nessa coluna (feature)?**

---

## 03.7 - Pré-Processando a coluna (feature) "ContractTime"
> Tipo de contrato, que pode ser **permanente (por exemplo, CLT)** ou **contrato (por exemplo, PJ)**.

### Preparando e colocando o tipo de dado mais adequado na *coluna (feature)* "ContractTime":

In [24]:
df_ContractTime = full_df[["ContractTime"]].copy()
df_ContractTime = df_ContractTime.astype({'ContractTime': 'string'})
df_ContractTime.info()
df_ContractTime.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   ContractTime  180863 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,ContractTime
0,permanent
1,permanent
2,permanent
3,permanent
4,permanent
5,permanent
6,permanent
7,permanent
8,permanent
9,permanent


### Verificando quanto porcento (%) dos dados são ausentes (missing):

In [25]:
# Data missing sum.
missing = df_ContractTime.isnull().sum()
missing

ContractTime    63905
dtype: int64

In [26]:
# Data missing in percent.
percentMissing = (missing / len(df_ContractTime.index)) * 100
percentMissing

ContractTime    26.108397
dtype: float64

Bem, das 244.768 amostras nós temos 63.905 faltando o campo **ContractTime** que representam **26%**.

**NOTE:**  
Novamente, vem a pergunta-chave:

> **Por que nós temos 26% dos dados faltanto nessa coluna (feature)?**

---

## 03.8 - Pré-Processando a coluna (feature) "Company"
> O nome do empregador conforme fornecido pelo anunciante do emprego.

### Preparando e colocando o tipo de dado mais adequado na *coluna (feature)* "Company":

In [27]:
df_Company = full_df[["Company"]].copy()
df_Company = df_Company.astype({'Company': 'string'})
df_Company.info()
df_Company.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Company  212338 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,Company
0,Gregory Martin International
1,Gregory Martin International
2,Gregory Martin International
3,Gregory Martin International
4,Gregory Martin International
5,Gregory Martin International
6,Indigo 21 Ltd
7,Code Blue Recruitment
8,Code Blue Recruitment
9,Indigo 21 Ltd


### Verificando quanto porcento (%) dos dados são ausentes (missing):

In [28]:
# Data missing sum.
missing = df_Company.isnull().sum()
missing

Company    32430
dtype: int64

In [29]:
# Data missing in percent.
percentMissing = (missing / len(df_Company.index)) * 100
percentMissing

Company    13.249281
dtype: float64

**NOTE:**  
Bem, essa coluna tem bem menos dados faltando, **13%**. Porém, nós temos alguns questionamentos:
 - **Por que essa coluna (feature) tem essa porcentagem de dados faltando?**
 - **Devemos nos preocupar?**
 - **O que fazer com os 87%?**

---

## 03.9 - Pré-Processando a coluna (feature) "Category"
> Em qual das 30 categorias de empregos padrão este anúncio se encaixa.

**NOTE:**  
Sabemos que há muito ruído e erro neste campo.

### Preparando e colocando o tipo de dado mais adequado na *coluna (feature)* "Category":

In [30]:
df_Category = full_df[["Category"]].copy()
df_Category = df_Category.astype({'Category': 'string'})
df_Category.info()
df_Category.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   Category  244768 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,Category
0,Engineering Jobs
1,Engineering Jobs
2,Engineering Jobs
3,Engineering Jobs
4,Engineering Jobs
5,Engineering Jobs
6,Engineering Jobs
7,HR & Recruitment Jobs
8,HR & Recruitment Jobs
9,Engineering Jobs


### Verificando quanto porcento (%) dos dados são ausentes (missing):

In [31]:
# Data missing sum.
missing = df_Category.isnull().sum()
missing

Category    0
dtype: int64

In [32]:
# Data missing in percent.
percentMissing = (missing / len(df_Category.index)) * 100
percentMissing

Category    0.0
dtype: float64

---

## 03.10 - Pré-Processando a coluna (feature) "SalaryRaw"
Imagine que essa coluna representa o salário do anúncio (amostra). Porém:
 - Sem formatação;
 - Com bonus;
 - Remuneração:
   - Por hora;
   - Por mês;
   - Por ano.

**NOTE:**  
Essa coluna segue a mesma lógica das colunas **LocationRaw x LocationRawNormalized**. Ou seja, para essa coluna (feature) nós vamos ter a mesma *pergunta* e *resposta*:

**Quando eu poderia utilizar essa no lugar da normalizada pelo *Adzuna*?**  
> **Depois de aplicar um Pré-Processamento nessa que nós der uma métrica ou modelagem melhor se comparada com a normalizada pelo Adzuna.**

### Preparando e colocando o tipo de dado mais adequado na *coluna (feature)* "SalaryNormalized":

In [33]:
df_SalaryRaw = full_df[["SalaryRaw"]].copy()
df_SalaryRaw = df_SalaryRaw.astype({'SalaryRaw': 'string'})
df_SalaryRaw.info()
df_SalaryRaw.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   SalaryRaw  244768 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,SalaryRaw
0,20000 - 30000/annum 20-30K
1,25000 - 35000/annum 25-35K
2,20000 - 40000/annum 20-40K
3,25000 - 30000/annum 25K-30K negotiable
4,20000 - 30000/annum 20-30K
5,20000 - 30000/annum 20K to 30K
6,50000 - 100000/annum
7,18000 - 26000/annum TO 26K BASIC + COMM + BENS
8,18000 - 28000/annum 18 - 28K BASIC + COMM + BENS
9,70000 - 100000/annum


### Verificando quanto porcento (%) dos dados são ausentes (missing):

In [34]:
# Data missing sum.
missing = df_SalaryRaw.isnull().sum()
missing

SalaryRaw    0
dtype: int64

In [35]:
# Data missing in percent.
percentMissing = (missing / len(df_SalaryRaw.index)) * 100
percentMissing

SalaryRaw    0.0
dtype: float64

---

## 03.11 - Pré-Processando a coluna (feature) "SalaryNormalized"
Tem o mesmo significado da coluna **"SalaryRaw"**, porém a *Adzuna* normalizou os dados para ser representado de forma **anualizado**.

### Preparando e colocando o tipo de dado mais adequado na *coluna (feature)* "SalaryNormalized":

In [36]:
df_SalaryNormalized = full_df[["SalaryNormalized"]].copy()
df_SalaryNormalized = df_SalaryNormalized.astype({'SalaryNormalized': 'float64'})
df_SalaryNormalized.info()
df_SalaryNormalized.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   SalaryNormalized  244768 non-null  float64
dtypes: float64(1)
memory usage: 1.9 MB


Unnamed: 0,SalaryNormalized
0,25000.0
1,30000.0
2,30000.0
3,27500.0
4,25000.0
5,25000.0
6,75000.0
7,22000.0
8,23000.0
9,85000.0


### Verificando quanto porcento (%) dos dados são ausentes (missing):

In [37]:
# Data missing sum.
missing = df_SalaryNormalized.isnull().sum()
missing

SalaryNormalized    0
dtype: int64

In [38]:
# Data missing in percent.
percentMissing = (missing / len(df_SalaryNormalized.index)) * 100
percentMissing

SalaryNormalized    0.0
dtype: float64

---

## 03.12 - Pré-Processando a coluna (feature) "SourceName"
> O nome do site ou anunciante de quem recebemos o anúncio de emprego.

### Preparando e colocando o tipo de dado mais adequado na *coluna (feature)* "SourceName":

In [39]:
df_SourceName = full_df[["SourceName"]].copy()
df_SourceName = df_SourceName.astype({'SourceName': 'string'})
df_SourceName.info()
df_SourceName.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244768 entries, 0 to 244767
Data columns (total 1 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   SourceName  244767 non-null  string
dtypes: string(1)
memory usage: 1.9 MB


Unnamed: 0,SourceName
0,cv-library.co.uk
1,cv-library.co.uk
2,cv-library.co.uk
3,cv-library.co.uk
4,cv-library.co.uk
5,cv-library.co.uk
6,cv-library.co.uk
7,cv-library.co.uk
8,cv-library.co.uk
9,cv-library.co.uk


### Verificando quanto porcento (%) dos dados são ausentes (missing):

In [40]:
# Data missing sum.
missing = df_SourceName.isnull().sum()
missing

SourceName    1
dtype: int64

In [41]:
# Data missing in percent.
percentMissing = (missing / len(df_SourceName.index)) * 100
percentMissing

SourceName    0.000409
dtype: float64

**NOTE:**  
Bem, em apenas 1 das amostras está faltando o site do anunciante. Mas, mesmo assim nós vamos fazer aquela velha pergunta-chave:

> **Por que diante de tantas amostras (244.768) em apenas uma está faltando o site do anunciante?**

# 04 - Load
> A etapa de **load** é responsável por salvar os dados já ***Pré-Processados*** por uma ou mais colunas (features).

**NOTE:**  
Essa etapa segue uma lógica incremental, onde, em cada iteração *(Load-v1, Load-v2,..., Load-vn)* nós vamos salvando os dados já manipulados com objetivo de encontrar uma melhor métrica ou modelagem dos dados.

## 04.1 - Load-v1 (?)

# Resumos

x