# Binary Classification of Machine Failures

Este é um dataset do Kaggle, sobre vagas de emprego para Ciência de Dados em 2024. Estes dados são do Glassdoor, que fornece informações sobre o mercado de trabalho, tendências do setor e características dos empregadores.
A base pode ser acessada no link: https://www.kaggle.com/competitions/playground-series-s3e17/data

In [1]:
%config InteractiveShell.ast_node_interactivity = 'all'

import pandas as pd

## Carregamento dos dados 

Garanta que o arquivo baixado no Kaggle esteja disponível na pasta raiz deste arquivo do Jupyter Notebook

In [2]:
dataset = pd.read_csv("Glassdoor_Job_Postings.csv")

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   company                      899 non-null    object 
 1   job_title                    900 non-null    object 
 2   company_rating               656 non-null    float64
 3   job_description              888 non-null    object 
 4   location                     900 non-null    object 
 5   salary_avg_estimate          636 non-null    object 
 6   salary_estimate_payperiod    636 non-null    object 
 7   company_size                 774 non-null    object 
 8   company_founded              774 non-null    object 
 9   employment_type              774 non-null    object 
 10  industry                     774 non-null    object 
 11  sector                       774 non-null    object 
 12  revenue                      774 non-null    object 
 13  career_opportunities

In [4]:
dataset.describe()

Unnamed: 0,company_rating,career_opportunities_rating,comp_and_benefits_rating,culture_and_values_rating,senior_management_rating,work_life_balance_rating
count,656.0,731.0,731.0,731.0,731.0,731.0
mean,3.948171,3.838304,3.678796,3.90301,3.677702,3.804378
std,0.440294,0.52002,0.525854,0.545233,0.589133,0.559777
min,1.0,1.0,1.0,1.0,1.0,1.0
25%,3.7,3.6,3.4,3.6,3.3,3.6
50%,4.0,3.8,3.7,3.9,3.6,3.8
75%,4.2,4.1,4.0,4.2,4.0,4.1
max,5.0,5.0,5.0,5.0,5.0,5.0


## Apresentação de todo o conjunto de dados

In [5]:
ds1 = dataset[["company","job_title","company_rating","job_description","location"]].head()
ds2 = dataset[["company","salary_avg_estimate","salary_estimate_payperiod","company_size","company_founded","employment_type"]].head()
ds3 = dataset[["company","industry","sector","revenue","career_opportunities_rating","comp_and_benefits_rating"]].head()
ds4 = dataset[["company","culture_and_values_rating","senior_management_rating", "work_life_balance_rating"]].head()

ds1
ds2
ds3
ds4

Unnamed: 0,company,job_title,company_rating,job_description,location
0,ABB,Junior Data Analyst,4.0,Junior Data Analyst\nTake your next career ste...,Bengaluru
1,Philips,Data Scientist - AI/ML,4.0,Job Title\nData Scientist - AI/ML\nJob Descrip...,Bengaluru
2,HSBC,Data Science GSC’s,3.9,Job description\nGraduate/ Post-graduate degre...,Bengaluru
3,Facctum Solutions,Data Analyst,,Job Description\nExperience: 0 - 2 years in da...,Karnataka
4,JPMorgan Chase & Co,Data and Analytics - Associate,4.0,JOB DESCRIPTION\n\nYou are a strategic thinker...,India


Unnamed: 0,company,salary_avg_estimate,salary_estimate_payperiod,company_size,company_founded,employment_type
0,ABB,"₹3,25,236",/yr (est.),10000+ Employees,1883,Company - Public
1,Philips,,,10000+ Employees,1891,Company - Public
2,HSBC,,,10000+ Employees,1865,Company - Public
3,Facctum Solutions,,,1 to 50 Employees,--,Company - Private
4,JPMorgan Chase & Co,,,10000+ Employees,1799,Company - Public


Unnamed: 0,company,industry,sector,revenue,career_opportunities_rating,comp_and_benefits_rating
0,ABB,Electronics Manufacturing,Manufacturing,$10+ billion (USD),3.7,3.6
1,Philips,Healthcare Services & Hospitals,Healthcare,$10+ billion (USD),3.8,3.7
2,HSBC,Banking & Lending,Finance,$10+ billion (USD),3.6,3.6
3,Facctum Solutions,--,--,Unknown / Non-Applicable,,
4,JPMorgan Chase & Co,Banking & Lending,Finance,$10+ billion (USD),4.0,3.9


Unnamed: 0,company,culture_and_values_rating,senior_management_rating,work_life_balance_rating
0,ABB,4.0,3.5,3.9
1,Philips,4.0,3.5,4.0
2,HSBC,3.8,3.4,3.7
3,Facctum Solutions,,,
4,JPMorgan Chase & Co,3.9,3.6,3.7


Seguindo a que discutimos no material, algumas variáveis contendo elementos textuais podem ser caracterizadas em *Dados Estruturados* e *Dados Não Estruturados*. Vamos ver alguns exemplos?

In [6]:
print('Titulos de empregos únicos:',dataset["job_title"].unique().size)
print('===============')
dataset["job_title"].value_counts()

Titulos de empregos únicos: 505


Data Analyst                           121
Data Scientist                          93
Data Engineer                           59
Data Entry Operator                     19
Data Science Intern                      8
                                      ... 
IT Business Analyst 2024 Internship      1
Data Visualization                       1
Data Engineer (WFH)                      1
Data Analyst (License Compliance)        1
Data Science Internship                  1
Name: job_title, Length: 505, dtype: int64

In [7]:
print('Setores únicos:',dataset["sector"].unique().size)
print('===============')
dataset["sector"].value_counts()

Setores únicos: 24


Information Technology                         295
--                                             146
Finance                                         80
Management & Consulting                         55
Manufacturing                                   46
Pharmaceutical & Biotechnology                  28
Media & Communication                           23
Retail & Wholesale                              14
Healthcare                                      14
Human Resources & Staffing                      13
Education                                       10
Energy, Mining, Utilities                        7
Real Estate                                      6
Aerospace & Defence                              6
Telecommunications                               6
Agriculture                                      5
Non-profit & NGO                                 5
Construction, Repair & Maintenance Services      4
Insurance                                        4
Transportation & Logistics     

In [8]:
print('Faturamento anual:',dataset["revenue"].unique().size)
print('===============')
dataset["revenue"].value_counts()

Faturamento anual: 11


Unknown / Non-Applicable            369
$10+ billion (USD)                  191
$2 to $5 billion (USD)               56
$100 to $500 million (USD)           32
$5 to $25 million (USD)              27
$25 to $50 million (USD)             25
$5 to $10 billion (USD)              23
$1 to $5 million (USD)               21
Less than $1 million (USD)           18
$500 million to $1 billion (USD)     12
Name: revenue, dtype: int64

In [9]:
print('Descrição da vaga:',dataset["job_description"].unique().size)
print('===============')
dataset["job_description"].value_counts()

Descrição da vaga: 882


Job Summary :\nOur corporate activities are growing rapidly, and we are currently seeking a full-time, office-based Data Coordinator to join our Core Laboratory/Imaging Services team in Mumbai, India. This position will work on a team to accomplish tasks and projects that are instrumental to the company’s success. If you want an exciting career where you use your previous expertise and can develop and grow your career even further, then this is the opportunity for you.\nResponsibilities :\nAssist in the preparation of Core Lab Data Management documents;\nPerform validation on data transfers and edit programs;\nMaintain data cleanup via edits, data review and data changes; and\nSend data transfers externally.\nQualifications :\nBachelor's degree in life science/ pharmacy/ health related field with strong attention to detail and working knowledge of Excel and Word; and\n1-2 years of experience in a pharmaceutical or CRO setting preferred.\nMedpace Overview :\nMedpace is a full-service cl