# Data Exploration
This notebook provides an overview of our dataset and also how we are cleaning our data.

## Imports

In [1]:
import pandas as pd
import textwrap
import sys
sys.path.append("../src")
from preprocessing import clean_text

## Data Inspection

### A look at the raw dataset

In [2]:
df = pd.read_csv("../Engineer_20230826.csv")
df.head()

Unnamed: 0,RequisitionID,OrigJobTitle,JobTitle,JobDescription
0,,Licensed Stationary Engineer,ENGINEER (all other),Licensed Stationary Engineer \n\n Froedtert So...
1,224907.0,Guidance Navigation and Control (GN&C) Enginee...,ENGINEER (all other),**The Boeing Company** is in search of a **L...
2,331804.0,"Propulsion Engineer - Associate, Mid-Level and...",ENGINEER (all other),"**Job Description**\n\nAt Boeing, we innovate ..."
3,336462.0,Senior Process Controls Engineer,ENGINEER (all other),"**Job Description**\n\nAt Boeing, we innovate ..."
4,338951.0,RF/Microwave Engineer (Level 2 or 3),ENGINEER (all other),"**Job Description**\n\nAt Boeing, we innovate ..."


### Features of dataset

In [3]:
print(f"Columns in data set: {df.columns.tolist()}")

Columns in data set: ['RequisitionID', 'OrigJobTitle', 'JobTitle', 'JobDescription']


Column Descriptions
 - RequisitionID - A unique identifier for each job posting in the dataset.
 - OrigJobTitle - The original job title as listed in the source data.
 - JobTitle - The only value in this column is "ENGINEER" which is not very descriptive.
 - JobDescription - The full,  text of the job posting, including responsibilities, qualifications, and other details.

### Job titles
The output below **demonstrates the use case for this project.** Engineering jobs in the dataset don't have standardized or very clear names, making it difficult for job seekers to distinguish one type of engineer from another or tell exactly what kind of engineer (e.g. aerospace vs. computer) a company is looking for.

In [4]:
random_job_postings = df.sample(n=10)
random_job_postings["OrigJobTitle"]

78       Materials and Processes Flammability Engineer ...
8837                                  Research Engineer iV
13055                                     Engineer-Precast
8642                          Postgres Automation Engineer
16251                           Senior Stationary Engineer
12124                                   Service Engineer 1
10969                                       Chief Engineer
3273                                      Field Engineer I
17676    Sr. Principal Aircraft Mechanical Systems (Mec...
10117                               Cybersecurity Engineer
Name: OrigJobTitle, dtype: object

### A sample job description
Below, the output for the job description at an arbitrary row demonstrates that our job descriptions must be cleaned of newline characters and HTML tags before being processed.

In [5]:
print(f"The elements in the JobDescription column are of data type: {df['JobDescription'].dtype}")

def print_job_desc(col: int):
    job_desc = df.iat[col, 3]
    print(f"Job description at col {col}:\n {textwrap.fill(job_desc, width=175)}")

print_job_desc(col=1000)

The elements in the JobDescription column are of data type: object
Job description at col 1000:
 ATS Company:  PA Solutions \n\n Requisition ID:  10392 \n\n Location:  \n\n Greenville, SC, US, 29615 Lewis Center, OH, US, 43035-9445 Guaynabo, PR, US, 00968-8058 Concord,
NH, US, 3811 Indianapolis, IN, US, 46250 Raleigh, NC, US, 27603 \n\n Date:  Jul 17, 2023 \n\n Automation Engineer \n\nJob Description\n\nProcess Automation Solutions is one of
the leading manufacturer-independent suppliers of complete automation solutions for the process and manufacturing industries. The company currently employs more than1,500
people with a global presence in Europe, the Americas, and Asia. Our operational activities focus on the design of process control systems and their vertical integration into
the overall business process. We offer complete services from the concept to commissioning, from the field level through process control level to corporate management level.
Process Automation Solutions is a 

### Checking for Missing or Messy Data

In [6]:
mask = df["JobDescription"].isna() | (df["JobDescription"].str.strip() == "")
num_blank = mask.sum()
print(f"There are {num_blank} empty or whitespace job descriptions.")

There are 0 empty or whitespace job descriptions.


In [7]:
df.isna().sum()

RequisitionID     1
OrigJobTitle      0
JobTitle          0
JobDescription    0
dtype: int64

## Cleaning the dataset
The clean_text function we wrote cleans job descriptions of any HTML tags, markdown, string literals, and unnecessary whitespace.

Here's an example:

In [8]:
df['JobDescription'] = df['JobDescription'].apply(clean_text)
print_job_desc(1000)

Job description at col 1000:
 ATS Company: PA Solutions Requisition ID: 10392 Location: Greenville, SC, US, 29615 Lewis Center, OH, US, 43035-9445 Guaynabo, PR, US, 00968-8058 Concord, NH, US, 3811
Indianapolis, IN, US, 46250 Raleigh, NC, US, 27603 Date: Jul 17, 2023 Automation Engineer Job Description Process Automation Solutions is one of the leading manufacturer-
independent suppliers of complete automation solutions for the process and manufacturing industries. The company currently employs more than1,500 people with a global presence
in Europe, the Americas, and Asia. Our operational activities focus on the design of process control systems and their vertical integration into the overall business process.
We offer complete services from the concept to commissioning, from the field level through process control level to corporate management level. Process Automation Solutions is
a company of ATS Corporation. Overview: This position participates in the design and implementation of c