# US Monster Jobs Dataset Cleansing

Personal checklist:
-> Standarization of data: Fix inconsistent column names and convert into a standard format

In [44]:
import pandas as pd

import numpy as np

In [3]:
df = pd.read_csv('C:\\Users\\jorge\\Desktop\\monster_com-job_sample.csv')

# Overview of dataset

In [4]:
df.shape

(22000, 14)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22000 entries, 0 to 21999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   country          22000 non-null  object
 1   country_code     22000 non-null  object
 2   date_added       122 non-null    object
 3   has_expired      22000 non-null  object
 4   job_board        22000 non-null  object
 5   job_description  22000 non-null  object
 6   job_title        22000 non-null  object
 7   job_type         20372 non-null  object
 8   location         22000 non-null  object
 9   organization     15133 non-null  object
 10  page_url         22000 non-null  object
 11  salary           3446 non-null   object
 12  sector           16806 non-null  object
 13  uniq_id          22000 non-null  object
dtypes: object(14)
memory usage: 2.3+ MB


# 1. Standarization of data
-> Column names/format: I don't see any relevant errors in the column name format. Am I missing something?
 <br />No caps, no format differences between one name and another.

In [6]:
df.columns

Index(['country', 'country_code', 'date_added', 'has_expired', 'job_board',
       'job_description', 'job_title', 'job_type', 'location', 'organization',
       'page_url', 'salary', 'sector', 'uniq_id'],
      dtype='object')

# 2. Data Type Conversion
Data in df is of 'object' type. We'll have to convert the data type of the columns according to the data in each one of them.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22000 entries, 0 to 21999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   country          22000 non-null  object
 1   country_code     22000 non-null  object
 2   date_added       122 non-null    object
 3   has_expired      22000 non-null  object
 4   job_board        22000 non-null  object
 5   job_description  22000 non-null  object
 6   job_title        22000 non-null  object
 7   job_type         20372 non-null  object
 8   location         22000 non-null  object
 9   organization     15133 non-null  object
 10  page_url         22000 non-null  object
 11  salary           3446 non-null   object
 12  sector           16806 non-null  object
 13  uniq_id          22000 non-null  object
dtypes: object(14)
memory usage: 2.3+ MB


# 2.1 Empty cells per column?

We have a ton of missing data in date_added, job_type, organization, salary and sector.

In [8]:
df.isnull().sum()

country                0
country_code           0
date_added         21878
has_expired            0
job_board              0
job_description        0
job_title              0
job_type            1628
location               0
organization        6867
page_url               0
salary             18554
sector              5194
uniq_id                0
dtype: int64

The only value in the 'Country' column is the US. I guess it makes sense when using a 'US Monster Jobs' dataset.

In [9]:
print(df['country'].value_counts())

print('-----------------------------------')

print(df['country_code'].value_counts())

United States of America    22000
Name: country, dtype: int64
-----------------------------------
US    22000
Name: country_code, dtype: int64


# 2.2 Convert date_added to date dtype

In [10]:
df.loc[df['date_added'].notnull()]

Unnamed: 0,country,country_code,date_added,has_expired,job_board,job_description,job_title,job_type,location,organization,page_url,salary,sector,uniq_id
133,United States of America,US,5/10/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Multibed Technician Job in Deer Park,Full Time Employee,"Deer Park, TX",Other/Not Classified,http://jobview.monster.com/Multibed-Technician...,,Other,6f6e952b8b0a2bb55e9feada54db2347
140,United States of America,US,5/13/2016,No,jobs.monster.com,Equal Opportunity Employer: Minority/Female/Di...,Principal Cyber Security Engineer Job in Houston,Full Time Employee,"Houston, TX",Computer SoftwareComputer/IT Services,http://jobview.monster.com/Principal-Cyber-Sec...,,IT/Software Development,1127457851cf28d79a39fd4b35867982
251,United States of America,US,5/9/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Field Supervisor IS Job in Deer Park,Full Time Employee,"Deer Park, TX",Other/Not Classified,http://jobview.monster.com/Field-Supervisor-IS...,,Other,94b49d291a16d01b27378ca97e653910
279,United States of America,US,6/10/2016,No,jobs.monster.com,"At American Family Insurance, we're firmly com...",Insurance Sales - Customer Service Job in Eden...,Full Time Employee,"Eden Prairie, MN 55344",Insurance,http://jobview.monster.com/insurance-sales-cus...,15.00 - 21.00 $ /hour,Accounting/Finance/Insurance,64a597e5dd17740aadf4b0e8047b51a5
366,United States of America,US,1/2/2017,No,jobs.monster.com,Description The Opportunity The Vehicle Mainte...,Vehicle Maintenance Mechanic - Las Vegas,Full Time Employee,"Las Vegas, NV",Energy and Utilities,http://jobview.monster.com/vehicle-maintenance...,,Installation/Maintenance/Repair,886903d4dda03046c2a826c44bfff3dc
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20760,United States of America,US,9/27/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Central Maintenance Planner Job in Norwell,Full Time Employee,"Norwell, MA",Other/Not Classified,http://jobview.monster.com/central-maintenance...,,Administrative/Clerical,7b115d764f741821bae4ac95bfcf3f04
21342,United States of America,US,3/30/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Branch Manager Job in Cincinnati,Full Time Employee,"Cincinnati, OH",Other/Not Classified,http://jobview.monster.com/Branch-Manager-Job-...,,Other,1db7d013265871214d3f4e7ed80d8a23
21391,United States of America,US,3/24/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Field Service Driver Job in Cincinnati,Full Time Employee,"Cincinnati, OH",Other/Not Classified,http://jobview.monster.com/Field-Service-Drive...,,Logistics/Transportation,4f304e6285b240f8442a028bc3716273
21631,United States of America,US,4/4/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Field Project Manager Job in Cincinnati,Full Time Employee,"Cincinnati, OH",Other/Not Classified,http://jobview.monster.com/Field-Project-Manag...,,Other,6f86e1f35ad082be591bcec15c75f947


In [11]:
lista = [(i + ''': ''' + str(df.dtypes[i])) for i in dict(df.dtypes)]

# for i in dict(df.dtypes)
#     lista.append(i + ': ' + str(df.dtypes[i]))

print(lista)

['country: object', 'country_code: object', 'date_added: object', 'has_expired: object', 'job_board: object', 'job_description: object', 'job_title: object', 'job_type: object', 'location: object', 'organization: object', 'page_url: object', 'salary: object', 'sector: object', 'uniq_id: object']


# Fix salary column

# Visualize full df rows

Code in next cell allows us to view rows in text editor. I wanted to have a better idea of the data in the column.

In [50]:
with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.precision', 3,
                       ):
    print(df[['salary']])

                                                  salary
0                                                    NaN
1                                                    NaN
2                                                    NaN
3                                                    NaN
4                                                    NaN
5                                                    NaN
6                                                    NaN
7                                                    NaN
8                                                    NaN
9                                                    NaN
10                                                   NaN
11                                                   NaN
12                                                   NaN
13                                  9.00 - 13.00 $ /hour
14                         80,000.00 - 95,000.00 $ /year
15                                                   NaN
16                             

In [49]:
df[['salary']]

Unnamed: 0,salary
0,
1,
2,
3,
4,
...,...
21995,"120,000.00 - 160,000.00 $ /yearbonus"
21996,"45,000.00 - 60,000.00 $ /year"
21997,
21998,25.00 - 28.00 $ /hour


Dictionary + astype to convert column dtype

In [12]:
convert_dict = {'salary': float}

df = df.astype(convert_dict)

ValueError: could not convert string to float: '9.00 - 13.00 $ /hour'

In [None]:
df.dtypes

country            object
country_code       object
date_added         object
has_expired        object
job_board          object
job_description    object
job_title          object
job_type           object
location           object
organization       object
page_url           object
salary             object
sector             object
uniq_id            object
dtype: object

'Apply' method to convert column dtype

Empty cells by column
What can we do about these columns?

-> date_added:
 <br />-> job_type: 
 <br />-> organization: 
 <br />-> salary:
 <br />-> sector: 

In [None]:
df.isnull().sum()

country                0
country_code           0
date_added         21878
has_expired            0
job_board              0
job_description        0
job_title              0
job_type            1628
location               0
organization        6867
page_url               0
salary             18554
sector              5194
uniq_id                0
dtype: int64

In [None]:
df.head()

Unnamed: 0,country,country_code,date_added,has_expired,job_board,job_description,job_title,job_type,location,organization,page_url,salary,sector,uniq_id
0,United States of America,US,,No,jobs.monster.com,TeamSoft is seeing an IT Support Specialist to...,IT Support Technician Job in Madison,Full Time Employee,"Madison, WI 53702",,http://jobview.monster.com/it-support-technici...,,IT/Software Development,11d599f229a80023d2f40e7c52cd941e
1,United States of America,US,,No,jobs.monster.com,The Wisconsin State Journal is seeking a flexi...,Business Reporter/Editor Job in Madison,Full Time,"Madison, WI 53708",Printing and Publishing,http://jobview.monster.com/business-reporter-e...,,,e4cbb126dabf22159aff90223243ff2a
2,United States of America,US,,No,jobs.monster.com,Report this job About the Job DePuy Synthes Co...,Johnson & Johnson Family of Companies Job Appl...,"Full Time, Employee",DePuy Synthes Companies is a member of Johnson...,Personal and Household Services,http://jobview.monster.com/senior-training-lea...,,,839106b353877fa3d896ffb9c1fe01c0
3,United States of America,US,,No,jobs.monster.com,Why Join Altec? If you’re considering a career...,Engineer - Quality Job in Dixon,Full Time,"Dixon, CA",Altec Industries,http://jobview.monster.com/engineer-quality-jo...,,Experienced (Non-Manager),58435fcab804439efdcaa7ecca0fd783
4,United States of America,US,,No,jobs.monster.com,Position ID# 76162 # Positions 1 State CT C...,Shift Supervisor - Part-Time Job in Camphill,Full Time Employee,"Camphill, PA",Retail,http://jobview.monster.com/shift-supervisor-pa...,,Project/Program Management,64d0272dc8496abfd9523a8df63c184c
