# Comprehension:  Cleaning the Data

In [164]:
import pandas as pd
import numpy as np

In [165]:
job_postings_df = pd.read_csv('job_postings.csv')

In [166]:
job_postings_df.set_index('job_id', inplace=True)

## Question 1

What is the number of null values in salary_range column?

In [167]:
job_postings_df['salary_range'].isnull().sum()

7490

## Question 2

As part of your task to create the application:
1. you need to check for missing values in all the variables or features of the dataset.
2. How many columns have more than 50% missing values?

In [168]:
(job_postings_df.isnull().sum() > (job_postings_df.shape[0] / 2))

title                  False
location               False
department              True
salary_range            True
company_profile        False
description            False
requirements           False
benefits               False
telecommuting          False
has_company_logo       False
has_questions          False
employment_type        False
required_experience    False
required_education     False
industry               False
function               False
fraudulent             False
dtype: bool

## Question 3

- In the previous question, you can observe that there are 2 columns - department and salary_range which have more than 50% missing values. 
- Remove these two columns from the dataset
- 
After dropping these columns, identify the number of rows that have more than five missing values

### 3.1: Removing the columns which have more than 50% missing values.

In [169]:
columns_to_drop = job_postings_df.loc[:, (job_postings_df.isnull().sum() > (job_postings_df.shape[0] / 2))].columns

job_postings_df = job_postings_df.drop(columns=columns_to_drop)

In [170]:
if columns_to_drop.tolist() not in job_postings_df.columns.tolist():
    print('dropped succesfully!')

dropped succesfully!


### 3.2: Identifying the number of rows that have more than five missing values.

In [171]:
job_postings_df.loc[job_postings_df.isnull().sum(axis=1) > 5, :]

Unnamed: 0_level_0,title,location,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
6,Accounting Clerk,"US, MD,",,Job OverviewApex is an environmental consultin...,,,0,0,0,,,,,,0
19,Visual Designer,"US, NY, New York",Kettle is an independent digital agency based ...,Kettle is hiring a Visual Designer!Job Locatio...,,,0,1,0,,,,,,0
25,Customer Service,"GB, LND, London",,We are a canary wharf based e-commerce company...,,,0,0,0,,,,,,0
26,H1B SPONSOR FOR L1/L2/OPT,"US, NY, New York",i28 Technologies has demonstrated expertise in...,"Hello,Wish you are doing good... ...","JAVA, .NET, SQL, ORACLE, SAP, Informatica, Big...",,0,1,1,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9929,Careers At Apcera,"US, CA, San Francisco","From the lands of Can Do, Roll Our Sleeves Up ...",Place holder for candidates that apply to jobs...,,,0,1,0,Full-time,,,,,0
9931,I can't see a job for me but I'm really intere...,"GB, ,",Lost My Name combines the power of storytellin...,"We're always looking for smart, creative peopl...",,,0,1,1,,,,,,0
9951,System Analyst,"US, TX,",,"Hello Candidate, there is a job opportunity fo...",,,0,0,0,,,,,,0
9964,I want to work @Workable,"US, ,",Workable is a venture-backed startup making cl...,If you'd like to be part of what we do at Work...,,,0,1,0,,,,Computer Software,,0


## Question 4

- In the previous question, you identified the rows which contain more than five missing values. Now remove all such rows.

-  0, how many of the remaining columns have more than 30% null values?

### 4.1: Remove the rows which contain more than five missing values

In [172]:
rows_to_drop = job_postings_df.loc[job_postings_df.isnull().sum(axis=1) > 5, :]

In [173]:
job_postings_df.drop(index=rows_to_drop.index, inplace=True)

### 4.2: Fill the empty data with 0.

In four columns:
1. company_profile
2. description
3. requirements
4. benefits

In [174]:
job_postings_df.loc[:,['company_profile', 'description', 'requirements', 'benefits']] = job_postings_df[['company_profile', 'description', 'requirements', 'benefits']].fillna(0)

### 4.3: Find the columns with more than 30% empty / null values.

In [175]:
 job_postings_df.loc[:, (job_postings_df.isnull().sum() > (job_postings_df.shape[0] * 0.30))].columns

Index(['required_experience', 'required_education'], dtype='object')

### Remove all rows that contain a missing value in any of these columns.

In [231]:
job_postings_df.dropna(subset=['required_experience', 'required_education'], inplace=True)

In [232]:
job_postings_df

Unnamed: 0_level_0,title,location,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
4,Account Executive - Washington DC,"US, DC, Washington",Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,(3-5),Bachelor's Degree,Computer Software,Sales,0
5,Bill Review Manager,"US, FL, Fort Worth",SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,(3-5),Bachelor's Degree,Hospital & Health Care,Health Care Provider,0
7,Head of Content (m/f),"DE, BE, Berlin","Founded in 2009, the Fonpit AG rose with its i...",Your Responsibilities: Manage the English-spea...,Your Know-How: ...,Your Benefits: Being part of a fast-growing co...,0,1,1,Full-time,(3-5),Master's Degree,Online Media,Management,0
11,ASP.net Developer Job opportunity at United St...,"US, NJ, Jersey City",0,Position : #URL_86fd830a95a64e2b30ceed829e63fd...,Position : #URL_86fd830a95a64e2b30ceed829e63fd...,Benefits - FullBonus Eligible - YesInterview T...,0,0,0,Full-time,(3-5),Bachelor's Degree,Information Technology and Services,Information Technology,0
13,"Applications Developer, Digital","US, CT, Stamford","Novitex Enterprise Solutions, formerly Pitney ...","The Applications Developer, Digital will devel...",Requirements:4 – 5 years’ experience in develo...,0,0,1,0,Full-time,(0-2),Bachelor's Degree,Management Consulting,Information Technology,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9981,UX Lead,"US, TX, Austin","We're an emerging technology agency, and we bu...",We’re looking for an experienced UX Designer t...,Here are our firm requirements for the role:Yo...,Why work for Mutual Mobile? We craft beautiful...,0,1,1,Full-time,(3-5),Bachelor's Degree,Computer Software,Design,0
9983,"[Raleigh, NC] Fundraising Consultant","US, NC, Raleigh",All American classics is a leading fundraising...,Position: Fundraising ConsultantLocation: Rale...,Must be able to work remotely.Must be self-dri...,Competitive Quarterly Bonus Structure Weekly ...,1,1,0,Full-time,0,Unspecified,Fund-Raising,Sales,0
9985,Junior Art Director (m/f),"DE, HH, Hamburg",Goodmates is a studio for digital communicatio...,Define and implement innovative solutions for ...,One year of experience in an agency or complet...,Get to be part of something amazing. With real...,0,1,0,Full-time,0,Bachelor's Degree,Marketing and Advertising,Advertising,0
9994,Executive Assistant (Full-Time) for Tech Compa...,"US, GA, Atlanta",352 Inc. is a full-service digital agency crea...,352 Inc. is a full-service digital agency in M...,QualificationsExperience as an Executive Assis...,What You’ll GetFreedom: We trust you to do you...,0,1,1,Full-time,(0-2),Unspecified,Internet,Administrative,0


### Question 8

 you decide to compute the 'Fraud Rate' only for job postings that don't have a logo. What does this value come out to be?

In [233]:
job_postings_df[job_postings_df['has_company_logo'] == 0]

Unnamed: 0_level_0,title,location,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
11,ASP.net Developer Job opportunity at United St...,"US, NJ, Jersey City",0,Position : #URL_86fd830a95a64e2b30ceed829e63fd...,Position : #URL_86fd830a95a64e2b30ceed829e63fd...,Benefits - FullBonus Eligible - YesInterview T...,0,0,0,Full-time,(3-5),Bachelor's Degree,Information Technology and Services,Information Technology,0
38,PROJECT MANAGER,"US, TX, HOUSTON",0,we are looking for a Project Manager. The Proj...,0,0,0,0,0,Full-time,(3-5),Bachelor's Degree,Oil & Energy,Engineering,0
43,Jr. Developer,US,0,Entry level Software DeveloperLocation : Atlan...,0,0,0,0,0,Full-time,0,Bachelor's Degree,Computer Software,Engineering,0
47,Entry Level,"EG, C,",History &amp; Background The Bank started its ...,We offer diversified opportunities in various ...,-0-1 years experience English Bachelor in comm...,0,0,0,0,Full-time,0,Bachelor's Degree,Banking,,0
54,Technical Project Manager,"US, NY, New York City",0,GBI is a growing company developing several cu...,Must have excellent oral and written communica...,"Experience with CRM, such as SugarCRM.Past emp...",0,0,0,Full-time,(0-2),Bachelor's Degree,Financial Services,Information Technology,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9945,Manufacturing Engineering Manager - Cedar Falls,"US, MO, St. Louis",We Provide Full Time Permanent Positions for m...,"Establishing objectives, schedules and priorit...",Manufacturing Engineering Manager - Cedar Fall...,0,0,0,0,Full-time,(3-5),Bachelor's Degree,Electrical/Electronic Manufacturing,Manufacturing,0
9946,Certified Medical Assistant,"US, AZ, Scottsdale",0,Back office medical assistant for a busy PM&am...,Certified Medical AssistantEMR / EHR experience,"Medical, Dental, and Vison after 90 day probat...",0,0,0,Full-time,0,Certification,Medical Practice,Health Care Provider,0
9949,Quality Assurance Engineer,"US, UT, Murray",0,POSITION SUMMARY:If you are a top notch Qualit...,Experience breaking complex software systemsE...,0,0,0,0,Full-time,(3-5),Bachelor's Degree,Internet,Quality Assurance,0
9952,Talented Web Developer,"GR, I, Halandri",0,FACTS ABOUT THE COMPANY#1 We are a small agenc...,You’ve gained experience by working on a team ...,We offer competitive salary depending on exper...,0,0,0,Full-time,(3-5),Bachelor's Degree,Marketing and Advertising,Information Technology,0


In [260]:
job_postings_df[(job_postings_df['has_company_logo'] == 0)]['fraudulent'].value_counts()

fraudulent
0    598
1     41
Name: count, dtype: int64

In [259]:
round(64 / job_postings_df[(job_postings_df['has_company_logo'] == 1)]['fraudulent'].shape[0], 2)

0.02

In [261]:
round(41 / job_postings_df[(job_postings_df['has_company_logo'] == 0)]['fraudulent'].shape[0], 2)

0.06

In [263]:
round(job_postings_df[(job_postings_df['has_questions'] == 0) & (job_postings_df['fraudulent'] == 1)].shape[0] / job_postings_df[(job_postings_df['has_questions'] == 0)]['fraudulent'].shape[0], 2)

0.03