# Part 2: Data Cleaning, Lemmatization and Feature Extraction

### In this section, the data will be cleaned to remove undesirable information for NLP, taking into account the findings in the previous part (EDA).

### Then, text data will be combined and TF-IDF will be used to extract features.

In [78]:
import numpy as np
import pandas as pd
import sys
import matplotlib.pyplot as plt
import pandas_profiling
import seaborn as sb
import spacy
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedShuffleSplit
from imblearn.over_sampling import ADASYN

adasyn=ADASYN()
sb.set()

""" NOTE: SET YOUR PROJECT ROOT DIRECTORY HERE """
PROJ_DIR = r""
RANDOM_SPLIT_SEED = 11

In [77]:
!{sys.executable} -m pip install pandas_profiling
!{sys.executable} -m pip install spacy
!{sys.executable} -m pip install nltk
!{sys.executable} -m pip install imblearn
!{sys.executable} -m spacy download en_core_web_md



Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.10.1-py3-none-any.whl (226 kB)
     ------------------------------------- 226.0/226.0 kB 14.4 MB/s eta 0:00:00
Collecting joblib>=1.1.1
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
     ---------------------------------------- 298.0/298.0 kB ? eta 0:00:00
Installing collected packages: joblib, imbalanced-learn, imblearn
  Attempting uninstall: joblib
    Found existing installation: joblib 1.1.0
    Uninstalling joblib-1.1.0:
      Successfully uninstalled joblib-1.1.0
Successfully installed imbalanced-learn-0.10.1 imblearn-0.0 joblib-1.2.0
Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
     --------------------------------------- 42.8/42.8 MB 10.4 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[

In [67]:
df = pd.read_csv(PROJ_DIR + r"\sc1015-project\dataset\fake_job_postings.csv")
df.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [48]:
for column in df.columns:
    df[column]=df[column].fillna(f'missing_{column}')

We observe that there are 4 boolean valued columns (including `fraudulent`). However, they are stored as int64 and will need to be encoded into text for the nlp

In [49]:
boolean_cols = ['has_company_logo','has_questions','telecommuting']

# changing boolean values to text and appending their column names
for col in boolean_cols:
    df.loc[df[col]==0,col] = f'no_{col}'
    df.loc[df[col]==1,col] = f'yes_{col}'

df.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,missing_salary_range,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,missing_benefits,no_telecommuting,yes_has_company_logo,no_has_questions,Other,Internship,missing_required_education,missing_industry,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,missing_salary_range,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,no_telecommuting,yes_has_company_logo,no_has_questions,Full-time,Not Applicable,missing_required_education,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",missing_department,missing_salary_range,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,missing_benefits,no_telecommuting,yes_has_company_logo,no_has_questions,missing_employment_type,missing_required_experience,missing_required_education,missing_industry,missing_function,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,missing_salary_range,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,no_telecommuting,yes_has_company_logo,no_has_questions,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",missing_department,missing_salary_range,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,no_telecommuting,yes_has_company_logo,yes_has_questions,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


Similarly, categorical values will be modified to the format of the boolean values (i.e {data value}_{col name})

In [50]:
categorical_cols = ['required_experience','required_education','employment_type']
for col in categorical_cols:
    val = list(df[col].values)
    val = ["_".join(re.findall(r"[\w']+",i)) + f"_{col}" for i in val]
    df[col] = np.array(val)
    
df.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,missing_salary_range,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,missing_benefits,no_telecommuting,yes_has_company_logo,no_has_questions,Other_employment_type,Internship_required_experience,missing_required_education_required_education,missing_industry,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,missing_salary_range,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,no_telecommuting,yes_has_company_logo,no_has_questions,Full_time_employment_type,Not_Applicable_required_experience,missing_required_education_required_education,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",missing_department,missing_salary_range,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,missing_benefits,no_telecommuting,yes_has_company_logo,no_has_questions,missing_employment_type_employment_type,missing_required_experience_required_experience,missing_required_education_required_education,missing_industry,missing_function,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,missing_salary_range,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,no_telecommuting,yes_has_company_logo,no_has_questions,Full_time_employment_type,Mid_Senior_level_required_experience,Bachelor's_Degree_required_education,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",missing_department,missing_salary_range,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,no_telecommuting,yes_has_company_logo,yes_has_questions,Full_time_employment_type,Mid_Senior_level_required_experience,Bachelor's_Degree_required_education,Hospital & Health Care,Health Care Provider,0


Now, the text of all the columns will be concatenated into a single block to be processed by tf-idf. The text will also be converted to lowercase to reduce variables that have little to no effect (in this case) on the meaning of the text.

In [51]:
list_of_cols = list(df)
list_of_cols.remove('fraudulent') # remove fraudulent because it is the desired classification
list_of_cols.remove('job_id') # remove job_id because useless data
print(list_of_cols)

ser=pd.Series(np.full(shape=(df.shape[0],), fill_value=''))
for col in list_of_cols:
    ser += (" " + df[col].str.lower())  # Convert text to lowercase

    
    
print(ser.describe())
ser.head()

['title', 'location', 'department', 'salary_range', 'company_profile', 'description', 'requirements', 'benefits', 'telecommuting', 'has_company_logo', 'has_questions', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
count                                                 17880
unique                                                17596
top        title insurance ops: sr title officer/counsel...
freq                                                      6
dtype: object


0     marketing intern us, ny, new york marketing m...
1     customer service - cloud video production nz,...
2     commissioning machinery assistant (cma) us, i...
3     account executive - washington dc us, dc, was...
4     bill review manager us, fl, fort worth missin...
dtype: object

We now have 17880 rows of text blocks, each representing a single job posting.


Here we will generate a list of stopwords, and exclude 'no' from it since this word encodes information regarding the boolean columns.

In [52]:
nltk.download('stopwords')
list_of_stopwords = nltk.corpus.stopwords.words('english')
list_of_stopwords.remove('no')

nlp = spacy.load('en_core_web_md')


def lemmatize_text(text):
    text = nlp(text)
    #-PRON- happens when the word is a pronoun. In that case, return the word again
    text = ' '.join([word.lemma_ if (word.lemma_ != '-PRON-' and word.lemma not in list_of_stopwords) else word.text for word in text])
    return text

def remove_special_chars(text):
    """
    if the character is not a space and regular alphabet,
    the character will be removed. However "_" will not be
    removed
    """
    char = r'[^A-z0-9\s]'
    possessive_char = r'\ss\s'
    white_space=r'  '
    text = re.sub(char, '', text) #removing
    text = re.sub(possessive_char,'',text)
    text = re.sub(white_space,' ',text)
    return text

## WARNING: SLOW PROCESS
ser_lemma = ser.apply(lambda x: lemmatize_text(x))
ser_lemma = ser.apply(lambda x: remove_special_chars(x))

print(ser_lemma.describe())
ser_lemma.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Workstation\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


count                                                 17880
unique                                                17582
top        title insurance ops sr title officercounsel u...
freq                                                      6
dtype: object


0     marketing intern us ny new york marketing mis...
1     customer service cloud video production nz au...
2     commissioning machinery assistant cma us ia w...
3     account executive washington dc us dc washing...
4     bill review manager us fl fort worth missing_...
dtype: object

The lemmatized and cleaned text is saved to csv because it takes a long time to process again

In [58]:
ser_lemma.to_csv('lemmatized_text.csv', index = False)

### This section will use the TF-IDF vectorizer to convert the text into vectors, which will then be used as inputs to various ML models in the following parts.

Note: This code will be repeated again in each seperate ML model file, to avoid issues associated with storing the ndarray in csv

In [111]:
cleaned_text = pd.read_csv('lemmatized_text.csv').squeeze() # convert to pd series
cleaned_text.head()

<class 'pandas.core.series.Series'>


In [112]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
vectorizer.fit(cleaned_text)
tfidf_data = vectorizer.transform(cleaned_text)

Use `StratifiedShuffleSplit` to allow for stratified sampling. 80/20 train test split with fixed seed to ensure consistency


In [113]:
sss=StratifiedShuffleSplit(n_splits=1, random_state=RANDOM_SPLIT_SEED, test_size=0.2)
X=np.zeros(shape=tfidf_data.shape[0],dtype=np.bool_)
y=np.array(df['fraudulent']) #prediction target
for train_index, test_index in sss.split(X, y):
    X_train, X_test = tfidf_data[train_index,:], tfidf_data[test_index,:]
    y_train, y_test = y[train_index], y[test_index]

From Part 1 EDA, we have seen that there exists a significant class imbalance (roughly 20-1) between fraudulent and not. Hence, it is important to consider the effects of this. Here we use ADASYN to generate synthetic data for the minority class.

In [None]:
X_res, y_res = adasyn.fit_resample(X_train, y_train)