# Data Preprocessing

In this notebook, the data frame is cleaned using some preprocessing steps as discussed in the data analysis phase. The cleaned data frame is then stored for later use.

In [1]:
# Import packages
import string

import pandas as pd
import numpy as np

import wordninja

import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
from nlpaug.util import Action

from sklearn.preprocessing import LabelEncoder

import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ramra\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ramra\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ramra\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [13]:
#read the data as a pandas dataframe
df = pd.read_json('../Data/data.json')

### Dropping non-english entries 

Since there is a negligible amount of non-english descriptions and titles, I decide to drop them. A language translation model is also available from *sparkNlp* and it is implemented in a seperate notebook.

In [7]:
# Non-english entries
display(df[166:167])
display(df[146:147])

Unnamed: 0,level,description,title
166,,Teads.Tv est spécialisé dans la diffusion de c...,Finance Controller


Unnamed: 0,level,description,title
146,,no,Praktikant im Bereich Operations & Customer Se...


In [14]:
# Drop rows with non-english entries
df.drop(index=[146, 166], inplace=True)

# Reset the index.
df.reset_index(inplace=True, drop = True)

In [15]:
# Split dataframe without labels (levels) as test set and the remaining as train set
test_df = df[df['level'].isnull()]

# Reset the index
test_df.reset_index(inplace = True)

# Drop the test dataframe
df.dropna(inplace = True)

print(f"Test size: {test_df.shape[0]}")
print(f"Train size: {df.shape[0]}")

Test size: 73
Train size: 141


### Split Combined words 

Since some words like "job descriptionInVision" are combined, in this part, the joined words are split.

In [16]:
def splitwords(text):
    """
    Split combined or joined words from the given list and return the cleaned list.
    """
    text = text.tolist()
    
    for tex in range(0, len(text)):
        temp = wordninja.split(text[tex])
        text[tex] = ' '.join(temp)
    
    return text

In [17]:
# Split the combined words in the actual dataframe
cleaned_df = pd.DataFrame()

cleaned_df['description'] = splitwords(df.description)
cleaned_df['title'] = splitwords(df.title)
cleaned_df['level'] = df.level

In [18]:
# For the test set
cleaned_test_df = pd.DataFrame()

cleaned_test_df['description'] = splitwords(test_df.description)
cleaned_test_df['title'] = splitwords(test_df.title)

In [19]:
cleaned_df[4:10]

Unnamed: 0,description,title,level
4,JOB DESCRIPTION Pa met is looking for a IBM We...,IBM Web Sphere Portal Developer,Entry Level
5,JOB DESCRIPTION Pa met is looking for a iOS de...,iOS Developer,Entry Level
6,JOB DESCRIPTION Pa met is looking for a Java D...,Java Developer,Entry Level
7,Snap chat is seeking an experienced market res...,Market Researcher,Senior Level
8,Description In Vision is the world s leading d...,Senior Full Stack Developer remote,Senior Level
9,Description In Vision is the world s leading d...,Enterprise Account Executive remote,Senior Level


### Simple Preprocessing Steps 

In [20]:
def preprocessing(df):
    """
    This function does the following preprocessing steps for a given dataframe and returns the cleaned frame.
        1. remove "\n",
        2. replace numbers with words (e.g., "3 years" as "three years"),
        3. remove all special characters,
        4. convert text to lowercase,
        5. remove "description" or "job description" or "Job Purpose" or "No description available",
        6. replace abbreviations (e.g., "sr.", "jr."),
        7. remove trailing white spaces at the beginning and end of the sentence,
        8. remove single characters if available, for example: "m", "f" from (m/f).
    """
    # remove "\n"
    df = df.replace('\n',' ', regex=True)
    
    # decontract words.
    df = df.replace({r"won't":"will not",
                     r"can\'t":"can not", 
                     "n't":" not", 
                     r"\'re":" are",
                     r"\'s":" is",
                     r"\'d":" would",
                     r"\'ll":" will",
                     r"\'t":" not",
                     r"\'ve":" have",
                     r"\'m":" am"}, regex=True)
    
    # replace numbers with the words.
    df = df.replace({'0': ' zero',
       '1': ' one',
       '2': ' two',
       '3': ' three',
       '4': ' four',
       '5': ' five',
       '6': ' six',
       '7': ' seven',
       '8': ' eight',
       '9': ' nine'}, regex=True)
    
    # remove special characters
    df = df.replace('[^A-Za-z0-9]+', ' ', regex=True)
    
    # convert to lower case
    df = df.apply(lambda x: x.astype(str).str.lower())
   
    # remove description or job decription at the beginning of the sentence
    df = df.replace(['job description', 'job purpose', 'company description', 'no description available', 'description', 'no'], ' ', regex=True)
    
    # replace abbreviations
    df = df.replace({'sr': 'senior',
                     'jr': 'junior'}, regex=True)

    # remove trailing white spaces
    df['description'] = df['description'].str.rstrip()
    df['description'] = df['description'].str.lstrip()
    
    df['title'] = df['title'].str.rstrip()
    df['title'] = df['title'].str.lstrip()

    # remove single characters
    df['description'] = df['description'].str.replace(r'\b\w\b', ' ').str.replace(r'\s+', ' ')
    df['title'] = df['title'].str.replace(r'\b\w\b', ' ').str.replace(r'\s+', ' ')
    
    return df

In [21]:
# Preprocess dataframes
cleaned_df = preprocessing(cleaned_df)
cleaned_test_df = preprocessing(cleaned_test_df)

  df['description'] = df['description'].str.replace(r'\b\w\b', ' ').str.replace(r'\s+', ' ')
  df['title'] = df['title'].str.replace(r'\b\w\b', ' ').str.replace(r'\s+', ' ')


In [22]:
cleaned_df[4:10]

Unnamed: 0,description,title,level
4,pa met is looking for ibm web sphere developer...,ibm web sphere portal developer,entry level
5,pa met is looking for ios developers to join u...,ios developer,entry level
6,pa met is looking for java developers to join ...,java developer,entry level
7,snap chat is seeking an experienced market res...,market researcher,senior level
8,in vision is the world leading design collabor...,senior full stack developer remote,senior level
9,in vision is the world leading design collabor...,enterprise account executive remote,senior level


In [24]:
# Combine job title and description
cleaned_df['text'] = cleaned_df["title"] + " " + cleaned_df["description"]
cleaned_test_df['text']=  cleaned_test_df["title"] + " " + cleaned_test_df["description"]

In [25]:
cleaned_df[4:10]

Unnamed: 0,description,title,level,text
4,pa met is looking for ibm web sphere developer...,ibm web sphere portal developer,entry level,ibm web sphere portal developer pa met is look...
5,pa met is looking for ios developers to join u...,ios developer,entry level,ios developer pa met is looking for ios develo...
6,pa met is looking for java developers to join ...,java developer,entry level,java developer pa met is looking for java deve...
7,snap chat is seeking an experienced market res...,market researcher,senior level,market researcher snap chat is seeking an expe...
8,in vision is the world leading design collabor...,senior full stack developer remote,senior level,senior full stack developer remote in vision i...
9,in vision is the world leading design collabor...,enterprise account executive remote,senior level,enterprise account executive remote in vision ...


In [26]:
cleaned_test_df[0:10]

Unnamed: 0,description,title,text
0,outfitter is europe biggest personal shopping ...,customer service netherlands in berlin,customer service netherlands in berlin outfit...
1,outfitter is europe biggest personal shopping ...,devo ps engineer,devo ps engineer outfitter is europe biggest ...
2,outfitter is europe biggest personal shopping ...,head of product management it,head of product management it outfitter is eu...
3,outfitter is europe biggest personal shopping ...,help desk support,help desk support outfitter is europe biggest...
4,outfitter is europe biggest personal shopping ...,intern help desk,intern help desk outfitter is europe biggest ...
5,outfitter is europe biggest personal shopping ...,sales consultant belgium in berlin,sales consultant belgium in berlin outfitter ...
6,outfitter is europe biggest personal shopping ...,sales consultant denmark in berlin,sales consultant denmark in berlin outfitter ...
7,outfitter is europa is groot ste personal shop...,sales consultant in berlin,sales consultant in berlin outfitter is europ...
8,outfitter is europe biggest personal shopping ...,sales consultant netherlands,sales consultant netherlands outfitter is eur...
9,outfitter is europe biggest personal shopping ...,sales consultant sweden in berlin,sales consultant sweden in berlin outfitter i...


In [27]:
# Lets encode the categorical levels to numerical values.
encoder = LabelEncoder()
cleaned_df.level = encoder.fit_transform(cleaned_df.level)

In [28]:
# Check the frequencies of the classes
(unique, counts) = np.unique(cleaned_df.level, return_counts=True)

frequencies = np.asarray((unique, counts)).T
frequencies

array([[ 0, 37],
       [ 1, 15],
       [ 2, 32],
       [ 3, 57]], dtype=int64)

In [19]:
#saving the dataframes for the modelling phase.
#cleaned_df.to_pickle("../Data/cleanedpandasDf.pkl") 
#cleaned_test_df.to_pickle("../Data/cleanedpandasDf.parquet")
#cleaned_df.to_parquet("../Data/cleanedpandasDf.pkl") 
#cleaned_test_df.to_parquet("../Data/cleanedpandastestDf.parquet")

In [20]:
#cleaned_df.to_csv("../Data/cleanedpandasDf.csv")

In [21]:
#cleaned_test_df.to_csv("../Data/cleanedpandastestDf.csv")

### Balancing the dataset 

As noted in the data analysis phase, the data set is imbalanced. The number of entries based on the levels on the train set is:
1. entry level: 37
2. senior level: 57
3. mid level: 32
4. Internship: 15

Hence in this part this is addressed by some imbalancing techniques.

map: {0: Entry_level,
      1: internships,
      2: mid-level,
      3: senior-level}

In [29]:
# Augment the "description" column of the dataframe.
X = cleaned_df['description']
y = cleaned_df.level 

In [30]:
# Replace 8 random words in the sentence using the wordnet from nltk.
# Augment text by replacing words with synonyms
aug = naw.SynonymAug(aug_src='wordnet',aug_max=8)

In [31]:
augmented_text=[]
augmented_labels=[]
augmented_title=[]
augmented_description=[]

def append(temps, i, label):
    """
    Append the generated sentences to a list with the corresponding labels, titles and description.
    """
    for sent in temps:
        #print(sent)
        augmented_description.append(sent)        
        augmented_labels.append(label)
        augmented_title.append(cleaned_df['title'][i])
        augmented_text.append(cleaned_df['title'][i] + " " + sent)

def augmentData(X,y):
    
    """
    Augment data using the nlpaug synonymn augmenter. Need n=2, to create two augmented sentences.
    """
    level_two_counter = 0
    level_zero_counter = 0
    
    for i in range(0, X.size):
        #entrylevel
        if (y[i]==0 and level_zero_counter < 6):
            level_zero_counter = level_zero_counter + 1
            temps = aug.augment(X[i], n=2)
            append(temps, i, y[i])
            
        if y[i]==1:
            #internships
            temps = aug.augment(X[i], n=2)
            append(temps, i, y[i])
            
        if (y[i]==2 and level_two_counter < 10):
            #midsenior level
            level_two_counter = level_two_counter + 1
            temps = aug.augment(X[i], n=2)
            append(temps, i, y[i])
        

augmentData(X,y)

In [32]:
cleaned_df.level = y

In [33]:
augmentedDf = pd.DataFrame({"description": augmented_description, "title": augmented_title, "text": augmented_text, "level": augmented_labels})

In [34]:
augmentedDf

Unnamed: 0,description,title,text,level
0,outfitter is europe biggest personal shopping ...,frontend engineer,frontend engineer outfitter is europe biggest...,0
1,outfitter is europe biggest personal shopping ...,frontend engineer,frontend engineer outfitter is europe biggest...,0
2,pa met is looking for android developer to joi...,android developer,android developer pa met is looking for androi...,0
3,pa fit is looking for android developer to joi...,android developer,android developer pa fit is looking for androi...,0
4,pa met is looking for ibm web sphere developer...,ibm web sphere portal developer,ibm web sphere portal developer pa met is look...,0
5,pa met is looking for ibm web sphere developer...,ibm web sphere portal developer,ibm web sphere portal developer pa met is look...,0
6,pa met is looking for ios developer to join us...,ios developer,ios developer pa met is looking for ios develo...,0
7,pa met is looking for ios developers to join u...,ios developer,ios developer pa met is looking for ios develo...,0
8,pa met is looking for java developers to join ...,java developer,java developer pa met is looking for java deve...,0
9,pa met is looking for java developers to join ...,java developer,java developer pa met is looking for java deve...,0


In [35]:
internAugmentedDf = cleaned_df.append(augmentedDf)

In [37]:
(unique, counts) = np.unique(internAugmentedDf['level'], return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

array([[ 0, 49],
       [ 1, 45],
       [ 2, 50],
       [ 3, 57]], dtype=int64)

In [30]:
#internAugmentedDf.to_pickle("../Data/cleanedAugmentedPandasDf.pkl")
#internAugmentedDf.to_parquet("../Data/cleanedAugmentedPandasDf.parquet")

In [31]:
#internAugmentedDf.to_csv("../Data/cleanedAugmentedPandasDf.csv")