### Job Recommender

TBD

Firstly let's retreive data and extract the useful features.

In [1]:
import pandas as pd

raw_jobs = pd.read_csv('data/Combined_Jobs_Final.csv')

# Replace NAN values with empty string
raw_jobs.fillna('', inplace=True)

# Concatenate insightful columns
jobs = raw_jobs['Title'] + ' ' + raw_jobs['Position'] + ' ' +  raw_jobs['Job.Description']
# Convert object into Pandas data-frame
jobs = pd.DataFrame(jobs, columns=['aggregated'])
jobs.head(5)

Unnamed: 0,aggregated
0,Server @ Tacolicious Server Tacolicious' first...
1,Kitchen Staff/Chef @ Claude Lane Kitchen Staff...
2,Bartender @ Machka Restaurants Corp. Bartender...
3,Server @ Teriyaki House Server ● Serve food/d...
4,Kitchen Staff/Chef @ Rosa Mexicano - Sunset Ki...


According to the upper result, the data contains many trash characters and words. Therefore, there is a must to process and clean the raw data. We propose the below function to purging punctuations and stop words, also applying tokenization and stemming.

In [2]:
# Importing the required libraries and tools
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')


def text_processing(df, col_name, punctuation=True):
    '''
    Given dataframe and desired column to convert the raw text into processed tokens
    :param df: dataframe
    :param col_name: column name
    :param punctuation: the boolean which shows whether punctuation cleaning be applied or not
    :return: processed dataframe
    '''
    if not punctuation:
        # punctuation cleaning
        df['clean_{}'.format(col_name)] = df[col_name].str.replace('[^\w\s]','')

        # tokenizing phase
        df['tokens_{}'.format(col_name)] = df['clean_{}'.format(col_name)].apply(nltk.word_tokenize)
    else:
        # tokenizing phase
        df['tokens_{}'.format(col_name)] = df[col_name].apply(nltk.word_tokenize)

    # stopwords cleaning
    stop = stopwords.words('english')
    df['stop_{}'.format(col_name)] = df['tokens_{}'.format(col_name)]. \
        apply(lambda x: [word for word in x if word not in (stop)])

    # stemming
    ps = PorterStemmer()
    df['stemmed_{}'.format(col_name)] = df['stop_{}'.format(col_name)]. \
        apply(lambda words: [ps.stem(word) for word in words])

    return df


[nltk_data] Downloading package punkt to /home/amin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/amin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# Calling the defined function to clean data

processed_jobs = text_processing(jobs, 'aggregated', punctuation=False)
processed_jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84090 entries, 0 to 84089
Data columns (total 5 columns):
aggregated            84090 non-null object
clean_aggregated      84090 non-null object
tokens_aggregated     84090 non-null object
stop_aggregated       84090 non-null object
stemmed_aggregated    84090 non-null object
dtypes: object(5)
memory usage: 3.2+ MB


There is no need for the middle columns in this project; Hence, the first and last columns will be selected and comapred.

In [4]:
final_jobs = processed_jobs[['aggregated', 'stemmed_aggregated']]
final_jobs.head(5)

Unnamed: 0,aggregated,stemmed_aggregated
0,Server @ Tacolicious Server Tacolicious' first...,"[server, tacolici, server, tacolici, first, pa..."
1,Kitchen Staff/Chef @ Claude Lane Kitchen Staff...,"[kitchen, staffchef, claud, lane, kitchen, sta..."
2,Bartender @ Machka Restaurants Corp. Bartender...,"[bartend, machka, restaur, corp, bartend, We, ..."
3,Server @ Teriyaki House Server ● Serve food/d...,"[server, teriyaki, hous, server, serv, fooddri..."
4,Kitchen Staff/Chef @ Rosa Mexicano - Sunset Ki...,"[kitchen, staffchef, rosa, mexicano, sunset, k..."
