<a href="https://colab.research.google.com/github/felixzhao/title_catgories_classification/blob/main/Job_title_classification_V2.1_preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this work, 
I need to preprocess the raw data by following steps

1. merge data.csv and categories.csv, create a raw dataset
2. preprocess the "x_title"

    1. remove stopwords
    2. punctuations
    3. stemming
    4. lemmatiztion

3. save to raw.csv for following step.

In [2]:
import numpy as np
import pandas as pd
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

In [3]:
# need to install stopwords and some other package by following code
# input "d" to download view, then input package name to start package download
import nltk
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

load data
This work loads data from google drive. Please put the files into "trademe_data" folder under the root of google drive.

Then run the following code to mount the google drive in colab.

You can follow the guide. https://towardsdatascience.com/different-ways-to-connect-google-drive-to-a-google-colab-notebook-pt-1-de03433d2f7a

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
root_path = "drive/MyDrive/trademe_data/"

# load data

In [6]:
cat_raw = pd.read_csv(f"{root_path}categories.csv")
print(len(cat_raw))
cat_raw.head(1)

196


Unnamed: 0,y_cat_id,cat_1,cat_2,cat_3
0,4957,construction & roading,health & safety,health & safety


In [7]:
cat_raw[cat_raw.duplicated()]

Unnamed: 0,y_cat_id,cat_1,cat_2,cat_3


In [8]:
data_raw = pd.read_csv(f"{root_path}data.csv")
print(len(data_raw))
data_raw.head(1)

20000


Unnamed: 0,x_title,y_cat_id
0,unqualified asbestos remover,5192


In [9]:
raw_df = pd.merge(data_raw, cat_raw, on='y_cat_id', how='left')
print(len(raw_df))
raw_df.head(1)

20000


Unnamed: 0,x_title,y_cat_id,cat_1,cat_2,cat_3
0,unqualified asbestos remover,5192,trades & services,labourers,labourers


# functions

In [None]:
# translate_table = dict((ord(char), ' ') for char in string.punctuation)
# def remove_punctuation(s):
#     return s.translate(translate_table)

In [None]:
# def remove_stopwords_punctuation(s):
#     stop = set(stopwords.words('english') + list(string.punctuation))
#     return ' '.join([i for i in word_tokenize(s.lower()) if i not in stop])

In [10]:

stop = set(stopwords.words('english') + list(string.punctuation))
# Initialize Python porter stemmer
ps = PorterStemmer()
# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()

def sentence_preprocess(sentence, stemming=False, lemmatization=True):
    # remove stopwords & punctuation
    words = [i for i in word_tokenize(sentence.lower()) if i not in stop]
    if stemming:
        words = [ps.stem(i) for i in words]
    if lemmatization:
        words = [wnl.lemmatize(word, pos="v") for word in words]
    return ' '.join(words)


# preprocess function analysis

from following output,
we can see if stemming only then it will cutoff some word, make it unreadable
but lemmatization only will not tans the word to root.

in this work, we use lemma only for now

these code reference from
https://www.datacamp.com/tutorial/stemming-lemmatization-python

In [11]:
example_sentence = "Python programmers often tend like programming in python because it's like english. We call people who program in python pythonistas."


In [12]:
sentence_preprocess(example_sentence, stemming=True, lemmatization=False)

"python programm often tend like program python 's like english call peopl program python pythonista"

In [13]:
sentence_preprocess(example_sentence, stemming=False, lemmatization=True)

"python programmers often tend like program python 's like english call people program python pythonistas"

In [14]:
sentence_preprocess(example_sentence, stemming=True, lemmatization=True)

"python programm often tend like program python 's like english call peopl program python pythonista"

# proprecss x_title column

In [15]:
raw_df['x_title']

0                           unqualified asbestos remover
1                                    senior test analyst
2                               ict trainer / supervisor
3        automotive specialists *multi faceted position*
4                                       business analyst
                              ...                       
19995                                    quality control
19996           showroom consultant | project specialist
19997                                civil site engineer
19998                                  senior scaffolder
19999              registered mental health professional
Name: x_title, Length: 20000, dtype: object

In [20]:
raw_df['x_title_feature'] = raw_df['x_title'].apply(sentence_preprocess)

In [24]:
raw_df['cats'] = raw_df.cat_1.str.cat(raw_df.cat_2, sep='-').str.cat(raw_df.cat_3, sep='-')

In [25]:
raw_df

Unnamed: 0,x_title,y_cat_id,cat_1,cat_2,cat_3,x_title_feature,cats
0,unqualified asbestos remover,5192,trades & services,labourers,labourers,unqualified asbestos remover,trades & services-labourers-labourers
1,senior test analyst,5123,it,testing,testing,senior test analyst,it-testing-testing
2,ict trainer / supervisor,6894,education,tutoring & training,tutoring & training,ict trainer supervisor,education-tutoring & training-tutoring & training
3,automotive specialists *multi faceted position*,5197,trades & services,technicians,technicians,automotive specialists multi faceted position,trades & services-technicians-technicians
4,business analyst,5114,it,business & systems analysts,business & systems analysts,business analyst,it-business & systems analysts-business & syst...
...,...,...,...,...,...,...,...
19995,quality control,5140,manufacturing & operations,quality assurance,quality assurance,quality control,manufacturing & operations-quality assurance-q...
19996,showroom consultant | project specialist,5167,sales,sales reps,sales reps,showroom consultant project specialist,sales-sales reps-sales reps
19997,civil site engineer,5058,engineering,civil & structural,civil & structural,civil site engineer,engineering-civil & structural-civil & structural
19998,senior scaffolder,5042,construction & roading,supervisors & forepersons,supervisors & forepersons,senior scaffolder,construction & roading-supervisors & foreperso...


In [26]:
output_path = "raw.csv"
raw_df.to_csv(output_path)