# Creation of a model to classify the type of job from a given text job description

## Introduction
The goal of this project is:
* Sort emails in these classes: Job description, Alert and Others. (Asma's model)
* Classify jobs descriptions in jobs
* Rate cv and skills.

We will focus on the following:
* jobs: Data Scientist, Data Engineer,Big data developper,Data Analyst and Others(mix of other type of job)
* datasets: Glassdoors, job_emails1 from Assan and  Kaggle 

## Libriries

In [1]:
!pip install ipynb



In [2]:
import pandas as pd
import numpy as np
import string
import spacy
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report

import joblib
# from ipynb.fs.full.text_preprocessing_with_spacy import text_preprocess

In [3]:
# load a new spacy model
nlp = spacy.load("en_core_web_lg")

# add somes stopwords in the default list of spacy
nlp.Defaults.stop_words |= {"or","per","like",'-','_','',
                            '–','[]','\n','\n\n','\n\n ','i.e.'}

# Create our list of stopwords
stopWords= spacy.lang.en.stop_words.STOP_WORDS #set(stopwords.words('english'))


## Read Data

In [4]:
emails = pd.read_csv('data/unbalanced.csv')
emails

Unnamed: 0,Content,source,job_title,label
0,Infoserv LLC\nData Scientist\nRemote\nEmployer...,glassdoor,Data Scientist,job_description
1,"ExxonMobil\n3.1\nData Scientist\nClinton, NJ\n...",glassdoor,Data Scientist,job_description
2,eBay Inc.\n4.1\nData Scientist/Applied Researc...,glassdoor,Data Scientist,job_description
3,"TikTok\n3.7\nData Scientist, University Gradua...",glassdoor,Data Scientist,job_description
4,"Mastercard\n4.3\nData Scientist, AI Services -...",glassdoor,Data Scientist,job_description
...,...,...,...,...
1031,Are you an experienced Data Analyst? Are you e...,Kaggle,Data Analyst,job_description
1032,"Data Analyst Data extraction, Storage, Back u...",Kaggle,Data Analyst,job_description
1033,Database Analyst ****K Wetherby This role sha...,Kaggle,Data Analyst,job_description
1034,Data Analyst / Data Analysis / Modelling / SQL...,Kaggle,Data Analyst,job_description


In [5]:
print(emails['label'].unique(),"\n\n",emails['job_title'].unique())


['job_description'] 

 ['Data Scientist' 'Data Engineer' 'Data Analyst' 'Big Data Developer'
 'other']


## Preprocessing

### Text preprocessing with spacy

In [6]:
def text_preprocess(text):
    nlp.max_length = 2030000 # To raise the max legnth of word
    
### Creation of function that we'll use to process data


# Lemmatization
    def lemmatize_word(text):
        """ lemmatise words to give his root for example did becomes do
        input: text that contains Tokens
        output: A list of lemmatized tokens
        """
        lemma_word = [] 
        for token in text:
            lemma_word.append(token.lemma_)
        return lemma_word
    
    
# Split words that ontains character and correct them 
    def check_character_in_words(text):
        """ Split words that ontains character and correct them 
        input: A list of tokens with characters
        output: A list of tokens splitted on the character
        """
        charact = ["\n", ":", '$']
        for words in text:
            for chars in charact:
                if chars in words:
                    text.remove(words)
                    words = words.replace(chars," ")
                    words = words.split()
                    text.extend(words)

        return text 
    
# Remove punctuation    
    def remove_punct(text):
        """
        Remove punctuation from text (List of tokens)
        input: A list of tokens with punctuation
        output: A list of tokens without punctuation
        """
        l=[]
        for word in text:
            if not word in string.punctuation:# list of punctuation
                l.append(word)
        # resultat=" ".join(l)   
        return l


# Remove stopwords
    def remove_stopword(liste,stopWords):
        """
        Remove stopwords from a list of tokens
        input:A list of tokens with stopwords
        output:A list of tokens without stopwords
        """
        list_tokens = [tok.lower() for tok in liste]
        l=[]
        for word in list_tokens:  
            if word not in stopWords:
                l.append(word)
        return l
    
# Remove duplicated wods 
    # In the text the world sometimes repeated twice or more. 
    # For example slary in the title of the job and the description
    
    def remove_duplicates(text):
        
        """Remove duplicated words in each elements of the list
        input: list
        output: list
        """
        l=[]
        [l.append(x) for x in text if x not in l]
        resultat=" ".join(l)
        return resultat

    
### Process the data

    # Tokenization
    doc = nlp(text)
    lemmatize_text = lemmatize_word(doc)
    checked_text = check_character_in_words(lemmatize_text)
    removed_punctuation_text = remove_punct(checked_text)
    removed_stopwords_punctuation = remove_stopword(removed_punctuation_text,
                                                    stopWords)
# Remove duplicated wods 
    clean_text = remove_duplicates(removed_stopwords_punctuation)
#     resultat=" ".join(clean_text)
    return(clean_text) 
    
    

In [7]:
emails['list_skills'] = emails['Content'][:].apply(text_preprocess)
emails.head()

Unnamed: 0,Content,source,job_title,label,list_skills
0,Infoserv LLC\nData Scientist\nRemote\nEmployer...,glassdoor,Data Scientist,job_description,infoserv llc data scientist remote employer pr...
1,"ExxonMobil\n3.1\nData Scientist\nClinton, NJ\n...",glassdoor,Data Scientist,job_description,exxonmobil 3.1 data scientist clinton nj 94 k ...
2,eBay Inc.\n4.1\nData Scientist/Applied Researc...,glassdoor,Data Scientist,job_description,ebay inc. 4.1 data scientist applied researche...
3,"TikTok\n3.7\nData Scientist, University Gradua...",glassdoor,Data Scientist,job_description,tiktok 3.7 data scientist university graduate ...
4,"Mastercard\n4.3\nData Scientist, AI Services -...",glassdoor,Data Scientist,job_description,mastercard 4.3 data scientist ai services laun...


## Modelisation

### Vectorization Feature Engineering and Train test split

After the prepreocessing step, we end up with text matched with their respective labels. Since  we can’t use text strings in our machine learning model, we need a way to convert it into something that can be represented numerically.

* One tool we can use for doing this is called Bag of Words. BoW converts text into the matrix of occurrence of words within a given document. It focuses on whether given words occurred or not in the document, and it generates a matrix that we might see referred to as a BoW matrix or a document term matrix.

We can generate a BoW matrix for our text data by using scikit-learn‘s CountVectorizer.

* Train test split
Split the data into train and test sets

* Model fitting 
To classify our data, we choose to use the decision tree classifier. The model will be exported and it will be loded to predict new data in the deployment step.
 But if we need also to export our BoW with the model, therefore we need to create a pipeline.
 We can use the Pipeline module of scikit learn

In [8]:
X_Data = emails["list_skills"] # Data to analyse
Y_Data = emails["job_title"] # Labels of data

In [9]:
# Pipeline( model and BoW)
model = Pipeline([('countVectorizer', CountVectorizer()),
         ('classifier', tree.DecisionTreeClassifier())])

In [10]:
# Train test split
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X_Data, 
                                                Y_Data, test_size=0.3, random_state=101)

In [11]:
Y_Train.value_counts()

Data Scientist        177
Data Engineer         168
other                 150
Data Analyst          141
Big Data Developer     89
Name: job_title, dtype: int64

In [12]:
Y_Test.value_counts()

other                 81
Data Scientist        71
Data Engineer         64
Data Analyst          59
Big Data Developer    36
Name: job_title, dtype: int64

In [13]:
# bow_vector = CountVectorizer(tokenizer = text_preprocess)

In [14]:
# tfidf_vector = TfidfVectorizer(tokenizer = text_preprocess)

In [15]:
# classifier = tree.DecisionTreeClassifier()
# pipe = Pipeline([ ('vectorizer', bow_vector),
#          ('classifier', classifier)])

# pipe.fit(X_Train, Y_Train)

### Model fitting

In [16]:
# model = tree.DecisionTreeClassifier()
model.fit(X_Train, Y_Train)


Pipeline(steps=[('countVectorizer', CountVectorizer()),
                ('classifier', DecisionTreeClassifier())])

In [17]:
model.score(X_Train, Y_Train)

0.9917241379310345

### Model testing

In [18]:
predicted = model.predict(X_Test)
#     joblib.dump(model, 'model_job_class.joblib')
print(classification_report(Y_Test, predicted))

                    precision    recall  f1-score   support

Big Data Developer       0.82      0.64      0.72        36
      Data Analyst       0.95      0.92      0.93        59
     Data Engineer       0.77      0.86      0.81        64
    Data Scientist       0.82      0.90      0.86        71
             other       0.92      0.88      0.90        81

          accuracy                           0.86       311
         macro avg       0.86      0.84      0.84       311
      weighted avg       0.86      0.86      0.86       311



In [19]:
model.score(X_Test,Y_Test)

0.8585209003215434

## Export the model

In [20]:
joblib.dump(model, 'model.joblib')

['model.joblib']

In [21]:
## test the model

In [22]:
test = "Stress Engineer Glasgow Salary **** to **** We re currently looking for talented engineers to join our growing Glasgow team at a variety of levels. The roles are ideally suited to high calibre engineering graduates with any level of appropriate experience, so that we can give you the opportunity to use your technical skills to provide high quality input to our aerospace projects, spanning both aerostructures and aeroengines. In return, you can expect good career opportunities and the chance for advancement and personal and professional development, support while you gain Chartership and some opportunities to possibly travel or work in other offices, in or outside of the UK. The Requirements You will need to have a good engineering degree that includes structural analysis (such as aeronautical, mechanical, automotive, civil) with some experience in a professional engineering environment relevant to (but not limited to) the aerospace sector. You will need to demonstrate experience in at least one or more of the following areas: Structural/stress analysis Composite stress analysis (any industry) Linear and nonlinear finite element analysis Fatigue and damage tolerance Structural dynamics Thermal analysis Aerostructures experience You will also be expected to demonstrate the following qualities: A strong desire to progress quickly to a position of leadership Professional approach Strong communication skills, written and verbal Commercial awareness Team working, being comfortable working in international teams and self managing PLEASE NOTE SECURITY CLEARANCE IS REQUIRED FOR THIS ROLE Stress Engineer Glasgow Salary **** to ****"

In [23]:
test = text_preprocess(test)
test

'stress engineer glasgow salary currently look talented join grow team variety level role ideally suit high calibre engineering graduate appropriate experience opportunity use technical skill provide quality input aerospace project span aerostructure aeroengine return expect good career chance advancement personal professional development support gain chartership possibly travel work office outside uk requirement need degree include structural analysis aeronautical mechanical automotive civil environment relevant limit sector demonstrate follow area composite industry linear nonlinear finite element fatigue damage tolerance dynamic thermal strong desire progress quickly position leadership approach communication write verbal commercial awareness comfortable international self manage note security clearance required'

In [24]:
model.predict([test])

array(['other'], dtype=object)