# Assignment 2: Milestone I Natural Language Processing
## Task 2&3
#### Student Name: Deepa Rose Thomas
#### Student ID: S3952532

Date: 30-09-2023

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* re
* numpy

## Introduction<br>
<p style="text-align: justify;">    
In this task, the primary objective is to create diverse feature representations for a collection of job advertisements. These representations will be derived exclusively from the textual descriptions provided within each job advertisement. The task encompasses three distinct feature representation methods:</p>

- ***Bag-of-Words Model:***
<p style="text-align: justify;">
The initial feature representation is based on a classic approach known as the Bag-of-Words (BoW) model.
For each job advertisement description, a Count Vector representation will be generated.
These Count Vectors will be constructed using a predefined vocabulary established in Task 1, as stored in the "vocab.txt" file.
The Count Vector representation will provide a clear indication of the frequency of words within each job advertisement's description.</p>

- ***Models Based on Word Embeddings:***
<p style="text-align: justify;">
In this phase, we will employ a specific language model for creating feature representations based on word embeddings.
We can select one word embedding model among available options such as FastText, GoogleNews300, Word2Vec pretrained models, or Glove.
Two variations of feature representations will be generated for each job advertisement description using the chosen language model:</p>

***TF-IDF Weighted Vector***
<p style="text-align: justify;">
This representation will be TF-IDF (Term Frequency-Inverse Document Frequency) weighted.
TF-IDF is a statistical measure that evaluates the importance of a word within a document relative to a corpus of documents.
The TF-IDF weighted vector representation will provide a sense of the uniqueness and relevance of words in each job advertisement.</p>

***Unweighted Vector***
<p style="text-align: justify;">
In contrast to the TF-IDF weighted version, this representation will not include TF-IDF weighting.
It will capture the raw word embeddings without adjusting for document-specific importance.
The unweighted vector representation offers a different perspective on the job advertisement descriptions, focusing solely on the words used.

By generating these three different types of feature representations, we aim to provide a comprehensive and versatile toolkit for analyzing and understanding the textual content of job advertisements. These representations can be used for various natural language processing (NLP) tasks, such as text classification, clustering, and information retrieval, to help extract meaningful insights and patterns from the job advertisements dataset.</p>

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
import re  # regex matching
import pandas as pd #pandas for dataframe manipulations
import nltk # for text processing
import numpy as np # 
from nltk.probability import *
from itertools import chain  # itreations
import warnings
warnings.filterwarnings("ignore")

## Task 2. Generating Feature Representations for Job Advertisement Descriptions

In [2]:
jobs_info = []  # empty list creation

# store the information from the file created in task1 to an empty list

with open("data.txt") as input_file:  #opening the input file that we had created in task1
    data_info = input_file.read().splitlines()  
    
    for id,text in enumerate(data_info):
        jobs_info.append({})
       
        text_info = text.split("||")[:-1]
        for line in text_info:
            final_file = re.match("(.+?):(.+)",line).groups()
           
            jobs_info[id][final_file[0]]=final_file[1]


In [3]:
#check the structure of jobs_info to see if it complies with our requirement
jobs_info

[{'id': '00232',
  'title': 'Accounting_Finance',
  'Title': 'FP&A  Blue Chip',
  'Webindex': '68802053',
  'Company': 'Hays Senior Finance'},
 {'id': '00233',
  'title': 'Accounting_Finance',
  'Title': 'Part time Management Accountant',
  'Webindex': '70757636',
  'Company': 'FS2 UK Ltd'},
 {'id': '00234',
  'title': 'Accounting_Finance',
  'Title': 'IFA  EMPLOYED',
  'Webindex': '71356489',
  'Company': 'Clark James Ltd'},
 {'id': '00235',
  'title': 'Accounting_Finance',
  'Title': 'Finance Manager',
  'Webindex': '69073629',
  'Company': 'Accountancy Action Ltd'},
 {'id': '00236',
  'title': 'Accounting_Finance',
  'Title': 'Management Accountant',
  'Webindex': '70656648',
  'Company': 'Alexander Lloyd'},
 {'id': '00237',
  'title': 'Accounting_Finance',
  'Title': 'Customer Service Administrator',
  'Webindex': '68531828'},
 {'id': '00238',
  'title': 'Accounting_Finance',
  'Title': 'Accounts Assistant',
  'Webindex': '72451165',
  'Company': 'Cherry Professional Limited'},
 {'

In [4]:
len(jobs_info)

776

***Explanation***

Here our goal is to read and parse information from a text file named "data.txt" and store it in a list called `jobs_info`. 
We have created an empty list called `jobs_info` to store the extracted information from the text file.The contents of data.txt file is read and split it into lines using the `splitlines()` function, resulting in a list called `data_info`.
We have then iterated through each line of `data_info` using a `for` loop, where each line represents information about a job.

Inside the loop, we have created a new empty dictionary is created and appended to the `jobs_info` list for each job.
The `text_info` variable is assigned the result of splitting the current line using the "||" delimiter, effectively splitting the line into key-value pairs.

For each key-value pair in `text_info`, a regular expression pattern is used to match and extract the key and value, which are then stored as key-value pairs in the dictionary corresponding to the current job.
Finally we have stored the information as dictionaries in the `jobs_info` list, where each dictionary represents a job's details.

In [5]:
#list to store the tokens generated in task1
jobs_token_list = []  
with open("tokens.txt") as token:  # open the token file 
    tokens= token.read().splitlines()    
    for text_data in tokens:
        jobs_token_list.append(text_data.split()) # append the data into the empty list

In [6]:
jobs_token_list

[['market',
  'leading',
  'retail',
  'business',
  'rapid',
  'growth',
  'due',
  'expansion',
  'add',
  'financial',
  'planning',
  'analyst',
  'central',
  'team',
  'based',
  'central',
  'london',
  'fantastic',
  'opportunity',
  'join',
  'newly',
  'created',
  'team',
  'driving',
  'forward',
  'financial',
  'planning',
  'analysis',
  'group',
  'reporting',
  'directly',
  'head',
  'fp',
  'assist',
  'revenue',
  'analysis',
  'product',
  'channel',
  'region',
  'provide',
  'commercial',
  'input',
  'review',
  'business',
  'cases',
  'presenting',
  'proposals',
  'approval',
  'work',
  'business',
  'develop',
  'endtoend',
  'planning',
  'cycle',
  'processes',
  'lead',
  'regional',
  'planning',
  'processes',
  'ensure',
  'completeness',
  'key',
  'business',
  'channels',
  'products',
  'provide',
  'finance',
  'support',
  'year',
  'strategic',
  'plan',
  'addition',
  'work',
  'globally',
  'business',
  'regions',
  'develop',
  'capital',


***Inference***

Here we have read lines of tokens from "tokens.txt," splits each line into individual tokens, and stores these tokens as lists within the jobs_token_list. This can be useful for further processing and analysis of the tokenized data generated in Task 1.

In [7]:
#storing the data from vocab.txt generated in task1
vocabulary_list = []
with open("vocab.txt") as vocabulary:
    vocab_data = vocabulary.read().splitlines()
    for text_data in vocab_data:
        vocabulary_list.append(text_data.split(":")[0])

In [8]:
len(vocabulary_list)

5218

***Inference***

Here we have read lines of tokens from "vocab.txt," splits each line into words, and stores them as lists within the vocabulary_list for further analysis

Now let's start with generating the binary vector representation

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
cVectorizer = CountVectorizer(analyzer = "word",vocabulary = vocabulary_list,lowercase = True) # initialised the CountVectorizer

In [10]:
count_features = cVectorizer.fit_transform([' '.join(i) for i in jobs_token_list])

In [11]:
count_features.shape

(776, 5218)

In [12]:
count_features

<776x5218 sparse matrix of type '<class 'numpy.int64'>'
	with 75446 stored elements in Compressed Sparse Row format>

In [13]:
#Coverting it into vector array
vect_array = count_features.toarray()

### Saving outputs
Saving the count vector representation as per specification
- count_vectors.txt

In [14]:
# Creating the count vectors.txt file as per the required output
with open("count_vectors.txt","w") as count_file_output:   
    for k,v in enumerate(vect_array): 
        count_file_output.write(f"#{jobs_info[k]['Webindex']}") # Creating the output file based on the given format and structure
        for i,j in enumerate(v):          
            if j>0:    
                count_file_output.write(f",{i}:{j}")
        count_file_output.write("\n") 

In [15]:
# import the api reference 
import gensim.downloader as api

In [16]:
#install the genism using pip install 
!pip install -U gensim





In [17]:
word_vector = api.load('word2vec-google-news-300') # create the google-news-vector reference from activity

In [18]:
word_vector

<gensim.models.keyedvectors.KeyedVectors at 0x1f95b43d0d0>

In [19]:
#Creating a user-defined function to generate the vector

def unweighted_vector(text_vector,word_tokens):  
    df = pd.DataFrame()   
    
    for w in range(0,len(word_tokens)):  
        text_tokens = word_tokens[w]
        
        #creating an empty dataframe
        data_frame = pd.DataFrame()  
        
         # use try, excpet for data process
        for i in range(0, len(text_tokens)):  
            try:
                text_dat = text_tokens[i]
                data_vect = text_vector[text_dat]
                data_frame = data_frame.append(pd.Series(data_vect), ignore_index = True)
            except:
                pass 
            
        datavector = data_frame.sum()
        df = df.append(datavector , ignore_index = True)
    return df  

***Explanation***

Here I have defined a user-defined function called unweighted_vector that generates a vector representation of text data based on precomputed word vectors. It takes a collection of word tokens and word vectors, retrieves the vectors for the words in each sublist, sums them up to create a sublist-level vector, and accumulates these vectors in a DataFrame, resulting in a final representation of the text data in vector form.

In [20]:
#Function call with vector and tokens 
unweighted_vector = unweighted_vector(word_vector,jobs_token_list)

In [21]:
unweighted_vector

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-7.160126,1.856628,-2.982376,3.630592,-4.704243,-1.182510,4.649521,-11.098745,7.324951,2.495411,...,-9.989853,6.407013,-4.160652,2.415009,-0.373993,2.523761,5.930438,0.620413,2.135750,-5.032024
1,-6.163345,0.061256,-0.407925,1.485382,-5.662848,-2.037109,4.310364,-7.707872,14.754150,0.745209,...,-13.678940,7.189995,-8.961487,3.975644,-0.868500,-0.756126,3.910271,-1.689117,0.821995,-7.342186
2,-4.949142,1.451538,-1.643799,1.790756,-4.067078,1.990873,5.423538,-6.774933,12.768120,2.964249,...,-9.782871,5.047195,-6.183914,0.038132,-2.153198,4.663071,6.192434,0.117004,0.741644,-9.068665
3,-3.432510,0.581108,-2.067047,-0.240967,-0.368744,-0.924429,2.958416,-5.528075,5.968628,0.622406,...,-3.846985,5.863441,-4.374390,0.901390,-1.786919,0.229401,2.147186,0.574051,1.451103,0.326584
4,-0.526581,0.514513,-5.035576,2.997803,-0.259148,1.090897,3.080891,-8.735336,9.242828,4.306641,...,-7.355980,4.525032,-9.773010,0.841362,-1.050568,-1.113660,3.434298,-1.639522,-3.710297,-4.472168
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
771,-0.401176,0.705744,1.405548,-0.995697,-1.968414,0.581604,3.142334,-4.481281,2.261033,-0.783447,...,-2.849060,1.711060,-2.805054,1.311924,-0.558594,2.071762,0.323410,-1.449493,0.793654,1.600983
772,-1.311035,-2.520634,-2.221619,3.701843,-1.536255,1.105083,1.884621,-5.574921,4.487061,-2.290710,...,-2.554054,1.739929,-5.733536,1.969170,0.403961,1.198993,1.826187,-4.165314,-0.754837,-2.374603
773,-7.416153,3.537933,2.989914,0.020447,-2.975037,3.340071,5.201202,-6.814972,6.812744,3.555664,...,-9.048771,5.632198,-6.506836,-0.290340,-2.951813,5.673668,3.358704,-1.569595,-1.777512,-1.717133
774,-10.500214,1.547220,-8.317028,9.061890,-8.450244,6.820450,17.256111,-9.053597,7.853255,-2.204483,...,-25.071526,14.419046,-21.816326,-5.035713,-8.192616,2.617218,6.903275,0.729652,-0.836781,-10.397064


***Inference***
- We have created a dataframe

In [22]:
#Creating TFID Vector
from sklearn.feature_extraction.text import TfidfVectorizer  

In [23]:
tfidvector = TfidfVectorizer(analyzer = "word",vocabulary = vocabulary_list, lowercase = True)

In [24]:
#vector tranformation
vector_features = tfidvector.fit_transform([' '.join(text) for text in jobs_token_list]) 

In [25]:
vector_features.shape

(776, 5218)

In [26]:
# convert to array
featurearray = vector_features.toarray() 
featurearray

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [27]:
#Empty list
weighted = []   
for k,v in enumerate(featurearray): # use for loop to get the array and then append it to the empty list. 
    weighted.append({})
    for i,j in enumerate(v):
        if j!=0:
            weighted[k][vocabulary_list[i]]=j

In [28]:
#Creating funciton to generate the TFID Vector

def weightedtfid(model,tk,tfid): 
    #Create empty dataframe
    df = pd.DataFrame()
    
    for word in range(0,len(tk)):     
        list_tokens = list(set(tk[word]))
        data_frame = pd.DataFrame()   
        
        for i in range(0, len(list_tokens)):
            try:                          
                textdata = list_tokens[i]
                vect = model[textdata] 
                
                weighted = float(tfid[word][textdata])
            
                data_frame = data_frame.append(pd.Series(vect*weighted), ignore_index = True) 
            except:
                pass
        data_vector = data_frame.sum()
        df = df.append(data_vector,ignore_index=True)
    return df 

***Explanation***

Here the function `weightedtfid` generates TF-IDF weighted vectors for documents using pre-trained word embeddings. It iterates through a list of document tokens, calculates TF-IDF weights for each token, and multiplies these weights by their corresponding word vectors obtained from a pre-trained embedding model. These weighted vectors are then summed to create document-level representations. The function accumulates these representations into a DataFrame, resulting in a collection of TF-IDF weighted vectors, where each row corresponds to a document's content, taking into account the importance of words based on their TF-IDF scores and their associated word embeddings.

In [29]:
#Function call
weighteddata = weightedtfid(word_vector,jobs_token_list,weighted) 

In [30]:
weighteddata

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-0.504901,0.124200,-0.194863,0.444976,-0.376841,-0.187031,0.369994,-0.809447,0.601400,0.235539,...,-0.648685,0.316571,-0.150151,0.243624,0.025920,0.119206,0.334186,-0.027812,0.201235,-0.389140
1,-0.346544,0.012133,0.017820,0.252876,-0.408563,-0.203890,0.242627,-0.483708,1.054773,0.130530,...,-0.937903,0.407780,-0.643838,0.201115,-0.058875,-0.119046,0.225163,-0.163693,0.047410,-0.505439
2,-0.384041,0.183813,-0.035027,0.195429,-0.334177,0.015410,0.349709,-0.507052,0.943370,0.208999,...,-0.672141,0.255909,-0.457822,0.053206,-0.177337,0.315515,0.411777,0.015256,0.057018,-0.588495
3,-0.315696,-0.000681,-0.251565,0.078043,-0.004102,-0.224366,0.362503,-0.528244,0.670161,0.047146,...,-0.391224,0.630915,-0.428363,0.101103,-0.194548,-0.009123,0.169657,0.106556,0.194715,0.067267
4,-0.071831,0.023072,-0.349967,0.398567,0.078827,-0.072465,0.077801,-0.589724,0.746227,0.331160,...,-0.429002,0.346106,-0.761660,0.045379,-0.040122,-0.204916,0.223594,-0.158430,-0.307753,-0.400928
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
771,-0.047423,0.147106,0.251730,-0.043298,-0.229455,0.038578,0.430429,-0.575377,0.325212,-0.013075,...,-0.314619,0.242557,-0.415510,0.239343,-0.070296,0.225778,-0.053129,-0.230121,0.104070,0.192955
772,-0.113014,-0.239486,-0.252883,0.501169,-0.135353,0.084937,0.114793,-0.542925,0.400639,-0.259761,...,-0.210502,0.201918,-0.554553,0.175493,-0.035822,0.015681,0.124400,-0.432376,-0.022502,-0.102307
773,-0.618682,0.331870,0.298890,-0.002147,-0.304266,0.198030,0.399435,-0.504132,0.638252,0.337135,...,-0.759424,0.411816,-0.442678,-0.025788,-0.310605,0.506795,0.269019,-0.094427,-0.154614,-0.094301
774,-0.397234,0.022102,-0.370746,0.371090,-0.340698,0.223981,0.633819,-0.178182,0.258703,-0.081291,...,-0.954404,0.577336,-0.813300,-0.164958,-0.309768,0.058736,0.233311,0.045546,-0.033589,-0.322281


***Inference***
- We have created a dataframe

## Task 3. Job Advertisement Classification

In [31]:
lab=[data["title"] for data in jobs_info]  

### Q1: Language model comparisons

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
regression = LogisticRegression(max_iter = 1000,random_state=0)

In [33]:
# Calculating accuracy 
print(f"FEATURE ACCURACY: {round(cross_val_score(regression, count_features, lab, cv=5).mean()*100,2)}%")
print(f"UNWEIGHTED ACCURACY: {round(cross_val_score(regression,unweighted_vector, lab, cv=5).mean()*100,2)}%")
print(f"TFIDF ACCURACY: {round(cross_val_score(regression, weighteddata, lab, cv=5).mean()*100,2)}%")

FEATURE ACCURACY: 87.5%
UNWEIGHTED ACCURACY: 84.54%
TFIDF ACCURACY: 85.96%


***Explanation***

Here, the accuracy of a machine learning regression model is computed using various feature representations. First we have calculated the accuracy using "count_features," representing Count Vectorization. Then the accuracy of unweighted word vectors are calculated and finally the accuracy of weighted word vectors are calculated.

All three accuracy calculations are performed with 5-fold cross-validation and presented as percentages. This analysis allows for an evaluation of the model's performance across different feature types, aiding in the determination of the most effective feature representation for the regression task.

In [34]:
#Creating the regression model
regression = LogisticRegression(max_iter = 2000,random_state=0)

In [35]:
# Calculating accuracy
print(f"FEATURE ACCURACY: {round(cross_val_score(regression, count_features, lab, cv=5).mean()*100,2)}%")
print(f"UNWEIGHTED ACCURACY: {round(cross_val_score(regression,unweighted_vector, lab, cv=5).mean()*100,2)}%")
print(f"TFIDF ACCURACY: {round(cross_val_score(regression, weighteddata, lab, cv=5).mean()*100,2)}%")

FEATURE ACCURACY: 87.5%
UNWEIGHTED ACCURACY: 84.54%
TFIDF ACCURACY: 85.96%


***Inference***
- In all the tested cases, the basic count vector appears to be more accurate.
- TFIDF weighted Google word embedding in more accurate than the non-weighted one

### Q2: Does more information provide higher accuracy

- Here we are evaluating the performance of classification models so as to in to investigate the impact of the job title on the accuracy of the  model accuracy. 
- For this, we create fresh tokens, and a logistic regression model with five fold cross validation is performed

In [36]:
label_new = [data["Title"] for data in jobs_info]

In [37]:
tokens = []     

for text in label_new:
   
    tokens.append(re.findall("[a-zA-Z]+(?:[-'][a-zA-Z]+)?",text))

In [38]:
tokens

[['FP', 'A', 'Blue', 'Chip'],
 ['Part', 'time', 'Management', 'Accountant'],
 ['IFA', 'EMPLOYED'],
 ['Finance', 'Manager'],
 ['Management', 'Accountant'],
 ['Customer', 'Service', 'Administrator'],
 ['Accounts', 'Assistant'],
 ['Pensions', 'Administrator'],
 ['Senior', 'Technical', 'Account', 'Analyst'],
 ['Financial', 'Administrator', 'and', 'Support', 'Fleet', 'Dept'],
 ['Pensions', 'Administrator', 'Month', 'Contract'],
 ['Financial', 'Accountant'],
 ['Senior', 'Internal', 'Audit'],
 ['MORTGAGE', 'SERVICES', 'CONSULTANT', 'UK', 'LEADING', 'ESTATE', 'A'],
 ['Brokers', 'Wanted', 'Imediate', 'Start'],
 ['Value', 'Risk', 'Manager'],
 ['Stress', 'Testing', 'Risk', 'Manager', 'Nonlife'],
 ['Company', 'Accountant'],
 ['Technical', 'Project', 'Manager', 'Mobile', 'Payments'],
 ['Pensions', 'Administration', 'Specialist'],
 ['Commercial', 'Underwriting', 'Motor', 'Team', 'Leader'],
 ['Bookkeeper', 'and', 'SMT', 'support', 'SAGE', 'Line', 'and', 'Forecasting'],
 ['Dividends', 'Corporate', 'Ac

In [39]:
 # Function for data processing
def data_preprocess(tokens_list):  
    
    # converting to lower case for processing
    list1=[[text.lower() for text in word] for word in tokens_list]   
    
    with open("stopwords_en.txt") as stopwords_file:
        stop_text=stopwords_file.read().splitlines()
        
    # getting the word with length > 2 
    list1=[[text for text in word if len(text)>=2] for word in list1] 
    
    # remove stopwords
    list1=[[text for text in word if text not in  stop_text] for word in list1] 
    
    return list1

In [40]:
newtokens = data_preprocess(tokens) 
newtokens

[['fp', 'blue', 'chip'],
 ['part', 'time', 'management', 'accountant'],
 ['ifa', 'employed'],
 ['finance', 'manager'],
 ['management', 'accountant'],
 ['customer', 'service', 'administrator'],
 ['accounts', 'assistant'],
 ['pensions', 'administrator'],
 ['senior', 'technical', 'account', 'analyst'],
 ['financial', 'administrator', 'support', 'fleet', 'dept'],
 ['pensions', 'administrator', 'month', 'contract'],
 ['financial', 'accountant'],
 ['senior', 'internal', 'audit'],
 ['mortgage', 'services', 'consultant', 'uk', 'leading', 'estate'],
 ['brokers', 'wanted', 'imediate', 'start'],
 ['risk', 'manager'],
 ['stress', 'testing', 'risk', 'manager', 'nonlife'],
 ['company', 'accountant'],
 ['technical', 'project', 'manager', 'mobile', 'payments'],
 ['pensions', 'administration', 'specialist'],
 ['commercial', 'underwriting', 'motor', 'team', 'leader'],
 ['bookkeeper', 'smt', 'support', 'sage', 'line', 'forecasting'],
 ['dividends', 'corporate', 'actions', 'administrator'],
 ['collection'

In [41]:
tokens_con = []
for i in range(len(jobs_token_list)):
    tokens_con.append(jobs_token_list[i] + tokens[i])

In [42]:
tokens_con

[['market',
  'leading',
  'retail',
  'business',
  'rapid',
  'growth',
  'due',
  'expansion',
  'add',
  'financial',
  'planning',
  'analyst',
  'central',
  'team',
  'based',
  'central',
  'london',
  'fantastic',
  'opportunity',
  'join',
  'newly',
  'created',
  'team',
  'driving',
  'forward',
  'financial',
  'planning',
  'analysis',
  'group',
  'reporting',
  'directly',
  'head',
  'fp',
  'assist',
  'revenue',
  'analysis',
  'product',
  'channel',
  'region',
  'provide',
  'commercial',
  'input',
  'review',
  'business',
  'cases',
  'presenting',
  'proposals',
  'approval',
  'work',
  'business',
  'develop',
  'endtoend',
  'planning',
  'cycle',
  'processes',
  'lead',
  'regional',
  'planning',
  'processes',
  'ensure',
  'completeness',
  'key',
  'business',
  'channels',
  'products',
  'provide',
  'finance',
  'support',
  'year',
  'strategic',
  'plan',
  'addition',
  'work',
  'globally',
  'business',
  'regions',
  'develop',
  'capital',


In [43]:
# Acuuarcy calculation
def calculate_accuracy(tokens,label):    
    
    vocab = set(chain.from_iterable(tokens))
    vector = CountVectorizer(analyzer = "word",vocabulary = vocab, binary = False)
    features = vector.fit_transform([' '.join(text) for text in tokens]) 
    regression = LogisticRegression(max_iter = 2000,random_state=0)
    
    return round(cross_val_score(regression,features, label, cv=5).mean()*100,2)

In [44]:
print(f"ACCURACY With Titles : {calculate_accuracy(tokens,lab)}%")
print(f"ACCURACY With Jobs Description : {calculate_accuracy(jobs_token_list,lab)}%")
print(f"ACCURACY with Title and Description : {calculate_accuracy(tokens_con,lab)}%")

ACCURACY With Titles : 59.15%
ACCURACY With Jobs Description : 87.5%
ACCURACY with Title and Description : 88.02%


***Inference***:

From this analysis we were able to infer that :

- Accuracy of the model is the least when only job title are used as the tokens.
- However we cannot observe much variation in the accuracy when we consider just Job description and then Title and Description. It could be because we have added limited titles in tokens.
Even yet, there isn't much of a difference in accuracy between titles and descriptions alone. The limited amount of titles added as tokens may be the cause of this.

## Summary

The assessment tasks presented a significant challenge, requiring thoughtful problem-solving and data analysis. It was initially a bit of a puzzle to find the right approach, but once a logical rationale was established, it effectively addressed all instances.

Among the tasks, the one that stood out the most was the investigation of various metrics to determine accuracy. This aspect of the assessment sparked curiosity and offered a deep dive into understanding the intricacies of machine learning models and feature representations.