# KarriereAI
#### A deep learning model used to predict viable career paths to a user based on their skills and interests.

#### Purpose
KarriereAI will classify the appropriate career within technology for a user from an interactive quizlet input. The quizlet is used to determine the user's skills and interests, before the model will predict a fitting career based on the input data.

#### Dataset
The dataset to be used is from the paper Skill2vec: A Machine Learning Approach for Determining the Relevant Skills from Job Description, by Van-Duyet Le et al. <a href="https://arxiv.org/pdf/1707.09751">here</a>. Containing relevant columns of job titles related to a free-text field of job descriptions describing relevant skills. 

#### Model Architecture
The main model is a feed-forward neural network (FNN) for classification, using a multi-layer-perceptron (MLP) architecture suitable for structured data classification. 

Part of engineering the main model requires preprocessing of the dataset in a natural language processing (NLP) set-up, preparing it to run through a sub-model with an encoded transformer architecture.

The main model also requires figuring out basic vs. deep MLP architecures. Additionally, figuring out whether the model should contain batch normalization, regularization and/or dropout. Lastly, experimenting with different activation functions.

Through using the NLP sub-model, the data will be vectorized to work with the MLP classifier rather than the language model.

#### Evaluation
As far as evaluation goes, a confusion matrix and an F1 score will be computed along with standard evaluation metrics like accuracy, recall and precision.

### Step 1 - Importing Libraries and Loading the Data 
We will be needing different libraries from <a href="https://keras.io/api/">Keras</a> and <a href="https://www.tensorflow.org/api_docs/python/tf">TensorFlow</a> among others to make computations on the dataset.

In [11]:
# Model processing
import sklearn
import numpy
import pandas
import tensorflow
import keras

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from keras import layers, Sequential
from scipy import sparse
from scipy.sparse import hstack

# Other
import datetime

# For plotting
%matplotlib inline
import matplotlib as plot
import matplotlib.pyplot as pyplot

from pathlib import Path
import seaborn
seaborn.set_theme (style = "whitegrid")

%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


After importing relevant libraries, we load the dataset we wish to train the model on.

In [None]:
# Load dataset
data = pandas.read_csv ("data/mustHaveSkills.csv", header = 0, encoding ='ISO-8859-1')
del data ['job_brief_id']

# REF: Van-Duyet Le

: 

### Step 2 - Taking a Look at the Data

To create optimal and smooth-running Python for the model we want to  study the shape of the data.

In [None]:
# Basic information about dataset
print ("Shape of dataset:", data.shape, "\n")
print ("Information about dataset:")
data.info()

Shape of dataset: (261724, 2) 

Information about dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261724 entries, 0 to 261723
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   keyword_name  261717 non-null  object
 1   job_title     261724 non-null  object
dtypes: object(2)
memory usage: 4.0+ MB


: 

### Step 3 - Exploratory Data Analysis (EDA)

To know what data the model will injest, taking a closer look by constructing plots and 

#### 3.1 Plotting Jobs and Skills
##### 3.1.1 Jobs

In [None]:
# Function to count number of occurrences 
def count_items(series):
    items = series.dropna().apply(lambda x: x.split(";"))
    flat_list = [item.strip() for sublist in items for item in sublist]
    return pandas.Series(flat_list).value_counts()

# REF: Adil Shamim

: 

In [None]:
# Count Jobs
job_count = count_items(data["job_title"])
print("Most Common Jobs:\n", job_count)

Most Common Jobs:
 Software Engineer                       4140
software engineer                       3954
software developer                      2612
engineer                                2190
consultant                              1826
                                        ... 
css programmer                             1
css developer                              1
Clinical Research Asscociate               1
Business Solution Security Architect       1
Acccount                                   1
Name: count, Length: 5650, dtype: int64


: 

##### 3.1.2 Skills

In [None]:
# Count Skills
skills_count = count_items(data["keyword_name"])
print("Most Common Skills:\n", skills_count)

Most Common Skills:
 C++                                               3024
Java                                              2928
Python                                            1606
J2EE                                              1518
C#                                                1362
                                                  ... 
Shaders                                              1
Mel                                                  1
Accessible Rich Internet Applications WAI-ARIA       1
LTL                                                  1
RTL page designs                                     1
Name: count, Length: 8586, dtype: int64


: 

### Step 4: Preprocessing of Data

For the purpose of cleaning the dataset to make sure the model doesn't learn errors making predictions skewed.

#### 4.1 String Cleanup

In [None]:
# Drop duplicates
data = data.drop_duplicates (subset = ['keyword_name', 'job_title'], keep = 'last')
data = data [data ["job_title"] != 0]
print (data.info())

# String magic
data ['Count'] = data.groupby ('job_title')['keyword_name'].transform (pandas.Series.value_counts)
data.drop_duplicates (inplace = True)
# data ['keyword_name'] = data ['keyword_name'].str.lower()
# data ['keyword_name'] = data ['keyword_name'].str.replace(' ', '_')
# data ['job_title'] = data ['job_title'].str.lower()

# Clean data
data_jobtitle = data.groupby ('job_title')['keyword_name'].apply(list)

# REF: Van-Duyet Le

<class 'pandas.core.frame.DataFrame'>
Index: 80810 entries, 0 to 261723
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   keyword_name  80803 non-null  object
 1   job_title     80810 non-null  object
dtypes: object(2)
memory usage: 1.8+ MB
None


: 

#### 4.2 Vectorization
Turning skills and interests into vectorized numerical features for classification.

In [None]:
# Define features to be used
features = data [['job_title', 'keyword_name']]
skills = data ['keyword_name']
target = data ['job_title']

: 

In [None]:
# Look at contents of dataset
data_jobtitle

job_title
.NET                               [Microsoft Office SharePoint Server]
.NET Developer        [sharepoint, MOSS, ASP.Net, HTML/HTML5, CSS/CS...
.NET developer        [SQL, Finance, Bank, banking, FSI, FI, FS, Fin...
.Net Application                         [architect, .net, asp.net, C#]
.Net Developer                               [Microsoft .NET, .Net, C#]
                                            ...                        
workday consultant    [workday, workday HCM, Mandarin, Cantonese, Ch...
world design          [Level, Levels, world, worlds, design, Multipl...
world designer        [Level, Levels, world, worlds, design, Multipl...
writer                [agency, copywriting, copywriter, writer, writ...
writing               [agency, copywriting, copywriter, writer, writ...
Name: keyword_name, Length: 5649, dtype: object

: 

In [None]:
vectorizer = keras.layers.TextVectorization (
    max_tokens = None,
    standardize = "lower_and_strip_punctuation",
    split = "character",
    ngrams = None, # we can insert a bigram (N-gram of 2)
    output_mode = "int",
    output_sequence_length = None,
    pad_to_max_tokens = False,
    vocabulary = None,
    idf_weights = None,
    sparse = False,
    ragged = False,
    encoding = "utf-8",
    name = None
)

vectorizer.adapt (target)
vectorizer.adapt (skills)

# REF: Keras Documentation

2025-04-05 15:34:50.892195: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).

: 

#### 4.3 Embedding the Data
To be able to run the data through a transformer, it needs to be embedded.

#### 4.4 The Transformer Model
Utilize the tokenized version of the data, to produce an output suitable for the MLP classifier.

### Step 5 - Building the MLP Model

Building the FFN model, and training it on training and validation sets.

#### 5.1 Constructing the Feature Matrix

Combining the numerical value of Age with the vectorized Skills and Interest features into a sparse matrix concatenation.

#### 5.2 Training, Validation and Test Sets

In [None]:
from sklearn.model_selection import train_test_split

print (type(feature_matrix), type(encoded_target))

X_train = feature_matrix 
y_train = encoded_target 

X_train = tensorflow.sparse.reorder(X_train)

# X_train, X_test, y_train, y_test = train_test_split (feature_matrix, encoded_target, test_size = 0.2, random_state = 42)

# tensorflow.sparse.reorder(X_train)

# this is trash:
# training_set, validation_set, test_set = train_val_test_split (age_matrix, feature_matrix, )
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<class 'tensorflow.python.framework.sparse_tensor.SparseTensor'> <class 'numpy.ndarray'>


: 

#### 5.3 Training the MLP Model

In [None]:
# Remove previous logs
# TODO

: 

In [None]:
model = Sequential([
    keras.Input (shape = (feature_matrix.shape[1], )),
    # to complete the embedding, a one-hot layer:layers.StringLookup (output_mode = "one-hot")
    layers.Dropout (0.1),
    layers.Dense (16, activation = 'relu'),
    layers.Dense (16, activation = 'relu'),
    # followed by a Dense layer: layers.Dense (units = embedding_dim, use_bias = False, activation = None)
    layers.Dense (target.unique().size)
])

# model.compile(loss = 'SparseCategoricalCrossentropy', optimizer = 'adam')
model.compile (loss = 'hinge', 
              optimizer = 'adam', 
              metrics=['accuracy'])

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tensorflow.keras.callbacks.TensorBoard(log_dir = log_dir, histogram_freq = 1)

model.fit (X_train, y_train, 
          epochs = 30, 
          batch_size = 1, 
          validation_split = 0.2,
          callbacks=[tensorboard_callback])

Epoch 1/30
[1m160/160[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.0437 - loss: 0.5685 - val_accuracy: 0.0250 - val_loss: 0.0411
Epoch 2/30
[1m160/160[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.0378 - loss: 0.0684 - val_accuracy: 0.0250 - val_loss: 0.0181
Epoch 3/30
[1m160/160[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.0841 - loss: 0.0267 - val_accuracy: 0.0250 - val_loss: 0.0108
Epoch 4/30
[1m160/160[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.0272 - loss: 0.0433 - val_accuracy: 0.0000e+00 - val_loss: 0.0097
Epoch 5/30
[1m160/160[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.0524 - loss: 0.0202 - val_accuracy: 0.0250 - val_loss: 0.0147
Epoch 6/30
[1m160/160[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.0425 - loss: 0.0249 - val_accuracy: 0.0000e+00 - val_loss: 0.0098
Epoch 7/30
[1m160/1

<keras.src.callbacks.history.History at 0x27e43a62c50>

: 

In [None]:
model.summary()

: 

In [None]:
%tensorboard --logdir logs/fit

Reusing TensorBoard on port 6006 (pid 20476), started 1 day, 13:53:40 ago. (Use '!kill 20476' to kill it.)

: 

### Step 6 - Evaluation

: 

#### 6.1 Confusion Matrix

In [None]:
# Predict on the test set
# y_pred = model.predict(X_train)

: 

In [None]:
# # Confusion matrix
# cm = confusion_matrix(y_train, y_pred)
# plot.figure(figsize = (10, 8))
# seaborn.heatmap(cm, annot = True, fmt = "d", cmap = "Blues",
#             xticklabels = target_encoder.classes_,
#             yticklabels = target_encoder.classes_)
# plot.xlabel("Predicted")
# plot.ylabel("Actual")
# plot.title("Confusion Matrix")
# plot.show()

: 

#### 6.2 F1-Score

: 

#### 6.3 Precision, Accuracy, Recall

: 