# **Resume Categorization**

![Resume Cateorization](CreatingaPhysicianCVthatShines.jpg)

This project tackles the challenge of **resume categorization** using machine learning and deep learning techniques. Companies often face the daunting task of sifting through numerous resumes for each job opening. This app aims to automate and streamline this process by predicting the job category a given resume belongs to. By using a trained model, the app can quickly suggest the appropriate job category for each resume, saving time and resources for recruiters.

## **Set Environmnt**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.multiclass import OneVsRestClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from pandas.plotting import scatter_matrix
from sklearn.neighbors import KNeighborsClassifier

## **Reading The Dataset**

**Dataset Acquisition:** The project uses the [Resume Dataset](https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset) from Kaggle. This dataset consists of resumes categorized into 25 distinct job fields, providing a solid foundation for training our models.

In [2]:
resumeDataSet = pd.read_csv("E:\\ERU\\Level 4\\S1\\ML\\Project\\UpdatedResumeDataSet.csv\\UpdatedResumeDataSet.csv")
resumeDataSet.head()

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."


### **Displaying the distinct categories of resume**

In [3]:
print ("Displaying the distinct categories of resume:\n\n ")
print (resumeDataSet['Category'].unique())

Displaying the distinct categories of resume:

 
['Data Science' 'HR' 'Advocate' 'Arts' 'Web Designing'
 'Mechanical Engineer' 'Sales' 'Health and fitness' 'Civil Engineer'
 'Java Developer' 'Business Analyst' 'SAP Developer' 'Automation Testing'
 'Electrical Engineering' 'Operations Manager' 'Python Developer'
 'DevOps Engineer' 'Network Security Engineer' 'PMO' 'Database' 'Hadoop'
 'ETL Developer' 'DotNet Developer' 'Blockchain' 'Testing']


### **Displaying the number of resumes in each category**

In [4]:
print ("Displaying the distinct categories of resume and the number of records belonging to each category:\n\n")
print (resumeDataSet['Category'].value_counts())

Displaying the distinct categories of resume and the number of records belonging to each category:


Category
Java Developer               84
Testing                      70
DevOps Engineer              55
Python Developer             48
Web Designing                45
HR                           44
Hadoop                       42
Sales                        40
Data Science                 40
Mechanical Engineer          40
ETL Developer                40
Blockchain                   40
Operations Manager           40
Arts                         36
Database                     33
Health and fitness           30
PMO                          30
Electrical Engineering       30
Business Analyst             28
DotNet Developer             28
Automation Testing           26
Network Security Engineer    25
Civil Engineer               24
SAP Developer                24
Advocate                     20
Name: count, dtype: int64


### **Check the dataset**

In [5]:
resumeDataSet['Category'][0]

'Data Science'

In [6]:
resumeDataSet['Resume'][0]

'Skills * Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, JavaScript/JQuery. * Machine learning: Regression, SVM, NaÃ¯ve Bayes, KNN, Random Forest, Decision Trees, Boosting techniques, Cluster Analysis, Word Embedding, Sentiment Analysis, Natural Language processing, Dimensionality reduction, Topic Modelling (LDA, NMF), PCA & Neural Nets. * Database Visualizations: Mysql, SqlServer, Cassandra, Hbase, ElasticSearch D3.js, DC.js, Plotly, kibana, matplotlib, ggplot, Tableau. * Others: Regular Expression, HTML, CSS, Angular 6, Logstash, Kafka, Python Flask, Git, Docker, computer vision - Open CV and understanding of Deep learning.Education Details \r\n\r\nData Science Assurance Associate \r\n\r\nData Science Assurance Associate - Ernst & Young LLP\r\nSkill Details \r\nJAVASCRIPT- Exprience - 24 months\r\njQuery- Exprience - 24 months\r\nPython- Exprience - 24 monthsCompany Details \r\ncompany - Ernst & Young LLP\r\ndescription - Fraud Investigatio

## **Text preprocessing**

**Text Preprocessing:** Raw resume text is often messy. We apply preprocessing techniques to clean the resume data, including:
- Removing URLs, RTs, hashtags, and mentions
- Eliminating special characters and non-ASCII characters
- Collapsing extra whitespace

In [7]:
import re
def cleanResume(txt):
    cleanText = re.sub('http\S+\s', ' ', txt) # This line removes any URLs from the text
    cleanText = re.sub('RT|cc', ' ', cleanText) # This line removes any RTs or cc from the text
    cleanText = re.sub('#\S+\s', ' ', cleanText) # This line removes any hashtags from the text
    cleanText = re.sub('@\S+', '  ', cleanText) # This line removes any @ from the text
    cleanText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', cleanText) # This line removes any punctuations from the text
    cleanText = re.sub(r'[^\x00-\x7f]', ' ', cleanText) # This line removes any non-ASCII characters from the text
    cleanText = re.sub('\s+', ' ', cleanText) # This line removes any extra whitespaces from the text
    return cleanText

In [8]:
resumeDataSet['Resume'] = resumeDataSet['Resume'].apply(lambda x: cleanResume(x))

### **Check cleaned text**

In [9]:
resumeDataSet['Resume'][0]

'Skills Programming Languages Python pandas numpy scipy scikit learn matplotlib Sql Java JavaScript JQuery Machine learning Regression SVM Na ve Bayes KNN Random Forest Decision Trees Boosting techniques Cluster Analysis Word Embedding Sentiment Analysis Natural Language processing Dimensionality reduction Topic Modelling LDA NMF PCA Neural Nets Database Visualizations Mysql SqlServer Cassandra Hbase ElasticSearch D3 js DC js Plotly kibana matplotlib ggplot Tableau Others Regular Expression HTML CSS Angular 6 Logstash Kafka Python Flask Git Docker computer vision Open CV and understanding of Deep learning Education Details Data Science Assurance Associate Data Science Assurance Associate Ernst Young LLP Skill Details JAVASCRIPT Exprience 24 months jQuery Exprience 24 months Python Exprience 24 monthsCompany Details company Ernst Young LLP description Fraud Investigations and Dispute Services Assurance TECHNOLOGY ASSISTED REVIEW TAR Technology Assisted Review assists in a elerating the 

## **Model Preprocessing**

 **Data Preparation:**
- **Category Encoding:** We transform the textual categories into numerical labels using Label Encoding, allowing our machine learning algorithms to work with the data.
- **TF-IDF Vectorization:** We convert the cleaned text data into numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency), which gives more weight to words that are specific to a document in the dataset.


### **Category Encoding**

In [10]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [11]:
le.fit(resumeDataSet['Category'])
resumeDataSet['Category'] = le.transform(resumeDataSet['Category'])

In [12]:
resumeDataSet.Category.unique()

array([ 6, 12,  0,  1, 24, 16, 22, 14,  5, 15,  4, 21,  2, 11, 18, 20,  8,
       17, 19,  7, 13, 10,  9,  3, 23])

### **TF-IDF**

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')

tfidf.fit(resumeDataSet['Resume'])
requredTaxt  = tfidf.transform(resumeDataSet['Resume'])

### **Data Splitting**

**Data Splitting:** The dataset is divided into training and testing sets to properly evaluate the performance of our models.

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(requredTaxt, resumeDataSet['Category'], test_size=0.2, random_state=42)
X_train.shape

(769, 7351)

In [17]:
X_test.shape

(193, 7351)

## **Model Development**

**Model Development:** We build and train multiple models:
 - **Machine Learning Models:**
    *   **K-Nearest Neighbors (KNN):** A simple yet effective classification algorithm based on distance between points.
    *   **Support Vector Machine (SVC):** A powerful algorithm that finds an optimal hyperplane to separate classes.
    *   **Random Forest:** An ensemble method that combines multiple decision trees for more robust predictions.
-   **Deep Learning Model:**
    *   **Multilayer Perceptron (MLP):** A neural network model with multiple hidden layers to learn complex patterns from the data.
-   **Ensemble Model**
    *   **Voting Classifier:** Combines the predictions from the machine learning models to achieve more accurate and robust results.

### **Machine Learning**

In [18]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt

In [19]:
# Ensure that X_train and X_test are dense if they are sparse
X_train = X_train.toarray() if hasattr(X_train, 'toarray') else X_train
X_test = X_test.toarray() if hasattr(X_test, 'toarray') else X_test

#### **KNN**

In [86]:
# 1. Train KNeighborsClassifier
knn_model = OneVsRestClassifier(KNeighborsClassifier())
knn_model.fit(X_train, y_train)

# Accuracy for the training set
y_train_pred_knn = knn_model.predict(X_train)
train_accuracy_knn = accuracy_score(y_train, y_train_pred_knn)
print(f"Training Accuracy: {train_accuracy_knn:.4f}")

# Accuracy for the test set
y_pred_knn = knn_model.predict(X_test)
test_accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f" Testing Accuracy: {test_accuracy_knn:.4f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_knn)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred_knn)}")

ValueError: could not convert string to float: ' Excellent grasping power in learning new concepts and technology Highly motivated team player with strong work ethics committed to hard work Ability to work and co ordinate in a team effectively Enthusiastic self starter and team player Quick and independent learner Education Details January 2014 Bachelor of Technology Information Technology branch BPUT University January 2010 Diploma Engineering Brahmapur Orissa U C P Engineering School Software Testing Automation Engineer Software Testing Automation Engineer Tech Mahindra Skill Details Company Details company Tech Mahindra description India Duration Oct 2017 Till Date Project Description BT Group plc trading as BT and formerly British Telecom is a British multinational telecommunications holding company with head offices in London United Kingdom I worked for Air Logistics Program under the banner of British Telecom This project handles all the web applications to carry out the whole logistics operation over United Kingdom through various airlines Roles Responsibilities Design and develop framework for the Test scenarios and Test cases Developing automation test scripts on the existing application Executing Test Cases for every new release Involved in running test cases and logging defects through the HPQC tool Involved in formulating test Summary Report Conduct Internal Test Case Peer Reviews Participated in Daily Scrum Meetings Participated in weekly status meetings with the team developers to discuss open issues and communicating with onsite team company Tech Mahindra Pvt Ltd Pune Tech Mahindra description is an Indian multinational company with around 115 000 Employees spread across 90 countries globally Total Experience 2 Years 7 Months Organization Designation Duration company Tech Mahindra description Project Description AT T Inc is an American multinational conglomerate holding company headquartered at Whitacre Tower in Downtown Dallas Texas During my serving as software engineer at AT T I have worked for CSI CAM Common Service Interface team which is responsible for running of AT T s centralised solution hub web application called myatt com Roles Responsibilities Design develop and maintaing Automation Test Scripts and Test cases using Selenium WebDriver and several desktop window automating tool such as Sikuli and AutoIT Executing Test Cases and check the working functionality of the existing application Involved in tracking manging test life cycle and logging defects using JIRA and HPQC ALM Involved in formulating test Summary Report Conduct Internal Test Case Peer Reviews Participated in Daily Scrum Meetings Participated in weekly status meetings with the team developers to discuss open issues and communicating with onsite team company Tech Mahindra Pvt Ltd Pune Tech Mahindra description Till Date'

#### **Support Vector Machine**

In [122]:
# 2. Train SVC
svc_model = OneVsRestClassifier(SVC())
svc_model.fit(X_train, y_train)

# Accuracy for the training set
y_train_pred_svc = svc_model.predict(X_train)
train_accuracy_svc = accuracy_score(y_train, y_train_pred_svc)
print(f"Training Accuracy: {train_accuracy_svc:.4f}")

# Accuracy for the test set
y_pred_svc = svc_model.predict(X_test)
test_accuracy_svc = accuracy_score(y_test, y_pred_svc)
print("\nSVC Results:")
print(f"Testing Accuracy: {test_accuracy_svc:.4f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_svc)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred_svc)}")

Training Accuracy: 1.0000

SVC Results:
Testing Accuracy: 0.9948
Confusion Matrix:
[[ 3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  0  6

#### **Random Forest**

In [123]:
# 3. Train RandomForestClassifier
rf_model = OneVsRestClassifier(RandomForestClassifier())
rf_model.fit(X_train, y_train)

# Accuracy for the training set
y_train_pred_rf = rf_model.predict(X_train)
train_accuracy_rf = accuracy_score(y_train, y_train_pred_rf)
print(f"Training Accuracy: {train_accuracy_rf:.4f}")

# Accuracy for the test set
y_pred_rf = rf_model.predict(X_test)
test_accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("\nRandomForestClassifier Results:")
print(f"Testing Accuracy: {test_accuracy_rf:.4f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_rf)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred_rf)}")

Training Accuracy: 1.0000

RandomForestClassifier Results:
Testing Accuracy: 0.9948
Confusion Matrix:
[[ 3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
   0]
 [ 0  0  0  0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  

### **Ensemble Model**

#### **Voting**

In [124]:
from sklearn.ensemble import VotingClassifier

# Create a VotingClassifier with the three models
ensemble_model = VotingClassifier(estimators=[
    ('knn', knn_model),
    ('svc', svc_model),
    ('rf', rf_model)
], voting='hard')

# Train the ensemble model
ensemble_model.fit(X_train, y_train)

# Make predictions with the ensemble model
y_pred_ensemble = ensemble_model.predict(X_test)

# Evaluate the ensemble model
ensemble_accuracy = accuracy_score(y_test, y_pred_ensemble)
print(f"Ensemble Model Accuracy: {ensemble_accuracy:.4f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_ensemble)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred_ensemble)}")

Ensemble Model Accuracy: 0.9948
Confusion Matrix:
[[ 3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  0  6  0  0  0  0  0  0  0  0  0  0  0

#### **MLP**

In [125]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense
import numpy as np
from sklearn.metrics import classification_report

# Assuming X_train, X_test, y_train, y_test are already defined

# Define the MLP model
model = keras.Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),  # Input layer
    Dense(64, activation='relu'),  # Hidden layer
    Dense(len(np.unique(y_train)), activation='softmax')  # Output layer (number of classes)
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',  # Use sparse_categorical_crossentropy for integer labels
              metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy = history.history['accuracy'][-1]  # Last epoch's training accuracy
val_accuracy = history.history['val_accuracy'][-1]  # Last epoch's validation accuracy

print(f"Final Training Accuracy: {train_accuracy:.4f}")
print(f"Final Validation Accuracy: {val_accuracy:.4f}")

# Evaluate the model on the test set
loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

# Make predictions
y_pred_mlp = np.argmax(model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_mlp))

Epoch 1/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 40ms/step - accuracy: 0.3595 - loss: 3.1767 - val_accuracy: 0.6753 - val_loss: 2.9307
Epoch 2/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 25ms/step - accuracy: 0.8042 - loss: 2.7127 - val_accuracy: 0.8312 - val_loss: 2.0922
Epoch 3/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step - accuracy: 0.8990 - loss: 1.6404 - val_accuracy: 0.8961 - val_loss: 1.0710
Epoch 4/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step - accuracy: 0.9652 - loss: 0.6920 - val_accuracy: 0.9870 - val_loss: 0.4553
Epoch 5/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 28ms/step - accuracy: 1.0000 - loss: 0.2786 - val_accuracy: 1.0000 - val_loss: 0.2075
Epoch 6/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 26ms/step - accuracy: 1.0000 - loss: 0.0966 - val_accuracy: 1.0000 - val_loss: 0.1147
Epoch 7/10
[1m22/22[0m [32m━━━━

## **Deep Learning**

### **RNN**

In [35]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

#### **RNN with GD**

In [34]:
# Compile the model
from tensorflow.keras.optimizers import SGD

sgd_optimizer = SGD(learning_rate=0.01, momentum=0.9)
rnn_model.compile(optimizer=sgd_optimizer,
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Train the model
history_rnn = rnn_model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_rnn = history_rnn.history['accuracy'][-1]
val_accuracy_rnn = history_rnn.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_rnn:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_rnn:.4f}")

# Evaluate the model on the test set
loss_rnn, test_accuracy_rnn = rnn_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss_rnn:.4f}")
print(f"Test Accuracy: {test_accuracy_rnn:.4f}")

# Make predictions
y_pred_rnn = np.argmax(rnn_model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_rnn))

Epoch 1/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 60ms/step - accuracy: 0.1946 - loss: 5.0641 - val_accuracy: 0.0909 - val_loss: 3.2306
Epoch 2/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 44ms/step - accuracy: 0.0728 - loss: 3.1867 - val_accuracy: 0.1039 - val_loss: 3.2158
Epoch 3/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 42ms/step - accuracy: 0.0737 - loss: 3.1777 - val_accuracy: 0.0909 - val_loss: 3.2337
Epoch 4/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 43ms/step - accuracy: 0.0878 - loss: 3.1283 - val_accuracy: 0.0260 - val_loss: 3.3137
Epoch 5/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 42ms/step - accuracy: 0.0607 - loss: 3.1741 - val_accuracy: 0.1039 - val_loss: 3.2499
Epoch 6/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 50ms/step - accuracy: 0.1005 - loss: 3.1294 - val_accuracy: 0.1039 - val_loss: 3.2593
Epoch 7/100
[1m22/22[0m [

#### **RNN with Adam**

In [36]:
# Tokenize the text data
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(resumeDataSet['Resume'])
sequences = tokenizer.texts_to_sequences(resumeDataSet['Resume'])
padded_sequences = pad_sequences(sequences, maxlen=200, padding='post', truncating='post')

# Encode the labels
le = LabelEncoder()
le.fit(resumeDataSet['Category'])
encoded_labels = le.transform(resumeDataSet['Category'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, encoded_labels, test_size=0.2, random_state=42)

# Define the RNN model
rnn_model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    SimpleRNN(64, return_sequences=False),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

# Compile the model
rnn_model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Train the model
history_rnn = rnn_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_rnn = history_rnn.history['accuracy'][-1]
val_accuracy_rnn = history_rnn.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_rnn:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_rnn:.4f}")

# Evaluate the model on the test set
loss_rnn, test_accuracy_rnn = rnn_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss_rnn:.4f}")
print(f"Test Accuracy: {test_accuracy_rnn:.4f}")

# Make predictions
y_pred_rnn = np.argmax(rnn_model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_rnn))

Epoch 1/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 134ms/step - accuracy: 0.2494 - loss: 3.0171 - val_accuracy: 0.6234 - val_loss: 2.5586
Epoch 2/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 98ms/step - accuracy: 0.6748 - loss: 2.1932 - val_accuracy: 0.7143 - val_loss: 1.9266
Epoch 3/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 138ms/step - accuracy: 0.7559 - loss: 1.4965 - val_accuracy: 0.7922 - val_loss: 1.5176
Epoch 4/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 106ms/step - accuracy: 0.8021 - loss: 1.1339 - val_accuracy: 0.8182 - val_loss: 1.1271
Epoch 5/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 121ms/step - accuracy: 0.8883 - loss: 0.7577 - val_accuracy: 0.8442 - val_loss: 0.8956
Epoch 6/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 90ms/step - accuracy: 0.8738 - loss: 0.6838 - val_accuracy: 0.8701 - val_loss: 0.6662
Epoch 7/10
[1m22/22[0m [32m

### **LSTM**

In [None]:
from tensorflow.keras.layers import LSTM, Embedding, Dense, GlobalMaxPool1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

# Tokenize the text data
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(resumeDataSet['Resume'])
sequences = tokenizer.texts_to_sequences(resumeDataSet['Resume'])
padded_sequences = pad_sequences(sequences, maxlen=200, padding='post', truncating='post')

# Encode the labels
le = LabelEncoder()
le.fit(resumeDataSet['Category'])
encoded_labels = le.transform(resumeDataSet['Category'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, encoded_labels, test_size=0.2, random_state=42)

# Define the LSTM model
lstm_model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    LSTM(64, return_sequences=True),
    GlobalMaxPool1D(),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

# Compile the model
lstm_model.compile(optimizer='adam',
                   loss='sparse_categorical_crossentropy',
                   metrics=['accuracy'])

# Train the model
history_lstm = lstm_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_lstm = history_lstm.history['accuracy'][-1]
val_accuracy_lstm = history_lstm.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_lstm:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_lstm:.4f}")

# Evaluate the model on the test set
loss_lstm, test_accuracy_lstm = lstm_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss_lstm:.4f}")
print(f"Test Accuracy: {test_accuracy_lstm:.4f}")

# Make predictions
y_pred_lstm = np.argmax(lstm_model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_lstm))

In [None]:
from tensorflow.keras.layers import LSTM, Embedding, Dense, GlobalMaxPool1D
from tensorflow.keras.models import Sequential

# Ensure padded_sequences is defined
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(resumeDataSet['Resume'])
sequences = tokenizer.texts_to_sequences(resumeDataSet['Resume'])
padded_sequences = pad_sequences(sequences, maxlen=200, padding='post', truncating='post')

# Define the LSTM model
lstm_model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    LSTM(64, return_sequences=True),
    GlobalMaxPool1D(),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

# Compile the model
lstm_model.compile(optimizer='adam',
                   loss='sparse_categorical_crossentropy',
                   metrics=['accuracy'])

# Train the model
history_lstm = lstm_model.fit(padded_sequences, resumeDataSet['Category'], epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_lstm = history_lstm.history['accuracy'][-1]
val_accuracy_lstm = history_lstm.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_lstm:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_lstm:.4f}")

# Evaluate the model on the test set
# Get the original resume text corresponding to the test set indices
test_resume_texts = resumeDataSet['Resume'].iloc[y_test.index].tolist()  

# Now use this list of texts for tokenization
test_sequences = tokenizer.texts_to_sequences(test_resume_texts)  
padded_test_sequences = pad_sequences(test_sequences, maxlen=200, padding='post', truncating='post')

loss_lstm, test_accuracy_lstm = lstm_model.evaluate(padded_test_sequences, y_test)
print(f"Test Loss: {loss_lstm:.4f}")
print(f"Test Accuracy: {test_accuracy_lstm:.4f}")

# Make predictions
y_pred_lstm = np.argmax(lstm_model.predict(padded_test_sequences), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_lstm))

In [20]:
from tensorflow.keras.layers import LSTM, Embedding, Dense, GlobalMaxPool1D
from tensorflow.keras.models import Sequential

# Ensure padded_sequences is defined
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(resumeDataSet['Resume'])
sequences = tokenizer.texts_to_sequences(resumeDataSet['Resume'])
padded_sequences = pad_sequences(sequences, maxlen=200, padding='post', truncating='post')

# Define the LSTM model
lstm_model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    LSTM(64, return_sequences=True),
    GlobalMaxPool1D(),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

# Compile the model
lstm_model.compile(optimizer='adam',
                   loss='sparse_categorical_crossentropy',
                   metrics=['accuracy'])

# Train the model
history_lstm = lstm_model.fit(padded_sequences, resumeDataSet['Category'], epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_lstm = history_lstm.history['accuracy'][-1]
val_accuracy_lstm = history_lstm.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_lstm:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_lstm:.4f}")

# Evaluate the model on the test set
# Get the original resume text corresponding to the test set indices
test_resume_texts = resumeDataSet['Resume'].iloc[y_test.index].tolist()  

# Now use this list of texts for tokenization
test_sequences = tokenizer.texts_to_sequences(test_resume_texts)  
padded_test_sequences = pad_sequences(test_sequences, maxlen=200, padding='post', truncating='post')

loss_lstm, test_accuracy_lstm = lstm_model.evaluate(padded_test_sequences, y_test)
print(f"Test Loss: {loss_lstm:.4f}")
print(f"Test Accuracy: {test_accuracy_lstm:.4f}")

# Make predictions
y_pred_lstm = np.argmax(lstm_model.predict(padded_test_sequences), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_lstm))

NameError: name 'padded_sequences' is not defined

### **LSTM with GD**

In [45]:
from tensorflow.keras.optimizers import SGD

# Define the LSTM model
lstm_model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    LSTM(64, return_sequences=True),
    GlobalMaxPool1D(),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

# Compile the model with SGD optimizer
sgd_optimizer = SGD(learning_rate=0.01, momentum=0.9)
lstm_model.compile(optimizer=sgd_optimizer,
                   loss='sparse_categorical_crossentropy',
                   metrics=['accuracy'])

# Train the model
history_lstm_sgd = lstm_model.fit(padded_sequences, resumeDataSet['Category'], epochs=70, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_lstm_sgd = history_lstm_sgd.history['accuracy'][-1]
val_accuracy_lstm_sgd = history_lstm_sgd.history['val_accuracy'][-1]

print(f"Final Training Accuracy with SGD: {train_accuracy_lstm_sgd:.4f}")
print(f"Final Validation Accuracy with SGD: {val_accuracy_lstm_sgd:.4f}")

# Evaluate the model on the test set
loss_lstm_sgd, test_accuracy_lstm_sgd = lstm_model.evaluate(padded_test_sequences, y_test)
print(f"Test Loss with SGD: {loss_lstm_sgd:.4f}")
print(f"Test Accuracy with SGD: {test_accuracy_lstm_sgd:.4f}")

# Make predictions
y_pred_lstm_sgd = np.argmax(lstm_model.predict(padded_test_sequences), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_lstm_sgd))

Epoch 1/70
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 113ms/step - accuracy: 0.0458 - loss: 3.2169 - val_accuracy: 0.0000e+00 - val_loss: 3.3401
Epoch 2/70
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 91ms/step - accuracy: 0.1071 - loss: 3.1881 - val_accuracy: 0.0000e+00 - val_loss: 3.4938
Epoch 3/70
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 78ms/step - accuracy: 0.0984 - loss: 3.1647 - val_accuracy: 0.0000e+00 - val_loss: 3.6221
Epoch 4/70
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 79ms/step - accuracy: 0.0867 - loss: 3.1649 - val_accuracy: 0.0000e+00 - val_loss: 3.7305
Epoch 5/70
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 98ms/step - accuracy: 0.0996 - loss: 3.1339 - val_accuracy: 0.0000e+00 - val_loss: 3.8851
Epoch 6/70
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 102ms/step - accuracy: 0.0927 - loss: 3.1337 - val_accuracy: 0.0000e+00 - val_loss: 4.0037
Epoch 7/

### **BI-LSTM**

In [46]:
from tensorflow import keras
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, GlobalMaxPool1D, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

# Tokenize the text data
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(resumeDataSet['Resume'])
sequences = tokenizer.texts_to_sequences(resumeDataSet['Resume'])
padded_sequences = pad_sequences(sequences, maxlen=200, padding='post', truncating='post')

# Encode the labels
le = LabelEncoder()
le.fit(resumeDataSet['Category'])
encoded_labels = le.transform(resumeDataSet['Category'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, encoded_labels, test_size=0.2, random_state=42)

# Define the BI-LSTM model
bi_lstm_model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    Bidirectional(LSTM(64, return_sequences=True)),
    GlobalMaxPool1D(),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

# Compile the model
bi_lstm_model.compile(optimizer='adam',
                      loss='sparse_categorical_crossentropy',
                      metrics=['accuracy'])

# Train the model
history_bi_lstm = bi_lstm_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_bi_lstm = history_bi_lstm.history['accuracy'][-1]
val_accuracy_bi_lstm = history_bi_lstm.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_bi_lstm:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_bi_lstm:.4f}")

# Evaluate the model on the test set
loss_bi_lstm, test_accuracy_bi_lstm = bi_lstm_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss_bi_lstm:.4f}")
print(f"Test Accuracy: {test_accuracy_bi_lstm:.4f}")

# Make predictions
y_pred_bi_lstm = np.argmax(bi_lstm_model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_bi_lstm))

In [48]:
from tensorflow import keras
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenize the text data
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(resumeDataSet['Resume'])
sequences = tokenizer.texts_to_sequences(resumeDataSet['Resume'])
padded_sequences = pad_sequences(sequences, maxlen=200, padding='post', truncating='post')

# Define the BI-LSTM model
bi_lstm_model = keras.Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    Bidirectional(LSTM(64, return_sequences=True)),
    GlobalMaxPool1D(),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

# Compile the model
bi_lstm_model.compile(optimizer='adam',
                      loss='sparse_categorical_crossentropy',
                      metrics=['accuracy'])

# Train the model
history_bi_lstm = bi_lstm_model.fit(padded_sequences, resumeDataSet['Category'], epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_bi_lstm = history_bi_lstm.history['accuracy'][-1]
val_accuracy_bi_lstm = history_bi_lstm.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_bi_lstm:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_bi_lstm:.4f}")

# Evaluate the model on the test set
# Get the original resume text corresponding to the test set indices
test_resume_texts = resumeDataSet['Resume'].iloc[y_test.index].tolist()  

# Now use this list of texts for tokenization
test_sequences = tokenizer.texts_to_sequences(test_resume_texts)  
padded_test_sequences = pad_sequences(test_sequences, maxlen=200, padding='post', truncating='post')

loss_bi_lstm, test_accuracy_bi_lstm = bi_lstm_model.evaluate(padded_test_sequences, y_test)
print(f"Test Loss: {loss_bi_lstm:.4f}")
print(f"Test Accuracy: {test_accuracy_bi_lstm:.4f}")


# Make predictions
y_pred_bi_lstm = np.argmax(bi_lstm_model.predict(padded_test_sequences), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_bi_lstm))

Epoch 1/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 167ms/step - accuracy: 0.1145 - loss: 3.1771 - val_accuracy: 0.0000e+00 - val_loss: 3.6304
Epoch 2/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 127ms/step - accuracy: 0.1657 - loss: 2.8476 - val_accuracy: 0.0000e+00 - val_loss: 4.4476
Epoch 3/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 121ms/step - accuracy: 0.4423 - loss: 2.1083 - val_accuracy: 0.0000e+00 - val_loss: 5.7524
Epoch 4/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 121ms/step - accuracy: 0.7394 - loss: 1.1998 - val_accuracy: 0.0000e+00 - val_loss: 6.5063
Epoch 5/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 115ms/step - accuracy: 0.9042 - loss: 0.6566 - val_accuracy: 0.0515 - val_loss: 6.8324
Epoch 6/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 145ms/step - accuracy: 0.9592 - loss: 0.4018 - val_accuracy: 0.2784 - val_loss: 6.7604
Epoch 7/10


### **BI-LSTM with GD**

In [None]:
from tensorflow.keras.optimizers import SGD

# Define the BI-LSTM model
bi_lstm_model_sgd = keras.Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    Bidirectional(LSTM(64, return_sequences=True)),
    GlobalMaxPool1D(),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

# Compile the model with SGD optimizer
sgd_optimizer = SGD(learning_rate=0.01, momentum=0.9)
bi_lstm_model_sgd.compile(optimizer=sgd_optimizer,
                          loss='sparse_categorical_crossentropy',
                          metrics=['accuracy'])

# Train the model
history_bi_lstm_sgd = bi_lstm_model_sgd.fit(padded_sequences, resumeDataSet['Category'], epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_bi_lstm_sgd = history_bi_lstm_sgd.history['accuracy'][-1]
val_accuracy_bi_lstm_sgd = history_bi_lstm_sgd.history['val_accuracy'][-1]

print(f"Final Training Accuracy with SGD: {train_accuracy_bi_lstm_sgd:.4f}")
print(f"Final Validation Accuracy with SGD: {val_accuracy_bi_lstm_sgd:.4f}")

# Evaluate the model on the test set
loss_bi_lstm_sgd, test_accuracy_bi_lstm_sgd = bi_lstm_model_sgd.evaluate(padded_test_sequences, y_test)
print(f"Test Loss with SGD: {loss_bi_lstm_sgd:.4f}")
print(f"Test Accuracy with SGD: {test_accuracy_bi_lstm_sgd:.4f}")

# Make predictions
y_pred_bi_lstm_sgd = np.argmax(bi_lstm_model_sgd.predict(padded_test_sequences), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_bi_lstm_sgd))

Epoch 1/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 134ms/step - accuracy: 0.0657 - loss: 3.2117 - val_accuracy: 0.0000e+00 - val_loss: 3.3698
Epoch 2/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 110ms/step - accuracy: 0.0936 - loss: 3.1778 - val_accuracy: 0.0000e+00 - val_loss: 3.5718
Epoch 3/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 108ms/step - accuracy: 0.1052 - loss: 3.1500 - val_accuracy: 0.0000e+00 - val_loss: 3.7622
Epoch 4/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 100ms/step - accuracy: 0.1086 - loss: 3.1333 - val_accuracy: 0.0000e+00 - val_loss: 3.9132
Epoch 5/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 106ms/step - accuracy: 0.1094 - loss: 3.1167 - val_accuracy: 0.0000e+00 - val_loss: 4.0698
Epoch 6/10
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 122ms/step - accuracy: 0.0926 - loss: 3.1298 - val_accuracy: 0.0000e+00 - val_loss: 4.2260
Epoc

In [None]:
# Tokenize the dataset
tokenizer = BertTokenizer.from_pretrained('distilbert-base-uncased')

class TextDataset(Dataset):
    def _init_(self, texts, labels, tokenizer, max_len=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def _len_(self):
        return len(self.texts)

    def _getitem_(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

train_dataset = TextDataset(X_train, y_train, tokenizer)
test_dataset = TextDataset(X_test, y_test, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Define the model, optimizer, and loss function
model = BertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels= 25)
optimizer = AdamW(model.parameters(), lr=2e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# Train the model
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

def train_epoch(model, data_loader, loss_fn, optimizer, device, scheduler, n_examples):
    model = model.train()
    losses = []
    correct_predictions = 0

    for d in data_loader:
        input_ids = d['input_ids'].to(device)
        attention_mask = d['attention_mask'].to(device)
        labels = d['labels'].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        _, preds = torch.max(outputs.logits, dim=1)
        loss = loss_fn(outputs.logits, labels)

        correct_predictions += torch.sum(preds == labels)
        losses.append(loss.item())

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    return correct_predictions.double() / n_examples, np.mean(losses)

def eval_model(model, data_loader, loss_fn, device, n_examples):
    model = model.eval()
    losses = []
    correct_predictions = 0

    with torch.no_grad():
        for d in data_loader:
            input_ids = d['input_ids'].to(device)
            attention_mask = d['attention_mask'].to(device)
            labels = d['labels'].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )

            _, preds = torch.max(outputs.logits, dim=1)
            loss = loss_fn(outputs.logits, labels)

            correct_predictions += torch.sum(preds == labels)
            losses.append(loss.item())

    return correct_predictions.double() / n_examples, np.mean(losses)

# Training loop
EPOCHS = 3

for epoch in range(EPOCHS):
    print(f'Epoch {epoch + 1}/{EPOCHS}')
    print('-' * 10)

    train_acc, train_loss = train_epoch( model, train_loader, loss_fn, optimizer, device, None, len(train_dataset))

    print(f'Train loss {train_loss} accuracy {train_acc}')

    test_acc, test_loss = eval_model(model, test_loader, loss_fn, device, len(test_dataset))

    print(f'Test loss {test_loss} accuracy {test_acc}')

## **Model Deployment**

**Model Deployment:** The trained models, along with the TF-IDF vectorizer and label encoder, are serialized (pickled) for use in the Streamlit application. This step makes the models reusable without retraining each time, **and it is where Streamlit comes into play**. This app was built using Streamlit, a python library to build interactive and sharable web applications. Streamlit made it easier to build this application by handling the user interface, input/output and allowing to display the predictions of different models to be accessible through a web browser.

In [126]:
import pickle
pickle.dump(tfidf,open('tfidf.pkl','wb'))
pickle.dump(knn_model, open('knn.pkl', 'wb'))
pickle.dump(svc_model, open('svc.pkl', 'wb'))
pickle.dump(rf_model, open('rf.pkl', 'wb'))
pickle.dump(model, open('mlp.pkl', 'wb'))
pickle.dump(ensemble_model, open('ensemble.pkl', 'wb'))
pickle.dump(le, open("encoder.pkl",'wb'))

## **Testing**

**Prediction:** The app provides predictions from all the trained models, allowing for an extensive comparison of their results. The user receives a result from each model, including KNN, SVC, Random Forest, MLP, and the ensemble model.

In [127]:
# Function to predict the category of a resume and print results for each model
def pred(input_resume):
    # Preprocess the input text (e.g., cleaning, etc.)
    cleaned_text = cleanResume(input_resume)

    # Vectorize the cleaned text using the same TF-IDF vectorizer used during training
    vectorized_text = tfidf.transform([cleaned_text])

    # Convert sparse matrix to dense
    vectorized_text = vectorized_text.toarray()

    # Prediction
    predicted_category_knn = knn_model.predict(vectorized_text)
    predicted_category_svc = svc_model.predict(vectorized_text) 
    predicted_category_rf = rf_model.predict(vectorized_text)
    predicted_category_mlp = np.argmax(model.predict(vectorized_text), axis=1)
    predicted_category_ensemble = ensemble_model.predict(vectorized_text)

    # Get name of predicted category for each model
    category_knn = le.inverse_transform(predicted_category_knn)[0]
    category_svc = le.inverse_transform(predicted_category_svc)[0]
    category_rf = le.inverse_transform(predicted_category_rf)[0]
    category_mlp = le.inverse_transform(predicted_category_mlp)[0]
    category_ensemble = le.inverse_transform(predicted_category_ensemble)[0]

    # Print results for each model
    print(f"KNN Model Prediction: {category_knn}")
    print(f"SVC Model Prediction: {category_svc}")
    print(f"Random Forest Model Prediction: {category_rf}")
    print(f"MLP Model Prediction: {category_mlp}")
    print(f"Ensemble Model Prediction: {category_ensemble}")

    # Return the category name predicted by the first model (as an example)
    return category_knn

In [128]:
myresume = """Name: Sarah Johnson
Contact Information:

Phone: +1 555-123-4567
Email: sarah.johnson@email.com
LinkedIn: linkedin.com/in/sarahjohnson
Address: New York, NY
Professional Summary
A results-driven HR professional with over 6 years of experience in talent acquisition, employee engagement, and HR operations. Skilled in building strong teams, fostering positive work environments, and implementing HR strategies that align with organizational goals. Proficient in HR software and data-driven decision-making to improve workforce management.

Key Skills
Talent Acquisition and Recruitment
Employee Onboarding and Training
Performance Management
HR Policies and Compliance
Employee Relations and Engagement
Compensation and Benefits Administration
HR Analytics and Reporting
HR Software: SAP, Workday, BambooHR
Strong Interpersonal and Communication Skills
Professional Experience
HR Manager
BrightPath Solutions | January 2020 – Present

Led end-to-end recruitment processes, successfully hiring over 50 candidates annually across various roles.
Designed and implemented onboarding programs, reducing new hire turnover by 20%.
Developed performance management frameworks, increasing employee productivity by 15%.
Conducted employee satisfaction surveys and implemented strategies to enhance engagement.
Ensured compliance with labor laws and company policies, minimizing legal risks.
HR Generalist
Global Reach Inc. | June 2016 – December 2019

Supported HR operations, including recruitment, payroll, and benefits administration.
Assisted in developing HR policies and communicated updates to employees.
Resolved employee grievances, fostering a collaborative workplace.
Analyzed HR metrics to identify trends and presented actionable insights to management.
HR Coordinator
TalentFirst Consulting | March 2014 – May 2016

Scheduled interviews and coordinated recruitment activities.
Maintained employee records and ensured accuracy in HR databases.
Assisted in planning company events and training sessions.
Education
Bachelor’s Degree in Human Resource Management
University of California, Berkeley | 2013

Certifications

Certified Professional in Human Resources (PHR)
SHRM Certified Professional (SHRM-CP)
Advanced HR Analytics (Coursera)
Achievements
Reduced time-to-hire by 30% through process optimization.
Increased employee retention by 25% by implementing a mentorship program.
Spearheaded diversity and inclusion initiatives, leading to a 40% increase in diverse hires.
Languages
English (Fluent)
Spanish (Intermediate)
"""

pred(myresume)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
KNN Model Prediction: HR
SVC Model Prediction: HR
Random Forest Model Prediction: HR
MLP Model Prediction: HR
Ensemble Model Prediction: HR


'HR'

In [129]:
myresume = """I am a Business Analyst specializing in developing dashboards, 
reports, and data models to drive performance insights. 
Proficient in Python, R, SQL, Excel, and Power BI, I excel in data 
analysis, advanced analytics, and automation of data processes. 
Skilled in statistical analysis and data visualization, I derive 
insights for data-driven decisions. Experienced in designing and 
optimizing data warehouse solutions, managing ETL processes, 
and ensuring data integrity and security. Additionally, I hold a 
CCNA certification from Cisco, showcasing my knowledge in 
networking.
"""

pred(myresume)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
KNN Model Prediction: Data Science
SVC Model Prediction: Data Science
Random Forest Model Prediction: Data Science
MLP Model Prediction: Data Science
Ensemble Model Prediction: Data Science


'Data Science'