# ***IRM : Intelligent Recruitment Model***

## **Problem Statement : -**

***Companies often receive thousands of resumes for each job posting and employ dedicated screening officers to screen qualified candidates.***

***Hiring the right talent is a challenge for all businesses. This challenge is magnified by the high volume of applicants if the business is labor-intensive, growing, and facing high attrition rates.***

***IT departments are short of growing markets. In a typical service organization, professionals with a variety of technical skills and business domain expertise are hired and assigned to projects to resolve customer issues. This task of selecting the best talent among many others is known as Resume Screening.***

***Typically, large companies do not have enough time to open each CV, so they use machine learning algorithms for the Resume Screening task.***

## **Libraries**

In [5]:
import pandas as pd
import plotly.express as px
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import string
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
df = pd.read_csv('data\\UpdatedResumeDataSet.csv')

df.head()

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."


In [7]:
df.shape

(962, 2)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  962 non-null    object
 1   Resume    962 non-null    object
dtypes: object(2)
memory usage: 15.2+ KB


In [9]:
print(df.isnull().sum())
df.drop_duplicates(subset="Resume", keep='first', inplace=True)

Category    0
Resume      0
dtype: int64


## **Visualization : Category (Jobs)**

In [10]:


fig = px.histogram(df, x="Category", title="Distribution of Job Categories", category_orders={"Category": df["Category"].value_counts().index})

fig

In [11]:
series = df['Category'].value_counts()

df_result = pd.DataFrame(series)

df_result = df_result.reset_index()  

df_result.columns = ['Category', 'Total']

print(df_result)

                     Category  Total
0              Java Developer     13
1                    Database     11
2                Data Science     10
3                    Advocate     10
4                          HR     10
5            DotNet Developer      7
6                      Hadoop      7
7             DevOps Engineer      7
8          Automation Testing      7
9                     Testing      7
10             Civil Engineer      6
11           Business Analyst      6
12              SAP Developer      6
13         Health and fitness      6
14           Python Developer      6
15                       Arts      6
16     Electrical Engineering      5
17                      Sales      5
18  Network Security Engineer      5
19        Mechanical Engineer      5
20              ETL Developer      5
21                 Blockchain      5
22         Operations Manager      4
23              Web Designing      4
24                        PMO      3


In [12]:
fig = px.pie(df_result,
                 values='Total',
                 names='Category')

fig

## **Data Preprocessing**




In [13]:

def cleanResume(resumeText):
    resumeText = re.sub('http\S+\s*', ' ', resumeText)  # remove URLs
    resumeText = re.sub('RT|cc', ' ', resumeText)  # remove RT and cc
    resumeText = re.sub('#\S+', '', resumeText)  # remove hashtags
    resumeText = re.sub('@\S+', '  ', resumeText)  # remove mentions
    resumeText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', resumeText)  # remove punctuations
    resumeText = re.sub(r'[^\x00-\x7f]',r' ', resumeText) 
    resumeText = re.sub('\s+', ' ', resumeText)  # remove extra whitespace
    return resumeText
    
df['cleaned_resume'] = df.Resume.apply(lambda x: cleanResume(x))

In [14]:
df.head()

Unnamed: 0,Category,Resume,cleaned_resume
0,Data Science,Skills * Programming Languages: Python (pandas...,Skills Programming Languages Python pandas num...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...,Education Details May 2013 to May 2017 B E UIT...
2,Data Science,"Areas of Interest Deep Learning, Control Syste...",Areas of Interest Deep Learning Control System...
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...,Skills R Python SAP HANA Tableau SAP HANA SQL ...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab...",Education Details MCA YMCAUST Faridabad Haryan...


## **Checking the most common words from resume**

In [15]:
oneSetOfStopWords = set(stopwords.words('english')+['``',"''"])
totalWords =[]
Sentences = df['Resume'].values
cleanedSentences = ""
for records in Sentences:
    cleanedText = cleanResume(records)
    cleanedSentences += cleanedText
    requiredWords = nltk.word_tokenize(cleanedText)
    for word in requiredWords:
        if word not in oneSetOfStopWords and word not in string.punctuation:
            totalWords.append(word)
    
wordfreqdist = nltk.FreqDist(totalWords)
mostcommon = wordfreqdist.most_common(50)
print(mostcommon)

[('Experience', 665), ('company', 520), ('months', 515), ('Details', 510), ('description', 458), ('1', 348), ('Project', 299), ('data', 242), ('project', 231), ('6', 227), ('Maharashtra', 217), ('year', 215), ('SQL', 215), ('team', 207), ('Less', 199), ('using', 197), ('January', 189), ('Skill', 175), ('Management', 167), ('Ltd', 159), ('Pune', 158), ('C', 151), ('Education', 144), ('management', 143), ('Data', 140), ('Developer', 137), ('Engineering', 134), ('database', 133), ('Java', 130), ('Database', 127), ('monthsCompany', 125), ('System', 123), ('University', 123), ('Server', 123), ('Pvt', 122), ('India', 120), ('like', 118), ('The', 117), ('Responsibilities', 117), ('various', 116), ('A', 113), ('business', 113), ('2', 113), ('development', 112), ('reports', 111), ('application', 110), ('issues', 106), ('system', 106), ('Mumbai', 106), ('Test', 105)]


## **Label Encoding**

In [16]:
from sklearn.preprocessing import LabelEncoder

var_mod = ['Category']
le = LabelEncoder()
for i in var_mod:
    df[i] = le.fit_transform(df[i])

In [17]:
df.Category.value_counts()

Category
15    13
7     11
6     10
0     10
12    10
9      7
13     7
8      7
2      7
23     7
5      6
4      6
21     6
14     6
20     6
1      6
11     5
22     5
17     5
16     5
10     5
3      5
18     4
24     4
19     3
Name: count, dtype: int64

## **Model Training (KNeighborsClassifier)**



In [18]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

requiredText = df['cleaned_resume'].values
requiredTarget = df['Category'].values

word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    stop_words='english')
word_vectorizer.fit(requiredText)
WordFeatures = word_vectorizer.transform(requiredText)

print ("Feature completed .....")

Feature completed .....


In [19]:
X_train,X_test,y_train,y_test = train_test_split(WordFeatures,requiredTarget,random_state=1, test_size=0.2,shuffle=True, stratify=requiredTarget)
print(X_train.shape)
print(X_test.shape)

(132, 7350)
(34, 7350)


In [20]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

In [21]:
clf = OneVsRestClassifier(KNeighborsClassifier())
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)
print('Accuracy of KNeighbors Classifier on training: {:.2f}'.format(clf.score(X_train, y_train)))
print('Accuracy of KNeighbors Classifier on test:     {:.2f}'.format(clf.score(X_test, y_test)))

Accuracy of KNeighbors Classifier on training: 0.91
Accuracy of KNeighbors Classifier on test:     0.82


In [22]:
le.classes_

array(['Advocate', 'Arts', 'Automation Testing', 'Blockchain',
       'Business Analyst', 'Civil Engineer', 'Data Science', 'Database',
       'DevOps Engineer', 'DotNet Developer', 'ETL Developer',
       'Electrical Engineering', 'HR', 'Hadoop', 'Health and fitness',
       'Java Developer', 'Mechanical Engineer',
       'Network Security Engineer', 'Operations Manager', 'PMO',
       'Python Developer', 'SAP Developer', 'Sales', 'Testing',
       'Web Designing'], dtype=object)

# Resume Upload

In [23]:
from flask import *
from datetime import datetime
from flask_bcrypt import Bcrypt 
from flask_sqlalchemy import SQLAlchemy
from flask_login import LoginManager, UserMixin, login_user, logout_user, login_required, current_user  
import pandas as pd
import os

In [24]:

app = Flask(__name__)

UPLOAD_FOLDER = 'Resumes'
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER

ALLOWED_EXTENSIONS = {'docx'}

def allowed_file(filename):
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/upload', methods=['GET', 'POST'])
def upload_file():
    if request.method == 'POST':
        if 'resume' not in request.files:
            return 'No file part'
        
        file = request.files['resume']
        name = request.form['name']
        
        if file.filename == '':
            return 'No selected file'
        
        if file and allowed_file(file.filename):
            file.save(os.path.join(app.config['UPLOAD_FOLDER'], name + '.docx'))
            return render_template('submitted.html',mes='File uploaded Succesfully!', type='success')
        else:
            return render_template('submitted.html',mes='Invalid file type. Please upload a .docx file.', type='error')
           
    else:
        return render_template('upload.html')

In [26]:
if __name__ == '__main__':
    app.run(debug=False)

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [13/May/2024 10:39:19] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [13/May/2024 10:39:20] "GET /static/index.css HTTP/1.1" 304 -
127.0.0.1 - - [13/May/2024 10:39:20] "GET /static/logobg.png HTTP/1.1" 304 -
127.0.0.1 - - [13/May/2024 10:39:20] "GET /favicon.ico HTTP/1.1" 404 -
127.0.0.1 - - [13/May/2024 10:39:24] "GET /upload HTTP/1.1" 200 -
127.0.0.1 - - [13/May/2024 10:39:24] "GET /static/upload.css HTTP/1.1" 304 -
127.0.0.1 - - [13/May/2024 10:39:24] "GET /static/logobg.png HTTP/1.1" 304 -
127.0.0.1 - - [13/May/2024 10:39:35] "POST /upload HTTP/1.1" 200 -
127.0.0.1 - - [13/May/2024 10:39:35] "GET /static/submit.css HTTP/1.1" 304 -
127.0.0.1 - - [13/May/2024 10:39:35] "GET /static/tick.png HTTP/1.1" 304 -


## **Resume Screening**

In [None]:
!pip install python-docx
!pip install tabulate



In [None]:
import pandas as pd
import glob
import os
import docx
import re
from datetime import datetime

def read_docx(file_path):
    doc = docx.Document(file_path)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

def extract_overall_experience_years(resume_text):
    
    date_pattern = r'((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)?\s*\d{4})\s*-\s*((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)?\s*(?:\d{4}|Present))'
    matches = re.findall(date_pattern, resume_text, re.IGNORECASE)

    if not matches:
        return "Null"  

    min_start_year = datetime.now().year
    max_end_year = 0

    for start, end in matches:
        start_year = int(re.search(r'\d{4}', start).group())
        end_year = datetime.now().year if "Present" in end else int(re.search(r'\d{4}', end).group())

        if start_year < min_start_year:
            min_start_year = start_year
        if end_year > max_end_year:
            max_end_year = end_year

    total_experience = max_end_year - min_start_year
    return total_experience

def process_resumes(resume_paths, model, vectorizer, le, threshold=0.1):
    results = []

    for resume_path in resume_paths:
        resume_text = read_docx(resume_path)
        candidate_name = os.path.splitext(os.path.basename(resume_path))[0]
        cleaned_resume = cleanResume(resume_text)  
        
        resume_vec = vectorizer.transform([cleaned_resume])
        pred_proba = model.predict_proba(resume_vec)[0]
        
        proba_category_mapping = [(le.inverse_transform([i])[0], proba) for i, proba in enumerate(pred_proba)]
        proba_category_mapping.sort(key=lambda x: x[1], reverse=True)
        
        suggested_categories = [(category, proba*100) for category, proba in proba_category_mapping if proba >= threshold]
        
        overall_experience = extract_overall_experience_years(resume_text)
        
        results.append({
            'Candidate Name': candidate_name,
            'Job Categories': suggested_categories,
            'Experience': overall_experience  
        })
    
    df_results = pd.DataFrame(results)
    summary_table = pd.DataFrame(columns=['Candidate Name', 'Job Category', 'Match Score', 'Experience'])

    for _, row in df_results.iterrows():
        candidate_name = row['Candidate Name']
        categories = row['Job Categories']
        experience = row['Experience']
        
        for category, score in categories:
            new_row = pd.DataFrame({
                'Candidate Name': [candidate_name], 
                'Job Category': [category], 
                'Match Score': [score],
                'Experience': [experience]  
            })
            summary_table = pd.concat([summary_table, new_row], ignore_index=True)
    
    return summary_table


resume_paths = glob.glob('./Resumes/*.docx')
result_table = process_resumes(resume_paths, clf, word_vectorizer, le, threshold=0.1)
csv_file = "data\\resume_classification_results.csv"

result_table.to_csv(csv_file, index=False)

print(result_table)
print(f"Results saved to {csv_file}.")



NameError: name 'clf' is not defined

In [None]:
from tabulate import tabulate

def print_results_by_category(df, le):
    print("\nAvailable job categories:")
    for category in le.classes_:
        print("-", category)
    print("Enter 'exit' to stop.")
    
    while True:
        target_category = input("\nEnter the job category you are interested in or 'exit' to stop: ")
        
        if target_category.lower() == 'exit':
            break
        
        if target_category not in le.classes_:
            print("Invalid job category. Please try again.")
            continue

        category_df = df[df['Job Category'] == target_category]
        
        headers = ['Candidate Name', 'Match Score', 'Experience (Years)']

        data = category_df[['Candidate Name', 'Match Score', 'Experience']].values.tolist()
        
        table_width = sum(len(str(header)) for header in headers) + len(headers) * 3 + 10 
        
        print("\n" + target_category.center(table_width) + "\n")

        if not data:
            print(f"No candidates found for {target_category}.")
        else:
            print(tabulate(data, headers=headers, tablefmt="pretty"))


print_results_by_category(result_table, le)



Available job categories:
- Advocate
- Arts
- Automation Testing
- Blockchain
- Business Analyst
- Civil Engineer
- Data Science
- Database
- DevOps Engineer
- DotNet Developer
- ETL Developer
- Electrical Engineering
- HR
- Hadoop
- Health and fitness
- Java Developer
- Mechanical Engineer
- Network Security Engineer
- Operations Manager
- PMO
- Python Developer
- SAP Developer
- Sales
- Testing
- Web Designing
Enter 'exit' to stop.


# Candidate Assessment Test (CAT)

In [27]:
from flask import *
from datetime import datetime
from flask_bcrypt import Bcrypt 
from flask_sqlalchemy import SQLAlchemy
from flask_login import LoginManager, UserMixin, login_user, logout_user, login_required, current_user  
import pandas as pd
import os

In [28]:
app = Flask(__name__)
app.config["SQLALCHEMY_DATABASE_URI"] = "sqlite:///candidates.sqlite"
app.config["SECRET_KEY"] = "2210993778"
db = SQLAlchemy()

In [29]:
# Define the Candidate database through class
class Candidate(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String(100), nullable=False)
    email = db.Column(db.String(100), nullable=False)
    score = db.Column(db.Integer, nullable=False)

In [30]:
# Add an event listener to update the database after each session
@db.event.listens_for(Candidate, 'after_insert')
@db.event.listens_for(Candidate, 'after_update')
@db.event.listens_for(Candidate, 'after_delete')
def after_candidate_changes(mapper, connection, candidate):
    candidates = Candidate.query.all()
    df = pd.DataFrame([(c.id, c.name, c.email, (c.score/20)*100) for c in candidates], columns=['id', 'name', 'email', 'score'])
    df.to_csv('data\\candidates.csv', index=False)

In [31]:
db.init_app(app)

with app.app_context():
    db.create_all()

In [32]:
questions_df = pd.read_csv("data\\Questions.csv")
questions = questions_df.to_dict('records')  

In [33]:

@app.route('/')
def index():
    return render_template('index1.html')

@app.route('/quiz')
def quiz():
    return render_template('quiz.html', questions=questions)

@app.route('/submit', methods=['POST'])
def submit():
    name = request.form['name']
    email = request.form['email']
    score = calculate_score(request.form)

    # Check if a candidate with the same email already exists
    candidate = Candidate.query.filter_by(email=email).first()

    if candidate:
        candidate.score = score
    else:
        candidate = Candidate(name=name, email=email, score=score)
        db.session.add(candidate)

    db.session.commit()

    return render_template('submit.html')

def calculate_score(form_data):
    score = 0
    for question in questions:
        answer_key = f"answer_{question['Question ID']}"
        if form_data.get(answer_key) == str(question['Correct Answer']):
            score += 1
    return score

In [34]:
@app.route('/leaderboard')
def leaderboard():
    candidates = Candidate.query.order_by(Candidate.score.desc()).all()
    ranked_candidates = generate_ranks(candidates)
    return render_template('leaderboard.html', candidates=ranked_candidates)

def generate_ranks(candidates):
    ranked_candidates = []
    rank = 1
    prev_score = None

    sorted_candidates = sorted(candidates, key=lambda c: (-c.score, c.name))

    for candidate in sorted_candidates:
        if candidate.score != prev_score:
            rank = len(ranked_candidates) + 1
            prev_score = candidate.score
        ranked_candidates.append({'rank': rank, 'candidate': candidate})

    return ranked_candidates

In [35]:
if __name__ == '__main__':
    app.run(debug=False)

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [13/May/2024 10:41:58] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [13/May/2024 10:41:58] "GET /static/index.css HTTP/1.1" 200 -
127.0.0.1 - - [13/May/2024 10:41:58] "GET /static/logobg.png HTTP/1.1" 200 -
127.0.0.1 - - [13/May/2024 10:41:58] "GET /favicon.ico HTTP/1.1" 404 -
127.0.0.1 - - [13/May/2024 10:42:02] "GET /quiz HTTP/1.1" 200 -
127.0.0.1 - - [13/May/2024 10:42:02] "GET /static/styles.css HTTP/1.1" 200 -
127.0.0.1 - - [13/May/2024 10:42:02] "GET /static/logobg.png HTTP/1.1" 304 -
127.0.0.1 - - [13/May/2024 10:42:08] "GET /leaderboard HTTP/1.1" 200 -
127.0.0.1 - - [13/May/2024 10:42:08] "GET /static/leaderboard.css HTTP/1.1" 200 -
127.0.0.1 - - [13/May/2024 10:42:08] "GET /static/logobg.png HTTP/1.1" 304 -
127.0.0.1 - - [13/May/2024 10:42:57] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [13/May/2024 10:42:57] "GET /static/index.css HTTP/1.1" 304 -
127.0.0.1 - - [13/May/2024 10:42:57] "GET /static/logobg.png HTTP/1.1" 304 

## Total Scores of Top Candidates

In [None]:
import pandas as pd

candidates_df = pd.read_csv("data\\candidates.csv")
resume_scores_df = pd.read_csv("data\\resume_classification_results.csv")

merged_df = pd.merge(candidates_df, resume_scores_df, left_on='name', right_on='Candidate Name', how='inner')

merged_df.rename(columns={'Match Score': 'Resume Match Score', 'score': 'CAT Score'}, inplace=True)

# Calculate total scores
merged_df['Total Score'] = merged_df['CAT Score'] + merged_df['Resume Match Score']

merged_df = merged_df.sort_values(by=['Candidate Name', 'Total Score'], ascending=[True, False])

unique_candidates_df = merged_df.drop_duplicates(subset=['Candidate Name'], keep='first')

ranked_df = unique_candidates_df.sort_values(by='Total Score', ascending=False)

# Calculating total score out of 100
ranked_df['Total Score'] = ranked_df['Total Score'] / 2

final_df = ranked_df[['Candidate Name','Job Category', 'Resume Match Score', 'CAT Score','Total Score']]

final_df.reset_index(drop=True, inplace=True)
final_df.index+=1

print("Total Scores of candidate in Percentage(%).")
print(tabulate(final_df, headers=final_df, tablefmt="fancy_outline"))


Total Scores of candidate in Percentage(%).
╒════╤══════════════════╤═════════════════════╤══════════════════════╤═════════════╤═══════════════╕
│    │ Candidate Name   │ Job Category        │   Resume Match Score │   CAT Score │   Total Score │
╞════╪══════════════════╪═════════════════════╪══════════════════════╪═════════════╪═══════════════╡
│  1 │ Harry            │ Mechanical Engineer │                   80 │          50 │            65 │
│  2 │ Jane Doe         │ Data Science        │                  100 │          20 │            60 │
│  3 │ Alice Johnson    │ Advocate            │                  100 │           0 │            50 │
│  4 │ Tim David        │ Data Science        │                   80 │          20 │            50 │
╘════╧══════════════════╧═════════════════════╧══════════════════════╧═════════════╧═══════════════╛
