# ***Resume Screening : The HR solution***

## **Essence of Resume Screening :**

* **It is the process of determining whether a candidate is qualified for a role based his or her education, experience, and other information captured on their resume.**

* **It is a crucial yet challenging part of the hiring process.**

* **On average, a recruiter spends 23 hours screening resumes for a single hire.**

* **Even with automated processes, it is still the most time-consuming part of recruiting.**

* ***Save your time (Having a defined process will) :- Make the process more efficient and more accurate Find unqualified applicants quickly Result in a shortlist of candidates to interview who are aligned to the qualifications you’re looking for This process will help you achieve your goal, to hire the most qualified and best-fitting applicant for the position.***

## **Problem Statement : -**

***Companies often receive thousands of resumes for each job posting and employ dedicated screening officers to screen qualified candidates.***

***Hiring the right talent is a challenge for all businesses. This challenge is magnified by the high volume of applicants if the business is labor-intensive, growing, and facing high attrition rates.***

***IT departments are short of growing markets. In a typical service organization, professionals with a variety of technical skills and business domain expertise are hired and assigned to projects to resolve customer issues. This task of selecting the best talent among many others is known as Resume Screening.***

***Typically, large companies do not have enough time to open each CV, so they use machine learning algorithms for the Resume Screening task.***

## **Import Libraries**

In [62]:
import pandas as pd
import plotly.express as px
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import string
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Read the table**

In [63]:
df = pd.read_csv('data\\UpdatedResumeDataSet.csv')

df.head()

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."


## **Checking the length of the table**

In [64]:
df.shape

(962, 2)

***There are 962 observations we have in the data. Each observation represents the complete details of each candidate so we have 962 resumes for screening.***

## **Checking the info for the columns**

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  962 non-null    object
 1   Resume    962 non-null    object
dtypes: object(2)
memory usage: 15.2+ KB


## **Let's check the missing values**

In [66]:
print(df.isnull().sum())
df.drop_duplicates(subset="Resume", keep='first', inplace=True)

Category    0
Resume      0
dtype: int64


## **Visualization : Category (Jobs)**

In [67]:


fig = px.histogram(df, x="Category", title="Distribution of Job Categories", category_orders={"Category": df["Category"].value_counts().index})

fig

**Jobs Distribution : - From the above bar chart we can see that there are 25 different categories we have in the data.**

**The top 3 job categories we have in the data are as follows.**

***Java developer, Testing, and DevOps Engineer.***

In [68]:
series = df['Category'].value_counts()

df_result = pd.DataFrame(series)

df_result = df_result.reset_index()  

df_result.columns = ['Category', 'Total']

print(df_result)

                     Category  Total
0              Java Developer     13
1                    Database     11
2                Data Science     10
3                    Advocate     10
4                          HR     10
5            DotNet Developer      7
6                      Hadoop      7
7             DevOps Engineer      7
8          Automation Testing      7
9                     Testing      7
10             Civil Engineer      6
11           Business Analyst      6
12              SAP Developer      6
13         Health and fitness      6
14           Python Developer      6
15                       Arts      6
16     Electrical Engineering      5
17                      Sales      5
18  Network Security Engineer      5
19        Mechanical Engineer      5
20              ETL Developer      5
21                 Blockchain      5
22         Operations Manager      4
23              Web Designing      4
24                        PMO      3


In [69]:
df_result_part = df_result.head(21)
print(df_result_part)

                     Category  Total
0              Java Developer     13
1                    Database     11
2                Data Science     10
3                    Advocate     10
4                          HR     10
5            DotNet Developer      7
6                      Hadoop      7
7             DevOps Engineer      7
8          Automation Testing      7
9                     Testing      7
10             Civil Engineer      6
11           Business Analyst      6
12              SAP Developer      6
13         Health and fitness      6
14           Python Developer      6
15                       Arts      6
16     Electrical Engineering      5
17                      Sales      5
18  Network Security Engineer      5
19        Mechanical Engineer      5
20              ETL Developer      5


In [70]:
fig = px.pie(df_result_part,
                 values='Total',
                 names='Category')

fig

***From the above pie chart, Instead of the count or frequency, we can also visualize the distribution of job categories in percentage***

## **Data Preprocessing**


### **we remove any unnecessary information from resumes like URLs, hashtags, and special characters. Thats why we will create Clean the ‘Resume’ column**

In [71]:

def cleanResume(resumeText):
    resumeText = re.sub('http\S+\s*', ' ', resumeText)  # remove URLs
    resumeText = re.sub('RT|cc', ' ', resumeText)  # remove RT and cc
    resumeText = re.sub('#\S+', '', resumeText)  # remove hashtags
    resumeText = re.sub('@\S+', '  ', resumeText)  # remove mentions
    resumeText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', resumeText)  # remove punctuations
    resumeText = re.sub(r'[^\x00-\x7f]',r' ', resumeText) 
    resumeText = re.sub('\s+', ' ', resumeText)  # remove extra whitespace
    return resumeText
    
df['cleaned_resume'] = df.Resume.apply(lambda x: cleanResume(x))

In [72]:
df.head()

Unnamed: 0,Category,Resume,cleaned_resume
0,Data Science,Skills * Programming Languages: Python (pandas...,Skills Programming Languages Python pandas num...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...,Education Details May 2013 to May 2017 B E UIT...
2,Data Science,"Areas of Interest Deep Learning, Control Syste...",Areas of Interest Deep Learning Control System...
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...,Skills R Python SAP HANA Tableau SAP HANA SQL ...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab...",Education Details MCA YMCAUST Faridabad Haryan...


## **Checking the most common words from resume**

In [73]:
oneSetOfStopWords = set(stopwords.words('english')+['``',"''"])
totalWords =[]
Sentences = df['Resume'].values
cleanedSentences = ""
for records in Sentences:
    cleanedText = cleanResume(records)
    cleanedSentences += cleanedText
    requiredWords = nltk.word_tokenize(cleanedText)
    for word in requiredWords:
        if word not in oneSetOfStopWords and word not in string.punctuation:
            totalWords.append(word)
    
wordfreqdist = nltk.FreqDist(totalWords)
mostcommon = wordfreqdist.most_common(50)
print(mostcommon)

[('Exprience', 616), ('company', 520), ('months', 515), ('Details', 510), ('description', 458), ('1', 348), ('Project', 299), ('data', 242), ('project', 231), ('6', 227), ('Maharashtra', 217), ('year', 215), ('SQL', 215), ('team', 207), ('Less', 199), ('using', 197), ('January', 189), ('Skill', 175), ('Management', 167), ('Ltd', 159), ('Pune', 158), ('C', 151), ('Education', 144), ('management', 143), ('Data', 140), ('Developer', 137), ('Engineering', 134), ('database', 133), ('Java', 130), ('Database', 127), ('monthsCompany', 125), ('System', 123), ('University', 123), ('Server', 123), ('Pvt', 122), ('India', 120), ('like', 118), ('The', 117), ('Responsibilities', 117), ('various', 116), ('A', 113), ('business', 113), ('2', 113), ('development', 112), ('reports', 111), ('application', 110), ('issues', 106), ('system', 106), ('Mumbai', 106), ('Test', 105)]


***Now, we will encode the ‘Category’ column using LabelEncoding. Even though the ‘Category’ column is ‘Nominal’ data we are using LabelEncong because the ‘Category’ column is our ‘target’ column. By performing LabelEncoding each category will become a class and we will be building a multiclass classification model.***

In [74]:
from sklearn.preprocessing import LabelEncoder

var_mod = ['Category']
le = LabelEncoder()
for i in var_mod:
    df[i] = le.fit_transform(df[i])

In [75]:
df.Category.value_counts()

Category
15    13
7     11
6     10
0     10
12    10
9      7
13     7
8      7
2      7
23     7
5      6
4      6
21     6
14     6
20     6
1      6
11     5
22     5
17     5
16     5
10     5
3      5
18     4
24     4
19     3
Name: count, dtype: int64

## **Building Model**

### **Spliting data into train and test**

***Here we will preprocess and convert the ‘cleaned_resume’ column into vectors. There are many ways to do that like ‘Bag of Words’, ‘Tf-Idf’, ‘Word2Vec’ and a combination of these methods.***

**We will be using the ‘Tf-Idf’ method to get the vectors in this approach.**

In [76]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
#from scipy.sparse import hstack

requiredText = df['cleaned_resume'].values
requiredTarget = df['Category'].values

word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    stop_words='english')
word_vectorizer.fit(requiredText)
WordFeatures = word_vectorizer.transform(requiredText)

print ("Feature completed .....")

Feature completed .....


In [77]:
X_train,X_test,y_train,y_test = train_test_split(WordFeatures,requiredTarget,random_state=1, test_size=0.2,shuffle=True, stratify=requiredTarget)
print(X_train.shape)
print(X_test.shape)

(132, 7351)
(34, 7351)


**We will be using the ‘One vs Rest’ method with ‘KNeighborsClassifier’ to build this multiclass classification model.**

In [78]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

In [79]:
clf = OneVsRestClassifier(KNeighborsClassifier())
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)
print('Accuracy of KNeighbors Classifier on training: {:.2f}'.format(clf.score(X_train, y_train)))
print('Accuracy of KNeighbors Classifier on test:     {:.2f}'.format(clf.score(X_test, y_test)))

Accuracy of KNeighbors Classifier on training: 0.91
Accuracy of KNeighbors Classifier on test:     0.82


**We can see that results are awesome. We are able to classify each Category of a given resume with 99% accuracy.**

***We can also check the detailed classification report for each class or category.***

In [80]:
print("\n Classification report for classifier %s:\n%s\n" % (clf, metrics.classification_report(y_test, prediction)))


 Classification report for classifier OneVsRestClassifier(estimator=KNeighborsClassifier()):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       0.00      0.00      0.00         1
           2       0.50      1.00      0.67         1
           3       1.00      1.00      1.00         1
           4       1.00      1.00      1.00         1
           5       1.00      1.00      1.00         1
           6       1.00      0.50      0.67         2
           7       1.00      1.00      1.00         2
           8       1.00      0.50      0.67         2
           9       0.67      1.00      0.80         2
          10       0.00      0.00      0.00         1
          11       1.00      1.00      1.00         1
          12       1.00      1.00      1.00         2
          13       0.50      1.00      0.67         1
          14       1.00      1.00      1.00         1
          15       0.75      1.00      0.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



**Where, 0, 1, 2…. are the job categories. We get the actual labels from the label encoder that we used.**

In [81]:
le.classes_

array(['Advocate', 'Arts', 'Automation Testing', 'Blockchain',
       'Business Analyst', 'Civil Engineer', 'Data Science', 'Database',
       'DevOps Engineer', 'DotNet Developer', 'ETL Developer',
       'Electrical Engineering', 'HR', 'Hadoop', 'Health and fitness',
       'Java Developer', 'Mechanical Engineer',
       'Network Security Engineer', 'Operations Manager', 'PMO',
       'Python Developer', 'SAP Developer', 'Sales', 'Testing',
       'Web Designing'], dtype=object)

**Here ‘Advocate’ is class 0, ‘Arts’ is class 1, and so on…**

## SCreening

In [82]:
pip install python-docx


Note: you may need to restart the kernel to use updated packages.


In [83]:
# import docx
# from sklearn.metrics.pairwise import cosine_similarity

# def read_docx(file_path):
#     """
#     Reads a .docx file and returns the text.
#     """
#     doc = docx.Document(file_path)
#     fullText = []
#     for para in doc.paragraphs:
#         fullText.append(para.text)
#     return '\n'.join(fullText)

# def resume_screening(model, vectorizer, le, resume_path):
#     """
#     Screens a resume against the job categories.
#     """
#     # Read resume content
#     resume_text = read_docx(resume_path)
    
#     # Clean the resume text
#     cleaned_resume = cleanResume(resume_text)
    
#     # Convert to TF-IDF features
#     resume_vec = vectorizer.transform([cleaned_resume])
    
#     # Predict the job category
#     pred = model.predict(resume_vec)
#     pred_proba = model.predict_proba(resume_vec)
    
#     # Find the best match category name
#     best_match_category = le.inverse_transform([pred[0]])[0]
#     best_match_score = max(pred_proba[0]) * 100  # Convert to percentage
    
#     print(f"Best Match Job Category: {best_match_category}")
#     print(f"Match Score: {best_match_score:.2f}%")
    
#     return best_match_category, best_match_score

# # Example usage:
# resume_path = "John Doe.docx"
# best_match_category, best_match_score = resume_screening(clf, word_vectorizer, le, resume_path)


In [84]:
# import docx

# def read_docx(file_path):
#     """
#     Reads a .docx file and returns the text.
#     """
#     doc = docx.Document(file_path)
#     fullText = []
#     for para in doc.paragraphs:
#         fullText.append(para.text)
#     return '\n'.join(fullText)

# def resume_screening(model, vectorizer, le, resume_path):
#     """
#     Screens a resume against a specified job category.
#     """
#     # Read resume content
#     resume_text = read_docx(resume_path)
    
#     # Clean the resume text
#     cleaned_resume = cleanResume(resume_text)
    
#     # Convert to TF-IDF features
#     resume_vec = vectorizer.transform([cleaned_resume])
    
#     # Prompt user to specify the job category
#     print("Available job categories:")
#     for category in le.classes_:
#         print("-", category)
#     target_category = input("Enter the job category you are interested in: ")
    
#     if target_category not in le.classes_:
#         print("Invalid job category.")
#         return
    
#     # Predict the job category
#     pred = model.predict(resume_vec)
#     pred_proba = model.predict_proba(resume_vec)
    
#     # Find the match score for the specified job category
#     target_category_index = le.transform([target_category])[0]
#     match_score = pred_proba[0][target_category_index] * 100  # Convert to percentage
    
#     print(f"Match Score for {target_category}: {match_score:.2f}%")
    
#     return target_category, match_score

# # Example usage:
# resume_path = "John Doe.docx"
# target_category, match_score = resume_screening(clf, word_vectorizer, le, resume_path)


In [85]:
# def resume_screening_with_threshold(model, vectorizer, le, resume_path, threshold=0.1):
#     """
#     Screens a resume and suggests job categories based on a confidence threshold.
#     """
#     # Read and clean the resume
#     resume_text = read_docx(resume_path)
#     cleaned_resume = cleanResume(resume_text)
    
#     # Convert to TF-IDF features
#     resume_vec = vectorizer.transform([cleaned_resume])
    
#     # Get predictions with probabilities
#     pred_proba = model.predict_proba(resume_vec)[0]
    
#     # Mapping probabilities to job categories
#     proba_category_mapping = [(le.inverse_transform([i])[0], proba) for i, proba in enumerate(pred_proba)]
    
#     # Filter based on threshold
#     suggested_categories = [ (category, proba*100) for category, proba in proba_category_mapping if proba >= threshold]
    
#     if not suggested_categories:
#         print("No matches found above the threshold.")
#     else:
#         print("Suggested categories based on resume content:")
#         for category, match_score in suggested_categories:
#             print(f"{category}: {match_score:.2f}%")

# # Example usage
# resume_path = "John Doe.docx"
# resume_screening_with_threshold(clf, word_vectorizer, le, resume_path, threshold=0.1)


In [86]:
# def resume_screening_with_suggestions(model, vectorizer, le, resume_path, threshold=0.1):
#     """
#     Screens a resume and suggests job categories based on a confidence threshold.
#     """
#     # Read and clean the resume
#     resume_text = read_docx(resume_path)
#     cleaned_resume = cleanResume(resume_text)
    
#     # Convert to TF-IDF features
#     resume_vec = vectorizer.transform([cleaned_resume])
    
#     # Get predictions with probabilities
#     pred_proba = model.predict_proba(resume_vec)[0]
    
#     # Mapping probabilities to job categories
#     proba_category_mapping = [(le.inverse_transform([i])[0], proba) for i, proba in enumerate(pred_proba)]
    
#     # Sort categories by probability
#     proba_category_mapping.sort(key=lambda x: x[1], reverse=True)
    
#     # Filter based on threshold
#     suggested_categories = [ (category, proba*100) for category, proba in proba_category_mapping if proba >= threshold]
    
#     if not suggested_categories:
#         print("No matches found above the threshold.")
#     else:
#         print("Suggested job categories based on resume content:")
#         for category, match_score in suggested_categories:
#             print(f"- {category}: {match_score:.2f}%")

# # Example usage
# resume_path = "John Doe.docx"
# resume_screening_with_suggestions(clf, word_vectorizer, le, resume_path, threshold=0.1)


In [87]:
# import re
# from datetime import datetime

# def extract_overall_experience_years(resume_text):
#     """
#     Calculates overall experience years from the resume text, accounting for overlapping jobs.
#     """
#     date_pattern = r'((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)?\s*\d{4})\s*-\s*((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)?\s*(?:\d{4}|Present))'
#     matches = re.findall(date_pattern, resume_text, re.IGNORECASE)

#     min_start_year = datetime.now().year
#     max_end_year = 0

#     for start, end in matches:
#         start_year = int(re.search(r'\d{4}', start).group())
#         end_year = datetime.now().year if "Present" in end else int(re.search(r'\d{4}', end).group())

#         # Update the earliest start year and latest end year found
#         if start_year < min_start_year:
#             min_start_year = start_year
#         if end_year > max_end_year:
#             max_end_year = end_year

#     total_experience = max_end_year - min_start_year
#     return total_experience
    
   


# def resume_screening_with_suggestions(model, vectorizer, le, resume_path, threshold=0.1):
#     """
#     Screens a resume, calculates total experience, and suggests job categories based on a confidence threshold.
#     """
#     # Read and clean the resume
#     resume_text = read_docx(resume_path)
#     cleaned_resume = cleanResume(resume_text)
    
#     # Calculate experience years
#     experience_years = extract_experience_years(resume_text)
#     print(f"Total experience: {experience_years} years")
    
#     # Convert to TF-IDF features
#     resume_vec = vectorizer.transform([cleaned_resume])
    
#     # Get predictions with probabilities
#     pred_proba = model.predict_proba(resume_vec)[0]
    
#     # Mapping probabilities to job categories
#     proba_category_mapping = [(le.inverse_transform([i])[0], proba) for i, proba in enumerate(pred_proba)]
    
#     # Sort categories by probability
#     proba_category_mapping.sort(key=lambda x: x[1], reverse=True)
    
#     # Filter based on threshold
#     suggested_categories = [ (category, proba*100) for category, proba in proba_category_mapping if proba >= threshold]
    
#     if not suggested_categories:
#         print("No matches found above the threshold.")
#     else:
#         print("Suggested job categories based on resume content:")
#         for category, match_score in suggested_categories:
#             print(f"- {category}: {match_score:.2f}%")

# # Example usage
# resume_path = "John Doe.docx"
# resume_screening_with_suggestions(clf, word_vectorizer, le, resume_path, threshold=0.1)


In [88]:
# import pandas as pd
# import glob

# def process_resumes(resume_paths, model, vectorizer, le, threshold=0.1):
#     """
#     Processes multiple resumes and generates classification results for each one.
#     Returns a DataFrame summarizing the job categories with the highest match scores for each candidate.
#     """
#     results = []

#     for resume_path in resume_paths:
#         # Read and clean the resume
#         resume_text = read_docx(resume_path)
#         cleaned_resume = cleanResume(resume_text)
        
#         # Convert to TF-IDF features
#         resume_vec = vectorizer.transform([cleaned_resume])
        
#         # Get predictions with probabilities
#         pred_proba = model.predict_proba(resume_vec)[0]
        
#         # Mapping probabilities to job categories
#         proba_category_mapping = [(le.inverse_transform([i])[0], proba) for i, proba in enumerate(pred_proba)]
        
#         # Sort categories by probability
#         proba_category_mapping.sort(key=lambda x: x[1], reverse=True)
        
#         # Filter based on threshold
#         suggested_categories = [(category, proba*100) for category, proba in proba_category_mapping if proba >= threshold]
        
#         # Store results
#         results.append({
#             'Resume': resume_path,
#             'Job Categories': suggested_categories
#         })
    
#     # Create DataFrame from results
#     df_results = pd.DataFrame(results)
    
#     # Generate a table summarizing the job categories with the highest match scores for each candidate
#     summary_table = pd.DataFrame(columns=['Job Category', 'Candidate', 'Match Score'])

#     for _, row in df_results.iterrows():
#         resume = row['Resume']
#         categories = row['Job Categories']
        
#         for category, score in categories:
#             candidate = resume.split('.')[0]  # Extract candidate name from file name
#             summary_table = pd.concat([summary_table, pd.DataFrame({'Job Category': [category], 'Candidate': [candidate], 'Match Score': [score]})], ignore_index=True)
    
#     return summary_table

# # Automatically find all .docx files in the code file's directory
# resume_paths = glob.glob('./*.docx')

# # Process the found resume files
# result_table = process_resumes(resume_paths, clf, word_vectorizer, le, threshold=0.1)
# print(result_table)


In [89]:
# import pandas as pd
# import glob

# def process_resumes(resume_paths, model, vectorizer, le, threshold=0.1):
#     """
#     Processes multiple resumes and generates classification results for each one.
#     Returns a DataFrame summarizing the job categories with the highest match scores for each candidate.
#     """
#     results = []

#     for resume_path in resume_paths:
#         # Read and clean the resume
#         resume_text = read_docx(resume_path)
#         cleaned_resume = cleanResume(resume_text)
        
#         # Convert to TF-IDF features
#         resume_vec = vectorizer.transform([cleaned_resume])
        
#         # Get predictions with probabilities
#         pred_proba = model.predict_proba(resume_vec)[0]
        
#         # Mapping probabilities to job categories
#         proba_category_mapping = [(le.inverse_transform([i])[0], proba) for i, proba in enumerate(pred_proba)]
        
#         # Sort categories by probability
#         proba_category_mapping.sort(key=lambda x: x[1], reverse=True)
        
#         # Filter based on threshold
#         suggested_categories = [(category, proba*100) for category, proba in proba_category_mapping if proba >= threshold]
        
#         # Extract experience years
#         experience_years = extract_overall_experience_years(resume_text)
        
#         # Store results
#         results.append({
#             'Resume': resume_path,
#             'Job Categories': suggested_categories,
#             'Experience Years': experience_years
#         })
    
#     # Create DataFrame from results
#     df_results = pd.DataFrame(results)
    
#     # Generate a table summarizing the job categories with the highest match scores for each candidate
#     summary_table = pd.DataFrame(columns=['Job Category', 'Candidate', 'Match Score', 'Experience Years'])

#     for _, row in df_results.iterrows():
#         resume = row['Resume']
#         categories = row['Job Categories']
#         experience_years = row['Experience Years']
        
#         for category, score in categories:
#             candidate = resume.split('.')[0]  # Extract candidate name from file name
#             summary_table = pd.concat([summary_table, pd.DataFrame({'Job Category': [category], 'Candidate': [candidate], 'Match Score': [score], 'Experience Years': [experience_years]})], ignore_index=True)
    
#     return summary_table

# # Automatically find all .docx files in the code file's directory
# resume_paths = glob.glob('./*.docx')

# # Process the found resume files
# result_table = process_resumes(resume_paths, clf, word_vectorizer, le, threshold=0.1)
# print(result_table)


In [90]:
# import pandas as pd
# import glob
# import os
# import docx

# def read_docx(file_path):
#     doc = docx.Document(file_path)
#     fullText = []
#     for para in doc.paragraphs:
#         fullText.append(para.text)
#     return '\n'.join(fullText)

# def process_resumes(resume_paths, model, vectorizer, le, threshold=0.1):
#     """
#     Processes multiple resumes and generates classification results for each one.
#     Returns a DataFrame summarizing the job categories with the highest match scores for each candidate.
#     """
#     results = []

#     for resume_path in resume_paths:
#         # Read the resume text
#         resume_text = read_docx(resume_path)
        
#         # Extract candidate's name from the file name
#         candidate_name = os.path.splitext(os.path.basename(resume_path))[0]
        
#         # Clean the resume
#         cleaned_resume = cleanResume(resume_text)
        
#         # Convert to TF-IDF features
#         resume_vec = vectorizer.transform([cleaned_resume])
        
#         # Get predictions with probabilities
#         pred_proba = model.predict_proba(resume_vec)[0]
        
#         # Mapping probabilities to job categories
#         proba_category_mapping = [(le.inverse_transform([i])[0], proba) for i, proba in enumerate(pred_proba)]
        
#         # Sort categories by probability
#         proba_category_mapping.sort(key=lambda x: x[1], reverse=True)
        
#         # Filter based on threshold
#         suggested_categories = [(category, proba*100) for category, proba in proba_category_mapping if proba >= threshold]
        
#         # Store results
#         results.append({
#             'Candidate Name': candidate_name,
#             'Job Categories': suggested_categories
#         })
    
#     # Create DataFrame from results
#     df_results = pd.DataFrame(results)
    
#     # Generate a table summarizing the job categories with the highest match scores for each candidate
#     summary_table = pd.DataFrame(columns=['Candidate Name', 'Job Category', 'Match Score'])

#     for _, row in df_results.iterrows():
#         candidate_name = row['Candidate Name']
#         categories = row['Job Categories']
        
#         for category, score in categories:
#             summary_table = pd.concat([summary_table, pd.DataFrame({'Candidate Name': [candidate_name], 'Job Category': [category], 'Match Score': [score]})], ignore_index=True)
    
#     return summary_table

# # Automatically find all .docx files in the code file's directory
# resume_paths = glob.glob('./*.docx')

# # Process the found resume files
# result_table = process_resumes(resume_paths, clf, word_vectorizer, le, threshold=0.1)
# print(result_table)


In [91]:
# def print_results_by_category(df):
#     # Extract unique job categories
#     job_categories = df['Job Category'].unique()
    
#     for category in job_categories:
#         # Filter DataFrame for the current category
#         category_df = df[df['Job Category'] == category]
        
#         print(f"{category} | Candidate Name | Match Score")
#         print("-" * (len(category) + 34))  # Adjust based on your column width
        
#         # Iterate through rows in the filtered DataFrame
#         for index, row in category_df.iterrows():
#             print(f"  {row['Candidate Name']: <20} | {row['Match Score']: >10}%")
        
#         print("\n")  # Print a newline for better separation between categories

# # Assuming 'result_table' is your DataFrame
# print_results_by_category(result_table)


In [92]:
# def print_results_by_category(df, le):
#     # Prompt user to specify the job category
#     print("Available job categories:")
#     for category in le.classes_:
#         print("-", category)
#     target_category = input("Enter the job category you are interested in: ")
    
#     if target_category not in le.classes_:
#         print("Invalid job category.")
#         return

#     # Filter DataFrame for the chosen category
#     category_df = df[df['Job Category'] == target_category]
    
#     print(f"\n{target_category} | Candidate Name | Match Score")
#     print("-" * (len(target_category) + 50))  # Adjust based on your column width
    
#     # Iterate through rows in the filtered DataFrame
#     for index, row in category_df.iterrows():
#         print(f"  {row['Candidate Name']: <20} | {row['Match Score']: >10}%")
    
#     print("\n")  # Print a newline for better separation

# # Assuming 'result_table' is your DataFrame
# print_results_by_category(result_table, le)


In [93]:
# from tabulate import tabulate

# def print_results_by_category(df, le):
#     # Prompt user to specify the job category
#     print("Available job categories:")
#     for category in le.classes_:
#         print("-", category)
#     target_category = input("Enter the job category you are interested in: ")
    
#     if target_category not in le.classes_:
#         print("Invalid job category.")
#         return

#     # Filter DataFrame for the chosen category
#     category_df = df[df['Job Category'] == target_category]
    
#     # Prepare data for tabulate
#     headers = ['Candidate Name', 'Match Score']
#     data = category_df[['Candidate Name', 'Match Score']].values.tolist()
    
#     # Calculate the width of the table
#     table_width = sum(len(str(header)) for header in headers) + len(headers) * 3
    
#     # Print table header with category name centered
#     print("\n" + target_category.center(table_width) + "\n")
    
#     # Print table using tabulate
#     print(tabulate(data, headers=headers, tablefmt="pretty"))

# # Assuming 'result_table' is your DataFrame
# print_results_by_category(result_table, le)


In [94]:
# from tabulate import tabulate

# def print_results_by_category(df, le):
#     # Print the list of available job categories only once at the beginning
#     print("\nAvailable job categories:")
#     for category in le.classes_:
#         print("-", category)
#     print("Enter 'exit' to stop.")
    
#     while True:
#         # Prompt user to specify the job category
#         target_category = input("\nEnter the job category you are interested in or 'exit' to stop: ")
        
#         # Check for exit condition
#         if target_category.lower() == 'exit':
#             break
        
#         if target_category not in le.classes_:
#             print("Invalid job category. Please try again.")
#             continue

#         # Filter DataFrame for the chosen category
#         category_df = df[df['Job Category'] == target_category]
        
#         # Prepare data for tabulate
#         headers = ['Candidate Name', 'Match Score']
#         data = category_df[['Candidate Name', 'Match Score']].values.tolist()
        
#         # Calculate the width of the table
#         table_width = sum(len(str(header)) for header in headers) + len(headers) * 3 + 5  # Adjust for spacing
        
#         # Print table header with category name centered
#         print("\n" + target_category.center(table_width) + "\n")
        
#         # Print table using tabulate
#         if not data:
#             print(f"No candidates found for {target_category}.")
#         else:
#             print(tabulate(data, headers=headers, tablefmt="pretty"))

# # Assuming 'result_table' is your DataFrame and 'le' is your LabelEncoder
# print_results_by_category(result_table, le)


In [95]:
import pandas as pd
import glob
import os
import docx
import re
from datetime import datetime

def read_docx(file_path):
    doc = docx.Document(file_path)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

def extract_overall_experience_years(resume_text):
    """
    Calculates overall experience years from the resume text, accounting for overlapping jobs.
    """
    date_pattern = r'((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)?\s*\d{4})\s*-\s*((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)?\s*(?:\d{4}|Present))'
    matches = re.findall(date_pattern, resume_text, re.IGNORECASE)

    if not matches:
        return "Null"  # Return "Null" if no dates found

    min_start_year = datetime.now().year
    max_end_year = 0

    for start, end in matches:
        start_year = int(re.search(r'\d{4}', start).group())
        end_year = datetime.now().year if "Present" in end else int(re.search(r'\d{4}', end).group())

        if start_year < min_start_year:
            min_start_year = start_year
        if end_year > max_end_year:
            max_end_year = end_year

    total_experience = max_end_year - min_start_year
    return total_experience

def process_resumes(resume_paths, model, vectorizer, le, threshold=0.1):
    results = []

    for resume_path in resume_paths:
        resume_text = read_docx(resume_path)
        candidate_name = os.path.splitext(os.path.basename(resume_path))[0]
        cleaned_resume = cleanResume(resume_text)  # Assuming you have a cleanResume function
        
        resume_vec = vectorizer.transform([cleaned_resume])
        pred_proba = model.predict_proba(resume_vec)[0]
        
        proba_category_mapping = [(le.inverse_transform([i])[0], proba) for i, proba in enumerate(pred_proba)]
        proba_category_mapping.sort(key=lambda x: x[1], reverse=True)
        
        suggested_categories = [(category, proba*100) for category, proba in proba_category_mapping if proba >= threshold]
        
        # Extract overall experience
        overall_experience = extract_overall_experience_years(resume_text)
        
        results.append({
            'Candidate Name': candidate_name,
            'Job Categories': suggested_categories,
            'Experience': overall_experience  # Add experience to the results
        })
    
    df_results = pd.DataFrame(results)
    summary_table = pd.DataFrame(columns=['Candidate Name', 'Job Category', 'Match Score', 'Experience'])

    for _, row in df_results.iterrows():
        candidate_name = row['Candidate Name']
        categories = row['Job Categories']
        experience = row['Experience']
        
        for category, score in categories:
            new_row = pd.DataFrame({
                'Candidate Name': [candidate_name], 
                'Job Category': [category], 
                'Match Score': [score],
                'Experience': [experience]  # Add experience to each row
            })
            summary_table = pd.concat([summary_table, new_row], ignore_index=True)
    
    return summary_table

# Assuming the necessary variables (clf, word_vectorizer, le) are defined elsewhere in your code.
resume_paths = glob.glob('./Resumes/*.docx')
result_table = process_resumes(resume_paths, clf, word_vectorizer, le, threshold=0.1)
csv_file = "data\\resume_classification_results.csv"

# Save the DataFrame to a CSV file
result_table.to_csv(csv_file, index=False)

print(result_table)
print(f"Results saved to {csv_file}.")



  Candidate Name         Job Category  Match Score Experience
0  Alice Johnson             Advocate        100.0          8
1          Ayush   Health and fitness         40.0       Null
2          Ayush           Blockchain         20.0       Null
3          Ayush  Mechanical Engineer         20.0       Null
4          Ayush                Sales         20.0       Null
5       Jane Doe         Data Science        100.0         10
6     John smith         Data Science         60.0         12
7     John smith     Python Developer         40.0         12
8      Tim David         Data Science         80.0          4
9      Tim David     Python Developer         20.0          4
Results saved to data\resume_classification_results.csv.


In [96]:
from tabulate import tabulate

def print_results_by_category(df, le):
    # Print the list of available job categories only once at the beginning
    print("\nAvailable job categories:")
    for category in le.classes_:
        print("-", category)
    print("Enter 'exit' to stop.")
    
    while True:
        # Prompt user to specify the job category
        target_category = input("\nEnter the job category you are interested in or 'exit' to stop: ")
        
        # Check for exit condition
        if target_category.lower() == 'exit':
            break
        
        if target_category not in le.classes_:
            print("Invalid job category. Please try again.")
            continue

        # Filter DataFrame for the chosen category
        category_df = df[df['Job Category'] == target_category]
        
        # Prepare data for tabulate
        headers = ['Candidate Name', 'Match Score', 'Experience (Years)']
        # Include the experience data in the listing
        data = category_df[['Candidate Name', 'Match Score', 'Experience']].values.tolist()
        
        # Calculate the width of the table
        table_width = sum(len(str(header)) for header in headers) + len(headers) * 3 + 10  # Adjust for spacing and column
        
        # Print table header with category name centered
        print("\n" + target_category.center(table_width) + "\n")
        
        # Print table using tabulate
        if not data:
            print(f"No candidates found for {target_category}.")
        else:
            print(tabulate(data, headers=headers, tablefmt="pretty"))

# Assuming 'result_table' is your DataFrame and 'le' is your LabelEncoder
# Ensure your 'result_table' DataFrame now includes an 'Experience' column, as modified in your earlier request
print_results_by_category(result_table, le)



Available job categories:
- Advocate
- Arts
- Automation Testing
- Blockchain
- Business Analyst
- Civil Engineer
- Data Science
- Database
- DevOps Engineer
- DotNet Developer
- ETL Developer
- Electrical Engineering
- HR
- Hadoop
- Health and fitness
- Java Developer
- Mechanical Engineer
- Network Security Engineer
- Operations Manager
- PMO
- Python Developer
- SAP Developer
- Sales
- Testing
- Web Designing
Enter 'exit' to stop.

                           Advocate                           

+----------------+-------------+--------------------+
| Candidate Name | Match Score | Experience (Years) |
+----------------+-------------+--------------------+
| Alice Johnson  |    100.0    |         8          |
+----------------+-------------+--------------------+


## Total Scores of Top Candidates

In [97]:
import pandas as pd

candidates_df = pd.read_csv("data\\candidates.csv")
resume_scores_df = pd.read_csv("data\\resume_classification_results.csv")

# Merge DataFrames on "Candidate Name"
merged_df = pd.merge(candidates_df, resume_scores_df, left_on='name', right_on='Candidate Name', how='inner')

# Calculate total scores
merged_df['Total Score'] = merged_df['score'] + merged_df['Match Score']

# Sort by "Candidate Name" and "Total Score" in descending order to ensure the highest score comes first
merged_df = merged_df.sort_values(by=['Candidate Name', 'Total Score'], ascending=[True, False])

# Drop duplicates keeping the first entry (highest score) for each candidate
unique_candidates_df = merged_df.drop_duplicates(subset=['Candidate Name'], keep='first')

# Sort candidates by total score in descending order
ranked_df = unique_candidates_df.sort_values(by='Total Score', ascending=False)

# Adjust the total scores to be out of 100
ranked_df['Total Score'] = ranked_df['Total Score'] / 2

# Select columns for the final DataFrame
final_df = ranked_df[['Candidate Name', 'Total Score', 'Job Category']]

# Reset index to start ranking from 1
final_df.reset_index(drop=True, inplace=True)

# Display the final DataFrame
print(final_df)


  Candidate Name  Total Score  Job Category
0       Jane Doe         60.0  Data Science
1  Alice Johnson         50.0      Advocate
2      Tim David         50.0  Data Science
