# 📜 Project: Job Description Analyzer – Extracting Required Skills from Job Postings


## 📌 Objective
Use spaCy’s Named Entity Recognition (NER) and NLTK preprocessing to extract and categorize required skills from job descriptions. The goal is to identify trends in job requirements and analyze the most in-demand skills across industries.

## 🛠️ Project Steps & Instructions


In [1]:
#📥 Download the Dataset
!wget https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv

--2025-03-11 18:15:51--  https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 646072 (631K) [text/plain]
Saving to: ‘data.csv’


2025-03-11 18:15:52 (3.00 MB/s) - ‘data.csv’ saved [646072/646072]



### Step 1: Load the Dataset
#### 📌 Dataset: A provided CSV file containing job descriptions from different industries (IT, Healthcare, Finance, Marketing, etc.).

1. Download the dataset (link below).
2. Load it into Python using Pandas.
3. View the first few rows to understand its structure.

In [4]:
# your code here
import pandas as pd

# Load the dataset into a Pandas DataFrame
df = pd.read_csv('data.csv')

# View the first few rows of the dataset
print("First few rows of the dataset:")
print(df.head())
print(df.shape)

First few rows of the dataset:
   Unnamed: 0                          company  \
0           1          Visual BI Solutions Inc   
1           2                       Jobvertise   
2           3           Santander Consumer USA   
3           4   Federal Reserve Bank of Dallas   
4           5                           Aviall   

                                            position  \
0  Graduate Intern (Summer 2017) - SAP BI / Big D...   
1                          Digital Marketing Manager   
2    Manager, Pricing Management Information Systems   
3               Treasury Services Analyst Internship   
4                              Intern, Sales Analyst   

                                                 url     location  \
0  https://www.glassdoor.com/partner/jobListing.h...    Plano, TX   
1  https://www.glassdoor.com/partner/jobListing.h...   Dallas, TX   
2  https://www.glassdoor.com/partner/jobListing.h...   Dallas, TX   
3  https://www.glassdoor.com/partner/jobListing.h...   

### Step 2: Preprocessing the Job Descriptions
#### 📌 Goal: Clean the text by removing stopwords, punctuation, and unnecessary characters.

1. Use NLTK to tokenize the descriptions.
2. Remove stopwords and special characters.
3. Convert text to lowercase for consistency.

In [8]:
# your code here

# Import necessary libraries
import nltk
nltk.download('punkt_tab')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Download NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('stopwords')

# Define a function to preprocess text
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Tokenize the text into words
    words = word_tokenize(text)

    # Remove punctuation and special characters
    words = [word for word in words if word.isalnum()]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Join the cleaned words back into a single string
    cleaned_text = ' '.join(words)

    return cleaned_text

# Apply preprocessing to the 'Job_Description' column
df['Cleaned_Job_Description'] = df['Job Description'].apply(preprocess_text)

# View the first few rows of the cleaned job descriptions
print("First few rows of cleaned job descriptions:")
print(df[['Job Description', 'Cleaned_Job_Description']].head())

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


First few rows of cleaned job descriptions:
                                     Job Description  \
0   Location: Plano, TX or Oklahoma City, OK Dura...   
1   The Digital Marketing Manager is the front li...   
2   Summary of Responsibilities:The Manager Prici...   
3   ORGANIZATIONAL SUMMARY:   As part of the nati...   
4     Aviall is the world's largest provider of n...   

                             Cleaned_Job_Description  
0  location plano tx oklahoma city ok duration in...  
1  digital marketing manager front line patient c...  
2  summary responsibilities manager pricing mis r...  
3  organizational summary part nation central ban...  
4  aviall world largest provider new aviation par...  


### Step 3: Extract Skills Using Named Entity Recognition (NER)
#### 📌 Goal: Use spaCy’s built-in NER to detect and extract skills from job descriptions.

1. Load spaCy’s English model.
2. Use NER to identify important keywords.
3. Extract words related to technical skills, tools, and expertise.

In [9]:
# your code here

# Import necessary libraries
import spacy

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Define a function to extract skills using spaCy's NER
def extract_skills(text):
    # Process the text using spaCy
    doc = nlp(text)

    # Extract entities that are likely to be skills
    skills = []
    for ent in doc.ents:
        if ent.label_ in ["ORG", "PRODUCT", "TECH"]:  # Filter for relevant entity labels
            skills.append(ent.text)

    return skills

# Apply the function to the cleaned job descriptions
df['Extracted_Skills'] = df['Cleaned_Job_Description'].apply(extract_skills)

# View the first few rows with extracted skills
print("First few rows with extracted skills:")
print(df[['Cleaned_Job_Description', 'Extracted_Skills']].head())

First few rows with extracted skills:
                             Cleaned_Job_Description  \
0  location plano tx oklahoma city ok duration in...   
1  digital marketing manager front line patient c...   
2  summary responsibilities manager pricing mis r...   
3  organizational summary part nation central ban...   
4  aviall world largest provider new aviation par...   

                                    Extracted_Skills  
0          [gpa scores, hone bi analytics expertise]  
1                                          [digital]  
2                                              [sas]  
3  [central bank federal reserve bank, federal re...  
4                                                 []  


### Step 4: Identify the Most In-Demand Skills
#### 📌 Goal: Count the most frequently mentioned skills in job descriptions.

1. Create a word frequency distribution of extracted skills.
2. Identify the top 10 most required skills.

In [10]:
# your code here

# Import necessary libraries
from nltk.probability import FreqDist

# Combine all extracted skills into a single list
all_skills = [skill for sublist in df['Extracted_Skills'] for skill in sublist]

# Create a frequency distribution of skills using NLTK's FreqDist
skill_freq = FreqDist(all_skills)

# Get the top 10 most frequently mentioned skills
top_10_skills = skill_freq.most_common(10)

# Print the top 10 skills
print("Top 10 Most In-Demand Skills:")
for skill, freq in top_10_skills:
    print(f"{skill}: {freq} occurrences")

Top 10 Most In-Demand Skills:
microsoft: 83 occurrences
ibm: 31 occurrences
gpa: 12 occurrences
phoenix house: 10 occurrences
deloitte university: 6 occurrences
deloitte consulting llp: 6 occurrences
grant thornton international ltd one: 6 occurrences
sas: 5 occurrences
central bank federal reserve bank: 4 occurrences
sql: 4 occurrences


### Step 5: Categorize Skills by Industry
#### 📌 Goal: Compare the most in-demand skills across different industries.

1. Group job descriptions by industry.
2. Extract and analyze skills for each industry.
3. Compare IT vs. Marketing vs. Healthcare, etc..

In [12]:
# your code here

# Import necessary libraries
from collections import defaultdict

# Group job descriptions by industry
industry_skills = defaultdict(list)

# Iterate through the DataFrame and group skills by industry
for index, row in df.iterrows():
    industry = row['industry']
    skills = row['Extracted_Skills']
    industry_skills[industry].extend(skills)

# Analyze the most in-demand skills for each industry
for industry, skills in industry_skills.items():
    # Create a frequency distribution of skills for the industry
    skill_freq = FreqDist(skills)

    # Get the top 10 most frequently mentioned skills for the industry
    top_10_skills = skill_freq.most_common(10)

    # Print the results
    print(f"\nTop 10 Most In-Demand Skills in {industry}:")
    for skill, freq in top_10_skills:
        print(f"{skill}: {freq} occurrences")


Top 10 Most In-Demand Skills in Information Technology:
microsoft: 19 occurrences
microsoft power bi tableau similar platforms ideal: 2 occurrences
microsoft power: 2 occurrences
sap academy: 2 occurrences
sap academy presales sap presales academy: 2 occurrences
iot logistics supply chain: 2 occurrences
metadata: 2 occurrences
gpa scores: 1 occurrences
hone bi analytics expertise: 1 occurrences
ge: 1 occurrences

Top 10 Most In-Demand Skills in Unknown:
microsoft: 3 occurrences
digital: 1 occurrences
google: 1 occurrences
401k: 1 occurrences
metadata: 1 occurrences
treasury: 1 occurrences
agile development environment: 1 occurrences

Top 10 Most In-Demand Skills in Finance:
microsoft: 6 occurrences
central bank federal reserve bank: 4 occurrences
federal reserve bank: 3 occurrences
dallas treasury services: 3 occurrences
new york stock exchange: 3 occurrences
yardi erp: 2 occurrences
sas: 1 occurrences
ibm: 1 occurrences
meta data: 1 occurrences
texasdata: 1 occurrences

Top 10 Most I