<a href="https://colab.research.google.com/github/dinakeshvari/NLP_Exercise_ShokrzadCourse/blob/main/Project01_DS04_S01_NLTK_SpaCy_RezaShokrzad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📜 Project: Job Description Analyzer – Extracting Required Skills from Job Postings


## 📌 Objective
Use spaCy’s Named Entity Recognition (NER) and NLTK preprocessing to extract and categorize required skills from job descriptions. The goal is to identify trends in job requirements and analyze the most in-demand skills across industries.

## 🛠️ Project Steps & Instructions


In [1]:
#📥 Download the Dataset
!wget https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv

--2025-04-18 13:31:47--  https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 646072 (631K) [text/plain]
Saving to: ‘data.csv’


2025-04-18 13:31:47 (19.3 MB/s) - ‘data.csv’ saved [646072/646072]



### Step 1: Load the Dataset
#### 📌 Dataset: A provided CSV file containing job descriptions from different industries (IT, Healthcare, Finance, Marketing, etc.).

1. Download the dataset (link below).
2. Load it into Python using Pandas.
3. View the first few rows to understand its structure.

In [4]:
# your code here
import pandas as pd
data = pd.read_csv('data.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,company,position,url,location,headquaters,employees,founded,industry,Job Description
0,1,Visual BI Solutions Inc,Graduate Intern (Summer 2017) - SAP BI / Big D...,https://www.glassdoor.com/partner/jobListing.h...,"Plano, TX","Plano, TX",51 to 200 employees,2010,Information Technology,"Location: Plano, TX or Oklahoma City, OK Dura..."
1,2,Jobvertise,Digital Marketing Manager,https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Berlin, Germany",1 to 50 employees,2011,Unknown,The Digital Marketing Manager is the front li...
2,3,Santander Consumer USA,"Manager, Pricing Management Information Systems",https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",5001 to 10000 employees,1995,Finance,Summary of Responsibilities:The Manager Prici...
3,4,Federal Reserve Bank of Dallas,Treasury Services Analyst Internship,https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",1001 to 5000 employees,1914,Finance,ORGANIZATIONAL SUMMARY: As part of the nati...
4,5,Aviall,"Intern, Sales Analyst",https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",1001 to 5000 employees,Boeing,Subsidiary or Business Segment,Aviall is the world's largest provider of n...


### Step 2: Preprocessing the Job Descriptions
#### 📌 Goal: Clean the text by removing stopwords, punctuation, and unnecessary characters.

1. Use NLTK to tokenize the descriptions.
2. Remove stopwords and special characters.
3. Convert text to lowercase for consistency.

In [12]:
data.columns

Index(['Unnamed: 0', 'company', 'position', 'url', 'location', 'headquaters',
       'employees', 'founded', 'industry', 'Job Description'],
      dtype='object')

In [22]:
# your code here
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
descriptoins = data['Job Description']
stop_words = set(stopwords.words('english'))
preprocessed_descriptions = []
for description in descriptoins:
  words = word_tokenize(description.lower())
  filtered_words = [word for word in words if word not in stop_words]
  preprocessed_descriptions.append(' '.join(filtered_words))
preprocessed_descriptions

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


["location : plano , tx oklahoma city , ok duration : internship summer 2017 term job summary : visual bi solutions inc seeking graduate interns strong expertise/passion bi big data & analytics solutions ( sap bw sap hana oracle / ms sql edw / pl/sql / bods / sas / big data / visualization tools ) join college recruiting hiring program . role , would building best-in-class bi , analytics & big data solutions would consumed leaders executives fortune 500 organizations . strong sense business analysis , etl , data modeling , data warehousing , visualization , reporting , advanced analytics data interpretation key attributes look . market leader sap bi & analytics - visual bi selective student hiring candidates portfolio non-academic project work/technical blogs preferred work experience academic gpa scores . asked hone bi & analytics expertise internship working best customers bi talent world requirements-2+ years bi / edw / etl / big data development relevant work bi/datawarehousing exp

### Step 3: Extract Skills Using Named Entity Recognition (NER)
#### 📌 Goal: Use spaCy’s built-in NER to detect and extract skills from job descriptions.

1. Load spaCy’s English model.
2. Use NER to identify important keywords.
3. Extract words related to technical skills, tools, and expertise.

In [24]:
len(preprocessed_descriptions)

157

In [34]:
# your code here
import spacy
nlp = spacy.load('en_core_web_sm')
skills = []
for description in preprocessed_descriptions:
  doc = nlp(description)
  for ent in doc.ents:
    # print(ent)
    if ent.label_ == 'PRODUCT':
      skills.append(ent.text)
    if ent.label_ == 'JOB':
      skills.append(ent.text)

skills


['•',
 '401k',
 '401k',
 'perspective.experience',
 'perspective.experience',
 '•',
 '•',
 '•',
 '•',
 '•',
 '•',
 'pbxs',
 'microsoft visio',
 '•']

In [47]:
# your code here
import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')

# A sample list of known technical skills/tools – customize this
skill_list = [
    "Python", "TensorFlow", "Keras", "PyTorch", "Docker", "Git", "SQL",
    "Java", "C++", "AWS", "GCP", "Azure", "Pandas", "NumPy", "Scikit-learn"
]

# Initialize matcher
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
patterns = [nlp(skill) for skill in skill_list]
matcher.add("SKILL", patterns)

skills = []

for description in preprocessed_descriptions:
    doc = nlp(description)

    # Rule 1: Use NER for PRODUCT entities
    for ent in doc.ents:
        if ent.label_ == 'PRODUCT':
            skills.append(ent.text)

    # Rule 2: Use PhraseMatcher
    matches = matcher(doc)
    for match_id, start, end in matches:
        skills.append(doc[start:end].text)

# # Optional: deduplicate and normalize
skills = list([s.strip().lower() for s in skills])

skills


['sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'java',
 '•',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 '401k',
 '401k',
 'java',
 'perspective.experience',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'python',
 'java',
 'sql',
 'java',
 'sql',
 'sql',
 'sql',
 'perspective.experience',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'python',
 'java',
 'sql',
 'java',
 'sql',
 'sql',
 'sql',
 '•',
 '•',
 '•',
 '•',
 '•',
 '•',
 'sql',
 'python',
 'java',
 'java',
 'sql',
 'java',
 'sql',
 'sql',
 'sql',
 'sql',
 'java',
 'python',
 'sql',
 'java',
 'python',
 'sql',
 'python',
 'sql',
 'sql',
 'pbxs',
 'microsoft visio',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'sql',
 'java',
 'java',
 'sql',
 'java',
 'sql',
 'c++',
 'python',
 'java',
 'java',
 'sql',
 'sql',
 'python',
 '•']

### Step 4: Identify the Most In-Demand Skills
#### 📌 Goal: Count the most frequently mentioned skills in job descriptions.

1. Create a word frequency distribution of extracted skills.
2. Identify the top 10 most required skills.

In [48]:
from collections import Counter

# Assuming 'skills' is your list of extracted skill names (lowercased & deduplicated per document)
# If you haven't already cleaned them:
# skills = [s.lower() for s in skills]

# Create frequency distribution
skill_freq = Counter(skills)

# Display top 10 most required skills
top_10_skills = skill_freq.most_common(10)

print("Top 10 most required skills:")
for skill, freq in top_10_skills:
    print(f"{skill}: {freq}")


Top 10 most required skills:
sql: 58
java: 16
•: 8
python: 8
401k: 2
perspective.experience: 2
pbxs: 1
microsoft visio: 1
c++: 1


### Step 5: Categorize Skills by Industry
#### 📌 Goal: Compare the most in-demand skills across different industries.

1. Group job descriptions by industry.
2. Extract and analyze skills for each industry.
3. Compare IT vs. Marketing vs. Healthcare, etc..

In [49]:
import spacy
from spacy.matcher import PhraseMatcher
from collections import defaultdict, Counter

nlp = spacy.load("en_core_web_sm")

# Define your skill list
skill_list = [
    "Python", "TensorFlow", "Docker", "Git", "SQL", "AWS", "GCP", "React", "SEO",
    "Google Analytics", "Content Marketing", "EMR", "HIPAA", "Patient Care", "Data Analysis"
]
patterns = [nlp(skill) for skill in skill_list]
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
matcher.add("SKILL", patterns)

# Sample job category
job_data = [
    {"industry": "IT", "description": "Experience with Python, Docker, and AWS."},
    {"industry": "Marketing", "description": "Skilled in SEO, Google Analytics, and content marketing."},
    {"industry": "Healthcare", "description": "Familiarity with EMR systems, HIPAA, and patient care protocols."},
    {"industry": "IT", "description": "Proficient in SQL, Git, and cloud platforms like GCP and AWS."},
    {"industry": "Marketing", "description": "Experienced in content marketing and SEO strategies."},
    {"industry": "Healthcare", "description": "Worked with patient care software and maintained HIPAA compliance."},
]

# skills per industry
industry_skills = defaultdict(list)

for job in job_data:
    industry = job["industry"]
    doc = nlp(job["description"])
    matches = matcher(doc)
    for match_id, start, end in matches:
        skill = doc[start:end].text.lower().strip()
        industry_skills[industry].append(skill)

# frequency distribution per industry
industry_skill_freq = {
    industry: Counter(skills) for industry, skills in industry_skills.items()
}

# top 5 for each industry
print("🔍 Top skills by industry:\n")
for industry, freq_dist in industry_skill_freq.items():
    print(f"📂 {industry}:")
    for skill, freq in freq_dist.most_common(5):
        print(f"   {skill}: {freq}")
    print()


🔍 Top skills by industry:

📂 IT:
   aws: 2
   python: 1
   docker: 1
   sql: 1
   git: 1

📂 Marketing:
   seo: 2
   content marketing: 2
   google analytics: 1

📂 Healthcare:
   hipaa: 2
   patient care: 2
   emr: 1

