<a href="https://colab.research.google.com/github/abolfazlaghdaee/LLM_journey/blob/main/Project01_DS04_S01_NLTK_SpaCy_RezaShokrzad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📜 Project: Job Description Analyzer – Extracting Required Skills from Job Postings


## 📌 Objective
Use spaCy’s Named Entity Recognition (NER) and NLTK preprocessing to extract and categorize required skills from job descriptions. The goal is to identify trends in job requirements and analyze the most in-demand skills across industries.

## 🛠️ Project Steps & Instructions


In [1]:
#📥 Download the Dataset
!wget https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv

--2025-03-19 12:00:13--  https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 646072 (631K) [text/plain]
Saving to: ‘data.csv’


2025-03-19 12:00:17 (2.95 MB/s) - ‘data.csv’ saved [646072/646072]



### Step 1: Load the Dataset
#### 📌 Dataset: A provided CSV file containing job descriptions from different industries (IT, Healthcare, Finance, Marketing, etc.).

1. Download the dataset (link below).
2. Load it into Python using Pandas.
3. View the first few rows to understand its structure.

In [9]:
# your code here
import pandas as pd

#section2
df = pd.read_csv('/content/data.csv')

#section3
df.head()

Unnamed: 0.1,Unnamed: 0,company,position,url,location,headquaters,employees,founded,industry,Job Description
0,1,Visual BI Solutions Inc,Graduate Intern (Summer 2017) - SAP BI / Big D...,https://www.glassdoor.com/partner/jobListing.h...,"Plano, TX","Plano, TX",51 to 200 employees,2010,Information Technology,"Location: Plano, TX or Oklahoma City, OK Dura..."
1,2,Jobvertise,Digital Marketing Manager,https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Berlin, Germany",1 to 50 employees,2011,Unknown,The Digital Marketing Manager is the front li...
2,3,Santander Consumer USA,"Manager, Pricing Management Information Systems",https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",5001 to 10000 employees,1995,Finance,Summary of Responsibilities:The Manager Prici...
3,4,Federal Reserve Bank of Dallas,Treasury Services Analyst Internship,https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",1001 to 5000 employees,1914,Finance,ORGANIZATIONAL SUMMARY: As part of the nati...
4,5,Aviall,"Intern, Sales Analyst",https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",1001 to 5000 employees,Boeing,Subsidiary or Business Segment,Aviall is the world's largest provider of n...


### Step 2: Preprocessing the Job Descriptions
#### 📌 Goal: Clean the text by removing stopwords, punctuation, and unnecessary characters.

1. Use NLTK to tokenize the descriptions.
2. Remove stopwords and special characters.
3. Convert text to lowercase for consistency.

In [3]:
# your code here
!pip install nltk



In [10]:
import nltk

nltk.download('punkt_tab')
nltk.download('stopwords')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

#section1
df['Tokenized Job Description'] = df['Job Description'].map(word_tokenize)


df['Tokenized Job Description']

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Tokenized Job Description
0,"[Location, :, Plano, ,, TX, or, Oklahoma, City..."
1,"[The, Digital, Marketing, Manager, is, the, fr..."
2,"[Summary, of, Responsibilities, :, The, Manage..."
3,"[ORGANIZATIONAL, SUMMARY, :, As, part, of, the..."
4,"[Aviall, is, the, world, 's, largest, provider..."
...,...
152,"[Real-world, Experience, ., Life-long, Connect..."
153,"[The, Internship, Program, Our, paid, internsh..."
154,"[Are, you, an, analytical, thinker, with, a, p..."
155,"[The, Internship, Program, Our, paid, internsh..."


In [11]:
#section2 and section3

def clear(words):

  clear_token = [word.lower() for word in words if word.isalnum() and word.lower() not in stopwords]

  return clear_token



df['Tokenized Job Description'] = df['Tokenized Job Description'].map(clear)

df['Tokenized Job Description']


Unnamed: 0,Tokenized Job Description
0,"[location, plano, tx, oklahoma, city, ok, dura..."
1,"[digital, marketing, manager, front, line, pat..."
2,"[summary, responsibilities, manager, pricing, ..."
3,"[organizational, summary, part, nation, centra..."
4,"[aviall, world, largest, provider, new, aviati..."
...,...
152,"[experience, connections, intern, seasoned, he..."
153,"[internship, program, paid, internship, progra..."
154,"[analytical, thinker, passion, actuarial, scie..."
155,"[internship, program, paid, internship, progra..."


### Step 3: Extract Skills Using Named Entity Recognition (NER)
#### 📌 Goal: Use spaCy’s built-in NER to detect and extract skills from job descriptions.

1. Load spaCy’s English model.
2. Use NER to identify important keywords.
3. Extract words related to technical skills, tools, and expertise.

In [33]:
# your code here

#section1
!pip install spacy
import spacy

nlp = spacy.load('en_core_web_sm')


#section 2 & 3

def ner(data):

  text = ' '.join(data)

  doc = nlp(text)

  extract_words = []

  for word in doc.ents:
    if word.label_ in ["ORG","PRODUCT", "SKILL"]:
      extract_words.append(word.text)



  return extract_words




df['extract_words'] = df['Tokenized Job Description'].map(ner)







In [34]:
df[['Job Description','Tokenized Job Description', 'extract_words']]

Unnamed: 0,Job Description,Tokenized Job Description,extract_words
0,"Location: Plano, TX or Oklahoma City, OK Dura...","[location, plano, tx, oklahoma, city, ok, dura...",[gpa scores]
1,The Digital Marketing Manager is the front li...,"[digital, marketing, manager, front, line, pat...",[digital]
2,Summary of Responsibilities:The Manager Prici...,"[summary, responsibilities, manager, pricing, ...",[]
3,ORGANIZATIONAL SUMMARY: As part of the nati...,"[organizational, summary, part, nation, centra...","[federal reserve bank, dallas treasury service..."
4,Aviall is the world's largest provider of n...,"[aviall, world, largest, provider, new, aviati...",[]
...,...,...,...
152,Real-world Experience. Life-long Connections....,"[experience, connections, intern, seasoned, he...",[]
153,The Internship Program Our paid internship p...,"[internship, program, paid, internship, progra...",[tx department]
154,Are you an analytical thinker with a passion ...,"[analytical, thinker, passion, actuarial, scie...","[deloitte university, deloitte university, del..."
155,The Internship Program Our paid internship p...,"[internship, program, paid, internship, progra...","[tx department name ag, ag, ag, microsoft]"


### Step 4: Identify the Most In-Demand Skills
#### 📌 Goal: Count the most frequently mentioned skills in job descriptions.

1. Create a word frequency distribution of extracted skills.
2. Identify the top 10 most required skills.

In [25]:
# your code here
from collections import Counter


skills =[]
for i in df['extract_words']:
  skills.extend(i)



Counter(skills).most_common(10)


[('microsoft', 46),
 ('ibm', 31),
 ('microsoft office', 23),
 ('deloitte', 20),
 ('texas usa', 7),
 ('deloitte university', 7),
 ('grant thornton international ltd one', 6),
 ('phoenix house', 6),
 ('google', 4),
 ('microsoft sql', 4)]

### Step 5: Categorize Skills by Industry
#### 📌 Goal: Compare the most in-demand skills across different industries.

1. Group job descriptions by industry.
2. Extract and analyze skills for each industry.
3. Compare IT vs. Marketing vs. Healthcare, etc..

In [29]:
# your code here
df.dropna(subset=["industry", "Job Description"])
df["industry"].value_counts()

Unnamed: 0_level_0,count
industry,Unnamed: 1_level_1
Business Services,23
Information Technology,21
Accounting & Legal,20
Finance,17
Media,12
Manufacturing,11
Health Care,10
Unknown,8
Subsidiary or Business Segment,8
Insurance,5


In [35]:
dic_skills = {}

for industry in df["industry"].unique():
  df_industry = df[df["industry"] == industry]['extract_words']
  skills = []
  for i in df_industry:
    skills.extend(i)
  dic_skills[industry] = Counter(skills).most_common(10)

In [42]:
for industry, skills in dic_skills.items():
    print(f"Top skills in {industry}:")
    print(skills)
    print("-" * 50)

Top skills in Information Technology:
[('microsoft', 15), ('texas usa', 6), ('microsoft sql', 4), ('microsoft office', 2), ('sap academy', 2), ('sap academy presales successfully', 2), ('sap academy presales sap presales academy', 2), ('sap academy presales looks', 2), ('metadata management data integration technologies hadoop', 2), ('gpa scores', 1)]
--------------------------------------------------
Top skills in Unknown:
[('microsoft', 2), ('digital', 1), ('google', 1), ('401k', 1), ('metadata management data integration technologies hadoop', 1)]
--------------------------------------------------
Top skills in Finance:
[('federal reserve bank', 3), ('dallas treasury services department regularly apply analytical problem', 3), ('microsoft', 3), ('new york stock exchange', 3), ('microsoft office', 2), ('google', 2), ('hris research', 1), ('ibm', 1), ('north america', 1), ('united states america', 1)]
--------------------------------------------------
Top skills in Subsidiary or Busine

In [50]:
# It , Marketing(skills in Business Services), Health care
print(dic_skills['Information Technology'])

print(50*'-')

print(dic_skills['Business Services'])

print(50*'-')

dic_skills['Health Care']

[('microsoft', 15), ('texas usa', 6), ('microsoft sql', 4), ('microsoft office', 2), ('sap academy', 2), ('sap academy presales successfully', 2), ('sap academy presales sap presales academy', 2), ('sap academy presales looks', 2), ('metadata management data integration technologies hadoop', 2), ('gpa scores', 1)]
--------------------------------------------------
[('ibm', 30), ('microsoft', 7), ('environmentstrong prioritization skillsstrong', 2), ('dmv', 2), ('epsilonmktg alliance', 2), ('cms', 1), ('irving us', 1), ('texas usa', 1), ('ceb nyse ceb', 1), ('department ceb corporate department', 1)]
--------------------------------------------------


[('phoenix house', 6),
 ('texas phoenix house', 1),
 ('phoenix house facebo phoenix house', 1),
 ('microsoft', 1),
 ('microsoft visio', 1)]