# Job Description Analyzer

As part of this project, we have been given a few job descriptions each having its own set of requirements. The aim of the project is to use simple python and some simple text processing to analyze these job descriptions and get a list of the most desirable skills in the industry.

To do that, please answer the questions that follow:

## Reading data

#### Q1: All the required files can be found in the `files` folder. One of the files is called `BusinessAnalyst.txt`. Using simple python file handling, read the contents of that file and save in a variable

In [2]:
with open("files/BussinessAnalyst.txt", "r") as f:
    text = f.read()

print(text[:300])


The Business Analyst will gather requirements, prepare documentation, and collaborate with developers. Required skills include SQL, Excel, problem-solving, communication, and knowledge of ERP systems.


Now, we must read all the files. Here, since there is no common naming convention followed, it can be cumbersome to get the names of all files manually. Here, we can use the `os` module.

#### Q2: use the `os.listdir()` function to get the names of all files to read automatically.

In [3]:
import os

file_names = os.listdir("files")
file_names


['BussinessAnalyst.txt',
 'DataAnalyst1.txt',
 'DeveloperFresher.txt',
 'ML_Engineer.txt',
 'ProjectManager.txt',
 'PythonDeveloper1.txt',
 'security_analyst.txt',
 'SoftwareEngineer.txt',
 'SoftwareEngineer2.txt',
 'SQL_developer.txt']

#### Q3: Now using a for loop, read all these files and save their contents in a list.



In [4]:
all_texts = []

for file in file_names:
    with open("files/" + file, "r") as f:
        all_texts.append(f.read())

len(all_texts)


10

#### Q4. Wrap this entire logic of reading all files into a function called `load_job_descriptions` that returns the contents list

In [5]:
def load_job_descriptions(folder_path="files"):
    texts = []
    for file in os.listdir(folder_path):
        with open(folder_path + "/" + file, "r") as f:
            texts.append(f.read())
    return texts

job_descriptions = load_job_descriptions()
len(job_descriptions)


10

### Cleaning text data for analysis

Now the second part of this mini project is to clean the job descriptions so that we can get some meaningful insights. We plan on doing a simple count-based analysis, where we will just figure out the most-frequently occuring words/skills in these files.


To make that analysis meaningful, we must apply cleaning steps including making everything of the same case, removing special characters like period(.), commas(,), etc. Also very important, we would need to remove frequently occuring words that don't provide any context as these words would have the highest counts but they don't provide any content. lets start one by one.

#### Q5. Create a function to clean job descriptions. This function should take one job description at a time, it should make everything lowercase and remove special characters including period(.), commas(,), and round brackets. Finally, it should split the text into words and return that list of words.

In [16]:
import re

def clean_job_description(text):
    text = text.lower()
    text = re.sub(r"[^a-z\s]","",text)
    words = text.split()
    return words


#### Q6. Now `map` this function to each job description

In [17]:


cleaned_descriptions = []
for jd in job_descriptions:
    cleaned_descriptions.append(clean_job_description(jd))
len(cleaned_descriptions)



10

#### Q7. Since we will need the most common skills across all these job descriptions, create a single list of all words from all job descriptions

In [18]:

all_words = []

for words in cleaned_descriptions:
    all_words.extend(words)
len(all_words)


305

#### Q8: Now before we start counting the number of occurences of each of these words, we must first remove common english language words as they will have really high counts but we don't need them in our analysis. Remove the below list of words from our all_words list.

In [19]:
stopwords = ["and","the","to","with","a","of","in","for","is","on","or","an","using","such","as", "we", "need", "will", "are", "also","be",
            "it","include", "skills"]
            
filtered_words = []

for word in all_words:
    if word not in stopwords:
        filtered_words.append(word)
len(filtered_words)



204

#### Q9: Finally, create a function that takes in a list of words and returns their word counts as a dictionary where the word is key and the count is value. Also sort that dictionary based on word counts using the `sorted` function in python. 

In [22]:
def word_count(words_list):
    counts = {}

    for word in words_list:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1
    sorted_counts = dict(
        sorted(counts.items(), key=lambda item: item[1], reverse=True)
    )

    return sorted_counts
word_counts = word_count(filtered_words)
list(word_counts.items())[:20]


[('python', 6),
 ('required', 4),
 ('sql', 4),
 ('knowledge', 4),
 ('data', 4),
 ('experience', 4),
 ('communication', 3),
 ('looking', 3),
 ('design', 3),
 ('engineer', 3),
 ('candidate', 3),
 ('strong', 3),
 ('cloud', 3),
 ('business', 2),
 ('analyst', 2),
 ('excel', 2),
 ('problemsolving', 2),
 ('datasets', 2),
 ('tools', 2),
 ('like', 2)]

#### Q10: Here we can see that the top occuring words are giving some context and the top skills required in the industry. However, there are still some irrelevant words that are not skills.

#### To make this even better, filter the above word counts to only include those key value pairs where the key is a skill as per the below skills list

In [23]:
skills = ['python', 'sql', 'powerbi', 'html', 'css', 'javascript', 'scripting', 'git', 'problem-solving', 'cloud']

skill_counts = {}

for skill in skills:
    if skill in word_counts:
        skill_counts[skill] = word_counts[skill]

skill_counts




{'python': 6,
 'sql': 4,
 'html': 1,
 'css': 1,
 'javascript': 1,
 'scripting': 2,
 'git': 2,
 'cloud': 3}

#### Bonus Question: Now package this entire program into a single .py script that takes in the folder location of where the files to read are and outputs the top skills demanded in the industry

In [24]:
import os
import re
from collections import Counter
import sys
SKILLS = [
    "python", "sql", "excel", "power bi", "tableau",
    "machine learning", "data analysis", "pandas",
    "numpy", "statistics", "communication"
]

def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9\s]", " ", text)
    return text

def extract_skills(text):
    found = []
    for skill in SKILLS:
        if skill in text:
            found.append(skill)
    return found

def read_files_from_folder(folder_path):
    all_text = ""

    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)

        if filename.endswith(".txt"):
            with open(file_path, "r", encoding="utf-8") as file:
                all_text += file.read() + " "

    return all_text

def main():
    if len(sys.argv) < 2:
        print("Usage: python skills_extractor.py <folder_path>")
        return

    folder_path = sys.argv[1]

    if not os.path.exists(folder_path):
        print("Folder does not exist.")
        return

    print("Reading files...")

    text = read_files_from_folder(folder_path)
    text = clean_text(text)

    skills_found = extract_skills(text)
    skill_counts = Counter(skills_found)

    print("\nTop Skills Demanded:\n")

    for skill, count in skill_counts.most_common(10):
        print(f"{skill.title()} - {count}")

if __name__ == "__main__":
    main()


Folder does not exist.
