# Introduction
The job finding industry has evovled from "help wanted" signs hung outside of stores, to a section in the newspaper devoted to job posts to what we have now, websites that connect employers to applicants. Due to the efficiency of these sites, most people looking for a job have to cast a wide net and apply to hundreds of jobs to even get an interview. With job posts being missleading it can feel like you are spending more time looking for positions to apply to rather than applying. This inefficiecy happens for a few reasons. The category "entry level" encompasses many positions from those that are specifically for newly graduating seniors to others that require canditits to have at up to 5+ years of industry experience. This means filtering for entry level positions is not helpful. As well, the key word searching that most job sites have return results that can match any level of experience. Looking up "Software Engineer" may include senior level positions, mid level positions and entry level positions. At the end of the day, a lot of time is wasted reading through job descriptions to find that they do not match what you are looking for, when that time could be used to applying to another position in which you may be qualified for.


# The Approach
In the evolution of the job finding industry, from "help wanted" signs to sophisticated online platforms, the quest for efficiency and accuracy in connecting job seekers with suitable positions remains paramount. The current landscape, characterized by the need to sift through hundreds of job postings due to misleading titles and broad categorization, underscores the necessity for a refined approach. This essay draws upon recent research to justify an innovative method aimed at streamlining the job search process, leveraging advancements in Natural Language Processing (NLP) and Machine Learning (ML) to address inherent inefficiencies.

The foundational piece of this discussion, "NLP Techniques for Job Market Trend Analysis," provides insight into how NLP can be utilized to dissect and understand complex job descriptions. By analyzing the semantics and structure of job postings, NLP techniques can categorize positions more accurately than traditional keyword-based searches. This nuanced understanding allows for the differentiation between truly entry-level roles and those requiring significant experience, thus enabling job seekers to navigate listings more effectively and apply to positions that match their qualifications and career stage.

Furthermore, the "Skills and Requirements" document underscores the importance of aligning job seekers' skills with employer expectations. A detailed examination of common requirements across various industries reveals the gap between job descriptions and the actual skills needed for a position. By integrating these insights into job matching platforms, it's possible to develop more precise filters and recommendations, guiding applicants toward opportunities where they are most likely to succeed and fulfill employers' expectations.

The most compelling argument for the proposed approach comes from "Machine Learning and Job Posting Classification: A Comparative Study," which explores the application of ML classifiers in distinguishing real from fake job postings. Beyond this primary function, these classifiers hold the potential for categorizing jobs into detailed levels and types based on the requirements and experience needed. Implementing such a classification system can dramatically improve the job search experience, moving beyond the limitations of keyword searches to offer a curated list of opportunities tailored to each job seeker's profile.

The integration of NLP and ML into the job finding process addresses two primary challenges: the broad and often misleading categorization of "entry level" positions and the inefficiency of keyword-based searches. By accurately interpreting job descriptions and requirements, the proposed method ensures that job seekers spend less time filtering through irrelevant postings and more time applying to positions for which they are well-suited. Moreover, by classifying postings based on detailed criteria, platforms can offer personalized recommendations, further enhancing the efficiency of the job search process.

In conclusion, the transition to a more sophisticated job finding mechanism, underpinned by NLP and ML, represents a significant leap forward in matching job seekers with suitable employment opportunities. This approach not only mitigates the challenges posed by traditional job search methods but also aligns with the evolving dynamics of the labor market. By leveraging the insights and techniques outlined in the aforementioned research, the proposed project stands to offer a more streamlined, accurate, and user-friendly job search experience, marking a pivotal advancement in the industry's ongoing evolution.

# The Dataset
The dataset that will be used to train this model will be created by using Jobspy. Jobspy is open source program that scrapes job postings from various websites. The dataset will consist of job postings from Linkedin, Indeed, Zip Recruiter, and Glassdoor and will be the results of looking up "entry level software developer," "new graduate software developer," and "junior software engineer" as these search results will have the highest correlation to jobs that require canditits to have no industry experience. The dataset is ever growing as when the program is run the results are saved and appended to the previous dataset. This means that it will contain real word job postings. Also I used software developer as this is what closely relates to me, but you can change it to any search input. 

Jobspy: https://github.com/Bunsly/JobSpy


In [1]:
pip install -U python-jobspy

Collecting python-jobspy
  Downloading python_jobspy-1.1.51-py3-none-any.whl.metadata (9.1 kB)
Downloading python_jobspy-1.1.51-py3-none-any.whl (29 kB)
Installing collected packages: python-jobspy
  Attempting uninstall: python-jobspy
    Found existing installation: python-jobspy 1.1.50
    Uninstalling python-jobspy-1.1.50:
      Successfully uninstalled python-jobspy-1.1.50
Successfully installed python-jobspy-1.1.51
Note: you may need to restart the kernel to use updated packages.


In [4]:
from jobspy import scrape_jobs
import pandas as pd
import time

In [6]:
 jobs1: pd.DataFrame = scrape_jobs(
    site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
    search_term="entry level software developer",
    hours_old=24,
    results_wanted=25,  # be wary the higher it is, the more likey you'll get blocked (rotating proxy can help tho)
    country_indeed="USA",
    # proxy="http://jobspy:5a4vpWtj8EeJ2hoYzk@ca.smartproxy.com:20001",
)

time.sleep(30)

jobs2: pd.DataFrame = scrape_jobs(
    site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
    search_term="new graduate software developer",
    hours_old=24,
    results_wanted=25,  # be wary the higher it is, the more likey you'll get blocked (rotating proxy can help tho)
    country_indeed="USA",
    # proxy="http://jobspy:5a4vpWtj8EeJ2hoYzk@ca.smartproxy.com:20001",
)

time.sleep(30)

jobs3: pd.DataFrame = scrape_jobs(
    site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
    search_term="junior software engineer",
    hours_old=24,
    results_wanted=25,  # be wary the higher it is, the more likey you'll get blocked (rotating proxy can help tho)
    country_indeed="USA",
    # proxy="http://jobspy:5a4vpWtj8EeJ2hoYzk@ca.smartproxy.com:20001",
)


# formatting for pandas
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", 50)  # set to 0 to see full job url / desc


print(jobs1)



2024-03-15 17:06:19,369 - JobSpy - INFO - Indeed search page: 1
2024-03-15 17:06:19,377 - JobSpy - INFO - LinkedIn search page: 1
2024-03-15 17:06:19,791 - JobSpy - INFO - ZipRecruiter search page: 1
2024-03-15 17:06:20,333 - JobSpy - INFO - ZipRecruiter finished scraping
2024-03-15 17:06:20,968 - JobSpy - INFO - Glassdoor search page: 1
2024-03-15 17:06:21,388 - JobSpy - INFO - Linkedin finished scraping
2024-03-15 17:06:22,740 - JobSpy - INFO - Indeed search page: 2
2024-03-15 17:06:35,016 - JobSpy - INFO - Indeed found no jobs on page: 2
2024-03-15 17:06:35,016 - JobSpy - INFO - Indeed finished scraping
2024-03-15 17:06:42,649 - JobSpy - INFO - Glassdoor finished scraping
2024-03-15 17:07:12,978 - JobSpy - INFO - Indeed search page: 1
2024-03-15 17:07:12,985 - JobSpy - INFO - LinkedIn search page: 1
2024-03-15 17:07:13,407 - JobSpy - INFO - ZipRecruiter search page: 1
2024-03-15 17:07:14,339 - JobSpy - INFO - Glassdoor search page: 1
2024-03-15 17:07:14,494 - JobSpy - INFO - Indeed 

             site                                            job_url  \
52      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
53      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
54      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
55      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
56      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
57      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
58      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
59      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
60      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
61      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
62      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
63      glassdoor  https://www.glassdoor.com/job-listing/j?jl=100...   
64      glassdoor  https://www.glassdoor.com/job-listing/j?jl=10

# EDA
We will now preform an EDA to get a better understanding of our data. As well, we will clean the data. This will happen in two parts as the NLP step requires us to only care about duplicates, and the quality of the description. After deleting duplicates, we will clean the description by checking to see if the web scraper got the whole description and removing HTML code. We will also add the title of the position to the description so that we can use that data to further understand the job posting. Lastly, we will remove any fields that are not necessary for training the model. 

In [7]:
existing_data = pd.read_json("./jobs.json", lines=True)
jobs = pd.concat([jobs1, jobs2, jobs3, existing_data], ignore_index=True)
jobs.shape

(529, 28)

In [8]:
jobs = jobs.drop_duplicates()
jobs.to_json('./jobs.json', orient='records', lines=True)

In [9]:
jobs.shape

(507, 28)

In [10]:
jobs.head(50)

Unnamed: 0,site,job_url,job_url_direct,title,company,location,job_type,date_posted,interval,min_amount,max_amount,currency,is_remote,emails,description,company_url,company_url_direct,company_addresses,company_industry,company_num_employees,company_revenue,company_description,logo_photo_url,banner_photo_url,ceo_name,ceo_photo_url,num_urgent_words,benefits
0,glassdoor,https://www.glassdoor.com/job-listing/j?jl=100...,,Associate ServiceNow Developer (Remote),ICF,"Reston, VA",,2024-03-15,yearly,57737.0,98153.0,USD,False,icfcareercenter@icf.com,\*We are open to supporting 100% remote work a...,https://www.glassdoor.com/Overview/W-EI_IE1865...,,,,,,,,,,,,
1,glassdoor,https://www.glassdoor.com/job-listing/j?jl=100...,,MuleSoft Developer (REMOTE),Everlight Solar,"Omaha, NE",,2024-03-15,yearly,60000.0,100000.0,USD,False,,Everlight Solar is seeking a skilled MuleSoft ...,https://www.glassdoor.com/Overview/W-EI_IE3183...,,,,,,,,,,,,
2,glassdoor,https://www.glassdoor.com/job-listing/j?jl=100...,,"Manager, Software Engineering","Pharmacy Data Management, Inc.",,,2024-03-15,,,,,True,,If you are a visionary leader with a passion f...,https://www.glassdoor.com/Overview/W-EI_IE1424...,,,,,,,,,,,,
3,glassdoor,https://www.glassdoor.com/job-listing/j?jl=100...,,Senior Software Engineer - Full Stack,"Apkudo, Inc.",United States,,2024-03-15,yearly,115000.0,130000.0,USD,False,,**Job Title:** Senior Software Engineer - Full...,https://www.glassdoor.com/Overview/W-EI_IE1467...,,,,,,,,,,,,
4,glassdoor,https://www.glassdoor.com/job-listing/j?jl=100...,,GraphQL Developer,Infoservices LLc,"Chicago, IL",,2024-03-15,,,,,False,,**Job Title: GraphQL Developer**\n\n**Location...,https://www.glassdoor.com/Overview/W-EI_IE4235...,,,,,,,,,,,,
5,glassdoor,https://www.glassdoor.com/job-listing/j?jl=100...,,Mainframe Developer,Talent Group,,,2024-03-15,hourly,40.0,42.0,USD,True,,* 5+ experience with CICS - Mainframe and MQ e...,https://www.glassdoor.com/Overview/W-EI_IE1255...,,,,,,,,,,,,
6,glassdoor,https://www.glassdoor.com/job-listing/j?jl=100...,,Sr. Clarion Consultant / Clarion Software Deve...,Finezi,,,2024-03-15,hourly,55.0,65.0,USD,True,,***ROLE: Clarion Software Developer***\n\n***L...,,,,,,,,,,,,,
7,glassdoor,https://www.glassdoor.com/job-listing/j?jl=100...,,Salesforce Developer with MuleSoft Experience ...,Everlight Solar,"Indianapolis, IN",,2024-03-15,yearly,95000.0,100000.0,USD,False,,Everlight Solar is seeking a skilled Salesforc...,https://www.glassdoor.com/Overview/W-EI_IE3183...,,,,,,,,,,,,
8,glassdoor,https://www.glassdoor.com/job-listing/j?jl=100...,,IT Application Developer,Northern Arizona Healthcare Corporation,"Flagstaff, AZ",,2024-03-15,yearly,66437.0,94375.0,USD,False,,Overview:\n\nUnder the direction of Leadership...,https://www.glassdoor.com/Overview/W-EI_IE1712...,,,,,,,,,,,,
9,glassdoor,https://www.glassdoor.com/job-listing/j?jl=100...,,iOS Developer with Swift Experience (REMOTE),Everlight Solar,"Minneapolis, MN",,2024-03-15,yearly,60000.0,80000.0,USD,False,,Everlight Solar is seeking a skilled iOS devel...,https://www.glassdoor.com/Overview/W-EI_IE3183...,,,,,,,,,,,,


# NLP
In the endeavor to refine the job search process, the implementation of Natural Language Processing (NLP) stands as a cornerstone for extracting and analyzing nuanced information from job descriptions. This essay delineates the meticulous steps involved in deploying NLP to categorize jobs accurately based on experience levels and specific qualifications, thus enhancing the precision and efficiency of job matching.

Preprocessing: The Foundation of Text Analysis
The journey begins with preprocessing, a critical phase where textual data undergoes a transformation to become conducive to analysis. This phase involves converting all text to lowercase to ensure consistency across the dataset. Subsequently, the text is tokenized, breaking down the job descriptions into individual words or tokens, laying the groundwork for a granular examination of the content. Common words, known as stop words, which hold minimal analytical value, are removed. Additionally, stemming or lemmatization processes are applied, standardizing words to their root form. This preprocessing not only cleans the data but also harmonizes it, setting a robust foundation for the intricate analysis to follow.

Feature Extraction: Unearthing Relevant Information
With the data primed, the focus shifts to feature extraction, a pivotal step where the trained NLP model identifies and extracts phrases and terms pivotal for categorizing jobs by experience level and qualifications. Techniques such as Named Entity Recognition (NER) are employed to classify key pieces of information into predefined categories, while pattern matching, especially through the use of regular expressions, aids in identifying specific expressions related to job qualifications, such as years of required experience.

Leveraging Pre-trained Models: A Shortcut to Depth
The utilization of pre-trained models like BERT or GPT marks a significant leap in the analysis. These models, having been trained on vast expanses of text, possess a profound understanding of language nuances. By fine-tuning these models for specific tasks, such as text classification or entity recognition, they become adept at classifying job descriptions into precise categories (entry-level, mid-level, senior-level) and extracting detailed requirements (years of experience, required skills, educational qualifications).

Post-processing and Categorization: The Art of Interpretation
The extracted information undergoes post-processing, where it is analyzed and mapped to specific categories. This process involves not just the categorization of jobs based on experience level but also the quantification of qualitative data, such as converting phrases indicating years of experience into numerical values. This nuanced interpretation and categorization enable the development of a sophisticated job filtering and search mechanism, directly addressing the needs and qualifications of job seekers.

Implementation and Integration: Bringing Analysis to Action
The final stride in this journey is the integration of the processed and categorized data into the job finding platform. This crucial phase ensures that the insights gleaned from the NLP analysis directly enhance the platform's functionality, allowing users to navigate through job listings with unparalleled precision. The implementation of NLP thus transcends mere analysis, embodying a transformation in how job seekers connect with potential opportunities, making the search for the perfect job not just faster but significantly more meaningful.

In essence, the application of NLP within the job search domain is a testament to the power of language processing in bridging the gap between job seekers and the roles they aspire to fill. By meticulously preprocessing text, extracting relevant features, leveraging advanced models, and integrating these insights into job search platforms, the project not only simplifies the job search process but elevates it, ensuring that every job seeker finds their rightful place in the professional world.

In [None]:
import spacy
import pandas as pd

# Load spaCy English model
nlp = spacy.load('en_core_web_sm')

# Function to extract job level and years of experience from the job description
def extract_details(description):
    doc = nlp(description.lower())
    years_exp = "Not specified"
    job_level = "Not specified"
    
    # Define simple patterns to identify job level and years of experience
    for token in doc:
        # Checking for years of experience
        if token.text.isdigit():
            if token.nbor(-1).text in ['with', 'over', 'at least'] and token.nbor(1).text in ['year', 'years']:
                years_exp = token.text + "+ years"
        
        # Identifying job level based on common phrases
        if "entry level" in description.lower():
            job_level = "Entry Level"
        elif "senior" in description.lower():
            job_level = "Senior"
        elif "new grad" in description.lower():
            job_level = "New Grad"
    
    return pd.Series([job_level, years_exp])

# Assuming 'jobs' DataFrame has already been populated and has a 'description' column
# Apply the function and create new columns for job level and years of experience
jobs[['Job Level', 'Years of Experience']] = jobs['description'].apply(extract_details)

# Display the updated DataFrame
jobs


In [None]:
import spacy
import pandas as pd

# Load spaCy English model
nlp = spacy.load('en_core_web_sm')

# Define the extraction function
def extract_details(description):
    doc = nlp(description.lower())
    years_exp = "Not specified"
    job_level = "Not specified"
    
    for token in doc:
        if token.text.isdigit():
            if token.nbor(-1).text in ['with', 'over', 'at least'] and token.nbor(1).text in ['year', 'years']:
                years_exp = token.text + "+ years"
        
        if "entry level" in description.lower():
            job_level = "Entry Level"
        elif "senior" in description.lower():
            job_level = "Senior"
        elif "new grad" in description.lower():
            job_level = "New Grad"
    
    return pd.Series([job_level, years_exp])

# Apply the function and create new columns for job level and years of experience
jobs[['Job Level', 'Years of Experience']] = jobs['description'].apply(extract_details)

# Verification Step:
# Display a random sample of the processed DataFrame to verify the extraction
print("Sample of extracted data:")
print(jobs[['description', 'Job Level', 'Years of Experience']].sample(5))

# Additionally, you can check for specific cases to ensure accuracy
print("\nVerifying specific cases:")
# Check entries marked as "Entry Level"
entry_level_cases = jobs[jobs['Job Level'] == "Entry Level"]
print(f"Number of 'Entry Level' cases: {len(entry_level_cases)}")

# Check entries with "Not specified" years of experience
not_specified_exp = jobs[jobs['Years of Experience'] == "Not specified"]
print(f"Number of cases with 'Not specified' years of experience: {len(not_specified_exp)}")


# Logistic Regression

In the realm of talent acquisition and career progression, the task of sorting job postings into various career levels represents a fundamental challenge. With an ever-growing pool of opportunities and candidates, efficiently categorizing these postings is crucial for both job seekers and employers. Leveraging machine learning techniques, particularly logistic regression, presents a pragmatic approach to address this challenge effectively.

At the core of the task lies a dataset comprising approximately 500 job postings, each characterized by a set of features including salary, years of experience, degree requirements, and career level indicators. The primary objective is to categorize these postings into distinct career levels such as entry-level, mid-level, and senior-level roles.

From a practical standpoint, logistic regression emerges as the optimal choice for several compelling reasons. Firstly, its inherent interpretability offers invaluable insights into the relationships between features and the predicted career levels. In the context of talent acquisition, understanding the influential factors driving career progression is paramount. Logistic regression allows stakeholders to discern the impact of individual features, thereby facilitating informed decision-making processes.

Moreover, logistic regression's simplicity and computational efficiency align seamlessly with the project's constraints. With a dataset of modest size, the need for a model that balances performance with resource utilization becomes apparent. Logistic regression fulfills this requirement admirably, offering respectable predictive capabilities without necessitating extensive computational resources or complex tuning procedures.

Furthermore, logistic regression's versatility in handling categorical features is particularly advantageous. In the context of job postings, where attributes such as degree requirements and career levels are inherently categorical, logistic regression seamlessly accommodates these variables. This eliminates the need for elaborate preprocessing steps, streamlining the model development process and expediting time-to-insight.

Additionally, logistic regression's amenability to regularization techniques further enhances its suitability for the task at hand. By applying techniques such as L1 (Lasso) or L2 (Ridge) regularization, the risk of overfitting inherent in small datasets is mitigated effectively. This ensures that the model maintains robust performance and generalizability, even in the presence of limited data.

In summary, the decision to employ logistic regression for the classification of job postings into career levels represents a pragmatic and well-founded approach. By prioritizing interpretability, simplicity, and computational efficiency, logistic regression enables stakeholders to glean actionable insights from the dataset while achieving respectable predictive performance. As the landscape of talent acquisition continues to evolve, logistic regression stands as a steadfast tool for navigating the complexities of job posting classification with confidence and efficacy.