# **Capstone Part 1:** Collecting Job Data Using APIs


#### Student Author: Abigail Hedden

## Objectives
*   Collect job data from Jobs API
*   Save the collected data to an Excel spreadsheet


## Dataset Used in this Assignment

* The dataset used in this lab comes from the following source: https://www.kaggle.com/promptcloud/jobs-on-naukricom under the under a **Public Domain license**.
* Note: This lab uses a modified subset of that dataset for the lab.
* The original dataset is a csv. Course authors converted the csv to json.

## Lab: Collect Jobs Data using Jobs API


In [17]:
# import required libraries
import pandas as pd
import json
from openpyxl import Workbook

### Objective: Determine the number of jobs currently open for various technologies  and for various locations


Collect the number of job postings for the following locations using the API:

* Los Angeles
* New York
* San Francisco
* Washington DC
* Seattle
* Austin
* Detroit


In [5]:
# define list of locations
locations = ['Los Angeles', 'New York', 'San Francisco', 'Washington DC','Seattle', 'Austin', 'Detroit']

# initialize counters for target locations
location_counts = {location: 0 for location in locations}

try:
    # read and parse the JSON file
    with open("jobs.json", 'r', encoding='utf-8') as file:
        jobs_data = json.load(file)
    
    # handle different JSON structures
    # if a list, simply assign to jobs_list
    if isinstance(jobs_data, list):
        jobs_list = jobs_data
    # if a dictionary, take all the values from the dictionary (ignores the keys), convert them into a list, and assign it to jobs_list
    elif isinstance(jobs_data, dict):
        jobs_list = list(jobs_data.values())
    # if structure is something else, print an error message and set jobs_list to an empty list 
    else:
        print("Unexpected JSON structure")
        jobs_list = []
    
    print(f"Total jobs in dataset: {len(jobs_list)}")
    
    # count the number of jobs for each location
    for job in jobs_list:
        if isinstance(job, dict) and 'Location' in job:
            job_location = job['Location'].strip()
            # check if job location matches any target location
            for location in locations:
                if location.lower() in job_location.lower():
                    location_counts[location] += 1
                    break

except Exception as e:
    print("An error occurred while processing the jobs data:", e)
    
# display results
# sort by count 
sorted_locations = sorted(location_counts.items(), key=lambda x: x[1], reverse=True)
 
for location, count in sorted_locations:
    print(f"{location:<15}: {count:>6} jobs")

total_jobs = sum(location_counts.values())
print(f"{'Total in list of locations':<15}: {total_jobs:>6} jobs")

Total jobs in dataset: 27005
Washington DC  :   5316 jobs
Detroit        :   3945 jobs
Seattle        :   3375 jobs
New York       :   3226 jobs
Los Angeles    :    640 jobs
San Francisco  :    435 jobs
Austin         :    434 jobs
Total in list of locations:  17371 jobs


### Write a function to get the number of jobs for the Python technology
  
 ##### The keys in the json are 
 * Job Title
 
 * Job Experience Required
 
 * Key Skills
 
 * Role Category
 
 * Location
 
 * Functional Area
 
 * Industry
 
 * Role 

#### Define function

In [8]:
api_url="http://127.0.0.1:5000/data"

def get_number_of_jobs_T(technology):
# function to search job roles for certain technologies
# returns number of jobs that contain the provided technology in any of the json keys mentioned above 
    try:
        # read and parse the JSON file
        with open('jobs.json', 'r', encoding='utf-8') as file:
            jobs_data = json.load(file)
        
        if isinstance(jobs_data, list):
            jobs_list = jobs_data
        elif isinstance(jobs_data, dict):
            jobs_list = list(jobs_data.values())
        else:
            print('Unexpected JSON structure')
            return technology, 0
        
        technology_count = 0
        technology_lower = technology.lower()
        
        # search through each job
        for job in jobs_list:
            if isinstance(job, dict):
                # Fields to search for the technology
                search_fields = [
                    job.get('Job Title', ''),
                    job.get('Key Skills', ''),
                    job.get('Role Category', ''),
                    job.get('Functional Area', ''),
                    job.get('Industry', ''),
                    job.get('Role', '')
                ]
                
                # check if technology is mentioned in any of the keys
                for field in search_fields:
                    if field and technology_lower in field.lower():
                        technology_count += 1
                        break  
        
        return technology, technology_count
    
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return technology, 0

#### Call the function for **Python** to ensure it works

In [9]:
# call function for python
get_number_of_jobs_T('Python')

('Python', 1189)

### Write a function to find number of jobs in US for a location of your choice

#### Define function

In [11]:
def get_number_of_jobs_L(location):

    try:
        # read and parse the JSON file
        with open('jobs.json', 'r', encoding='utf-8') as file:
            jobs_data = json.load(file)
        
        # handle different JSON structures
        if isinstance(jobs_data, list):
            jobs_list = jobs_data
        elif isinstance(jobs_data, dict):
            jobs_list = list(jobs_data.values())
        else:
            print("Unexpected JSON structure")
            return location, 0
        
        location_count = 0
        number_of_jobs = 0
        location_lower = location.lower()
        
        for job in jobs_list:
            if isinstance(job, dict) and 'Location' in job:
                job_location = job['Location'].strip().lower()
                if location_lower in job_location:
                    number_of_jobs += 1
        
        return location,number_of_jobs
    
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return location, 0

#### Call the function for Los Angeles to ensure it works

In [13]:
# call function for L.A.
get_number_of_jobs_L('Los Angeles')

('Los Angeles', 640)

## Store the results in an excel file
#### *Locations*

In [36]:
# create a python list of all locations for which you need to find the number of jobs postings (already written above as well)
locations = ['Los Angeles', 'New York', 'San Francisco', 'Washington DC','Seattle', 'Austin', 'Detroit']

In [37]:
# create a workbook and select the active worksheet
wb = Workbook()
ws1 = wb.active

# give worksheet a title
ws1.title = "location-job-counts"

# add headers
ws1.append(['Location', 'Number of Jobs'])

#### Add the job counts for each location to the spreadsheet
* *Find the number of jobs postings for each of the location in the above list*
* *Write the Location name and the number of jobs postings into the excel spreadsheet*

In [38]:
# loop through each location and find number of jobs, adding info to excel sheet
for location in locations:
    location_name, number_of_jobs = get_number_of_jobs_L(location)
    ws1.append([location_name, number_of_jobs])

In [39]:
# save into an excel spreadsheet named 'job-postings.xlsx'
wb.save("job-postings.xlsx")


#### *Technologies*

Collect the number of job postings for the following languages using the API:

*   C
*   C#
*   C++
*   Java
*   JavaScript
*   Python
*   Scala
*   Oracle
*   SQL Server
*   MySQL Server
*   PostgreSQL
*   MongoDB


In [40]:
# create a python list of all technologies 
technologies = ['C', 'C#', 'C++', 'Java', 'JavaScript', 'Python', 'Scala', 'Oracle', 'SQL Server', 'MySQL Server', 'PostgreSQL','MongoDB']

In [41]:
# make new worksheet in existing workbook
ws2 = wb.create_sheet(title="tech-job-counts")

# add headers
ws2.append(['Technology', 'Number of Jobs'])

#### Add the job counts for each technology to the spreadsheet

In [42]:
# loop through each technology and find number of jobs, adding info to excel sheet
for technology in technologies:
    tech_name, number_of_jobs = get_number_of_jobs_T(technology)
    ws2.append([tech_name, number_of_jobs])

In [43]:
# save into an excel spreadsheet named 'job-postings.xlsx'
wb.save("job-postings.xlsx")

# *Original Course Lab Authors & Contributors*

## Author


Ayushi Jain


### Other Contributors


Rav Ahuja

Lakshmi Holla

Malika


Copyright © 2022 IBM Corporation. All rights reserved. 
