Notes: Run the cell below, and it will output a Pandas dataframe with the percent of listings a skill is included in for each job title. For example, if the `data` row has a `python` value of 40.03, that means that 40.03% of data jobs listed on Indeed (as of when the cell is run) include "Python".

Notes:
* It will take a few minutes to run, as there are two-second pause intervals throughout to keep the load on Indeed's servers low. 
* The code won't work if any skill is listed in a very small number of listings. You can add new titles or skills and modify the code accordingly, but if they're very rare job titles or skills then you'll probably have to rewrite a lot. 
* Although automating it would be cooler, for the time being, the dataframe that prints will copy-paste easily into a spreadsheet.

In [6]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

titles = ['data', 'data+analyst', 'data+scientist', 'data+engineer']
skills = ['python', 'sql', 'r', 'machine learning', 'spark']
final_list = []

    
def jobs_getter(titles_list):
    for title in titles_list:
        add_to_list = []
        page = requests.get("https://www.indeed.com/jobs?q=title%3A%22{title_here}%22&l=United+States".format(title_here = title))
        soup = BeautifulSoup(page.content, 'html.parser')
        raw = soup.find_all(id="searchCountPages")
        raw_text = raw[0].get_text()
        raw_text = raw_text.split()
        number = raw_text[3]
        number = number.replace(',','')
        number = int(number)
        add_to_list = [title, number]
        time.sleep(2)

        for skill in skills:
            page = requests.get("https://www.indeed.com/jobs?q=title%3A%22{title_here}%22+{skill_here}&l=United+States".format(title_here = title, skill_here = skill))
            soup = BeautifulSoup(page.content, 'html.parser')
            raw = soup.find_all(id="searchCountPages")
            raw_text = raw[0].get_text()
            raw_text = raw_text.split()
            number = raw_text[3]
            number = number.replace(',','')
            number = int(number)
            add_to_list.append(number)
            time.sleep(2)

        final_list.append(add_to_list)
    percentages = []
    for job in final_list:
        title = job[0]
        jobtotal = job[1]
        python = job[2]
        sql = job[3]
        r = job[4]
        ml = job[5]
        spark = job[6]
        python_percent = round((python / jobtotal)*100, 2)
        sql_percent = round((sql / jobtotal)*100, 2)
        r_percent = round((r / jobtotal)*100, 2)
        ml_percent = round((ml / jobtotal)*100, 2)
        spark_percent = round((spark / jobtotal)*100, 2)
        percentages.append([title, jobtotal, python_percent, sql_percent, r_percent, ml_percent, spark_percent])

    df = pd.DataFrame(percentages,columns=['job_title', 'total', 'python', 'sql', 'r', 'machine_learning', 'spark'])
    return df

jobs_getter(titles)


    


Unnamed: 0,job_title,total,python,sql,r,machine_learning,spark
0,data,42021,39.74,47.04,20.31,26.04,19.02
1,data+analyst,5967,31.84,56.21,24.22,8.38,2.71
2,data+scientist,5279,71.64,52.59,64.97,76.83,22.31
3,data+engineer,4997,74.48,72.06,12.97,29.56,51.27


Next steps:
* A good next step would be to figure out a good way to track all of these variables over time. To make that easy to look at, we might need a separate dataframe for each individual job title, where each row would be a date.
* Another good next step would be to link this to a google sheet to auto-update it whenever the cell is run (perhaps monthly or bimonthly?)