#### Web scrapping for Data Scientist job in CO (9 points total)



In this exercise we will do web scrapping for **Data Scientist job in CO**


Here is the link to the search query

https://www.indeed.com/jobs?q=data+scientist&l=CO

As you can see at the bottom of the page there are links to several pages related to this search.
If you click on second page, search url changes to

https://www.indeed.com/jobs?q=data+scientist&l=CO&start=10

If you click on 3rd then url changes to

https://www.indeed.com/jobs?q=data+scientist&l=CO&start=20

Hence, to go to more pages we can format the search string(**change start=??** part) for **requests.get in a loop**


# Q1(5 =  4(non indicator columns) + 1(indicator columns) points) Please complete the following task

- Scrape 10 pages (**last page(10 th) url will be https://www.indeed.com/jobs?q=data+scientist&l=CO&start=90**)and build a pandas DataFrame containing following information
    + **job title, name of the company, location, summary of job description**
    + **Indicator columns(with value True/False) about keywords Python, SQL, AWS, RESTFUL, Machine learning, Deep Learning, Text Mining, NLP, SAS, Tableau, Sagemaker, TensorFlow, Spark**

Note:
- Make sure that you do a case insensitive search for keywords when filing(Tue/False) in the indicator columns
- You need to go to the webpage of detail job posting for keywords search. The main job posting only contains summary of the job description.  Build detail job posting webpage url from web scrapping main search results.
- If you run into difficulties which you are not able to overcome, skip this question and import the datafram from the provided the pickle file instead.
- If you find this entire homework too difficult at your current level of expertise, please feel free to complete the AlternateHwk5 instead.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re

html_link = "https://www.indeed.com"
base_url = "https://www.indeed.com"
base_search_url = f"{base_url}/jobs?q=data+scientist&l=CO"
print("calling",base_search_url)
response = requests.get(base_search_url)
zip_code_re = re.compile(r'[0-9]+')
job_df = None
job_keywords = {"Python":"has_python",
                "SQL": "has_sql", 
                "AWS": "has_aws", 
                "RESTFUL": "has_rest", 
                "Machine learning": "has_ml", 
                "Deep Learning": "has_dl", 
                "Text Mining": "has_mining", 
                "NLP": "has_nlp", 
                "SAS": "has_sas", 
                "Tableau": "has_tableau", 
                "Sagemaker": "has_sagemaker", 
                "TensorFlow": "has_tf", 
                "Spark": "has_spark"}
names = ["job_title","company","location","description"]


calling https://www.indeed.com/jobs?q=data+scientist&l=CO


In [2]:
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.findAll('a')

In [3]:
def get_job_details_main(href_value):
    """
    Extract the details from a job by making an http call to the specific job page
    @param href_value the hyperlink url path
    """

 #   print(job_keywords)
    job_url = f'{base_url}{href_value}'
    job_response = requests.get(job_url)
#    print(job_response.status_code)
    if job_response.status_code == 200:
        job_soup = BeautifulSoup(job_response.text)
        content_container = job_soup.findAll(class_="jobsearch-ViewJobLayout-mainContent")
        main_content = content_container.pop()
        # convert the bs4 NavigableString to base string, otherwise we get a pickle error later
        job_title = str(main_content.find("h1",class_="jobsearch-JobInfoHeader-title").string)
        # convert the bs4 NavigableString to base string, otherwise we get a pickle error later
        company_name_div = main_content.findAll("div",class_="icl-u-lg-mr--sm icl-u-xs-mr--xs")
        company_name = pd.NA
        # we want the one that has a child a, but it doesn't seem to always work, so we set company_name to Na first
        for d in company_name_div:
            if d.find("a"):
                company_name = str(d.find("a").string)
        empty_divs = main_content.findAll("div",class_="")
        # the location seems to be the 4th empty div tag
        location = str(list(empty_divs)[4].string)
        location = re.sub(zip_code_re,"",location)
        location = location.replace(",", " ")
        location = location.strip()
        # locations may have zip codes in them, so regex them out
        job_description_div = main_content.find(class_="jobsearch-jobDescriptionText")
        
        # there seems to be too much variation in how the job desciptions are structure to pick up a "job summary".
        # we will just pull in the entire description
        job_description = str(job_description_div.text)
        job_dict = {"job_title":[job_title],
                           "company":[company_name],
                           "location": [location],
                           "description": [job_description]}
        for keyword, col_name in job_keywords.items():
            job_dict[col_name] = keyword.lower() in job_description.lower()
        # print(job_title,"|",company_name,"|",location,"|",job_description[0:30], "|", has_keyword)
        return pd.DataFrame(job_dict)
        

In [4]:
job_df = None
def process_one_page(links):
    """
    Go through each link tag and look for ones that include /rc in the href value.
    This will point to a job details link
    @param a list of hyperlinks tags
    """
    page_df = None
    for link in links:
        href = link.get("href")
    #    print(href)
        if href:
    #        print(href)
            if "/rc" in href:
    #            print("rc: ",href)
                df = get_job_details_main(href)
    #            print(df)
                if page_df is None:
                    page_df = df;
                else:
                    page_df = page_df.append(df, ignore_index=True)
    #           print(page_df["job_title"])
    return page_df


In [5]:
def process_pages(num_pages):
    """
    Loop through the desired number of pages
    @param num_pages the number of pages to process
    """
    pages_df = None
    for page_number in range(num_pages):
        start_value = page_number*10
        page_url = f'{base_search_url}&start={start_value}'
        print(f'searching for page {page_number} with url {page_url}')
        response = requests.get(base_search_url)
        soup = BeautifulSoup(response.text, 'html.parser')
        html_links = soup.findAll('a')
        print(f"found {len(html_links)} links")
        df = process_one_page(html_links)
        if pages_df is None:
            print("setting")
            pages_df = df
        else:
            print("appending")
            pages_df = pages_df.append(df, ignore_index=True)
        print(f'now have {pages_df.size} after page {page_number}')
    return pages_df
        

In [6]:
indeed_job_df = process_pages(4)

searching for page 0 with url https://www.indeed.com/jobs?q=data+scientist&l=CO&start=0
found 189 links
setting
now have 187 after page 0
searching for page 1 with url https://www.indeed.com/jobs?q=data+scientist&l=CO&start=10
found 189 links
appending
now have 374 after page 1
searching for page 2 with url https://www.indeed.com/jobs?q=data+scientist&l=CO&start=20
found 198 links
appending
now have 561 after page 2
searching for page 3 with url https://www.indeed.com/jobs?q=data+scientist&l=CO&start=30
found 190 links
appending
now have 748 after page 3


# Q2(1 point) Save you DataFrame to a pickle file name *indeed_job_co.pkl*. 
   Load this pkl file in dataFrame and use this dataFrame for answering following questions.

   <font color='red'>upload the pickle file(indeed_job_co.pkl) along with solution notebook to the canvas</font>

In [7]:
#write code here
# save the pickle file
indeed_job_df.to_pickle("indeed_job_co.pkl", compression="gzip")

In [8]:
# read it back in
job_df = pd.read_pickle("indeed_job_co.pkl", compression="gzip")
job_df.head()

Unnamed: 0,job_title,company,location,description,has_python,has_sql,has_aws,has_rest,has_ml,has_dl,has_mining,has_nlp,has_sas,has_tableau,has_sagemaker,has_tf,has_spark
0,Data Scientist,Visa,Highlands Ranch CO,\n Company Description\n Visa is a world lead...,True,True,False,False,False,False,False,False,True,True,False,False,False
1,Data Scientist (Junior),BDSA,Louisville CO,\nJob Summary: The Analytics team empowers BDS...,True,False,False,False,False,False,False,False,True,False,False,False,False
2,Data Scientist,ISSAC Corp,Colorado Springs CO,\nTop Reasons to work with us\n\nA small team ...,True,False,False,False,True,True,False,False,False,False,False,False,False
3,Data Scientist,NomiSo,Englewood CO,"\n\n\nLocation: Englewood, CO\n \n\n About Nom...",True,False,True,False,True,False,False,True,False,False,True,True,True
4,Data Scientist (3-5 years experience),Datalab USA,Broomfield CO,\n\n DataLab USA\n ™ is an analytics and tec...,True,True,False,False,False,False,False,False,False,True,False,False,False


<font size = "6" color='red'> Use pandas functionality to answer question 3</font>
# Q 3 a(1 point) Which city has maximum job posting.



In [11]:
job_df.groupby(["location"])['job_title'].count().sort_values(ascending=False).head(1)

location
Denver  CO    18
Name: job_title, dtype: int64

# Q 3 b(1.5 point) - Top 3 most demanding skills(like Python, AWS, SQL ...)



In [10]:
job_df[job_df==True].count().sort_values(ascending=False).head(3)

has_python    35
has_ml        24
has_sql       24
dtype: int64

# Q3 c(.5 point) What other questions you would like to ask  based on indeed data?

This is a free response question.

- Most in-demand skill pairings (all possible combinations of the indicator columns). With more data we could break that down by location.
- 
