# Web Scraping Indeed.com for Data Science Job Requirements in Hong Kong

This data analysis will be using Python's Scrapy framework to scrape Indeed.com for Data Science Job Requirements. Indeed does have a [REST API](https://github.com/indeedassessments/api-documentation) but at the time of this writing, the API is still under construction so I will be performing web scraping.
<img src="data/indeed-750.jpg" alt="Indeed.com" style="width: 300px;"/>
After the web scraping of web pages linked to the search term `Data Science`, I will be analyzing the job description and generating a word cloud visualization of the top skills required for Data Science roles in Hong Kong.

<img src="data/Generic_DataScienceWordCloud.jpeg" alt="Data Science Word Cloud" style="width: 500px;"/>

### Business Question: What are the main skills that a Data Scientist needs in Hong Kong?

<img src="data/Data Science Life Cycle.png" alt="Data Science Life Cycle" style="width: 500px;"/>

### Part 1: Data Mining

What is web scraping? It is extracting content from websites. First you need to study the website to know what information it has and where to get it from. Specifically, using HTML tags and CSS ids.

Why would you do it? If you don't have an API to get information you can programatically get data from a web page when no API's have been necessarily provided.

We will be using Scrapy - a complete Python framework used for automatic web crawling and scraping. This will deal with the communication aspect of the operation between the server hosting the target website and our python console.


#### First, we inspect the website and test out an X-Path selector string using Chrome Developer Tools.

<img src="data/IndeedInspect.png" alt="Data Science Life Cycle" style="width: 1500px;"/>

It seems that all the requirements and job description details are always listed in bullet points. Therefore, our xpath selector for scrapy will be: `//div/ul/li/text()` when we want to get a list of requirements and `//head/title/text()` for the title of each page.

![image.png](attachment:image.png)

You can see from a sample webpage that we are getting the text we want. Now, we need to create the spider to crawl across all the Data Science jobs in Hong Kong.


This is the code I created in my spider to crawl Indeed:

```
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from indeedSpider.items import Article


class ArticleSpider(CrawlSpider):
    # The name of the spider
    name = "article"

    # The domains that are allowed (links to other domains are skipped)
    allowed_domains = ["indeed.hk"]

    # The URLs to run on. The site maxes out at page start page 990. 
    start_urls = [f"https://www.indeed.hk/jobs?q=data+science&l=Hong+Kong&start={i}" for i in range(0, 1000, 10)]

    # This spider has one rule: extract all (unique and canonicalized) links, follow them and parse them using the parse_item method
    rules = [
        Rule(
            LinkExtractor(
                canonicalize=True,
                unique=True
            ),
            follow=True,
            callback="parse_item"
        )
    ]

    #Spiders can take custom settings, these are higher priority then the setting.py - contains additional setting specific to this spider.
    custom_settings = {
        'DEPTH_LIMIT': 1, # this will only go one deep, come back, go to next link, and then come back, etc.
        'DOWNLOAD_DELAY' : 0.25
    }

    # Method which starts the requests by visiting all URLs specified in start_urls
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, dont_filter=True)


    def parse_item(self, response):

        # creating a new instance of our Article() object that will store the data that inherits from Item parent object defined in our framework (items.py)
        item = Article() 

        # passes a response into this from our crawl, and here we get the title
        title = response.xpath('//head/title/text()').getall()
        # get the content
        content = response.xpath('//div/ul/li/text()').getall()

        # This is my flag - if more than 25% of all the characters are white space, we have a problem, dump the content:
        space_percent = str(content).count(' ') / len(str(content))
        if space_percent > .25:
            content = []
        # another flag, if the content size is less than this, it must be nothing:
        elif len(str(content)) < 30:
            content = []
        # another flag, removing terms of service link:
        elif title[0] == "Cookies, Privacy and Terms of Service | Indeed.hk":
            content = []
        
        # get the string of the first title inside the list of strings
        item['title'] = title[0]
        # passing content to our item
        item['content'] = content
        
        # Tracing
        print("Title is: ", title[0])
        print("Content is: ", content)

        # only return content if we didn't dump it
        if content:
            return item
```            

The spider has filters in place to dump articles that are not relevant. After running the spider crawl over night, these are the result statistics:

![image.png](attachment:image.png)

The spider crawled through 4,617 links and returned 713 job descriptions.

## Part 2 - Data Cleaning

Let's import the JSON file and start cleaning it up.

In [None]:
import json
import pandas as pd

indeed