# Web Scraping Indeed.com for Data Science Job Requirements in Hong Kong

*By Danny Vu*

This data analysis will be using Python's Scrapy framework to scrape Indeed.com for Data Science Job Requirements. Indeed does have a [REST API](https://github.com/indeedassessments/api-documentation) but at the time of this writing, the API is still under construction so I will be performing web scraping.

<img src="data/indeed-750.jpg" alt="Indeed.com" style="width: 300px;"/>

After the web scraping of web pages linked to the search term `Data Science`, I will be analyzing the job description and generating visualizations of the top skills required for Data Science roles in Hong Kong.


### My Question: What are the main skills that a Data Scientist needs in Hong Kong?

As an American expat moving to Hong Kong, I was interested in finding out how difficult it would be to get a job as a data scientist in this country. My main concerns are the fact that I only speak English and that I only have a Bachelor's degree in Computer Science / Informatics. What are the skillsets that the job market looks for in Hong Kong?


### Summary of Results:
The main skills sought after were Computer Science, English, and Communication skills - no surprise there. What's surprising is that only 3-4% of Job Descriptions in Hong Kong mention a PHD requirement - which can often be seen as a requirement in the United States. Another nice touch for expats looking to get Data Science jobs in Hong Kong is that Cantonese and Mandarin are only mentioned as a requirement in about 15% of job descriptions and English is the dominant language in the job market here. Out of all the hard skills, SQL, Python, Statistics, and Machine Learning are the top sought after skills.

<img src="data/IndeedAnalysis2.png" alt="Summary of Results" style="width: 1500px;"/>

<img src="data/requirementbuildingblock.png" alt="Summary of Results 2" style="width: 1500px;"/>


### Please read on for the full Data Analysis:


<img src="data/Data Science Life Cycle.png" alt="Data Science Life Cycle" style="width: 400px;"/>


### Part 1: Data Mining

What is web scraping? It is extracting content from websites. First you need to study the website to know what information it has and where to get it from. Specifically, using HTML tags and CSS ids.

Why would you do it? If you don't have an API to get information you can programatically get data from a web page when no API's have been necessarily provided.

I will be using Scrapy - a complete Python framework used for automatic web crawling and scraping. This will deal with the communication aspect of the operation between the server hosting the target website and our python console.


#### First, I inspect the website and test out an X-Path selector string using Chrome Developer Tools.

<img src="data/IndeedInspect.png" style="width: 1500px;"/>

It seems that all the requirements and job description details are always listed in bullet points. Therefore, our xpath selector for scrapy will be: `//div/ul/li/text()` when I want to get a list of requirements and `//head/title/text()` for the title of each page. Now, I need to create the spider to crawl across all the Data Science jobs in Hong Kong.

#### This is where I crawl, across the entire search spectrum of Indeed's Data Science jobs in Hong Kong

<img src="data/indeedcrawl.png" style="width: 1500px;"/>

This is the code I created in my spider to crawl Indeed:

```
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from indeedSpider.items import Article


class ArticleSpider(CrawlSpider):
    # The name of the spider
    name = "article"

    # The domains that are allowed (links to other domains are skipped)
    allowed_domains = ["indeed.hk"]

    # The URLs to run on. The site maxes out at page start page 990. 
    start_urls = [f"https://www.indeed.hk/jobs?q=data+science&l=Hong+Kong&start={i}" for i in range(0, 1000, 10)]

    # This spider has one rule: extract all (unique and canonicalized) links, follow them and parse them using the parse_item method
    rules = [
        Rule(
            LinkExtractor(
                canonicalize=True,
                unique=True
            ),
            follow=True,
            callback="parse_item"
        )
    ]

    #Spiders can take custom settings, these are higher priority then the setting.py - contains additional setting specific to this spider.
    custom_settings = {
        'DEPTH_LIMIT': 1, # this will only go one deep, come back, go to next link, and then come back, etc.
        'DOWNLOAD_DELAY' : 0.25
    }

    # Method which starts the requests by visiting all URLs specified in start_urls
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, dont_filter=True)


    def parse_item(self, response):

        # creating a new instance of our Article() object that will store the data that inherits from Item parent object defined in our framework (items.py)
        item = Article() 

        # passes a response into this from our crawl, and here we get the title
        title = response.xpath('//head/title/text()').getall()
        # get the content
        content = response.xpath('//div/ul/li/text()').getall()

        # This is my flag - if more than 25% of all the characters are white space, we have a problem, dump the content:
        space_percent = str(content).count(' ') / len(str(content))
        if space_percent > .25:
            content = []
        # another flag, if the content size is less than this, it must be nothing:
        elif len(str(content)) < 30:
            content = []
        # another flag, removing terms of service link:
        elif title[0] == "Cookies, Privacy and Terms of Service | Indeed.hk":
            content = []
        
        # get the string of the first title inside the list of strings
        item['title'] = title[0]
        # passing content to our item
        item['content'] = content
        
        # Tracing
        print("Title is: ", title[0])
        print("Content is: ", content)

        # only return content if we didn't dump it
        if content:
            return item
```            

The spider has filters in place to dump articles that are not relevant. After running the spider crawl, these are the result statistics:

<img src="data/spiderfinish.png" style="width: 700px;"/>

The spider crawled through 4,617 links and returned 713 job descriptions.

## Part 2 - Data Cleaning

Let's import the JSON file and start cleaning it up.

In [1]:
import json
import pandas as pd

# reading the JSON data using json.load()
with open('data/indeedscrape.json') as f:
    scrapedjson = json.load(f)

# exploring what some of our JSON data looks like:
scrapedjson[3:5]

[{'title': 'Assistant Training Executive - Hong Kong - Indeed.hk',
  'content': ['Administer the Learning Management System',
   'Be the administrator of accreditation body',
   'Provide on-site support and registration of public courses',
   'Coordinate training course arrangement, including registration, certificate issuance and post-course evaluation etc.',
   'Work location: Hong Kong Science Park',
   'Diploma holder or above with at least 2 years of working experience',
   'Previous training administration experience is preferred',
   'Computer knowledge in Chinese word processor and Windows application – MS Word, Excel',
   'Outgoing, organized and hardworking',
   'Good command of written and spoken English and Chinese',
   '5-day Work Week',
   'Double Pay and Performance Bonus',
   'Dental and Medical Benefits',
   'Life Insurance',
   'Annual Leave',
   'Paid Maternity Leave, Paternity Leave, and Marriage Leave',
   'Career Advancement Opportunities',
   'On-the-Job Training

In [2]:
# let's strip the new lines:
for i in range(len(scrapedjson)):
    scrapedjson[i]['title'] = scrapedjson[i]['title'].strip()
    for j in range(len(scrapedjson[i]['content'])):
        scrapedjson[i]['content'][j] = scrapedjson[i]['content'][j].strip()
        
# take a look:
scrapedjson[4]

{'title': 'IBM Graduate Trainee (Business and Technology Consultant) - Hong Kong - Indeed.hk',
 'content': ['Business Analyst',
  'Project Manager Assistant',
  'Change Management Consultant',
  'Business Process and Functional Consultant',
  'Digital Strategy Consultant',
  'Customer Experience Designer',
  'Cloud Application Consultant',
  'Data Scientist',
  'Oracle / SAP / SalesForce Consultant',
  'Full Stack Developer',
  'Experience Platform and Mobile Developer',
  'Blockchain Developer',
  '- How do you cope with changing demands and stress? Are you flexible? Have you successfully completed several projects with competing deadlines?',
  '- Do you present information clearly, precisely and succinctly? Adapt the way you communicate to your audience? And listen to others?',
  '- Can you see a situation from a client’s viewpoint, whether that’s colleagues or customers? Can you anticipate their needs?',
  '- Do you use ingenuity, supported by logical methods and analysis, to propos

In [3]:
# Do we any duplicate data or repeated job descriptions?
# let's put it into a dataframe then run unique on that:
import pandas as pd

df = pd.DataFrame.from_dict(scrapedjson)

df.head()

Unnamed: 0,content,title
0,"[Involve in inhouse application development, d...",Information Technology Internship (1-year prog...
1,[Assess and analyze customers’ nutritional nee...,Nutritionist /Dietitian (R0301-D) - Hong Kong ...
2,"[Perform database installation, patching, upgr...",Database Administrator - Hong Kong - Indeed.hk
3,"[Administer the Learning Management System, Be...",Assistant Training Executive - Hong Kong - Ind...
4,"[Business Analyst, Project Manager Assistant, ...",IBM Graduate Trainee (Business and Technology ...


In [4]:
print(f'There are {df.shape[0]} total job descriptions.')

print(f'There are {len(df.title.unique())} unique job descriptions.')

print(f'We have {df.shape[0] - len(df.title.unique())} duplicates that need to be removed.')

There are 713 total job descriptions.
There are 556 unique job descriptions.
We have 157 duplicates that need to be removed.


157 duplicates need removing. Let's clean up our dataframe and do this.

In [5]:
df.drop_duplicates('title', inplace = True)

print(f'We now have {df.shape[0]} total job descriptions.')

We now have 556 total job descriptions.


Now, let's reorder our dataframe so the title is first.

In [6]:
# reordering the dataframe:
cols = list(df.columns)

df = df[cols[::-1]]

df.head()

Unnamed: 0,title,content
0,Information Technology Internship (1-year prog...,"[Involve in inhouse application development, d..."
1,Nutritionist /Dietitian (R0301-D) - Hong Kong ...,[Assess and analyze customers’ nutritional nee...
2,Database Administrator - Hong Kong - Indeed.hk,"[Perform database installation, patching, upgr..."
3,Assistant Training Executive - Hong Kong - Ind...,"[Administer the Learning Management System, Be..."
4,IBM Graduate Trainee (Business and Technology ...,"[Business Analyst, Project Manager Assistant, ..."


## Data Exploration

What I am going to do is start with a huge bank of data-science related words and then see if the job description mentions this skill / requirement or not. 

The bank of words I will use are:
```
'bachelor', 'masters', 'phd', 'computer science', 'math', 'algebra', 'calculus', 'statistics', 'informatics', 'python', ' r,', ' r ', 'c+', 'c#', 'java', 'javascript', 'node', 'angular', 'html', 'php', 'bootstrap', 'django', 'css', 'wordpress', 'agile''d3.js', 'tableau', 'jupyter', 'seaborn', 'sql', 'nosql', 'mysql', 'postgres', 'mongodb', ' api', 'scala', 'hadoop', 'spark', 'tensor', 'nltk', ' ai', 'cloud', ' etl', 'azure', 'data lake', 'data model', 'hortonworks', 'pyspark', 'numpy', 'pandas', 'linux', 'unix', 'deep learning', 'neural network', 'machine learning', 'data mining', 'research', 'recommendation systems', 'scrap', 'scikit', 'keras', 'agile', 'communication', 'interpersonal', 'fresh grad', 'people skills', 'english', 'mandarin', 'cantonese'
```

In [7]:
# content visualization testing
str(df.iloc[0,1]).lower()

"['involve in inhouse application development, documentation and testing', 'provide insight and analysis on web/social applications', 'perform r&d related projects as assigned by supervisors', 'undergraduate, preferable penultimate students who are studying it programming, computer science or related disciplines', 'with basic knowledge on nodejs, javascript, web service, mysql/mongo db', 'understand facebook/wechat app development will be an advantage', 'good interpersonal and communication skill', 'a team player and is ready to take on challenges', 'an individual who is highly motivated and with keen interest to learn and explore new technology skills and business applications']"

In [8]:
# base conditional test:
'node' in str(df.iloc[0,1]).lower()

True

In [9]:
# building a list of dictionaries containing keywords and values of 0 or 1 (depending on if the word is contained or not in the JD) with the index value == index of our df
keywords = ['bachelor', 'masters', 'phd', 'computer science', 'math', 'algebra', 'calculus', 'statistics', 'informatics', 'python', ' r,', ' r ', 'c+', 'c#', 'java ', 'javascript', 'node', 'angular', 'html', 'php', 'bootstrap', 'django', 'css', 'wordpress', 'd3.js', 'tableau', 'jupyter', 'seaborn', 'sql', 'nosql', 'mysql', 'postgres', 'mongodb', ' api', 'scala', 'hadoop', 'spark', 'tensor', 'nltk', ' ai', 'cloud', ' etl', 'azure', 'data lake', 'data model', 'hortonworks', 'pyspark', 'numpy', 'pandas', 'linux', 'unix', 'deep learning', 'neural network', 'machine learning', 'data mining', 'research', 'recommendation systems', 'scrap', 'scikit', 'keras', 'agile', 'communication', 'interpersonal', 'fresh grad', 'people skills', 'english', 'mandarin', 'cantonese']
listdict = []

# for each row in the dataframe
for i in range(df.shape[0]):
    # create empty dictionary for that row
    keyword_value = {}
    # set index number
    keyword_value['index'] = i
    # for that row, go through each of the keywords
    for word in keywords:
        # check if word is in the content of that row, make it a 1 if it is, 0 if not
        if word in str(df.iloc[i,1]).lower():
            keyword_value[word] = 1
        else:
            keyword_value[word] = 0
    # appending to our list the dictionary corresponding to that row of data
    listdict.append(keyword_value)

listdict[0]

{'index': 0,
 'bachelor': 0,
 'masters': 0,
 'phd': 0,
 'computer science': 1,
 'math': 0,
 'algebra': 0,
 'calculus': 0,
 'statistics': 0,
 'informatics': 0,
 'python': 0,
 ' r,': 0,
 ' r ': 0,
 'c+': 0,
 'c#': 0,
 'java ': 0,
 'javascript': 1,
 'node': 1,
 'angular': 0,
 'html': 0,
 'php': 0,
 'bootstrap': 0,
 'django': 0,
 'css': 0,
 'wordpress': 0,
 'd3.js': 0,
 'tableau': 0,
 'jupyter': 0,
 'seaborn': 0,
 'sql': 1,
 'nosql': 0,
 'mysql': 1,
 'postgres': 0,
 'mongodb': 0,
 ' api': 0,
 'scala': 0,
 'hadoop': 0,
 'spark': 0,
 'tensor': 0,
 'nltk': 0,
 ' ai': 0,
 'cloud': 0,
 ' etl': 0,
 'azure': 0,
 'data lake': 0,
 'data model': 0,
 'hortonworks': 0,
 'pyspark': 0,
 'numpy': 0,
 'pandas': 0,
 'linux': 0,
 'unix': 0,
 'deep learning': 0,
 'neural network': 0,
 'machine learning': 0,
 'data mining': 0,
 'research': 0,
 'recommendation systems': 0,
 'scrap': 0,
 'scikit': 0,
 'keras': 0,
 'agile': 0,
 'communication': 1,
 'interpersonal': 1,
 'fresh grad': 0,
 'people skills': 0,
 'eng

In [10]:
# Turning our list into a temporary dataframe
df2 = pd.DataFrame(listdict)
df2.set_index('index', inplace=True, drop=True)

df2.head()

Unnamed: 0_level_0,ai,api,etl,r,"r,",agile,algebra,angular,azure,bachelor,...,scikit,scrap,seaborn,spark,sql,statistics,tableau,tensor,unix,wordpress
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# merging the dfs together and aligning them by index:
indeed = df.merge(df2, left_index = True, right_index = True)

indeed.head()

Unnamed: 0,title,content,ai,api,etl,r,"r,",agile,algebra,angular,...,scikit,scrap,seaborn,spark,sql,statistics,tableau,tensor,unix,wordpress
0,Information Technology Internship (1-year prog...,"[Involve in inhouse application development, d...",0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,Nutritionist /Dietitian (R0301-D) - Hong Kong ...,[Assess and analyze customers’ nutritional nee...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Database Administrator - Hong Kong - Indeed.hk,"[Perform database installation, patching, upgr...",0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
3,Assistant Training Executive - Hong Kong - Ind...,"[Administer the Learning Management System, Be...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,IBM Graduate Trainee (Business and Technology ...,"[Business Analyst, Project Manager Assistant, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As you can see in the following, you can see that the keyword if it exists in the scraped content will be shown as a 1 in the final dataframe:

In [12]:
scrapedjson[0]

{'title': 'Information Technology Internship (1-year programme) - Hong Kong - Indeed.hk',
 'content': ['Involve in inhouse application development, documentation and testing',
  'Provide insight and analysis on web/social applications',
  'Perform R&D related projects as assigned by supervisors',
  'Undergraduate, preferable penultimate students who are studying IT Programming, Computer Science or related disciplines',
  'With basic knowledge on NodeJS, JavaScript, web service, MySQL/Mongo DB',
  'Understand Facebook/WeChat App development will be an advantage',
  'Good interpersonal and communication skill',
  'A team player and is ready to take on challenges',
  'An individual who is highly motivated and with keen interest to learn and explore new technology skills and business applications']}

In [13]:
indeed.iloc[0,:]

title                     Information Technology Internship (1-year prog...
content                   [Involve in inhouse application development, d...
 ai                                                                       0
 api                                                                      0
 etl                                                                      0
 r                                                                        0
 r,                                                                       0
agile                                                                     0
algebra                                                                   0
angular                                                                   0
azure                                                                     0
bachelor                                                                  0
bootstrap                                                                 0
c#          

# Data Analysis

Once we have our final dataframe, it is very easy to do data analysis. For instance, if we wanted to see the percentage of data science jobs that mention SQL in their requirements, we could run the following:

In [14]:
# % jobs requiring SQL
round(indeed.sql.sum() / indeed.shape[0] * 100, 2)

24.26

What about the percentage of Data Science jobs in Hong Kong mentioning Python?

In [15]:
# % jobs requiring python
round(indeed.python.sum() / indeed.shape[0] * 100, 2)

22.68

Percentage of jobs here that mention a PHD?

In [16]:
# % jobs requiring phd
round(indeed.phd.sum() / indeed.shape[0] * 100, 2)

3.17

Only 3 percent! That's comforting to hear. However, many job descriptions might not explicitly state what their requirements are and thus the percentages found might be a bit low.

The most useful data that we might find would be what requirements got mentioned most. Let's find the sum of these keywords and sort them.

In [17]:
indeed.iloc[:,2:].sum().sort_values(ascending=False)

computer science          263
english                   228
communication             212
bachelor                  127
interpersonal             110
sql                       107
python                    100
math                       90
mandarin                   77
research                   73
statistics                 71
cantonese                  66
machine learning           55
fresh grad                 48
 ai                        47
cloud                      42
c+                         37
c#                         32
scala                      31
agile                      30
unix                       29
linux                      29
javascript                 29
hadoop                     24
 r,                        23
java                       23
data model                 22
 api                       22
html                       22
tableau                    18
                         ... 
node                       12
 etl                       12
php       

That's a useful series. Let's turn that into a dataframe and add percentages as a column.

In [18]:
columns = ['keyword', 'mentions', 'percentage']

# getting our series
summary = indeed.iloc[:,2:].sum().sort_values(ascending=False)

# turning it into a dataframe
summary = summary.to_frame()

# resetting the index:
summary = summary.reset_index()

# renaming the columns:
summary.columns = ['keyword', 'mentions']

# adding a column that is the percentage
# remember that the percentage is the number of mentions divided by the total number of articles 
# so we need to divide the count of the mentions here by the previous dataframe's shape, not this one!
summary['percentage'] = round(summary.mentions / indeed.shape[0] * 100, 2)

If you remember, there were two keywords made to catch all mentions of the programming language 'r' - they were `' r '` and `' r,'`. Let's combine them.

In [19]:
# adding the mentions together:
summary.loc[summary.keyword ==  ' r,', 'mentions'] = int(summary.loc[summary.keyword == ' r,', 'mentions']) + int(summary.loc[summary.keyword == ' r ', 'mentions'])

# recalculating the percentage:
summary.loc[summary.keyword ==  ' r,', 'percentage'] = round(int(summary.loc[summary.keyword == ' r,', 'mentions']) / indeed.shape[0] * 100, 2)

# dropping the other 'r' row:
summary = summary.drop(summary[summary.keyword == ' r '].index)

# sorting it:
summary = summary.sort_values(by='mentions', ascending=False)

###  Now, we have our final Data Analysis Table with the Highest Rated Requirements for Data Science related jobs in Hong Kong

In [20]:
summary

Unnamed: 0,keyword,mentions,percentage
0,computer science,263,59.64
1,english,228,51.70
2,communication,212,48.07
3,bachelor,127,28.80
4,interpersonal,110,24.94
5,sql,107,24.26
6,python,100,22.68
7,math,90,20.41
8,mandarin,77,17.46
9,research,73,16.55


In [21]:
### Export Data out to CSV files:
summary.to_csv(path_or_buf="data/summary_indeed.csv", index=False)

# Data Visualization

Let's take a look at a couple data visualizations. Created with Tableau and hosted on Tableau Public:

In [22]:
%%html
<div class='tableauPlaceholder' id='viz1551589175203' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;in&#47;indeed_Tableau_2&#47;BuildingBlocks&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='indeed_Tableau_2&#47;BuildingBlocks' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;in&#47;indeed_Tableau_2&#47;BuildingBlocks&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1551589175203');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

In [23]:
%%html

<div class='tableauPlaceholder' id='viz1551588363414' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;in&#47;indeed_Tableau_2&#47;BarChart&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='indeed_Tableau_2&#47;BarChart' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;in&#47;indeed_Tableau_2&#47;BarChart&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1551588363414');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

# Summary of Results

It is no surprise that Computer Science, English, and Communication skills are highly sought after in Data Science Jobs here in Hong Kong. What's surprising is that only 3-4% of Job Descriptions in Hong Kong mention a PHD requirement - which can often be seen as a requirement in the United States. Another nice touch for expats looking to get Data Science jobs in Hong Kong is that Cantonese and Mandarin are only mentioned as a requirement in about 15% of job descriptions and English is the dominant language in the job market here. Out of all the hard skills, SQL, Python, Statistics, and Machine Learning are the top sought after skills.