# Example - Netflix job postings

### Introduction

Like some other companies, Netflix posts its job offers at a platform called Lever. **Netflix job postings** can be found at `jobs.lever.co/netflix`. I call this page the **main page**. It will display, the day you visit it, about 500 postings. The postings can be filtered by city, team and work type. Most of the postings on display are for teams in from the Streaming division.

The main page contains, for each available position, basic information about the job, such as the job title, the location and the team, and a link to a page specific for that position, such as `jobs.lever.co/netflix/2d11d912-bfb3-4d9d-bfa1-0ce036214284`. I call that specific page the **individual page**. The individual page presents a description of the company and the role of the new employee.

### Capturing the source code

**HTTP** is a protocol for communication between clients and servers. For instance, a client (such as your browser) sends a **HTTP request** to the server. Then the server returns the response to the client. The response contains status information about the request and, when the request is accepted, the requested content. 

**GET** is one of the most common HTTP methods. It is used to request data from a specified resource. In the Python package `requests`, the function `get` is an implementation of the HTTP method GET. `requests` comes with the Anaconda distribution, so we can import it directly.

In [1]:
import requests

The `request` function `get` returns an object of a special type (type `requests.models.Response`). The attribute `text` of this object is a string which, for an ordinary web page, is the HTML source code. 

In [2]:
page = requests.get('https://jobs.lever.co/netflix').text

Now, `page` is a string containing the source code of the Netflix Lever main page.

### Parsing the source code

To parse HTML code and to learn the tree structure it conveys, I use the package `lxml` (not the only option). More specifically, the function `fromstring` from the subpackge `html`. We import this subpackage with: 

In [3]:
from lxml import html

In [4]:
tree = html.fromstring(page)

`fromstring` returns a special `lxml` object: 

In [5]:
type(tree)

lxml.html.HtmlElement

### Getting the ID's for the job postings

I am going to extract from `tree` the three pieces of information I am interested in, by means of adequate **XPath expressions**. How can find an adequate XPath expression? There are many ways, and every web scraper has his/her own cookbook. The simplest approach is based on the *Inspect* tool of the browser. Right-click on the *APPLY* button of a job post, opening a contextual menu, and select *Inspect*. This will open a window showing a view of the source code in which the node containing a link to the page of that job post is highlighted. 

Let me start with the ID's of the job postings. The nodes containing the ID his node are `div` nodes which have the ID as the value of the attribute `data-qa-posting-id`. So I use the XPath expression `'//div/@data-qa-posting-id'`:

In [6]:
id = tree.xpath('//div/@data-qa-posting-id')
id[:5]

['4dfd6b7b-e020-44ae-a9f5-64b631b20e9c',
 'a1291f11-99e7-4213-87c0-fcbac8e4c6a7',
 'f5296ea6-b3bd-4cf1-9167-268e70218838',
 'e8999fd7-e3ac-4af3-8667-0bd002123528',
 '44b91dcb-0735-4b94-a171-81c43133e5c1']

I get a list with the ID's of 523 job postings:

In [7]:
len(id)

523

The links to the individual pages can be directly obtained from the ID's (they can also be scraped with an adequate XPath expression):

In [8]:
link = ['https://jobs.lever.co/netflix/' + i for i in id]
link[:5]

['https://jobs.lever.co/netflix/4dfd6b7b-e020-44ae-a9f5-64b631b20e9c',
 'https://jobs.lever.co/netflix/a1291f11-99e7-4213-87c0-fcbac8e4c6a7',
 'https://jobs.lever.co/netflix/f5296ea6-b3bd-4cf1-9167-268e70218838',
 'https://jobs.lever.co/netflix/e8999fd7-e3ac-4af3-8667-0bd002123528',
 'https://jobs.lever.co/netflix/44b91dcb-0735-4b94-a171-81c43133e5c1']

### Job titles

We follow a similar approach to extract the job titles. Using *Inspect* with a job title, we find it as the value of a `h5` node, with a `data-qa` attribute whose value is `"posting-name"`:

In [9]:
job = tree.xpath('//h5[@data-qa="posting-name"]/text()')
job[:5]

['Lead Technical Director - Wendell & Wild',
 'Render Wrangler/Jr. Technical Director - Wendell & Wild',
 '2D Background Layout Artist - Blue Eye Samurai',
 'Art Director - Blue Eye Samurai',
 'Background Layout Supervisor - Blue Eye Samurai']

### Job location

Now, the job location, which is found as the value of a `span` node with a `class` attribute whose value is `"sort-by-location posting-category small-category-label"`. 

In [10]:
location = tree.xpath('//span[@class="sort-by-location posting-category small-category-label"]/text()')
location[:5]

['Oregon',
 'Oregon',
 'Los Angeles, California',
 'Los Angeles, California',
 'Los Angeles, California']

### Team

The team is the least piece of information that we scrape from this page. It is found as the value of a `span` node which has a `class` attribute whose value is `"sort-by-team posting-category small-category-label"`.

In [11]:
team = tree.xpath('//span[@class="sort-by-team posting-category small-category-label"]/text()')
team[:5]

['Animation – Animation',
 'Animation – Animation',
 'Animation – Art',
 'Animation – Art',
 'Animation – Art']

The team comes in two parts: (a) a division, such *Animation* or *Gaming*, and (b) a department, such as *Art* or *Production Management*. It might be interesting to split it in these two parts, which are separated by a symbol which looks like a hyphen but it is a bit longer. It is the **en dash** (see `jkorpela.fi/dashes.html` if you are curious about this). You can copypaste it in a Jupyter interface, or use the Unicode representation \u2013.

In [12]:
team = [t.split(' – ') for t in team]
team[:5]

[['Animation', 'Animation'],
 ['Animation', 'Animation'],
 ['Animation', 'Art'],
 ['Animation', 'Art'],
 ['Animation', 'Art']]

Once the split has been performed, I name the two parts:

In [13]:
division = [t[0] for t in team]
division[:5]

['Animation', 'Animation', 'Animation', 'Animation', 'Animation']

In [14]:
dept = [t[1] for t in team]
dept[:5]

['Animation', 'Animation', 'Art', 'Art', 'Art']

### JSON format

The **JSON** (JavaScript Object Notation) format is very practical for storing certain types of information, such as Twitter or news data, for which the tabular format is not adequate. A JSON document is a collection of **pairs key/value**, organized in a special way, which accounts for a hierarchy of information. For instance, the following example stores family information:

    [{'Name': 'John', 'Age': 27},

     {'Name': 'Peter', 'Age': 32, 'Children': 'Louis'},

     {'Name': 'Maria', 'Age': 29, 'Children': ['Edward', 'Christine']}]
 
In this example, you can see how to include information about the children in a flexible way, allowing for zero, one or more children. To use a tabular format for these data, you would have to create a collection of columns "Child1", "Child2", etc, with many missing values. The flexible structure of JSON allows you to cope with different family sizes in a simple way. 

To the Pythonista, the JSON document looks like a nested structure of lists and dictionaries. This makes straightforward importing and exporting JSON data in Python. The package `json`, which is part of the Python Standard Library, allows this functionality.

Now, a more complex example. The following JSON document has been extracted from the Lever webpage of one of the Netflix job posts:

    {"@context" : "http://schema.org",
     "@type" : "JobPosting",
     "title" : "Senior Systems Development Engineer",
     "hiringOrganization" : 
         {"@type" : "Organization",
          "name": "Netflix",
          "logo": "https://lever-client-logos.s3.amazonaws.com/84963f7c-5208-4789-813f-59b515174479-1442905953849.png"},
     "jobLocation":
         {"@type" : "Place",
              "address" :
              {"@type" : "PostalAddress",
               "addressLocality" : "Los Angeles, California",
               "addressRegion" : null,
               "addressCountry" : null,
               "postalCode" : null}},
     "employmentType" : null,
     "datePosted" : "2019-10-05",
     "description" : "The Creative Compute and Storage team designs, develops and delivers technology infrastructure globally for the evolving needs of our creatives. As we continue to expand our content creation globally, we are looking for the best and brightest engineering talent to be part of our growth. \n\nOur team is looking for a Senior Systems Development Engineer to be part of the development and build-out of our purposefully developed infrastructure platforms. You will work with internal engineering teams, technical creatives, production teams and external vendors around the world to deliver amazing technology experiences for our creative users. We are looking for an experienced engineer that brings a broad set of technical skills and achievements, a development and automation focused mindset to solving problems and unique career and life experiences to join our teams as we continue to evolve entertainment around the world.\n\nBe sure to review our culture page and long-term view to learn more about the unique Netflix culture and the opportunity to be part of our team.\n\n"}

This is one record. Sometimes these units are managed separately, as in this case, which is related to a single job post, but they can come in a list, enclosed in square brackets, as in our former example. Please, note that the indentation has been added after copying the JSON document from the source page, to help you to see the structure. Webpage maintainers are rarely so polite.

### The package json

We import `json` as usual in small packages:

In [15]:
import json

The package `json` provides two basic functions, `loads` and `dumps`, which convert a JSON string to either a list or a dictionary, and conversely. This is needed, since in Python every object has a type. To illustrate this point, suppose that we enter our first JSON example to a Python shell: 

In [16]:
json_list = [{'Name': 'John', 'Age': 27},
     {'Name': 'Peter', 'Age': 32, 'Children': 'Louis'},
     {'Name': 'Maria', 'Age': 29, 'Children': ['Edward', 'Christine']}]

To convert this to a string:

In [17]:
json_doc = json.dumps(json_list)
json_doc

'[{"Name": "John", "Age": 27}, {"Name": "Peter", "Age": 32, "Children": "Louis"}, {"Name": "Maria", "Age": 29, "Children": ["Edward", "Christine"]}]'

To parse this string, recovering the list of dictionaries:  

In [18]:
json.loads(json_doc)

[{'Name': 'John', 'Age': 27},
 {'Name': 'Peter', 'Age': 32, 'Children': 'Louis'},
 {'Name': 'Maria', 'Age': 29, 'Children': ['Edward', 'Christine']}]

*Note*. On the fly, `json_loads` performs, when needed, some conversions from Java to Python, such as `null` to `None`, ot `true` to `True`. Also, the quotes are double in `json_doc`, because that is the rule of the JSON format.

### Scraping data in JSON format

Some webpages include a JSON document in a `script` node. In general, `script` nodes are used in HTML code to embed executable code or data. Although most of those nodes embed or refer to JavaScript code, they can also be used to store metadata in JSON format. In this case, they have the attribute `type="application/ld+json"`. Sometimes, these JSON documents contain information which can also be found in other parts of the web page. They are used to mark up web contents so that they can be understood by major search engines as Google and Bing. The data stored in `script` nodes is not displayed by the browser.

Let us see how this works for the individual pages of the Netflix jobs. The link for the first job is :

In [19]:
link[0]

'https://jobs.lever.co/netflix/4dfd6b7b-e020-44ae-a9f5-64b631b20e9c'

This is a page containing detail for a post related to a Senior Systems Development Engineer position. By applying `request.get`, I capture the source as a string. Then I parse that string with `html.fromstring`:

In [20]:
page = requests.get(link[0]).content
tree = html.fromstring(page)

The potential JSON documents contained in `script` nodes as described above are easy to get:

In [21]:
json_doc = tree.xpath('//script[@type="application/ld+json"]/text()')

`xpath` rerturns a list, which can have any length, depending on the number of elements of this type included in the source code. In this case, I am lucky:

In [22]:
len(json_doc)

1

I convert the single element of this list with `json.loads`. I call the outcome `json_dict`, becuase it is a dictionary:

In [23]:
json_dict = json.loads(json_doc[0])
type(json_dict)

dict

To get an insight on the contents, I list the keys:

In [24]:
json_dict.keys()

dict_keys(['@context', '@type', 'title', 'hiringOrganization', 'jobLocation', 'employmentType', 'datePosted', 'description'])

The employment type, the date the job was posted and the description are potentially interesting:

In [25]:
json_dict['employmentType']

'Contractor'

In [26]:
json_dict['datePosted']

'2021-06-17'

In [27]:
json_dict['description']

'With over 204 million subscribers enjoying great content in over 190 countries, it’s an exciting time to work at Netflix.\xa0\n\nBy serving as a platform for original storytelling, we’re fueled by the broad appeal of being able to instantly enjoy unlimited movies and TV shows and seek to create joy for our members around the world. This guiding principle has informed our commitment to animation, a universal language, and the establishment of Netflix Animation Studios where creative visionaries can perform their best work.\xa0\n\nThis is the new wave of animation — and you can help shape it. Our goal is to tell stories that no one has seen but that everyone will remember.\n\n\n\nLooking for someone to start ASAP!\nEstimated project end date: 12/17/2021\n'

### A function to scrape information from an individual page

We would like to capture this information for all postings. But there are 523 postings, so we need a procedure to do massively what we have done with one postings. People typically use loops for this kind of massive repetitions. Creating a function which scrapes the information from an individual page simplifies the code.

In the definition of the **scraping function** I just collect the operations that I have performed above, using the link as the argument of the function:

In [28]:
def f(l):
    page = requests.get(l).content
    tree = html.fromstring(page)
    json_doc = tree.xpath('//script[@type="application/ld+json"]/text()')
    json_dict = json.loads(json_doc[0])
    return [json_dict['employmentType'], json_dict['datePosted'], json_dict['description']]

Let us check that this works as expected:

In [29]:
f(link[0])

['Contractor',
 '2021-06-17',
 'With over 204 million subscribers enjoying great content in over 190 countries, it’s an exciting time to work at Netflix.\xa0\n\nBy serving as a platform for original storytelling, we’re fueled by the broad appeal of being able to instantly enjoy unlimited movies and TV shows and seek to create joy for our members around the world. This guiding principle has informed our commitment to animation, a universal language, and the establishment of Netflix Animation Studios where creative visionaries can perform their best work.\xa0\n\nThis is the new wave of animation — and you can help shape it. Our goal is to tell stories that no one has seen but that everyone will remember.\n\n\n\nLooking for someone to start ASAP!\nEstimated project end date: 12/17/2021\n']

### Looping over the individual pages

A loop needs a place to start. We set this as an empty list:

In [30]:
employmentType, datePosted, description = [], [], []

 Now we create a loop which adds the outcome of our scraping for every link, one-by-one:

In [31]:
for l in link:
    data = f(l)
    employmentType = employmentType + [data[0]]
    datePosted = datePosted + [data[1]]
    description = description + [data[2]]

*Note*. It may be that the loop stops before attaining the end of the links' list. This may be due to various reasons: the connection fails, the server decides that you are robot, etc. In that case, you can restart the loop at the point it stopped, which you can learn from the length of the lists that you have already obtained.

### Gather the information collected in a data frame

I can now pack as a data frame the information collected, so that it can be easily exported to text file or a database:

In [32]:
import pandas as pd
df = pd.DataFrame({'job': job, 'location': location, 'division': division, 'dept': dept,
  'employmentType': employmentType, 'datePosted': datePosted, 'description': description}, index=id)

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 523 entries, 4dfd6b7b-e020-44ae-a9f5-64b631b20e9c to 836292b3-4b02-4f35-97f7-0b9e4aeda4e7
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   job             523 non-null    object
 1   location        523 non-null    object
 2   division        523 non-null    object
 3   dept            523 non-null    object
 4   employmentType  520 non-null    object
 5   datePosted      523 non-null    object
 6   description     523 non-null    object
dtypes: object(7)
memory usage: 32.7+ KB


In [34]:
df.head()

Unnamed: 0,job,location,division,dept,employmentType,datePosted,description
4dfd6b7b-e020-44ae-a9f5-64b631b20e9c,Lead Technical Director - Wendell & Wild,Oregon,Animation,Animation,Contractor,2021-06-17,With over 204 million subscribers enjoying gre...
a1291f11-99e7-4213-87c0-fcbac8e4c6a7,Render Wrangler/Jr. Technical Director - Wende...,Oregon,Animation,Animation,Contractor,2021-06-17,With over 204 million subscribers enjoying gre...
f5296ea6-b3bd-4cf1-9167-268e70218838,2D Background Layout Artist - Blue Eye Samurai,"Los Angeles, California",Animation,Art,Contractor,2021-05-04,With over 204 million subscribers enjoying gre...
e8999fd7-e3ac-4af3-8667-0bd002123528,Art Director - Blue Eye Samurai,"Los Angeles, California",Animation,Art,Contractor,2020-04-26,With over 204 million subscribers enjoying gre...
44b91dcb-0735-4b94-a171-81c43133e5c1,Background Layout Supervisor - Blue Eye Samurai,"Los Angeles, California",Animation,Art,Contractor,2021-06-16,With over 204 million subscribers enjoying gre...


### Homework

**Kraken** is a US-based cryptocurrency exchange and bank, founded in 2011. They provide cryptocurrency-to-fiat-money trading, and provides price information to Bloomberg Terminal. At `jobs.lever.co/kraken`, you will find a job posting site, organized in the same way as Netflix's site. Capture the information about the Kraken job posts that you find interesting and summarize it.