# Analysing the `DICE` dataset. 

In [None]:
import pandas as pd
dice_df = pd.read_csv('dice_com-job_us_sample.csv')

In [None]:
with pd.option_context('display.max_colwidth', None):
    display(dice_df['jobdescription'])

In [None]:
print(dice_df.columns)

# Information Based Use Cases

Modern Engineers are expected to be more pro-active. 
This is especially true for ML Engineers.

The engineers job is to make connections between technical capability and available information.



# Exercise [15 min]: Installing Spacy

For the machine learning part we will make use of a framework called `spacy`.

First step is the installation of the library.

Check https://spacy.io/usage in order to get instructions of how to install it. 
The documentation of spacy is excellent.

Since the installation can take some time we kick this off in the background. 

## Exercise [15 Min]: Brainstorming Use Cases for DICE

Use case analysis should always start with the users.
There must be some form of utility from the functionality that we create.

### What are some interesting use case that come to mind?

What would you be interested in from the point of view of a job seeker?

* As a
* I would ...

* As a person resposible for the curriculum in data engineering I would like to better understand the key skill requirements in industry.

### Machine Learning as a Way to Scale Things

Two machine learning tasks that we looked at, might be of use for the use cases you consider. 

* Similarity: The ML based ability to calculate how similar items are.
* Information Extraction: The ability to detect patterns in data and extract those.

## Extracting Named Entities

Named Entities are one form of information extraction. 

They refer to types of entities that are mentioned in a given text.
Typical examples are:

* Date
* Location
* Person
* Organization

### Exercise [20 min] : Apply Spacy NER on the DICE dataset

See the following link for loading a default NER model:

https://spacy.io/api/entityrecognizer#config

In order to get the `nlp` variable you see in the given example you have to follow the instructions here:

https://spacy.io/usage/processing-pipelines



In [None]:
#Applying Spacy NER on the DICE dataset

import spacy
nlp = spacy.load("en_core_web_md")

for doc in nlp.pipe(["The University of Chicago and MIT have a pub crawl on Sunday."]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])




### Exercise [15 min]: Model Selection

Machine Learning often begins with selecting a pre-trained model (i.e. a model that has been trained by someone else). 

Model selection is a non-trivial task. 
In order to experience this try to select the `best` model for the extraction by using:

https://spacy.io/usage/models

## Discussion: Observations

* What are some observations based on the application of the NER models?

* What are some things that we should focus on in our analysis?

### Exercise [45 min] KeyPhrase Extraction

PyTextRank is an extension for the detection of key phrases (noun chunks) in text.

The following sources will give you an introduction to its application:

https://spacy.io/universe/project/spacy-pytextrank
https://derwen.ai/docs/ptr/
https://derwen.ai/docs/ptr/sample/

Use it to extract key phrases from the DICE dataset.

## Discussion: Observations

* What do you think about the results utility?
* Would this be of use for one of the use cases you have thought of?
* What should be some criteria we consider with regard to putting these extractions in a production system?

In [None]:
import spacy
import pytextrank

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("textrank")

In [None]:
sampled_df = dice_df.sample(n=5)
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
for jobdesc, jobtitle in zip(sampled_df['jobdescription'], sampled_df['jobtitle']):
    doc = nlp(jobdesc)
    print(jobtitle)
    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    for p in doc._.phrases[:5]:
        print('{:.4f} {:5d}  {}'.format(p.rank, p.count, p.text))
        print(p.chunks)
    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')

   