In [1]:
import pandas as pd
import dvc.api as DvcApi

In [2]:
import sys, os

sys.path.append(os.path.abspath(os.path.join("../..")))
sys.path.append(os.path.abspath(os.path.join("../scripts")))

In [3]:
# pd.set_option('display.max_colwidth', None)

---------------------------------------------------------------------------------------------------------------------------

# Project1

### Project Overview
A client has a system that collects news artifacts from web pages, tweets, facebook posts, etc. The client is interested in scoring a given new artifact against a topic. The client has hired experts to score a few of these news items. The range of results between 0 and 10 signifies the degree of relevance of the news item to the topic. 

We want to explore the efficiency of GPT3-like LLMs to this task. If the recommendation is positive, we must demonstrate that our strategies to design prompts are reproducible and produce a consistent result. 

Design a pipeline that takes a news item (e.g. title +  description + body) and returns a score for the news item

The columns of this data are as follows

- Domain: the base URL or a reference to the source these item comes from 
- **Title:** title of the item - the content of the item
- **Description:** the content of the item
- **Body:** the content of the item
- Link: URL to the item source (it may not functional anymore sometime)
- Timestamp: timestamp that this item was collected at
- **Analyst_Average_Score:** target variable - the score to be estimated 
- Analyst_Rank: score as rank
- Reference_Final_Score: Not relevant for now - it is a transformed quantity

In [4]:
path = "data/Example_data1.csv"
repo = "../"
version = "v0"

data_url = DvcApi.get_url(path = path, repo = repo, rev = version) #could be tag or git commit
data_news = pd.read_csv(data_url)

In [5]:
data_news.head(2)

Unnamed: 0,Domain,Title,Description,Body,Link,timestamp,Analyst_Average_Score,Analyst_Rank,Reference_Final_Score
0,rassegnastampa.news,Boris Johnson using a taxpayer-funded jet for ...,…often trigger a protest vote that can upset…t...,Boris Johnson using a taxpayer-funded jet for ...,https://rassegnastampa.news/boris-johnson-usin...,2021-09-09T18:17:46.258006,0.0,4,1.96
1,twitter.com,"Stumbled across an interesting case, a woman f...","Stumbled across an interesting case, a woman f...","Stumbled across an interesting case, a woman f...",http://twitter.com/CoruscaKhaya/status/1435585...,2021-09-08T13:02:45.802298,0.0,4,12.0


In [6]:
data_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Domain                 10 non-null     object 
 1   Title                  10 non-null     object 
 2   Description            10 non-null     object 
 3   Body                   10 non-null     object 
 4   Link                   10 non-null     object 
 5   timestamp              10 non-null     object 
 6   Analyst_Average_Score  10 non-null     float64
 7   Analyst_Rank           10 non-null     int64  
 8   Reference_Final_Score  10 non-null     float64
dtypes: float64(2), int64(1), object(6)
memory usage: 848.0+ bytes


We have only 10 examples. No issues of missing data

In [7]:
# let's take a look on titles of news items that have non-zero score with respect to the topic of "public unrest"

pd.set_option('display.max_colwidth', None)
data_news.loc[data_news['Analyst_Average_Score']!= 0, 'Title' ]

5    Male arrested for the murder of an elderly female in Cofimvaba – SAPS Crime Report: 2021-09-09 13:22:58
7                             The construction sector is expected to be boosted by riots and looting repairs
8                       News24.com | Court dismisses attempt by former Eskom CEO to 'punish' woman for tweet
Name: Title, dtype: object

In [8]:
pd.set_option('display.max_colwidth', 50)

---------------------------------------------------------------------------------------------------------------------------

# Project 2:

### Project Overview
The data are job descriptions ( together named entities)  and  relationships between entities in json format. To understand more about where the data comes from, read [How to Train a Joint Entities and Relation Extraction Classifier using BERT Transformer with spaCy 3](https://towardsdatascience.com/how-to-train-a-joint-entities-and-relation-extraction-classifier-using-bert-transformer-with-spacy-49eb08d91b5c)

- **Dataset 1:** For development and training
- **Dataset 2:** For testing and final reporting


In [9]:
path = "data/relations_dev.json"
repo = "../"
version = "v0"

data_url = DvcApi.get_url(path = path, repo = repo, rev = version)
data_JobDev = pd.read_json(data_url)

In [10]:
# lets look at a few of the examples
data_JobDev.head(3)

Unnamed: 0,document,tokens,relations
0,Bachelor's degree in Mechanical Engineering or...,"[{'text': 'Bachelor', 'start': 0, 'end': 8, 't...","[{'child': 4, 'head': 0, 'relationLabel': 'DEG..."
1,10+ years of software engineering work experie...,"[{'text': '10+ years', 'start': 0, 'end': 9, '...","[{'child': 4, 'head': 0, 'relationLabel': 'EXP..."
2,3+ years Swift & Objective-C and experience wi...,"[{'text': '3+ years', 'start': 0, 'end': 8, 't...","[{'child': 3, 'head': 0, 'relationLabel': 'EXP..."


In [11]:
# The first example
data_JobDev.loc[0]

document     Bachelor's degree in Mechanical Engineering or...
tokens       [{'text': 'Bachelor', 'start': 0, 'end': 8, 't...
relations    [{'child': 4, 'head': 0, 'relationLabel': 'DEG...
Name: 0, dtype: object

In [12]:
data_JobDev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   document   22 non-null     object
 1   tokens     22 non-null     object
 2   relations  22 non-null     object
dtypes: object(3)
memory usage: 656.0+ bytes


At the first look it seems there are no missing data. But there might be examples with no extracted entities

In [13]:
# how many documents have zero exctracted tokens?

empty_tokens = data_JobDev.tokens.apply(lambda x: len(x) == 0)

print("Number of documents with zero exctracted entities is {}".format(sum(empty_tokens)))

Number of documents with zero exctracted entities is 2


In [14]:
data_JobDev[empty_tokens]

Unnamed: 0,document,tokens,relations
15,Experience with Golang (Go Programming Languag...,[],[]
16,Experience with C/C++ and Python. Experience w...,[],[]


In [15]:
# print the two documents that have zero exctracted tokens, maybe we can see some pattern

print(data_JobDev[empty_tokens].iloc[0].document, '\n--\n', data_JobDev[empty_tokens].iloc[1].document)

Experience with Golang (Go Programming Language) Experience with Agile software development Strong debug skills, effective verbal and written communication skills, team oriented Preferred Tech and Prof Experience Experience with Kubernetes, microservices architecture, and Docker containers Familiarity with the Linux OS GitHub, JIRA , RESTful APIs, Agile tools EO Statement IBM is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status. . 
--
 Experience with C/C++ and Python. Experience with LabWindows/CVI, LabVIEW and/or TestStand. Experience developing software that interfaces and controls hardware devices o

In [16]:
# We drop these cases as they won't be useful

data_JobDev.drop(data_JobDev[empty_tokens].index, inplace= True)

In [17]:
# write it out 
data_JobDev.to_json("../data/relations_dev.json")

--------------------------------------------------------------------------------------------------------------------------

In [18]:
path = "data/relations_test.json"
repo = "../"
version = "v0"

data_url = DvcApi.get_url(path = path, repo = repo, rev = version)

data_JobTest = pd.read_json(data_url)

In [19]:
data_JobTest.head(2)

Unnamed: 0,document,tokens,relations
0,"\nCurrently holding a faculty, industry, or go...","[{'text': 'Ph.D.', 'start': 75, 'end': 80, 'to...","[{'child': 18, 'head': 14, 'relationLabel': 'D..."
1,\n2+ years experience in the online advertisin...,"[{'text': '2+ years', 'start': 1, 'end': 9, 't...","[{'child': 7, 'head': 1, 'relationLabel': 'EXP..."


In [20]:
data_JobTest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   document   11 non-null     object
 1   tokens     11 non-null     object
 2   relations  11 non-null     object
dtypes: object(3)
memory usage: 392.0+ bytes


In [21]:
# how many documents have zero exctracted tokens?

empty_tokens = data_JobTest.tokens.apply(lambda x: len(x) == 0)

print("Number of documents with zero exctracted entities is {}".format(sum(empty_tokens)))

Number of documents with zero exctracted entities is 1


In [22]:
data_JobTest[empty_tokens]

Unnamed: 0,document,tokens,relations
6,"\nPh.D. with 5+ years of experience, MS with 7...",[],[]


In [23]:
# print the two documents that have zero exctracted tokens, maybe we can see some pattern

data_JobTest[empty_tokens].iloc[0].document

'\nPh.D. with 5+ years of experience, MS with 7+ years of experience, or BS with 10+ years of experience in Physics, Electrical Engineering, Computer Science, or a related technical field such us architecting, developing, and launching hardware/software projects and/or services\nDemonstrated knowledge dissemination through authored publications, international conference presentations or shipped products\nML/AI basics, and systems basics, including the requisite programming experience (python or equivalent, and at least one systems-level programming language: C, C++, Java, Go, Rust, or equivalent)\nExperience with data analytics  (data collection, storage, cleaning, processing with statistics, visualization, and other data related processes)\nExperience working on communication systems in a research and/or development capacity\nTechnical leadership in leading research efforts with a demonstrated experience handling multiple priorities\n\nPREFERRED \nDeep understanding of how culture and

In [24]:
# We drop these cases as they won't be useful

data_JobTest.drop(data_JobTest[empty_tokens].index, inplace= True)

In [25]:
# write it out 
data_JobTest.to_json("../data/relations_test.json")