Pipeline for Extracting Keywords from the dataset

In [1]:
import pandas as pd
import spacy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import numpy as np

In [2]:
nlp = spacy.load("en_core_web_sm")

path = "/Users/ethanvirtudazo/Desktop/DS105_Dataset/6622_jobs.xls"
df_import = pd.read_excel(path)

df_import.shape

(6623, 11)

In [3]:
df_import.head(2)

Unnamed: 0.1,Unnamed: 0,title,details,deadline,opport_type,commence_date,contract_type,location,Renumeration,company,links
0,0,Rothschild & Co - Private Equity Long-Term Int...,This London-based 6-month internship is an exc...,2023-04-30,Internship,2023-07-01 00:00:00,Temporary,London,,Rothschild & Co,https://careers.lse.ac.uk//students/jobs/detai...
1,1,2023 HSBC Global Graduate Programme (Hong Kong...,You’re excited about starting your career and ...,2023-01-06,Graduate employment,2023-07-03 00:00:00,Temporary,Hong KongSingapore,,HSBC (HSBC) - Hong Kong,https://careers.lse.ac.uk//students/jobs/detai...


In [4]:
df1 = df_import.iloc[:, 1:5]
df1.head()

df2 = df_import.iloc[:, [7,9,10]]
df2.head()

pdList = [df1,df2]
df = pd.concat(pdList,axis=1)
df.head()

df.shape

types = df.dtypes
print(types)

df_title = df.iloc[:,0]
df_text_both = df.iloc[:, [0,1]]
df_text = df.iloc[:, [1]]

df_text_both.head()


title                  object
details                object
deadline       datetime64[ns]
opport_type            object
location               object
company                object
links                  object
dtype: object


Unnamed: 0,title,details
0,Rothschild & Co - Private Equity Long-Term Int...,This London-based 6-month internship is an exc...
1,2023 HSBC Global Graduate Programme (Hong Kong...,You’re excited about starting your career and ...
2,2023 HSBC Global Internship Programme (Hong Ko...,You’re excited about starting your career and ...
3,"Graduate Training Scheme, Capital Markets",Graduate Training Scheme – LondonGreySpark Par...
4,6-Months Internship – Sell-side Tech M&A,"At IPTP, we understand software from decades o..."


In [5]:
df_text.head()

Unnamed: 0,details
0,This London-based 6-month internship is an exc...
1,You’re excited about starting your career and ...
2,You’re excited about starting your career and ...
3,Graduate Training Scheme – LondonGreySpark Par...
4,"At IPTP, we understand software from decades o..."


In [6]:
type(df_text)

pandas.core.frame.DataFrame

In [7]:
#ar_text = df_text.to_numpy()
#ar_text = ar_text[:101]

df_text_1 = df_text[:101]
len(df_text_1)

101

In [8]:
# Converting text data of the sample data into nlp object called 'document'

#txt_lst = df_text['details'].tolist()

docs_1 = df_text_1['details'].apply(nlp)
len(docs_1)

101

In [9]:
# Viewing specific nlp doc
docs_1[0]

This London-based 6-month internship is an exciting opportunity to intern with Rothschild & Co's European corporate private equity business. Rothschild & Co’s European corporate private equity business is comprised of three strategies: Five Arrows Principal Investments (FAPI), Five Arrows Growth Capital (FAGC), and Five Arrows Long Term Fund (FALT). FAPI manages c. €1.3 billion through its latest fund and is the flagship corporate private equity / buy-out vehicle investing in mid-cap companies, while FAGC manages c. €500 million through growth investments and small-cap buy-outs, and FALT makes investments in larger companies with the option for longer-term holding periods. All three funds are focused on investments within Western Europe (with FALT additionally on North America) and primarily invest in three main verticals: Data & Software, Healthcare, and Technology-enabled Business Services. The respective teams are based in London, Paris and Luxembourg and are comprised of c. 30 inve

In [10]:
#Individual Job View of NER 
# NER = Name Entity Recognition
rend_doc_1 = docs_1[73]

from spacy import displacy
displacy.render(rend_doc_1,style="ent")

Part-of-Speech Filter

In [11]:
filtered_docs_1 = list([[token.text for token in doc if token.pos_ in ['PROPN', 'NOUN','ADJ']] for doc in docs_1])
len(filtered_docs_1)

101

In [12]:
# Viewing first filtered list of tokens
# words that are not proper nouns, nouns, or adjectives are removed from the documents.
filtered_docs_1[0]

['London',
 'month',
 'internship',
 'exciting',
 'opportunity',
 'Rothschild',
 'Co',
 'European',
 'corporate',
 'private',
 'equity',
 'business',
 'Rothschild',
 'Co',
 'European',
 'corporate',
 'private',
 'equity',
 'business',
 'strategies',
 'Arrows',
 'Principal',
 'Investments',
 'FAPI',
 'Arrows',
 'Growth',
 'Capital',
 'FAGC',
 'Arrows',
 'Long',
 'Term',
 'Fund',
 'FALT',
 'FAPI',
 'c.',
 'latest',
 'fund',
 'flagship',
 'corporate',
 'private',
 'equity',
 'vehicle',
 'mid',
 '-',
 'cap',
 'companies',
 'FAGC',
 'c.',
 'growth',
 'investments',
 'small',
 'cap',
 'buy',
 'outs',
 'FALT',
 'investments',
 'larger',
 'companies',
 'option',
 'longer',
 'term',
 'holding',
 'periods',
 'funds',
 'investments',
 'Western',
 'Europe',
 'FALT',
 'North',
 'America',
 'main',
 'verticals',
 'Data',
 'Software',
 'Healthcare',
 'Technology',
 'Business',
 'Services',
 'respective',
 'teams',
 'London',
 'Paris',
 'Luxembourg',
 'c.',
 'investment',
 'professionals',
 'countries

In [13]:
# text_1 for testing
text_1 = filtered_docs_1[0]

In [14]:
# Might not need this after all
# join all tokens for each job into a single string
fil_lst_1 = [' '.join(lst) for lst in filtered_docs_1]
len(fil_lst_1)

101

In [15]:
type(fil_lst_1[0])

str

In [16]:
type(filtered_docs_1[0])

list

Dictionary Filter

Technical/Hard Skills: 

1. IT Skills
* MS Office 
    * PowerPoint
    * Excel 
    * Word
* Pages
* Numbers
2. Financial Modelling (Modeling)
3. Programming
* Java  
* C# (C #)
* C++ 
* SQL 
* NoSQL
* Perl 
* JavaScript 
* HTML 
* CSS
* Python
* Java 

4. Data Platform Navigation and Utilisation:
* FactSet
* Bloomberg

5. Technical Knowledge
* Interest/Knowledge/Understanding in corporate finance/financial markets/finance/financial services

6. Cognitive Ability
* Analytical/Numerical/Quantitative skills/Problem-Solving


Soft Skills
1. Interpersonal 
* Communication
* Presentation
2. Project Management
* Leadership
* Attention to detail
* Work under pressure







In [182]:
# Testing code with one string matching
term_1 = 'Microsoft'
match = process.extractOne(term_1,text_1,scorer = fuzz.ratio)
word = match[0]
fuzz_score = (fuzz.ratio(term_1,word))
print(word,fuzz_score)

first 57


In [None]:
# text_1 = contains a list of strings 
text_1 = filtered_docs_1[22]
text_1

In [179]:
#setting the threshold 
score_cutoff = 75

In [178]:
#Filtering for IT Skills: Microsoft
MS_keys = ["Microsoft","MS","Powerpoint","Excel"]

MS_match = [match for match in [process.extractOne(key, text_1, scorer = fuzz.token_set_ratio, score_cutoff = score_cutoff) for key in MS_keys] if match is not None and match[1] >= score_cutoff]

print(MS_match)

[('MS', 100)]


In [181]:
#Filtering for Data Platform Skills: Bloomberg and/or FactSet
DATA_keys = ["Bloombreg","FactSet"]

DATA_match = [match for match in [process.extractOne(key, text_1, scorer = fuzz.token_set_ratio, score_cutoff = score_cutoff) for key in DATA_keys] if match is not None and match[1] >= score_cutoff]

print(DATA_match)

[('Bloomberg', 89)]


In [195]:
# Creating a function to pass the texts through

def dict_filter(text):
    MS_match = [match for match in [process.extractOne(key, text, scorer = fuzz.token_set_ratio, score_cutoff = score_cutoff) for key in MS_keys] if match is not None and match[1] >= score_cutoff]
    DATA_match = [match for match in [process.extractOne(key, text, scorer = fuzz.token_set_ratio, score_cutoff = score_cutoff) for key in DATA_keys] if match is not None and match[1] >= score_cutoff]
    return(MS_match,DATA_match)


In [203]:
text = filtered_docs_1[23]
text

['British',
 'International',
 'Investment',
 'UK',
 'development',
 'finance',
 'institution',
 'impact',
 'investor',
 'UK',
 'Government',
 'years',
 'experience',
 'investment',
 'partner',
 'businesses',
 'Africa',
 'Asia',
 'Caribbean',
 'productive',
 'sustainable',
 'inclusive',
 'economies',
 'people',
 'better',
 'lives',
 'communities',
 'businesses',
 'investees',
 'impactful',
 'businesses',
 'website',
 'more',
 'information',
 'www.bii.co.uk/en•',
 'look',
 'videos',
 'overview',
 'approach',
 'climate',
 'change',
 'impact',
 'www.youtube.com/channel/UCcgTGOpDZ4mPdkuZ2U_9FRgIf',
 'Master',
 'MBA',
 'equivalent',
 'prior',
 'work',
 'experience',
 'other',
 'internship',
 'week',
 'summer',
 'internships',
 'start',
 'date',
 'Investment',
 'Corporate',
 'Impact',
 'Group',
 'teams',
 'opportunity',
 'live',
 'investments',
 'impact',
 'issues',
 'comprehensive',
 'induction',
 'networking',
 'sessions',
 'intern',
 'project',
 'speaker',
 'series',
 'events',
 'interest

In [202]:
dict_filter(text)

([], [])

In [None]:
# CREATING DATAFRAME FOR R

# 1 Turning titles 
df_title_r_1 = df_title[0:101]
df_title_r_1 = pd.DataFrame(df_title_r_1)
type(df_title_r_1)

pandas.core.frame.DataFrame

In [None]:
df_title_r_1.head()

Unnamed: 0,title
0,Rothschild & Co - Private Equity Long-Term Int...
1,2023 HSBC Global Graduate Programme (Hong Kong...
2,2023 HSBC Global Internship Programme (Hong Ko...
3,"Graduate Training Scheme, Capital Markets"
4,6-Months Internship – Sell-side Tech M&A


In [None]:
df_fil_1 = pd.DataFrame(fil_lst_1)
type(df_fil_1)

pandas.core.frame.DataFrame

In [None]:
len(df_fil_1)

101

In [None]:
# CREATING DATAFRAME FOR R

df_r_1 = pd.concat([df_title_r_1, df_fil_1], axis = 1)
df_r_1

Unnamed: 0,title,0
0,Rothschild & Co - Private Equity Long-Term Int...,London month internship exciting opportunity R...
1,2023 HSBC Global Graduate Programme (Hong Kong...,excited career many paths possibilities global...
2,2023 HSBC Global Internship Programme (Hong Ko...,excited career many paths possibilities global...
3,"Graduate Training Scheme, Capital Markets",Graduate Training Scheme LondonGreySpark Partn...
4,6-Months Internship – Sell-side Tech M&A,IPTP software decades deep experience technolo...
...,...,...
96,M&A Analyst Intern,MAJOR RESPONSIBILITIESGather financial operat...
97,12 Month Internship - Financial Crimes and San...,Job summaryFinancial Crime Financial Security ...
98,Investment Associate - Fixed Income,Position OverviewPutnam energetic curious indi...
99,12 Month Internship - Central Compliance,SummaryThe Central Compliance team responsible...


In [None]:
df_r_1.iloc[95,1]

'2R Capital Investment Management Limited independent investment company London UK successful credit business process new initiatives equity investing private assets equity primary objective long term capital clients commensurate reasonable risk attention mid - sized European companies fundamental investors extensive research businesses regions industry sectors significant expertise private assets space equity debt small medium sized companies significant growth potential sectors regions Job Opportunities analysts investment opportunities Europe internship full time positions available Targeted training successful candidates self starters activities little supervision keen interest securities investing good research writing financial modelling abilities European languages important Day day activities search origination potential investment opportunities primary research analysis specific sectors companies valuation investment opportunities Direct interaction entrepreneurs managers inve

In [None]:
#Export as .csv file
#df_r_1.to_csv('df_r_1.csv')

Below is the code for applying the pipeline to the entire dataset

1. Cleaning the data 
-identify index of non-text data
-removing nan (non-text) data

2. Convert to nlp object 'document'


3. 