<h1>Named Entity Recognition (NER) for job data 0: Data Loading</h1>
<h3>Adel Rahmani</h3>
<hr style="height:5px;border:none;color:#333;background-color:#333;" />

<div style="background-color:#F2FBEF;">
<h2><font color=#04B404>Summary</font></h2>
This notebook loads and merges the Adzuna job data sets and saves the result to parquet.
</div>
<hr>

In [1]:
import pandas as pd
import warnings
warnings.simplefilter('ignore')

----
# The Data

This pipeline uses multiple data sources to construct an annotated data set for Named Entity Recognition (NER) for job ads.

The data source comes from the [Kaggle Adzuna](https://www.kaggle.com/c/job-salary-prediction/data) data containing over 300,000 job ads, mostly from the UK.


---
## Loading the Adzuna data sets

In [2]:
%%time

df_list = []

for source in ('data/Train_rev1.csv.zip', 'data/Test_rev1.zip', 'data/Valid_rev1.csv.zip'):
    
    df_ = (pd.read_csv(source)
       .assign(
           Title=lambda df: df.Title.str.strip(),
           FullDescription=lambda df: df.FullDescription.str.strip(),
           Company=lambda df: df.Company.str.strip(),
       )
       .dropna(subset=['Title', 'Company','FullDescription'])
       .query("~Title.str.contains('\*')")
      )

    df_list.append(df_)

data = pd.concat(df_list, axis=0, ignore_index=True).drop(['SalaryRaw','SalaryNormalized'], axis=1)

CPU times: user 7.85 s, sys: 662 ms, total: 8.51 s
Wall time: 8.59 s


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320306 entries, 0 to 320305
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Id                  320306 non-null  int64 
 1   Title               320306 non-null  object
 2   FullDescription     320306 non-null  object
 3   LocationRaw         320306 non-null  object
 4   LocationNormalized  320306 non-null  object
 5   ContractType        88223 non-null   object
 6   ContractTime        246321 non-null  object
 7   Company             320306 non-null  object
 8   Category            320306 non-null  object
 9   SourceName          320305 non-null  object
dtypes: int64(1), object(9)
memory usage: 24.4+ MB


In [4]:
data.sample(5, random_state=0)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SourceName
135923,71469886,"VBNet Developer (SQL Server, ASPNET) Harrogate","VB.Net Developer (SQL Server, ASP.NET) Harrog...",Harrogate,Harrogate,,permanent,Applause IT Limited,IT Jobs,jobsite.co.uk
178054,72446229,Centre Based Trainer in IT and AutoCAD,CAD Centre (UK) Ltd Centre Based Trainer in IT...,East London London South East,South East London,,permanent,The CAD Centre Ltd,IT Jobs,totaljobs.com
188359,72635123,Sub Agent,Sub Agent required for a major Rail/Civil Engi...,"Newcastle upon Tyne, Tyne and Wear",Newcastle Upon Tyne,,contract,VGC,Engineering Jobs,cv-library.co.uk
126675,71288596,Purchase Ledger Clerk,Hays Accountancy and Finance are currently rec...,Sheffield,Sheffield,,,Hays Sheffield,Accounting & Finance Jobs,MyUkJobs
211726,68680609,Managing Consultant Construction/Civils/FM,Managing Consultant Construction/Civils/FM B...,Bristol Avon South West,UK,,permanent,Fresh Partnership,HR & Recruitment Jobs,totaljobs.com


In [5]:
data.to_parquet('data/Adzuna.parq')