# FIT5196 Task 2 in Assessment 2
#### Student Name: Chuangfu Xie
#### Student ID: 27771539

Date: 04/05/2018

Version: 1.0

Environment: Python 3.6.4

Packages list:  
1. **Pandas** (0.22.0)  
2. **numpy** (1.14.0)
3. **recordlinkage** (0.11.2): tools needed for record linkage and deduplication

In [1]:
import sys
print (sys.version)

3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


## 1.  Import libraries 

In [2]:
import pandas as pd
import numpy as np
import recordlinkage

## 2. Identify conflicts and resolution

In [3]:
df = pd.read_csv("./dataset2_integration.csv")
df_sln = pd.read_csv("./dataset1_solution.csv")

### 2.1 Schema conflicts
First, let's have a look at these two datasets:

In [4]:
df.head()

Unnamed: 0,Id,Source Name,Title,location,Contract Type,Contract Time,Company,Category,Salary per month,OpenDate,CloseDate
0,69247680,jobs.guardian.co.uk,Business Development Exec research / insight ...,London,ft,perm.,BOYCE RECRUITMENT,"PR, Advertising & Marketing Jobs",2125,2012-03-01 15:00:00,2012-03-31 15:00:00
1,69247682,icaewjobs.com,Audit Senior London,London,ft,contr.,Pro Finance,Finance & Accounting Jobs,3600,2013-03-14 12:00:00,2013-04-13 12:00:00
2,69247685,jobs.guardian.co.uk,PR & Social Media Account Executive Top Inter...,Central London,ft,perm.,ECOM RECRUITMENT LTD,"PR, Advertising & Marketing Jobs",1917,2012-02-14 12:00:00,2012-03-15 12:00:00
3,69247688,jobs.guardian.co.uk,Content Manager,London,ft,perm.,NAKAMA LONDON,"PR, Advertising & Marketing Jobs",2083,2012-10-21 00:00:00,2012-12-20 00:00:00
4,69247694,jobs.guardian.co.uk,AV PRODUCTION MANAGER,Worcestershire,ft,perm.,LIVE RECRUITMENT,"PR, Advertising & Marketing Jobs",2292,2012-02-22 00:00:00,2012-04-22 00:00:00


In [5]:
df_sln.head()

Unnamed: 0,Id,Title,Location,ContractType,ContractTime,Company,Category,Salary per annum,SourceName,OpenDate,CloseDate
0,12612628,Engineering Systems Analyst,Dorking,not available,permanent,Gregory Martin International,Engineering Jobs,24996,cv-library.co.uk,2012-11-03 00:00:00,2012-12-03 00:00:00
1,12612830,Stress Engineer Glasgow,Glasgow,not available,permanent,Gregory Martin International,Engineering Jobs,30000,cv-library.co.uk,2013-01-08 15:00:00,2013-04-08 15:00:00
2,12612844,Modelling and simulation analyst,Hampshire,not available,permanent,Gregory Martin International,Engineering Jobs,30000,cv-library.co.uk,2013-07-26 15:00:00,2013-09-24 15:00:00
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Surrey,not available,permanent,Gregory Martin International,Engineering Jobs,27504,cv-library.co.uk,2012-12-14 00:00:00,2013-03-14 00:00:00
4,12613647,"Pioneer, Miser Engineering Systems Analyst",Surrey,not available,permanent,Gregory Martin International,Engineering Jobs,24996,cv-library.co.uk,2013-10-25 00:00:00,2013-12-24 00:00:00


By comparing these two datasets, we can find that there are some naming conflicts, where different names are used for the same object: 
1. <font color='blue'>**location**</font> againsts <font color='blue'>**Location**</font>
2. <font color='blue'>**Contract Type**</font> againsts <font color='blue'>**ContractType**</font>, 
3. <font color='blue'>**Contract Time**</font> againsts <font color='blue'>**ContractTime**</font>, and 
4. <font color='blue'>**Source Name**</font> againsts <font color='blue'>**SourceName**</font>  

First, we deal with all **naming issues** by `rename()` function:

In [6]:
df.rename(columns={'location':'Location', 
                   'Contract Type':'ContractType', 
                   'Contract Time':'ContractTime',
                   'Source Name':'SourceName'}, inplace=True)

### 2.2 Indentify attribute-level problems
Previously we have solve the schema problems, but we also notice that salary data represented in both dataset are at different level of time:
1. <font color='blue'>**Salary per month**</font> againsts <font color='blue'>**Salary per annum**</font>. 

Then, we use a simple arithmetic formula to unify the **salary** part:  
$$Annual = Monthly \times 12$$

In [7]:
df['Salary per month'] = df['Salary per month']*12
# Then, rename this column
df.rename(columns={'Salary per month':'Salary per annum'}, inplace=True)

Here we also notice that the value in following columns are different from the other:

In [8]:
print("From df:")
print(df.ContractTime.value_counts())
print(df.ContractType.value_counts())
print('\n')
print("From df_sln:")
print(df_sln.ContractTime.value_counts())
print(df_sln.ContractType.value_counts())

From df:
perm.     16505
contr.     3258
Name: ContractTime, dtype: int64
ft    4353
pt     466
Name: ContractType, dtype: int64


From df_sln:
permanent        16194
not available     6212
contract          2671
Name: ContractTime, dtype: int64
not available    19499
full-time         4883
part-time          695
Name: ContractType, dtype: int64


In [9]:
print('NaN in ContractTime: ' + str(df.ContractTime.isnull().sum()))
print('NaN in ContractType: ' + str(df.ContractType.isnull().sum()))

NaN in ContractTime: 5513
NaN in ContractType: 20457


It seems that dataset `df` contains many missing value in <font color='blue'>**ContractTime**</font> and <font color='blue'>**ContractType**</font>. Also, we find that same entity has been presented in different forms:
1. `perm.` againsts `permanent`
2. `contr.` againsts `contract`
3. `ft` againsts `full-time`
4. `pt` againsts `part-time`

Also, data in <font color='blue'>**Company**</font> are also inconsistent with other:
1. `BOYCE RECRUITMENT` againsts `Gregory Martin International`

We can use `map()` function to replace these inconsistencies and also deal with the `NaN`:

In [10]:
df.Company = df.Company.str.title()
df.ContractType = df.ContractType.map({'ft':'full-time',
                                       'pt':'part-time'})
df.ContractType.fillna('not available', inplace=True)
df.ContractTime = df.ContractTime.map({'perm.':'permanent',
                                       'contr.':'contract'})
df.ContractTime.fillna('not available', inplace=True)

### 2.3 Deduplication 

In this section, we are going to identify duplication within and between datasets. Each data has their own key **`Id` as identifier** for each record. However, the identifier **`Id`** are not sufficient enough in this case to distinguish each record to the other, since there are other 10 attributes which characterise each record as well. Hence, we need to find the **suitable combination key** of multiple attributes.

Previously we have worked on <font color='blue'>**ContractTime**</font>, <font color='blue'>**ContractType**</font>. These are not suitable attribute for discovering duplicate since there are only 3 possible values, which can not ideally differentiate a record. Same reason can be applied on other attributes:

In [11]:
def check_unique(df):
    for col in df.columns.tolist():
        print(col + ": " + str(len(df[col].unique())))
check_unique(df)

Id: 25276
SourceName: 92
Title: 25276
Location: 475
ContractType: 3
ContractTime: 3
Company: 5546
Category: 8
Salary per annum: 1505
OpenDate: 2193
CloseDate: 2395


From the output we can find that there are 6 attributes which has most significant number in terms of uniqueness: <font color='blue'>**Title**</font>, <font color='blue'>**Location**</font>, <font color='blue'>**Company**</font>, <font color='blue'>**Salary per annum**</font>, <font color='blue'>**OpenDate**</font> and <font color='blue'>**CloseDate**</font>. 

However, as there is a rull of null equality that **null value is always not equal to itself**. Let's check whether any the `null` value exists in datasets:

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25276 entries, 0 to 25275
Data columns (total 11 columns):
Id                  25276 non-null int64
SourceName          25276 non-null object
Title               25276 non-null object
Location            25276 non-null object
ContractType        25276 non-null object
ContractTime        25276 non-null object
Company             24178 non-null object
Category            25276 non-null object
Salary per annum    25276 non-null int64
OpenDate            25276 non-null object
CloseDate           25276 non-null object
dtypes: int64(2), object(9)
memory usage: 2.1+ MB


In [13]:
df_sln.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25077 entries, 0 to 25076
Data columns (total 11 columns):
Id                  25077 non-null int64
Title               25077 non-null object
Location            25077 non-null object
ContractType        25077 non-null object
ContractTime        25077 non-null object
Company             21242 non-null object
Category            25077 non-null object
Salary per annum    25077 non-null int64
SourceName          25077 non-null object
OpenDate            25077 non-null object
CloseDate           25077 non-null object
dtypes: int64(2), object(9)
memory usage: 2.1+ MB


Obviously, these null value in <font color='blue'>**Company**</font> will mess up our comparing process, we need to deal with them: replacing it with 'Not Given'

In [14]:
df.Company.fillna("Not Given",inplace=True)
df_sln.Company.fillna("Not Given",inplace=True)

Now we have all the candidate attributes for our combination key, we can work on the duplication.

We use **Python Record Linkage Toolkit** to link records in or between data sources for record linkage and deduplication. Especially, `BlockIndex` class is very useful when identify records on more than one attributes:

In [15]:
def check_dup(key, df1, df2=None):
    indexer = recordlinkage.BlockIndex(on=key)
    try:
        if df2 == None:
            pairs = indexer.index(df1)
    except:
        pairs = indexer.index(df1,df2)
    print(len(pairs))

key = ['Title', 'Location', 'Company', 'Salary per annum', 'OpenDate', 'CloseDate']
check_dup(key, df)

0


Oops! It seems that all record in `df` are different.  
Let's check another one:

In [16]:
check_dup(key, df_sln)

0


Since we are going to merge these two into one, how about duplications exist between these datasets?

In [17]:
check_dup(key, df, df_sln)

154


Now we find there are **154 potential duplications** between datasets, we can start comparing each attribute and output the level of similarity for drop decision. 

The `compare()` method provide various algorithms for **string matching**: `‘jaro’`, `’jarowinkler’`, `‘levenshtein’`, `‘damerau_levenshtein’`, `‘qgram’` or `‘cosine’`. We use `'jaro'` for <font color='blue'>**Title**</font>, <font color='blue'>**Location**</font>, <font color='blue'>**Company**</font>.
Additionally, we also should consider the threshold value where will sift the scores: 
1. Data in <font color='blue'>**Title**</font> are so varied. we give it as 0.5.
2. As to <font color='blue'>**Location**</font>, we give 0.5.
3. For <font color='blue'>**Company**</font>, we give 0.85.

For matching numeric values, `compare()` method also provide various way: `‘step’`, `‘linear’`, `‘exp’`, `‘gauss’` or `‘squared’`. Considering each job release is an independent event, we regards <font color='blue'>**Salary per annum**</font> value as random variable which distribution is unknown. Hence, we can treat it as normal distributions, using `'gauss'` method.  

For <font color='blue'>**OpenDate**</font> and <font color='blue'>**CloseDate**</font>, we using exact match becasue time is unique.

In [18]:
key = ['Title', 'Location', 'Company', 'Salary per annum', 'OpenDate', 'CloseDate']
indexer = recordlinkage.BlockIndex(on=key)
pairs = indexer.index(df, df_sln)
compare = recordlinkage.Compare()
# Configuring paremeters for Compare()
# String matching
compare.string('Title','Title', threshold = 0.5, method='jaro', label='Title')
compare.string('Location','Location', threshold = 0.5, method='jaro', label='Location')
compare.string('Company','Company', threshold = 0.85, method='jaro', label='Company')
# Numeric matching
compare.numeric('Salary per annum','Salary per annum', scale=1000, method='gauss', label='Salary per annum')
# Exact matching
compare.exact('OpenDate','OpenDate',label='OpenDate')
compare.exact('CloseDate','CloseDate',label='CloseDate')
# yield the output
results = compare.compute(pairs, df, df_sln)

In [19]:
# Sum the comparison results.
results.sum(axis=1).value_counts().sort_index(ascending=False)

6.0    154
dtype: int64

Now we have 154 duplications, let's have a closer look at these data:

In [20]:
indices = results.index.get_values()
dp_list1= []
dp_list2= []
# extract index from results
for index in indices:
    dp_list1.append(index[0])
    dp_list2.append(index[1])
# For double-check
dp_df = df.iloc[dp_list1]
dp_df.columns = df.columns
dp_df_sln = df_sln.iloc[dp_list2]
dp_df_sln.columns = df_sln.columns

It seems we have find all the duplications. Let's have a random pick to examine in details:

In [21]:
from random import randint
pick = randint(0,len(results)-1)
check = dp_df.iloc[pick].append(dp_df_sln.iloc[pick])
check

Id                                   71198448
SourceName                   cv-library.co.uk
Title                   Care Assistant  Derby
Location                                Derby
ContractType                    not available
ContractTime                        permanent
Company                         Albion Health
Category            Healthcare & Nursing Jobs
Salary per annum                        12144
OpenDate                  2012-08-23 00:00:00
CloseDate                 2012-10-22 00:00:00
Id                                   69012230
Title                   Care Assistant  Derby
Location                                Derby
ContractType                    not available
ContractTime                        permanent
Company                         Albion Health
Category            Healthcare & Nursing Jobs
Salary per annum                        12144
SourceName                   cv-library.co.uk
OpenDate                  2012-08-23 00:00:00
CloseDate                 2012-10-

### 2.4 Drop duplicate data and Merge

Now we can drop all the duplications with index:

In [22]:
df = df.drop(index=dp_list1)
# Double check
check_dup(key, df, df_sln)

0


Then, merge datasets into one:

In [23]:
new = pd.concat([df, df_sln], ignore_index=True)
# set 'id' as identifier
new.set_index('Id',inplace=True)
new.head()

Unnamed: 0_level_0,Category,CloseDate,Company,ContractTime,ContractType,Location,OpenDate,Salary per annum,SourceName,Title
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
69247680,"PR, Advertising & Marketing Jobs",2012-03-31 15:00:00,Boyce Recruitment,permanent,full-time,London,2012-03-01 15:00:00,25500,jobs.guardian.co.uk,Business Development Exec research / insight ...
69247682,Finance & Accounting Jobs,2013-04-13 12:00:00,Pro Finance,contract,full-time,London,2013-03-14 12:00:00,43200,icaewjobs.com,Audit Senior London
69247685,"PR, Advertising & Marketing Jobs",2012-03-15 12:00:00,Ecom Recruitment Ltd,permanent,full-time,Central London,2012-02-14 12:00:00,23004,jobs.guardian.co.uk,PR & Social Media Account Executive Top Inter...
69247688,"PR, Advertising & Marketing Jobs",2012-12-20 00:00:00,Nakama London,permanent,full-time,London,2012-10-21 00:00:00,24996,jobs.guardian.co.uk,Content Manager
69247694,"PR, Advertising & Marketing Jobs",2012-04-22 00:00:00,Live Recruitment,permanent,full-time,Worcestershire,2012-02-22 00:00:00,27504,jobs.guardian.co.uk,AV PRODUCTION MANAGER


## 3. Export CSV

In [24]:
new.to_csv('./dataset1_dataset2_solution.csv')

## 4. Summary
From schema integration to data integration, various identification techniques are needed to apply until finding what these conflicts are. In deduplication where data with numerous attributes, matching numeric and categorical values requires different algorithms. However, It is also important to sift attributes by their uniqueness level and finally have the ideal global key to deal with the duplication.

## Reference
* NumPy User Guide - Miscellaneous. *IEEE 754 Floating Point Special Values*. Retrieved from: [https://docs.scipy.org/doc/numpy/user/misc.html](https://docs.scipy.org/doc/numpy/user/misc.html)
* Python Record Linkage Toolkit Documentation. *Data deduplication*. Retrieved from: [http://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html](http://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html)
* Anderson J. (2017, July 5). *Pre-processing with recordlinkage*. Retrieved from:  [http://networkslab.org/2017/07/05/2017-07-05-preprocessing/](http://networkslab.org/2017/07/05/2017-07-05-preprocessing/)