# Assignment 2: Entity Resolution (Part 2)


## Objective

In Assignment 2 (Part 2), you will learn how to use Active Learning to address the entity resolution problem. After completing this assignment, you should be able to answer the following questions:

1. Why Active Learning?
2. How to implement uncertain sampling, a popular query strategy for Active Learning?
3. How to solve an ER problem using Active Learning?


## Active Learning

[Active learning](http://tiny.cc/al-wiki) is a certain type of ML algorithms that can train a high-quality ML model with small data-labeling cost. Its basic idea is quite easy to understand. Consider a typical supervised ML problem, which requires a (relatively) large training dataset. In the training dataset, there may be only a small number of data points that are beneficial to the trained ML model. In other words, labeling a small number of data points is enough to train a high-quality ML model. The goal of active learning is to help us to identify those data points. 


In this assignment, we will develop an Active Learning approach for Entity Resolution. The following figure shows the architecture of an entity resolution solution. It consists of four major steps. **I will provide you the source code for Steps 1, 2, 4. Your job is to implement Step 3.**  

<img src="img/arch.png", width=800/>

### Step 1. Read Data

Suppose we get a restaurant dataset `restaurant.csv`. The data has many duplicate restaurants.  For example, the first two rows shown below are duplicated (i.e., refer to the same real-world entity). You can check out all duplicate (matching) record pairs from `true_matches.json`. 

In [2]:
import pandas as pd

df = pd.read_csv('restaurant.csv')
data = df.values.tolist()
print("(#Rows, #Cols) :", df.shape)
df.head(5)

FileNotFoundError: [Errno 2] File b'restaurant.csv' does not exist: b'restaurant.csv'

### Step 2. Similar Pairs

We first use a similarity-join algorithm to generate similar pairs. 

Below is the code. After running the code, we get 678 similar pairs ordered by their similarity decreasingly.

In [None]:
from a2_utils import *

data = df.values.tolist()
simpairs = simjoin(data)

print("Num of Pairs: ", len(data)*(len(data)-1)/2)
print("Num of Similar Pairs: ", len(simpairs))
print("The Most Similar Pair: ", simpairs[0])

We can see that `simjoin` helps us remove the number of pairs from 367653 to 678. But, there are still many non-matching pairs in `simpairs` (see below). 

In [None]:
print(simpairs[-1])

Next, we will use active learning to train a classifier, and then use the classifier to classify each pair in `simpairs` as either "matching" or "nonmatching". 

### Step 3. Active Learning

Given a set of similar pairs, what you need to do next is to iteratively train a classifier to decide which pairs are truly matching. We are going to use [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) as our classifier. 

#### Initialization

At the beginning, all the pairs are unlabeled. To initialize a model, we first pick up ten pairs and then label each pair using  the `crowdsourcing()` function. You can assume that `crowdsourcing()` will ask a crowd worker (e.g., on Amazon Mechanical Turk) to label a pair. 


`crowdsourcing(pair)` is a function that simulates the use of crowdsourcing to label a pair
  
  - **Input:**	pair – A pair of records 

  - **Output:**	Boolean –  *True*: The pair of records are matching; *False*: The pair of records are NOT matching;

Please use the following code to do the initialization. 

In [3]:
from a2_utils import crowdsourcing

# choose the most/least similar five pairs as initial training data
init_pairs = simpairs[:5] + simpairs[-5:]
matches = []
nonmatches = []
for pair in init_pairs:
    is_match = crowdsourcing(pair)
    if is_match == True:
        matches.append(pair)
    else:
        nonmatches.append(pair)
        
print("Number of matches: ", len(matches))
print("Number of nonmatches: ", len(nonmatches))



ModuleNotFoundError: No module named 'a2_utils'

Here is the only code you need to write in this assignment.


In [1]:
from a2_utils import featurize, crowdsourcing
from sklearn.linear_model import LogisticRegression

labeled_pairs = matches + nonmatches
unlabeled_pairs = [p for p in simpairs if p not in labeled_pairs]
# set up the iteration time 
iter_num = 5
# featurize the unlabeled pairs so that it can be used to predict
fea_unlabel = list(map(featurize, unlabeled_pairs))
# festurize the labeled paris so that it can be used to fit 
x = list(map(featurize, labeled_pairs))
# label(crowdsourcing) the labeled pairs
y = list(map(crowdsourcing, labeled_pairs))
# train the initial model on labeled pairs. 
logmodel = LogisticRegression().fit(x,y)

# iterate the model five times
for i in range(iter_num):
    # get the predict prob from featurized unlabed pairs
    pred = logmodel.predict_proba(fea_unlabel)
    # calculate the confidence for each pair
    uncertain_prob = list(map(lambda x:abs(x[1]-x[0]),pred))
    # get the most uncertain pair's index
    most_uncertain_index = uncertain_prob.index(min(uncertain_prob))
    # print the most uncertain paris' confidence
    print('most uncertain pairs confidence:', uncertain_prob[most_uncertain_index])
    # label the most uncertain pair 
    result = crowdsourcing(unlabeled_pairs[most_uncertain_index])
    # append the most featurized uncertain pair to the training data 
    x.append(fea_unlabel[most_uncertain_index])
    # delete the most uncertain pair from unlabed_pairs and featurized unlabeled pairs
    del fea_unlabel[most_uncertain_index]
    del unlabeled_pairs[most_uncertain_index]
    # append the labeled result of the most uncertain pair to y 
    y.append(result)
    # fit the model with the new data 
    logmodel.fit(x, y)

model = logmodel

ModuleNotFoundError: No module named 'a2_utils'

**[Algorithm Description].**   Active learning has many [query strategies](http://tiny.cc/al-wiki-qs) to decide which data point should be labeled. You need to implement uncertain sampling. The algorithm trains an initial model on `labeled_pairs`. Then, it iteratively trains a model. At each iteration, it first applies the model to `unlabeled_pairs`, and makes a prediction on each unlabeled pair along with a probability, where the probability indicates the confidence of the prediction. After that, it selects the most uncertain pair (If there is still a tie, break it randomly),  and call the `crowdsroucing()` function to label the pair. After the pair is labeled, it updates `labeled_pairs` and `unlabeled_pairs`, and then retrain the model on `labeled_pairs`.

**[Input].** 
- `labeled_pairs`: 10 labeled pairs (by default)
- `unlabeled_pairs`: 668 unlabeled pairs (by default)
- `iter_num`: 5 (by default)

**[Output].** 
- `model`: A logistic regression model built by scikit-learn


### Step 4. Model Evaluation

After training an model, you can use the following code to evalute it.

In [146]:
import json
import numpy as np
from a2_utils import evaluate
            
sp_features = np.array([featurize(sp) for sp in simpairs])
label = model.predict(sp_features)
pair_label = zip(simpairs, label)

identified_matches = []
for pair, label in pair_label:
    if label == 1:
        identified_matches.append(pair)
        
precision, recall, fscore = evaluate(identified_matches)

print("Precision:", precision)
print("Recall:", recall)
print("Fscore:", fscore)
   

Precision: 0.8660714285714286
Recall: 0.9150943396226415
Fscore: 0.8899082568807338


## Submission

Complete the code in A2-2.ipynb, and submit it to the CourSys activity Assignment 2.