# Final Replication - Result Analysis

Let's explore more about the AutoPhrase's results. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
import re
import gensim

### 1. Try to randomly pick 100 multi-word phrases whose scores are greater than 0.5. Manually check them and see what's the percentage of high-quality phrases.

In [None]:
sample = pd.read_csv("../references/annotated_multi-words.csv")
sample

In [None]:
high_quality = sample[sample['label'] == 1].reset_index(drop=True)
p = high_quality.shape[0] / sample.shape[0] * 100

print('The percentage of high-quality phrases is: ' + str(p) + '%.')

### 2. Since these 100 multi-word phrases can be ranked by their scores, please plot a precision-recall curve too.

In [None]:
precision, recall, thresholds = precision_recall_curve(
    y_true=sample['label'],
    probas_pred=sample['score']
)

plt.figure(figsize=(10, 8))
plt.plot(recall, precision, scalex=False, scaley=False)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve for 100 Multi-Word Phrases')

#plt.savefig('../data/report/Precision-Recall Curve for 100 Multi-Word Phrases.png')
#plt.show()

### 3. Try to run the word2vec code on the phrasal segmentation results to obtain phrase embedding. 

To convert the phrasal segmentation results, we replace empty spaces inside phrase tags by underscores. 

- For example, if the line looks like: 
    - `<phrase>Overview</phrase> of the ADDS System. <phrase>Transaction Management</phrase> in <phrase>Multidatabase Systems</phrase>.`
- The converted line should look like:
    - `Overview of the ADDS System. Transaction_Management in Multidatabase_Systems.`
    
Then we train the Word2Vec model on the phrasal segmentation results.

In [None]:
multi = pd.read_csv('../data/out/AutoPhrase_multi-words.txt', sep="	", header=None)
multi.columns = ['score', 'phrase']
multi = multi[multi.score > 0.5].reset_index(drop=True)
multi = multi['phrase'].to_list()

In [None]:
model = gensim.models.Word2Vec.load("../data/report/word2vec.model")
exist = []
for phrase in high_quality['phrase'].to_list():
    phrase = phrase.replace(' ', '_')
    if phrase in model.wv:
        exist.append(phrase)

### 4. Pick 3 high-quality phrases from your previous annotations in step 1, run a similarity search among all multi-word phrases whose scores are greater than 0.5, and report the top-5 results. Comment on the results. 

In [None]:
similar = dict()
#for p1 in ['hearing_aid', 'design_studio', 'waste_water_treatment']:
for p1 in exist:
    sim = model.wv.most_similar(positive=p1)
    lst = []
    count = 0
    for pair in sim:
        if count > 4:
            continue
        else:
            p2 = pair[0].replace('_', ' ')
            if p2 in multi:
                lst.append(pair[0])
                count += 1
    if len(lst) == 5 and len(similar) < 3:
        similar[p1] = lst

In [None]:
sim_5 = pd.DataFrame.from_dict(similar)
#sim_5.to_csv('../data/report/Top 5 Similar Multi-Word Phrases.csv')
sim_5

Each group of phrases show syntactic and semantic similarities in different ways, but this shows that AutoPhrase can successfully extract phrases that we can use to group together into different categories from the same corpus.

Note: Since Word2Vec models perform poorly on small data, there would be no similar phrases for our test data.