<a href="https://colab.research.google.com/github/Arjavjain100/TOS-Summarization/blob/main/KeyPhrase_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installing required libraries

In [38]:
!pip install keyphrase-vectorizers
#Cite both

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [39]:
!git clone "https://github.com/Arjavjain100/TOS-Summarization.git"

fatal: destination path 'TOS-Summarization' already exists and is not an empty directory.


### Extracting keyphrase from a document using noun phrase and BERT 

A noun phrase is a simple phrase built around a noun. It contains a determiner and a noun. For example: a tree, some sweets, the castle. An expanded noun phrase adds more detail to the noun by adding one or more adjectives. An adjective is a word that describes a noun. For example: a huge tree, some colourful sweets, the large, royal castle


In [40]:
from keyphrase_vectorizers import KeyphraseCountVectorizer
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from google.colab import files
import pandas as pd
import numpy as np

In [41]:
data = pd.read_csv('/content/TOS-Summarization/Dataset/all_v1_transpose.csv')

In [42]:
X = data['original_text']

In [44]:
X = X.to_frame()

In [45]:
X['keyphrase'] = np.zeros(X.shape[0])

In [46]:
X.head()

Unnamed: 0,original_text,keyphrase
0,welcome to the pokémon go video game services ...,0.0
1,by using our services you are agreeing to thes...,0.0
2,if you want to use certain features of the ser...,0.0
3,during game play please be aware of your surro...,0.0
4,subject to your compliance with these terms ni...,0.0


First the candidate keyphrases are extracted using noun phrase.
Then the embedding for noun phrases and original document are calculated using BERT. The embeddings are compared using cosine similarity. The reason for using noun phrase is to get semantically correct keyphrases.

In [57]:
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

In [73]:
def get_keyphrase(text, n = 1):
  text = [text]
  vectorizer = KeyphraseCountVectorizer()
  vectorizer.fit_transform(text)
  candidates = vectorizer.get_feature_names_out()

  doc_embedding = model.encode(text)
  candidate_embeddings = model.encode(candidates)
  top_n = n

  distances = cosine_similarity(doc_embedding, candidate_embeddings)
  keyphrases = [candidates[index] for index in distances.argsort()[0][-top_n:]]

  return keyphrases

In [75]:
X['original_text'][42]

'account termination policy youtube will terminate a user s access to the service if under appropriate circumstances the user is determined to be a repeat infringer. youtube reserves the right to decide whether content violates these terms of service for reasons other than copyright infringement such as but not limited to pornography obscenity or excessive length. youtube may at any time without prior notice and in its sole discretion remove such content and or terminate a user s account for submitting such material in violation of these terms of service.'

In [76]:
get_keyphrase(X['original_text'][42], n = 5)

['youtube',
 'youtube reserves',
 'copyright infringement',
 'pornography obscenity',
 'account termination policy youtube']

In [64]:
X['keyphrase'] = X['original_text'].apply(get_keyphrase)

In [65]:
X.head()

Unnamed: 0,original_text,keyphrase
0,welcome to the pokémon go video game services ...,[http pokemongo nianticlabs com]
1,by using our services you are agreeing to thes...,[trainer guidelines]
2,if you want to use certain features of the ser...,[pokémon trainer club ptc account]
3,during game play please be aware of your surro...,[game play]
4,subject to your compliance with these terms ni...,[apple store google play]


In [66]:
X.to_csv("keyphrases.csv")

In [67]:
files.download('keyphrases.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>