## In this notebook

- We implement  spacy entities and noun phrase based n-grams extraction to pick out key terms and phrases from reviews
    - benepar parser `currently not working --> needs investigation` (needed for verb phrases)
    - regular spacy noun phrases


- We store the phrase2idx and phrase2vector mapping as a pickle

- We construct and save a nearest neighbor object to help find the closest phrases


In [1]:
import pandas as pd

### load data

In [2]:
df = pd.read_excel("../data/test_data.xlsx", index_col=0)
df['contents'] = df['title'] + '.\n\n' + df['review']

### tokenization

In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
#df['review_tokens'] = df['contents'].apply(lambda x: [t.text.lower() for t in nlp(x)])

### extract entities and POS phrases using benepar

In [7]:
import benepar
benepar.download('benepar_en3')

[nltk_data] Downloading package benepar_en3 to /home/nino/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!


True

In [8]:
spacy.__version__

'3.3.0'

In [9]:
nlp.add_pipe("benepar", config={"model": "benepar_en3"})

NVIDIA GeForce RTX 3060 Laptop GPU with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3060 Laptop GPU GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/



<benepar.integrations.spacy_plugin.BeneparComponent at 0x7f15ac3ca7a0>

In [11]:
review = df.iloc[0]['contents']

In [4]:
#doc = nlp(review)

In [56]:
parser = benepar.Parser('benepar_en3')

NVIDIA GeForce RTX 3060 Laptop GPU with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3060 Laptop GPU GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/



In [58]:
' '.join(dir(parser))

'__class__ __delattr__ __dict__ __dir__ __doc__ __eq__ __format__ __ge__ __getattribute__ __gt__ __hash__ __init__ __init_subclass__ __le__ __lt__ __module__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _language_code _parser _tokenizer_lang _with_missing_fields_filled batch_size parse parse_sents'

In [None]:
df['contents']

In [None]:
parser.parse()

## Extract entities and nounphrases using spacy

In [5]:
spacy_docs = df['contents'].apply(lambda review: nlp(review))

In [34]:
all_noun_chunks = [e.text.lower() for sd in spacy_docs for sent in sd.sents for e in sent.noun_chunks]
all_entities = [e.text.lower() for sd in spacy_docs for sent in sd.sents for e in sent.ents]

In [35]:
from collections import Counter

In [36]:
spacy_counts = Counter(all_noun_chunks + all_entities)

In [37]:
spacy_count_df = pd.DataFrame({
    'phrase': list(spacy_counts.keys()),
    'count': list(spacy_counts.values())
})

In [38]:
spacy_count_df.sort_values('count', ascending=False)

Unnamed: 0,phrase,count
2,i,810
3,it,585
42,sony,324
12,you,234
84,this tv,162
...,...,...
1583,amazing pictures,1
1584,so much detail,1
1585,awe,1
1586,space,1


In [39]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
def contains_stopword(doc):
    if any([t in spacy_stopwords for t in doc.split()]):
        return True
    return False

def contains_more_than_n_stopwords(doc, n=1):
    if sum([t in spacy_stopwords for t in doc.split()]) > n:
        return True
    return False

In [40]:
spacy_count_df[spacy_count_df['phrase'].apply(lambda p: not contains_stopword(p))].sort_values('count', ascending=False).head(25)

Unnamed: 0,phrase,count
42,sony,324
293,sound,39
294,google tv,35
554,picture,31
1054,netflix,30
196,best buy,26
332,tv,26
123,samsung,24
41,lg,24
320,picture quality,23
