# This solution is related to the Problem Statement 2 for Survey Buddy Internship.

I have used alexa_reviews dataset(from row 2000 onwards) from Kaggle for demonstration. I will be using TfidfVectorizer and cosine similarity for the task.
I have also used a text file containing keyphrases which were generated from Assignment 1.

In [182]:
#Importing basic libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [183]:
#Loading dataset
df1=pd.read_csv('amazon_alexa.tsv', sep='\t')
df=df1[2000:]
df.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
2000,1,21-Jul-18,Black Plus,received the wrong product...was so excited to...,0
2001,4,21-Jul-18,Black Plus,"I’m having trouble with it, the alarm most of ...",1
2002,5,21-Jul-18,White Plus,Love it just wish I could play my amazon music...,1
2003,5,20-Jul-18,White Plus,Sounds a little better than the original plus ...,1
2004,4,20-Jul-18,Black Plus,I got it to try side by side with the Echo ori...,1


In [184]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1150 entries, 2000 to 3149
Data columns (total 5 columns):
rating              1150 non-null int64
date                1150 non-null object
variation           1150 non-null object
verified_reviews    1150 non-null object
feedback            1150 non-null int64
dtypes: int64(2), object(3)
memory usage: 45.1+ KB


In [185]:
keyphrase=pd.read_csv('keyphrases.txt')
kp=[i for sublist in keyphrase.values.tolist() for i in sublist]
kp

['love/ great',
 'smark design',
 'worthless',
 'meet every expectation',
 'handy',
 'wife hates',
 'outlet work dissappoint',
 'sound work great',
 'pretty cool',
 'extremely low volume']

In [186]:
#Text preprocessing
#Step 1: Transforming to lowercase

df['reviews_lc']=df['verified_reviews'].str.lower()
df['reviews_lc'].head()

2000    received the wrong product...was so excited to...
2001    i’m having trouble with it, the alarm most of ...
2002    love it just wish i could play my amazon music...
2003    sounds a little better than the original plus ...
2004    i got it to try side by side with the echo ori...
Name: reviews_lc, dtype: object

In [187]:
#Step 2: Removing stopwords and punctuation
import nltk
from nltk.corpus import stopwords

sw=stopwords.words('english')

In [188]:
import re
from nltk.tokenize import word_tokenize

def transform_text(s):
    
    #remove tags
    s=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",s)
    
    # remove special characters and digits
    s=re.sub("(\\d|\\W)+"," ",s)
    
    # remove stopwords
    tokens = nltk.word_tokenize(s)
    
    new_string = []
    for w in tokens:
        # remove words with len = 2 AND stopwords
        if len(w) > 2 and w not in sw:
            new_string.append(w)
 
    s = ' '.join(new_string)
    
    return s.strip()

In [189]:
df['reviews_sw'] = df['reviews_lc'].apply(transform_text)
df['reviews_sw'].head()

2000    received wrong product excited install excitem...
2001    trouble alarm times work timer ends alexa keep...
2002    love wish could play amazon music devices with...
2003    sounds little better original plus interrogate...
2004    got try side side echo originale essentially d...
Name: reviews_sw, dtype: object

In [190]:
#Step 3: Lemmatizing
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer() 

def lemmatizer_text(s):
    tokens = nltk.word_tokenize(s)
    
    new_string = []
    for w in tokens:
        lem = lemmatizer.lemmatize(w, pos="v")
        # exclude if lenght of lemma is smaller than 2
        if len(lem) > 2:
            new_string.append(lem)
    
    s = ' '.join(new_string)
    return s.strip()

In [191]:
df['reviews_lm'] = df['reviews_sw'].apply(lemmatizer_text)
df['reviews_lm'].head()

2000    receive wrong product excite install excitemen...
2001    trouble alarm time work timer end alexa keep s...
2002    love wish could play amazon music devices with...
2003    sound little better original plus interrogate ...
2004    get try side side echo originale essentially d...
Name: reviews_lm, dtype: object

In [192]:
text = df['reviews_lm'].values.tolist()
text

['receive wrong product excite install excitement thank amazon',
 'trouble alarm time work timer end alexa keep say unavailable',
 'love wish could play amazon music devices without buy additional subscription',
 'sound little better original plus interrogate zigbee hub hub compatible link bulbs',
 'get try side side echo originale essentially dimension weight woofer save smaller tweeter point instead build hub many folks already point hub direct replacement philips hue wave anything else provide sub set feature available stand alone hub end lose functionality take grant dedicate hub voice recognition music play capability original echo may edge volume department due larger speakers new echo plus price lower original echo build hub really need hub less functionality addition think amazon software update hub dedicate hub would bug could end never fix philips hub get software update probably every couple months keep change new bulbs fix bug net net echo anything original echo sound quali

In [193]:
# For the sake of simplicity, I am taking a single review for the demonstration.
text1=df['reviews_lm'][2010]
text1

'sound fantastic classic music like orchestra home control sound screamig twist top also tell alexa volume'

# Using Tfidf vectorizer and cosine similarity function
For this I have referred to the notebook https://www.kaggle.com/currie32/predicting-similarity-tfidfvectorizer-doc2vec

In [194]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Use TfidfVectorizer() to transform the questions into vectors,
# then compute their cosine similarity.
vectorizer = TfidfVectorizer()
def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]

In [195]:
Tfidf_scores = []
for i in range(len(kp)):
    score = cosine_sim(text1, kp[i])
    Tfidf_scores.append(score)

In [196]:
Tfidf_scores

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.15976420924144444,
 0.0,
 0.07642786054574813]

# Result
It shows correlation of the review text('sound fantastic classic music like orchestra home control sound screamig twist top also tell alexa volume') with keyphrases: 'sound work great'(0.15) and 'extremely low volume'(0.07).
So, it seems okay result wise. However, it can be explored further like it corelated volume but review sentiment is positive related to volume and keyphrase sentiment negative.

# Using gensim
This is using gensim library. I have referred to article: https://medium.com/better-programming/introduction-to-gensim-calculating-text-similarity-9e8b55de342d. Its functioning is almost same as the above method, i.e., using tfidf vectorizer and cosine similarity except that it uses Gensim library.

In [170]:
from gensim import corpora, models, similarities

In [171]:
texts=[nltk.word_tokenize(i) for i in text]

In [172]:
dictionary = corpora.Dictionary(texts)
feature_cnt = len(dictionary.token2id)
print(dictionary)
print(feature_cnt)

Dictionary(1537 unique tokens: ['amazon', 'excite', 'excitement', 'install', 'product']...)
1537


In [173]:
kp1 = [nltk.word_tokenize(i) for i in kp]
kp1

[['love/', 'great'],
 ['smark', 'design'],
 ['worthless'],
 ['meet', 'every', 'expectation'],
 ['handy'],
 ['wife', 'hates'],
 ['outlet', 'work', 'dissappoint'],
 ['sound', 'work', 'great'],
 ['pretty', 'cool'],
 ['extremely', 'low', 'volume']]

In [174]:
corpus = [dictionary.doc2bow(i) for i in kp1]
corpus

[[(232, 1)],
 [(621, 1)],
 [],
 [(61, 1), (490, 1)],
 [(1145, 1)],
 [(1386, 1)],
 [(17, 1), (743, 1)],
 [(17, 1), (37, 1), (232, 1)],
 [(330, 1), (404, 1)],
 [(109, 1), (875, 1), (1053, 1)]]

In [175]:
tfidf = models.TfidfModel(corpus) 

In [176]:
t1 = [nltk.word_tokenize(text1)]
t1

[['sound',
  'fantastic',
  'classic',
  'music',
  'like',
  'orchestra',
  'home',
  'control',
  'sound',
  'screamig',
  'twist',
  'top',
  'also',
  'tell',
  'alexa',
  'volume']]

In [177]:
t1_vector = [dictionary.doc2bow(i) for i in t1]
t1_vector

[[(9, 1),
  (23, 1),
  (37, 2),
  (109, 1),
  (178, 1),
  (184, 1),
  (185, 1),
  (186, 1),
  (187, 1),
  (188, 1),
  (189, 1),
  (190, 1),
  (191, 1),
  (192, 1),
  (193, 1)]]

In [178]:
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features = feature_cnt)
sim = index[tfidf[t1_vector]]

In [179]:
sim

array([[0.       , 0.       , 0.       , 0.       , 0.       , 0.       ,
        0.       , 0.6361048, 0.       , 0.2581989]], dtype=float32)

# Result
We can see that we get the same results. However, the scoring is different.

In this Assignment, I have created a very basic model based on my understanding. However, I would further like to work on a few things:

1. Remove the recurrence of very long reviews as it is more probable a redundancy that coincidence.
2. Firstly extract keywords/keyphrases from 2 or 3 worded reviews and then use the remaining data for further analysis.
3. Use sentiment analysis methods to analyse data more while assigning ke