## Analysis on title
In my previous work, I calculated an overall score based on Treatment Rating, Number of photos, Number of words and Provider rating, without involving column of Title. I figured this data is probably useful, at least couldn't be ignored, so I try to apply latent semantic analysis (LSA) to these customers' review titles. 



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD  
from sklearn.feature_extraction.text import TfidfVectorizer  # convert words into weight matrix 


In [31]:
# Import data
df = pd.read_excel('/content/drive/MyDrive/Q2.xlsx')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 662 entries, 0 to 661
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Created Date      662 non-null    datetime64[ns]
 1   Treatment Rating  662 non-null    object        
 2   Number of Photos  662 non-null    int64         
 3   Number of Words   662 non-null    int64         
 4   Provider Rating   644 non-null    float64       
 5   Physician Type    547 non-null    object        
 6   Treatment Name    662 non-null    object        
 7   Title             662 non-null    object        
 8   Cost              662 non-null    float64       
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 46.7+ KB


In [3]:
# Preview of data
df.head(10)

Unnamed: 0,Created Date,Treatment Rating,Number of Photos,Number of Words,Provider Rating,Physician Type,Treatment Name,Title,Cost
0,2019-05-31,Worth it,0,109,5.0,Physician,volbella,From Nervous to Loyal Customer in One Visit!,0.0
1,2019-02-13,Worth it,0,96,5.0,Dermatologic Surgeon,volbella,Dr. Schlessinger did a fantastic job giving me...,0.0
2,2019-01-13,Worth it,2,93,5.0,Plastic Surgeon,volbella,Amazing Artistry!,0.0
3,2019-06-30,Worth it,0,80,5.0,Family Physician,volbella,Expert Injector!,0.0
4,2019-03-21,Not worth it,3,172,,,volbella,Late Reaction to Volbella,750.0
5,2019-01-03,Worth it,0,38,5.0,Facial Plastic Surgeon,volbella,"I had my lips done, simply amazing!",0.0
6,2019-02-14,Worth it,0,42,5.0,Plastic Surgeon,volbella,I Had a Wonderful Experience,0.0
7,2019-04-06,Worth it,3,94,5.0,,volbella,"Wanted a natural, but fuller look to my lips",449.98
8,2019-05-26,Worth it,2,72,5.0,Oculoplastic Surgeon,volbella,Dark Circle Fillers,2200.0
9,2019-03-19,Worth it,0,115,5.0,Plastic Surgeon,volbella,5 Star Experience,0.0


#### Latent semantic analysis on titles 

Suppose that some consumers use words which could express strong emotions, either positive or negative, like fantasitcs, amazing, or horrible, disappointed. On the contrary, others may just choose neutral words, such as natural. So I will try to cluster the review titles, label them and give each title a score denpending on how much consumers like the treatment or hate it.

In [None]:
# Define all 662 titles as corpus and make sure all words in lowercase.
corpus = df.Title.str.lower()

In [None]:
# Convert all words in review title, apart from stop words in English, into term frequency–inverse document frequency (TF-IDF) matrix.

vectorizer = TfidfVectorizer(stop_words='english' ) 
vectors = vectorizer.fit_transform(corpus) 
words = vectorizer.get_feature_names()                                             

In [None]:
print('There are %s distinct words(features) in review titles.' %len(words))

There are 727 distinct words(features) in review titles.


In [None]:
# print 1st vector in the matrix. 
# Only 4 non-zero elements, corresponding 4 words in first title(apart from stop words), nervous, loyal, customer, visit.
# Most of elements in matrix are 0, so it's a sparse matrix. Need to be downsized.
print(vectors.toarray()[0])

[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         

In [None]:
# Use truncated SVD to downsize the matrix

categories = 5        # Remain 5 columns, which means review titles will be classified into 5 categories by similarity. 
lsa = TruncatedSVD(n_components=categories)  
trunc_v = lsa.fit_transform(vectors)  
print("--------lsa singular value---------")
print(lsa.singular_values_)
print("--------662 review titles，in %s categories vector space---------" %categories)
print(trunc_v.shape)  

--------lsa singular value---------
[5.85804978 4.76231525 4.32992423 3.68126854 3.54158413]
--------662 review titles，in 5 categories vector space---------
(662, 5)


In [None]:
# Pick 5 most typical titles in each category.

pick_titles = 5  
title_docid = [trunc_v[:, i].argsort()[:-(pick_titles + 1):-1] for i in range(categories)]
#print("--------5 most typical titles in each category---------")
#print(title_docid)

In [None]:
# Pick 5 key words in each category

pick_keywords = 5  
cat_keywdid = [lsa.components_[i].argsort()[:-(pick_keywords + 1):-1] for i in range(categories)]
#print("--------3 typical words in each category---------")
#print(cat_keywdid)

In [None]:
print("-------- Results---------")
for c in range(categories):
    print("\n rating categories {}".format(c+1))
    print("\t keywords：{}".format(", ".join(words[cat_keywdid[c][j]] for j in range(pick_keywords))))
    for i in range(pick_titles):
        print('\t\t titles %s: %s,' % ('{}'.format(i+1),corpus[title_docid[c][i]]))

-------- Results---------

 rating categories 1
	 keywords：experience, amazing, great, botox, results
		 titles 1: amazing experience,
		 titles 2: amazing experience...,
		 titles 3: i had an amazing experience,
		 titles 4: amazing experience,
		 titles 5: amazing experience,

 rating categories 2
	 keywords：botox, best, great, results, treatment
		 titles 1: botox,
		 titles 2: botox,
		 titles 3: botox,
		 titles 4: botox,
		 titles 5: botox,

 rating categories 3
	 keywords：great, experience, service, results, wonderful
		 titles 1: great experience, great results!,
		 titles 2: another great experience,
		 titles 3: always a great experience!!,
		 titles 4: great experience,
		 titles 5: great experience!,

 rating categories 4
	 keywords：best, experience, doctor, dr, plastic
		 titles 1: the best,
		 titles 2: best of the best,
		 titles 3: best of the best!,
		 titles 4: best experience!,
		 titles 5: best botox! best doctor! best medi spa!,

 rating categories 5
	 keywords：res

As the result shows, titles are classified into 5 categories, based on how strongly they related to each category, or each particular topic. However, I'm facing several problems. The first problem is, although there are 5 categories, I can't state that one category represents stronger emotion than another does, in other words, I'm unable to grade the categories. 

The 2nd problem is that there're too many identical key words in different categories, so the classification is not performed well. This is easy to be understood because most customers selected same words to express their happiness.

#### I noticed that there are no negative words in any category, Thus, this time, I'll analyze the titles written by consumers giving 'Not worth it' only. 

In [None]:
# Extract titles written by users give negative rating.
df_neg = df[df['Treatment Rating'] == 'Not worth it']
df_neg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31 entries, 4 to 633
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Created Date      31 non-null     datetime64[ns]
 1   Treatment Rating  31 non-null     object        
 2   Number of Photos  31 non-null     int64         
 3   Number of Words   31 non-null     int64         
 4   Provider Rating   23 non-null     float64       
 5   Physician Type    20 non-null     object        
 6   Treatment Name    31 non-null     object        
 7   Title             31 non-null     object        
 8   Cost              31 non-null     float64       
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 2.4+ KB


In [None]:
# Generating a matrix of word occurences instead of TF-IDF value.

cv = CountVectorizer(stop_words='english')
neg_v = cv.fit_transform(df_neg.Title)

In [None]:
word_list = cv.get_feature_names()
count_list = neg_v.toarray().sum(axis=0) 
print('%s words (features) in negative review titles' %len(word_list))

77 words (features) in negative review titles


In [None]:
#print top 10 words with highest occurence.

d = dict(zip(word_list,count_list))
print(sorted(d.items(), key=lambda item: item[1],reverse=True)[:10])

[('botox', 12), ('experience', 5), ('horrible', 4), ('bad', 3), ('results', 3), ('crows', 2), ('feet', 2), ('jaw', 2), ('just', 2), ('masseter', 2)]


In the negative review titles, the words about consumers sentiment, like horrible and bad, just appears 4 times and 3 times, respectively.
So the last problem is, since there are only 31 consumers giving negative ratings, the key words in their reviews are eliminated during truncated SVD step because of low frequency. 

### Sentiment Intensity Analyzer

LSA doen't work well in this situation, so I try to assign polarity score to each comment title leveraging built-in sentiment analyzer in the NLTK Python library. 

In [4]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA



In [5]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [6]:
sia = SIA() #Instantiate

In [7]:
ss_df = pd.DataFrame(columns=['neg','pos','neu','compound'],dtype='float')

In [29]:
for index,row in df.iterrows():
    ss = sia.polarity_scores(row['Title'])
    ss_df.at[index,'neg'] = ss['neg']
    ss_df.at[index,'pos'] = ss['pos']
    ss_df.at[index,'neu']= ss['neu']
    ss_df.at[index,'compound'] = ss['compound']

In [34]:
ss_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 662 entries, 0 to 661
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   neg       662 non-null    float64
 1   pos       662 non-null    float64
 2   neu       662 non-null    float64
 3   compound  662 non-null    float64
dtypes: float64(4)
memory usage: 45.9 KB


In [35]:
ss_df

Unnamed: 0,neg,pos,neu,compound
0,0.183,0.295,0.522,0.3164
1,0.000,0.586,0.414,0.8176
2,0.000,0.804,0.196,0.6239
3,0.000,0.000,1.000,0.0000
4,0.000,0.000,1.000,0.0000
...,...,...,...,...
657,0.000,0.316,0.684,0.5719
658,0.000,0.000,1.000,0.0000
659,0.000,0.397,0.603,0.5574
660,0.000,0.666,0.334,0.6114


In [10]:
#load the data set with satisfaction score (saved in previous work)
df_1 = pd.read_csv('/content/drive/MyDrive/with_overall_score.csv')

In [36]:
df = pd.concat([df_1,ss_df],axis=1)

In [37]:
df.head()

Unnamed: 0.1,Unnamed: 0,Created Date,Treatment Rating,Number of Photos,Number of Words,Provider Rating,Physician Type,Treatment Name,Title,Cost,if_pay,cost_level,sign,score_tr,score_ph,score_w,score_pr,score,neg,pos,neu,compound
0,0,2019-05-31,Worth it,0,109,5.0,Physician,volbella,From Nervous to Loyal Customer in One Visit!,0.0,NO cost with insurance,no payment,1,7,0.0,1.25,0.25,8.5,0.183,0.295,0.522,0.3164
1,1,2019-02-13,Worth it,0,96,5.0,Dermatologic Surgeon,volbella,Dr. Schlessinger did a fantastic job giving me...,0.0,NO cost with insurance,no payment,1,7,0.0,1.0,0.25,8.25,0.0,0.586,0.414,0.8176
2,2,2019-01-13,Worth it,2,93,5.0,Plastic Surgeon,volbella,Amazing Artistry!,0.0,NO cost with insurance,no payment,1,7,0.2,1.0,0.25,8.45,0.0,0.804,0.196,0.6239
3,3,2019-06-30,Worth it,0,80,5.0,Family Physician,volbella,Expert Injector!,0.0,NO cost with insurance,no payment,1,7,0.0,1.0,0.25,8.25,0.0,0.0,1.0,0.0
4,4,2019-03-21,Not worth it,3,172,,,volbella,Late Reaction to Volbella,750.0,pay,acceptable,-1,5,0.2,1.25,0.0,3.55,0.0,0.0,1.0,0.0


In [38]:
df.groupby('Treatment Rating').describe()['compound']

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Treatment Rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Not worth it,31.0,-0.266603,0.313217,-0.5859,-0.5423,-0.4767,0.0,0.5719
Worth it,631.0,0.464708,0.308264,-0.5255,0.1625,0.5859,0.6696,0.938


It seems like this polarity score matching the conusmers' rating.  Then, combine the sentiment score with the overall score


In [39]:
# simply add the sentiment score to the calculated satisfaction score
df['new_score'] = df['score'] + df['compound']

In [40]:
df.groupby(['Treatment Rating']).describe()['new_score']

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Treatment Rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Not worth it,31.0,2.955977,0.792959,1.2641,2.5905,2.9904,3.3,5.0719
Worth it,631.0,8.541887,0.421,7.2394,8.3219,8.5217,8.81915,10.2351


In [41]:
df.groupby(['Treatment Name']).mean()['new_score'].sort_values

<bound method Series.sort_values of Treatment Name
botox            8.308456
chemical peel    8.113782
coolmini         7.971994
volbella         7.969443
Name: new_score, dtype: float64>

In [42]:
df.groupby(['Treatment Name']).mean()['score'].sort_values

<bound method Series.sort_values of Treatment Name
botox            7.869916
chemical peel    7.766071
coolmini         7.606250
volbella         7.602174
Name: score, dtype: float64>

With sentiment score added, the ranking of treatments doesn't change.