# Service Intelligence Practice
- TA : Jongkyung Shin
-  contact : shinjk1156@unist.ac.kr
-  if you have some questions, feel free to email me

## Loading dataset
- Datasets: Hotel review data from hotel website in Singapore extracted using web crawling

In [None]:
import pandas as pd
df_star2 = pd.read_excel("HotelRev_less4STAR.xlsx", sheet_name= '2class')
df_star3 = pd.read_excel("HotelRev_less4STAR.xlsx", sheet_name= '3class')
df_star4 = pd.read_excel("HotelRev_less4STAR.xlsx", sheet_name= '4class')

In [None]:
def makeClearSent(sent):
    sent = str(sent)
    sent = sent.replace("\n", "")
    sent = sent.replace("\r", "")
    sent = sent.replace("  ", " ")
    return sent


def makeReviewList(df):
    review_list = []
    for i in range(len(df)):
        review_list.append(makeClearSent(df.loc[i, "title"]) + ". " + makeClearSent(df.loc[i, "body"]))

    '''
    --- same as ---- 
    review_list = [makeClearSent(df.loc[i, "title"]) + ". " + makeClearSent(df.loc[i, "body"]) for i in range(len(df))]
    '''
    return review_list

-  Extract only reviews and construct datasets (+ simple clearing)

In [None]:
star2_review_list = makeReviewList(df_star2)
star3_review_list = makeReviewList(df_star3)
star4_review_list = makeReviewList(df_star4)

star2_rating_list = (df_star2.loc[:,"rating"]//10).to_list()
star3_rating_list = (df_star3.loc[:,"rating"]//10).to_list()
star4_rating_list = (df_star4.loc[:,"rating"]//10).to_list()


StarRatingList = star2_rating_list + star3_rating_list + star4_rating_list

## 1) Online Review Preprocessing

* Using package : nltk 

  
* Required steps
  - Tokenization
  - Lowercasing
  - Lemmatization
  - Stopwords removal
  - pos tagging (you have to exact the only noun words)
  - stemming etc.. (you can skip the stemming and other preprocessing methods)  


In [None]:
!pip install nltk

In [None]:
from tqdm import tqdm
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk import WordNetLemmatizer
from nltk import pos_tag

In [None]:
ExampleReview = star2_review_list[0]
print(ExampleReview)

### 1-1) Tokenization
- Use 'sent_tokenize' and 'word_tokenize' functions in nltk package
- It is required two steps; sentence tokenization and word tokenization.
- example : "Expect Pure Misery." -> Tokenization -> ['Expect', 'Pure', 'Misery', '.']

In [None]:
SentenceTokenList = sent_tokenize(ExampleReview)
print("The number of sentense : {}".format(len(SentenceTokenList)))

In [None]:
ExampleSentence = SentenceTokenList[10]

ExampleWordTokenList = word_tokenize(ExampleSentence)

print("The number of words in a sentence : {}\n".format(len(ExampleWordTokenList)))
print("Original sentence : ",ExampleSentence + "\n")
print("Tokenized sentence : ", ExampleWordTokenList)

### 1-2) Lowercasing
- Lowercasing is required to handle words mixed with uppercase letters as semantically equal
- example : Room -> room, Staff -> staff

In [None]:
print("Original sentence : ",ExampleSentence + "\n")

LowerCasedSentence = ExampleSentence.lower() # for string
print("Lowercased Sentence : ", LowerCasedSentence)

print("\nOriginal word token list : ", ExampleWordTokenList)

LowerCasedTokenList = []
for token in ExampleWordTokenList:
    LowerCapToken = token.lower() # for string
    LowerCasedTokenList.append(LowerCapToken)
    
print("\nLowercased word token list : ", LowerCasedTokenList)

ExampleWordTokenList = LowerCasedTokenList

### 1-3) Lemmatization
- Lemmatization is required to handle inflectional forms as semantically equal
- Example : rooms -> room, staffs -> staff

In [None]:
lemma = WordNetLemmatizer()

LemmatizationTokenList = []
for token in ExampleWordTokenList:
    LemmatizedToken = lemma.lemmatize(token)
    LemmatizationTokenList.append(LemmatizedToken)

print("\nOriginal word token list : ", ExampleWordTokenList)

print("\nLemmatized word token list : ", LemmatizationTokenList)

ExampleWordTokenList = LemmatizationTokenList

### 1-4) Stopwords removal
- Stopword : Words that appear frequently in a sentence but are not related to the meaning of the sentence
- Use 'stopword' function in nltk package to get the list of stopwords

In [None]:
StopWordList = stopwords.words('english')
print(StopWordList)

In [None]:
StopWordRemovalTokenList = []
for token in ExampleWordTokenList:
    if token not in StopWordList:
        StopWordRemovalTokenList.append(token)

print("\nOriginal word token list : ", ExampleWordTokenList)
        
print("\nStopword Removed token list : ", StopWordRemovalTokenList)


ExampleWordTokenList = StopWordRemovalTokenList

### 1-5) Part-Of-Speech Tagging (Pos tagging)

- POS tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech (e.g., nouns, verbs, adjectives, and adverbs)

In [None]:
PosTaggedTokenList = pos_tag(ExampleWordTokenList)

print("\nOriginal word token list : ", ExampleWordTokenList)
        
print("\nPos tagged token list : ", PosTaggedTokenList)

### 1-6) Noun Extraction




In [None]:
NounTokenList = []
for token, pos in PosTaggedTokenList:
    if pos[0] == "N":
        NounTokenList.append(token)
        
print("\nOriginal word token list : ", ExampleWordTokenList)
        
print("\nNoun token list : ", NounTokenList)

### Your work (1)

- I showed the example of each preprocessing result for one sample sentence.
- You need to preprocess all the review data (star2, star3, and star4).
- There are two datasets that you have to create; preprocessed token dataset (list) & Noun token dataset (list)
- Output Variables name
  - TokenDataset (preprocessed token dataset, review\*sentence\*word matrix)
  - NounTokenDataset (Noun token dataset, review*noun matrix )


In [None]:
ReviewList = star2_review_list + star3_review_list + star4_review_list
NounTokenDataset = []
TokenDataset = []
################ Write Your code #####################





#######################################################

#### Output datasets shape checker

In [None]:
print("\nThe number of reviews in TokenDataset : ", len(TokenDataset)) # Answer : about 33,132 
print("\nThe number of sentences in first review : ", len(TokenDataset[0])) # Answer : about 38 (It's okay if the value is not the same)
print("\nThe number of tokens in first sentence of first review : ", len(TokenDataset[0][0])) # Answer : about 4 (It's okay if the value is not the same)

print("\nThe number of reviews in NounTokenDataset : ", len(NounTokenDataset)) # Answer : about 33,132 
print("\nThe number of noun tokens in first review : ", len(NounTokenDataset[0])) # Answer : about 96 (It's okay if the value is not the same)

## 2) LDA topic modeling (service feature engineering)

In [None]:
!pip install gensim
!pip install pyLDAvis

In [None]:
#Create dictionary
import gensim
import gensim.corpora as corpora

id2word= corpora.Dictionary(NounTokenDataset)

# if you want, use filter_extremes()
id2word.filter_extremes(no_below=10, no_above=0.3)

#create count matrix
corpus = [id2word.doc2bow(rev) for rev in NounTokenDataset]

In [None]:
# Peform LDA model
model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                        id2word=id2word,
                                        num_topics=10,
                                        iterations=3000,
                                        alpha=0.1,
                                        eta=0.01)
model.print_topics()

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis 
#Visualize LDA model

pyLDAvis.enable_notebook()

vis = gensimvis.prepare(model, corpus, id2word)

vis
# pyLDAvis.save_html(vis, 'LDA visualization.html')

In [None]:
RawTopicTokenList = model.print_topics()

In [None]:
def parse_topic_token(RawTopicTokenList):
  TopicDic = []
  for idx, eqs in RawTopicTokenList:
    tokenlist = []
    eqlist = eqs.split(" + ")
    for idx, eq in enumerate(eqlist):
      word = eq.split("*")[1]
      tokenlist.append(word.split('"')[1])
    TopicDic.append(tokenlist)

  return TopicDic

In [None]:
TopicWordList = parse_topic_token(RawTopicTokenList)

### Your work (2)

- The number of topics should be adjusted by you, but once you give it around 10.
- If there are many overlapping topics, you have to adjust it by decreasing the number of topics or by increasing the number of iterations (if you want, additional preprocessing is allowed).

- Alternatively, there are some methods to decide the number of topics, such as perplexity and topic coherence. If you are interested in these topic model evaluation methods, please find the relevant resources.

- You have to name each topic appropriately by interpreting the high-weighted words

In [None]:
################ Write Your code #####################





#######################################################
RawTopicTokenList = model.print_topics()
TopicWordList = parse_topic_token(RawTopicTokenList)

TopicName = [,,,,,] # enter the name of each topic

## 3) Sentiment Analysis

- The sentiment of the service features are estimated based on the sentiments of sentence containing service feature words.
- We use VADER sentiment analysis to estimate sentiment intensity (score) of the each sentence
- Score range of VADER : -1 (extremely negative) ~ +1 (extremely positive)

In [None]:
!pip install vaderSentiment

In [None]:
# Example
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyser = SentimentIntensityAnalyzer()

example1 = "Location was great, nearby landmark"
example2 = "Awesome view, Wow"
example3 = "City veiw was not good"
example4 = "Terrible room condition"
example5 = "great quality internet"

print("Sentence  : [Sentiment score] ")
print("{0} : {1}".format(example1, analyser.polarity_scores(example1)['compound']))
print("{0} : {1}".format(example2, analyser.polarity_scores(example2)['compound']))
print("{0} : {1}".format(example3, analyser.polarity_scores(example3)['compound']))
print("{0} : {1}".format(example4, analyser.polarity_scores(example4)['compound']))
print("{0} : {1}".format(example5, analyser.polarity_scores(example5)['compound']))

In [None]:
#Sentiment analysis
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

def show_sentence_sentiment (TopicWordList_topic, TokenDataset, ReviewList):
    for i,review in enumerate(TokenDataset):
        SentenceTokenizedList = nltk.sent_tokenize(ReviewList[i])
        for j,sent in enumerate(review):
            for word in sent:
                if word in TopicWordList_topic:
                    print("word: ",word)
                    print("tokenlist: ",sent)
                    print("Sent: ",SentenceTokenizedList[j])
                    print("Sentiment score: {}\n".format(analyser.polarity_scores(SentenceTokenizedList[j])['compound']))

# Only for topic 1
show_sentence_sentiment(TopicWordList[0], TokenDataset,  ReviewList)

### Your Work (3)
Modifying the above code (**'show_sentence_sentiment'** function), Create Review*SentimentLabel matrix (# of reviews * # of topics)
- Review*SentimentLabel matrix
    - shape : [review * service_feature(topic)]
    - value : Sentiment label corresponding to sentiment intensity (score)
    - Note : Convert sentiment intensity to sentiment label (see tutorial ppt file)

- Output Variables name : ReviewSentimentMatrix

In [None]:
################ Write Your code #####################




#######################################################

## 4) Prediction Modeling on Customer Satisfaction

- Input data (X, independence variables) : Review*sentiment_score matrix (Just use the data(ReviewSentimentMatrix) that you created in your work (3))
   - shape : [review * service_feature(topic)]
   - value : sentiment label
- Output data (y, dependence variable) : star rating
    - Using **StarRatingList** variable (already defined the above) and **preprocess** it!
    - label  
      - 0 : negative label (1,2,3 ratings)
      - 1 : positive label (4,5 ratings)
    - Output variable name : StarRatingLabels
- Note
    - We perform classification (categorical) task, not regression (continuous) task
    - You have to remove the reviews that sentiment scores of all features are zero value. (These reviews are meaningless for analysis)

In [None]:
#example
#load the example input and output data. (This code and data are just for showing the example.)

import pickle
import numpy as np
'''Loading input, output for ML'''
InputData_X = pickle.load(open("Example_Input_Data", "rb"))
OutputData_Y = pickle.load(open("Example_Output_Data", "rb"))


print("Shape of input data :{}".format(InputData_X.shape))
print("Shape of ouput data :{}".format(OutputData_Y.shape))

print("Input data : \n{}".format(InputData_X))
print("Ouput data : {}".format(OutputData_Y))

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# call the model
logit=LogisticRegression()

# train the model
logit=logit.fit(InputData_X,OutputData_Y)

print("Training Accuracy : {}\n".format(round(logit.score(InputData_X,OutputData_Y),3)))

#print("coefficient : {}".format(logit.coef_))

TopicNameList = ["Location", "View", "Breakfast", "Sleep Quality", "Bathroom", "Service", "Check", "Value", "Internet"]
print("Topic  (idx) : Coefficient")
for idx, coef_topic in enumerate(logit.coef_[0]):
    print("{} ({}) : {}".format(TopicNameList[idx],idx+1,round(coef_topic,3)))

In [None]:
# performance : AverageSentimentScores
Performance = InputData_X.mean(axis=0)
MeanPerformance = Performance.mean()

#importance
Importance = logit.coef_[0]
MeanImportance = Importance.mean()

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8,8))
plt.scatter(Performance, Importance, color = 'g')
for idx, topic_name in enumerate(TopicNameList):
  plt.text(Performance[idx]+0.005, Importance[idx]+0.005, topic_name)

plt.axhline(MeanImportance, color = 'b', linestyle = '--')
plt.axvline(MeanPerformance, color = 'b', linestyle = '--')
plt.ylabel("Importance")
plt.xlabel("Performance")
plt.title("Importance-Performance Analysis (IPA)", fontsize = 20)
plt.text(0.50, 0.26, "Q2", fontweight= "bold", fontsize = 15)
plt.text(2.20, 0.26, "Q1", fontweight= "bold", fontsize = 15)
plt.text(0.50, -0.10, "Q3", fontweight= "bold", fontsize = 15)
plt.text(2.20, -0.10, "Q4", fontweight= "bold", fontsize = 15)
plt.show()

### Your work (4)
Modifying the above code, train the model using your own review*sentiment score matrix and conduct Importance-performance analysis (IPA)
- If you want, you can preprocess the dataset to get high accuracy. 
- And, you can use other machine learning models, but it is necessary to interpret the result and conduct IPA through the model.

In [None]:
InputData_X = ReviewSentimentMatrix
OutputData_Y = StarRatingLabels

################ Write Your code #####################


#######################################################