# Sentiment Analysis  

Steps

1)Load the dataset 

2)Clean Dataset

3)Encode Sentiments

4)Split Dataset

5)Tokenize and Pad/Truncate Reviews

6)Build Architecture/Model

7)Train and Test

#                            #Approach one

# Import all the libraries needed


In [36]:
import nltk
import numpy as np
from unidecode import unidecode
from nltk.corpus import stopwords
from re import sub
import pandas as pd
import time
import multiprocessing
from tensorflow.keras.models import load_model
# !pip install -U -q segmentation-models
# !pip install -q tensorflow==2.1
# !pip install -q keras==2.3.1
# !pip install -q tensorflow-estimator==2.1.
import re
## Imports libs
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ["SM_FRAMEWORK"] = "tf.keras"



import os
from tensorflow import keras
import segmentation_models as sm
from tensorflow.keras.models import Sequential 
from nltk.stem.porter import PorterStemmer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import ModelCheckpoint
stopwords = nltk.corpus.stopwords.words('english')
from gensim.models.phrases import Phrases, Phraser
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences 
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Loading Data and Preview dataset

In [4]:
df=pd.read_csv(r'C:\Users\veereshg\Downloads\Dataset.csv',encoding='latin-1')

In [5]:
df.head()

Unnamed: 0,ID,MEMBER_ID,REASONNPSSCORE__C
0,a2p1U000000RowfQAC,0011U00000rjFKdQAM,"I showed up for my appointment, but they had m..."
1,a2p1U000000RqQqQAK,0011U00000riCSHQA2,"Staff was polite, courteous, and on time"
2,a2p1U000000RqXyQAK,0011U00000riTw7QAE,Overall care is great! It's wonderful to be a...
3,a2p1U000000Rq1LQAS,0011U00000rhu8eQAA,Like the doctor and staff at this location. Ea...
4,a2p1U000000RpiuQAC,0011U00000rk4SHQAY,The convenience and the doctors


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3812 entries, 0 to 3811
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   ID                 3812 non-null   object
 1   MEMBER_ID          3812 non-null   object
 2   REASONNPSSCORE__C  3812 non-null   object
dtypes: object(3)
memory usage: 89.5+ KB


Dropping the features which are not adding value to the model

In [7]:
df.drop(['ID','MEMBER_ID'],inplace=True,axis=1)

# Creating labels to the unlabelled data using vader_lexicon

In [8]:
df.rename(columns = {'REASONNPSSCORE__C':'review'}, inplace = True)

In [9]:
df.drop_duplicates(keep='first',inplace=True) 

In [10]:
import nltk
nltk.download("vader_lexicon")

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\veereshg\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [11]:
###USING nltk SENTIMENT VANDER TO LABEL THE SENTENCE 

In [12]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def vader_sentiment_result(sent):
    scores = analyzer.polarity_scores(sent)
    
    if scores["neg"] > scores["pos"]:
        return 0

    return 1

df["sentiment"] = df["review"].apply(lambda x: vader_sentiment_result(x))
#df["vader_result"] = valid_set["review"].apply(lambda x: vader_sentiment_result(x))

In [18]:
df.head()

Unnamed: 0,review,sentiment
0,"I showed up for my appointment, but they had m...",1
1,"Staff was polite, courteous, and on time",1
2,Overall care is great! It's wonderful to be a...,1
3,Like the doctor and staff at this location. Ea...,1
4,The convenience and the doctors,1


# Processing Data

Load and Clean Dataset
In the original dataset, the reviews are still dirty. There are still numbers, uppercase, and punctuations. This will not be good for training, so in load_dataset() function, beside loading the dataset using pandas, I also pre-process the reviews by removing  non alphabet (punctuations and numbers), stop words, and lower case all of the reviews.

Stop Word is a commonly used words in a sentence, usually a search engine is programmed to ignore this words (i.e. "the", "a", "an", "of", etc.)




In [21]:
def load_dataset():
    from nltk.stem import WordNetLemmatizer
  
    ls = WordNetLemmatizer()
    x_data = df['review']       # Reviews/Input
    y_data = df['sentiment']    # Sentiment/Output

    # PRE-PROCESS REVIEW
    
    x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)
    x_data = x_data.apply(lambda review: [ls.lemmatize(w) for w in review.split() if w not in stopwords])
    x_data = x_data.apply(lambda review: [w.lower() for w in review])
      # remove stop words
      # lower case
    
    # ENCODE SENTIMENT -> 0 & 1
   

    return x_data, y_data


# print('Reviews')
# print(x_data, '\n')
# # print('Sentiment')
# print(y_data)

In [22]:
x_data,y_data=load_dataset()

# Split Dataset


In this work, I decided to split the data into 80% of Training and 20% of Testing set using train_test_split method from Scikit-Learn. By using this method, it automatically shuffles the dataset. We need to shuffle the data because in the original dataset, the reviews and sentiments are in order, where they list positive reviews first and then negative reviews. By shuffling the data, it will be distributed equally in the model, so it will be more accurate for predictions.



In [23]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

# print('Train Set')
# print(x_train, '\n')
# print(x_test, '\n')
# print('Test Set')
# print(y_train, '\n')
# print(y_test)

Function for getting the maximum review length, by calculating the mean of all the reviews length (using numpy.mean)

In [24]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

# Tokenize and Pad/Truncate Reviews

A Neural Network only accepts numeric data, so we need to encode the reviews. I use tensorflow.keras.preprocessing.text.Tokenizer to encode the reviews into integers, where each unique word is automatically indexed (using fit_on_texts method) based on x_train.
x_train and x_test is converted into integers using texts_to_sequences method.

Each reviews has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all reviews length) using tensorflow.keras.preprocessing.sequence.pad_sequences.

post, pad or truncate the words in the back of a sentence
pre, pad or truncate the words in front of a sentence

In [25]:
# ENCODE REVIEW
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[  11    2    3 ...    0    0    0]
 [   1   11  144 ...    4  571    3]
 [  14    9  129 ...  251 1963   55]
 ...
 [  88   10    9 ...    0    0    0]
 [ 626    2  329 ...    0    0    0]
 [ 480    8    5 ...   27 1080 1364]] 

Encoded X Test
 [[   1   81  457 ...    0    0    0]
 [   1   68  908 ...   67  220  186]
 [ 353  355   48 ...    1  914   66]
 ...
 [ 170 2172 1252 ...    0    0    0]
 [  16   17   20 ...    0    0    0]
 [   1  373    5 ... 2683   19    6]] 

Maximum review length:  12


# Build Architecture/Model
Embedding Layer: in simple terms, it creates word vectors of each word in the word_index and group words that are related or have similar meaning by analyzing other words around them.

LSTM Layer: to make a decision to keep or throw away data by considering the current input, previous output, and previous memory. There are some important components in LSTM.

Forget Gate, decides information is to be kept or thrown away
Input Gate, updates cell state by passing previous output and current input into sigmoid activation function
Cell State, calculate new cell state, it is multiplied by forget vector (drop value if multiplied by a near 0), add it with the output from input gate to update the cell state value.
Ouput Gate, decides the next hidden state and used for predictions
Dense Layer: compute the input with the weight matrix and bias (optional), and using an activation function. I use Sigmoid activation function for this work because the output is only 0 or 1.

The optimizer is Adam and the loss function is Binary Crossentropy because again the output is only 0 and 1, which is a binary number.

In [26]:
# ARCHITECTURE
EMBED_DIM = 32
LSTM_OUT = 64

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 12, 32)            115264    
_________________________________________________________________
lstm (LSTM)                  (None, 64)                24832     
_________________________________________________________________
dense (Dense)                (None, 1)                 65        
Total params: 140,161
Trainable params: 140,161
Non-trainable params: 0
_________________________________________________________________
None


# Training
For training, it is simple. We only need to fit our x_train (input) and y_train (output/label) data. For this training, I use a mini-batch learning method with a batch_size of 64 and 50 epochs.

Also, I added a callback called checkpoint to save the model locally for every epoch if its accuracy improved from the previous epoch.

In [27]:
checkpoint = ModelCheckpoint('LSTM.h5',monitor='accuracy',save_best_only=True,verbose=1)

In [29]:
model.fit(x_train, y_train, batch_size = 64, epochs = 50, callbacks=[checkpoint])

Epoch 1/50

Epoch 00001: accuracy improved from 0.99025 to 0.99260, saving model to LSTM.h5
Epoch 2/50

Epoch 00002: accuracy improved from 0.99260 to 0.99630, saving model to LSTM.h5
Epoch 3/50

Epoch 00003: accuracy improved from 0.99630 to 0.99798, saving model to LSTM.h5
Epoch 4/50

Epoch 00004: accuracy did not improve from 0.99798
Epoch 5/50

Epoch 00005: accuracy did not improve from 0.99798
Epoch 6/50

Epoch 00006: accuracy did not improve from 0.99798
Epoch 7/50

Epoch 00007: accuracy did not improve from 0.99798
Epoch 8/50

Epoch 00008: accuracy improved from 0.99798 to 0.99866, saving model to LSTM.h5
Epoch 9/50

Epoch 00009: accuracy did not improve from 0.99866
Epoch 10/50

Epoch 00010: accuracy did not improve from 0.99866
Epoch 11/50

Epoch 00011: accuracy did not improve from 0.99866
Epoch 12/50

Epoch 00012: accuracy did not improve from 0.99866
Epoch 13/50

Epoch 00013: accuracy did not improve from 0.99866
Epoch 14/50

Epoch 00014: accuracy did not improve from 0.998

<tensorflow.python.keras.callbacks.History at 0x1de8d57c3a0>

# Testing
To evaluate the model, we need to predict the sentiment using our x_test data and comparing the predictions with y_test (expected output) data. Then, we calculate the accuracy of the model by dividing numbers of correct prediction with the total data. Resulted an accuracy of 84.67%

In [30]:
y_pred = model.predict_classes(x_test, batch_size = 128)

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))



Correct Prediction: 631
Wrong Prediction: 113
Accuracy: 84.81182795698925


# Load Saved Model
Load saved model and use it to predict a movie review statement's sentiment (positive or negative).

In [31]:
loaded_model = load_model('LSTM.h5')

In [34]:
review = str(input('Hospital Review: '))

Hospital Review: Which types of hospital is this, very bad experience , don't go here, you know why , cus of discharge time of patients was today's evening 5-6 oclk and this hospital without giving any basic details of asking the charges of amount .


In [38]:
# Pre-process input
regex = re.compile(r'[^a-zA-Z\s]')
review = regex.sub('', review)
print('Cleaned: ', review)

words = review.split(' ')
filtered = [w for w in words if w not in stopwords]
filtered = ' '.join(filtered)
filtered = [filtered.lower()]

print('Filtered: ', filtered)

Cleaned:  Which types of hospital is this very bad experience  dont go here you know why  cus of discharge time of patients was todays evening  oclk and this hospital without giving any basic details of asking the charges of amount 
Filtered:  ['which types hospital bad experience  dont go know  cus discharge time patients todays evening  oclk hospital without giving basic details asking charges amount ']


In [39]:
tokenize_words = token.texts_to_sequences(filtered)
tokenize_words = pad_sequences(tokenize_words, maxlen=max_length, padding='post', truncating='post')
print(tokenize_words)

[[ 881 1002  233   25  808   36   44    3 2714 1722 1002  244]]


In [40]:
result = loaded_model.predict(tokenize_words)
print(result)

[[0.11768821]]


If the confidence score is close to 0, then the statement is negative. On the other hand, if the confidence score is close to 1, then the statement is positive. I use a threshold of 0.7 to determine which confidence score is positive and negative, so if it is equal or greater than 0.7, it is positive and if it is less than 0.7, it is negative

In [43]:
if result >= 0.7:
    print('positive')
else:
    print('negative')

negative


# Model Performance

In [45]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.51      0.47      0.49       116
           1       0.90      0.92      0.91       628

    accuracy                           0.85       744
   macro avg       0.71      0.69      0.70       744
weighted avg       0.84      0.85      0.84       744



# SECOND APPROACH

# Using transformer based zero shot classification to label the sentences

In [46]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


In [47]:
df=pd.read_csv(r'C:\Users\veereshg\Downloads\Dataset.csv',encoding='latin-1')

# Applying single shot classification on every sentence to label the sentence

In [48]:
the_labels = ["positive", "negative"]
DF['results']=df.REASONNPSSCORE__C.apply(lambda text : classifier(text, the_labels))

In [52]:
DF.to_csv('NLPdata.csv')

In [85]:
DF=pd.read_csv('NLPdata.csv')
DF.head()

Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,ID,MEMBER_ID,REASONNPSSCORE__C,results
0,0,0,0,0,a2p1U000000RowfQAC,0011U00000rjFKdQAM,"I showed up for my appointment, but they had m...","{'sequence': ""I showed up for my appointment, ..."
1,1,1,1,1,a2p1U000000RqQqQAK,0011U00000riCSHQA2,"Staff was polite, courteous, and on time","{'sequence': 'Staff was polite, courteous, and..."
2,2,2,2,2,a2p1U000000RqXyQAK,0011U00000riTw7QAE,Overall care is great! It's wonderful to be a...,"{'sequence': ""Overall care is great! It's won..."
3,3,3,3,3,a2p1U000000Rq1LQAS,0011U00000rhu8eQAA,Like the doctor and staff at this location. Ea...,{'sequence': 'Like the doctor and staff at thi...
4,4,4,4,4,a2p1U000000RpiuQAC,0011U00000rk4SHQAY,The convenience and the doctors,{'sequence': 'The convenience and the doctors'...


In [86]:
DF.drop(['ID','MEMBER_ID','Unnamed: 0.3','Unnamed: 0.1','Unnamed: 0.2','Unnamed: 0.1'],inplace=True,axis=1)

In [87]:
import ast
res = DF['results'].apply(lambda x: ast.literal_eval(x))

In [88]:
DF['output'] = [(i.get('labels')[0]) for i in res]

In [89]:
DF1=DF[['REASONNPSSCORE__C','output']]
DF1.head()

Unnamed: 0,REASONNPSSCORE__C,output
0,"I showed up for my appointment, but they had m...",positive
1,"Staff was polite, courteous, and on time",positive
2,Overall care is great! It's wonderful to be a...,positive
3,Like the doctor and staff at this location. Ea...,positive
4,The convenience and the doctors,positive


In [90]:
DF1['output']=DF1['output'].map({'positive':1,'negative':0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF1['output']=DF1['output'].map({'positive':1,'negative':0})


In [91]:
DF1.rename(columns={'REASONNPSSCORE__C':'review'},inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF1.rename(columns={'REASONNPSSCORE__C':'review'},inplace=True)


In [92]:
DF1['output'].value_counts()

1    2696
0    1116
Name: output, dtype: int64

Here we can notice that after using better model for labelling the data negative reviews increased from 569 to 1116


In [97]:
def load_dataset():
    x_data = DF1['review']       # Reviews/Input
    y_data = DF1['output']
    from nltk.stem import WordNetLemmatizer
    ls = WordNetLemmatizer()
    
    
    # Sentiment/Output

    x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)
    x_data = x_data.apply(lambda review: [ls.lemmatize(w) for w in review.split() if w not in stopwords])
    x_data = x_data.apply(lambda review: [w.lower() for w in review])
    


    return x_data, y_data

In [98]:
x_data, y_data = load_dataset()
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

In [99]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

In [100]:
# ENCODE REVIEW
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[   6    2   30 ...   22  544    0]
 [   2 1978 1130 ...  139    0    0]
 [  49   13   32 ...    0    0    0]
 ...
 [ 865 1929    0 ...    0    0    0]
 [  38   51    5 ...    0    0    0]
 [1015   27    0 ...    0    0    0]] 

Encoded X Test
 [[  12    9    0 ...    0    0    0]
 [  43    9    0 ...    0    0    0]
 [2031  125   10 ...  284  102  342]
 ...
 [ 144 1453   36 ...    0    0    0]
 [  11    7    0 ...    0    0    0]
 [ 843   88   13 ...    0    0    0]] 

Maximum review length:  12


In [101]:
EMBED_DIM = 32
LSTM_OUT = 64

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 12, 32)            116096    
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 140,993
Trainable params: 140,993
Non-trainable params: 0
_________________________________________________________________
None


In [102]:
checkpoint = ModelCheckpoint('LSTM1.h5',monitor='accuracy',save_best_only=True,verbose=1)

In [103]:
x_train.shape

(3049, 12)

In [105]:
model.fit(x_train, y_train, batch_size = 64, epochs = 50, callbacks=[checkpoint])

Epoch 1/50

Epoch 00001: accuracy did not improve from 0.98065
Epoch 2/50

Epoch 00002: accuracy did not improve from 0.98065
Epoch 3/50

Epoch 00003: accuracy improved from 0.98065 to 0.98262, saving model to LSTM1.h5
Epoch 4/50

Epoch 00004: accuracy did not improve from 0.98262
Epoch 5/50

Epoch 00005: accuracy improved from 0.98262 to 0.98623, saving model to LSTM1.h5
Epoch 6/50

Epoch 00006: accuracy improved from 0.98623 to 0.98754, saving model to LSTM1.h5
Epoch 7/50

Epoch 00007: accuracy improved from 0.98754 to 0.99049, saving model to LSTM1.h5
Epoch 8/50

Epoch 00008: accuracy improved from 0.99049 to 0.99213, saving model to LSTM1.h5
Epoch 9/50

Epoch 00009: accuracy did not improve from 0.99213
Epoch 10/50

Epoch 00010: accuracy improved from 0.99213 to 0.99344, saving model to LSTM1.h5
Epoch 11/50

Epoch 00011: accuracy improved from 0.99344 to 0.99377, saving model to LSTM1.h5
Epoch 12/50

Epoch 00012: accuracy did not improve from 0.99377
Epoch 13/50

Epoch 00013: accur

<tensorflow.python.keras.callbacks.History at 0x1df000a9370>

In [106]:
y_pred = model.predict_classes(x_test, batch_size = 128)

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))



Correct Prediction: 611
Wrong Prediction: 152
Accuracy: 80.07863695937091


In [107]:
model2 = load_model('LSTM1.h5')

In [108]:
y_pred1=model2.predict_classes(x_test)

In [109]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred1))

              precision    recall  f1-score   support

           0       0.66      0.59      0.62       218
           1       0.84      0.88      0.86       545

    accuracy                           0.80       763
   macro avg       0.75      0.73      0.74       763
weighted avg       0.79      0.80      0.79       763




# After downsampling

In [180]:
DF=pd.read_csv('NLPdata.csv')

In [181]:
DF.drop(['ID','MEMBER_ID','Unnamed: 0.3','Unnamed: 0.1','Unnamed: 0.2','Unnamed: 0.1'],inplace=True,axis=1)

In [182]:
import ast
res = DF['results'].apply(lambda x: ast.literal_eval(x))

In [183]:
DF['output'] = [(i.get('labels')[0]) for i in res]

In [184]:
DF.head()

Unnamed: 0.1,Unnamed: 0,REASONNPSSCORE__C,results,output
0,0,"I showed up for my appointment, but they had m...","{'sequence': ""I showed up for my appointment, ...",positive
1,1,"Staff was polite, courteous, and on time","{'sequence': 'Staff was polite, courteous, and...",positive
2,2,Overall care is great! It's wonderful to be a...,"{'sequence': ""Overall care is great! It's won...",positive
3,3,Like the doctor and staff at this location. Ea...,{'sequence': 'Like the doctor and staff at thi...,positive
4,4,The convenience and the doctors,{'sequence': 'The convenience and the doctors'...,positive


In [185]:
DF1=DF[['REASONNPSSCORE__C','output']]
DF1.head()

Unnamed: 0,REASONNPSSCORE__C,output
0,"I showed up for my appointment, but they had m...",positive
1,"Staff was polite, courteous, and on time",positive
2,Overall care is great! It's wonderful to be a...,positive
3,Like the doctor and staff at this location. Ea...,positive
4,The convenience and the doctors,positive


In [186]:
DF1['output']=DF1['output'].map({'positive':1,'negative':0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF1['output']=DF1['output'].map({'positive':1,'negative':0})


In [187]:
pos_review = DF1[DF1['output'] == 1]
neg_review  = DF1[DF1['output'] == 0]
pos_review.shape

(2696, 2)

In [188]:
from sklearn.utils import resample
pos_downsample = resample(pos_review,
             replace=True,
             n_samples=len(neg_review ),
             random_state=42)

print(pos_downsample.shape)

(1116, 2)


In [189]:
data_downsampled = pd.concat([pos_downsample, neg_review])

In [190]:
data_downsampled

Unnamed: 0,REASONNPSSCORE__C,output
1223,Customer service is great and dr was very help...,1
1846,"Just started with them, so far so good",1
1600,"Save money, convenient",1
1558,I love my care provider - Ashley Giles. She r...,1
2331,"I was in the office for one symptom, however a...",1
...,...,...
3795,"While a good concept and convenient location, ...",0
3798,I sometimes feel as though the Dr is trying to...,0
3804,Nurse was unable to complete a blood draw beca...,0
3808,Very skeptical that you will soon be without a...,0


In [191]:
def load_dataset():
    x_data = data_downsampled['REASONNPSSCORE__C']     # Reviews/Input
    y_data = data_downsampled['output']  
    from nltk.stem import WordNetLemmatizer
    # Sentiment/Output
    ls = WordNetLemmatizer()
    # PRE-PROCESS REVIEW
    #x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
    x_data = x_data.apply(lambda review: [ls.lemmatize(w)  for w in review.split() if w not in stopwords])  # remove stop words
    x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case
    


    return x_data, y_data

In [192]:
x_data, y_data = load_dataset()

In [193]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.08)

In [194]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

In [195]:
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[ 143 1895  865 ...    0    0    0]
 [ 260  756    0 ...    0    0    0]
 [   1   38    8 ... 1896  569   37]
 ...
 [  17  138    7 ...  165   13   70]
 [  25    9  101 ...    0    0    0]
 [ 593  129  182 ...    0    0    0]] 

Encoded X Test
 [[ 120    2   10 ...    0    0    0]
 [  17   51   14 ...    0    0    0]
 [ 128    0    0 ...    0    0    0]
 ...
 [ 450  164  336 ...    0    0    0]
 [  64  935   90 ...    0    0    0]
 [ 121    2 2842 ...    0    0    0]] 

Maximum review length:  15


In [196]:
checkpoint = ModelCheckpoint('LSTM3.h5',monitor='accuracy',save_best_only=True,verbose=1)

In [197]:
EMBED_DIM = 32
LSTM_OUT = 64

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

print(model.summary())

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 15, 32)            110976    
_________________________________________________________________
lstm_5 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 65        
Total params: 135,873
Trainable params: 135,873
Non-trainable params: 0
_________________________________________________________________
None


In [198]:
model.fit(x_train, y_train, batch_size = 64, epochs = 50, callbacks=[checkpoint])

Epoch 1/50

Epoch 00001: accuracy improved from -inf to 0.66683, saving model to LSTM3.h5
Epoch 2/50

Epoch 00002: accuracy improved from 0.66683 to 0.80419, saving model to LSTM3.h5
Epoch 3/50

Epoch 00003: accuracy improved from 0.80419 to 0.89479, saving model to LSTM3.h5
Epoch 4/50

Epoch 00004: accuracy improved from 0.89479 to 0.93619, saving model to LSTM3.h5
Epoch 5/50

Epoch 00005: accuracy improved from 0.93619 to 0.95373, saving model to LSTM3.h5
Epoch 6/50

Epoch 00006: accuracy improved from 0.95373 to 0.97126, saving model to LSTM3.h5
Epoch 7/50

Epoch 00007: accuracy improved from 0.97126 to 0.97906, saving model to LSTM3.h5
Epoch 8/50

Epoch 00008: accuracy improved from 0.97906 to 0.98246, saving model to LSTM3.h5
Epoch 9/50

Epoch 00009: accuracy improved from 0.98246 to 0.98636, saving model to LSTM3.h5
Epoch 10/50

Epoch 00010: accuracy improved from 0.98636 to 0.98782, saving model to LSTM3.h5
Epoch 11/50

Epoch 00011: accuracy did not improve from 0.98782
Epoch 12

<tensorflow.python.keras.callbacks.History at 0x1df00089040>

In [199]:
y_pred = model.predict_classes(x_test, batch_size = 128)

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))



Correct Prediction: 144
Wrong Prediction: 35
Accuracy: 80.44692737430168


In [200]:
model3 = load_model('LSTM3.h5')

In [201]:
y_pred1=model3.predict_classes(x_test)



In [202]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred1))

              precision    recall  f1-score   support

           0       0.82      0.78      0.80        87
           1       0.80      0.84      0.82        92

    accuracy                           0.81       179
   macro avg       0.81      0.81      0.81       179
weighted avg       0.81      0.81      0.81       179



# CREATING WORD EMBEDDING FOR THE SENTENCES USING PRETRAINED BERT 

In [None]:
#Below iS the dataframe which has the word embedings for the given dataset
#Here BERT model creates  embeddings for the given sentence using cosine similarity.

In [397]:
DF_EMB=pd.read_csv('EMBEDED.csv')

In [398]:
DF_EMB.head()

Unnamed: 0.1,Unnamed: 0,REASONNPSSCORE__C,output,emb
0,0,"I showed up for my appointment, but they had m...",positive,[-0.09676336 0.23733687 -0.03380208 -0.175840...
1,1,"Staff was polite, courteous, and on time",positive,[ 0.04106916 0.25190938 -0.12936194 -0.085945...
2,2,Overall care is great! It's wonderful to be a...,positive,[-6.6784762e-02 8.1635922e-02 -1.8907736e-01 ...
3,3,Like the doctor and staff at this location. Ea...,positive,[-2.04459503e-02 3.02443504e-01 2.63532002e-...
4,4,The convenience and the doctors,positive,[-0.17510815 0.36582455 -0.3491363 -0.244227...


In [399]:
DF_EMB=DF_EMB[['emb','output']]

In [400]:
DF_EMB['emb'] = DF_EMB['emb'].apply(lambda x: np.array(x.replace('[','').replace(']','').split()).astype(float))

In [401]:
DF_EMB['emb']

0       [-0.09676336, 0.23733687, -0.03380208, -0.1758...
1       [0.04106916, 0.25190938, -0.12936194, -0.08594...
2       [-0.066784762, 0.081635922, -0.18907736, 0.120...
3       [-0.0204459503, 0.302443504, 2.63532002e-05, 0...
4       [-0.17510815, 0.36582455, -0.3491363, -0.24422...
                              ...                        
3807    [-0.54986167, -0.7347374, -0.29466134, 0.34458...
3808    [-0.12956539, -0.17287686, -0.14482819, 0.0309...
3809    [-0.06594059, 0.14354283, -0.25912088, -0.4730...
3810    [0.04620619, 0.08419552, 0.6763907, -0.1371661...
3811    [-0.14128962, 0.03340659, 0.05415522, 0.021991...
Name: emb, Length: 3812, dtype: object

In [402]:
DF_EMB['output']=DF_EMB['output'].map({'positive':1,'negative':0})

In [403]:
x_data=DF_EMB['emb']
y_data=DF_EMB['output']

In [404]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

# Training the dataset on SVM classifier

In [418]:
from sklearn import svm
X =np.array(x_train.tolist())
y = y_train
clf = svm.SVC()
clf.fit(X, y)


SVC()

In [421]:
y_pred=clf.predict(np.array(x_test.tolist()))

In [422]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.70      0.67      0.68       221
           1       0.87      0.88      0.87       542

    accuracy                           0.82       763
   macro avg       0.78      0.77      0.78       763
weighted avg       0.82      0.82      0.82       763

