# Introduction

This notebook demonstres steps of using the Natural Language Programming in a dataset of reviews.
In the first moment, we organized the data, removing columns, rating positive reviews and negative reviews, helping to reduce the volume of data of CSV file consumed.
On the next steps over notebook, we utilized ``pandas`` and ``nltk`` libraries to manipulate the data and create fuctions using concepts learned in the PLN classes.

Another annotations are disposed over notebook.

The dataset was downloaded from [Kaggle - Hotel Reviews in Europe](https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe/downloads/515k-hotel-reviews-data-in-europe.zip/1)

In [None]:
import pandas
import nltk
import textblob
import matplotlib.pyplot as plotlib
import csv
import random

nltk.download('words')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('brown')

In [7]:
# Reading the dataset
data = pandas.read_csv('dataset/review_hotel_6.csv')

In [8]:
# Exploring the top 5 observations
data.head()

Unnamed: 0,Review_Date,Hotel_Name,Reviewer_Nationality,Review
0,8/3/2017,Hotel Arena,Russia,I am so angry that i made this post available...
1,8/3/2017,Hotel Arena,Ireland,No Negative
2,7/31/2017,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...
3,7/31/2017,Hotel Arena,United Kingdom,My room was dirty and I was afraid to walk ba...
4,7/24/2017,Hotel Arena,New Zealand,You When I booked with your company on line y...


# Modeling the data

In Data Science, we deal with a big volume of data, so it's important to handle our data before start create a model.
There are many steps that we can use here, but in this case, we did the following:
- Remove unnecessary columns manually
- Make text all lower case
- Tokenize text
- Stemming / lemmatization
- Remove stop words
- Remove sentences with little sense (i.e. "No negative", "Nothing")

__Others steps to clean the data that can be done:__
- Remove numerical values
- Remove common non-sensical text (/n)
- Parts of speech tagging
- Create bi-grams or tri-grams
- Deal with typos
- And more...

In [9]:
sentences_tokens = []

# Set language stopwords to english
stop_words = set(nltk.corpus.stopwords.words('english')) 

# Do tokenization over each Review from dataset
for index, row in data.iterrows():
    sentences_tokens.append(nltk.word_tokenize(row['Review']))

# Put all words on singular and after find infinitive verbs of words

word_capital_letter = ''
lemmatized = ''
index = -1

for row_tokenizing in sentences_tokens:
    for token in row_tokenizing:
        index += 1
        word_capital_letter = token
        lemmatized = nltk.WordNetLemmatizer().lemmatize(word_capital_letter.lower(), 'v')
        if word_capital_letter.istitle():
            word_capital_letter = lemmatized.capitalize()
        else:
            word_capital_letter = lemmatized
        token = word_capital_letter
        row_tokenizing[index] = token
    
    index = -1           

# Remove stop words

index = -1

for words_sentence in sentences_tokens:
    for word in words_sentence:
        index += 1
        if word in stop_words:
            del words_sentence[index]
    
    index = -1

# Sentiment analysis

Our main goal here was extract pieces of information over each sentence that could help us to understand which valuable that assessment could be, analising if the sentence is more objective or subjective (driven by emotion).

When it comes to text data, there are few popular techniques that help to start with sentiment analysis, and they are:
1. __Text Blob Module:__ Linguistic researchers have labeled the sentiment of words based on their domain expertise. The sentiment of words can vary based on where it is in a sentence. The Text Blob Module allows us to take advantage of these labels.
2. __Sentiment Labels:__ Each word in a corpus is labeled in terms of polarity and subjectivity. A corpus' sentiment is the average of these.

    - Polarity: How positive or negative a word is. -1 is very negative. +1 is very positive.
    - Subjectivity: How subjective, or opinionated a word is. Zero is a fact. +1 is very much opinion.

At the end, we created a new data frame join all sentences modeled and their sentiment labels.

In [22]:
manual_sentences = ['No Negative', 'Nothing', 'Nothing at all', 'No Positive', 'Good location'] #sentences to not use
polarity_sentences = []
subjectivity_sentences = []
review = []
index = []
aux = 0
count = 0

for sentence_token in sentences_tokens:
    
    text = nltk.tokenize.treebank.TreebankWordDetokenizer().detokenize(sentence_token)

    if text in manual_sentences:
        del sentences_tokens[aux]
    else:
        polarity_sentences.append(textblob.TextBlob(text).sentiment.polarity)
        subjectivity_sentences.append(textblob.TextBlob(text).sentiment.subjectivity)
        index.append(count)
        review.append(text)
        count += 1
        
    aux += 1
    
data_modeling = pandas.DataFrame()
data_modeling = pandas.DataFrame(columns=['id', 'review', 'polarity', 'subjectivity'])
data_modeling['polarity'] = polarity_sentences
data_modeling['subjectivity'] = subjectivity_sentences
data_modeling['id'] = index
data_modeling['review'] = review

data_modeling.head(10)

Unnamed: 0,id,review,polarity,subjectivity
0,0,I so angry i make post available via possible ...,0.055789,0.416072
1,1,Room nice for elderly bite difficult most room...,0.032653,0.539541
2,2,My room dirty I afraid walk barefoot the floor...,-0.02716,0.554321
3,3,You When I book your company line show picture...,0.066667,0.464815
4,4,Backyard the hotel total mess t happen hotel 4...,-0.0875,0.4625
5,5,Cleaner not change sheet duvet everyday just m...,0.083333,0.65
6,6,Apart the price the brekfast Everything good,0.7,0.6
7,7,Even though picture show clean room actual roo...,-0.083333,0.525
8,8,The aircondition make much noise its hard slee...,-0.045833,0.370833
9,9,Nothing great,0.8,0.75


Below, we plot a graph to watch how close the evaluations may be to each other, or whether there is a big difference, taking into account the feelings when conducting the review.

In [None]:
plotlib.rcParams['figure.figsize'] = [10, 8]

for index, id in enumerate(data_modeling.index):
    x = data_modeling.polarity.loc[id]
    y = data_modeling.subjectivity.loc[id]
    plotlib.scatter(x, y, color='blue')
    plotlib.text(x+.001, y+.001, [index], fontsize=0)
    plotlib.xlim(-1., 1.)
    
plotlib.title('Sentiment Analysis of Hotel''s Review', fontsize=20)
plotlib.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plotlib.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

plotlib.show()

# Manual notation

Our next step was utilize a bit part of the dataset to compare with the model. So we took a few sentences and rank according to what we think is best. This is a practice to see how good was our modeling, and maybe propose modifications or new steps.

The result was disposed in a table, showing that some moments we had a false positive value, where the model rated differently than expected.

In [23]:
aux_list = ['Review']
rand_list = []
aux = 0

for i in range(30): 
    rand_list.append(random.randint(0,len(data_modeling) -1))
    
for index, row in data_modeling.iterrows():
    if index in rand_list:
        aux_list.append(row['review'])
    

# Uncommend if necessary change the result of file manual_review.csv
    
resultFyle = open("dataset/manual_review.csv",'w')

for r in aux_list:
    resultFyle.write(r + "\n")

resultFyle.close()

In [24]:
data_analyze = pandas.read_csv('dataset/manual_review.csv')

In [25]:
data_analyze

Unnamed: 0,Review,result_review
0,Room small,NEGITIVE
1,We an issue sofa bed the room be It break the ...,POSITIVE
2,Unfriendly bar staff,NEGITIVE
3,We like all,POSITIVE
4,The wall very thin you hear every little thing...,NEGITIVE
5,room cold overall stay ok,POSITIVE
6,The tone the carpet could be little rosier Kid,NEGITIVE
7,Breakfast room bite chilly easier set for brea...,NEGITIVE
8,Small room small basic toilet old fashion,NEGITIVE
9,Double bed our room sag Not comfortable,NEGITIVE


In [26]:
listi = []

for index, row in data_modeling.iterrows():
    for index2, row2 in data_analyze.iterrows():
        if (row['review'] == row2['Review']) and (row['polarity'] > 0.00) and (row2['result_review'] == 'POSITIVE'):
            listi.append([row.review, row.polarity, row2.result_review, 'POSITIVE'])
        elif (row['review'] == row2['Review']) and (row['polarity'] < 0.00) and (row2['result_review'] == 'NEGATIVE'):
            listi.append([row.review, row.polarity, row2.result_review, 'NEGATIVE'])
        elif (row['review'] == row2['Review']) and (row['polarity'] > 0.00) and (row2['result_review'] == 'NEGATIVE'):
            listi.append([row.review, row.polarity, row2.result_review, 'Polarity greater than 0 and manual review NEGATIVE?'])
        elif (row['review'] == row2['Review']) and (row['polarity'] < 0.00) and (row2['result_review'] == 'POSITIVE'):
            listi.append([row.review, row.polarity, row2.result_review, 'Polarity less than 0 and manual review POSITIVE?'])
            
df = pandas.DataFrame(listi, columns=['review', 'polarity_sentence', 'manual_result', 'should_be'])

In [27]:
df

Unnamed: 0,review,polarity_sentence,manual_result,should_be
0,We an issue sofa bed the room be It break the ...,0.166667,POSITIVE,POSITIVE
1,room cold overall stay ok,-0.033333,POSITIVE,Polarity less than 0 and manual review POSITIVE?
2,Everything good There mention construction nea...,0.2875,POSITIVE,POSITIVE
3,Book online save us half cost book the recepti...,-0.166667,POSITIVE,Polarity less than 0 and manual review POSITIVE?
4,Clean quiet room,0.183333,POSITIVE,POSITIVE
5,There a small microwave sink dish refrigerator...,-0.25,POSITIVE,Polarity less than 0 and manual review POSITIVE?
6,Lively neighborhood access underground close H...,0.248939,POSITIVE,POSITIVE
7,Good reception specially indian guy I forget name,0.528571,POSITIVE,POSITIVE
8,Staff extra friendly,0.1875,POSITIVE,POSITIVE
9,Excellent gym pool Excellent location Excellen...,0.835,POSITIVE,POSITIVE


# Next steps

Making the previous result better visible, a confusion matrix was make approving incorrect categories over data.
And finally, a neural network was build to improve the results of the analyzes performed. 