In [3]:
! unzip '/content/trip_advisor_hackerearth_data.zip'

Archive:  /content/trip_advisor_hackerearth_data.zip
  inflating: test.csv                
  inflating: train.csv               


Problem Statement

“TripAdvisor is the world’s largest travel site where you can compare and book hotels, flights, restaurants etc. The data set provided in this challenge consists of a sample of hotel reviews provided by the customers. Analyzing customers reviews will help them understand about the hotels listed on their website i.e. if they are treating customers well or if they are providing hospitality services as expected.

In this challenge, you have to predict if a customer is happy or not happy.” 

Data-set Description

You are given three files to download: train.csv, test.csv and sample_submission.csv. The training data has 38932 rows, while the test data has 29404 rows. You can download the .csv files from 

Variable 	Description
User_ID 	unique ID of the customer
Description 	description of the review posted
Browser_Used 	browser used to post the review
Device_Used 	device used to post the review
Is_Response 	target variable

We are interested only in 2 columns. ‘Description’ which contains hotel reviews given by different users and ‘Is_Response’ which keeps the record of ‘happy’ or ‘not_happy’. So, in essence, this is simply a 2-class sentiment analysis problem.

The steps we are going to follow in this blog-post are as follows:

    Prepare data
    Feature Extraction.
    Build the Model.
    Train the Model.
    Checking Performance.


In [3]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import json
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Dropout

Using TensorFlow backend.


In [0]:
train = pd.read_csv('/content/train.csv')

In [5]:
train.head()

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
0,id10326,The room was kind of clean but had a VERY stro...,Edge,Mobile,not happy
1,id10327,I stayed at the Crown Plaza April -- - April -...,Internet Explorer,Mobile,not happy
2,id10328,I booked this hotel through Hotwire at the low...,Mozilla,Tablet,not happy
3,id10329,Stayed here with husband and sons on the way t...,InternetExplorer,Desktop,happy
4,id10330,My girlfriends and I stayed here to celebrate ...,Edge,Tablet,not happy


Prepare Data

Training data is provided in .csv format which can be ingested easily with pandas as shown in the code below. ‘Is_Response’ field of data carries strings, i.e. ‘happy’ and ‘not_happy’, which needs to be encoded in integer format, i.e. 0 and 1. Here, it is done by LabelEncoder class of scikit-learn library. The function returns list of a hotel reviews and their respective happiness labels.

In [0]:
def data_prepare(training_file_path):
    dataset = pd.read_csv(training_file_path)
    reviews = []
    labels = []    
 
    # Enconding Categorical Data
    labelencoder_y = LabelEncoder()
    dataset['Is_Response'] = labelencoder_y.fit_transform(dataset['Is_Response'])
    cLen = len(dataset['Description'])
 
    for i in range(0,cLen):
        review = dataset['Description'][i]
        reviews.append(review)
        label = dataset["Is_Response"][i]
        labels.append(label)
    labels = np.asarray(labels)
    return reviews,labels

Feature Extraction

In this task, words are features, hence the bag-of-words model can be used to create a feature vector. It can be done in following steps:

1. Make a dictionary : We create a dictionary containing word-index tuples of all the distinct words in training text reviews. We assume that the ordering of words is not important.

2. Convert words of each text review into word index array and store the index array of each review in global array. Example of a text review –

The room was kind of clean but had a VERY strong smell of dogs. Generally below average but ok for a overnight stay if you're not too fussy. Would consider staying again if the price was right. Breakfast was free and just about better than nothing.
[1, 14, 5, 436, 9, 52, 17, 25, 3, 22, 1735, 628, 9, 1727, 1109, 943, 492, 17, 322, 11, 3, 1010, 34, 42, 411, 24, 131, 3754, 40, 941, 181, 72, 42, 1, 126, 5, 117, 60, 5, 89, 2, 56, 64, 172, 100, 268]

Convert the global array of index into a feature matrix. Each text review is represented by a sparse vector of the size of the vocabulary, with 1 in the entries representing the word and 0 in all other entries. We use the maximum number of features as 10,000. Thus the final feature matrix will be of shape (38392,10000).

In [7]:
train.shape

(38932, 5)

While training the model, we pass the feature matrix, the labels, input batch size to process, the number of iterations etc as parameters . We also save the dictionary and the NN model in order to use them later while performing predictions on the test data. Once the NN model has been trained, we can check the performance of the model on test .csv data.

In [0]:
def convert_text_to_index_array(text):
    return [dictionary[word] for word in kpt.text_to_word_sequence(text)]

In [0]:
train_file_path = "/content/train.csv"
[reviews,labels] = data_prepare(train_file_path)

In [12]:
len(reviews), len(labels)

(38932, 38932)

In [14]:
print(reviews[0])
print(labels[0])

The room was kind of clean but had a VERY strong smell of dogs. Generally below average but ok for a overnight stay if you're not too fussy. Would consider staying again if the price was right. Breakfast was free and just about better than nothing.
1


In [0]:
# Create Dictionary of words and their indices
max_words = 10000
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(reviews)
dictionary = tokenizer.word_index

In [17]:
cnt=0
for key, word in dictionary.items():
  print(key, word)
  cnt+=1
  if(cnt==5):
    break

the 1
and 2
a 3
to 4
was 5


In [0]:
# save dictionary
with open('dictionary.json','w') as dictionary_file:
    json.dump(dictionary,dictionary_file)

In [0]:
# Replace words of each text review to indices
allWordIndices = []
for num,text in enumerate(reviews):
    wordIndices = convert_text_to_index_array(text)
    allWordIndices.append(wordIndices)

In [20]:
allWordIndices[0]

[1,
 14,
 5,
 436,
 9,
 52,
 17,
 25,
 3,
 22,
 1735,
 628,
 9,
 1727,
 1109,
 943,
 492,
 17,
 322,
 11,
 3,
 1010,
 34,
 42,
 411,
 24,
 131,
 3754,
 40,
 941,
 181,
 72,
 42,
 1,
 126,
 5,
 117,
 60,
 5,
 89,
 2,
 56,
 64,
 172,
 100,
 268]

In [0]:
# Convert the index sequences into binary bag of words vector (one hot encoding)
allWordIndices = np.asarray(allWordIndices)
train_X = tokenizer.sequences_to_matrix(allWordIndices, mode='binary')
labels = keras.utils.to_categorical(labels,num_classes=2)

In [23]:
train_X, train_X.shape

(array([[0., 1., 1., ..., 0., 0., 0.],
        [0., 1., 1., ..., 0., 0., 0.],
        [0., 1., 1., ..., 0., 0., 0.],
        ...,
        [0., 1., 1., ..., 0., 0., 0.],
        [0., 1., 1., ..., 0., 0., 0.],
        [0., 1., 1., ..., 0., 0., 0.]]), (38932, 10000))

In [24]:
labels, labels.shape

(array([[0., 1.],
        [0., 1.],
        [0., 1.],
        ...,
        [0., 1.],
        [0., 1.],
        [1., 0.]], dtype=float32), (38932, 2))

In [25]:
# Creating Dense Neural Network Model
model = Sequential()
model.add(Dense(256, input_shape=(max_words,), activation='elu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='elu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
 
model.compile(loss='categorical_crossentropy',
  optimizer='adam',
  metrics=['accuracy'])
 
# Training the Model
model.fit(train_X, labels,
  batch_size=32,
  epochs=5,
  verbose=1,
  validation_split=0.1,
  shuffle=True)
 
# Save model to disk
model_json = model.to_json()
with open('model.json', 'w') as json_file:
    json_file.write(model_json)
model.save_weights('model.h5')

Train on 35038 samples, validate on 3894 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Checking Performance

In HackerEarth challenge, the test.csv file is provided and it consists of 29404 hotel reviews. We will now predict the sentiment for all the hotel reviews. To find the accuracy (score) of the model, one needs to upload the prediction csv file on the portal here.

To check the performance of the “Predict the Happiness” system, the trained dictionary and the NN model is loaded. For each of the hotel reviews, we extract the bag of word features in a similar way as in training. The softmax scores of the output layer are calculated by feedforwarding the input features to the trained NN model. A higher score shows more probability of that sentiment. Finally, the prediction csv file is written with User_ID and the predicted response. The Python code for performing predictions on the test data is shown below.

In [0]:
import json
import numpy as np
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.models import model_from_json
import pandas as pd
 
def convert_text_to_index_array(text):
    words = kpt.text_to_word_sequence(text)
    wordIndices = []
    for word in words:
        if word in dictionary:
            wordIndices.append(dictionary[word])
    return wordIndices
 
# Load the dictionary
labels = ['happy','not_happy']
with open('/content/dictionary.json', 'r') as dictionary_file:
    dictionary = json.load(dictionary_file)
 
# Load trained model
json_file = open('/content/model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
model.load_weights('/content/model.h5')
 
testset = pd.read_csv("/content/test.csv")
cLen = len(testset['Description'])
tokenizer = Tokenizer(num_words=10000)
 
# Predict happiness for each review in test.csv
y_pred = []
for i in range(0,cLen):
    review = testset['Description'][i]
    testArr = convert_text_to_index_array(review)
    input = tokenizer.sequences_to_matrix([testArr], mode='binary')
    pred = model.predict(input)
    #print pred[0][np.argmax(pred)] * 100, labels[np.argmax(pred)]
    y_pred.append(labels[np.argmax(pred)])
 
# Write the results in submission csv file
raw_data = {'User_ID': testset['User_ID'],
        'Is_Response': y_pred}
df = pd.DataFrame(raw_data, columns = ['User_ID', 'Is_Response'])
df.to_csv('submission_model1.csv', sep=',',index=False)

# State-of-the-Art Text Classification using BERT model: “Predict the Happiness” Challenge

Much recently in October, 2018, Google released new language representation model called BERT, which stands for “Bidirectional Encoder Representations from Transformers”. According to their paper, It obtains new state-of-the-art results on wide range of natural language processing tasks like text classification, entity recognition, question and answering system etc.

Installation

As far as tensorflow based installation are concerned, It is easy to set up the experiment. In your python tensorflow environment, just follow these two steps.

    Clone the BERT Github repository onto your own machine. On your terminal, type
    git clone https://github.com/google-research/bert.git
    Download the pre-trained model from official BERT Github page here. There are 4 types of per-trained models.
    BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
    BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
    BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
    BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters

I downloaded the BERT-Base, Cased one for the experiment as the text data-set used had cased words. Also, base models are only 12 layers deep neural network (as opposed to BERT-Large which is 24 layers deep) which can run on GTX 1080Ti (11 GB VRAM). BERT-Large models can not run on 11 GB GPU memory and it would require more space to run (64GB would suffice).

In [4]:
! wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip

--2020-05-12 08:21:27--  https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.216.128, 2607:f8b0:400c:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.216.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 404261442 (386M) [application/zip]
Saving to: ‘cased_L-12_H-768_A-12.zip’


2020-05-12 08:21:31 (90.1 MB/s) - ‘cased_L-12_H-768_A-12.zip’ saved [404261442/404261442]



In [5]:
! unzip '/content/cased_L-12_H-768_A-12.zip'

Archive:  /content/cased_L-12_H-768_A-12.zip
   creating: cased_L-12_H-768_A-12/
  inflating: cased_L-12_H-768_A-12/bert_model.ckpt.meta  
  inflating: cased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  
  inflating: cased_L-12_H-768_A-12/vocab.txt  
  inflating: cased_L-12_H-768_A-12/bert_model.ckpt.index  
  inflating: cased_L-12_H-768_A-12/bert_config.json  


In [None]:
! git clone https://github.com/google-research/bert.git

Preparing Data for Model

We need to prepare the text data in a format that it complies with BERT model. Basically codes written by Google to apply BERT accepts the “Tab separated” file in following format.

train.tsv or dev.tsv

    an ID for the row
    the label for the row as an int (class labels: 0,1,2,3 etc)
    A column of all the same letter (weird throw away column expected by BERT)
    the text examples you want to classify

test.tsv

    an ID for the row
    the text sentences/paragraph you want to test

The below python code snippet would read the HackerEarth training data (train.csv) and prepares it according to BERT model compliance.

In [0]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from pandas import DataFrame
 
le = LabelEncoder()
 
df = pd.read_csv("/content/train.csv")
 
# Creating train and dev dataframes according to BERT
df_bert = pd.DataFrame({'user_id':df['User_ID'],
            'label':le.fit_transform(df['Is_Response']),
            'alpha':['a']*df.shape[0],
            'text':df['Description'].replace(r'\n',' ',regex=True)})
 
df_bert_train, df_bert_dev = train_test_split(df_bert, test_size=0.01)
 
# Creating test dataframe according to BERT
df_test = pd.read_csv("/content/test.csv")
df_bert_test = pd.DataFrame({'User_ID':df_test['User_ID'],
                 'text':df_test['Description'].replace(r'\n',' ',regex=True)})
 
# Saving dataframes to .tsv format as required by BERT
df_bert_train.to_csv('/content/data/train.tsv', sep='\t', index=False, header=False)
df_bert_dev.to_csv('/content/data/dev.tsv', sep='\t', index=False, header=False)
df_bert_test.to_csv('/content/data/test.tsv', sep='\t', index=False, header=True)

In [8]:
df_bert_test.head(3)

Unnamed: 0,User_ID,text
0,id80132,Looking for a motel in close proximity to TV t...
1,id80133,Walking distance to Madison Square Garden and ...
2,id80134,Visited Seattle on business. Spent - nights in...


In [9]:
df_bert_train.head(3)

Unnamed: 0,user_id,label,alpha,text
14154,id24480,0,a,I would have to say this was nicer than what I...
26936,id37262,0,a,This Embassy Suites hotel is definitely nicer ...
33400,id43726,0,a,My husband and I spent five nights at this hot...


In [10]:
df_bert_dev.head(3)

Unnamed: 0,user_id,label,alpha,text
36036,id46362,1,a,"Terrible stay. Loud, tiny room. Incredibly ove..."
27418,id37744,0,a,My now husband and I started staying at this S...
2623,id12949,0,a,The hotel looks avergae on the outside but the...


A column of all the same letter — this is a throw-away column that you need to include because the BERT model expects it. In our case it is the alpha column

Training Model using Pre-trained BERT model

Following the blog-post till here finishes half of the job. Just recheck the following things.

    All the .tsv files are in a folder having name “data”
    Make sure you have created a folder “bert_output” where the fine tuned model will be saved and test results are generated under the name “test_results.tsv“
    Check that you downloaded the pre-trained BERT model in current directory “cased_L-12_H-768_A-12”
    Also, ensure that the paths in the command are relative path (starts with “./”)

One can now fine tune the downloaded pre-trained model for our problem data-set by running the below command on terminal:

In [0]:
import os
os.chdir('/content/bert')

In [0]:
! python run_classifier.py --task_name=cola --do_train=true --do_eval=true --do_predict=True --data_dir='/content/data' --vocab_file='/content/cased_L-12_H-768_A-12/vocab.txt' --bert_config_file='/content/cased_L-12_H-768_A-12/bert_config.json' --init_checkpoint='/content/cased_L-12_H-768_A-12/bert_model.ckpt' --max_seq_length=400 --train_batch_size=8 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir='/content/bert_output' --do_lower_case=False

In [0]:
! pip install tensorflow-gpu==1.15

Preparing Results for Submission

The below python code converts the results from BERT model to .csv format in order to submit to HackerEarth Challenge.

In [0]:
df_results = pd.read_csv("/content/bert_output/test_results.tsv",sep="\t",header=None)
df_results_csv = pd.DataFrame({'User_ID':df_test['User_ID'],
                               'Is_Response':df_results.idxmax(axis=1)})
 
# Replacing index with string as required for submission
df_results_csv['Is_Response'].replace(0, 'happy',inplace=True)
df_results_csv['Is_Response'].replace(1, 'not_happy',inplace=True)
 
# writing into .csv
df_results_csv.to_csv('/content/data/result.csv',sep=",",index=None)

In [21]:
! zip -r bert_output.zip '/content/bert_output'

  adding: content/bert_output/ (stored 0%)
  adding: content/bert_output/eval_results.txt (deflated 18%)
  adding: content/bert_output/predict.tf_record (deflated 80%)
  adding: content/bert_output/model.ckpt-14453.data-00000-of-00001 (deflated 13%)
  adding: content/bert_output/graph.pbtxt (deflated 97%)
  adding: content/bert_output/train.tf_record (deflated 80%)
  adding: content/bert_output/model.ckpt-14453.meta (deflated 92%)
  adding: content/bert_output/eval.tf_record (deflated 80%)
  adding: content/bert_output/events.out.tfevents.1589272608.2df7c7822df7 (deflated 90%)
  adding: content/bert_output/eval/ (stored 0%)
  adding: content/bert_output/eval/events.out.tfevents.1589279143.2df7c7822df7 (deflated 92%)
  adding: content/bert_output/model.ckpt-14453.index (deflated 69%)
  adding: content/bert_output/test_results.tsv (deflated 63%)
  adding: content/bert_output/.ipynb_checkpoints/ (stored 0%)
  adding: content/bert_output/checkpoint (deflated 75%)


In [20]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
! mv '/content/bert/bert_output.zip' '/content/drive/My Drive/Colab Notebooks'