![title](inn.png)

# Innoplexus Online Hiring Hackathon: Sentiment Analysis

## Problem Statement

### Sentiment Analysis for drugs/medicines
Nowadays the narrative of a brand is not only built and controlled by the company that owns the brand. For this reason, companies are constantly looking out across Blogs, Forums, and other social media platforms, etc for checking the sentiment for their various products and also competitor products to learn how their brand resonates in the market. This kind of analysis helps them as part of their post-launch market research. This is relevant for a lot of industries including pharma and their drugs.
 

**The challenge is that the language used in this type of content is not strictly grammatically correct. Some use sarcasm. Others cover several topics with different sentiments in one post. Other users post comments and reply and thereby indicating his/her sentiment around the topic.**

Sentiment can be clubbed into 3 major buckets - **Positive, Negative and Neutral Sentiments.**


You are provided with data containing samples of text. This text can contain one or more drug mentions. Each row contains a unique combination of the text and the drug mention. Note that the same text can also have different sentiment for a different drug.

Given the text and drug name, the task is to predict the sentiment for texts contained in the test dataset. Given below is an example of text from the dataset:


Example:

*Stelara is still fairly new to Crohn's treatment. This is why you might not get a lot of replies. I've done some research, but most of the "time to work" answers are from Psoriasis boards. For Psoriasis, it seems to be about 4-12 weeks to reach a strong therapeutic level. The good news is, Stelara seems to be getting rave reviews from Crohn's patients. It seems to be the best med to come along since Remicade. I hope you have good success with it. My daughter was diagnosed Feb. 19/07, (13 yrs. old at the time of diagnosis), with Crohn's of the Terminal Illium. Has used Prednisone and Pentasa. Started Imuran (02/09), had an abdominal abscess (12/08). 2cm of Stricture. Started ​Remicade in Feb. 2014, along with 100mgs. of Imuran.*


For Stelara the above text is **positive** while for Remicade the above text is **negative**.

### Data Description
**train.csv**
Contains the labelled texts with sentiment values for a given drug
 
|Variable|	Definition|
|----|----|
|unique_hash |Unique ID|
|text|text pertaining to the drugs|
|drug |drug name for which the sentiment is provided|
|sentiment	|(Target) 0-positive, 1-negative, 2-neutral  |


**test.csv**
test.csv contains texts with drug names for which the participants are expected to predict the correct sentiment
 

### Evaluation Metric
The metric used for evaluating the performance of the classification model would be macro F1-Score.
 

## Public and Private Split

The texts in the test data are further randomly divided into Public (40%) and Private (60%) data.
Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

# Approaches



# Leaderboard

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import textblob

import os
print(os.listdir("../input"))
os.environ['PYTHONHASHSEED'] = '10000'
np.random.seed(10001)
import random
import tensorflow as tf
random.seed(10002)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=6, inter_op_parallelism_threads=5)
from keras import backend

tf.set_random_seed(10003)
backend.set_session(tf.Session(graph=tf.get_default_graph(), config=session_conf))

print("Loading data...")
train = pd.read_csv("../input/av _ innoplexus hiring/train_F3WbcTw.csv")
print("Train shape:", train.shape)
test = pd.read_csv("../input/av _ innoplexus hiring/test_tOlRoBf.csv")
print("Test shape:", test.shape)

enc = OneHotEncoder(sparse=False)
enc.fit(train["sentiment"].values.reshape(-1, 1))
print("Number of classes:", enc.n_values_[0])

print("Class distribution:\n{}".format(train["sentiment"].value_counts()/train.shape[0]))

['av _ innoplexus hiring']


Using TensorFlow backend.


Loading data...
Train shape: (5279, 4)
Test shape: (2924, 3)
Number of classes: 3
Class distribution:
2    0.724569
1    0.158553
0    0.116878
Name: sentiment, dtype: float64


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


For the examples which occur in both sets, we can directly use the labels from train set as our prediction.

In [3]:
test.shape
730/2924
len(set(train["text"]+' '+train["drug"]).intersection(set(test["text"]+' '+test["drug"])))/test.shape[0]

0.0

### Removing duplicate text

In [4]:
common=np.intersect1d(train.text,test.text)
test_common=test.query('text in @common')

In [5]:
train=train.query('text not in @common')

In [6]:
train.sentiment.value_counts()

2    2999
1     837
0     617
Name: sentiment, dtype: int64

In [7]:
# all_drugs=train.drug.unique()
# all_drugs

In [8]:
# train[train['drug']=='gilenya'].text.values

In [9]:
test=test.query('text not in @common')

In [10]:
test.shape,train.shape

((2081, 3), (4453, 4))

In [11]:
len(set(train["text"]).intersection(set(test["text"])))/test.shape[0]

0.0

In [12]:
len(np.intersect1d(train.text,test.text))

0

In [13]:
print("Ratio of test set examples which occur in the train set: {0:.2f}".format(len(set(train["text"]).intersection(set(test["text"])))/test.shape[0]))
# test = pd.merge(test, train[["text", "sentiment"]], on="text", how="left")

Ratio of test set examples which occur in the train set: 0.00


In [14]:
train.head()

Unnamed: 0,unique_hash,text,drug,sentiment
0,2e180be4c9214c1f5ab51fd8cc32bc80c9f612e0,Autoimmune diseases tend to come in clusters. ...,gilenya,2
1,9eba8f80e7e20f3a2f48685530748fbfa95943e4,I can completely understand why you’d want to ...,gilenya,2
2,fe809672251f6bd0d986e00380f48d047c7e7b76,Interesting that it only targets S1P-1/5 recep...,fingolimod,2
3,bd22104dfa9ec80db4099523e03fae7a52735eb6,"Very interesting, grand merci. Now I wonder wh...",ocrevus,2
4,b227688381f9b25e5b65109dd00f7f895e838249,"Hi everybody, My latest MRI results for Brain ...",gilenya,1


Let's see if all the words in the test set occurs in the train set:

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

cv1 = CountVectorizer()
cv1.fit(train["text"])

cv2 = CountVectorizer()
cv2.fit(test["text"])

print("Train Set Vocabulary Size:", len(cv1.vocabulary_))
print("Test Set Vocabulary Size:", len(cv2.vocabulary_))
print("Number of Words that occur in both:", len(set(cv1.vocabulary_.keys()).intersection(set(cv2.vocabulary_.keys()))))

Train Set Vocabulary Size: 36569
Test Set Vocabulary Size: 25257
Number of Words that occur in both: 17558


## Data Augmentation for Text

In [16]:
from nltk import sent_tokenize
import json
random.seed(1994)
def tokenize(text):
    '''text: list of text documents'''
    tokenized =  sent_tokenize(text)
    return tokenized

def shuffle_tokenized(text):
    random.shuffle(text)
    newl=list(text)
    shuffled.append(newl)
    return text


df_train=train[['text','sentiment']]

augmented = []
reps=[]
for ng_rev in df_train[df_train.sentiment==2].text:
    tok = tokenize(ng_rev)
    shuffled= [tok]
    #print(ng_rev)
    for i in range(2):
    #generate 11 new reviews
        shuffle_tokenized(shuffled[-1])
    for k in shuffled:
        '''create new review by joining the shuffled sentences'''
        s = ' '
        new_rev = s.join(k)
        if new_rev not in augmented:
            augmented.append(new_rev)
        else:
            reps.append(new_rev)
df2=pd.DataFrame({'text':augmented,'sentiment':[2]*len(augmented)})
print(df2.shape)
df2.head()

(5669, 2)


Unnamed: 0,text,sentiment
0,Autoimmune diseases tend to come in clusters. ...,2
1,"I hope that it does work out, I really do. I c...",2
2,There so much still to do before this is convi...,2
3,Interesting that it only targets S1P-1/5 recep...,2
4,I'm very pleased that something is being devel...,2


In [17]:

augmented = []
reps=[]
for ng_rev in df_train[df_train.sentiment==1].text:
    tok = tokenize(ng_rev)
    shuffled= [tok]
    #print(ng_rev)
    for i in range(7):
    #generate 11 new reviews
        shuffle_tokenized(shuffled[-1])
    for k in shuffled:
        '''create new review by joining the shuffled sentences'''
        s = ' '
        new_rev = s.join(k)
        if new_rev not in augmented:
            augmented.append(new_rev)
        else:
            reps.append(new_rev)
df1=pd.DataFrame({'text':augmented,'sentiment':[1]*len(augmented)})
print(df1.shape)
df1.head()

(5266, 2)


Unnamed: 0,text,sentiment
0,This could represent artifact or early axonal ...,1
1,What are the kind of symptoms from C2-C3 lesio...,1
2,What are the kind of symptoms from C2-C3 lesio...,1
3,What are the kind of symptoms from C2-C3 lesio...,1
4,This could represent artifact or early axonal ...,1


In [18]:

augmented = []
reps=[]
for ng_rev in df_train[df_train.sentiment==0].text:
    tok = tokenize(ng_rev)
    shuffled= [tok]
    #print(ng_rev)
    for i in range(9):
    #generate 11 new reviews
        shuffle_tokenized(shuffled[-1])
    for k in shuffled:
        '''create new review by joining the shuffled sentences'''
        s = ' '
        new_rev = s.join(k)
        if new_rev not in augmented:
            augmented.append(new_rev)
        else:
            reps.append(new_rev)
df0=pd.DataFrame({'text':augmented,'sentiment':[0]*len(augmented)})
print(df0.shape)
df0.head()

(4713, 2)


Unnamed: 0,text,sentiment
0,"If you would like to talk, contact the Help Ce...",0
1,To learn more view our Understanding IBD Medic...,0
2,Humira and other biologics are very successful...,0
3,"First, I know you said that you are scared of ...",0
4,You can learn more about some of your treatmen...,0


In [19]:
df0.shape

(4713, 2)

In [20]:
df0=df0.append(df1)
df0=df0.append(df2)
df0.shape

(15648, 2)

In [21]:
train=df0

In [22]:
# # reps

# df2=augment_text(df_train[df_train.sentiment==2].text,2,2)
# df1=augment_text(df_train[df_train.sentiment==1].text,7,1)
# df0=augment_text(df_train[df_train.sentiment==0].text,9,0)
# print(df0.shape,df1.shape,df2.shape)

In [23]:
train.head()

Unnamed: 0,text,sentiment
0,"If you would like to talk, contact the Help Ce...",0
1,To learn more view our Understanding IBD Medic...,0
2,Humira and other biologics are very successful...,0
3,"First, I know you said that you are scared of ...",0
4,You can learn more about some of your treatmen...,0


In [24]:
train[train.sentiment==2].shape,train[train.sentiment==0].shape,train[train.sentiment==1].shape

((5669, 2), (4713, 2), (5266, 2))

In [25]:
print("Class distribution:\n{}".format(train["sentiment"].value_counts()/train.shape[0]))

Class distribution:
2    0.362283
1    0.336529
0    0.301189
Name: sentiment, dtype: float64


In [26]:
train['text'].values

array(['If you would like to talk, contact the Help Center at 888-694-8872 or at info@crohnscolitisfoundation.org To reduce your level of fear it can help to learn more about your treatment option. Hi Jess Sorry to read about the challenges you are having with your health. You mentioned a lot in your post. I just want to share some info on a few of the points. To learn more view our Understanding IBD Medication brochure at: http://www.crohnscolitisfoundation.org/assets/pdfs/understanding-ibd-meds-nov.pdf . Reply posted for JessZidek. Humira and other biologics are very successful in reducing symptoms and inducing and maintain disease remission. First, I know you said that you are scared of Humira. You can learn more about some of your treatment options.',
       'To learn more view our Understanding IBD Medication brochure at: http://www.crohnscolitisfoundation.org/assets/pdfs/understanding-ibd-meds-nov.pdf . You can learn more about some of your treatment options. You mentioned a lot 

### Feature Extraction

In [27]:
#
#     df["phrase_count"] = df.groupby("drug")["text"].transform("count")
# df['drug']=pd.factorize(df['drug'])[0]

def transform(df):
#     df['drug_count']=df['text'].apply(lambda x: len(np.intersect1d(x.split(),all_drugs)))
#     df["word_count"] = df["text"].apply(lambda x: len(x.split()))
    df["has_upper"] = df["text"].apply(lambda x: x.lower() != x)
    df['upper'] = df['text'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
    df["sentence_end"] = df["text"].apply(lambda x: x.endswith("."))
    df["after_comma"] = df["text"].apply(lambda x: x.startswith(","))
    df["sentence_start"] = df["text"].apply(lambda x: "A" <= x[0] <= "Z")
    df["text"] = df["text"].apply(lambda x: x.lower())
    import string
    punctuation=string.punctuation
    df['word_count']=df['text'].apply(lambda x: len(str(x).split(" ")))
    df['char_count'] = df['text'].str.len()
    def avg_word(sentence):
        words = sentence.split()
        return (sum(len(word) for word in words)/len(words))

    df['avg_word'] = df['text'].apply(lambda x: avg_word(x))
    from nltk.corpus import stopwords
    stop = stopwords.words('english')

    df['stopwords'] = df['text'].apply(lambda x: len([x for x in x.split() if x in stop]))
    df['numerics'] = df['text'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
    
    df['word_density'] = df['char_count'] / (df['word_count']+1)
    df['punctuation_count'] = df['text'].apply(lambda x: len("".join(_ for _ in x if _ in punctuation))) 
    
    return df

train = transform(train)
test = transform(test)

# dense_features = ["phrase_count", "word_count", "has_upper", "after_comma", "sentence_start", "sentence_end",'char_count','avg_word','stopwords','numerics','word_density','punctuation_count','drug','upper']
dense_features = [ "word_count", "has_upper", "after_comma", "sentence_start", "sentence_end",'char_count','avg_word','stopwords','numerics','word_density','punctuation_count','upper']
train.groupby("sentiment")[dense_features].mean()

Unnamed: 0_level_0,word_count,has_upper,after_comma,sentence_start,sentence_end,char_count,avg_word,stopwords,numerics,word_density,punctuation_count,upper
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,354.334394,0.998939,0.0,0.924676,0.841502,2128.94356,4.821606,142.037556,4.30087,5.743732,62.471886,15.445152
1,272.493164,0.997152,0.0,0.930308,0.81428,1550.847133,4.625931,120.075389,2.698823,5.559593,43.669958,11.440562
2,310.137767,0.995414,0.000529,0.911272,0.786382,1869.66502,4.85218,122.55583,3.619862,5.75494,58.436409,13.941259


**Splitting Data into folds**

If we split the data totally random, we may bias our validation set because the phrases in the same sentence may be distributed to train and validation sets. We need to guarantee that all phrases of one sentence is in one fold. We can assume that SentenceId%NUM_FOLDS preserves this while splitting the data randomly.

### Data Preprocessing

In [28]:
# train["drug"]%5

import re
import nltk
from nltk.corpus import stopwords
def url_to_words(raw_text):
    raw_text=raw_text.strip()
    no_coms=re.sub(r'\.com','',raw_text)
    no_urls=re.sub('https?://www','',no_coms)
    no_urls1=re.sub('https?://','',no_urls)
    try:
        no_encoding=no_urls1.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        no_encoding = no_urls1
    letters_only = re.sub("[^a-zA-Z0-9]", " ",no_encoding) 
    words = letters_only.split()                             
    stops = stopwords.words('english')         
    meaningful_words = [w for w in words if not w in stops] 
    return( " ".join( meaningful_words ))


train['text']=train['text'].apply(url_to_words)
test['text']=test['text'].apply(url_to_words)


In [29]:
test.shape

(2081, 15)

In [30]:
train.sentiment.value_counts()

2    5669
1    5266
0    4713
Name: sentiment, dtype: int64

In [32]:
NUM_FOLDS = 7
train["fold_id"] = train.reset_index()['index'].apply(lambda x: x%NUM_FOLDS)
train.head()

Unnamed: 0,text,sentiment,has_upper,upper,sentence_end,after_comma,sentence_start,word_count,char_count,avg_word,stopwords,numerics,word_density,punctuation_count,fold_id
0,would like talk contact help center 888 694 88...,0,True,3,True,False,True,118,755,5.40678,61,0,6.344538,28,0
1,learn view understanding ibd medication brochu...,0,True,3,False,False,True,118,755,5.40678,61,0,6.344538,28,1
2,humira biologics successful reducing symptoms ...,0,True,3,True,False,True,118,755,5.40678,61,0,6.344538,28,2
3,first know said scared humira want share info ...,0,True,3,False,False,True,118,755,5.40678,61,0,6.344538,28,3
4,learn treatment options humira biologics succe...,0,True,3,True,False,True,118,755,5.40678,61,0,6.344538,28,4


In [33]:
# train["fold_id"].value_counts()
# train.groupby(['fold_id','sentiment']).count()
from sklearn.utils import shuffle
train=shuffle(train)

**Transfer Learning Using GLOVE Embeddings**

In [34]:
# EMBEDDING_FILE = "../input/glove-global-vectors-for-word-representation/glove.6B.100d.txt"
EMBEDDING_DIM = 300

# all_words = set(cv1.vocabulary_.keys()).union(set(cv2.vocabulary_.keys()))

# def get_embedding():
#     embeddings_index = {}
#     f = open(EMBEDDING_FILE)
#     for line in f:
#         values = line.split()
#         word = values[0]
#         if len(values) == EMBEDDING_DIM + 1 and word in all_words:
#             coefs = np.asarray(values[1:], dtype="float32")
#             embeddings_index[word] = coefs
#     f.close()
#     return embeddings_index

# embeddings_index = get_embedding()
# print("Number of words that don't exist in GLOVE:", len(all_words - set(embeddings_index)))

In [35]:
train.sentiment.value_counts()
# print(2999-837)

2    5669
1    5266
0    4713
Name: sentiment, dtype: int64

In [36]:
# 
train[train['sentiment']==1].shape

(5266, 15)

**Prepare the sequences for LSTM**

In [37]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

MAX_SEQUENCE_LENGTH = 350

tokenizer = Tokenizer()
tokenizer.fit_on_texts(np.append(train["text"].values, test["text"].values))
word_index = tokenizer.word_index
seq = pad_sequences(tokenizer.texts_to_sequences(train["text"]), maxlen=MAX_SEQUENCE_LENGTH)
test_seq = pad_sequences(tokenizer.texts_to_sequences(test["text"]), maxlen=MAX_SEQUENCE_LENGTH)

In [38]:
nb_words = len(word_index) + 1

In [39]:
nb_words

43974

**Define the Model**

### Attention

In [40]:
from keras.engine.topology import Layer
from keras import backend as K
from keras import initializers, regularizers, constraints, optimizers, layers
class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None
        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
                        K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        if mask is not None:
            a *= K.cast(mask, K.floatx())

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0],  self.features_dim

## Different Models

In [41]:
from keras.layers import *
from keras.models import Model
from keras.callbacks import EarlyStopping
from keras.models import Sequential

def build_model():
    embedding_layer = Embedding(nb_words,
                                300,
                                input_length=MAX_SEQUENCE_LENGTH,
                                trainable=True)
    dropout = SpatialDropout1D(0.2)
    mask_layer = Masking()
    lstm_layer = LSTM(100,recurrent_dropout=0.2)
    seq_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype="int32")
    dense_input = Input(shape=(len(dense_features),))
    dense_vector = BatchNormalization()(dense_input)
    phrase_vector = lstm_layer(mask_layer(dropout(embedding_layer(seq_input))))
    
    feature_vector = concatenate([phrase_vector, dense_vector])
    feature_vector = Dense(64, activation="relu")(feature_vector)
#     feature_vector = Dense(20, activation="relu")(feature_vector)
    
    output = Dense(3, activation="softmax")(feature_vector)
    
    model = Model(inputs=[seq_input, dense_input], outputs=output)
    return model

def build_model_only():
    model5_CNN= Sequential()
    model5_CNN.add(Embedding(nb_words,300,input_length=MAX_SEQUENCE_LENGTH))
    model5_CNN.add(Dropout(0.2))
    model5_CNN.add(Conv1D(64,kernel_size=3,padding='same',activation='relu',strides=1))
    model5_CNN.add(GlobalMaxPooling1D())
    model5_CNN.add(Dense(128,activation='relu'))
    model5_CNN.add(Dropout(0.2))
    model5_CNN.add(Dense(3,activation='softmax'))
    model5_CNN.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    return model5_CNN

from keras.layers import Dense, Input, CuDNNLSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D
def buildAtt_layer():

    inp = Input(shape=(MAX_SEQUENCE_LENGTH,))
    x = Embedding(nb_words, 300)(inp)
    x = Bidirectional(CuDNNLSTM(128, return_sequences=True))(x)
    x = Bidirectional(CuDNNLSTM(64, return_sequences=True))(x)
    x = Attention(MAX_SEQUENCE_LENGTH)(x)
    x = Dense(64, activation="relu")(x)
    x = Dense(3, activation="softmax")(x)
    modelATT = Model(inputs=inp, outputs=x)
#     modelATT.compile(loss='categorical_crossentropy', optimizer=Adam(lr=1e-3), metrics=['accuracy'])
    modelATT.summary()
    return modelATT

**Train the Model:**

In [43]:
dense_features

['word_count',
 'has_upper',
 'after_comma',
 'sentence_start',
 'sentence_end',
 'char_count',
 'avg_word',
 'stopwords',
 'numerics',
 'word_density',
 'punctuation_count',
 'upper']

In [44]:
# test_seq
test[dense_features].head()

Unnamed: 0,word_count,has_upper,after_comma,sentence_start,sentence_end,char_count,avg_word,stopwords,numerics,word_density,punctuation_count,upper
0,10,False,False,False,False,72,6.3,3,1,6.545455,4,0
1,27,True,False,True,False,146,4.444444,15,0,5.214286,4,1
5,18,True,False,True,False,114,5.388889,7,0,6.0,4,1
6,105,True,False,True,False,559,4.333333,42,2,5.273585,26,8
7,167,True,False,True,True,881,4.281437,83,5,5.244048,19,7


In [45]:
from keras.optimizers import Adam

## Folds

In [46]:
test_preds1 = np.zeros((test.shape[0], 3))
from sklearn.metrics import f1_score
for i in range(NUM_FOLDS):
    print("FOLD", i+1)
    
    print("Splitting the data into train and validation...")
    train_seq, val_seq = seq[train["fold_id"] != i], seq[train["fold_id"] == i]
    train_dense, val_dense = train[train["fold_id"] != i][dense_features], train[train["fold_id"] == i][dense_features]
    y_train = enc.transform(train[train["fold_id"] != i]["sentiment"].values.reshape(-1, 1))
    y_val = enc.transform(train[train["fold_id"] == i]["sentiment"].values.reshape(-1, 1))
    
    print("Building the model...")
#     model = build_model_only()
    model = build_model_only()
    model.compile(loss="categorical_crossentropy", optimizer='adam', metrics=["acc"])
    
    early_stopping = EarlyStopping(monitor="val_acc", patience=2, verbose=1)
    
    print("Training the model...")
    model.fit(train_seq, y_train, validation_data=(val_seq, y_val),
              epochs=10, batch_size=8, shuffle=True, callbacks=[early_stopping], verbose=1)
#     print(np.argmax(model.predict([val_seq, val_dense[dense_features]], batch_size=128, verbose=1),axis=1),y_val)
    print('Evaluation',f1_score(np.argmax(model.predict(val_seq, batch_size=8, verbose=1),axis=1),train[train["fold_id"] == i]["sentiment"].values,average='macro'))
    print("Predicting...")
    test_preds1 += model.predict(test_seq, batch_size=8, verbose=1)
    print()
    
test_preds1 /= NUM_FOLDS

FOLD 1
Splitting the data into train and validation...
Building the model...
Training the model...
Train on 13410 samples, validate on 2238 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 00005: early stopping
Evaluation 0.9777258536766964
Predicting...

FOLD 2
Splitting the data into train and validation...
Building the model...
Training the model...
Train on 13410 samples, validate on 2238 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 00007: early stopping
Evaluation 0.9862046295710885
Predicting...

FOLD 3
Splitting the data into train and validation...
Building the model...
Training the model...
Train on 13413 samples, validate on 2235 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 00004: early stopping
Evaluation 0.9876733006228374
Predicting...

FOLD 4
Splitting the data into train and validation...
Building the model...
Training the model...
Train on 13413 samples, validate on 2235 samples
Epoch 1/10
Ep

In [47]:
test_preds2 = np.zeros((test.shape[0], 3))
from sklearn.metrics import f1_score
for i in range(NUM_FOLDS):
    print("FOLD", i+1)
    
    print("Splitting the data into train and validation...")
    train_seq, val_seq = seq[train["fold_id"] != i], seq[train["fold_id"] == i]
    train_dense, val_dense = train[train["fold_id"] != i][dense_features], train[train["fold_id"] == i][dense_features]
    y_train = enc.transform(train[train["fold_id"] != i]["sentiment"].values.reshape(-1, 1))
    y_val = enc.transform(train[train["fold_id"] == i]["sentiment"].values.reshape(-1, 1))
    
    print("Building the model...")
    model = build_model()
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["acc"])
    
    early_stopping = EarlyStopping(monitor="val_acc", patience=2, verbose=1)
    
    print("Training the model...")
    model.fit([train_seq, train_dense], y_train, validation_data=([val_seq, val_dense], y_val),
              epochs=10, batch_size=1024, shuffle=True, callbacks=[early_stopping], verbose=1)
#     print(np.argmax(model.predict([val_seq, val_dense[dense_features]], batch_size=128, verbose=1),axis=1),y_val)
    print('Evaluation',f1_score(np.argmax(model.predict([val_seq, val_dense[dense_features]], batch_size=128, verbose=1),axis=1),train[train["fold_id"] == i]["sentiment"].values,average='macro'))
    print("Predicting...")
    test_preds2 += model.predict([test_seq, test[dense_features]], batch_size=1024, verbose=1)
    print()
    
test_preds2 /= NUM_FOLDS

FOLD 1
Splitting the data into train and validation...
Building the model...
Training the model...
Train on 13410 samples, validate on 2238 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Evaluation 0.9377159973285935
Predicting...

FOLD 2
Splitting the data into train and validation...
Building the model...
Training the model...
Train on 13410 samples, validate on 2238 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 00009: early stopping
Evaluation 0.9291658675665698
Predicting...

FOLD 3
Splitting the data into train and validation...
Building the model...
Training the model...
Train on 13413 samples, validate on 2235 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Evaluation 0.9356892220224338
Predicting...

FOLD 4
Splitting the data into train and validation...
Building the mode

In [48]:
test_preds3 = np.zeros((test.shape[0], 3))
from sklearn.metrics import f1_score
for i in range(NUM_FOLDS):
    print("FOLD", i+1)
    
    print("Splitting the data into train and validation...")
    train_seq, val_seq = seq[train["fold_id"] != i], seq[train["fold_id"] == i]
    train_dense, val_dense = train[train["fold_id"] != i][dense_features], train[train["fold_id"] == i][dense_features]
    y_train = enc.transform(train[train["fold_id"] != i]["sentiment"].values.reshape(-1, 1))
    y_val = enc.transform(train[train["fold_id"] == i]["sentiment"].values.reshape(-1, 1))
    
    print("Building the model...")
#     model = build_model_only()
    model = buildAtt_layer()
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    early_stopping = EarlyStopping(monitor="val_acc", patience=2, verbose=1)
    
    print("Training the model...")
    model.fit(train_seq, y_train, validation_data=(val_seq, y_val),
              epochs=10, batch_size=1024, shuffle=True, callbacks=[early_stopping], verbose=1)
#     print(np.argmax(model.predict([val_seq, val_dense[dense_features]], batch_size=128, verbose=1),axis=1),y_val)
    print('Evaluation',f1_score(np.argmax(model.predict(val_seq, batch_size=512, verbose=1),axis=1),train[train["fold_id"] == i]["sentiment"].values,average='macro'))
    print("Predicting...")
    test_preds3 += model.predict(test_seq, batch_size=1024, verbose=1)
    print()
    
test_preds3 /= NUM_FOLDS

FOLD 1
Splitting the data into train and validation...
Building the model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_15 (InputLayer)        (None, 350)               0         
_________________________________________________________________
embedding_15 (Embedding)     (None, 350, 300)          13192200  
_________________________________________________________________
bidirectional_1 (Bidirection (None, 350, 256)          440320    
_________________________________________________________________
bidirectional_2 (Bidirection (None, 350, 128)          164864    
_________________________________________________________________
attention_1 (Attention)      (None, 128)               478       
_________________________________________________________________
dense_29 (Dense)             (None, 64)                8256      
_________________________________________________________________

**Making submission...**

In [49]:
test_preds=(test_preds1+test_preds2+test_preds3)/3

In [50]:
print("Select the class with the highest probability as prediction...")
test["sentiment"] = test_preds.argmax(axis=1)
# test.loc[test["sentiment"].isnull(), "pred"]
test_common['sentiment']=2
sub=test[["unique_hash", "sentiment"]]
sub=sub.append(test_common[["unique_hash", "sentiment"]],ignore_index=True)
print(sub.shape)

print("Make the submission ready...")
sub["sentiment"] = sub["sentiment"].astype(int)
sub.to_csv("submissionkv10stacked.csv", index=False)
sub.sentiment.value_counts()

Select the class with the highest probability as prediction...
(2924, 2)
Make the submission ready...


2    2472
1     270
0     182
Name: sentiment, dtype: int64

In [51]:
print("Select the class with the highest probability as prediction...")
test["sentiment"] = test_preds1.argmax(axis=1)
# test.loc[test["sentiment"].isnull(), "pred"]
test_common['sentiment']=2
sub=test[["unique_hash", "sentiment"]]
sub=sub.append(test_common[["unique_hash", "sentiment"]],ignore_index=True)
print(sub.shape)

print("Make the submission ready...")
sub["sentiment"] = sub["sentiment"].astype(int)
sub.to_csv("submissionkv10CNN.csv", index=False)
sub.sentiment.value_counts()

Select the class with the highest probability as prediction...
(2924, 2)
Make the submission ready...


2    2535
1     223
0     166
Name: sentiment, dtype: int64

In [52]:
print("Select the class with the highest probability as prediction...")
test["sentiment"] = test_preds2.argmax(axis=1)
# test.loc[test["sentiment"].isnull(), "pred"]
test_common['sentiment']=2
sub=test[["unique_hash", "sentiment"]]
sub=sub.append(test_common[["unique_hash", "sentiment"]],ignore_index=True)
print(sub.shape)

print("Make the submission ready...")
sub["sentiment"] = sub["sentiment"].astype(int)
sub.to_csv("submissionkv10withFeat.csv", index=False)
sub.sentiment.value_counts()

Select the class with the highest probability as prediction...
(2924, 2)
Make the submission ready...


2    2234
1     370
0     320
Name: sentiment, dtype: int64

In [53]:
print("Select the class with the highest probability as prediction...")
test["sentiment"] = test_preds3.argmax(axis=1)
# test.loc[test["sentiment"].isnull(), "pred"]
test_common['sentiment']=2
sub=test[["unique_hash", "sentiment"]]
sub=sub.append(test_common[["unique_hash", "sentiment"]],ignore_index=True)
print(sub.shape)

print("Make the submission ready...")
sub["sentiment"] = sub["sentiment"].astype(int)
sub.to_csv("submissionkv10ATT.csv", index=False)
sub.sentiment.value_counts()

Select the class with the highest probability as prediction...
(2924, 2)
Make the submission ready...


2    2432
1     302
0     190
Name: sentiment, dtype: int64