## General information

In this kernel I'll work with data from Movie Review Sentiment Analysis Playground Competition.

This dataset is interesting for NLP researching. Sentences from original dataset were split in separate phrases and each of them has a sentiment label. Also a lot of phrases are really short which makes classifying them quite challenging. Let's try!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from nltk.tokenize import TweetTokenizer
import datetime
import lightgbm as lgb
from scipy import stats
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
pd.set_option('max_colwidth',400)

In [2]:
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
import datetime
from sklearn.metrics import roc_auc_score, roc_curve, f1_score

In [3]:
# train = pd.read_csv('../input/movie-review-sentiment-analysis-kernels-only/train.tsv', sep="\t")
# test = pd.read_csv('../input/movie-review-sentiment-analysis-kernels-only/test.tsv', sep="\t")
# sub = pd.read_csv('../input/movie-review-sentiment-analysis-kernels-only/sampleSubmission.csv', sep=",")

In [4]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
sub = pd.read_csv('../input/sample_submission.csv', sep=",")

In [5]:
train.head(10)

Unnamed: 0,unique_hash,text,drug,sentiment
0,2e180be4c9214c1f5ab51fd8cc32bc80c9f612e0,"Autoimmune diseases tend to come in clusters. As for Gilenya – if you feel good, don’t think about it, it won’t change anything but waste your time and energy. I’m taking Tysabri and feel amazing, no symptoms (other than dodgy color vision, but I’ve had it since always, so, don’t know) and I don’t know if it will last a month, a year, a decade, ive just decided to enjoy the ride, no point in w...",gilenya,2
1,9eba8f80e7e20f3a2f48685530748fbfa95943e4,"I can completely understand why you’d want to try it. But, results reported in lectures don’t always stand up to the scrutiny of peer-review during publication. There so much still to do before this is convincing. I hope that it does work out, I really do. And if you’re aware of and happy with the risks, then that’s great. I just think it’s important to present this in a balanced way, and to u...",gilenya,2
2,fe809672251f6bd0d986e00380f48d047c7e7b76,"Interesting that it only targets S1P-1/5 receptors rather than 1-5 like Fingolimod. Hoping to soon see what the AEs and SAEs were Yes. I'm not sure what this means, exactly: Quote Nine patients reported serious adverse events (2 mg: 3/29 [10.3%], 1.25 mg: 1/43 [2.3%], 0.5 mg: 4/29 [13.8%], and 0.25 mg: 1/50 [2.0%]; no serious adverse event was reported for more than 1 patient and no new safet...",fingolimod,2
3,bd22104dfa9ec80db4099523e03fae7a52735eb6,"Very interesting, grand merci. Now I wonder where lemtrada and ocrevus sales would go, if they prove anti-cd20 are induction",ocrevus,2
4,b227688381f9b25e5b65109dd00f7f895e838249,"Hi everybody, My latest MRI results for Brain and Cervical Cord are in and my next Neurologist appointment is in the next couple of weeks. There’re no new lesions in Brain/Cord and I’ve had no relapses while I was on Gilenya. This was a good sign. But there was one line in the cervical cord review that concerned me. It goes : “Lesions at C2-3 and T2 now show hypointensity on the post gadoliniu...",gilenya,1
5,a043780c757966243779bf3c0d11bf6eef721971,I can’t give you advice about Lemtrada because I chose Cladribine. Have you thought about this drug? The doctors at Barts are keen to give it to people with SPMS. You can read about it here: http://multiple-sclerosis-research.blogspot.com/2016/01/suppose-there-was-therapy-for-all.html,cladribine,2
6,be5a13376933a7f9bbf8e801c31691092f63260a,"Reply posted for JessZidek. Hi Jess Sorry to read about the challenges you are having with your health. You mentioned a lot in your post. I just want to share some info on a few of the points. First, I know you said that you are scared of Humira. Humira and other biologics are very successful in reducing symptoms and inducing and maintain disease remission. To reduce your level of fear it can ...",humira,0
7,08c3c0c702fc97d290204b37798ac62005da5626,"Well as expected my Neurologist wants me to start Tysabri, I kept saying that I wasn’t happy and he kept saying Yes you are! But I am still NOT, If Lemtrada was available here I think I would definitely go for that,but like every thing here we are way behind, it took 8 years longer than Australia to get Gilenya and Tysabri? I am taking Gilenya every second day,but my vitals are still low,white...",gilenya,2
8,8fd3d7ad80791c9343e5cf8a83bd1adf6577d516,"Why do you think that FIngolimod was such a miserable failure in progressive MS trial in humans (not animals) that was aborted by Biogen? If it is in fact stimulating neuronal gene expression, axon growth and regeneration, which is what you want in progressive patients, why do human trials fail?",fingolimod,1
9,793c5af7cc8332df17eb602247d886fbd1c80f89,"Thank you so much…I’m learning a lot here at GRACE. I should have mentioned my husband’s cancer is in his bones, liver, adrenal, in addition to lung and brain mets as I mentioned. I truly appreciate the comments on hospice as we just started hospice a few weeks ago…my insurance allows palliative care along with continued anti-cancer treatment. I only thought of hospice as end-of-life care,...",tagrisso,2


In [6]:
print('Average count of text per drug in train is {0:.0f}.'.format(train.groupby('drug')['text'].count().mean()))
print('Average count of text per drug in test is {0:.0f}.'.format(test.groupby('drug')['text'].count().mean()))

Average count of text per drug in train is 52.
Average count of text per drug in test is 31.


In [7]:
print('Number of text in train: {}. Number of drug in train: {}.'.format(train.shape[0], len(train.drug.unique())))
print('Number of text in test: {}. Number of drug in test: {}.'.format(test.shape[0], len(test.drug.unique())))

Number of text in train: 5279. Number of drug in train: 102.
Number of text in test: 2924. Number of drug in test: 95.


In [8]:
print('Average word length of text in train is {0:.0f}.'.format(np.mean(train['text'].apply(lambda x: len(x.split())))))
print('Average word length of text in test is {0:.0f}.'.format(np.mean(test['text'].apply(lambda x: len(x.split())))))

Average word length of text in train is 341.
Average word length of text in test is 397.


In [9]:
train['sentiment'].value_counts()

2    3825
1     837
0     617
Name: sentiment, dtype: int64

We can see than sentences were split in 18-20 phrases at average and a lot of phrases contain each other. Sometimes one word or even one punctuation mark influences the sentiment

Let's see for example most common trigrams for positive phrases

In [10]:
text = ' '.join(train.loc[train.sentiment == 0, 'text'].values)
text_trigrams = [i for i in ngrams(text.split(), 4)]

In [11]:
Counter(text_trigrams).most_common(50)

[(('2016', '(HealthDay', 'News)', '--'), 138),
 (('(subscription', 'or', 'payment', 'may'), 112),
 (('or', 'payment', 'may', 'be'), 112),
 (('payment', 'may', 'be', 'required)'), 112),
 (('according', 'to', 'a', 'study'), 90),
 (('to', 'a', 'study', 'published'), 90),
 (('Full', 'Text', '(subscription', 'or'), 84),
 (('Text', '(subscription', 'or', 'payment'), 84),
 (('.', 'Full', 'Text', '(subscription'), 78),
 (('a', 'study', 'published', 'online'), 65),
 (('for', 'the', 'treatment', 'of'), 65),
 (('study', 'published', 'online', 'Dec.'), 57),
 (('(HealthDay', 'News)', '--', 'For'), 33),
 (('in', 'the', 'Journal', 'of'), 31),
 (('Food', 'and', 'Drug', 'Administration'), 31),
 (('Editorial', '(subscription', 'or', 'payment'), 26),
 (('Medicine', '.', 'Full', 'Text'), 23),
 (('a', 'study', 'published', 'in'), 23),
 (('may', 'be', 'required)', 'Editorial'), 23),
 (('be', 'required)', 'Editorial', '(subscription'), 23),
 (('required)', 'Editorial', '(subscription', 'or'), 23),
 (('(Healt

In [12]:
text = ' '.join(train.loc[train.sentiment == 0, 'text'].values)
text = [i for i in text.split() if i not in stopwords.words('english')]
text_trigrams = [i for i in ngrams(text, 3)]
Counter(text_trigrams).most_common(30)

[(('(HealthDay', 'News)', '--'), 139),
 (('2016', '(HealthDay', 'News)'), 138),
 (('.', 'Full', 'Text'), 114),
 (('(subscription', 'payment', 'may'), 112),
 (('payment', 'may', 'required)'), 112),
 (('according', 'study', 'published'), 90),
 (('Full', 'Text', '(subscription'), 84),
 (('Text', '(subscription', 'payment'), 84),
 (('published', 'online', 'Dec.'), 80),
 (('study', 'published', 'online'), 67),
 (('News)', '--', 'For'), 33),
 (('Food', 'Drug', 'Administration'), 31),
 (('cell', 'lung', 'cancer'), 30),
 (('Editorial', '(subscription', 'payment'), 26),
 (('non-small', 'cell', 'lung'), 25),
 (('Medicine', '.', 'Full'), 23),
 (('may', 'required)', 'Editorial'), 23),
 (('required)', 'Editorial', '(subscription'), 23),
 (('News)', '--', 'The'), 22),
 (('mg', 'every', 'week'), 22),
 (('Drug', 'Administration', '(FDA)'), 18),
 (('primary', 'progressive', 'MS'), 17),
 (('40', 'mg', 'every'), 17),
 (('New', 'England', 'Journal'), 16),
 (('U.S.', 'Food', 'Drug'), 16),
 (('England', 'Jo

In [13]:
text = ' '.join(train.loc[train.sentiment == 1, 'text'].values)
text = [i for i in text.split() if i not in stopwords.words('english')]
text_trigrams = [i for i in ngrams(text, 3)]
Counter(text_trigrams).most_common(30)

[(("Crohn's", '&', 'Colitis'), 30),
 (('I', 'feel', 'like'), 29),
 (('I', 'think', 'I'), 19),
 (('bad', 'sometimes', 'deadly'), 17),
 (('face,', 'lips,', 'tongue,'), 17),
 (('doctor', 'right', 'away'), 17),
 (('I', 'know', 'I'), 17),
 (('I', 'don’t', 'know'), 16),
 (('report', 'side', 'effects'), 16),
 (('&', 'Colitis', 'Community'), 16),
 (('natural', 'products,', 'vitamins'), 15),
 (('side', 'effects', 'may'), 15),
 (('Recommended', 'safety', 'management'), 14),
 (('safety', 'management', 'strategies'), 14),
 (('management', 'strategies', 'recommended'), 14),
 (('This', 'medicine', 'may'), 13),
 (('Call', 'doctor', 'right'), 13),
 (('side', 'effects.', 'You'), 13),
 (('Lung', 'Cancer', ','), 13),
 (('This', 'topic', 'modified'), 12),
 (('swelling', 'face,', 'lips,'), 12),
 (('OTC,', 'natural', 'products,'), 12),
 (('lips,', 'tongue,', 'throat.'), 12),
 (('effects.', 'You', 'may'), 12),
 (('strategies', 'recommended', 'monitoring:'), 12),
 (('white', 'blood', 'cell'), 11),
 (('What', 

In [14]:
text = ' '.join(train.loc[train.sentiment == 2, 'text'].values)
text = [i for i in text.split() if i not in stopwords.words('english')]
text_trigrams = [i for i in ngrams(text, 3)]
Counter(text_trigrams).most_common(30)

[(('cell', 'lung', 'cancer'), 333),
 (('non-small', 'cell', 'lung'), 308),
 (('(Opens', 'new', 'window)'), 192),
 (('small', 'cell', 'lung'), 157),
 (('new', 'window)', 'Click'), 154),
 (('Cell', 'Lung', 'Cancer'), 138),
 ((',', "Crohn's", 'Disease'), 129),
 (('enhance', 'adverse/toxic', 'effect'), 124),
 (('cell', 'lung', 'cancer.'), 115),
 (('PM', 'Ulcerative', 'Colitis'), 112),
 (('(', 'NCI', 'Thesaurus'), 110),
 (('NCI', 'Thesaurus', ')'), 110),
 (('active', 'clinical', 'trials'), 109),
 (('clinical', 'trials', 'using'), 109),
 (('Check', 'active', 'clinical'), 108),
 (('trials', 'using', 'agent.'), 108),
 (('using', 'agent.', '('), 108),
 (('agent.', '(', 'NCI'), 108),
 (('Member', 'Joined', ':'), 102),
 ((',', 'Ulcerative', 'Colitis'), 93),
 (('Non-Small', 'Cell', 'Lung'), 92),
 (('doctor', 'right', 'away'), 91),
 (('every', '8', 'weeks'), 91),
 (('growth', 'factor', 'receptor'), 86),
 (('A', 'clinical', 'trial'), 85),
 (('lung', 'cancer', '(NSCLC)'), 83),
 (('Food', 'Drug', 'Adm

The results show the main problem with this dataset: there are to many common words due to sentenced splitted in phrases. As a result stopwords shouldn't be removed from text.

### Thoughts on feature processing and engineering

So, we have only phrases as data. And a phrase can contain a single word. And one punctuation mark can cause phrase to receive a different sentiment. Also assigned sentiments can be strange. This means several things:
- using stopwords can be a bad idea, especially when phrases contain one single stopword;
- puntuation could be important, so it should be used;
- ngrams are necessary to get the most info from data;
- using features like word count or sentence length won't be useful;

In [15]:
tokenizer = TweetTokenizer()

In [16]:
#remove stop words
train['text_orig'] = train['text']
test['text_orig'] = test['text']

stop = stopwords.words('english')

train['text'] = train['text_orig'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
test['text'] = test['text_orig'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


# train['text'] = train['text_orig'].apply(lambda x: [item for item in x if item not in stop])
# test['text'] = test['text_orig'].apply(lambda x: [item for item in x if item not in stop])

In [17]:
# #remove stop words
# test_filt = test[0:10].copy()
# test_filt['text_orig'] = test_filt['text']

# stop = stopwords.words('english')

# test_filt['text'] = test_filt['text_orig'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# # test['text'] = test['text_orig'].apply(lambda x: [item for item in x if item not in stop])

# # train['text'] = [i for i in train['text_orig'].str.split() if i not in stopwords.words('english')]
# # test['text'] = [i for i in test['text_orig'].str.split() if i not in stopwords.words('english')]

In [18]:
print(test['text_orig'].tail())
test['text'].tail()

2919    Reply posted for Hippopostrous. We are sorry to read about your daughter’s potential Crohn’s Disease diagnosis. Crohn’s disease can be managed and many patients live well with the disease. It is important to learn all you can about the disease. A good place to start is with A Guide for Parents brochure http://www.crohnscolitisfoundation.org/assets/pdfs/parent_guide_final_052016.pdf and Underst...
2920    Aw Lorraine That's crap. What a shame you've had to stop Tysabri. I'm not in exactly the same position as you, but not currently on a DMD either. Tysabri gave me raised liver enzymes so I had to stop taking it. Then Tecfidera (which I presume is the other drug you can't take because of your JCV status) gave me depleted lymphocytes so I had to stop that too. I've already had bad side effects fr...
2921    jskozio14\n That sounds like nonsense to me.  Experimental?  Baloney!\n In March of 2105 Opdivo (Nivolumab) was FDA approved for use against squamous NSCLC and later in the yea

2919    Reply posted Hippopostrous. We sorry read daughter’s potential Crohn’s Disease diagnosis. Crohn’s disease managed many patients live well disease. It important learn disease. A good place start A Guide Parents brochure http://www.crohnscolitisfoundation.org/assets/pdfs/parent_guide_final_052016.pdf Understanding Medication brochure http://www.crohnscolitisfoundation.org/assets/pdfs/understandi...
2920    Aw Lorraine That's crap. What shame stop Tysabri. I'm exactly position you, currently DMD either. Tysabri gave raised liver enzymes I stop taking it. Then Tecfidera (which I presume drug can't take JCV status) gave depleted lymphocytes I stop too. I've already bad side effects Avonex failed copaxone (ie started relapsing again). Given history, fact I'm disabled MS 19 years, I think another DMD...
2921    jskozio14 That sounds like nonsense me. Experimental? Baloney! In March 2105 Opdivo (Nivolumab) FDA approved use squamous NSCLC later year (maybe October) Adenocaricnoma NSCLC.

In [19]:
vectorizer = TfidfVectorizer(ngram_range=(1, 3), tokenizer=tokenizer.tokenize)
full_text = list(train['text'].values) + list(test['text'].values)
vectorizer.fit(full_text)
train_vectorized = vectorizer.transform(train['text'])
test_vectorized = vectorizer.transform(test['text'])

In [20]:
train_vectorized[0:2]

<2x1944141 sparse matrix of type '<class 'numpy.float64'>'
	with 610 stored elements in Compressed Sparse Row format>

In [21]:
y = train['sentiment']

In [22]:
logreg = LogisticRegression()
ovr = OneVsRestClassifier(logreg)

In [23]:
%%time
ovr.fit(train_vectorized, y)



CPU times: user 11.9 s, sys: 128 ms, total: 12.1 s
Wall time: 9.02 s


OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='warn', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [24]:
scores = cross_val_score(ovr, train_vectorized, y, scoring='accuracy', n_jobs=-1, cv=5)
print('Cross-validation mean accuracy {0:.2f}%, std {1:.2f}.'.format(np.mean(scores) * 100, np.std(scores) * 100))

Cross-validation mean accuracy 72.46%, std 0.07.


In [25]:
scores = cross_val_score(ovr, train_vectorized, y, scoring='f1_macro', n_jobs=-1, cv=5)
print('Cross-validation mean f1 score {0:.2f}%, std {1:.2f}.'.format(np.mean(scores) * 100, np.std(scores) * 100))

Cross-validation mean f1 score 28.01%, std 0.02.


In [26]:
%%time
svc = LinearSVC(dual=False)
scores = cross_val_score(svc, train_vectorized, y, scoring='accuracy', n_jobs=-1, cv=3)
print('Cross-validation mean accuracy {0:.2f}%, std {1:.2f}.'.format(np.mean(scores) * 100, np.std(scores) * 100))

Cross-validation mean accuracy 72.65%, std 0.32.
CPU times: user 32 ms, sys: 24 ms, total: 56 ms
Wall time: 15.2 s


In [27]:
ovr.fit(train_vectorized, y);
svc.fit(train_vectorized, y);



In [28]:
import time
def runlgb(ispermutefeats,train,test,target,param,cur_features,score_function=None):

    
    oof = np.zeros((train.shape[0],3))
    predictions = np.zeros((test.shape[0],3))
    start = time.time()
    valid_scores =[]

    folds = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=4590)
    indices = folds.split(train, target.values)
        
    for fold_, (trn_idx, val_idx) in enumerate(indices):
        print()
        print("fold n°{}".format(fold_))

        tr = train[trn_idx]
        val = train[val_idx]
        y_val = target.iloc[val_idx]
        y_tr = target.iloc[trn_idx]
        
        trn_data = lgb.Dataset(tr, label=y_tr)#,, categorical_feature=categorical_feats)
        val_data = lgb.Dataset(val, label=y_val)#,, categorical_feature=categorical_feats)
        
        num_round = param['n_estimators']
        clf = lgb.train(param, trn_data, num_round, valid_sets = [val_data], verbose_eval=100, 
                        early_stopping_rounds = 200)


        oof[val_idx,:] = clf.predict(val, num_iteration=clf.best_iteration)

        valid_scores+=[clf.best_score['valid_0'][param['metric']]]
        predictions += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits

    print('valid scores log loss:',valid_scores)
#     pred_labels = oof.reshape(3,-1).argmax(axis=0)
# #     pred_labels = oof.reshape(len(np.unique(target)),-1).argmax(axis=0)
#     print(oof[0:10,:])
#     print(pred_labels[0:10])
#     print(oof.reshape(3,-1)[0:10,:])
#     print(oof.reshape(-1,3)[0:10,:])
    
    factors  =[0.1304,0.177,0.8088]
    oof_factored =  oof / factors
    oof_labels = oof_factored.argmax(axis=1)
    print(oof_factored[0:10,:])
    print(oof_labels[0:10])
    
    print("CV F1 score: {:<8.5f}".format(f1_score(target, oof_labels,average='macro')))

    predictions_factored =  predictions / factors
    test_labels = predictions_factored.argmax(axis=1)
    print()
    print("TEST METRICS")
    print(predictions_factored[0:10,:])
    print(test_labels[0:10])
    
    return predictions,oof,oof_labels,test_labels

In [29]:
target = train['sentiment']

In [30]:
param = {'num_leaves': 31,
         'min_data_in_leaf': 30, 
         'objective':'multiclass',
         'num_class' : 3,
         'max_depth': -1,
         'learning_rate': 0.1,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'multi_logloss',
         "lambda_l1": 0.1,
         "verbosity": -1,
         "nthread": 4,
         'n_estimators' : 1000,
         "random_state": 4590}

In [31]:
%%time
n_splits=5
features =['text']

predictions_1,oof_1,pred_labels, test_labels = \
        runlgb(False,train_vectorized,test_vectorized,target,param,features,score_function=None)



fold n°0




Training until validation scores don't improve for 200 rounds.
[100]	valid_0's multi_logloss: 0.729865
[200]	valid_0's multi_logloss: 0.828237
Early stopping, best iteration is:
[46]	valid_0's multi_logloss: 0.700092

fold n°1
Training until validation scores don't improve for 200 rounds.
[100]	valid_0's multi_logloss: 0.752534
[200]	valid_0's multi_logloss: 0.858207
Early stopping, best iteration is:
[28]	valid_0's multi_logloss: 0.711047

fold n°2
Training until validation scores don't improve for 200 rounds.
[100]	valid_0's multi_logloss: 0.701654
[200]	valid_0's multi_logloss: 0.792456
Early stopping, best iteration is:
[51]	valid_0's multi_logloss: 0.678314

fold n°3
Training until validation scores don't improve for 200 rounds.
[100]	valid_0's multi_logloss: 0.737587
[200]	valid_0's multi_logloss: 0.833955
Early stopping, best iteration is:
[40]	valid_0's multi_logloss: 0.700706

fold n°4
Training until validation scores don't improve for 200 rounds.
[100]	valid_0's multi_logloss

In [32]:
print(target.shape)

(5279,)


In [33]:
# print(oof_1[0:10,:])
# print(oof_1.reshape(3,-1)[0:10,:])
# print(oof_1.reshape(3,-1).shape)
# pred_labels_new = oof_1.argmax(axis=1)
# print(pred_labels_new[1000:1020])
# print("CV F1 score: {:<8.5f}".format(f1_score(target, pred_labels_new,average='macro')))

# pred_labels_new[pred_labels_new==2].shape

In [34]:
# factors  =[0.1304,0.177,0.8088]
# oof_2 =  oof_1 / factors
# print(oof_2[0:10])

# pred_labels_new = oof_2.argmax(axis=1)
# print(pred_labels_new[1000:1020])
# print("CV F1 score: {:<8.5f}".format(f1_score(target, pred_labels_new,average='macro')))

# pred_labels_new[pred_labels_new==2].shape


In [35]:
sub['sentiment'] = test_labels
sub.to_csv("submission.csv", index=False)

In [36]:
# scores = cross_val_score(ovr, train_vectorized, y, scoring='f1_macro', n_jobs=-1, cv=5)
# print('Cross-validation mean f1 score {0:.2f}%, std {1:.2f}.'.format(np.mean(scores) * 100, np.std(scores) * 100))

## Deep learning
And now let's try DL. DL should work better for text classification with multiple layers. I use an architecture similar to those which were used in toxic competition.

In [37]:
# from keras.preprocessing.text import Tokenizer
# from keras.preprocessing.sequence import pad_sequences
# from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Conv1D, GRU, CuDNNGRU, CuDNNLSTM, BatchNormalization
# from keras.layers import Bidirectional, GlobalMaxPool1D, MaxPooling1D, Add, Flatten
# from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D
# from keras.models import Model, load_model
# from keras import initializers, regularizers, constraints, optimizers, layers, callbacks
# from keras import backend as K
# from keras.engine import InputSpec, Layer
# from keras.optimizers import Adam

# from keras.callbacks import ModelCheckpoint, TensorBoard, Callback, EarlyStopping

In [38]:
# tk = Tokenizer(lower = True, filters='')
# tk.fit_on_texts(full_text)

In [39]:
# train_tokenized = tk.texts_to_sequences(train['Phrase'])
# test_tokenized = tk.texts_to_sequences(test['Phrase'])

In [40]:
# max_len = 50
# X_train = pad_sequences(train_tokenized, maxlen = max_len)
# X_test = pad_sequences(test_tokenized, maxlen = max_len)

In [41]:
embedding_path = "../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec"

In [42]:
embed_size = 300
max_features = 30000

In [43]:
# def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
# embedding_index = dict(get_coefs(*o.strip().split(" ")) for o in open(embedding_path))

# word_index = tk.word_index
# nb_words = min(max_features, len(word_index))
# embedding_matrix = np.zeros((nb_words + 1, embed_size))
# for word, i in word_index.items():
#     if i >= max_features: continue
#     embedding_vector = embedding_index.get(word)
#     if embedding_vector is not None: embedding_matrix[i] = embedding_vector

In [44]:
# from sklearn.preprocessing import OneHotEncoder
# ohe = OneHotEncoder(sparse=False)
# y_ohe = ohe.fit_transform(y.values.reshape(-1, 1))

In [45]:
# def build_model1(lr=0.0, lr_d=0.0, units=0, spatial_dr=0.0, kernel_size1=3, kernel_size2=2, dense_units=128, dr=0.1, conv_size=32):
#     file_path = "best_model.hdf5"
#     check_point = ModelCheckpoint(file_path, monitor = "val_loss", verbose = 1,
#                                   save_best_only = True, mode = "min")
#     early_stop = EarlyStopping(monitor = "val_loss", mode = "min", patience = 3)
    
#     inp = Input(shape = (max_len,))
#     x = Embedding(19479, embed_size, weights = [embedding_matrix], trainable = False)(inp)
#     x1 = SpatialDropout1D(spatial_dr)(x)

#     x_gru = Bidirectional(CuDNNGRU(units, return_sequences = True))(x1)
#     x1 = Conv1D(conv_size, kernel_size=kernel_size1, padding='valid', kernel_initializer='he_uniform')(x_gru)
#     avg_pool1_gru = GlobalAveragePooling1D()(x1)
#     max_pool1_gru = GlobalMaxPooling1D()(x1)
    
#     x3 = Conv1D(conv_size, kernel_size=kernel_size2, padding='valid', kernel_initializer='he_uniform')(x_gru)
#     avg_pool3_gru = GlobalAveragePooling1D()(x3)
#     max_pool3_gru = GlobalMaxPooling1D()(x3)
    
#     x_lstm = Bidirectional(CuDNNLSTM(units, return_sequences = True))(x1)
#     x1 = Conv1D(conv_size, kernel_size=kernel_size1, padding='valid', kernel_initializer='he_uniform')(x_lstm)
#     avg_pool1_lstm = GlobalAveragePooling1D()(x1)
#     max_pool1_lstm = GlobalMaxPooling1D()(x1)
    
#     x3 = Conv1D(conv_size, kernel_size=kernel_size2, padding='valid', kernel_initializer='he_uniform')(x_lstm)
#     avg_pool3_lstm = GlobalAveragePooling1D()(x3)
#     max_pool3_lstm = GlobalMaxPooling1D()(x3)
    
    
#     x = concatenate([avg_pool1_gru, max_pool1_gru, avg_pool3_gru, max_pool3_gru,
#                     avg_pool1_lstm, max_pool1_lstm, avg_pool3_lstm, max_pool3_lstm])
#     x = BatchNormalization()(x)
#     x = Dropout(dr)(Dense(dense_units, activation='relu') (x))
#     x = BatchNormalization()(x)
#     x = Dropout(dr)(Dense(int(dense_units / 2), activation='relu') (x))
#     x = Dense(5, activation = "sigmoid")(x)
#     model = Model(inputs = inp, outputs = x)
#     model.compile(loss = "binary_crossentropy", optimizer = Adam(lr = lr, decay = lr_d), metrics = ["accuracy"])
#     history = model.fit(X_train, y_ohe, batch_size = 128, epochs = 20, validation_split=0.1, 
#                         verbose = 1, callbacks = [check_point, early_stop])
#     model = load_model(file_path)
#     return model

An attempt at ensemble:

In [46]:
# model1 = build_model1(lr = 1e-3, lr_d = 1e-10, units = 64, spatial_dr = 0.3, kernel_size1=3, kernel_size2=2, dense_units=32, dr=0.1, conv_size=32)

In [47]:
# model2 = build_model1(lr = 1e-3, lr_d = 1e-10, units = 128, spatial_dr = 0.5, kernel_size1=3, kernel_size2=2, dense_units=64, dr=0.2, conv_size=32)

In [48]:
# def build_model2(lr=0.0, lr_d=0.0, units=0, spatial_dr=0.0, kernel_size1=3, kernel_size2=2, dense_units=128, dr=0.1, conv_size=32):
#     file_path = "best_model.hdf5"
#     check_point = ModelCheckpoint(file_path, monitor = "val_loss", verbose = 1,
#                                   save_best_only = True, mode = "min")
#     early_stop = EarlyStopping(monitor = "val_loss", mode = "min", patience = 3)

#     inp = Input(shape = (max_len,))
#     x = Embedding(19479, embed_size, weights = [embedding_matrix], trainable = False)(inp)
#     x1 = SpatialDropout1D(spatial_dr)(x)

#     x_gru = Bidirectional(CuDNNGRU(units, return_sequences = True))(x1)
#     x_lstm = Bidirectional(CuDNNLSTM(units, return_sequences = True))(x1)
    
#     x_conv1 = Conv1D(conv_size, kernel_size=kernel_size1, padding='valid', kernel_initializer='he_uniform')(x_gru)
#     avg_pool1_gru = GlobalAveragePooling1D()(x_conv1)
#     max_pool1_gru = GlobalMaxPooling1D()(x_conv1)
    
#     x_conv2 = Conv1D(conv_size, kernel_size=kernel_size2, padding='valid', kernel_initializer='he_uniform')(x_gru)
#     avg_pool2_gru = GlobalAveragePooling1D()(x_conv2)
#     max_pool2_gru = GlobalMaxPooling1D()(x_conv2)
    
    
#     x_conv3 = Conv1D(conv_size, kernel_size=kernel_size1, padding='valid', kernel_initializer='he_uniform')(x_lstm)
#     avg_pool1_lstm = GlobalAveragePooling1D()(x_conv3)
#     max_pool1_lstm = GlobalMaxPooling1D()(x_conv3)
    
#     x_conv4 = Conv1D(conv_size, kernel_size=kernel_size2, padding='valid', kernel_initializer='he_uniform')(x_lstm)
#     avg_pool2_lstm = GlobalAveragePooling1D()(x_conv4)
#     max_pool2_lstm = GlobalMaxPooling1D()(x_conv4)
    
    
#     x = concatenate([avg_pool1_gru, max_pool1_gru, avg_pool2_gru, max_pool2_gru,
#                     avg_pool1_lstm, max_pool1_lstm, avg_pool2_lstm, max_pool2_lstm])
#     x = BatchNormalization()(x)
#     x = Dropout(dr)(Dense(dense_units, activation='relu') (x))
#     x = BatchNormalization()(x)
#     x = Dropout(dr)(Dense(int(dense_units / 2), activation='relu') (x))
#     x = Dense(5, activation = "sigmoid")(x)
#     model = Model(inputs = inp, outputs = x)
#     model.compile(loss = "binary_crossentropy", optimizer = Adam(lr = lr, decay = lr_d), metrics = ["accuracy"])
#     history = model.fit(X_train, y_ohe, batch_size = 128, epochs = 20, validation_split=0.1, 
#                         verbose = 1, callbacks = [check_point, early_stop])
#     model = load_model(file_path)
#     return model

In [49]:
# model3 = build_model2(lr = 1e-4, lr_d = 0, units = 64, spatial_dr = 0.5, kernel_size1=4, kernel_size2=3, dense_units=32, dr=0.1, conv_size=32)

In [50]:
# model4 = build_model2(lr = 1e-3, lr_d = 0, units = 64, spatial_dr = 0.5, kernel_size1=3, kernel_size2=3, dense_units=64, dr=0.3, conv_size=32)

In [51]:
# model5 = build_model2(lr = 1e-3, lr_d = 1e-7, units = 64, spatial_dr = 0.3, kernel_size1=3, kernel_size2=3, dense_units=64, dr=0.4, conv_size=64)

In [52]:
# pred1 = model1.predict(X_test, batch_size = 1024, verbose = 1)
# pred = pred1
# pred2 = model2.predict(X_test, batch_size = 1024, verbose = 1)
# pred += pred2
# pred3 = model3.predict(X_test, batch_size = 1024, verbose = 1)
# pred += pred3
# pred4 = model4.predict(X_test, batch_size = 1024, verbose = 1)
# pred += pred4
# pred5 = model5.predict(X_test, batch_size = 1024, verbose = 1)
# pred += pred5

In [53]:
# predictions = np.round(np.argmax(pred, axis=1)).astype(int)
# sub['Sentiment'] = predictions
# sub.to_csv("blend.csv", index=False)