# `wisesight-sentiment` Kaggle Competition

This notebook details the steps taken to compete in the [WISESIGHT Sentiment Analysis](https://www.kaggle.com/c/wisesight-sentiment/) competition. Competition metric is overall accuracy across `neg`ative, `pos`itive, `neu`tral and `q`uestion classes.

Our optimal strategy was:
1. Train a logistic regression model (L2 regularization; C=2.0) with tf-idf features (minimum frequency = 20) and predict on the test set. Also output probabilities for each class of the test set.

2. Combine the training set with test set labeled by the previous logistic regression model to create the augmented set.

3. Finetune a ULMFit language mdoel (minimum frequency = 2) with all data available with the following hyperparameters:

```
config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1, qrnn=False, tie_weights=True, out_bias=True,
             output_p=0.25, hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15)
trn_args = dict(drop_mult=1., clip=0.12, alpha=2, beta=1)
```

4. Train a ULMFit classification model with the augmented set (minimum frequency = 20). Output probabilities for each class of the test set. The hyperparameters are as follows:

```
config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1, qrnn=False,
             output_p=0.4, hidden_p=0.2, input_p=0.6, embed_p=0.1, weight_p=0.5)
trn_args = dict(bptt=70, drop_mult=0.7, alpha=2, beta=1, max_len=500)
```

5. Take an average the probabilities outputed by model in 1. and 3. Predict the class with the highest average probabilities.

At every step, we first trained with a 85/15 validation set for a decent set hyperparameters then train with 95/5 split afterwards before proceeding to the next step.

The results for Logistic Regression, FastText, ULMFit, ULMFit with semi-supervised data are as follows:

| Model               | Public Accuracy | Private Accuracy |
|---------------------|-----------------|------------------|
| Logistic Regression | 0.72781         | 0.7499           |
| FastText            | 0.63144         | 0.6131           |
| ULMFit              | 0.71259         | 0.74194          |
| ULMFit Semi-supervised    | 0.73119     | 0.75859      |
| ULMFit Semi-supervised Repeated One Time    | **0.73372**     | **0.75968**      |
| [USE](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3)    | 0.63987*   |
* Done after competition with a test set that was cleaned from 3946 rows to 2674 rows


Things that we tried and did not help:

* Rules based on error analysis
* Sub-model to predict positive class out of those that we predicted as neutral (since `true label = pos / predicted label = neutral` is by far the largest error group)
* SVD to decompose tf-idf features
* Adding average/sum embeddings of finetuned language model to features
* Sparse features and multi-layer perceptrons
* Removing duplicated rows
* Over/undersampling
* Randomly initialized bi-directional AWD-LSTM with "cleaner" processing rules
* Generate fake samples and retrain with ULMFit

In [None]:
#uncomment if running from colab
# !wget https://github.com/PyThaiNLP/wisesight-sentiment/archive/master.zip; unzip master.zip
# !mv wisesight-sentiment-master/kaggle-competition/* .
# !pip install tensorflow_text
# !pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
# !pip install emoji
# !ls

In [1]:
import pandas as pd
import numpy as np
from pythainlp import word_tokenize
from tqdm import tqdm_notebook
import re
import emoji

#viz
from plotnine import *
import matplotlib.pyplot as plt
import seaborn as sns

## Text Processor for Logistic Regression

In [2]:
def replace_url(text):
    URL_PATTERN = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
    return re.sub(URL_PATTERN, 'xxurl', text)

def replace_rep(text):
    def _replace_rep(m):
        c,cc = m.groups()
        return f'{c}xxrep'
    re_rep = re.compile(r'(\S)(\1{2,})')
    return re_rep.sub(_replace_rep, text)

def ungroup_emoji(toks):
    res = []
    for tok in toks:
        if emoji.emoji_count(tok) == len(tok):
            for char in tok:
                res.append(char)
        else:
            res.append(tok)
    return res

def process_text(text):
    #pre rules
    res = text.lower().strip()
    res = replace_url(res)
    res = replace_rep(res)
    
    #tokenize
    res = [word for word in word_tokenize(res) if word and not re.search(pattern=r"\s+", string=word)]
    
    #post rules
    res = ungroup_emoji(res)
    
    return res

## Process Text Files to CSVs

In [6]:
with open('train.txt') as f:
    texts = [line.strip() for line in f.readlines()]
f.close()

with open('train_label.txt') as f:
    categories = [line.strip() for line in f.readlines()]
f.close()

all_df = pd.DataFrame({'category':categories, 'texts':texts})
all_df.to_csv('all_df.csv',index=False)
all_df.shape

(24063, 2)

In [7]:
with open('test.txt') as f:
    texts = [line.strip() for line in f.readlines()]
f.close()

test_df = pd.DataFrame({'category':'test', 'texts':texts})
test_df.to_csv('test_df.csv',index=False)
test_df.shape

(2674, 2)

## Load Data

In [8]:
all_df = pd.read_csv('all_df.csv')
test_df = pd.read_csv('test_df.csv')

all_df['processed'] = all_df.texts.map(lambda x: '|'.join(process_text(x)))
all_df['wc'] = all_df.processed.map(lambda x: len(x.split('|')))
all_df['uwc'] = all_df.processed.map(lambda x: len(set(x.split('|'))))

test_df['processed'] = test_df.texts.map(lambda x: '|'.join(process_text(x)))
test_df['wc'] = test_df.processed.map(lambda x: len(x.split('|')))
test_df['uwc'] = test_df.processed.map(lambda x: len(set(x.split('|'))))

In [None]:
#prevalence
all_df.category.value_counts() / all_df.shape[0]

## Train-validation Split

We perform 85/15 random train-validation split. We also perform under/oversampling to balance out the classes a little.

In [None]:
#when finding hyperparameters
from sklearn.model_selection import train_test_split
train_df, valid_df = train_test_split(all_df, test_size=0.15, random_state=1412)
train_df = train_df.reset_index(drop=True)
valid_df = valid_df.reset_index(drop=True)

#when actually doing it
# train_df = all_df.copy()
# valid_df = pd.read_csv('valid_df.csv')

In [None]:
valid_df.head()

In [None]:
#prevalence
print(train_df['category'].value_counts() / train_df.shape[0])

In [None]:
#prevalence
print(valid_df['category'].value_counts() / valid_df.shape[0])

## Logistic Regression

### Create Features

In [None]:
#dependent variables
y_train = train_df['category']
y_valid = valid_df['category']

In [None]:
#text faetures
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tfidf = TfidfVectorizer(tokenizer=process_text, ngram_range=(1,2), min_df=20, sublinear_tf=True)
tfidf_fit = tfidf.fit(all_df['texts'])
text_train = tfidf_fit.transform(train_df['texts'])
text_valid = tfidf_fit.transform(valid_df['texts'])
text_test = tfidf_fit.transform(test_df['texts'])
text_train.shape, text_valid.shape

In [None]:
#word count and unique word counts; actually might not be so useful
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler_fit = scaler.fit(all_df[['wc','uwc']].astype(float))
print(scaler_fit.mean_, scaler_fit.var_)
num_train = scaler_fit.transform(train_df[['wc','uwc']].astype(float))
num_valid = scaler_fit.transform(valid_df[['wc','uwc']].astype(float))
num_test = scaler_fit.transform(test_df[['wc','uwc']].astype(float))
num_train.shape, num_valid.shape

In [None]:
#concatenate text and word count features
X_train = np.concatenate([num_train,text_train.toarray()],axis=1)
X_valid = np.concatenate([num_valid,text_valid.toarray()],axis=1)
X_test = np.concatenate([num_test,text_test.toarray()],axis=1)
X_train.shape, X_valid.shape

### Fit Model

In [None]:
#fit logistic regression models
model = LogisticRegression(C=2., penalty='l2', solver='liblinear', dual=False, multi_class='ovr')
model.fit(X_train,y_train)
model.score(X_valid,y_valid)

### See Results

In [None]:
probs = model.predict_proba(X_valid)
probs_df = pd.DataFrame(probs)
probs_df.columns = model.classes_
probs_df['preds'] = model.predict(X_valid)
probs_df['category'] = valid_df.category
probs_df['texts'] = valid_df.texts
probs_df['processed'] = valid_df.processed
probs_df['wc'] = valid_df.wc
probs_df['uwc'] = valid_df.uwc
probs_df['hit'] = (probs_df.preds==probs_df.category)
probs_df.to_csv('probs_df_linear.csv',index=False)

In [None]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(probs_df.category,probs_df.preds)
print(model.score(X_valid,y_valid))
sns.heatmap(conf_mat, annot=True, fmt='d',
            xticklabels=model.classes_, yticklabels=model.classes_)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

### Export Augmented Dataset

In [None]:
test_df['category'] = model.predict(X_test)
all_aug = pd.concat([test_df,all_df]).reset_index(drop=True)
print(all_aug.shape)
# all_aug.to_csv('all_aug.csv',index=False)

### Submission

In [None]:
# preds = model.predict(text_test)
# submit = pd.read_csv('test_majority.csv')
# submit['Class'] = preds
# print(submit.shape)
# submit.to_csv('submit_linear.csv',index=False)
# submit.tail()

## [ULMFit](https://github.com/cstorm125/thai2fit) Model

In [None]:
from fastai.text import *
from fastai.callbacks import CSVLogger, SaveModelCallback
from pythainlp.ulmfit import *

model_path = 'wisesight_data/'

In [None]:
#when training to find hyperparameters
all_df = pd.read_csv('all_df.csv')
train_df, valid_df = train_test_split(all_df, test_size=0.15, random_state=1412)

#when training with augmented set
# train_df = pd.read_csv('all_aug.csv')

#test set
# test_df = pd.read_csv('test_df.csv')

### Finetune Language Model

In [None]:
tt = Tokenizer(tok_func = ThaiTokenizer, lang = 'th', pre_rules = pre_rules_th, post_rules=post_rules_th)
processor = [TokenizeProcessor(tokenizer=tt, chunksize=10000, mark_fields=False),
            NumericalizeProcessor(vocab=None, max_vocab=60000, min_freq=2)]

data_lm = (TextList.from_df(all_df, model_path, cols='texts', processor=processor)
    .random_split_by_pct(valid_pct = 0.01, seed = 1412)
    .label_for_lm()
    .databunch(bs=48))
data_lm.sanity_check()
# data_lm.save('wisesight_lm.pkl')

In [None]:
data_lm.sanity_check()
len(data_lm.train_ds), len(data_lm.valid_ds)

In [None]:
config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1, qrnn=False, tie_weights=True, out_bias=True,
             output_p=0.25, hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15)
trn_args = dict(drop_mult=1., clip=0.12, alpha=2, beta=1)

learn = language_model_learner(data_lm, AWD_LSTM, config=config, pretrained=False, **trn_args)

#load pretrained models
learn.load_pretrained(**_THWIKI_LSTM)

In [None]:
#train frozen
print('training frozen')
learn.freeze_to(-1)
learn.fit_one_cycle(1, 1e-2, moms=(0.8, 0.7))

In [None]:
#train unfrozen
print('training unfrozen')
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3, moms=(0.8, 0.7))

In [None]:
# learn.save('wisesight_lm')
# learn.save_encoder('wisesight_enc')

### Train Text Classifier

In [None]:
#lm data
data_lm = load_data(model_path,'wisesight_lm.pkl')
data_lm.sanity_check()

#classification data
tt = Tokenizer(tok_func = ThaiTokenizer, lang = 'th', pre_rules = pre_rules_th, post_rules=post_rules_th)
processor = [TokenizeProcessor(tokenizer=tt, chunksize=10000, mark_fields=False),
            NumericalizeProcessor(vocab=data_lm.vocab, max_vocab=60000, min_freq=20)]

data_cls = (ItemLists(model_path,train=TextList.from_df(train_df, model_path, cols=['texts'], processor=processor),
                     valid=TextList.from_df(valid_df, model_path, cols=['texts'], processor=processor))
    .label_from_df('category')
    .add_test(TextList.from_df(test_df, model_path, cols=['texts'], processor=processor))
    .databunch(bs=50)
    )
data_cls.sanity_check()
print(len(data_cls.vocab.itos))

In [None]:
#model
config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1, qrnn=False,
             output_p=0.4, hidden_p=0.2, input_p=0.6, embed_p=0.1, weight_p=0.5)
trn_args = dict(bptt=70, drop_mult=0.7, alpha=2, beta=1, max_len=500)

learn = text_classifier_learner(data_cls, AWD_LSTM, config=config, pretrained=False, **trn_args)
#load pretrained finetuned model
learn.load_encoder('wisesight_enc')

In [None]:
# #train unfrozen
# learn.freeze_to(-1)
# learn.fit_one_cycle(1, 2e-2, moms=(0.8, 0.7))
# learn.freeze_to(-2)
# learn.fit_one_cycle(1, slice(1e-2 / (2.6 ** 4), 1e-2), moms=(0.8, 0.7))
# learn.freeze_to(-3)
# learn.fit_one_cycle(1, slice(5e-3 / (2.6 ** 4), 5e-3), moms=(0.8, 0.7))
# learn.unfreeze()
# learn.fit_one_cycle(10, slice(1e-3 / (2.6 ** 4), 1e-3), moms=(0.8, 0.7),
#                    callbacks=[SaveModelCallback(learn, every='improvement', monitor='accuracy', name='bestmodel')])

Training takes about 20 minutes so we use the script `train_model.py` to do it with the following results (validation run):

```
epoch     train_loss  valid_loss  accuracy
1         0.812156    0.753478    0.687532
Total time: 00:56
epoch     train_loss  valid_loss  accuracy
1         0.740403    0.699093    0.714394
Total time: 00:57
epoch     train_loss  valid_loss  accuracy
1         0.727394    0.668807    0.723011
Total time: 01:34
epoch     train_loss  valid_loss  accuracy
1         0.722163    0.675351    0.723517
2         0.675266    0.654477    0.738723
3         0.669178    0.641070    0.737962
4         0.612528    0.637456    0.744551
5         0.618259    0.635149    0.749366
6         0.572621    0.651169    0.749873
7         0.561985    0.661739    0.747593
8         0.534753    0.673563    0.738469
9         0.530844    0.688871    0.746072
10        0.522788    0.670024    0.743031
Total time: 23:42
```

### See Results

In [None]:
learn.load('bestmodel');
#get predictions
probs, y_true, loss = learn.get_preds(ds_type = DatasetType.Valid, ordered=True, with_loss=True)
classes = learn.data.train_ds.classes
y_true = np.array([classes[i] for i in y_true.numpy()])
preds = np.array([classes[i] for i in probs.argmax(1).numpy()])
prob = probs.numpy()
loss = loss.numpy()

In [None]:
to_df = np.concatenate([y_true[:,None],preds[:,None],loss[:,None],prob],1)
probs_df = pd.DataFrame(to_df)
probs_df.columns = ['category','preds','loss'] + classes
probs_df['hit'] = (probs_df.category == probs_df.preds)
probs_df['texts'] = valid_df.texts
(y_true==preds).mean()

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

conf_mat = confusion_matrix(probs_df.category,probs_df.preds)
sns.heatmap(conf_mat, annot=True, fmt='d',
            xticklabels=classes, yticklabels=classes)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

### Submission

In [None]:
# submit = pd.read_csv('test_majority.csv')
# submit['Class'] = preds
# print(submit.shape)
# submit.to_csv('submit_ulmfit.csv',index=False)
# submit.tail()

## Average Class Probabilities

In [None]:
#ulmfit trained with augmented set
probs_df_ulmfit = pd.read_csv('probs_df_ulmfit.csv')

#logistic regression trained with training set
probs_df = pd.read_csv('probs_df_linear.csv')

probs_df_ulmfit.head()

In [None]:
#ulmfit probabilities
ulm = np.array(ulmfit_probs[['neg','neu','pos','q']])[None,:]

#logistic regression probabilities
lr = np.array(probs_df[['neg','neu','pos','q']])[None,:]

#take average
mean_probs = np.concatenate([lr,ulm],0).mean(0)
mean_preds = np.argmax(mean_probs,1)
mean_preds = np.array([['neg','neu','pos','q'][i] for i in mean_preds])

### Submit

In [None]:
# preds = model.predict(text_test)
# submit = pd.read_csv('test_majority.csv')
# submit['Class'] = mean_preds
# print(submit.shape)
# submit.to_csv('submit_mean.csv',index=False)
# submit.tail()

## FastText

In [None]:
import codecs

def replace_newline(t):
    return re.sub('[\n]{1,}', ' ', t)

ft_data = 'ft_data/'

train_df = pd.read_csv('all_df.csv')
test_df = pd.read_csv('test_df.csv')
all_df = pd.concat([train_df,test_df],0).reset_index(drop=True)

In [None]:
df_txts = ['train','test']
dfs = [train_df,test_df]

for i in range(2):
    df = dfs[i]
    ft_lines = []
    for _,row in df.iterrows():
        ft_lab = f'__label__{row["category"]}'
        ft_text = replace_newline(f'{row["texts"]}')
        ft_line = f'{ft_lab} {ft_text}'
        ft_lines.append(ft_line)

    doc = '\n'.join(ft_lines)
    with codecs.open(f'{ft_data}{df_txts[i]}.txt','w', encoding="utf-8") as f:
        f.write(doc)
    f.close()

In [None]:
#for fasttext embedding finetuning
ft_lines = []
for _,row in all_df.iterrows():
    ft_lab = '__label__0'
    ft_text = replace_newline(f'{row["texts"]}')
    ft_line = f'{ft_lab} {ft_text}'
    ft_lines.append(ft_line)

doc = '\n'.join(ft_lines)
with codecs.open(f'{ft_data}all.txt','w', encoding="utf-8") as f:
    f.write(doc)
f.close()

In [None]:
#finetune with all data
!/root/fastText/fasttext skipgram \
-pretrainedVectors 'model/wiki.th.vec' -dim 300 \
-input ft_data/all.txt -output 'model/finetuned'

In [None]:
#train classifier
!/root/fastText/fasttext supervised \
-input 'ft_data/train.txt' -output 'model/sentiment' \
-pretrainedVectors 'model/finetuned.vec' -epoch 5 -dim 300 -wordNgrams 2

In [None]:
#get prediction
preds = !/root/fastText/fasttext predict 'model/sentiment.bin' 'ft_data/test.txt'
preds = [i.split('__')[-1] for i in preds]

In [None]:
submit = pd.read_csv('test_majority.csv')
submit['Class'] = preds
print(submit.shape)
submit.to_csv('submit_fasttext.csv',index=False)
submit.tail()

## [Multilingual Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3)

In [None]:
import tensorflow_hub as hub
import tensorflow_text
import tensorflow as tf #tensorflow 2.1.0

enc = hub.load('https://tfhub.dev/google/universal-sentence-encoder-multilingual/3')

In [None]:
#dependent variables
y_train = train_df['category']
y_test = test_df['category']

In [None]:
X_trains = []
X_tests = []
bs = 10

In [None]:
for i in tqdm_notebook(range(y_test.shape[0]//bs+1)):
    X_tests.append(enc(test_df.texts[(i*bs):((i+1)*bs)]).numpy())

In [None]:
for i in tqdm_notebook(range(y_train.shape[0]//bs+1)):
    X_trains.append(enc(train_df.texts[(i*bs):((i+1)*bs)]).numpy())

In [None]:
X_test = np.concatenate(X_tests,0)
X_train = np.concatenate(X_trains,0)
X_train.shape, X_test.shape

In [None]:
from sklearn.svm import LinearSVC

text_clf = LinearSVC(class_weight='balanced')
text_clf.fit(X_train, y_train)

In [None]:
# preds = text_clf.predict(X_test)
# submit = pd.read_csv('test_majority.csv')
# submit['Class'] = preds
# print(submit.shape)
# submit.to_csv('submit_use.csv',index=False)
# submit.tail()

In [None]:
(lab.Class==submit.Class).mean()