Contents:

First I do some basic analysis of the tweets
then I try a bunch of methods.

*  Ridge classifier that gets us upto 79.5%  (from the getting-started tutorial)
*  Logistic Regression with differing parameters
*  Decision Tree
*  BERT

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
import seaborn as sns
import matplotlib.pyplot as plt
from collections import defaultdict, Counter
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud 
from nltk.tokenize import word_tokenize 

In [None]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

In [None]:
train_df[train_df["target"] == 0]["text"].values[1]

In [None]:
train_df[train_df["target"] == 1]["text"].values[1]

Lets analyze the tweets

In [None]:
train_df.info()

In [None]:
keywords_vc = pd.DataFrame({"Count": train_df["keyword"].value_counts()})
sns.barplot(y=keywords_vc[0:30].index, x=keywords_vc[0:30]["Count"], orient='h')
plt.title("Top 30 Keywords")
plt.show()

THIS method of visualiztion was inspired by Zineb khanjari

In [None]:
disaster_keywords = train_df.loc[train_df["target"] == 1]["keyword"].value_counts()
nondisaster_keywords = train_df.loc[train_df["target"] == 0]["keyword"].value_counts()

fig, ax = plt.subplots(1,2, figsize=(20,8))
sns.barplot(y=disaster_keywords[0:30].index, x=disaster_keywords[0:30], orient='h', ax=ax[0], palette="Reds_d")
sns.barplot(y=nondisaster_keywords[0:30].index, x=nondisaster_keywords[0:30], orient='h', ax=ax[1], palette="Blues_d")
ax[0].set_title("Top 30 Keywords - Disaster Tweets")
ax[0].set_xlabel("Keyword Frequency")
ax[1].set_title("Top 30 Keywords - Non-Disaster Tweets")
ax[1].set_xlabel("Keyword Frequency")
plt.tight_layout()
plt.show()

Its worth exploring to see if these top frequency words are really indicative of the label:
lets explore this aspect for the top few words

In [None]:
def keyword_disaster_probabilities(x):
    tweets_w_keyword = np.sum(train_df["keyword"].fillna("").str.contains(x))
    tweets_w_keyword_disaster = np.sum(train_df["keyword"].fillna("").str.contains(x) & train_df["target"] == 1)
    return tweets_w_keyword_disaster / tweets_w_keyword

keywords_vc["Disaster_Probability"] = keywords_vc.index.map(keyword_disaster_probabilities)
keywords_vc.head()

Lets observe what words are most indicative

In [None]:
keywords_vc.sort_values(by="Disaster_Probability", ascending=False).head(10)

Let us look ta the tweet length distribution.


In [None]:
train_df["tweet_length"] = train_df["text"].apply(len)
sns.distplot(train_df["tweet_length"])
plt.title("Histogram of Tweet Length")
plt.xlabel("Number of Characters")
plt.ylabel("Density")
plt.show()

Might be more useful to see if there is a difference based on the label.

In [None]:
g = sns.FacetGrid(train_df, col="target", height=5)
g = g.map(sns.distplot, "tweet_length")
plt.suptitle("Distribution  of Tweet Length")
plt.show()

### Building vectors

The theory behind the model we'll build in this notebook is pretty simple: the words contained in each tweet are a good indicator of whether they're about a real disaster or not (this is not entirely correct, but it's a great place to start).

We'll use scikit-learn's `CountVectorizer` to count the words in each tweet and turn them into data our machine learning model can process.


In [None]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [None]:
train_df["text"][0:5]

In [None]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

The above tells us that:
1. There are 54 unique words (or "tokens") in the first five tweets.
2. The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.

Now let's create vectors for all of our tweets.

In [None]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])

## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test_df["text"])

### Ridge Classifier

As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a _linear_ connection. So let's build a linear model and see!

In [None]:
## Our vectors are really big, so we want to push our model's weights
## toward 0 without completely discounting different words - ridge regression 
## is a good way to do this.
clf = linear_model.RidgeClassifier()

Let's test our model and see how well it does on the training data. For this we'll use `cross-validation` - where we train on a portion of the known data, then validate it with the rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs.

The metric for this competition is F1, so let's use that here.

In [None]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

The above scores aren't terrible! It looks like our assumption will score roughly 0.65 on the leaderboard. There are lots of ways to potentially improve on this (TFIDF, LSA, LSTM / RNNs, the list is long!) - give any of them a shot!

In the meantime, let's do predictions on our training set and build a submission for the competition.

In [None]:
clf.fit(train_vectors, train_df["target"])

In [None]:
sample_submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

In [None]:
sample_submission["target"] = clf.predict(test_vectors)

In [None]:
sample_submission.head()

In [None]:
sample_submission.to_csv("submission.csv", index=False)

![](http://)Lets try different hyperparametes.

After some tuning, max_df = 180 seemed to work well

In [None]:
count_vectorizer = feature_extraction.text.CountVectorizer(max_df = 180, stop_words = 'english')
train_vectors = count_vectorizer.fit_transform(train_df["text"])
train_vectors[0].todense().shape
test_vectors = count_vectorizer.transform(test_df['text'])
test_vectors[0].todense().shape

In [None]:
clf = linear_model.RidgeClassifier(alpha = 20)
clf.fit(train_vectors, train_df['target'])
scores = model_selection.cross_val_score(clf, train_vectors, train_df['target'], cv=3, scoring = 'f1')
scores

This one gets about 79.5%
(I just run above cell to save the csv)

Now we try decision tree from SKLearn

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
pipeline = Pipeline([  ('clf', DecisionTreeClassifier( splitter='random', class_weight='balanced'))
])
parameters = {
    'clf__max_depth':(150,160,165),
    'clf__min_samples_split':(18,20,23),
    'clf__min_samples_leaf':(5,6,7)
}

df_tfidf = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=-1, scoring='f1')
df_tfidf.fit(X_train_tfidf, y_train)

print(df_tfidf.best_estimator_.get_params())

Next, inspired by some high scoring submissions, I decided to use TF-IDF vectorizer. Starting with something simple, lets consider logistic regression

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split

In [None]:
data = pd.concat([train_df, test_df])

data.shape

In [None]:
data_train = data[0:train_df.shape[0]]
data_test = data[train_df.shape[0]:-1]

X_train, X_test, y_train, y_test = train_test_split(data_train['text'], data_train['target'],
                                            test_size = 0.2, random_state = 75)

In [None]:
# id_train=train_df.id
# text_train=list(train_df.text)

# y_train=train_df.target.values
# id_test=train_df.id
# text_test=list(test_df.text)
# text=text_train+text_test
# print(len(text))
# lentraindata=
tfv = TfidfVectorizer(  max_features=None,tokenizer=None,ngram_range=(1,1)
    ,analyzer='word', use_idf=1,smooth_idf=1,sublinear_tf=1)

X = tfv.fit_transform(text)

X_train = X[:len(text_train)]
X_test = X[len(text_train):]

lr = LogisticRegression(C=1,max_iter=10000)

print('Cross val score: {}'.format(np.mean(cross_val_score(lr,X_train,y_train, cv=10))))

lr.fit(X_train,y_train)
y_predict = lr.predict(X_test)

# submission_df = pd.DataFrame(y_predict)
# submission_df.to_csv('submission.csv',index=False)

In [None]:
train_tfidf = tfv.fit_transform(X_train)
test_tfidf = tfv.fit_transform(X_test)

Before generating submission we can train it on the whole dataset (both training and testing split)!

In [None]:
y_train = train_df.target.values
X = tfv.fit_transform(text)
# lr.fit(X,y_train)
# y_predict = lr.predict(X_test)

# submission_df = pd.DataFrame(y_predict)
# submission_df.to_csv('submission.csv',index=False)

In [None]:
lr = LogisticRegression(class_weight = 'balanced', solver = 'lbfgs', n_jobs = -1)
lr.fit(train_tfidf, y_train)
# y_predicted_lr = lr.predict(test_tfidf)

This gets us 74.23% accuracy with something as simple as logistic regression!

Inspired by the top scoring submission, I decided to try out a pre-trained model like BERT,
Most straightforwardly from huggingface 

In [None]:
!pip install transformers

In [None]:
from transformers import BertTokenizer
import numpy as np
import tensorflow as tf 
from transformers import TFBertModel
import transformers

bert_model = TFBertModel.from_pretrained('bert-large-uncased')

In [None]:
model = create_model(bert_model)
model.summary()

In [None]:
history = model.fit([train_input_ids,train_attention_masks],train.target,validation_split=0.2, epochs=2,batch_size=10)

In [None]:
out = model.predict([test_input_ids,test_attention_masks])
out = np.round(out).astype(int)
result = pd.DataFrame(result)
submission = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')
output = pd.DataFrame({'id':submission.id,'target':result[0]})
output.to_csv('submission.csv',index=False)

We finally managed to push our accuracy to 84% !