# Natural Language Processing with Disaster Tweets

This notebook contains two simple models for sentiment analysis of this Kaggle competition (see [here](https://www.kaggle.com/competitions/nlp-getting-started)).

- The first one is a Support Vector Machine trained on word embeddings from spaCy; this vectorization uses the raw tweets as data. This models achieves a F1 score of around 0.81.

- The second is a stochastic gradient descent classifier trained with count and tf-idf vectors; in this case, there is some preprocessing: lemmatization, dropping of stopwords and lowercasing. It achieves a F1 score of around 0.8.

An averaging ensemble of both models is also included. It achives a F1 score of 0.83.

In [1]:
import pandas as pd

import spacy

from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import SGDClassifier

import numpy as np

In [2]:
# This is a scorer for f1
f1_score = make_scorer(f1_score, average='binary')

## Brief exploration of the dataset

In [3]:
# This loads the train dataset
df = pd.read_csv('train.csv')

In [4]:
df.shape

(7613, 5)

In [5]:
df.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [6]:
df['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [7]:
# This counts the number of # in the tweets
df['text'].str.contains('#').sum()

1761

In [8]:
# This counts the number of @
df['text'].str.contains('@').sum()

2039

In [9]:
df['text'].str.contains('http://').sum(), df['text'].str.contains('https://').sum()

(3604, 407)

In [10]:
for col in df.columns:
    count = pd.isna(df[col]).sum()
    print(f'Column {col} has {count} NaN values.')
    print()

Column id has 0 NaN values.

Column keyword has 61 NaN values.

Column location has 2533 NaN values.

Column text has 0 NaN values.

Column target has 0 NaN values.



In [11]:
df_pred = pd.read_csv('test.csv')

In [12]:
df_pred.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [13]:
for col in df_pred.columns:
    count = pd.isna(df[col]).sum()
    print(f'Column {col} has {count} NaN values.')
    print()

Column id has 0 NaN values.

Column keyword has 61 NaN values.

Column location has 2533 NaN values.

Column text has 0 NaN values.



## SpaCy word embeddings on raw data + SVM

In [14]:
spacy.cli.download("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [15]:
nlp = spacy.load('en_core_web_lg')

In [16]:
X = np.array([nlp(text).vector for text in df['text']])

In [17]:
y = df['target'].values

In [18]:
# SVM model
svm_model = SVC(kernel='rbf', C=3.5, probability=True, verbose=1)

In [19]:
# Cross validation
cv_scores = cross_val_score(svm_model, X, y, cv=5, scoring=f1_score, verbose=1)

print("Cross-Validation Scores:", cv_scores)
print()
print("Mean Accuracy: {:.2f}".format(cv_scores.mean()))
print("Standard Deviation: {:.2f}".format(cv_scores.std()))

[LibSVM][LibSVM][LibSVM][LibSVM][LibSVM]Cross-Validation Scores: [0.70424978 0.69521411 0.73633441 0.68793103 0.76595745]

Mean Accuracy: 0.72
Standard Deviation: 0.03


In [20]:
svm_model.fit(X, y)

[LibSVM]

In [21]:
X_test = np.array([nlp(text).vector for text in df_pred['text']])

In [22]:
df_pred['target'] = svm_model.predict(X_test)

In [23]:
svm_prob = svm_model.predict_proba(X_test)

In [24]:
nlp_svm = df_pred[['id', 'target']]

In [25]:
nlp_svm.to_csv('nlp_svm.csv', index=False)

## CountVectorizer on ngrams + SGDClassifier

In [26]:
def CountVectorize_tweets(df, df_pred):
    all_df = pd.concat([df, df_pred], ignore_index=True)
    train_text = []
    train_hashtag = []
    pred_text = []
    pred_hashtag = []
    for index, row in all_df.iterrows():
        # This is simply to display how the process moves on. It's off now
        # print(f'{index}/{len(all_df)}')

        # This chunk of code filgers hashtags
        tweet = row['text'].split(' ')
        hashtags = [word for word in tweet if word.startswith('#')]

        # These lines were meant to remove urls and account names from the tweets. However,
        # the classifier seems to work better with them
        # remove = [word for word in tweet if word.startswith('@') or 'http' in word]
        # tweet = [word for word in tweet if word not in remove]

        # This puts the tweet and the hashtags together in a string
        tweet = ' '.join(tweet)
        hashtags = ' '.join(hashtags)

        # Lemmatization
        doc = nlp(tweet)
        tweet = [token.lemma_ for token in doc]
        tweet = ' '.join(tweet)

        # This separates the tweets into the training and prediction sets
        if index > df.index[-1]:
            pred_text.append(tweet)
            pred_hashtag.append(hashtags)
        else:
            train_text.append(tweet)
            train_hashtag.append(hashtags)

    # This checks whether the resulting data has the same lenght as the original datasets
    assert len(df) == len(train_text)
    assert len(df_pred) == len(pred_text)

    # This instantiates a count vectorizer over the words in each tweets
    count_vectorizer = CountVectorizer(stop_words='english', lowercase=True,ngram_range=(1, 2))
    train_text_m = count_vectorizer.fit_transform(train_text).toarray()
    pred_text_m = count_vectorizer.transform(pred_text).toarray()

    # This instantiates a tf-idf vectorizer over the hashtags
    count_vectorizer = TfidfVectorizer(lowercase=True)
    train_hashtag_m = count_vectorizer.fit_transform(train_hashtag).toarray()
    pred_hashtag_m = count_vectorizer.transform(pred_hashtag).toarray()

    # This puts together the arrays
    X_train = np.concatenate((train_text_m, train_hashtag_m), axis=1)
    X_pred = np.concatenate((pred_text_m, pred_hashtag_m), axis=1)
    return X_train, X_pred

In [27]:
X_train, X_pred = CountVectorize_tweets(df, df_pred)

In [28]:
sgd_classifier = SGDClassifier(loss='log_loss', verbose=1)

In [29]:
cv_scores_sgd = cross_val_score(sgd_classifier, X_train, y, cv=5, scoring=f1_score, verbose=1)

print("Cross-Validation Scores:", cv_scores_sgd)
print()
print("Mean Accuracy: {:.2f}".format(cv_scores_sgd.mean()))
print("Standard Deviation: {:.2f}".format(cv_scores_sgd.std()))

-- Epoch 1
Norm: 209.66, NNZs: 58956, Bias: -3.290940, T: 6090, Avg. loss: 2.746111
Total training time: 1.70 seconds.
-- Epoch 2
Norm: 127.39, NNZs: 58956, Bias: -1.192990, T: 12180, Avg. loss: 0.422751
Total training time: 2.97 seconds.
-- Epoch 3
Norm: 91.81, NNZs: 58956, Bias: -2.624307, T: 18270, Avg. loss: 0.148670
Total training time: 4.26 seconds.
-- Epoch 4
Norm: 73.22, NNZs: 58956, Bias: -1.703017, T: 24360, Avg. loss: 0.094220
Total training time: 5.55 seconds.
-- Epoch 5
Norm: 62.51, NNZs: 58956, Bias: -2.219513, T: 30450, Avg. loss: 0.079268
Total training time: 6.83 seconds.
-- Epoch 6
Norm: 55.82, NNZs: 58956, Bias: -2.208719, T: 36540, Avg. loss: 0.074513
Total training time: 8.10 seconds.
-- Epoch 7
Norm: 51.44, NNZs: 58956, Bias: -1.949519, T: 42630, Avg. loss: 0.072501
Total training time: 9.39 seconds.
-- Epoch 8
Norm: 48.47, NNZs: 58956, Bias: -1.785196, T: 48720, Avg. loss: 0.073442
Total training time: 10.67 seconds.
-- Epoch 9
Norm: 46.38, NNZs: 58956, Bias: -1.

In [30]:
sgd_classifier.fit(X_train, y)

-- Epoch 1
Norm: 189.71, NNZs: 72144, Bias: -0.805957, T: 7613, Avg. loss: 2.521911
Total training time: 1.62 seconds.
-- Epoch 2
Norm: 114.12, NNZs: 72153, Bias: -2.454343, T: 15226, Avg. loss: 0.359162
Total training time: 3.23 seconds.
-- Epoch 3
Norm: 83.00, NNZs: 72153, Bias: -2.584523, T: 22839, Avg. loss: 0.132088
Total training time: 4.81 seconds.
-- Epoch 4
Norm: 67.35, NNZs: 72153, Bias: -2.301789, T: 30452, Avg. loss: 0.096600
Total training time: 6.82 seconds.
-- Epoch 5
Norm: 58.58, NNZs: 72153, Bias: -1.764644, T: 38065, Avg. loss: 0.085413
Total training time: 9.04 seconds.
-- Epoch 6
Norm: 53.26, NNZs: 72153, Bias: -1.939348, T: 45678, Avg. loss: 0.088937
Total training time: 10.78 seconds.
-- Epoch 7
Norm: 49.90, NNZs: 72153, Bias: -1.910645, T: 53291, Avg. loss: 0.088626
Total training time: 12.36 seconds.
-- Epoch 8
Norm: 47.66, NNZs: 72153, Bias: -1.869364, T: 60904, Avg. loss: 0.090767
Total training time: 13.95 seconds.
-- Epoch 9
Norm: 46.13, NNZs: 72153, Bias: -

In [31]:
df_pred['target'] = sgd_classifier.predict(X_pred)
sgd_prob = sgd_classifier.predict_proba(X_pred)
nlp_sgd = df_pred[['id', 'target']]
nlp_sgd.to_csv('nlp_sgd.csv', index=False)

## Ensemble

In [32]:
# Combining these two models gets above 0.82 F1 score
combined_props = 1/2 * svm_prob[:, 1].reshape(-1, 1) + 1/2 * sgd_prob[:, 1].reshape(-1, 1)

In [33]:
combined_predictions = (combined_props > 0.5).astype(int)

In [34]:
nlp_ensemble = pd.DataFrame()
nlp_ensemble['id'] = df_pred['id']
nlp_ensemble['target'] = combined_predictions

In [35]:
nlp_ensemble.to_csv('nlp_ensemble.csv', index=False)