# A Simple Classification Approach

Again, lets load in the tweet dataset. This time we'll run a simple model - SVM, in order to classify the tweets as either relating to a genuine disaster or not.

As seen in the previous notebook, the following cell downloads the dataset from the web and extracts it:

In [1]:
!rm -rf data
!mkdir -p data
!wget https://github.com/ghomasHudson/text-mining-demos-workshop/raw/main/disaster_tweets.zip -O data/data.zip
!unzip -j data/data.zip -d data
!rm data/data.zip

--2024-01-22 10:45:26--  https://github.com/ghomasHudson/text-mining-demos-workshop/raw/main/disaster_tweets.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ghomasHudson/text-mining-demos-workshop/main/disaster_tweets.zip [following]
--2024-01-22 10:45:26--  https://raw.githubusercontent.com/ghomasHudson/text-mining-demos-workshop/main/disaster_tweets.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 410737 (401K) [application/zip]
Saving to: ‘data/data.zip’


2024-01-22 10:45:26 (10.9 MB/s) - ‘data/data.zip’ saved [410737/410737]

Archive:  data/data.zip
  inflating: data/README.md       

This time we'll load both the train and testing sets as pandas dataframes.

In [2]:
import pandas as pd
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")

train_df.sample(5)

Unnamed: 0,id,keyword,location,text,target
1304,1885,burning,Gameday,.@StacDemon with five burning questions for Ch...,1
3051,4378,earthquake,,Contruction upgrading ferries to earthquake st...,1
1723,2486,collided,,We're happily collided :),0
1876,2695,crush,w. Nykae,More than a crush ???????????? WCE @nykaeD_ ??...,0
6635,9502,terrorist,,Fresh encounter in Pulwama of J&amp;amp;K one ...,1


# Formatting the data

We now need to format this data ready for training. We'll use CountVectorizer to convert the text data into a numerical format that we can feed into the SVM. This process is called feature extraction.

First lets fit our CountVectorizer to the training set. This will learn the vocabulary of the training set:

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words="english")
vectorizer.fit(train_df["text"])
print("Loaded a vocabulary of length: ", len(vectorizer.get_feature_names_out()))

Loaded a vocabulary of length:  21242


We can now use this to convert both train and test sets into features for the model:

In [4]:
x_train = vectorizer.transform(train_df["text"])
x_test = vectorizer.transform(test_df["text"])

Let's visualise what this has done:

In [5]:
df = pd.DataFrame(x_train[:15].todense())
df.columns = list(vectorizer.get_feature_names_out())
df

Unnamed: 0,00,000,0000,007npen6lg,00cy9vxeff,00end,00pm,01,02,0215,...,ûïyou,ûò,ûò800000,ûòthe,ûó,ûóher,ûókody,ûónegligence,ûótech,ûówe
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


It's a bit hard to see because there's a lot of 0s (understandably as many words will not occur in every tweet). But we'll get positive counts in positions where words occur multiple times, e.g.

In [6]:
print("For the tweet:")
print(train_df["text"][11])
print()
print("The word 'flooded' gets the value:", df["flooded"][11])
print("The word 'tampa' gets the value:", df["tampa"][11])

For the tweet:
Haha South Tampa is getting flooded hah- WAIT A SECOND I LIVE IN SOUTH TAMPA WHAT AM I GONNA DO WHAT AM I GONNA DO FVCK #flooding

The word 'flooded' gets the value: 1
The word 'tampa' gets the value: 2


# Training

Now lets initialize our SVM model and train it on our data. We'll feed in both the features we just extracted from the text and the labels ("disaser" or "not disaster"):

In [7]:
from sklearn.svm import SVC
clf = SVC(C=1)
clf.fit(x_train, train_df["target"])

Lets now see how well we did. We can print out some evaluation metrics based on the model's performance on the test set.

In [8]:
from sklearn.metrics import classification_report, accuracy_score, f1_score

preds = clf.predict(x_test)

print(classification_report(test_df["target"], preds, target_names=["Not Disaster", "Disaster"]))
print()
print("Accuracy: ", accuracy_score(test_df["target"], preds))
print("F1: ", f1_score(test_df["target"], preds))

              precision    recall  f1-score   support

Not Disaster       0.83      1.00      0.91        39
    Disaster       1.00      0.85      0.92        55

    accuracy                           0.91        94
   macro avg       0.91      0.93      0.91        94
weighted avg       0.93      0.91      0.92        94


Accuracy:  0.9148936170212766
F1:  0.9215686274509803


Now lets try the model on some inputs. Change the input to different sentences to test it's performance:

In [9]:
input_sentence = "Plane crash near site of fire."

vector = vectorizer.transform([input_sentence])
prediction = clf.pred(vector)[0]
"Disaster" if prediction else "Not Disaster"

'Disaster'

# Additional Exercises

1. Try increasing the ngram range of the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) with the parameter `ngram_range=(1, 2)`. This will now look for both words and pairs of words (bigrams).
2. Try modifying the [SVM parameters](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
2. Try experimenting with other feature extraction methods such as tfidf instead of CountVectorizer. In Sklearn you can do this with the [Tfidfvectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).
3. Try applying some of the cleaning methods from the previous notebook. Does removing links, punctuation, and stopwords help?
4. Try other simple sklearn models such as decision trees and Linear Regression.