<a href = "https://www.pieriantraining.com"><img src="../PT Centered Purple.png"> </a>

<em style="text-align:center">Copyrighted by Pierian Training</em>

# Multi-Modal: Natural Language Binary Classification

A common data set simply includes **just** a natural language feature and a binary label, such as sentiment analysis  (positive or negative sentiment) or in this case, whether an email is "ham" (legitimate) or "spam" (a spammy email).

For a single column of natural language text as are only feature, it makes sense to use the MultiModalPredictor from autogluon.

Our data set is sourced from here:
https://www.kaggle.com/datasets/venky73/spam-mails-dataset

## Imports

In [5]:
from autogluon.multimodal import MultiModalPredictor
from autogluon.tabular import TabularDataset

In [10]:
data = TabularDataset("data/sentiment/spam_ham_dataset.csv")

In [15]:
data.head()

Unnamed: 0,label,text
0,ham,Subject: enron methanol ; meter # : 988291\r\n...
1,ham,"Subject: hpl nom for january 9 , 2001\r\n( see..."
2,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar..."
3,spam,"Subject: photoshop , windows , office . cheap ..."
4,ham,Subject: re : indian springs\r\nthis deal is t...


In [16]:
data['label'].value_counts()

ham     3672
spam    1499
Name: label, dtype: int64

In [17]:
print(data.iloc[4]['label'])
print(data.iloc[4]['text'])

ham
Subject: re : indian springs
this deal is to book the teco pvr revenue . it is my understanding that teco
just sends us a check , i haven ' t received an answer as to whether there is a
predermined price associated with this deal or if teco just lets us know what
we are giving . i can continue to chase this deal down if you need .


---

In [18]:
print(data.iloc[3]['label'])
print(data.iloc[3]["text"])

spam
Subject: photoshop , windows , office . cheap . main trending
abasements darer prudently fortuitous undergone
lighthearted charm orinoco taster
railroad affluent pornographic cuvier
irvin parkhouse blameworthy chlorophyll
robed diagrammatic fogarty clears bayda
inconveniencing managing represented smartness hashish
academies shareholders unload badness
danielson pure caffein
spaniard chargeable levin



## Train Test Split

In [19]:
train_size = int(len(data) * 0.8)
seed = 42
train_data = data.sample(train_size, random_state=seed)
test_data = data.drop(train_data.index)

## MultiModalPredictor

Let's train a MultiModalPredictor on this dataset. You can use this specific predictor when you're only dealing with natural language text data. Keep in mind that a TabularPredictor is also quite capable, its just that with a MultiModalPredictor Autogluon will also attempt to train a Transformer Network.

Note that without a GPU training might take a while!

In [20]:
save_path="sentiment_analysis"

In [23]:
predictor = MultiModalPredictor(label="label", path=save_path)

In [None]:
# ONLY RUN THIS IF YOU HAVE A LOT OF TIME!
# THIS WILL TAKE HOURS TO RUN ON A CPU!
predictor.fit(train_data)

## Evaluation

After training, we can easily evaluate our predictor on separate test data formatted similarly to our training data.

In [25]:
test_score = predictor.evaluate(test_data, metrics=['acc', 'f1'])
print(test_score)

Predicting: 0it [00:00, ?it/s]

{'acc': 0.9603864734299516, 'f1': 0.9313232830820771}


## Prediction on New Data

In [27]:
email1 = "Subject: free pills\n Give me your social security number and bank account for free pills"
email2 = "Subject: Work Meeting\n Let me know if you have availability for Friday. Thanks!"

In [28]:
print(email1)

Subject: free pills
 Give me your social security number and bank account for free pills


In [29]:
print(email2)

Subject: Work Meeting
 Let me know if you have availability for Friday. Thanks!


In [30]:
predictions = predictor.predict({'email': [email1, email2]})
print('"email":', email1, '"Predicted Sentiment":', predictions[0])
print('"email":', email2, '"Predicted Sentiment":', predictions[1])

"email": Subject: free pills
 Give me your social security number and bank account for free pills "Predicted Sentiment": spam
"email": Subject: Work Meeting
 Let me know if you have availability for Friday. Thanks! "Predicted Sentiment": ham




For classification tasks, you can ask for predicted class-probabilities instead of predicted classes.

In [31]:
predictions = predictor.predict_proba({'email': [email1, email2]})
print('"email":', email1, '"Predicted Sentiment":', predictions[0])
print('"email":', email2, '"Predicted Sentiment":', predictions[1])

"email": Subject: free pills
 Give me your social security number and bank account for free pills "Predicted Sentiment": [0.39974084 0.6002591 ]
"email": Subject: Work Meeting
 Let me know if you have availability for Friday. Thanks! "Predicted Sentiment": [0.99675035 0.00324949]


