# Let's build a spam classifier

We will use df from `SMS Spam Collection v. 1` described as:

> a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-encoded messages, tagged according being legitimate (ham) or spam.

([source](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/))

#### Load useful libraries and df

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /Users/janice/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/janice/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [7]:
# Load df
df = pd.read_csv(
    "../df/ChatGPT-play-reviews_prep.csv",
    encoding="utf-8",
)

In [5]:
# Looking at a sample of our df
df.sample(10)

Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,at,replyContent,repliedAt,appVersion,reply,at_ym,at_m,at_wd,at_q,at_d,at_ymd
4649,6a7541ff-1214-4296-9c9c-d45d5b10e029,Maahi Prajapati,This is degital world and open ai chatgpt came...,5,0,2023-07-29 07:36:00,,,1.0.0023,0,2023-07,July,Saturday,3,29,07/29/23
28298,e7a4dd35-5448-4361-be03-0cbcd1e0b33f,Faridur Rahman,অসাধারণ,5,0,2023-07-25 23:35:54,,,1.0.0022,0,2023-07,July,Tuesday,3,25,07/25/23
10168,c1a9c1d4-589b-4eea-a6b2-8ff42ec94e54,Jordz Mindoza,the best app I ever use in my whole life 😉🧬 gr...,5,0,2023-09-01 11:01:23,,,1.0.0035,0,2023-09,September,Friday,3,1,09/01/23
19417,414b4d3e-cbef-4615-a998-aa637606b491,Wendy Engela,Struggles to connect to internet on a 20mbps l...,2,0,2023-09-02 00:38:51,Thank you for your feedback. Is this problem s...,2023-09-05 21:26:34,,1,2023-09,September,Saturday,3,2,09/02/23
30528,51dc8698-5e88-49be-80c4-8f8e0174c13e,Riyan Putri,Cant login,1,1,2023-08-23 13:52:19,,,,0,2023-08,August,Wednesday,3,23,08/23/23
10630,57d7e5a3-21e0-478f-b952-1ef78d3f3873,Subham Saud,This is soo good for students,5,0,2023-09-13 15:29:16,,,1.2023.242,0,2023-09,September,Wednesday,3,13,09/13/23
27759,72af2079-1841-428a-a1ae-48aed33ac453,coalws,🤯🤯,5,0,2023-07-28 06:29:31,,,1.0.0023,0,2023-07,July,Friday,3,28,07/28/23
11842,0542318e-aa65-4737-80d1-9e0440356139,the_man cave with jacob,Sign into an app to use any features? *sus*,2,0,2023-10-09 05:06:15,,,1.2023.263,0,2023-10,October,Monday,4,9,10/09/23
26510,79ee0633-c8df-475b-9dac-2457e289ad01,Ahtisham awan,best application,4,0,2023-08-10 18:06:16,,,1.0.0030,0,2023-08,August,Thursday,3,10,08/10/23
6099,6c3f1844-f106-4d8d-b061-7c6fc5e15d52,Jovany Sfive,My experience now is so good chat gpt is help ...,5,0,2023-07-31 23:00:42,,,1.0.0023,0,2023-07,July,Monday,3,31,07/31/23


## Classification

We will here build a "vanilla" classifier, without pouring too many thoughts about what the actual messages, spam or not, look like. To improve your model you can of course have a closer look and investigate the df more in detail. 

In [13]:
# Split dataset between train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df['content'], df["reply"], random_state=0
)

In [14]:
X_train.head()

25900                                    Superb experience
11411    Waited for the app for too long,Now,It's here,...
11831                 Neet to enable the copy to clipboard
7576     Very useful for those who need opinions of oth...
27941                                        Excellent app
Name: content, dtype: object

### CountVectorizer

In [17]:
# Fit the CountVectorizer to the training df
vect = CountVectorizer().fit(df['content'])

# transform the documents in the training df to a document-term matrix
X_train_vectorized = vect.transform(df['content'])
# print("X_train_vectorized: ", X_train_vectorized)

In [18]:
print("X_train shape = {}".format(df['content'].shape))
print("Vocabulary length = {}".format(len(vect.vocabulary_)))

X_train shape = (30956,)
Vocabulary length = 15245


So in 30.956 messages we found 15.245 different words.

In [20]:
# Let's look at our vocabulary list (sorted alphabetically)
# Does it look like you expected?
sorted(vect.vocabulary_.items(), key=lambda x: x[1])[200:240]

[('5g', 200),
 ('5he', 201),
 ('5k', 202),
 ('5lacs', 203),
 ('5m', 204),
 ('5mb', 205),
 ('5o', 206),
 ('5star', 207),
 ('5stars', 208),
 ('5th', 209),
 ('60', 210),
 ('600', 211),
 ('62851', 212),
 ('64', 213),
 ('64th', 214),
 ('65', 215),
 ('68', 216),
 ('69', 217),
 ('6a', 218),
 ('6th', 219),
 ('70', 220),
 ('750', 221),
 ('777ffmahima', 222),
 ('78898', 223),
 ('7a', 224),
 ('7c', 225),
 ('7ec556a4dde829ea', 226),
 ('7ec5c6a5eb2dfb28', 227),
 ('7p', 228),
 ('7sdni', 229),
 ('7th', 230),
 ('7yati', 231),
 ('80', 232),
 ('800', 233),
 ('80286', 234),
 ('80mbps', 235),
 ('819', 236),
 ('838', 237),
 ('84', 238),
 ('841', 239)]

In [21]:
# We can also print the newly created feature matrix
# Note: you see its a sparse matrix with many 0 values. 
# with .toarray() the compressed sparse matrix form is converted to a normal numpy array
print(X_train_vectorized.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [None]:
# # Train the model
# model = LogisticRegression(max_iter=1500)
# model.fit(X_train_vectorized, y_train)

# # Predict the transformed test documents
# predictions = model.predict(vect.transform(X_test))
# predict_probab = model.predict_proba(vect.transform(X_test))[:,1]

# print("AUC = {:.3f}".format(roc_auc_score(y_test, predict_probab)))

In [None]:
# # get the feature names as numpy array
# feature_names = np.array(vect.get_feature_names_out())

# # Sort the coefficients from the model (from lowest to highest values)
# sorted_coef_index = model.coef_[0].argsort()

# # Find the 10 smallest and 10 largest coefficients
# # The 10 largest coefficients are being indexed using [:-11:-1]
# # so the list returned is in order of largest to smallest
# print("Smallest Coefs:\n{}\n".format(feature_names[sorted_coef_index[:10]]))
# print("Largest Coefs: \n{}".format(feature_names[sorted_coef_index[:-11:-1]]))

The AUC of our first model was already pretty good (~0.95). Let's see if we can improve this with another transformation of our df. Therefore, we will test the TF-IDF transformation next.




### TF-IDF

TF-IDF is short for **Term Frequency - Inverse Document Frequency**. 

It measure how important a word is to a document in a set of texts (in our case all SMS we collected). A frequent word in a document that is also frequent in the corpus is less important to a document than a frequent word in a document that is not frequent in the corpus.




In [24]:
# Fit the TfidfVectorizer to the training df specifiying a minimum document frequency of 30
# This means a word should have been used in at least 30 reviews
vect = TfidfVectorizer(min_df=30).fit(df['content'])

# transform the documents in the training df to a document-term matrix
X_train_vectorized = vect.transform(df['content'])

# let's look of some of the words gathered with this method
sorted(vect.vocabulary_.items(), key=lambda x: x[1])[10:30]

[('absolutely', 10),
 ('access', 11),
 ('accessible', 12),
 ('account', 13),
 ('accuracy', 14),
 ('accurate', 15),
 ('across', 16),
 ('actually', 17),
 ('add', 18),
 ('added', 19),
 ('adding', 20),
 ('addition', 21),
 ('ads', 22),
 ('advanced', 23),
 ('advice', 24),
 ('after', 25),
 ('again', 26),
 ('age', 27),
 ('ago', 28),
 ('ai', 29)]

In [25]:
# how many words appear in more than 20 text messages
len(sorted(vect.vocabulary_.items(), key=lambda x: x[1]))

1010

We can check which words created the largest tfidf values for the texts.

In [26]:
# save all feature names == words in an array
feature_names = np.array(vect.get_feature_names_out())

#sort for the column names according to highest tfidf value in the column
sorted_tfidf_index = X_train_vectorized.toarray().max(0).argsort()

# print words with highest and lowest tfidf values
print("Smallest tfidf:\n{}\n".format(feature_names[sorted_tfidf_index[:10]]))
print("Largest tfidf: \n{}".format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['exceeded' 'processing' 'consistently' 'versatility' 'powered' 'coherent'
 'natural' 'however' 'whether' 'interactions']

Largest tfidf: 
['কর' 'size' 'kudos' 'la' 'language' 'learning' 'let' 'level' 'life'
 'like']


In [None]:
# # Train the model
# model = LogisticRegression(max_iter=1500)
# model.fit(X_train_vectorized, y_train)

# # Predict the transformed test documents
# predictions = model.predict_proba(vect.transform(X_test))[:,1]

# print("AUC = {:.3f}".format(roc_auc_score(y_test, predictions)))

In [None]:
# # Sort the coefficients from the model
# sorted_coef_index = model.coef_[0].argsort()

# # Find the 10 smallest and 10 largest coefficients
# # The 10 largest coefficients are being indexed using [:-11:-1]
# # so the list returned is in order of largest to smallest
# print("Smallest Coefs:\n{}\n".format(feature_names[sorted_coef_index[:10]]))
# print("Largest Coefs: \n{}".format(feature_names[sorted_coef_index[:-11:-1]]))

### Stemming

Stemming reduces a word to its stem. The result is less readable to humans, but makes the text more comparable across observations.

For example, the words "consult", "consultant", "consulting", " consultative", "consultants" have the same stem **"consult "**.

We will now add stemming as a preprocessing step to our workflow. The nltk PorterStemmer will generate the stems of the words. These features will be used in the CountVectorizer to create a matrix with the number of features (stemmed words).

In [27]:
# Initializing stemmer and countvectorizer 
stemmer = nltk.PorterStemmer()
cv_analyzer = CountVectorizer().build_analyzer()
# tfidf_analyzer = TfidfVectorizer(min_df=15).build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in cv_analyzer(doc))

# define CountVectorizer with stemming function 
stem_vectorizer = CountVectorizer(analyzer = stemmed_words)
# stem_vectorizer = TfidfVectorizer(min_df=15, analyzer = stemmed_words)


# Transform X_train
df_content_stem_vectorized = stem_vectorizer.fit_transform(df['content'])

In [31]:
r = 100
sample_text = X_train[r:r+1]
print("Sample Text - ", sample_text[sample_text.index[0]])
print("-"*30)
print("Text after passing through build_analyzer - ", cv_analyzer(sample_text[sample_text.index[0]]))
print("-"*30)
print("Text after stemming - ",[stemmer.stem(w) for w in cv_analyzer(sample_text[sample_text.index[0]])])


Sample Text -  It's wonderful. I would give it 5 stars, if it had EDITING option after the message has been sent.
------------------------------
Text after passing through build_analyzer -  ['it', 'wonderful', 'would', 'give', 'it', 'stars', 'if', 'it', 'had', 'editing', 'option', 'after', 'the', 'message', 'has', 'been', 'sent']
------------------------------
Text after stemming -  ['it', 'wonder', 'would', 'give', 'it', 'star', 'if', 'it', 'had', 'edit', 'option', 'after', 'the', 'messag', 'ha', 'been', 'sent']


You can also try uncommenting the tfidf lines in the cell above, so instead of using CountVectorizer you can also use TfIDF

### Lemmatization

The same way we used stemming we can also apply lemmatization to our df.
Lemmatization reduces variant forms to base form (eg. am, are, is --> be; car, cars, car's, cars' --> car).


In [33]:
# Initialization
WNlemma = nltk.WordNetLemmatizer()
cv_analyzer = CountVectorizer().build_analyzer()
# cv_analyzer = TfidfVectorizer(min_df=15).build_analyzer()

def lemmatize_word(doc):
    return (WNlemma.lemmatize(t) for t in cv_analyzer(doc))

lemm_vectorizer = CountVectorizer(analyzer = lemmatize_word)
# lemm_vectorizer = TfidfVectorizer(min_df=15, analyzer=lemmatize_word)

# Transform X_train
df_content_lemm_vectorized = lemm_vectorizer.fit_transform(df['content'])

In [34]:
df_content_lemm_vectorized.shape

(30956, 14215)

In [37]:
r = 300
sample_text = X_train[r:r+1]
print("Sample Text - ", sample_text[sample_text.index[0]])
print("-"*30)
print("Text after passing through build_analyzer - ", cv_analyzer(sample_text[sample_text.index[0]]))
print("-"*30)
print("Text after stemming - ",[WNlemma.lemmatize(t) for t in cv_analyzer(sample_text[sample_text.index[0]])])

Sample Text -  I'm really amazed 👏 this kind of ai also exists in the world 😇
------------------------------
Text after passing through build_analyzer -  ['really', 'amazed', 'this', 'kind', 'of', 'ai', 'also', 'exists', 'in', 'the', 'world']
------------------------------
Text after stemming -  ['really', 'amazed', 'this', 'kind', 'of', 'ai', 'also', 'exists', 'in', 'the', 'world']


In [38]:
from transformers import pipeline
from tqdm import tqdm

In [39]:
classifier = pipeline("zero-shot-classification",device = 0)

No model was supplied, defaulted to roberta-large-mnli and revision 130fb28 (https://huggingface.co/roberta-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFRobertaForSequenceClassification.

All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [56]:
candidate_labels = ["feature evaluation", "praise", "bug report", "feature request", "performance", "usage"]

In [66]:
#candidate_labels = list(category_map.values())
predictedCategories = []
Content = []
#trueCategories = []
for i in tqdm(range(100)):
    text = df.iloc[i,]['content']
    #cat = [df.iloc[i,]['target']]
    res = classifier(text, candidate_labels, multi_label=False)
    labels = res['labels'] 
    scores = res['scores'] #extracting the scores associated with the labels
    res_dict = {label : score for label,score in zip(labels, scores)}
    sorted_dict = dict(sorted(res_dict.items(), key=lambda x:x[1],reverse = True)) #sorting the dictionary of labels in descending order based on their score
    categories  = next(k for i, (k,v) in enumerate(sorted_dict.items()))

    predictedCategories.append(categories)
    Content.append(text)

100%|██████████| 100/100 [03:06<00:00,  1.87s/it]


In [81]:
res

{'sequence': "ChatGPT for Android is a fantastic language model app! Its responsiveness and accuracy in understanding my queries impressed me. It offers a wide range of information, making it a reliable source for quick answers. Additionally, the app's user-friendly interface and smooth navigation make the experience enjoyable. Overall, ChatGPT for Android is a must-have tool for anyone seeking assistance with language-related tasks on the go. Kudos to the developers for creating such a helpful and efficient",
 'labels': ['praise',
  'feature evaluation',
  'performance',
  'feature request',
  'usage',
  'bug report'],
 'scores': [0.6620474457740784,
  0.18852728605270386,
  0.05828249827027321,
  0.0422508679330349,
  0.04120683670043945,
  0.0076851374469697475]}

predictedCategories

In [58]:
dataset = pd.DataFrame({'Category': predictedCategories, 'Content': Content})

In [64]:
pd.set_option('max_colwidth', 500)
dataset.head(40)

Unnamed: 0,Category,Content
0,praise,"ChatGPT on Android is a solid app with seamless OpenAI server connectivity, ensuring smooth interactions. However, it falls behind its Apple counterpart in features and updates. The voice input can be prematurely triggered by pauses, unlike on Apple. Additionally, the lack of a search function for previous messages is a drawback. Despite these, it remains a commendable app, deserving a 4-5 star rating. With minor improvements, it can match its Apple version."
1,praise,"I've been using chatGPT for a while but I've just tested out the microphone speech recognition option for the first time, and let's say... I'M COMPLETELY BLOWN AWAY. NO, SERIOUSLY. It literally puts ALL the expressions, punctuation, in the right place. No matter how you talk, it converts it without a problem. It's amazing and I will probably will never type to ChatGPT again! Still though... that's some outstanding work. Now we wait for voice responses from the Bot! Hopefully..."
2,praise,"The ChatGPT Android app has completely blown me away with its exceptional performance and versatility! As an AI language model, it consistently delivers impressive responses to my queries and provides insightful suggestions. The ease of use and intuitive interface make chatting with ChatGPT a delightful experience. In all seriousness, it's good. It works. I would still recommend using the browser though, as the app lacks a few features, such as editing a message after it's sent."
3,praise,"No subscription, free and accurate, unbiased answers. No fresh data since September 2021 but that doesn't make this app less accurate. I'm loving the fact it can even generate basic ASCII ART other than just text. The text is excellent, grammar precise and a great plethora of vocabulary. I can't wait for the next update and to see what that brings."
4,feature request,"I use this app for learning languages, which chatgpt is amazing at. However, there are a few things I'd like added. 1. A search function. Being able to search a word that's several tokens back would be great. 2. Ability to highlight a word and 'search with Google'. 3. An AI voice that can read back chatgpt's outputs. For language learning especially, it'd be very useful to hear how certain words are pronounced. But still, great client for chatgpt! Can't wait for future improvements"
5,feature evaluation,"Seems to work now. App seems nice but has two issues. The website mentions that users can enable voice conversations in ""Settings → App → New Features → Voice conversations (toggle on)"" but this setting doesn't exist. Also, when using voice dictation to compose a prompt, it writes mandarin text using simplified characters rather than standard form/traditional characters."
6,feature evaluation,"This app is a mixed bag. On one hand, response times are vastly superior to those in the browser, custom instructions set in the browser are carried over and the ability to say what you want as a message is a crucial accessibility feature. On the other hand, messages are not possible to edit, which is a severe shortcoming in my opinion. The app is also quite basic, and any discussions done in the browser can't be continued in the app. Overall, it has great potential and can easily be improved."
7,performance,"You've been using ChatGPT AI for a while, and it's among the best AI companions. It's lightning-fast and responsive, aiding you effectively. However, the app lags behind the web version in performance. The AI provides insightful and accurate responses for various tasks. The app interface is clean and user-friendly, but it can stutter compared to the web version. Despite this, it's a game-changer and a great on-the-go tool. Some performance optimizations could make it perfect"
8,bug report,"Voice conversations choppy, doesn't work anymore. It was working great, but then one day all of the voices never choppy and clicky. I can't make out what they're saying anymore. It also has a much harder time picking up what I'm saying. Sometimes it thinks I'm just saying ""bye"". Cleared cache, reinstalled, didn't help. Anyways, great app otherwise - looking forward to fixes!"
9,feature evaluation,"I love the app so far, and voice is great, but a few issues need to be ironed out before a five star is reasonable: 1. Voice doesn’t recognize for about a second or a little less. Once it tells you to talk, you have to wait about a second before it actually recognizes what you say. 2. When you leave voice to go back to messages, you always have to scroll down. 3. The scroll-to-bottom button appears even if you are at the bottom. 4. No way to view message log during voice communication."


In [67]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

(…)cased/resolve/main/tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

(…)rt-base-uncased/resolve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

(…)bert-base-uncased/resolve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)base-uncased/resolve/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [72]:
def preprocess_function(df):
    return tokenizer(df["content"], truncation=True)

In [75]:
tokenized_df = df.map(preprocess_function, batched=True)

AttributeError: 'DataFrame' object has no attribute 'rdd'

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [84]:
df['content'][0:100]

0                                          ChatGPT on Android is a solid app with seamless OpenAI server connectivity, ensuring smooth interactions. However, it falls behind its Apple counterpart in features and updates. The voice input can be prematurely triggered by pauses, unlike on Apple. Additionally, the lack of a search function for previous messages is a drawback. Despite these, it remains a commendable app, deserving a 4-5 star rating. With minor improvements, it can match its Apple version.
1                      I've been using chatGPT for a while but I've just tested out the microphone speech recognition option for the first time, and let's say... I'M COMPLETELY BLOWN AWAY. NO, SERIOUSLY. It literally puts ALL the expressions, punctuation, in the right place. No matter how you talk, it converts it without a problem. It's amazing and I will probably will never type to ChatGPT again! Still though... that's some outstanding work. Now we wait for voice responses from the Bot! H

In [86]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")
classifier(df['content'][1])

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


[{'label': 'LABEL_1', 'score': 0.9982408285140991}]

In [88]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(df['content'][1], return_tensors="pt")

ImportError: Unable to convert output to PyTorch tensors format, PyTorch is not installed.