# Let's build a spam classifier

We will use df from `SMS Spam Collection v. 1` described as:

> a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-encoded messages, tagged according being legitimate (ham) or spam.

([source](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/))

#### Load useful libraries and df

In [24]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


from tqdm import tqdm

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /Users/janice/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/janice/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
# Load df
df = pd.read_csv(
    "../data/ChatGPT-play-reviews_prep.csv",
    encoding="utf-8",
)

In [3]:
# Looking at a sample of our df
df.sample(10)

Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,at,replyContent,repliedAt,appVersion,reply,at_ym,at_m,at_wd,at_q,at_d,at_ymd
6333,a58ae6d0-17a2-4cf9-b14a-e4d1c734fcda,WaHeEd BaLoCh Ms,App is fantastic but there should be image abi...,4,0,2023-08-02 07:58:07,,,1.0.0023,0,2023-08,August,Wednesday,3,2,08/02/23
26842,7882b5cc-01fc-4059-8efb-84b7207d852c,Aaa Aaa,Behatareen,5,0,2023-08-18 20:05:23,,,1.0.0026,0,2023-08,August,Friday,3,18,08/18/23
11040,18ba3dd5-90a6-401d-bfe1-fcacb9a7551a,mohammed mujtaba ali,I am loving it more than web version.!,5,0,2023-07-30 09:45:45,,,1.0.0023,0,2023-07,July,Sunday,3,30,07/30/23
27,75f60c3c-629d-4fe9-9d06-643e10a4ec15,Mysoul Hurts,This app is better than I expected. The interf...,4,151,2023-07-31 19:54:58,,,1.0.0023,0,2023-07,July,Monday,3,31,07/31/23
1800,358d07a6-45d3-49db-9432-00486912a85d,Orçun Altundağ,Requires privacy invasive Google Chrome just t...,1,0,2023-08-03 15:47:23,Thank you for your feedback. We've improved lo...,2023-08-09 22:41:24,1.0.0026,1,2023-08,August,Thursday,3,3,08/03/23
16394,5041eef1-8d39-458f-914b-de1841633357,Vishal Kumar,"Really fabulous 😻, sam love u for creating it",5,0,2023-07-25 19:12:54,,,1.0.0016,0,2023-07,July,Tuesday,3,25,07/25/23
4948,824e7846-fee1-4375-b19e-965bc2c819b4,Mr Himanshu Tyagi,Chat GPT is good application for every person....,5,0,2023-07-26 08:56:05,,,1.0.0022,0,2023-07,July,Wednesday,3,26,07/26/23
13971,076808e0-a156-45f1-b552-5245933824d9,Manioq V,When new training data,5,0,2023-09-17 11:32:35,,,1.2023.242,0,2023-09,September,Sunday,3,17,09/17/23
24878,4e2cc766-f362-42b7-965b-a36861c92868,Soch Prosper,Awesome,5,0,2023-09-04 19:36:24,,,1.0.0039,0,2023-09,September,Monday,3,4,09/04/23
10793,b16f30e1-c088-4815-a57f-ffbff83ad87c,Piyush Unjiya,This app is very awesome and wrote fast ⚡,5,0,2023-09-05 15:35:18,,,1.0.0039,0,2023-09,September,Tuesday,3,5,09/05/23


## Classification

We will here build a "vanilla" classifier, without pouring too many thoughts about what the actual messages, spam or not, look like. To improve your model you can of course have a closer look and investigate the df more in detail. 

In [4]:
# Split dataset between train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df['content'], df["reply"], random_state=0
)

In [5]:
X_train.head()

25900                                    Superb experience
11411    Waited for the app for too long,Now,It's here,...
11831                 Neet to enable the copy to clipboard
7576     Very useful for those who need opinions of oth...
27941                                        Excellent app
Name: content, dtype: object

### CountVectorizer

In [6]:
# Fit the CountVectorizer to the training df
vect = CountVectorizer().fit(df['content'])

# transform the documents in the training df to a document-term matrix
X_train_vectorized = vect.transform(df['content'])
# print("X_train_vectorized: ", X_train_vectorized)

In [7]:
print("X_train shape = {}".format(df['content'].shape))
print("Vocabulary length = {}".format(len(vect.vocabulary_)))

X_train shape = (30956,)
Vocabulary length = 15245


So in 30.956 messages we found 15.245 different words.

In [8]:
# Let's look at our vocabulary list (sorted alphabetically)
# Does it look like you expected?
sorted(vect.vocabulary_.items(), key=lambda x: x[1])[200:240]

[('5g', 200),
 ('5he', 201),
 ('5k', 202),
 ('5lacs', 203),
 ('5m', 204),
 ('5mb', 205),
 ('5o', 206),
 ('5star', 207),
 ('5stars', 208),
 ('5th', 209),
 ('60', 210),
 ('600', 211),
 ('62851', 212),
 ('64', 213),
 ('64th', 214),
 ('65', 215),
 ('68', 216),
 ('69', 217),
 ('6a', 218),
 ('6th', 219),
 ('70', 220),
 ('750', 221),
 ('777ffmahima', 222),
 ('78898', 223),
 ('7a', 224),
 ('7c', 225),
 ('7ec556a4dde829ea', 226),
 ('7ec5c6a5eb2dfb28', 227),
 ('7p', 228),
 ('7sdni', 229),
 ('7th', 230),
 ('7yati', 231),
 ('80', 232),
 ('800', 233),
 ('80286', 234),
 ('80mbps', 235),
 ('819', 236),
 ('838', 237),
 ('84', 238),
 ('841', 239)]

In [9]:
# We can also print the newly created feature matrix
# Note: you see its a sparse matrix with many 0 values. 
# with .toarray() the compressed sparse matrix form is converted to a normal numpy array
print(X_train_vectorized.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [10]:
# # Train the model
# model = LogisticRegression(max_iter=1500)
# model.fit(X_train_vectorized, y_train)

# # Predict the transformed test documents
# predictions = model.predict(vect.transform(X_test))
# predict_probab = model.predict_proba(vect.transform(X_test))[:,1]

# print("AUC = {:.3f}".format(roc_auc_score(y_test, predict_probab)))

In [11]:
# # get the feature names as numpy array
# feature_names = np.array(vect.get_feature_names_out())

# # Sort the coefficients from the model (from lowest to highest values)
# sorted_coef_index = model.coef_[0].argsort()

# # Find the 10 smallest and 10 largest coefficients
# # The 10 largest coefficients are being indexed using [:-11:-1]
# # so the list returned is in order of largest to smallest
# print("Smallest Coefs:\n{}\n".format(feature_names[sorted_coef_index[:10]]))
# print("Largest Coefs: \n{}".format(feature_names[sorted_coef_index[:-11:-1]]))

The AUC of our first model was already pretty good (~0.95). Let's see if we can improve this with another transformation of our df. Therefore, we will test the TF-IDF transformation next.




### TF-IDF

TF-IDF is short for **Term Frequency - Inverse Document Frequency**. 

It measure how important a word is to a document in a set of texts (in our case all SMS we collected). A frequent word in a document that is also frequent in the corpus is less important to a document than a frequent word in a document that is not frequent in the corpus.




In [12]:
# Fit the TfidfVectorizer to the training df specifiying a minimum document frequency of 30
# This means a word should have been used in at least 30 reviews
vect = TfidfVectorizer(min_df=30).fit(df['content'])

# transform the documents in the training df to a document-term matrix
X_train_vectorized = vect.transform(df['content'])

# let's look of some of the words gathered with this method
sorted(vect.vocabulary_.items(), key=lambda x: x[1])[10:30]

[('absolutely', 10),
 ('access', 11),
 ('accessible', 12),
 ('account', 13),
 ('accuracy', 14),
 ('accurate', 15),
 ('across', 16),
 ('actually', 17),
 ('add', 18),
 ('added', 19),
 ('adding', 20),
 ('addition', 21),
 ('ads', 22),
 ('advanced', 23),
 ('advice', 24),
 ('after', 25),
 ('again', 26),
 ('age', 27),
 ('ago', 28),
 ('ai', 29)]

In [13]:
# how many words appear in more than 20 text messages
len(sorted(vect.vocabulary_.items(), key=lambda x: x[1]))

1010

We can check which words created the largest tfidf values for the texts.

In [14]:
# save all feature names == words in an array
feature_names = np.array(vect.get_feature_names_out())

#sort for the column names according to highest tfidf value in the column
sorted_tfidf_index = X_train_vectorized.toarray().max(0).argsort()

# print words with highest and lowest tfidf values
print("Smallest tfidf:\n{}\n".format(feature_names[sorted_tfidf_index[:10]]))
print("Largest tfidf: \n{}".format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['exceeded' 'processing' 'consistently' 'versatility' 'powered' 'coherent'
 'natural' 'however' 'whether' 'interactions']

Largest tfidf: 
['কর' 'size' 'kudos' 'la' 'language' 'learning' 'let' 'level' 'life'
 'like']


In [15]:
# # Train the model
# model = LogisticRegression(max_iter=1500)
# model.fit(X_train_vectorized, y_train)

# # Predict the transformed test documents
# predictions = model.predict_proba(vect.transform(X_test))[:,1]

# print("AUC = {:.3f}".format(roc_auc_score(y_test, predictions)))

In [16]:
# # Sort the coefficients from the model
# sorted_coef_index = model.coef_[0].argsort()

# # Find the 10 smallest and 10 largest coefficients
# # The 10 largest coefficients are being indexed using [:-11:-1]
# # so the list returned is in order of largest to smallest
# print("Smallest Coefs:\n{}\n".format(feature_names[sorted_coef_index[:10]]))
# print("Largest Coefs: \n{}".format(feature_names[sorted_coef_index[:-11:-1]]))

### Stemming

Stemming reduces a word to its stem. The result is less readable to humans, but makes the text more comparable across observations.

For example, the words "consult", "consultant", "consulting", " consultative", "consultants" have the same stem **"consult "**.

We will now add stemming as a preprocessing step to our workflow. The nltk PorterStemmer will generate the stems of the words. These features will be used in the CountVectorizer to create a matrix with the number of features (stemmed words).

In [17]:
# Initializing stemmer and countvectorizer 
stemmer = nltk.PorterStemmer()
cv_analyzer = CountVectorizer().build_analyzer()
# tfidf_analyzer = TfidfVectorizer(min_df=15).build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in cv_analyzer(doc))

# define CountVectorizer with stemming function 
stem_vectorizer = CountVectorizer(analyzer = stemmed_words)
# stem_vectorizer = TfidfVectorizer(min_df=15, analyzer = stemmed_words)


# Transform X_train
df_content_stem_vectorized = stem_vectorizer.fit_transform(df['content'])

In [18]:
r = 300
sample_text = X_train[r:r+1]
print("Sample Text - ", sample_text[sample_text.index[0]])
print("-"*30)
print("Text after passing through build_analyzer - ", cv_analyzer(sample_text[sample_text.index[0]]))
print("-"*30)
print("Text after stemming - ",[stemmer.stem(w) for w in cv_analyzer(sample_text[sample_text.index[0]])])


Sample Text -  I'm really amazed 👏 this kind of ai also exists in the world 😇
------------------------------
Text after passing through build_analyzer -  ['really', 'amazed', 'this', 'kind', 'of', 'ai', 'also', 'exists', 'in', 'the', 'world']
------------------------------
Text after stemming -  ['realli', 'amaz', 'thi', 'kind', 'of', 'ai', 'also', 'exist', 'in', 'the', 'world']


You can also try uncommenting the tfidf lines in the cell above, so instead of using CountVectorizer you can also use TfIDF

### Lemmatization

The same way we used stemming we can also apply lemmatization to our df.
Lemmatization reduces variant forms to base form (eg. am, are, is --> be; car, cars, car's, cars' --> car).


In [19]:
# Initialization
WNlemma = nltk.WordNetLemmatizer()
cv_analyzer = CountVectorizer().build_analyzer()
# cv_analyzer = TfidfVectorizer(min_df=15).build_analyzer()

def lemmatize_word(doc):
    return (WNlemma.lemmatize(t) for t in cv_analyzer(doc))

lemm_vectorizer = CountVectorizer(analyzer = lemmatize_word)
# lemm_vectorizer = TfidfVectorizer(min_df=15, analyzer=lemmatize_word)

# Transform X_train
df_content_lemm_vectorized = lemm_vectorizer.fit_transform(df['content'])

In [20]:
df_content_lemm_vectorized.shape

(30956, 14215)

In [21]:
r = 500
sample_text = X_train[r:r+1]
print("Sample Text - ", sample_text[sample_text.index[0]])
print("-"*30)
print("Text after passing through build_analyzer - ", cv_analyzer(sample_text[sample_text.index[0]]))
print("-"*30)
print("Text after stemming - ",[WNlemma.lemmatize(t) for t in cv_analyzer(sample_text[sample_text.index[0]])])

Sample Text -  🥰🥰⚡ ⚡🔥🔥
------------------------------
Text after passing through build_analyzer -  []
------------------------------
Text after stemming -  []


In [22]:
from transformers import pipeline
from tqdm import tqdm

In [23]:
classifier = pipeline("zero-shot-classification",device = 0)

No model was supplied, defaulted to roberta-large-mnli and revision 130fb28 (https://huggingface.co/roberta-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFRobertaForSequenceClassification.

All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [37]:
#candidate_labels = ["feature evaluation", "praise", "bug report", "feature request", "performance", "usage"]
candidate_labels = ["positive", "neutral", "negative"]

In [38]:
#candidate_labels = list(category_map.values())
predictedCategories = []
Content = []
Star = []
#trueCategories = []
for i in tqdm(range(50)):
    text = df.iloc[i,]['content']
    star = df.iloc[i,]['score']
    #cat = [df.iloc[i,]['target']]
    res = classifier(text, candidate_labels, multi_label=False)
    labels = res['labels'] 
    scores = res['scores'] #extracting the scores associated with the labels
    res_dict = {label : score for label,score in zip(labels, scores)}
    sorted_dict = dict(sorted(res_dict.items(), key=lambda x:x[1],reverse = True)) #sorting the dictionary of labels in descending order based on their score
    categories  = next(k for i, (k,v) in enumerate(sorted_dict.items()))

    predictedCategories.append(categories)
    Content.append(text)

100%|██████████| 50/50 [01:24<00:00,  1.68s/it]


In [44]:
df.iloc[0:100,]['score']

0     4
1     5
2     4
3     5
4     4
     ..
95    5
96    1
97    5
98    5
99    5
Name: score, Length: 100, dtype: int64

In [39]:
res

{'sequence': "Generally useful. I wish that it would offer the option to read aloud it's response to me, or even better, detect that I provided voice (to text) input in forming my query and THEN respond by reading aloud. Also, I appreciate that the app, by default, adjusts it's UI theme based on my system settings. Well done. However, that only kicked in after I had logged in after installation. It's initial login/setup screen used the default BRIGHT WHITE high contrast scheme that is very painful for me",
 'labels': ['positive', 'neutral', 'negative'],
 'scores': [0.45063674449920654, 0.2816002368927002, 0.26776307821273804]}

predictedCategories

In [46]:
dataset = pd.DataFrame({'Category': predictedCategories, 'Content': Content, 'Star': df.iloc[0:50,]['score']})

In [47]:
pd.set_option('max_colwidth', 700)
dataset.sort_values('Category', ascending=True)

Unnamed: 0,Category,Content,Star
0,negative,"ChatGPT on Android is a solid app with seamless OpenAI server connectivity, ensuring smooth interactions. However, it falls behind its Apple counterpart in features and updates. The voice input can be prematurely triggered by pauses, unlike on Apple. Additionally, the lack of a search function for previous messages is a drawback. Despite these, it remains a commendable app, deserving a 4-5 star rating. With minor improvements, it can match its Apple version.",4
44,negative,"Good at explaining things in specific ways i could never instantly get a person to do. Only issue is that there should be some options: Such as, being able to delete certain segments of a long conversation that you no longer want to be shown. Or at least being able to search a chat for mentions of a word. Things like that which prevent ppl from having to scroll through everything in a chat that's been going on for a while. This app is very helpful AND entertaining though lol",4
43,negative,"I have been using the ChatGPT since its release, and while I see its potential, there are a few issues that need to be addressed to enhance the user experience. I noticed that when I input a prompt text and press the backspace key, it unnecessarily removes the space between two immediate words. This can be quite frustrating as it disrupts the flow of my queries and makes the conversation feel disjointed.",3
37,negative,"I generally really like it but it's annoying that you have to press a button to get to the history, instead of it just being in the sidebar. Especially if you want to klick on multiple chats in a row it's REALLY annoying that you have to scroll down every time and find the chat where you left off. It also bugs me that editing a message isn't possible like it is on the website, because that's VERY useful. Also the option to add a secondary language would be nice, because I use German and English.",4
36,negative,"The animation that appears when I open the app runs twice, so if I start writing right away, everything I wrote disappears when the animation runs the second time and the big dot remains on the screen. It's annoying having to wait for the second animation or having to write everything all over again.",3
35,negative,"Solid app, very clean interface. However it'd be nice if there was an option to have your history on the sidebar like on the web version. More importantly, I find it a little baffling that there is no option to edit and re-submit a query on the app. That's arguably one of the key features of ChatGPT, so I'm disappointed that it didn't make it's way to the official android application.",3
34,negative,"The app mostly does what I expect it to : it's a good alternative from using ChatGPT in the browser on mobile. However, a feature that is present in the web version and that is missing in this app is the ability to render latex. This is a really important feature to me, as I don't want to have to go back and forth between the app and a latex/markdown renderer to discuss maths with ChatGPT, and the lack of this feature is the reason for the missing 2 stars in my review.",3
30,negative,"Definitely more handy than opening browser for the web version, but inferior in all other aspects. The biggest problem for me is that I can't edit my prompts. As a plus subscriber I find the lack of beta features to be inconvenient. Another thing is that the UI is too big. This is especially annoying when reading and code snippets, as only a few words can fit in a row of text.",3
25,negative,"Smooth design and wonderful experience. However, I have some criticisms: - I would like to have easy access to my conversation history without having to go through the entire chat history. - When I open an old conversation, it scrolls from the first text I sent to the last one, causing a long waiting time. - The code blocks don't come with colors; they only display as black text on a white background.""",4
18,negative,"Easily the best app I've ever downloaded. Highly impressed. I only have 1 concern at the moment, and that is the fear of potentially losing a chat or chats getting too long for the AI to ""remember"" past details. Would love to see 1.) A way to save chats permanently. 2.) A search function so that I can search for past information, so that I can copy and paste it to ""refresh"" the AI's memory for context that has fallen outside it's context window.",5


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
def preprocess_function(df):
    return tokenizer(df["content"], truncation=True)

In [None]:
tokenized_df = df.map(preprocess_function, batched=True)

AttributeError: 'DataFrame' object has no attribute 'map'

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
df['content'][0:100]

0                                          ChatGPT on Android is a solid app with seamless OpenAI server connectivity, ensuring smooth interactions. However, it falls behind its Apple counterpart in features and updates. The voice input can be prematurely triggered by pauses, unlike on Apple. Additionally, the lack of a search function for previous messages is a drawback. Despite these, it remains a commendable app, deserving a 4-5 star rating. With minor improvements, it can match its Apple version.
1                      I've been using chatGPT for a while but I've just tested out the microphone speech recognition option for the first time, and let's say... I'M COMPLETELY BLOWN AWAY. NO, SERIOUSLY. It literally puts ALL the expressions, punctuation, in the right place. No matter how you talk, it converts it without a problem. It's amazing and I will probably will never type to ChatGPT again! Still though... that's some outstanding work. Now we wait for voice responses from the Bot! H

In [None]:
#id2label = {0: "NEGATIVE", 1: "POSITIVE"}
#label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [66]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")
classifier(df['content'][1])

RuntimeError: Failed to import transformers.models.gpt_bigcode.modeling_gpt_bigcode because of the following error (look up to see its traceback):
Can't get source for <function softmax at 0x177b90680>. TorchScript requires source access in order to carry out compilation, make sure original .py files are available.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(df['content'][1], return_tensors="pt")

NameError: name 'df' is not defined