*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you need to create a machine learning model that will classify SMS messages as either "ham" or "spam". A "ham" message is a normal message sent by a friend. A "spam" message is an advertisement or a message sent by a company.

You should create a function called `predict_message` that takes a message string as an argument and returns a list. The first element in the list should be a number between zero and one that indicates the likeliness of "ham" (0) or "spam" (1). The second element in the list should be the word "ham" or "spam", depending on which is most likely.

For this challenge, you will use the [SMS Spam Collection dataset](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/). The dataset has already been grouped into train data and test data.

The first two cells import the libraries and data. The final cell tests your model and function. Add your code in between these cells.


In [125]:
import tensorflow as tf
import pandas as pd
from tensorflow import keras

import tensorflow_datasets as tfds
import numpy as np
import matplotlib.pyplot as plt
import requests

from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense, Dropout, LSTM, Bidirectional

print(tf.__version__)

2.9.1


In [7]:
def download(link):
    files = requests.get(link,stream=True)
    loc = open(link.split('/')[-1],'wb')
    for chunk in files.iter_content(chunk_size=8192):
        loc.write(chunk)
    loc.close()

In [8]:
download('https://cdn.freecodecamp.org/project-data/sms/train-data.tsv')
download('https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv')
train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

In [307]:
df_train = pd.read_csv(train_file_path,sep='\t',names=['type','msg'])
print(df_train.head(5),len(df_train))

  type                                                msg
0  ham  ahhhh...just woken up!had a bad dream about u ...
1  ham                           you can never do nothing
2  ham  now u sound like manky scouse boy steve,like! ...
3  ham  mum say we wan to go then go... then she can s...
4  ham  never y lei... i v lazy... got wat? dat day ü ... 4179


In [308]:
df_test = pd.read_csv(test_file_path, sep='\t', names=['type','msg'])
print(df_test.head(5),len(df_test))

  type                                                msg
0  ham  i am in hospital da. . i will return home in e...
1  ham         not much, just some textin'. how bout you?
2  ham  i probably won't eat at all today. i think i'm...
3  ham  don‘t give a flying monkeys wot they think and...
4  ham                                who are you seeing? 1392


In [309]:
df_train.iloc[df_train['msg'][df_train['type']=='spam'].index[0]]['msg']
#example spam

'urgent! call 09066350750 from your landline. your complimentary 4* ibiza holiday or 10,000 cash await collection sae t&cs po box 434 sk3 8wp 150 ppm 18+'

In [310]:
print(df_train.isna().sum())
print(df_test.isna().sum())
print(df_train.duplicated().sum(), df_test.duplicated().sum())

type    0
msg     0
dtype: int64
type    0
msg     0
dtype: int64
244 43


In [311]:
df_train.groupby('type').describe().T

Unnamed: 0,type,ham,spam
msg,count,3619,560
msg,unique,3430,505
msg,top,"sorry, i'll call later",hmv bonus special 500 pounds of genuine hmv vo...
msg,freq,22,3


In [312]:
df_test.groupby('type').describe().T

Unnamed: 0,type,ham,spam
msg,count,1205,187
msg,unique,1173,176
msg,top,"sorry, i'll call later",you have won a nokia 7250i. this is what you g...
msg,freq,8,2


In [313]:
for i in df_train.iloc[df_train['msg'][df_train.duplicated()==True].index[0:20]]['msg']:
    print (i)

want to funk up ur fone with a weekly new tone reply tones2u 2 this text. www.ringtones.co.uk, the original n best. tones 3gbp network operator rates apply
ok then i will come to ur home after half an hour
watching cartoon, listening music &amp; at eve had to go temple &amp; church.. what about u?
sorry, i'll call later
sorry, i'll call later
i am in escape theatre now. . going to watch kavalan in a few minutes
sorry, i'll call later
ok...
ok.
sorry, i'll call later
ok...
beautiful truth against gravity.. read carefully: our heart feels light when someone is in it.. but it feels very heavy when someone leaves it.. good night
ok...
i wnt to buy a bmw car urgently..its vry urgent.but hv a shortage of  &lt;#&gt; lacs.there is no source to arng dis amt. &lt;#&gt; lacs..thats my prob
sorry, i'll call later
private! your 2004 account statement for 07742676969 shows 786 unredeemed bonus points. to claim call 08719180248 identifier code: 45239 expires
raji..pls do me a favour. pls convey my bi

In [314]:
df_train = df_train.drop_duplicates()
df_test = df_test.drop_duplicates()

In [319]:
ham_msg = df_train[df_train.type =='ham']
spam_msg = df_train[df_train.type=='spam']
print(len(ham_msg),len(spam_msg))

3430 505


In [320]:
ham_msg_df = ham_msg.sample(n = len(spam_msg), random_state = 40)
spam_msg_df = spam_msg
print(ham_msg_df.shape, spam_msg_df.shape)

(505, 2) (505, 2)


In [326]:
df_train = pd.concat([ham_msg_df,spam_msg_df]).reset_index(drop=True)

In [328]:
df_train = df_train.sample(frac=1).reset_index(drop=True)
print(df_train)

      type                                                msg
0     spam  urgent ur £500 guaranteed award is still uncla...
1     spam  private! your 2003 account statement for shows...
2      ham   what part of don't initiate don't you understand
3      ham  ok i'm gonna head up to usf in like fifteen mi...
4      ham  the 2 oz guy is being kinda flaky but one frie...
...    ...                                                ...
1005  spam  i'd like to tell you my deepest darkest fantas...
1006   ham  sorry, i guess whenever i can get a hold of my...
1007  spam       private! your 2003 account statement for 078
1008  spam  u have a secret admirer who is looking 2 make ...
1009   ham  as per your request 'maangalyam (alaipayuthe)'...

[1010 rows x 2 columns]


In [329]:
ham_msg = df_test[df_test.type =='ham']
spam_msg = df_test[df_test.type=='spam']
print(len(ham_msg),len(spam_msg))

1173 176


In [330]:
ham_msg_df = ham_msg.sample(n = len(spam_msg), random_state = 40)
spam_msg_df = spam_msg
print(ham_msg_df.shape, spam_msg_df.shape)

(176, 2) (176, 2)


In [331]:
df_test = pd.concat([ham_msg_df,spam_msg_df]).reset_index(drop=True)
print(df_test)

     type                                                msg
0     ham  good! no, don‘t need any receipts—well done! (...
1     ham  allo! we have braved the buses and taken on th...
2     ham  that would be great. we'll be at the guild. co...
3     ham  daddy, shu shu is looking 4 u... u wan me 2 te...
4     ham  hi neva worry bout da truth coz the truth will...
..    ...                                                ...
347  spam  valentines day special! win over £1000 in our ...
348  spam  dear voucher holder, to claim this weeks offer...
349  spam  urgent! your mobile was awarded a £1,500 bonus...
350  spam  spjanuary male sale! hot gay chat now cheaper,...
351  spam  ur cash-balance is currently 500 pounds - to m...

[352 rows x 2 columns]


In [332]:
df_test = df_test.sample(frac=1).reset_index(drop=True)
print(df_test)

     type                                                msg
0    spam  u r subscribed 2 textcomp 250 wkly comp. 1st w...
1     ham  me 2 babe i feel the same lets just 4get about...
2     ham                            lol ok your forgiven :)
3     ham  hello u.call wen u finish wrk.i fancy meetin u...
4     ham  aiyah then i wait lor. then u entertain me. he...
..    ...                                                ...
347  spam  can u get 2 phone now? i wanna chat 2 set up m...
348   ham              shant disturb u anymore... jia you...
349  spam  sms. ac blind date 4u!: rodds1 is 21/m from ab...
350   ham               ok.ok ok..then..whats ur todays plan
351  spam  sunshine quiz! win a super sony dvd recorder i...

[352 rows x 2 columns]


In [333]:
df_train['type']= df_train['type'].map({'ham': 0, 'spam': 1})
train_label = df_train['type'].values

df_test['type']= df_test['type'].map({'ham': 0, 'spam': 1})
test_label = df_test['type'].values


In [334]:
df_train = df_train.pop('msg')
print(df_train)

0       urgent ur £500 guaranteed award is still uncla...
1       private! your 2003 account statement for shows...
2        what part of don't initiate don't you understand
3       ok i'm gonna head up to usf in like fifteen mi...
4       the 2 oz guy is being kinda flaky but one frie...
                              ...                        
1005    i'd like to tell you my deepest darkest fantas...
1006    sorry, i guess whenever i can get a hold of my...
1007         private! your 2003 account statement for 078
1008    u have a secret admirer who is looking 2 make ...
1009    as per your request 'maangalyam (alaipayuthe)'...
Name: msg, Length: 1010, dtype: object


In [335]:
df_test = df_test.pop('msg')
print(df_test)

0      u r subscribed 2 textcomp 250 wkly comp. 1st w...
1      me 2 babe i feel the same lets just 4get about...
2                                lol ok your forgiven :)
3      hello u.call wen u finish wrk.i fancy meetin u...
4      aiyah then i wait lor. then u entertain me. he...
                             ...                        
347    can u get 2 phone now? i wanna chat 2 set up m...
348                shant disturb u anymore... jia you...
349    sms. ac blind date 4u!: rodds1 is 21/m from ab...
350                 ok.ok ok..then..whats ur todays plan
351    sunshine quiz! win a super sony dvd recorder i...
Name: msg, Length: 352, dtype: object


In [336]:
train_label

array([1, 1, 0, ..., 1, 1, 0], dtype=int64)

In [337]:
len(train_label)
train_label = np.asarray(train_label).astype('float32').reshape((-1,1))
test_label = np.asarray(test_label).astype('float32').reshape((-1,1))

In [338]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words = 500, char_level=False, oov_token = "<OOV>" )
tokenizer.fit_on_texts(df_train)

In [339]:
word_index = tokenizer.word_index
word_index

{'<OOV>': 1,
 'to': 2,
 'you': 3,
 'a': 4,
 'i': 5,
 'call': 6,
 'the': 7,
 'your': 8,
 'u': 9,
 'for': 10,
 'is': 11,
 'now': 12,
 'and': 13,
 '2': 14,
 'free': 15,
 'or': 16,
 'in': 17,
 'on': 18,
 'have': 19,
 'of': 20,
 'ur': 21,
 'txt': 22,
 '4': 23,
 'with': 24,
 'from': 25,
 'me': 26,
 'are': 27,
 'it': 28,
 'no': 29,
 'stop': 30,
 'this': 31,
 'text': 32,
 'just': 33,
 'reply': 34,
 'mobile': 35,
 'get': 36,
 'that': 37,
 'my': 38,
 'be': 39,
 'can': 40,
 'not': 41,
 'out': 42,
 'www': 43,
 'we': 44,
 'will': 45,
 'do': 46,
 'claim': 47,
 'our': 48,
 'only': 49,
 'but': 50,
 'so': 51,
 'if': 52,
 'send': 53,
 'new': 54,
 'prize': 55,
 'at': 56,
 'nokia': 57,
 "i'm": 58,
 'gt': 59,
 'cash': 60,
 't': 61,
 'lt': 62,
 '150p': 63,
 'been': 64,
 'go': 65,
 'uk': 66,
 'week': 67,
 'com': 68,
 'up': 69,
 '1': 70,
 'by': 71,
 'like': 72,
 'all': 73,
 'how': 74,
 'who': 75,
 'r': 76,
 'please': 77,
 'won': 78,
 'got': 79,
 'has': 80,
 'service': 81,
 'phone': 82,
 'win': 83,
 'urgent': 

In [340]:
len(word_index)

3943

In [341]:
# Sequencing and padding on training and testing 
training_sequences = tokenizer.texts_to_sequences(df_train)
training_padded = pad_sequences (training_sequences, maxlen = 50, padding = "post", truncating = "post")

testing_sequences = tokenizer.texts_to_sequences(df_test)
testing_padded = pad_sequences(testing_sequences, maxlen = 50,
padding = "post", truncating = "post")

In [342]:
# Shape of train tensor
print('Shape of training tensor: ', training_padded.shape)
print('Shape of testing tensor: ', testing_padded.shape)

Shape of training tensor:  (1010, 50)
Shape of testing tensor:  (352, 50)


In [343]:
# Before padding
print(len(training_sequences[0]), len(training_sequences[1]))
# After padding
print(len(training_padded[0]), len(training_padded[1]))

19 23
50 50


In [344]:
print(training_padded[0])

[ 84  21 223 111 199  11 139   1   6   1  12   1   1   1   1   1 128   1
   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0]


In [345]:
model = Sequential()
model.add(Embedding(500, 16, input_length=50))
model.add(GlobalAveragePooling1D())
model.add(Dense(48, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy',optimizer='adam' ,metrics=['accuracy'])

Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_10 (Embedding)    (None, 50, 16)            8000      
                                                                 
 global_average_pooling1d_10  (None, 16)               0         
  (GlobalAveragePooling1D)                                       
                                                                 
 dense_21 (Dense)            (None, 48)                816       
                                                                 
 dropout_11 (Dropout)        (None, 48)                0         
                                                                 
 dense_22 (Dense)            (None, 1)                 49        
                                                                 
Total params: 8,865
Trainable params: 8,865
Non-trainable params: 0
___________________________________________________

In [346]:
# fitting a dense spam detector model
history = model.fit(training_padded, train_label, epochs=25, validation_data=(testing_padded, test_label),callbacks =[EarlyStopping(monitor='val_loss', patience=3)], verbose=False)

In [347]:
model.evaluate(testing_padded, test_label)



[0.13621830940246582, 0.9517045617103577]

In [372]:
# function to predict messages based on model
# (should return list containing prediction and label, ex. [0.008318834938108921, 'ham'])
def predict_message(pred_text):
  new_seq = tokenizer.texts_to_sequences([pred_text])
  padded = pad_sequences(new_seq, maxlen =50,
                    padding = "post",
                    truncating="post")
  if float(model.predict(padded)) >=0.5:
    return (float(model.predict(padded)),'spam')
  else:
     return (float(model.predict(padded)),'ham')

pred_text = "how are you doing today?"

prediction = predict_message(pred_text)

(0.00542035885155201, 'ham')


In [373]:
# Run this cell to test your function and model. Do not modify contents.
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    if prediction[1] != ans:
      passed = False

  if passed:
    print("You passed the challenge. Great job!")
  else:
    print("You haven't passed yet. Keep trying.")

test_predictions()


You passed the challenge. Great job!
