# Instructions 

In this challenge, you need to create a machine learning model that will classify SMS messages as either "ham" or "spam". A "ham" message is a normal message sent by a friend. A "spam" message is an advertisement or a message sent by a company.

You should create a function called `predict_message` that takes a message string as an argument and returns a list. The first element in the list should be a number between zero and one that indicates the likeliness of "ham" (0) or "spam" (1). The second element in the list should be the word "ham" or "spam", depending on which is most likely.

For this challenge, you will use the [SMS Spam Collection dataset](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/). The dataset has already been grouped into train data and test data.

The first two cells import the libraries and data. The final cell tests your model and function. Add your code in between these cells.


In [1]:
# import libraries
# try:
#   # %tensorflow_version only exists in Colab.
#   !pip install tf-nightly
# except Exception:
#   pass
import tensorflow as tf
import pandas as pd
from tensorflow import keras
#!pip install tensorflow-datasets
import tensorflow_datasets as tfds
import numpy as np
import matplotlib.pyplot as plt


from keras.preprocessing import sequence


import os


#print(tf.__version__)

In [2]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
!wget https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv

train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

--2021-05-11 23:28:03--  https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.2.33, 104.26.3.33, 172.67.70.149, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.2.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358233 (350K) [text/tab-separated-values]
Saving to: ‘train-data.tsv.10’


2021-05-11 23:28:03 (8.53 MB/s) - ‘train-data.tsv.10’ saved [358233/358233]

--2021-05-11 23:28:03--  https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 172.67.70.149, 104.26.3.33, 104.26.2.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|172.67.70.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118774 (116K) [text/tab-separated-values]
Saving to: ‘valid-data.tsv.10’


2021-05-11 23:28:03 (8.15 MB/s) - ‘valid-data.tsv.10’ saved [118774/118774]



In [3]:
train_dataset = pd.read_csv(train_file_path, sep = '\t', names=['type', 'text'])
test_dataset = pd.read_csv(test_file_path, sep = '\t', names = ['type', 'text'])


In [4]:
train_dataset.tail()

Unnamed: 0,type,text
4174,ham,just woke up. yeesh its late. but i didn't fal...
4175,ham,what do u reckon as need 2 arrange transport i...
4176,spam,free entry into our £250 weekly competition ju...
4177,spam,-pls stop bootydelious (32/f) is inviting you ...
4178,ham,tell my bad character which u dnt lik in me. ...


In [5]:
test_dataset.head()

Unnamed: 0,type,text
0,ham,i am in hospital da. . i will return home in e...
1,ham,"not much, just some textin'. how bout you?"
2,ham,i probably won't eat at all today. i think i'm...
3,ham,don‘t give a flying monkeys wot they think and...
4,ham,who are you seeing?


In [6]:
train_dataset.isnull().sum()

type    0
text    0
dtype: int64

In [7]:
test_dataset.isnull().sum()

type    0
text    0
dtype: int64

In [8]:
train_dataset.dtypes

type    object
text    object
dtype: object

In [9]:
train_dataset.text[2]

'now u sound like manky scouse boy steve,like! i is travelling on da bus home.wot has u inmind 4 recreation dis eve?'

In [10]:
# train_dataset.text.values.tolist()
# test_dataset.text.values.tolist()

In [11]:
train_dataset.tail()

Unnamed: 0,type,text
4174,ham,just woke up. yeesh its late. but i didn't fal...
4175,ham,what do u reckon as need 2 arrange transport i...
4176,spam,free entry into our £250 weekly competition ju...
4177,spam,-pls stop bootydelious (32/f) is inviting you ...
4178,ham,tell my bad character which u dnt lik in me. ...


In [12]:
vocab = {}  
word_encoding = 1
def one_hot_encoding(text):
  global word_encoding
  MAXLEN = 30

  words = text.lower().split(" ") 
  encoding = []  

  for word in words:
    if word in vocab:
      code = vocab[word]  
      encoding.append(code) 
    else:
      vocab[word] = word_encoding
      encoding.append(word_encoding)
      word_encoding += 1

  
  return encoding

In [None]:
#otra forma de encode
def sentences_to_indices(X, word_to_index, max_len):
    X_indices = np.zeros((m,max_len))
    # Assign indices to words
    for i,sentence in enumerate(X):        
        sentence_words = sentence.lower().split()
        for j,word in enumerate(sentence_words):
            X_indices[i, j] = word_to_index.get(word,0)  #improvement
    return X_indices


In [13]:
len(train_dataset)

4179

In [14]:
train_dataset.text[0]

'ahhhh...just woken up!had a bad dream about u tho,so i dont like u right now :) i didnt know anything about comedy night but i guess im up for it.'

In [15]:
for i in range(0,len(train_dataset)):
        train_dataset.text[i] = one_hot_encoding(train_dataset.text[i])
train_dataset.head()

Unnamed: 0,type,text
0,ham,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 8, 13,..."
1,ham,"[27, 28, 29, 30, 31]"
2,ham,"[14, 8, 32, 12, 33, 34, 35, 36, 10, 37, 38, 39..."
3,ham,"[49, 50, 51, 52, 53, 54, 55, 56, 55, 57, 28, 5..."
4,ham,"[29, 63, 64, 10, 65, 66, 67, 68, 69, 70, 71, 7..."


In [16]:
for i in range(0,len(test_dataset)):
        test_dataset.text[i] = one_hot_encoding(test_dataset.text[i])
test_dataset.head()

Unnamed: 0,type,text
0,ham,"[10, 305, 78, 3350, 338, 808, 10, 272, 2224, 3..."
1,ham,"[306, 5208, 148, 345, 11332, 108, 2432, 3541]"
2,ham,"[10, 936, 1387, 775, 202, 196, 630, 10, 851, 2..."
3,ham,"[9036, 234, 4, 11334, 11335, 2665, 115, 851, 1..."
4,ham,"[867, 231, 27, 11336]"


In [17]:
labels = {} # creation of a dictionary
values = train_dataset.type.astype('category') # convert the values into category type
labels['type'] = values.cat.categories # creates and index object who contains the encode for the types in values
train_dataset['type'] = values.cat.codes  # encode the data
train_dataset.head()

Unnamed: 0,type,text
0,0,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 8, 13,..."
1,0,"[27, 28, 29, 30, 31]"
2,0,"[14, 8, 32, 12, 33, 34, 35, 36, 10, 37, 38, 39..."
3,0,"[49, 50, 51, 52, 53, 54, 55, 56, 55, 57, 28, 5..."
4,0,"[29, 63, 64, 10, 65, 66, 67, 68, 69, 70, 71, 7..."


In [18]:
labels = {} # creation of a dictionary
values = test_dataset.type.astype('category') # convert the values into category type
labels['type'] = values.cat.categories # creates and index object who contains the encode for the types in values
test_dataset['type'] = values.cat.codes  # encode the data


In [19]:
test_dataset.tail()

Unnamed: 0,type,text
1387,0,"[329, 13480, 1326, 53, 149, 2554, 153, 4339, 1..."
1388,0,"[838, 272, 51, 30, 78, 84, 10869, 9523]"
1389,0,"[992, 231, 27, 218, 838, 231, 27, 378, 218, 23..."
1390,1,"[182, 9147, 37, 5987, 533, 3877, 390, 53, 9148..."
1391,1,"[306, 9466, 123, 10546, 4, 3711, 121, 45, 4223..."


In [20]:
test_dataset.text.values

array([list([10, 305, 78, 3350, 338, 808, 10, 272, 2224, 376, 78, 2554]),
       list([306, 5208, 148, 345, 11332, 108, 2432, 3541]),
       list([10, 936, 1387, 775, 202, 196, 630, 10, 851, 250, 404, 11333, 108, 243, 109, 422, 314, 8, 434, 1248]),
       ...,
       list([992, 231, 27, 218, 838, 231, 27, 378, 218, 231, 13483, 1013, 39, 661, 84, 8842, 53, 109, 3758, 218, 314, 27, 295, 4, 13484, 144, 257, 1107, 76, 218, 10, 384, 27]),
       list([182, 9147, 37, 5987, 533, 3877, 390, 53, 9148, 182, 9149, 14, 72, 131, 53, 3881, 268, 7360, 10222, 13485, 8426, 9586]),
       list([306, 9466, 123, 10546, 4, 3711, 121, 45, 4223, 261, 1020, 527, 12769, 53, 12770, 52, 12771, 1077, 94, 73, 3834, 12772, 55, 353, 1534, 53, 12773, 12774, 72, 1239, 12775, 1628, 172])],
      dtype=object)

In [21]:
MAXLEN = 40
a = sequence.pad_sequences(train_dataset.text.values, MAXLEN)
b = sequence.pad_sequences(test_dataset.text.values, MAXLEN)


In [22]:
# train_dataset = a
# test_dataset = b

In [23]:

# for i in range(0,len(test_dataset.text.values)):
# #for i in range(0,4):
#     longitud = len(test_dataset.text.values[i])
#     if longitud <= 30:
#         l = 30-longitud
#         test_dataset.text.values[i] = test_dataset.text.values[i] + [0 for j in range(0,l)]
#         #print(test_dataset.text.values)
#     else:
#         test_dataset.text.values[i] = test_dataset.text.values[i][0:30]
#         #print(test_dataset.text.values[i])
        
    

In [24]:
# for i in range(0,len(train_dataset.text.values)):
# #for i in range(0,4):
#     longitud = len(train_dataset.text.values[i])
#     if longitud <= 30:
#         l = 30-longitud
#         train_dataset.text.values[i] = train_dataset.text.values[i] + [0 for j in range(0,l)]
#         #print(test_dataset.text.values)
#     else:
#         train_dataset.text.values[i] = train_dataset.text.values[i][0:30]
#         #print(test_dataset.text.values[i])
        
    

In [30]:
train_dataset['type'].values

array([0, 0, 0, ..., 1, 1, 0], dtype=int8)

In [31]:
# train_labels = train_dataset.pop('type')
# test_labels = test_dataset.pop('type')
train_labels = train_dataset.type
test_labels = test_dataset.type
print(train_dataset.shape)
print(test_dataset.shape)

(4179, 2)
(1392, 2)


In [33]:
train_labels = train_labels.values
test_labels = test_labels.values

In [34]:
train_dataset = a
test_dataset = b

In [35]:
print(train_labels.shape)
print(test_labels.shape)

(4179,)
(1392,)


In [36]:
train_dataset

array([[    0,     0,     0, ...,    24,    25,    26],
       [    0,     0,     0, ...,    29,    30,    31],
       [    0,     0,     0, ...,    46,    47,    48],
       ...,
       [    0,     0,     0, ...,  1088,   729,  9664],
       [    0,     0,     0, ...,  1773,    53,  1240],
       [    0,     0,     0, ..., 11330, 11331,  3328]], dtype=int32)

In [37]:
train_labels

array([0, 0, 0, ..., 1, 1, 0], dtype=int8)

In [38]:
VOCAB_SIZE = len(vocab)

In [39]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32), # embedding layer
    tf.keras.layers.LSTM(32), # internal layer LSTM
    tf.keras.layers.Dense(1, activation="sigmoid") # final dense layer, dense because we need to predict the sentiment and is the result of the layer.
])# we use a signmoid function because is between 0 and 1 and we can categorize a bad review in zero and a good in one. 

In [40]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 32)          431520    
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 439,873
Trainable params: 439,873
Non-trainable params: 0
_________________________________________________________________


In [41]:
model.compile(loss="binary_crossentropy", # loss function check documentation
              optimizer="rmsprop", # optimizer for the model
              metrics=['acc']) # read documentation


In [42]:
train_dataset

array([[    0,     0,     0, ...,    24,    25,    26],
       [    0,     0,     0, ...,    29,    30,    31],
       [    0,     0,     0, ...,    46,    47,    48],
       ...,
       [    0,     0,     0, ...,  1088,   729,  9664],
       [    0,     0,     0, ...,  1773,    53,  1240],
       [    0,     0,     0, ..., 11330, 11331,  3328]], dtype=int32)

In [43]:
test_dataset

array([[    0,     0,     0, ...,   376,    78,  2554],
       [    0,     0,     0, ...,   108,  2432,  3541],
       [    0,     0,     0, ...,     8,   434,  1248],
       ...,
       [    0,     0,     0, ...,    10,   384,    27],
       [    0,     0,     0, ..., 13485,  8426,  9586],
       [    0,     0,     0, ..., 12775,  1628,   172]], dtype=int32)

In [44]:
#TRAIN
history = model.fit(train_dataset, train_labels, epochs=10, validation_split=0.3) # train 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [62]:
test_dataset

array([[    0,     0,     0, ...,   376,    78,  2554],
       [    0,     0,     0, ...,   108,  2432,  3541],
       [    0,     0,     0, ...,     8,   434,  1248],
       ...,
       [    0,     0,     0, ...,    10,   384,    27],
       [    0,     0,     0, ..., 13485,  8426,  9586],
       [    0,     0,     0, ..., 12775,  1628,   172]], dtype=int32)

In [46]:
train_labels

array([0, 0, 0, ..., 1, 1, 0], dtype=int8)

In [60]:
test_labels

array([0, 0, 0, ..., 0, 1, 1], dtype=int8)

In [50]:
values

0        ham
1        ham
2        ham
3        ham
4        ham
        ... 
1387     ham
1388     ham
1389     ham
1390    spam
1391    spam
Name: type, Length: 1392, dtype: category
Categories (2, object): ['ham', 'spam']

In [84]:
#results = model.evaluate(test_dataset, test_labels)
#print(results)

In [66]:
word_index = vocab
def encode_text(text):
  tokens = keras.preprocessing.text.text_to_word_sequence(text) # processing words
  tokens = [word_index[word] if word in word_index else 0 for word in tokens] # give the numbers
  return sequence.pad_sequences([tokens], MAXLEN)[0] # make clusters of sequences of maxlen


In [67]:
encode_text('just woken up!had a bad dream about u tho,so i dont like u right now')

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,  148,    2,   24,  666,    4,    5,    6,    7,    8, 2475,
        165,   10,   11,   12,    8,   13,   14], dtype=int32)

In [68]:
reverse_word_index = {value: key for (key, value) in word_index.items()}

def decode_integers(integers):
    PAD = 0
    text = ""
    for num in integers:
      if num != PAD:
        text += reverse_word_index[num] + " "

    return text[:-1]  # omitimos el espacio del final


In [69]:
decode_integers(encode_text('just woken up!had a bad dream about u tho,so i dont like u right now'))

'just woken up had a bad dream about u tho so i dont like u right now'

In [74]:
# function to predict messages based on model
# (should return list containing prediction and label, ex. [0.008318834938108921, 'ham'])
def predict_message(pred_text):
    encoded_text = encode_text(pred_text)
    print(encoded_text, '\n')
    prediction = np.zeros((1,40))
    prediction[0] = encoded_text
    print(prediction, '\n')
    result = model.predict(prediction)
    #print('result ', result[0])
    prediction1 = [pred_text,'b']
    if result[0] <=0.5:
        prediction1[1] = 'ham'
    else:
        prediction1[1] = 'spam'
    print(result[0])
    return (prediction1)



In [75]:
pred_text = "how are you doing today?"

prediction = predict_message(pred_text)


[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 108
 231  27 378 292] 

[[  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0. 108. 231.  27. 378. 292.]] 

[2.6560787e-05]


In [76]:
print(prediction)

['how are you doing today?', 'ham']


In [77]:
test_messages1 = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

In [79]:
predict_message(test_messages1[4])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0   27  145 1034 1196  131  121   53 1038  109 1197] 

[[   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
     0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
     0.    0.    0.    0.    0.    0.   27.  145. 1034. 1196.  131.  121.
    53. 1038.  109. 1197.]] 

[0.99920577]


['you have won £1000 cash! call to claim your prize.', 'spam']

In [80]:
# Run this cell to test your function and model. Do not modify contents.
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    if prediction[1] != ans:
      passed = False

  if passed:
    print("You passed the challenge. Great job!")
  else:
    print("You haven't passed yet. Keep trying.")

test_predictions()


[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 108
 231  27 378 292] 

[[  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0. 108. 231.  27. 378. 292.]] 

[2.6560787e-05]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0 7738  292   53 1239 5041  121    0] 

[[   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
     0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
     0.    0.    0.    0.    0.    0.    0.    0.    0. 7738.  292.   53.
  1239. 5041.  121.    0.]] 

[0.99693584]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0   10 