## Exploración de datos
Para el laboratorio se proporcionan dos datasets distintos. Revise la data y realice las operaciones
necesarias para unificar los datasets y que el dataset contenga el mensaje del correo y la etiqueta que
indique si es SPAM o no.
Muestre ejemplos de los datasets individuales y del dataset final.

In [36]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import string
import nltk

stopwords = nltk.corpus.stopwords.words('english')
punctuation = [i for i in (string.punctuation + "\n")]

def remove_punctuation(text):
    no_punctuation_text = str(text).lower()
    for i in punctuation:
        no_punctuation_text = no_punctuation_text.replace(i, ' ')

    return no_punctuation_text.split()

def remove_stopwords(text):
    return [i for i in text if i not in stopwords]

def generate_body(words):
    tokens = [token for token in words]
    return ' '.join(tokens)

In [37]:
df_1 = pd.read_csv('completeSpamAssassin.csv')
df_2 = pd.read_csv('enronSpamSubset.csv')

# Removed non significant columns which give us no relevant information
df_1.drop(columns='Unnamed: 0', inplace=True)
df_2.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'], inplace=True)

In [38]:
df_1

Unnamed: 0,Body,Label
0,\r\nSave up to 70% on Life Insurance.\r\nWhy S...,1
1,1) Fight The Risk of Cancer!\r\nhttp://www.adc...,1
2,1) Fight The Risk of Cancer!\r\nhttp://www.adc...,1
3,##############################################...,1
4,I thought you might like these:\r\n1) Slim Dow...,1
5,A POWERHOUSE GIFTING PROGRAM You Don't Want To...,1
6,Help wanted. We are a 14 year old fortune 500...,1
7,ReliaQuote - Save Up To 70% On Life Insurance\...,1
8,TIRED OF THE BULL OUT THERE?\r\nWant To Stop L...,1
9,"Dear ricardo1 ,\r\nCOST EFFECTIVE Direct Email...",1


In [39]:
df_2

Unnamed: 0,Body,Label
0,Subject: stock promo mover : cwtd\r\n * * * ur...,1
1,Subject: are you listed in major search engine...,1
2,"Subject: important information thu , 30 jun 20...",1
3,Subject: = ? utf - 8 ? q ? bask your life with...,1
4,"Subject: "" bidstogo "" is places to go , things...",1
5,Subject: dont pay more than $ 100 for ur softw...,1
6,Subject: paliourg\r\n micros 0 ft for pennies\...,1
7,"Subject: all graphics software available , che...",1
8,"Subject: the man of stteel\r\n hello , welcome...",1
9,"Subject: adjourn pasteup\r\n paliourg ,\r\n lo...",1


In [40]:
combined_df = pd.concat([df_1, df_2])
combined_df.index = range(combined_df.shape[0])
combined_df

Unnamed: 0,Body,Label
0,\r\nSave up to 70% on Life Insurance.\r\nWhy S...,1
1,1) Fight The Risk of Cancer!\r\nhttp://www.adc...,1
2,1) Fight The Risk of Cancer!\r\nhttp://www.adc...,1
3,##############################################...,1
4,I thought you might like these:\r\n1) Slim Dow...,1
5,A POWERHOUSE GIFTING PROGRAM You Don't Want To...,1
6,Help wanted. We are a 14 year old fortune 500...,1
7,ReliaQuote - Save Up To 70% On Life Insurance\...,1
8,TIRED OF THE BULL OUT THERE?\r\nWant To Stop L...,1
9,"Dear ricardo1 ,\r\nCOST EFFECTIVE Direct Email...",1


## Preprocesamiento
Aplique las técnicas de pre – procesamiento de lenguaje natural que considere necesarias (conversión
de minúsculas, mayúsculas, eliminación de acentos, expansión de contracciones, eliminación de stop
words, etc.)

In [41]:
# Punctuation Removal & Tokenization
combined_df['No_Punctuation_Body']= combined_df['Body'].apply(lambda x:remove_punctuation(x))
combined_df['No_Punctuation_Body'].head()

0    [save, up, to, 70, on, life, insurance, why, s...
1    [1, fight, the, risk, of, cancer, http, www, a...
2    [1, fight, the, risk, of, cancer, http, www, a...
3    [adult, club, offers, free, membership, instan...
4    [i, thought, you, might, like, these, 1, slim,...
Name: No_Punctuation_Body, dtype: object

In [42]:
# Stop Words Removal
combined_df['No_StopWords_Body'] = combined_df['No_Punctuation_Body'].apply(lambda x:remove_stopwords(x))
combined_df['No_StopWords_Body'].head()

0    [save, 70, life, insurance, spend, life, quote...
1    [1, fight, risk, cancer, http, www, adclick, w...
2    [1, fight, risk, cancer, http, www, adclick, w...
3    [adult, club, offers, free, membership, instan...
4    [thought, might, like, 1, slim, guaranteed, lo...
Name: No_StopWords_Body, dtype: object

In [43]:
combined_df.head()

Unnamed: 0,Body,Label,No_Punctuation_Body,No_StopWords_Body
0,\r\nSave up to 70% on Life Insurance.\r\nWhy S...,1,"[save, up, to, 70, on, life, insurance, why, s...","[save, 70, life, insurance, spend, life, quote..."
1,1) Fight The Risk of Cancer!\r\nhttp://www.adc...,1,"[1, fight, the, risk, of, cancer, http, www, a...","[1, fight, risk, cancer, http, www, adclick, w..."
2,1) Fight The Risk of Cancer!\r\nhttp://www.adc...,1,"[1, fight, the, risk, of, cancer, http, www, a...","[1, fight, risk, cancer, http, www, adclick, w..."
3,##############################################...,1,"[adult, club, offers, free, membership, instan...","[adult, club, offers, free, membership, instan..."
4,I thought you might like these:\r\n1) Slim Dow...,1,"[i, thought, you, might, like, these, 1, slim,...","[thought, might, like, 1, slim, guaranteed, lo..."


In [44]:
combined_df['Body'] = combined_df['No_StopWords_Body'].apply(generate_body)
combined_df.drop(columns=['No_Punctuation_Body', 'No_StopWords_Body'], inplace=True)

In [45]:
combined_df.head()
'''
It is very important to discuss that the Body had urls which contained "acceptable" but meaningless words, however this gets far away from a normal preprocessing job and would take more time and effort to handle these edge cases, so we decided to keep these irrelevant words like adclick, ws, p, and so on.
'''

Unnamed: 0,Body,Label
0,save 70 life insurance spend life quote saving...,1
1,1 fight risk cancer http www adclick ws p cfm ...,1
2,1 fight risk cancer http www adclick ws p cfm ...,1
3,adult club offers free membership instant acce...,1
4,thought might like 1 slim guaranteed lose 10 1...,1


## Representación de texto
Utilice  los  modelos  de  BoG  (para  n  =  1,2)  y  TF-IDF.  Muestre  ejemplos  de  los  mensajes  en  su
representación numérica.

### Reference
https://www.analyticsvidhya.com/blog/2021/07/bag-of-words-vs-tfidf-vectorization-a-hands-on-tutorial/

In [61]:
# TF-IDF
'''We decided to use 0.05 & 0.9 as min - max values for performance & precision in our Models presented in "modelosSpam".
The same values are used for BoG'''
vector = TfidfVectorizer(min_df=0.05, max_df=0.9, use_idf=True)
matrix = vector.fit_transform(combined_df['Body'])
array = matrix.toarray()

In [62]:
names = vector.get_feature_names()
vector_df = pd.DataFrame(np.round(array, 2), columns=names)
vector_df['Label'] = combined_df['Label']
vector_df.to_csv('tf_idf.csv', index=False)

In [63]:
vector_df.head()

Unnamed: 0,00,000,01,02,05,08,09,10,100,11,...,without,work,working,world,would,wrote,www,year,years,Label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.46,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.61,0.0,0.0,1
3,0.0,0.0,0.0,0.07,0.0,0.07,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.36,0.0,0.0,1


In [58]:
# BoG (para n=1,2)
vector = CountVectorizer(ngram_range=(1,2), min_df=0.05, max_df=0.90)
matrix = vector.fit_transform(combined_df['Body'])
array = matrix.toarray()

In [59]:
names = vector.get_feature_names()
vector_df = pd.DataFrame(array, columns=names)
vector_df['Label'] = combined_df['Label']
vector_df.to_csv('Bog.csv', index=False)

In [60]:
vector_df.head()

Unnamed: 0,00,000,01,02,05,08,09,10,100,11,...,work,working,world,would,would like,wrote,www,year,years,Label
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,7,0,0,1
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,6,0,0,1
3,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
4,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,5,0,0,1
