## Exploración de datos
Para el laboratorio se proporcionan dos datasets distintos. Revise la data y realice las operaciones
necesarias para unificar los datasets y que el dataset contenga el mensaje del correo y la etiqueta que
indique si es SPAM o no.
Muestre ejemplos de los datasets individuales y del dataset final.

In [53]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import string
import nltk

stopwords = nltk.corpus.stopwords.words('english')
punctuation = [i for i in (string.punctuation + "\n")]

def remove_punctuation(text):
    no_punctuation_text = str(text).lower()
    for i in punctuation:
        no_punctuation_text = no_punctuation_text.replace(i, ' ')

    return no_punctuation_text.split()

def remove_stopwords(text):
    return [i for i in text if i not in stopwords]

def generate_body(words):
    tokens = [token for token in words]
    return ' '.join(tokens)

In [54]:
df_1 = pd.read_csv('completeSpamAssassin.csv')
df_2 = pd.read_csv('enronSpamSubset.csv')

# Removed non significant columns which give us no relevant information
df_1.drop(columns='Unnamed: 0', inplace=True)
df_2.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'], inplace=True)

In [55]:
df_1

Unnamed: 0,Body,Label
0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1
1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
3,##############################################...,1
4,I thought you might like these:\n1) Slim Down ...,1
...,...,...
6041,empty,0
6042,___ ___ ...,0
6043,IN THIS ISSUE:01. Readers write\n02. Extension...,0
6044,empty,0


In [56]:
df_2

Unnamed: 0,Body,Label
0,Subject: stock promo mover : cwtd\n * * * urge...,1
1,Subject: are you listed in major search engine...,1
2,"Subject: important information thu , 30 jun 20...",1
3,Subject: = ? utf - 8 ? q ? bask your life with...,1
4,"Subject: "" bidstogo "" is places to go , things...",1
...,...,...
9995,"Subject: monday 22 nd oct\n louise ,\n do you ...",0
9996,Subject: missing bloomberg deals\n stephanie -...,0
9997,Subject: eops salary survey questionnaire\n we...,0
9998,"Subject: q 3 comparison\n hi louise ,\n i have...",0


In [57]:
combined_df = pd.concat([df_1, df_2])
combined_df.index = range(combined_df.shape[0])
combined_df

Unnamed: 0,Body,Label
0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1
1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
3,##############################################...,1
4,I thought you might like these:\n1) Slim Down ...,1
...,...,...
16041,"Subject: monday 22 nd oct\n louise ,\n do you ...",0
16042,Subject: missing bloomberg deals\n stephanie -...,0
16043,Subject: eops salary survey questionnaire\n we...,0
16044,"Subject: q 3 comparison\n hi louise ,\n i have...",0


## Preprocesamiento
Aplique las técnicas de pre – procesamiento de lenguaje natural que considere necesarias (conversión
de minúsculas, mayúsculas, eliminación de acentos, expansión de contracciones, eliminación de stop
words, etc.)

In [58]:
# Punctuation Removal & Tokenization
combined_df['No_Punctuation_Body']= combined_df['Body'].apply(lambda x:remove_punctuation(x))
combined_df['No_Punctuation_Body'].head()

0    [save, up, to, 70, on, life, insurance, why, s...
1    [1, fight, the, risk, of, cancer, http, www, a...
2    [1, fight, the, risk, of, cancer, http, www, a...
3    [adult, club, offers, free, membership, instan...
4    [i, thought, you, might, like, these, 1, slim,...
Name: No_Punctuation_Body, dtype: object

In [59]:
# Stop Words Removal
combined_df['No_StopWords_Body'] = combined_df['No_Punctuation_Body'].apply(lambda x:remove_stopwords(x))
combined_df['No_StopWords_Body'].head()

0    [save, 70, life, insurance, spend, life, quote...
1    [1, fight, risk, cancer, http, www, adclick, w...
2    [1, fight, risk, cancer, http, www, adclick, w...
3    [adult, club, offers, free, membership, instan...
4    [thought, might, like, 1, slim, guaranteed, lo...
Name: No_StopWords_Body, dtype: object

In [60]:
combined_df.head()

Unnamed: 0,Body,Label,No_Punctuation_Body,No_StopWords_Body
0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1,"[save, up, to, 70, on, life, insurance, why, s...","[save, 70, life, insurance, spend, life, quote..."
1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1,"[1, fight, the, risk, of, cancer, http, www, a...","[1, fight, risk, cancer, http, www, adclick, w..."
2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1,"[1, fight, the, risk, of, cancer, http, www, a...","[1, fight, risk, cancer, http, www, adclick, w..."
3,##############################################...,1,"[adult, club, offers, free, membership, instan...","[adult, club, offers, free, membership, instan..."
4,I thought you might like these:\n1) Slim Down ...,1,"[i, thought, you, might, like, these, 1, slim,...","[thought, might, like, 1, slim, guaranteed, lo..."


In [61]:
combined_df['Body'] = combined_df['No_StopWords_Body'].apply(generate_body)
combined_df.drop(columns=['No_Punctuation_Body', 'No_StopWords_Body'], inplace=True)

In [62]:
combined_df.head()

Unnamed: 0,Body,Label
0,save 70 life insurance spend life quote saving...,1
1,1 fight risk cancer http www adclick ws p cfm ...,1
2,1 fight risk cancer http www adclick ws p cfm ...,1
3,adult club offers free membership instant acce...,1
4,thought might like 1 slim guaranteed lose 10 1...,1


## Representación de texto
Utilice  los  modelos  de  BoG  (para  n  =  1,2)  y  TF-IDF.  Muestre  ejemplos  de  los  mensajes  en  su
representación numérica.

### Reference
https://www.analyticsvidhya.com/blog/2021/07/bag-of-words-vs-tfidf-vectorization-a-hands-on-tutorial/

In [76]:
# TF-IDF
vector = TfidfVectorizer(min_df=0.1, max_df=0.9, use_idf=True)
matrix = vector.fit_transform(combined_df['Body'])
array = matrix.toarray()

In [77]:
names = vector.get_feature_names_out()
vector_df = pd.DataFrame(np.round(array, 2), columns=names)

In [78]:
vector_df.head()

Unnamed: 0,00,10,11,12,20,2000,2001,2002,30,also,...,time,today,us,use,want,way,well,work,would,www
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.11,0.0,0.13,0.0,0.0,0.0,0.0,0.13,0.0,...,0.0,0.0,0.0,0.12,0.0,0.0,0.0,0.0,0.0,0.68
2,0.0,0.13,0.0,0.15,0.0,0.0,0.0,0.0,0.16,0.0,...,0.0,0.0,0.0,0.14,0.0,0.0,0.0,0.0,0.0,0.7
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.13,0.0,0.15,0.0,0.0,0.0,0.0,0.16,0.0,...,0.0,0.0,0.0,0.14,0.0,0.0,0.0,0.0,0.0,0.58


In [79]:
# TF-IDF BoG (para n=1,2)
vector = CountVectorizer(ngram_range=(1,2), min_df=0.1, max_df=0.9)
matrix = vector.fit_transform(combined_df['Body'])
array = matrix.toarray()

In [80]:
names = vector.get_feature_names_out()
vector_df = pd.DataFrame(array, columns=names)

In [81]:
vector_df.head()

Unnamed: 0,00,10,11,12,20,2000,2001,2002,30,also,...,time,today,us,use,want,way,well,work,would,www
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,1,0,1,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,7
2,0,1,0,1,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,6
3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,1,0,1,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,5
