## Exploración de datos
Para el laboratorio se proporcionan dos datasets distintos. Revise la data y realice las operaciones
necesarias para unificar los datasets y que el dataset contenga el mensaje del correo y la etiqueta que
indique si es SPAM o no.
Muestre ejemplos de los datasets individuales y del dataset final.

In [47]:
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
import pandas as pd
import re
import string
import nltk

stopwords = nltk.corpus.stopwords.words('english')
porter_stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()
punctuation = [i for i in (string.punctuation + "\n")]

def remove_punctuation(text):
    no_punctuation_text = str(text).lower()
    for i in punctuation:
        no_punctuation_text = no_punctuation_text.replace(i, ' ')

    return no_punctuation_text.split()

def remove_stopwords(text):
    return [i for i in text if i not in stopwords]

def stemming(text):
    return [porter_stemmer.stem(word) for word in text]

def lemmatizer(text):
    return [wordnet_lemmatizer.lemmatize(word) for word in text]

In [48]:
df_1 = pd.read_csv('completeSpamAssassin.csv')
df_2 = pd.read_csv('enronSpamSubset.csv')

# Removed non significant columns which give us no relevant information
df_1.drop(columns='Unnamed: 0', inplace=True)
df_2.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'], inplace=True)

In [49]:
df_1

Unnamed: 0,Body,Label
0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1
1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
3,##############################################...,1
4,I thought you might like these:\n1) Slim Down ...,1
...,...,...
6041,empty,0
6042,___ ___ ...,0
6043,IN THIS ISSUE:01. Readers write\n02. Extension...,0
6044,empty,0


In [50]:
df_2

Unnamed: 0,Body,Label
0,Subject: stock promo mover : cwtd\n * * * urge...,1
1,Subject: are you listed in major search engine...,1
2,"Subject: important information thu , 30 jun 20...",1
3,Subject: = ? utf - 8 ? q ? bask your life with...,1
4,"Subject: "" bidstogo "" is places to go , things...",1
...,...,...
9995,"Subject: monday 22 nd oct\n louise ,\n do you ...",0
9996,Subject: missing bloomberg deals\n stephanie -...,0
9997,Subject: eops salary survey questionnaire\n we...,0
9998,"Subject: q 3 comparison\n hi louise ,\n i have...",0


In [51]:
combined_df = pd.concat([df_1, df_2])
combined_df.index = range(combined_df.shape[0])
combined_df

Unnamed: 0,Body,Label
0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1
1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
3,##############################################...,1
4,I thought you might like these:\n1) Slim Down ...,1
...,...,...
16041,"Subject: monday 22 nd oct\n louise ,\n do you ...",0
16042,Subject: missing bloomberg deals\n stephanie -...,0
16043,Subject: eops salary survey questionnaire\n we...,0
16044,"Subject: q 3 comparison\n hi louise ,\n i have...",0


## Preprocesamiento
Aplique las técnicas de pre – procesamiento de lenguaje natural que considere necesarias (conversión
de minúsculas, mayúsculas, eliminación de acentos, expansión de contracciones, eliminación de stop
words, etc.)

In [52]:
# Punctuation Removal & Tokenization
combined_df['No_Punctuation_Body']= combined_df['Body'].apply(lambda x:remove_punctuation(x))
combined_df['No_Punctuation_Body'].head()

0    [save, up, to, 70, on, life, insurance, why, s...
1    [1, fight, the, risk, of, cancer, http, www, a...
2    [1, fight, the, risk, of, cancer, http, www, a...
3    [adult, club, offers, free, membership, instan...
4    [i, thought, you, might, like, these, 1, slim,...
Name: No_Punctuation_Body, dtype: object

In [None]:
# Stop Words Removal
combined_df['No_StopWords_Body'] = combined_df['No_Punctuation_Body'].apply(lambda x:remove_stopwords(x))
combined_df['No_StopWords_Body'].head()

In [None]:
combined_df.head()