#### Exercício - Detecção de Spam

1. Faça a leitura de um dataset de spam
2. Preprocesse a base com as técnicas de NLP que julgar necessárias
3. Utilize as duas técnicas de bag of words para gerar o seu dataset numérico
4. Treine um classificador de sua escolha
5. Avalie os resultados do algoritmo 

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("../dataset/spamraw.csv")

In [3]:
df

Unnamed: 0,type,text
0,ham,Hope you are having a good week. Just checking in
1,ham,K..give back my thanks.
2,ham,Am also doing in cbe only. But have to pay.
3,spam,"complimentary 4 STAR Ibiza Holiday or £10,000 ..."
4,spam,okmail: Dear Dave this is your final notice to...
...,...,...
5554,ham,You are a great role model. You are giving so ...
5555,ham,"Awesome, I remember the last time we got someb..."
5556,spam,"If you don't, your prize will go to another cu..."
5557,spam,"SMS. ac JSco: Energy is high, but u may not kn..."


In [13]:
import nltk
import re
from unidecode import unidecode
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
from nltk.stem import RSLPStemmer
import spacy
words_en = nltk.corpus.stopwords.words('english')
words_pt = nltk.corpus.stopwords.words('portuguese')
from sklearn.pipeline import Pipeline

In [14]:
movies = pd.read_csv('../dataset/movies.csv', index_col=0)
movies

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1
...,...,...
4995,This is the kind of picture John Lassiter woul...,1
4996,A MUST SEE! I saw WHIPPED at a press screening...,1
4997,NBC should be ashamed. I wouldn't allow my chi...,0
4998,This movie is a clumsy mishmash of various gho...,0


In [15]:
movies_sample = movies.sample(frac=0.1, replace=False, ignore_index="true")
movies_sample

Unnamed: 0,text,label
0,Erroll works for The Department of Public Safe...,1
1,Zoey 101 is basically about a girl named Zoey ...,0
2,"A lot of people hated this movie, but that I b...",1
3,ZERO stars out of ****<br /><br />Endless Desc...,0
4,Forget the recent dire American remake which s...,1
...,...,...
4995,"This ""space snippet"" was kind of dumb. I guess...",0
4996,I watched this movie the night it premiered on...,1
4997,This was a very faithful presentation of Lewis...,1
4998,What makes watching and reviewing films a plea...,1


In [16]:
class PreProcesssPhrase:
        
    def remove_accent(self, text):
        ra = unidecode(text)
        return ra
    
    def remove_digits(self, text):
        rd = re.sub(r'\d', '', text)
        return rd

    def remove_special_char(self, text):
        rsc = re.sub(r'[^a-zA-Z ]','',text)
        return rsc

    def word_lower(self, text):
        wl = unidecode(text.lower())
        return wl

    def tokenizer(self, text):
        t = word_tokenize(text)
        return t

    def remove_stopwords(self, text):
        list = []
        for word in text:
            if not word in words_en:
                list.append(word)
        return list

    def stemmer(self, text):
        stemmer = PorterStemmer()
        list = []
        for word in text:
            list.append(stemmer.stem(word))
        return ' '.join(list) 

    def lemma(self, text):
        list = []
        for word in text:
            list.append(nlp(word)[0].lemma_)
        return ' '.join(list)

    def pipeline(self, text, methods):
        
        dici = {
            'remove_accent': self.remove_accent,
            'remove_digits':self.remove_digits,
            'remove_special_char':self.remove_special_char,
            'word_lower':self.word_lower,
            'tokenizer':self.tokenizer,
            'remove_stopwords':self.remove_stopwords,
            'stemmer':self.stemmer,
            'lemma':self.lemma
        }
        
        for method in methods:
            text = dici[method](text)
        return text

In [17]:
preprocess = PreProcesssPhrase()
pipeline = [
    'remove_digits',
    'remove_special_char',
    'word_lower',
    'tokenizer',
    'remove_stopwords',
    'stemmer'
]
movies_sample["filtered_words"] = movies_sample['text'].apply(preprocess.pipeline, methods = pipeline)
movies_sample

Unnamed: 0,text,label,filtered_words
0,Erroll works for The Department of Public Safe...,1,errol work depart public safeti job check sex ...
1,Zoey 101 is basically about a girl named Zoey ...,0,zoey basic girl name zoey transfer boy board s...
2,"A lot of people hated this movie, but that I b...",1,lot peopl hate movi blame two fact want much l...
3,ZERO stars out of ****<br /><br />Endless Desc...,0,zero star br br endless descent absolut redeem...
4,Forget the recent dire American remake which s...,1,forget recent dire american remak sadli tarnis...
...,...,...,...
4995,"This ""space snippet"" was kind of dumb. I guess...",0,space snippet kind dumb guess suppos shocker u...
4996,I watched this movie the night it premiered on...,1,watch movi night premier mtv usual mtv movi ki...
4997,This was a very faithful presentation of Lewis...,1,faith present lewiss life mid dialogu theolog ...
4998,What makes watching and reviewing films a plea...,1,make watch review film pleasur everi least exp...
