# Regresión Logística: Detencion de SPAM
En este ejercicio se muestran los fundamentos de la regresión logística planteando uno de los primeros problemas que fueron solucionados mediante el uso de teécnicas de Machine Learning: La detención de SPAM.

## Enunciado del ejercico.
SE propone la construcción de un sistema de aprendizaje automáticos capaz de predecir si un correo determinado se corresponde con un correo SPAM o no, para ello se utilizara el siguiente DataSet:

##### [2007 TREC Public Spam Corpus](https://plg.uwaterloo.ca/~gvcormac/treccorpus07/)

The corpus trec07p contains 75,419 messages:

    25220 ham
    50199 spam

These messages constitute all the messages delivered to a particular
server between these dates:

    Sun, 8 Apr 2007 13:07:21 -0400
    Fri, 6 Jul 2007 07:04:53 -0400


In [1]:
# En esta clase se facilita el procesamiento de correos electronicos que poseen código HTML 

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

In [2]:
# Esta función se encarga de eliminar los tags HTML que se encuentren en el texto dehtml los correos electrónicos
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [3]:
# Ejemplo de eliminacion de los tags HTML de un texto
t = '<tr><td align="left"><ahref="../../issues/51/16.html#article">Phrack World News</a><td>'
strip_tags(t)

'Phrack World News'

Ademas de eliminar los posibles tags HTML que se encuentra en el correo electronico deben realizarse otras acciones para evitar que los mensajes contengan ruido inecesario. Entre ellas se encuentran la eliminación de os signos de puntuación, eliminación de los posibles campos de correo electronico que no sean relevantes o eliminación de los afijos de una palabra manteniendo únicamente la raíz de la misma (steming). La clase que se muestra a continuación realiza estas transformaciones.

In [4]:
import email
import string
import nltk

class Parser:
    
    def __init__(self):
        self.stemmer =nltk.PorterStemmer()
        self.stopwords= set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)
        
    def parse(self,email_path):
        """Parse an email."""
        with open(email_path,errors = 'ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)
        
    def get_email_content(self,msg):
        """Extract the email content."""
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else []
        body= self.get_email_body(msg.get_payload(),
                                 msg.get_content_type())
        content_type = msg.get_content_type()
        #Return the content of the email
        return {"Subject": subject,
               "body":body,
               "content_type": content_type}
        
    def get_email_body(self,payload,content_type):
        """Extract the body of the email."""
        body= []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(),
                                            p.get_content_type())
        return body
        
    def tokenize(self, text):
        """Transform a text string in tokens. Perfomr two main actions,
        claen the punctuation symbols and do stemming of tehe text."""
        for c in self.punctuation:
            text= text.replace(c,"")
        text = text.replace("/t","")
        text = text.replace("/n","")
        tokens =list(filter(None, text.split(" ")))
        #Stemming of the tokens
        return [self.stemmer.stem(w) for w in tokens if w not in self.stopwords]
        

        
        

Lectura de un correo en formato .raw

In [5]:
inmail = open("datasets/datasets/trec07p/data/inmail.1").read()
print(inmail)

From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007
Return-Path: <RickyAmes@aol.com>
Received: from 129.97.78.23 ([211.202.101.74])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;
	Sun, 8 Apr 2007 13:07:21 -0400
Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100
Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>
From: "Tomas Jacobs" <RickyAmes@aol.com>
Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>
To: the00@speedy.uwaterloo.ca
Subject: Generic Cialis, branded quality@ 
Date: Sun, 08 Apr 2007 21:00:48 +0300
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="--8896484051606557286"
X-Priority: 3
X-MSMail-Priority: Normal
Status: RO
Content-Length: 988
Lines: 24

----8896484051606557286
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit

<html>
<body bgcolor="#ffffff">
<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0px; margin-bottom: 0px;" align="

##### Parsing del correo electrónico

In [6]:
p = Parser()
p.parse("datasets/datasets/trec07p/data/inmail.1")

{'Subject': ['gener', 'ciali', 'brand', 'qualiti'],
 'body': ['\n\n\n\n\n\n\ndo',
  'feel',
  'pressur',
  'perform',
  'rise',
  'occasion\n\n\n\n\n\ntri',
  'viagra\nyour',
  'anxieti',
  'thing',
  'past',
  'will\nb',
  'back',
  'old',
  'self\n\n\n'],
 'content_type': 'multipart/alternative'}

##### Lectura del índice
Estas funciones complementarias se encargan de cragar en memoria la ruta se cada correo electronico y su etiqueta correspondiente.{Spam, ham}

In [7]:
index = open("datasets/datasets/trec07p/full/index").readlines()
index

['spam ../data/inmail.1\n',
 'ham ../data/inmail.2\n',
 'spam ../data/inmail.3\n',
 'spam ../data/inmail.4\n',
 'spam ../data/inmail.5\n',
 'spam ../data/inmail.6\n',
 'spam ../data/inmail.7\n',
 'spam ../data/inmail.8\n',
 'spam ../data/inmail.9\n',
 'ham ../data/inmail.10\n',
 'spam ../data/inmail.11\n',
 'spam ../data/inmail.12\n',
 'spam ../data/inmail.13\n',
 'spam ../data/inmail.14\n',
 'spam ../data/inmail.15\n',
 'spam ../data/inmail.16\n',
 'spam ../data/inmail.17\n',
 'spam ../data/inmail.18\n',
 'spam ../data/inmail.19\n',
 'ham ../data/inmail.20\n',
 'ham ../data/inmail.21\n',
 'spam ../data/inmail.22\n',
 'spam ../data/inmail.23\n',
 'spam ../data/inmail.24\n',
 'spam ../data/inmail.25\n',
 'spam ../data/inmail.26\n',
 'spam ../data/inmail.27\n',
 'spam ../data/inmail.28\n',
 'ham ../data/inmail.29\n',
 'spam ../data/inmail.30\n',
 'ham ../data/inmail.31\n',
 'spam ../data/inmail.32\n',
 'spam ../data/inmail.33\n',
 'ham ../data/inmail.34\n',
 'spam ../data/inmail.35\n',
 

In [8]:
import os 

DATASET_PATH = "datasets/datasets/trec07p"

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    for i in range(n_elements):
        mail = index[i].split(" ../")
        label = mail[0]
        path = mail[1][:-1]
        ret_indexes.append({"label": label, "email_path": os.path.join(DATASET_PATH, path)})
    return ret_indexes

In [9]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

In [10]:
indexes = parse_index("datasets/datasets/trec07p/full/index", 10)
indexes

[{'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.1'},
 {'label': 'ham', 'email_path': 'datasets/datasets/trec07p/data/inmail.2'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.3'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.4'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.5'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.6'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.7'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.8'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.9'},
 {'label': 'ham', 'email_path': 'datasets/datasets/trec07p/data/inmail.10'}]

# Preprocesamiento del DataSet.

Con las funciones presentadas anteriormente se permite la lectura de los correos electrónicos de manera programática y el procesamiento de los mismos para eliminar aquellos componentes que no resutan de utilidad para la detección de correos de SPAM. Sin embargo cada uno de los correos sigue estando representado por un diccionario de Python con una serie de palabras.

In [11]:
# Cargar el indice y las etiquetas en memooria.
index = parse_index("datasets/datasets/trec07p/full/index", 1)

In [12]:
# Leemos el primer correo
import os

open(index[0]["email_path"]).read()

'From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007\nReturn-Path: <RickyAmes@aol.com>\nReceived: from 129.97.78.23 ([211.202.101.74])\n\tby speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;\n\tSun, 8 Apr 2007 13:07:21 -0400\nReceived: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100\nMessage-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>\nFrom: "Tomas Jacobs" <RickyAmes@aol.com>\nReply-To: "Tomas Jacobs" <RickyAmes@aol.com>\nTo: the00@speedy.uwaterloo.ca\nSubject: Generic Cialis, branded quality@ \nDate: Sun, 08 Apr 2007 21:00:48 +0300\nX-Mailer: Microsoft Outlook Express 6.00.2600.0000\nMIME-Version: 1.0\nContent-Type: multipart/alternative;\n\tboundary="--8896484051606557286"\nX-Priority: 3\nX-MSMail-Priority: Normal\nStatus: RO\nContent-Length: 988\nLines: 24\n\n----8896484051606557286\nContent-Type: text/html;\nContent-Transfer-Encoding: 7Bit\n\n<html>\n<body bgcolor="#ffffff">\n<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0

In [13]:
# Parsear el primer correo
mail, label = parse_email(index[0])
print("El correo es: ", label, "\n")
print(mail)

El correo es:  spam 

{'Subject': ['gener', 'ciali', 'brand', 'qualiti'], 'body': ['\n\n\n\n\n\n\ndo', 'feel', 'pressur', 'perform', 'rise', 'occasion\n\n\n\n\n\ntri', 'viagra\nyour', 'anxieti', 'thing', 'past', 'will\nb', 'back', 'old', 'self\n\n\n'], 'content_type': 'multipart/alternative'}


El algoritmo de Regresión Logística no es capaz de ingerir texto como parte del DataSet, por lo tanto, deben aplicarse una serie de funciones adicionales que transformen el texto de los correos electrónicos parseados enun arepresentación númerica.

### Aplicacion de CountVectorizer

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

# Preparación del email de una cadena de texto .
prep_email = [" ".join(mail['Subject']) + " ".join(mail['body'])]

vectorizer = CountVectorizer()
X = vectorizer.fit(prep_email)

print("e-mail:", prep_email, "\n")
print("Caracteristicas de entrada:", vectorizer.get_feature_names_out(),"\n")

e-mail: ['gener ciali brand qualiti\n\n\n\n\n\n\ndo feel pressur perform rise occasion\n\n\n\n\n\ntri viagra\nyour anxieti thing past will\nb back old self\n\n\n'] 

Caracteristicas de entrada: ['anxieti' 'back' 'brand' 'ciali' 'do' 'feel' 'gener' 'occasion' 'old'
 'past' 'perform' 'pressur' 'qualiti' 'rise' 'self' 'thing' 'tri' 'viagra'
 'will' 'your'] 



In [15]:
X = vectorizer.transform(prep_email)
print("\nValues:\n", X.toarray())


Values:
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


#### Aplicacón de OneHotEncoding

In [16]:
from sklearn.preprocessing import OneHotEncoder

prep_email = [[w] for w in mail['Subject'] + mail['body']]
enc = OneHotEncoder(handle_unknown = 'ignore')
X = enc.fit_transform(prep_email)

print("Features:", enc.get_feature_names_out(), "\n")
print("\Values:\n", X.toarray())

Features: ['x0_\n\n\n\n\n\n\ndo' 'x0_anxieti' 'x0_back' 'x0_brand' 'x0_ciali'
 'x0_feel' 'x0_gener' 'x0_occasion\n\n\n\n\n\ntri' 'x0_old' 'x0_past'
 'x0_perform' 'x0_pressur' 'x0_qualiti' 'x0_rise' 'x0_self\n\n\n'
 'x0_thing' 'x0_viagra\nyour' 'x0_will\nb'] 

\Values:
 [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

#### Funciones auxiliares para el preprocesamiento del DataSet

In [17]:
def create_prep_dataset(index_path, n_elements):
    X = []
    y = []
    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\rParsing email: {0}".format(i+1), end = '')
        mail, label = parse_email(indexes[i])
        X.append(" ".join(['Subject']) + " ".join(mail['body']))
        y.append(label)
    return X, y

In [18]:
# Leer únicamente un subconjunto de 1000 correos electrónicos
X_train, y_train = create_prep_dataset("datasets/datasets/trec07p/full/index", 1000)
X_train

Parsing email: 1000

['Subject\n\n\n\n\n\n\ndo feel pressur perform rise occasion\n\n\n\n\n\ntri viagra\nyour anxieti thing past will\nb back old self\n\n\n',
 'Subjecthi ive updat gulu i check mirrors\nit seem littl typo debianreadm file\n\nexample\nhttpgulususherbrookecadebianreadme\nftpftpfrdebianorgdebianreadme\n\ntest lenni access releas diststest the\ncurr test develop snapshot name etch packag which\nhav test unstabl pass autom test propog to\nthi release\n\netch replac lenni like readmehtml\n\n\n\n \nyan morin\nconsult en logiciel libre\nyanmorinsavoirfairelinuxcom\n5149941556\n\n\n \nto unsubscrib email debianmirrorsrequestlistsdebianorg\nwith subject unsubscrib troubl contact listmasterlistsdebianorg\n\n',
 'Subjectmega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click here\nhttpwwwmoujsjkhchumcom\n\n authent viagra\n\nmega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click here\n\n',
 'Subject\nhey billi \n\nit realli fun go night \nand talk

##### Aplicar vectorización a los Datos

In [19]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [20]:
print(X_train.toarray())
print("\Features", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
\Features 21970


In [21]:
import pandas as pd

pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names_out()])

Unnamed: 0,00,000,0000,000000,000002,000048000000000,000099,0000ff,000115000000000,0001171749,...,绰tel,绰۹ϵͳctsƽ,肾ǝvă,鏗ėvłq,饻jwk,鵵χ,낢ȏglgwă,뼰ʱϵ,쫷ƹư,쵼ã
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
y_train

['spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam'

#### Entrenamiento del algoritmo de REgresión Logistica con el DataSet Preprocesado

In [23]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

# 4.- Predicción

In [24]:
# Lectura de un DataSet de correos nuevos.

# Leer 1500 correos de nuestro DataSet y quedarnos unicámente con los 500 ultimos correos electrónicos, los cuales no se han utilizado para entrenar el algoritmo.
X, y = create_prep_dataset("datasets/datasets/trec07p/full/index", 150)
X_test = X[100:]
y_test = y[100:]

Parsing email: 150

#### Preprocesamiento de los correos electrónicos con el vectorizado creado anteriormente

In [25]:
X_test = vectorizer.transform(X_test)

In [26]:
y_pred = clf.predict(X_test)
y_pred

array(['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam',
       'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham',
       'spam', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam'], dtype='<U4')

In [27]:
print("Predicción\n", y_pred)
print("\nEtiquetas reales", y_test)

Predicción
 ['spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'ham' 'spam' 'spam' 'ham' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam']

Etiquetas reales ['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam']


#### Evaluación de resultados

In [28]:
from sklearn.metrics import accuracy_score
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 1.000


# 5.- Aumentando el DataSet

In [29]:
# Leer 20,000 correos electrónicos
X, y = create_prep_dataset("datasets/datasets/trec07p/full/index", 20000)

Parsing email: 20000

In [30]:
# Utilizamos 15,000 correos para entrenar el algoritmo y 5,000 para realizar pruebas
X_train, y_train = X[:15000], y[:15000]
X_test, y_test = X[15000:], y[15000:]

In [31]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [32]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [33]:
X_test = vectorizer.transform(X_test)
y_pred = clf.predict(X_test)

In [34]:
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 0.989
