# Regresión Logística: Detección de SPAM
 
En este ejercicio se muestran los fundamentos de la Regresión Logística,
 planteando uno de los primeros problemas que fueron solucionados mediante 
 el uso de técnicas de Machine Learning: la detección de SPAM.

 El resultado es una probabilidad en un rango de 0 a 100, no una predicción categórica.

 Enunciado del ejercicio:

 Se propone la construcción de un sistema de aprendizaje automático capaz de predecir 
 si un correo determinado corresponde a un correo SPAM o no. 
 Para esto, se utilizará el siguiente dataset:
 [DataSet](https://www.kaggle.com/datasets/imeepmind/preprocessed-trec-2007-public-corpus-dataset)

 El corpus trec07p contiene 75,419 mensajes distribuidos de la siguiente forma:
   - 25,220 Ham
   - 50,199 SPAM

 Los mensajes corresponden a todos los correos entregados a un servidor 
 entre las siguientes fechas:

   - Domingo, 8 de abril de 2007, 13:07:21 -0400
   - Viernes, 6 de julio de 2007, 07:04:53 -0400


### 1.- Funciones complementarias  

En este caso práctico, relacionado con la detección de correos SPAM, el dataset disponible está formado por correos electrónicos que incluyen sus correspondientes cabeceras y campos adicionales.  

Por lo tanto, requieren de un **procesamiento previo** antes de ser utilizados por el algoritmo de *Machine Learning*.  


In [None]:
# Esta clase facilita el procesamiento de correos electrónicos que contienen código HTML
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()  # inicializa correctamente la clase padre
        self.strict = False
        self.convert_charrefs = True  
        self.fed = []  

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return "".join(self.fed)


In [6]:
# Esta función se encarga de eliminar las etiquetas HTML que se encuentran en el texto del e-mail
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()


In [7]:
# Ejemplo de eliminación de las etiquetas HTML en un texto
t = '<tr><td align="left"><a href="../../issues/51/16.html#article">Phrack world News</a></td></tr>'
print(strip_tags(t))


Phrack world News


Además de eliminar las posibles etiquetas HTML que se encuentren en el correo electrónico, 
deben realizarse otras acciones de procesamiento para evitar que los mensajes contengan ruido innecesario.  

Entre ellas se encuentran:  
- La eliminación de los signos de puntuación.  
- La eliminación de posibles campos del correo electrónico que no son relevantes.  
- La eliminación de afijos de una palabra, manteniendo únicamente la raíz de la misma (*Stemming*).  

La clase que se muestra a continuación realiza estas transformaciones.


In [8]:
import email
import string
import  nltk

In [None]:
import nltk
import string
import email
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

class Parser:
    def __init__(self):
        """Inicializa el procesador de texto para correos electrónicos."""
        self.stemmer = PorterStemmer()
        self.stopwords = set(stopwords.words("english"))
        self.punctuation = list(string.punctuation)

    def parse(self, email_path):
        """Procesa un correo electrónico y devuelve su contenido estructurado."""
        with open(email_path, errors="ignore") as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)

    def get_email_content(self, msg):
        """Extrae el contenido (asunto y cuerpo) del correo electrónico."""
        subject = self.tokenize(msg["Subject"]) if msg["Subject"] else []
        body = self.get_email_body(msg.get_payload(), msg.get_content_type())

        content_type = msg.get_content_type()

        # Retorna un diccionario con el contenido del correo
        return {
            "subject": subject,
            "body": body,
            "content_type": content_type
        }

    def get_email_body(self, payload, content_type):
        """Extrae el cuerpo del correo electrónico."""
        body = []

        if isinstance(payload, str) and content_type == "text/plain":
            return self.tokenize(payload)

        elif isinstance(payload, str) and content_type == "text/html":
            
            return self.tokenize(payload)

        elif isinstance(payload, list):
            for part in payload:
                body += self.get_email_body(part.get_payload(), part.get_content_type())

        return body

    def tokenize(self, text):
        """Transforma una cadena de texto en tokens.
        Limpia los signos de puntuación y aplica stemming.
        """
        # Eliminar signos de puntuación
        for c in self.punctuation:
            text = text.replace(c, "")

        # Eliminar tabulaciones y saltos de línea
        text = text.replace("\t", " ").replace("\n", " ")

        # Tokenización básica por espacios
        tokens = list(filter(None, text.split(" ")))

        # Stemming y eliminación de stopwords
        return [self.stemmer.stem(w.lower()) for w in tokens if w.lower() not in self.stopwords]


Lectura de un e-mail en formato Raw 

In [10]:
import os
print(os.getcwd())


c:\Users\yanet\OneDrive\Documentos\ANTLR


In [11]:
import os

# Ruta al archivo
ruta_archivo = r"C:\Users\yanet\OneDrive\Documentos\ANTLR\datasets\trec07p\data\inmail.1"

# Verificar si existe
if os.path.exists(ruta_archivo):
    with open(ruta_archivo, "r", encoding="utf-8", errors="ignore") as f:
        inmail = f.read()
    print("✅ Archivo leído correctamente. Primeros 500 caracteres:\n")
    print(inmail[:500])
else:
    print("❌ No se encontró el archivo en la ruta especificada.")


✅ Archivo leído correctamente. Primeros 500 caracteres:

From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007
Return-Path: <RickyAmes@aol.com>
Received: from 129.97.78.23 ([211.202.101.74])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;
	Sun, 8 Apr 2007 13:07:21 -0400
Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100
Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>
From: "Tomas Jacobs" <RickyAmes@aol.com>
Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>
To: the00@speedy.uwaterloo.ca
Subject: Generic Cialis, brande


In [12]:
inmail = open(r"C:\Users\yanet\OneDrive\Documentos\ANTLR\datasets\trec07p\data\inmail.1", "r", encoding="utf-8", errors="ignore").read()
print(inmail[:500])  # imprime solo los primeros 500 caracteres


From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007
Return-Path: <RickyAmes@aol.com>
Received: from 129.97.78.23 ([211.202.101.74])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;
	Sun, 8 Apr 2007 13:07:21 -0400
Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100
Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>
From: "Tomas Jacobs" <RickyAmes@aol.com>
Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>
To: the00@speedy.uwaterloo.ca
Subject: Generic Cialis, brande


Parsear el e-mail

In [13]:
p=Parser()
p.parse(r"C:\Users\yanet\OneDrive\Documentos\ANTLR\datasets\trec07p\data\inmail.1")

{'subject': ['gener', 'ciali', 'brand', 'qualiti'],
 'body': ['html',
  'bodi',
  'bgcolorffffff',
  'div',
  'stylebordercolor',
  '00ffff',
  'borderrightwidth',
  '0px',
  'borderbottomwidth',
  '0px',
  'marginbottom',
  '0px',
  'aligncent',
  'tabl',
  'stylebord',
  '1px',
  'borderstyl',
  'solid',
  'bordercolor000000',
  'cellpadding5',
  'cellspacing0',
  'bgcolorccffaa',
  'tr',
  'td',
  'stylebord',
  '0px',
  'borderbottom',
  '1px',
  'borderstyl',
  'solid',
  'bordercolor000000',
  'center',
  'feel',
  'pressur',
  'perform',
  'rise',
  'occasionbr',
  'center',
  'tdtrtr',
  'td',
  'bgcolorffff33',
  'stylebord',
  '0px',
  'borderbottom',
  '1px',
  'borderstyl',
  'solid',
  'bordercolor000000',
  'center',
  'ba',
  'hrefhttpexcoriationtuhcomlzmfnrdklekstri',
  'spanvspanspaniaspanspangrspanaspanabcent',
  'tdtrtdcenteryour',
  'anxieti',
  'thing',
  'past',
  'willbr',
  'back',
  'old',
  'self',
  'centertdtrtabledivbodyhtml'],
 'content_type': 'multipart/a

Lectura del indice

Estas funciones complementarias se encargan de cargar en memomoria la ruta de cada correo electronico y su etiqueta correspondiente {ham , spam}.

In [14]:
index = open (r"C:\Users\yanet\OneDrive\Documentos\ANTLR\datasets\trec07p\full/index").readlines()
index

['spam ../data/inmail.1\n',
 'ham ../data/inmail.2\n',
 'spam ../data/inmail.3\n',
 'spam ../data/inmail.4\n',
 'spam ../data/inmail.5\n',
 'spam ../data/inmail.6\n',
 'spam ../data/inmail.7\n',
 'spam ../data/inmail.8\n',
 'spam ../data/inmail.9\n',
 'ham ../data/inmail.10\n',
 'spam ../data/inmail.11\n',
 'spam ../data/inmail.12\n',
 'spam ../data/inmail.13\n',
 'spam ../data/inmail.14\n',
 'spam ../data/inmail.15\n',
 'spam ../data/inmail.16\n',
 'spam ../data/inmail.17\n',
 'spam ../data/inmail.18\n',
 'spam ../data/inmail.19\n',
 'ham ../data/inmail.20\n',
 'ham ../data/inmail.21\n',
 'spam ../data/inmail.22\n',
 'spam ../data/inmail.23\n',
 'spam ../data/inmail.24\n',
 'spam ../data/inmail.25\n',
 'spam ../data/inmail.26\n',
 'spam ../data/inmail.27\n',
 'spam ../data/inmail.28\n',
 'ham ../data/inmail.29\n',
 'spam ../data/inmail.30\n',
 'ham ../data/inmail.31\n',
 'spam ../data/inmail.32\n',
 'spam ../data/inmail.33\n',
 'ham ../data/inmail.34\n',
 'spam ../data/inmail.35\n',
 

In [15]:
import os

DATASET_PATH = r"C:\Users\yanet\OneDrive\Documentos\ANTLR\datasets\trec07p"

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    for i in range(n_elements):
        mail = index[i].split( "../")
        label = mail[0]
        path = mail[1][:-1]
        ret_indexes.append({"label": label, "email_path": os.path.join(DATASET_PATH, path)})
                            

    return ret_indexes

In [16]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index ["label"]


In [17]:
indexes = parse_index(r"C:\Users\yanet\OneDrive\Documentos\ANTLR\datasets\trec07p\full/index", 10)
indexes

[{'label': 'spam ',
  'email_path': 'C:\\Users\\yanet\\OneDrive\\Documentos\\ANTLR\\datasets\\trec07p\\data/inmail.1'},
 {'label': 'ham ',
  'email_path': 'C:\\Users\\yanet\\OneDrive\\Documentos\\ANTLR\\datasets\\trec07p\\data/inmail.2'},
 {'label': 'spam ',
  'email_path': 'C:\\Users\\yanet\\OneDrive\\Documentos\\ANTLR\\datasets\\trec07p\\data/inmail.3'},
 {'label': 'spam ',
  'email_path': 'C:\\Users\\yanet\\OneDrive\\Documentos\\ANTLR\\datasets\\trec07p\\data/inmail.4'},
 {'label': 'spam ',
  'email_path': 'C:\\Users\\yanet\\OneDrive\\Documentos\\ANTLR\\datasets\\trec07p\\data/inmail.5'},
 {'label': 'spam ',
  'email_path': 'C:\\Users\\yanet\\OneDrive\\Documentos\\ANTLR\\datasets\\trec07p\\data/inmail.6'},
 {'label': 'spam ',
  'email_path': 'C:\\Users\\yanet\\OneDrive\\Documentos\\ANTLR\\datasets\\trec07p\\data/inmail.7'},
 {'label': 'spam ',
  'email_path': 'C:\\Users\\yanet\\OneDrive\\Documentos\\ANTLR\\datasets\\trec07p\\data/inmail.8'},
 {'label': 'spam ',
  'email_path': 'C:\\

# 2.-Preprocesamento de los datos del Dataset

con las finciones presentadas anteriormente se ermiten la lectura de los correos electronicos de manera progmatica y el preproceamiento de los mismos que no resultan de utilidad para la deteccion de correos SPAM. 
Sinembargo , cada uno de los correos sigue estando representado por un diccionario de Python con una serie de palabras.

In [18]:
#cargar el indice y las etiquetas en memoria
index = parse_index(r"C:\Users\yanet\OneDrive\Documentos\ANTLR\datasets\trec07p\full/index", 1)

In [19]:
#Leer el primer corrreo
import os
open(index[0]["email_path"]).readlines()

['From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007\n',
 'Return-Path: <RickyAmes@aol.com>\n',
 'Received: from 129.97.78.23 ([211.202.101.74])\n',
 '\tby speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;\n',
 '\tSun, 8 Apr 2007 13:07:21 -0400\n',
 'Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100\n',
 'Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>\n',
 'From: "Tomas Jacobs" <RickyAmes@aol.com>\n',
 'Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>\n',
 'To: the00@speedy.uwaterloo.ca\n',
 'Subject: Generic Cialis, branded quality@ \n',
 'Date: Sun, 08 Apr 2007 21:00:48 +0300\n',
 'X-Mailer: Microsoft Outlook Express 6.00.2600.0000\n',
 'MIME-Version: 1.0\n',
 'Content-Type: multipart/alternative;\n',
 '\tboundary="--8896484051606557286"\n',
 'X-Priority: 3\n',
 'X-MSMail-Priority: Normal\n',
 'Status: RO\n',
 'Content-Length: 988\n',
 'Lines: 24\n',
 '\n',
 '----8896484051606557286\n',
 'Content-Type: text/html;\n',
 'Content-Transfer-Encoding: 7Bi

In [20]:
#parsear el primer correo 
mail, label =parse_email(index[0])
print("El correo es :", label)
print(mail)

El correo es : spam 
{'subject': ['gener', 'ciali', 'brand', 'qualiti'], 'body': ['html', 'bodi', 'bgcolorffffff', 'div', 'stylebordercolor', '00ffff', 'borderrightwidth', '0px', 'borderbottomwidth', '0px', 'marginbottom', '0px', 'aligncent', 'tabl', 'stylebord', '1px', 'borderstyl', 'solid', 'bordercolor000000', 'cellpadding5', 'cellspacing0', 'bgcolorccffaa', 'tr', 'td', 'stylebord', '0px', 'borderbottom', '1px', 'borderstyl', 'solid', 'bordercolor000000', 'center', 'feel', 'pressur', 'perform', 'rise', 'occasionbr', 'center', 'tdtrtr', 'td', 'bgcolorffff33', 'stylebord', '0px', 'borderbottom', '1px', 'borderstyl', 'solid', 'bordercolor000000', 'center', 'ba', 'hrefhttpexcoriationtuhcomlzmfnrdklekstri', 'spanvspanspaniaspanspangrspanaspanabcent', 'tdtrtdcenteryour', 'anxieti', 'thing', 'past', 'willbr', 'back', 'old', 'self', 'centertdtrtabledivbodyhtml'], 'content_type': 'multipart/alternative'}


El algoritmo de regrecion loguistica de ingerir texto como parte del DataSet.
Por lo tanto se deben de aplicar una serie de funciones adicioneles que transformen el texto de los correos parseados en una representacion numerica.

Aplicacion de CountVectorizer

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

# Preparación del e-mail en una cadena de texto
prep_email = [" ".join(mail["subject"]) + " " + " ".join(mail["body"])]

vectorizer = CountVectorizer()
x = vectorizer.fit_transform(prep_email)

print("e-mail:", prep_email, "\n")
print("Características de entrada:", vectorizer.get_feature_names_out())
    
              

e-mail: ['gener ciali brand qualiti html bodi bgcolorffffff div stylebordercolor 00ffff borderrightwidth 0px borderbottomwidth 0px marginbottom 0px aligncent tabl stylebord 1px borderstyl solid bordercolor000000 cellpadding5 cellspacing0 bgcolorccffaa tr td stylebord 0px borderbottom 1px borderstyl solid bordercolor000000 center feel pressur perform rise occasionbr center tdtrtr td bgcolorffff33 stylebord 0px borderbottom 1px borderstyl solid bordercolor000000 center ba hrefhttpexcoriationtuhcomlzmfnrdklekstri spanvspanspaniaspanspangrspanaspanabcent tdtrtdcenteryour anxieti thing past willbr back old self centertdtrtabledivbodyhtml'] 

Características de entrada: ['00ffff' '0px' '1px' 'aligncent' 'anxieti' 'ba' 'back' 'bgcolorccffaa'
 'bgcolorffff33' 'bgcolorffffff' 'bodi' 'borderbottom' 'borderbottomwidth'
 'bordercolor000000' 'borderrightwidth' 'borderstyl' 'brand'
 'cellpadding5' 'cellspacing0' 'center' 'centertdtrtabledivbodyhtml'
 'ciali' 'div' 'feel' 'gener' 'hrefhttpexcoriation

In [22]:
x=vectorizer.transform(prep_email)
print("\nValues:\n",x.toarray())


Values:
 [[1 5 3 1 1 1 1 1 1 1 1 2 1 3 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  3 1 3 1 1 2 1 1 1 1 1]]


In [23]:
from sklearn.preprocessing import OneHotEncoder
prep_email = [[w] for w in mail["subject"] + mail["body"]]

enc=OneHotEncoder(handle_unknown='ignore')
x = enc.fit_transform(prep_email)

print("Feactures: \n", enc.get_feature_names_out())
print("\nValues:\n", x.toarray())

Feactures: 
 ['x0_00ffff' 'x0_0px' 'x0_1px' 'x0_aligncent' 'x0_anxieti' 'x0_ba'
 'x0_back' 'x0_bgcolorccffaa' 'x0_bgcolorffff33' 'x0_bgcolorffffff'
 'x0_bodi' 'x0_borderbottom' 'x0_borderbottomwidth' 'x0_bordercolor000000'
 'x0_borderrightwidth' 'x0_borderstyl' 'x0_brand' 'x0_cellpadding5'
 'x0_cellspacing0' 'x0_center' 'x0_centertdtrtabledivbodyhtml' 'x0_ciali'
 'x0_div' 'x0_feel' 'x0_gener'
 'x0_hrefhttpexcoriationtuhcomlzmfnrdklekstri' 'x0_html' 'x0_marginbottom'
 'x0_occasionbr' 'x0_old' 'x0_past' 'x0_perform' 'x0_pressur' 'x0_qualiti'
 'x0_rise' 'x0_self' 'x0_solid'
 'x0_spanvspanspaniaspanspangrspanaspanabcent' 'x0_stylebord'
 'x0_stylebordercolor' 'x0_tabl' 'x0_td' 'x0_tdtrtdcenteryour' 'x0_tdtrtr'
 'x0_thing' 'x0_tr' 'x0_willbr']

Values:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [24]:
print(prep_email)

[['gener'], ['ciali'], ['brand'], ['qualiti'], ['html'], ['bodi'], ['bgcolorffffff'], ['div'], ['stylebordercolor'], ['00ffff'], ['borderrightwidth'], ['0px'], ['borderbottomwidth'], ['0px'], ['marginbottom'], ['0px'], ['aligncent'], ['tabl'], ['stylebord'], ['1px'], ['borderstyl'], ['solid'], ['bordercolor000000'], ['cellpadding5'], ['cellspacing0'], ['bgcolorccffaa'], ['tr'], ['td'], ['stylebord'], ['0px'], ['borderbottom'], ['1px'], ['borderstyl'], ['solid'], ['bordercolor000000'], ['center'], ['feel'], ['pressur'], ['perform'], ['rise'], ['occasionbr'], ['center'], ['tdtrtr'], ['td'], ['bgcolorffff33'], ['stylebord'], ['0px'], ['borderbottom'], ['1px'], ['borderstyl'], ['solid'], ['bordercolor000000'], ['center'], ['ba'], ['hrefhttpexcoriationtuhcomlzmfnrdklekstri'], ['spanvspanspaniaspanspangrspanaspanabcent'], ['tdtrtdcenteryour'], ['anxieti'], ['thing'], ['past'], ['willbr'], ['back'], ['old'], ['self'], ['centertdtrtabledivbodyhtml']]


Funciones auxiliares para el preprocesamiento de DataSet 

In [25]:
def create_pep_dtaset(index_path, n_elements):
    x = []
    y = []
    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\nParsing email: {0}".format(i+1), end="")
        mail , label = parse_email(indexes[i])
        x.append(" ".join(mail["subject"])+ " ".join(mail["body"]))
        y.append(label)
    return x, y 


# 3.- Entrenamiento del algoritmo

In [26]:
#Leer unicamento un sudconjunto de 100 correos.
x_train, y_train = create_pep_dtaset(r"C:\Users\yanet\OneDrive\Documentos\ANTLR\datasets\trec07p\full/index", 100)
x_train


Parsing email: 1
Parsing email: 2
Parsing email: 3
Parsing email: 4
Parsing email: 5
Parsing email: 6
Parsing email: 7
Parsing email: 8
Parsing email: 9
Parsing email: 10
Parsing email: 11
Parsing email: 12
Parsing email: 13
Parsing email: 14
Parsing email: 15
Parsing email: 16
Parsing email: 17
Parsing email: 18
Parsing email: 19
Parsing email: 20
Parsing email: 21
Parsing email: 22
Parsing email: 23
Parsing email: 24
Parsing email: 25
Parsing email: 26
Parsing email: 27
Parsing email: 28
Parsing email: 29
Parsing email: 30
Parsing email: 31
Parsing email: 32
Parsing email: 33
Parsing email: 34
Parsing email: 35
Parsing email: 36
Parsing email: 37
Parsing email: 38
Parsing email: 39
Parsing email: 40
Parsing email: 41
Parsing email: 42
Parsing email: 43
Parsing email: 44
Parsing email: 45
Parsing email: 46
Parsing email: 47
Parsing email: 48
Parsing email: 49
Parsing email: 50
Parsing email: 51
Parsing email: 52
Parsing email: 53
Parsing email: 54
Parsing email: 55
Parsing email: 56


['gener ciali brand qualitihtml bodi bgcolorffffff div stylebordercolor 00ffff borderrightwidth 0px borderbottomwidth 0px marginbottom 0px aligncent tabl stylebord 1px borderstyl solid bordercolor000000 cellpadding5 cellspacing0 bgcolorccffaa tr td stylebord 0px borderbottom 1px borderstyl solid bordercolor000000 center feel pressur perform rise occasionbr center tdtrtr td bgcolorffff33 stylebord 0px borderbottom 1px borderstyl solid bordercolor000000 center ba hrefhttpexcoriationtuhcomlzmfnrdklekstri spanvspanspaniaspanspangrspanaspanabcent tdtrtdcenteryour anxieti thing past willbr back old self centertdtrtabledivbodyhtml',
 'typo debianreadmhi ive updat gulu check mirror seem littl typo debianreadm file exampl httpgulususherbrookecadebianreadm ftpftpfrdebianorgdebianreadm test lenni access releas diststest current test develop snapshot name etch packag test unstabl pass autom test propog releas etch replac lenni like readmehtml yan morin consult en logiciel libr yanmorinsavoirfairel

In [27]:
vectorizer = CountVectorizer()
x_train_vectorized = vectorizer.fit_transform(x_train)

In [28]:
print(x_train_vectorized.toarray())
print("\nFeatures: ", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Features:  6197


In [29]:
import pandas as pd
df = pd.DataFrame(x_train_vectorized.toarray(), columns=vectorizer.get_feature_names_out())
df


Unnamed: 0,0000,000000,00085,002,003,00450,009,00ffff,01,01000u,...,ö¹,öð,öôööµæ,öø³ðåµ,öþ,öˆ,ùbp,úàí,þîñòµ¼,šè
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
y_train

['spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ']

Entrenamiento del algoritmo de regrecion loguistica con el DtaSet preprocesado

In [31]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(x_train_vectorized, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


### 4. Prediccion

Lectura de un DataSet de correos electronicos 

In [32]:
#Leer 150 correos electrinicos de nuetsro DataSet y nos quedamos unicamente con los 50 ultimos 
#Estos 50 correos electronicos no se an utilizado para entrenal el algoritmo
x, y = create_pep_dtaset(r"C:\Users\yanet\OneDrive\Documentos\ANTLR\datasets\trec07p\full/index", 150)
x_test = x[100:]
y_test = y[100:]



Parsing email: 1
Parsing email: 2
Parsing email: 3
Parsing email: 4
Parsing email: 5
Parsing email: 6
Parsing email: 7
Parsing email: 8
Parsing email: 9
Parsing email: 10
Parsing email: 11
Parsing email: 12
Parsing email: 13
Parsing email: 14
Parsing email: 15
Parsing email: 16
Parsing email: 17
Parsing email: 18
Parsing email: 19
Parsing email: 20
Parsing email: 21
Parsing email: 22
Parsing email: 23
Parsing email: 24
Parsing email: 25
Parsing email: 26
Parsing email: 27
Parsing email: 28
Parsing email: 29
Parsing email: 30
Parsing email: 31
Parsing email: 32
Parsing email: 33
Parsing email: 34
Parsing email: 35
Parsing email: 36
Parsing email: 37
Parsing email: 38
Parsing email: 39
Parsing email: 40
Parsing email: 41
Parsing email: 42
Parsing email: 43
Parsing email: 44
Parsing email: 45
Parsing email: 46
Parsing email: 47
Parsing email: 48
Parsing email: 49
Parsing email: 50
Parsing email: 51
Parsing email: 52
Parsing email: 53
Parsing email: 54
Parsing email: 55
Parsing email: 56


precprocesamiento de los correos electronicos con el vectorizador creado anteriormente 

In [33]:
x_test = vectorizer.transform(x_test)

prediccion del tipo de correo

In [34]:
y_pred= clf.predict(x_test)
y_pred

array(['spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam '], dtype='<U5')

In [35]:
#programar el modelo mateatico con un moedelo de prediccion del titanic
#predecir di tu te ubieras salvado 
#modelo de dataset del titanic_en internet 
#si estabas en una parte del barco ubieras sobrevivido 
#en appi y con from 
#tambien espesificar las catefotias y caracteristicas que se tiennen 

In [36]:
print("Prediccion:\n",y_pred)
print("\nEtiquetas reales:\n",y_test)

Prediccion:
 ['spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam '
 'spam ' 'ham ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'ham ' 'spam '
 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam '
 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam '
 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam '
 'spam ' 'spam ' 'spam ' 'spam ' 'spam ']

Etiquetas reales:
 ['spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ']


para hacerlo mas facil :

Evalucion de resultados 

In [37]:
from sklearn.metrics import  accuracy_score

print("Acuracy: {:3f}".format(accuracy_score(y_test, y_pred)))


Acuracy: 0.940000


### 5.- Aumentando el DataSet

In [38]:
#Leer 12,000 correos electronicos para entrenar el algoritmo  
#y 2,000 para realizar pruebas
x, y = create_pep_dtaset(r"C:\Users\yanet\OneDrive\Documentos\ANTLR\datasets\trec07p\full/index", 12000)


Parsing email: 1
Parsing email: 2
Parsing email: 3
Parsing email: 4
Parsing email: 5
Parsing email: 6
Parsing email: 7
Parsing email: 8
Parsing email: 9
Parsing email: 10
Parsing email: 11
Parsing email: 12
Parsing email: 13
Parsing email: 14
Parsing email: 15
Parsing email: 16
Parsing email: 17
Parsing email: 18
Parsing email: 19
Parsing email: 20
Parsing email: 21
Parsing email: 22
Parsing email: 23
Parsing email: 24
Parsing email: 25
Parsing email: 26
Parsing email: 27
Parsing email: 28
Parsing email: 29
Parsing email: 30
Parsing email: 31
Parsing email: 32
Parsing email: 33
Parsing email: 34
Parsing email: 35
Parsing email: 36
Parsing email: 37
Parsing email: 38
Parsing email: 39
Parsing email: 40
Parsing email: 41
Parsing email: 42
Parsing email: 43
Parsing email: 44
Parsing email: 45
Parsing email: 46
Parsing email: 47
Parsing email: 48
Parsing email: 49
Parsing email: 50
Parsing email: 51
Parsing email: 52
Parsing email: 53
Parsing email: 54
Parsing email: 55
Parsing email: 56


vamos a hacer el testeo 

In [39]:
x_train, y_train = x[:10000], y[:10000]
x_test, y_test = x[10000:], y[10000:]

In [40]:
vectorizer = CountVectorizer()
x_train = vectorizer.fit_transform(x_train)

In [41]:
clf = LogisticRegression()
clf.fit(x_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [42]:
x_test = vectorizer.transform(x_test) 
y_pred= clf.predict(x_test)
print("Acuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Acuracy: 0.992
