<a href="https://colab.research.google.com/github/cabamarcos/PLN2/blob/main/mia07_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ACTIVIDAD DE CLASIFICACIÓN DE TEXTO

En esta actividad vamos a trabajar en clasificar textos. Se recorrerá todo el proceso desde traer el dataset hasta proceder a dicha clasificación. Durante la actividad se llevarán a cabo muchos procesos como la creación de un vocabulario, el uso de embeddings y la creación de modelos.

Las cuestiones presentes en esta actividad están basadas en un Notebook creado por François Chollet, uno de los creadores de Keras y autor del libro "Deep Learning with Python".

En este Notebook se trabaja con el dataset "Newsgroup20" que contiene aproximadamente 20000 mensajes que pertenecen a 20 categorías diferentes.

El objetivo es entender los conceptos que se trabajan y ser capaz de hacer pequeñas experimentaciones para mejorar el Notebook creado.

# Librerías

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Descarga de Datos

In [2]:
data_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
)

Downloading data from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz
[1m17329808/17329808[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 1us/step


In [3]:
import os
import pathlib

#Estructura de directorios del dataset
data_dir = pathlib.Path(data_path) / "20_newsgroup"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)

Number of directories: 20
Directory names: ['rec.sport.hockey', 'sci.med', 'sci.crypt', 'talk.politics.misc', 'talk.religion.misc', 'comp.sys.mac.hardware', 'rec.motorcycles', 'comp.graphics', 'misc.forsale', 'sci.space', 'rec.autos', 'talk.politics.guns', 'talk.politics.mideast', 'comp.windows.x', 'comp.os.ms-windows.misc', 'rec.sport.baseball', 'alt.atheism', 'sci.electronics', 'soc.religion.christian', 'comp.sys.ibm.pc.hardware']


In [4]:
print(data_dir)

/root/.keras/datasets/news20_extracted/20_newsgroup


In [5]:
#Algunos archivos de la categoria "com.graphics"
fnames = os.listdir(data_dir / "comp.graphics")
print("Number of files in comp.graphics:", len(fnames))
print("Some example filenames:", fnames[:5])

Number of files in comp.graphics: 1000
Some example filenames: ['39669', '37922', '38379', '38225', '38544']


In [6]:
#Ejemplo de un texto de la categoría "com.graphics"
print(open(data_dir / "comp.graphics" / "37261").read())

Xref: cantaloupe.srv.cs.cmu.edu comp.graphics:37261 alt.graphics:519 comp.graphics.animation:2614
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu!zaphod.mps.ohio-state.edu!darwin.sura.net!dtix.dt.navy.mil!oasys!lipman
From: lipman@oasys.dt.navy.mil (Robert Lipman)
Newsgroups: comp.graphics,alt.graphics,comp.graphics.animation
Subject: CALL FOR PRESENTATIONS: Navy SciViz/VR Seminar
Message-ID: <32850@oasys.dt.navy.mil>
Date: 19 Mar 93 20:10:23 GMT
Article-I.D.: oasys.32850
Expires: 30 Apr 93 04:00:00 GMT
Reply-To: lipman@oasys.dt.navy.mil (Robert Lipman)
Followup-To: comp.graphics
Distribution: usa
Organization: Carderock Division, NSWC, Bethesda, MD
Lines: 65


			CALL FOR PRESENTATIONS
	
      NAVY SCIENTIFIC VISUALIZATION AND VIRTUAL REALITY SEMINAR

			Tuesday, June 22, 1993

	    Carderock Division, Naval Surface Warfare Center
	      (formerly the David Taylor Research Center)
			  Bethesda, Maryland

SPONSOR: NESS (Navy Engineering Software System) is sponsori

El código está bien pero vamo sa usar una semilla para que el resultado sea reproducible

In [7]:
import spacy
import en_core_web_sm
import random
nlp = en_core_web_sm.load()

# total_tokens=0
# for i in (range(15)):
#     doc = nlp(pathlib.Path(data_dir / "comp.graphics" / fnames[i]).read_text(encoding="latin-1"))
#     doc.__len__()
#     total_tokens+=doc.__len__()

# average_tokens = total_tokens/15

# print(average_tokens)
# Fijar la semilla para la reproducibilidad
random.seed(42)

# Seleccionar 15 archivos aleatorios de la categoría 'comp.graphics'
sample_files = random.sample(fnames, 15)

token_counts = []

for file in sample_files:
    file_path = data_dir / "comp.graphics" / file
    with open(file_path, "r", encoding="latin-1") as f:
        text = f.read()
        doc = nlp(text)
        token_counts.append(len(doc))

# Calcular el número promedio de tokens
average_tokens = sum(token_counts) / len(token_counts)

print("Número promedio de tokens en la muestra de 15 archivos:", average_tokens)

Número promedio de tokens en la muestra de 15 archivos: 241.66666666666666


In [8]:
import spacy
import random
import numpy as np
import pathlib
import os

# Cargar modelo de spaCy
nlp = spacy.load("en_core_web_sm")

# Fijar la semilla para la reproducibilidad
random.seed(42)

# Obtener la estructura de directorios del dataset
data_dir = pathlib.Path(data_path) / "20_newsgroup"
dirnames = os.listdir(data_dir)

token_counts = []
word_counts = []

# Procesar todos los archivos en el dataset
for dirname in dirnames:
    category_path = data_dir / dirname
    if not category_path.is_dir():
        continue

    for file in os.listdir(category_path):
        file_path = category_path / file
        with open(file_path, "r", encoding="latin-1") as f:
            text = f.read()
            doc = nlp(text)
            token_counts.append(len(doc))
            word_counts.append(len([token for token in doc if token.is_alpha]))

# Calcular el número promedio de tokens en todo el dataset
average_tokens = sum(token_counts) / len(token_counts)
# Calcular el número promedio de palabras en todo el dataset
average_words = sum(word_counts) / len(word_counts)

print("Número promedio de tokens en todo el dataset:", average_tokens)
print("Número promedio de palabras en todo el dataset:", average_words)



Número promedio de tokens en todo el dataset: 469.12666900035003
Número promedio de palabras en todo el dataset: 270.8842326348952


In [9]:
#Algunos archivos de la categoria "talk.politics.misc"
fnames = os.listdir(data_dir / "talk.politics.misc")
print("Number of files in talk.politics.misc:", len(fnames))
print("Some example filenames:", fnames[:5])

Number of files in talk.politics.misc: 1000
Some example filenames: ['178692', '178799', '179012', '177023', '178932']


In [10]:
print(open(data_dir / "talk.politics.misc" / "178799").read())

Xref: cantaloupe.srv.cs.cmu.edu talk.politics.misc:178799 alt.society.civil-liberty:9098 misc.legal:60857 bit.listserv.lawsch-l:982
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!uunet!enterpoop.mit.edu!senator-bedfellow.mit.edu!senator-bedfellow.mit.edu!usenet
From: wdstarr@athena.mit.edu (William December Starr)
Newsgroups: talk.politics.misc,alt.society.civil-liberty,misc.legal,bit.listserv.lawsch-l
Subject: Re: Law and Economics
Date: 21 Apr 1993 14:33:22 GMT
Organization: Northeastern Law, Class of '93
Lines: 234
Message-ID: <1r3lviINN125@senator-bedfellow.MIT.EDU>
References: <1993Apr11.155955.23346@midway.uchicago.edu> <1qjq2nINN7ql@senator-bedfellow.MIT.EDU> <1993Apr15.143623.25813@midway.uchicago.edu>
NNTP-Posting-Host: nw12-326-1.mit.edu
In-reply-to: thf2@midway.uchicago.edu


I'm going to be mixing together here stuff from two of Ted Frank's
articles, <1993Apr15.143623.25813@midway.uchicago.e

In [11]:
#Ejemplo de un texto de la categoría "talk.politics.misc"
print(open(data_dir / "talk.politics.misc" / "178463").read())

Xref: cantaloupe.srv.cs.cmu.edu talk.politics.guns:54219 talk.politics.misc:178463
Newsgroups: talk.politics.guns,talk.politics.misc
Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!darwin.sura.net!martha.utcc.utk.edu!FRANKENSTEIN.CE.UTK.EDU!VEAL
From: VEAL@utkvm1.utk.edu (David Veal)
Subject: Re: Proof of the Viability of Gun Control
Message-ID: <VEAL.749.735192116@utkvm1.utk.edu>
Lines: 21
Sender: usenet@martha.utcc.utk.edu (USENET News System)
Organization: University of Tennessee Division of Continuing Education
References: <1qpbqd$ntl@access.digex.net> <C5otvp.ItL@magpie.linknet.com>
Date: Mon, 19 Apr 1993 04:01:56 GMT

[alt.drugs and alt.conspiracy removed from newsgroups line.]

In article <C5otvp.ItL@magpie.linknet.com> neal@magpie.linknet.com (Neal) writes:

>   Once the National Guard has been called into federal service,
>it is under the command of the present. Tha N

In [12]:
# Seleccionar sólo ciertas clases
list_all_dir = [
    'alt.atheism',
    'comp.graphics',
    'comp.sys.mac.hardware',
    'comp.windows.x',
    'misc.forsale',
    'rec.autos',
    'rec.sport.baseball',
    'rec.sport.hockey',
    'sci.crypt',
    'sci.med',
    'sci.space',
    'soc.religion.christian',
    'talk.politics.guns',
    'talk.politics.misc',
    'talk.religion.misc'
]

In [13]:
samples = []
labels = []
class_names = []
class_index = 0
for dirname in list_all_dir:
    class_names.append(dirname)
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath)
    print("Processing %s, %d files found" % (dirname, len(fnames)))
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding="latin-1")
        content = f.read()
        lines = content.split("\n")
        lines = lines[10:]
        content = "\n".join(lines)
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

Processing alt.atheism, 1000 files found
Processing comp.graphics, 1000 files found
Processing comp.sys.mac.hardware, 1000 files found
Processing comp.windows.x, 1000 files found
Processing misc.forsale, 1000 files found
Processing rec.autos, 1000 files found
Processing rec.sport.baseball, 1000 files found
Processing rec.sport.hockey, 1000 files found
Processing sci.crypt, 1000 files found
Processing sci.med, 1000 files found
Processing sci.space, 1000 files found
Processing soc.religion.christian, 997 files found
Processing talk.politics.guns, 1000 files found
Processing talk.politics.misc, 1000 files found
Processing talk.religion.misc, 1000 files found
Classes: ['alt.atheism', 'comp.graphics', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.misc', 'talk.religion.misc']
Number of samples: 14997


# Mezclando los datos para separarlos en Traning y Test

In [14]:
# Shuffle the data
seed = 1337
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)
keras.utils.set_random_seed(seed)

# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

In [15]:
print(train_samples[:1])

['NNTP-Posting-Host: hsc.usc.edu\n\nIn article <1993Apr26.184507.10511@aio.jsc.nasa.gov> kjenks@gothamcity.jsc.nasa.gov writes:\n>I know it\'s only wishful thinking, with our current President,\n>but this is from last fall:\n>\n>     "Is there life on Mars?  Maybe not now.  But there will be."\n>        -- Daniel S. Goldin, NASA Administrator, 24 August 1992\n>\n>-- Ken Jenks, NASA/JSC/GM2, Space Shuttle Program Office\n>      kjenks@gothamcity.jsc.nasa.gov  (713) 483-4368\n\nLets hear it for Dan Goldin...now if he can only convince the rest of\nour federal government that the space program is a worth while\ninvestment!\n\nI hope that I will live to see the day we walk on Mars, but\nwe need to address the technical hurdles first!  If there\'s sufficient\ninterest, maybe we should consider starting a sci.space group \ndevoted to the technical analysis of long-duration human spaceflight.\nMost of you regulars know that I\'m interested in starting this analysis\nas soon as possible.\n\nKe

In [16]:
print(val_samples[:1])

["I need a little help from a Texas Rangers expert.\n\nI was at Yankee Stadium Sunday (12-2 Texas rout) with my kids.  We\nwandered out to the outfield during Rangers batting practice and\nI caught a ball tossed into the stands (actually wrestled some guy\na bit, I might add) by #62 on the Rangers.  Who is he?  Looked like\na bullpen assistant type, youngish I think.  He was not in the\nroster listed in the Yankee scorecard.  Any ideas?\n\nPlease e-mail as I haven't been reading r.s.b regularly.\n\nThanks.\n- Bob\n--\nName:    Bob Dorin\nCompany: Kendall Square Research \nEmail:   dorin@ksr.com, ksr!dorin\n\n\n"]


In [17]:
print(train_labels[:1])

[10]


In [18]:
print(val_labels[:1])

[6]


# Tokenización de las palabras con TextVectorization

In [19]:
from tensorflow.keras.layers import TextVectorization
vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

In [20]:
vectorizer.get_vocabulary()[:5]

['', '[UNK]', 'the', 'to', 'of']

In [21]:
len(vectorizer.get_vocabulary())

20000

# Viendo la salida de Vectorizer

In [22]:
output = vectorizer([["the cat sat on the mat"]])
output.numpy()[0, :6]

array([   2, 3762, 1955,   18,    2, 5188])

In [23]:
output

<tf.Tensor: shape=(1, 200), dtype=int64, numpy=
array([[   2, 3762, 1955,   18,    2, 5188,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,   

In [24]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [25]:
test = ["the", "cat", "sat", "on", "the", "mat"]
[word_index[w] for w in test]

[2, 3762, 1955, 18, 2, 5188]

# Tokenización de los datos de entrenamiento y validación

In [26]:
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

# Creación y entrenamiento del modelo

In [27]:
# pon aquí tu

# Evaluación

In [28]:
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = modeloEmbeddingGloveTransformers(x)
end_to_end_model = keras.Model(string_input, preds)

probabilities = end_to_end_model.predict(
    [["this message is about computer graphics and 3D modeling"]]
)

class_names[np.argmax(probabilities[0])]

NameError: name 'modeloEmbeddingGloveTransformers' is not defined

In [None]:
probabilities = end_to_end_model.predict(
    [["politics and federal courts law that people understand with politician and elects congressman"]]
)

class_names[np.argmax(probabilities[0])]

In [None]:
probabilities = end_to_end_model.predict(
    [["we are talking about religion"]]
)

class_names[np.argmax(probabilities[0])]