### Vectorización de texto y modelo de clasificación Naïve Bayes con el dataset 20 newsgroups

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import f1_score

# 20newsgroups por ser un dataset clásico de NLP ya viene incluido y formateado
# en sklearn
from sklearn.datasets import fetch_20newsgroups
import numpy as np

## Carga de datos

In [3]:
# cargamos los datos (ya separados de forma predeterminada en train y test)
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

## Vectorización

In [4]:
# instanciamos un vectorizador
# ver diferentes parámetros de instanciación en la documentación de sklearn
tfidfvect = TfidfVectorizer()

In [5]:
# en el atributo `data` accedemos al texto
newsgroups_train.data[0]

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

In [6]:
# con la interfaz habitual de sklearn podemos fitear el vectorizador
# (obtener el vocabulario y calcular el vector IDF)
# y transformar directamente los datos
X_train = tfidfvect.fit_transform(newsgroups_train.data)
# `X_train` la podemos denominar como la matriz documento-término

In [7]:
# recordar que las vectorizaciones por conteos son esparsas
# por ello sklearn convenientemente devuelve los vectores de documentos
# como matrices esparsas
print(type(X_train))
print(f'shape: {X_train.shape}')
print(f'cantidad de documentos: {X_train.shape[0]}')
print(f'tamaño del vocabulario (dimensionalidad de los vectores): {X_train.shape[1]}')

<class 'scipy.sparse._csr.csr_matrix'>
shape: (11314, 101631)
cantidad de documentos: 11314
tamaño del vocabulario (dimensionalidad de los vectores): 101631


In [8]:
# una vez fiteado el vectorizador, podemos acceder a atributos como el vocabulario
# aprendido. Es un diccionario que va de términos a índices.
# El índice es la posición en el vector de documento.
tfidfvect.vocabulary_['car']

25775

In [9]:
# es muy útil tener el diccionario opuesto que va de índices a términos
idx2word = {v: k for k,v in tfidfvect.vocabulary_.items()}

In [10]:
# en `y_train` guardamos los targets que son enteros
y_train = newsgroups_train.target
y_train[:10]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [11]:
# hay 20 clases correspondientes a los 20 grupos de noticias
print(f'clases {np.unique(newsgroups_test.target)}')
newsgroups_test.target_names

clases [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Similaridad de documentos

In [12]:
# Veamos similaridad de documentos. Tomemos algún documento
idx = 4811
print(newsgroups_train.data[idx])

THE WHITE HOUSE

                  Office of the Press Secretary
                   (Pittsburgh, Pennslyvania)
______________________________________________________________
For Immediate Release                         April 17, 1993     

             
                  RADIO ADDRESS TO THE NATION 
                        BY THE PRESIDENT
             
                Pittsburgh International Airport
                    Pittsburgh, Pennsylvania
             
             
10:06 A.M. EDT
             
             
             THE PRESIDENT:  Good morning.  My voice is coming to
you this morning through the facilities of the oldest radio
station in America, KDKA in Pittsburgh.  I'm visiting the city to
meet personally with citizens here to discuss my plans for jobs,
health care and the economy.  But I wanted first to do my weekly
broadcast with the American people. 
             
             I'm told this station first broadcast in 1920 when
it reported that year's presidential elec

In [13]:
# midamos la similaridad coseno con todos los documentos de train
cossim = cosine_similarity(X_train[idx], X_train)[0]

In [14]:
# podemos ver los valores de similaridad ordenados de mayor a menor
np.sort(cossim)[::-1]

array([1.        , 0.70930477, 0.67474953, ..., 0.        , 0.        ,
       0.        ])

In [15]:
# y a qué documentos corresponden
np.argsort(cossim)[::-1]

array([ 4811,  6635,  4253, ...,  1534, 10055,  4750], dtype=int64)

In [16]:
# los 5 documentos más similares:
mostsim = np.argsort(cossim)[::-1][1:6]

In [17]:
# el documento original pertenece a la clase:
newsgroups_train.target_names[y_train[idx]]

'talk.politics.misc'

In [18]:
# y los 5 más similares son de las clases:
for i in mostsim:
  print(newsgroups_train.target_names[y_train[i]])

talk.politics.misc
talk.politics.misc
talk.politics.misc
talk.politics.misc
talk.politics.misc


### Modelo de clasificación Naïve Bayes

In [19]:
# es muy fácil instanciar un modelo de clasificación Naïve Bayes y entrenarlo con sklearn
clf = MultinomialNB()
clf.fit(X_train, y_train)

In [20]:
# con nuestro vectorizador ya fiteado en train, vectorizamos los textos
# del conjunto de test
X_test = tfidfvect.transform(newsgroups_test.data)
y_test = newsgroups_test.target
y_pred =  clf.predict(X_test)

In [21]:
# el F1-score es una metrica adecuada para reportar desempeño de modelos de claificación
# es robusta al desbalance de clases. El promediado 'macro' es el promedio de los
# F1-score de cada clase. El promedio 'micro' es equivalente a la accuracy que no
# es una buena métrica cuando los datasets son desbalanceados
f1_score(y_test, y_pred, average='macro')

0.5854345727938506

### Consigna del desafío 1

#### **1**. Vectorizar documentos. Tomar 5 documentos al azar y medir similaridad con el resto de los documentos. Estudiar los 5 documentos más similares de cada uno analizar si tiene sentido la similaridad según el contenido del texto y la etiqueta de clasificación.

In [22]:
# Veamos similaridad de documentos. Tomemos algún documento
import random

In [23]:
seed = 10
np.random.seed(seed)
random.seed(seed)
random_idx = []
for i in range(0,5):
    random_idx.append(random.randint(0,11314))
print(random_idx)
    #print(newsgroups_train.data[random_idx])

[9361, 533, 7026, 7906, 9471]


#### Una vez obtenidos los índice se hace el análsis para cada uno de ellos

#### Índice 9361

##### 1 - Se mide la similaridad con el resto de los documentos

In [31]:
# midamos la similaridad coseno con todos los documentos de train
cossim = cosine_similarity(X_train[random_idx[0]], X_train)[0]
# podemos ver los valores de similaridad ordenados de mayor a menor
mostsim = np.argsort(cossim)[::-1][1:6]

##### 2 - Se observa la similaridad con los demás documentos y a qué índice pertenecen estos.

In [32]:
print(np.sort(cossim)[::-1][1:6])
# y a qué documentos corresponden
print(np.argsort(cossim)[::-1][1:6])
# los 5 documentos más similares:

[0.21649189 0.20618781 0.2029193  0.17855957 0.17843009]
[ 6201  8467 11137 10106  5989]


##### 3 - Se obtiene la clase del índice y también la de los de mayor similaridad.

In [33]:
# Se imprime la clase del documento
print("El documento original pertenece a la clase: ", newsgroups_train.target_names[y_train[random_idx[0]]])

# Se imprime la clase de los 5 más similares
print("Los demás documentos pertenecen a las clases: ")
for i in mostsim:
  print(newsgroups_train.target_names[y_train[i]])

El documento original pertenece a la clase:  alt.atheism
Los demás documentos pertenecen a las clases: 
talk.religion.misc
rec.motorcycles
rec.motorcycles
soc.religion.christian
sci.crypt


##### 4 - Se observa el contenido del documento.

In [34]:
print(newsgroups_train.data[random_idx[0]])


: 	Nice cop out bill.

I'm sure you're right, but I have no idea to what you refer. Would you
mind explaining how I copped out?


Se trata de un proyecto de ley.

In [35]:
print(newsgroups_train.data[mostsim[0]])




You might be sure, but you would also be wrong.



Se trata de una frase con tendencia religiosa.

In [36]:
print(newsgroups_train.data[mostsim[1]])


1) The next time you get stoped by a cop, never never never admit to anything.

2) Don't volunteer any information.

3) When a retoracle question is ask by the cop, like "...it <looked> like you were going kinda fast coming down highway 12.  You <must have> been going at least 70 or 75?" -- the correct reponse is to deny it. This technique is employed by police to help establish guilt, especially when (9 times out of 10) he/she is not sure who was doing the speeding. If the cop is unsure this may be the difference of him letting you off the hook or getting the tissue.

Hope this helps for next time.


Se trata de como evadir una multa de velocidad.

In [37]:
print(newsgroups_train.data[mostsim[2]])


Right on, it is every citizen's right and duty to FORCE government
accountability.

(anecdotes deleted)


Also keep in mind that cops will LIE in court to get their way! (don't get
me started by asking how I know ;) If you decide to fight you have to be ready
for this as well as devise strategy to make the cop's story doubtful in the
judge/jury's mind.


Se trata de que dice que los policías mentirán para salirse con la suya y debes tenerlo en cuenta para idea una estrategia para desmentirlo.

In [38]:
print(newsgroups_train.data[mostsim[3]])

[In looking through my files this weekend, I ran across some lyrics from
various rock groups that have content.  Here are two from Black Sabbath's
"Master of Reality".  I'll say this much for the music of the '60's and early
'70's, at least they asked questions of significance.  Jethro Tull is another
to asked and wrote about things that caused one to wonder. --Rex] 

AFTER FOREVER

Have you ever thought about your soul--
     can it be saved?
Or perhaps you think that when you're dead
     you just stay in you grave.
Is God just a thought within you read in a book
     when you were at school?
When you think about death 
     Do you lose your breath
     Or do you keep your cool?

Would you like to see the Pope on the end of a rope?
Do you think he's a fool?
Well I have seen the truth.  Yes I have seen the light
     and I've changed my ways.
And I'll be prepared 
     When you're lonely and scared
     at the end of your days.

Could it be you're afraid of what your friends might say

Se trata de una letra de una canción de Black Sabbath.

In [39]:
print(newsgroups_train.data[mostsim[4]])



Could you expand on this? I have a feeling you're right, but I don't quite
understand.


Se pide que se amplié un tema que no se termina de entender.

Se observa que se tuvo una baja similaridad, por lo que las clases del documento obtenido al azar no se relaciona con las demás clases. Algunas palabras claves que se han observado en los documentos son cop, 

##### Índice 533

 Habla de un spinoff de ciencia ficción.

##### 1 - Se mide la similaridad con el resto de los documentos

In [42]:
# midamos la similaridad coseno con todos los documentos de train
cossim = cosine_similarity(X_train[random_idx[1]], X_train)[0]
# podemos ver los valores de similaridad ordenados de mayor a menor
mostsim = np.argsort(cossim)[::-1][1:6]

##### 2 - Se observa la similaridad con los demás documentos y a qué índice pertenecen estos.

In [43]:
print(np.sort(cossim)[::-1][1:6])
# y a qué documentos corresponden
print(np.argsort(cossim)[::-1][1:6])
# los 5 documentos más similares:

[0.48255067 0.35558112 0.3520732  0.33276742 0.31324499]
[ 2061 10855  9934 11198  3285]


##### 3 - Se obtiene la clase del índice y también la de los de mayor similaridad.

In [44]:
# Se imprime la clase del documento
print("El documento original pertenece a la clase: ", newsgroups_train.target_names[y_train[random_idx[1]]])

# Se imprime la clase de los 5 más similares
print("Los demás documentos pertenecen a las clases: ")
for i in mostsim:
  print(newsgroups_train.target_names[y_train[i]])

El documento original pertenece a la clase:  sci.space
Los demás documentos pertenecen a las clases: 
sci.space
sci.space
sci.space
sci.space
sci.space


##### 4 - Se observa el contenido del documento.

In [46]:
print(newsgroups_train.data[random_idx[1]])

From the article "What's New" Apr-16-93 in sci.physics.research:

........
WHAT'S NEW (in my opinion), Friday, 16 April 1993  Washington, DC

1. SPACE BILLBOARDS! IS THIS ONE THE "SPINOFFS" WE WERE PROMISED?
In 1950, science fiction writer Robert Heinlein published "The
Man Who Sold the Moon," which involved a dispute over the sale of
rights to the Moon for use as billboard. NASA has taken the firsteps toward this
 hideous vision of the future.  Observers were
startled this spring when a NASA launch vehicle arrived at the
pad with "SCHWARZENEGGER" painted in huge block letters on the
side of the booster rockets.  Space Marketing Inc. had arranged
for the ad to promote Arnold's latest movie. Now, Space Marketing
is working with University of Colorado and Livermore engineers on
a plan to place a mile-long inflatable billboard in low-earth
orbit.  NASA would provide contractual launch services. However,
since NASA bases its charge on seriously flawed cost estimates
(WN 26 Mar 93) the taxp

Se trata de un spinoff de ciencia ficción.

In [47]:
print(newsgroups_train.data[mostsim[0]])

;From the article "What's New" Apr-16-93 in sci.physics.research:
;
;........
;WHAT'S NEW (in my opinion), Friday, 16 April 1993  Washington, DC
;
;1. SPACE BILLBOARDS! IS THIS ONE THE "SPINOFFS" WE WERE PROMISED?
;What about light pollution in observations? (I read somewhere else that
;it might even be visible during the day, leave alone at night).
;Is NASA really supporting this junk?
;Are protesting groups being organized in the States?
;Really, really depressed.
;
;             Enzo

I wouldn't worry about it.  There's enough space debris up there that
a mile-long inflatable would probably deflate in some very short
period of time (less than a year) while cleaning up LEO somewhat.
Sort of a giant fly-paper in orbit.

Hmm, that could actually be useful.

As for advertising -- sure, why not?  A NASA friend and I spent one
drunken night figuring out just exactly how much gold mylar we'd need
to put the golden arches of a certain American fast food organization
on the face of the Moon.

Se trata de la primer parte de una spinoff y vuelve a figurar el nombre Enzo. 

In [48]:
print(newsgroups_train.data[mostsim[1]])

Brian Yamauchi asks: [Regarding orbital billboards...]
  
    Well, I had been collecting data for next edition of the
Commercial Space News/Space Technology Investor... To summarize:
  
SPACE ADVERTISING
    First, advertising on space vehicles is not new -- it is very
common practice to put the cooperating organization's logos on the
space launch vehicle.  For example, the latest GPS launcher had the
(very prominent) logos on its side of
   - McDonnell Douglas (the Delta launcher)
   - Rockwell International (who built the GPS satellite)
   - USAF (who paid for the satellite and launch), and
   - the GPS/Navstar program office
   This has not been considered "paid advertising" but rather
"public relations", since the restrictions have been such that only
organizations involved in the launch could put their logos on the
side, and there was no money exchanged for this.  [However, putting
a 10' high logo on the side of the launch vehicle facing the cameras
is "advertising" as much as it

Vuelve a figurar el término Billboards. Se trata de los logos de las organizaciones que cooperaron en el vehículo espacial.

In [49]:
print(newsgroups_train.data[mostsim[2]])

Two developments have brought these type of activities back to
the forefront in 1993.  First, in February, the Russians deployed a
20-m reflector from a Progress vehicle after it had departed from
the Mir Space Station.  While this "Banner" reflector was blank,
NPO Energia was very active in reporting that future  Banner
reflectors will be available to advertisers, who could use a space-
based video of their logo or ad printed on the Banner in a TV
commercial, as filmed from the Mir.
   The second development, has been that Space Marketing Inc, the
same company responsible for merchandising space on the Conestoga
booster and COMET spacecraft, is now pushing the "Environmental
Billboard".  As laid out by SMI Chief Engineer Dr Ron Humble of the
University of Colorado Space Laboratory and Preston Carter of the
Lawrence Livermore National Laboratory, the "Environmental
Billboard" is a large inflatable outer support structure of up to
804x1609 meters.  Advertising is carried by a mylar refl

Se trata de dos desarrollos para usar el espacio para publicidad.

In [50]:
print(newsgroups_train.data[mostsim[3]])

Archive-name: space/controversy
Last-modified: $Date: 93/04/01 14:39:06 $

CONTROVERSIAL QUESTIONS

    These issues periodically come up with much argument and few facts being
    offered. The summaries below attempt to represent the position on which
    much of the net community has settled. Please DON'T bring them up again
    unless there's something truly new to be discussed. The net can't set
    public policy, that's what your representatives are for.


    WHAT HAPPENED TO THE SATURN V PLANS

    Despite a widespread belief to the contrary, the Saturn V blueprints
    have not been lost. They are kept at Marshall Space Flight Center on
    microfilm.

    The problem in re-creating the Saturn V is not finding the drawings, it
    is finding vendors who can supply mid-1960's vintage hardware (like
    guidance system components), and the fact that the launch pads and VAB
    have been converted to Space Shuttle use, so you have no place to launch
    from.

    By the time you 

Se trata que los planos del Saturn V no se ha perdido. Pero el problema es que no hay proveedores.

In [51]:
print(newsgroups_train.data[mostsim[4]])

COMMERCIAL SPACE NEWS/SPACE TECHNOLOGY INVESTOR NUMBER 22

   This is number twenty-two in an irregular series on commercial 
space activities.  The commentaries included are my thoughts on 
these developments.  

   Sigh... as usual, I've gotten behind in getting this column 
written.  I can only plead the exigency of the current dynamics in 
the space biz.  This column is put together at lunch hour and after 
the house quiets down at night, so data can quickly build up if 
there's a lot of other stuff going on.  I've complied a lot of 
information and happenings since the last column, so I'm going to 
have to work to keep this one down to a readable length.  Have fun! 

CONTENTS:
1- US COMMERCIAL SPACE SALES FLATTEN IN 1993
2- DELTA WINS TWO KEY LAUNCH CONTRACTS
3- COMMERCIAL REMOTE SENSING VENTURE GETS DOC "GO-AHEAD"
4- INVESTMENT FIRM CALLS GD'S SPACE BIZ "STILL A GOOD INVESTMENT" 
5- ARIANE PREDICTS DIP IN LAUNCH DEMAND
6- NTSB INVESTIGATES PEGASUS LAUNCH OVER ABORTED ABORT
7- ANO

La noticia habla sobre actividades espaciales comerciales.

Se observa que se tuvo una mayor similaridad, tanto así que todos los documentos pertenecen a la misma clase.

#### **2**. Entrenar modelos de clasificación Naïve Bayes para maximizar el desempeño de clasificación(f1-score macro) en el conjunto de datos de test. Considerar cambiar parámteros de instanciación del vectorizador y los modelos y probar modelos de Naïve Bayes Multinomial y ComplementNB.

In [None]:
np.random.seed(10)
random.seed(10)
random_idx = []
for i in range(0,5):
    random_idx = random.randint(0,11314)
    cossim = cosine_similarity(X_train[random_idx], X_train)[0]
    #np.sort(cossim)[::-1]
    #np.argsort(cossim)[::-1]
    mostsim = np.argsort(cossim)[::-1][1:6]
    #print(idx2word[random_idx])
    for i in mostsim:
        print(newsgroups_train.target_names[y_train[i]])
    print()

#### **3**. Transponer la matriz documento-término. De esa manera se obtiene una matriz término-documento que puede ser interpretada como una colección de vectorización de palabras. Estudiar ahora similaridad entre palabras tomando 5 palabras y estudiando sus 5 más similares.