# PRACTICA GUIADA: Topic Modeling con Non-Negative Matrix Factorization (NMF)

Es un grupo de algoritmos en análisis multivariado y álgebra lineal donde una matriz V se factoriza en (generalmente) dos matrices W y H, con la propiedad de que las tres matrices no tienen elementos negativos. 

Dado que el problema no es exactamente solucionable en general, comúnmente se aproxima numéricamente.

In [1]:
import ast
import numpy as np
from sklearn.decomposition import NMF

In [2]:
# Carga de datos
fp = open('../Data/todasLasNoticias2008.txt','r')
data = fp.read()
fp.close()

In [3]:
type(data)

str

In [4]:
# https://docs.python.org/3/library/ast.html
notes = ast.literal_eval(data)

# Imprime cantidad de notas 
print('Cantidad de notas:',len(notes))

Cantidad de notas: 5808


In [5]:
type(notes)

list

In [6]:
notes[0]

{'url': 'https://www.nytimes.com/2008/09/04/nyregion/04schley.html',
 'text': ['To the Editor:',
  'Re “Bush Says McCain Is Choice to Lead in Time of Danger” (front page, Sept. 3):',
  'Seeing President Bush (at a safe distance) and Senator Joseph I. Lieberman speaking to the Republican convention dispels any doubt that Senator John McCain, the prospective nominee for president on the Republican ticket, is running for President Bush’s third term.',
  'Katrina, the economy, windfall profits for the oil industry as gas prices spiral out of control and eavesdropping on Americans — but it is the war in Iraq that is Mr. Bush’s albatross and everlasting legacy, which Mr. McCain supports at our peril and which the electorate will be voting on in November.',
  'Morris Roth',
  'Fort Lee, N.J., Sept. 3, 2008',
  '\x95',
  'To the Editor:',
  'Yes, we live in dangerous times, but the person we need to lead the country at such times is one who knows his geography and understands the complexities 

In [7]:
# Agrego información del Medio en la nota (Innecesario por ahora)
# y normalizo el texto
for note in notes:

    if 'foxnews' in note['url']:
        note['media_name'] = 'Fox News'
    elif 'nytimes' in note['url']:
        note['media_name'] = 'NYTimes'
    elif 'breitbart' in note['url']:
        note['media_name'] = 'Breitbart'
    else:
        pass

    note['text'] = ' '.join(note['text'])
    note['title'] = ' '.join(note['title'])

In [8]:
notes[0]

{'url': 'https://www.nytimes.com/2008/09/04/nyregion/04schley.html',
 'text': 'To the Editor: Re “Bush Says McCain Is Choice to Lead in Time of Danger” (front page, Sept. 3): Seeing President Bush (at a safe distance) and Senator Joseph I. Lieberman speaking to the Republican convention dispels any doubt that Senator John McCain, the prospective nominee for president on the Republican ticket, is running for President Bush’s third term. Katrina, the economy, windfall profits for the oil industry as gas prices spiral out of control and eavesdropping on Americans — but it is the war in Iraq that is Mr. Bush’s albatross and everlasting legacy, which Mr. McCain supports at our peril and which the electorate will be voting on in November. Morris Roth Fort Lee, N.J., Sept. 3, 2008 \x95 To the Editor: Yes, we live in dangerous times, but the person we need to lead the country at such times is one who knows his geography and understands the complexities of the world, one who listens and explore

In [9]:
# Preparo lista de textos a analizar
texts = []
for note in notes:
    texts.append(' '.join([note['title'], note['text']]))

In [10]:
# Análisis Tf-idf y NMF
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [11]:
stop_words_extra = ["host","play","episode","play","series","kelly","returns","husband","access","book","kind","asked","page",
                    "event","twitter","long","point","women","men","man","woman","30","board","chelsea","street","band","swing",
                    "45","speeches","wrote","44","staff","speech","things","lot","days","according","including","need","come",
                    "does","man","monday","great","today","big","washington","question","left","group","politics","change","end",
                    "little","old","better","woman","senate","office","won","help","let","believe","questions","wednesday",
                    "fact","press","issue","early","far","real","street","talk","use","trying","used","home","issues","nominee",
                    "donald","republican","et","trumps","hillary","rally","democratic","dont","ryan","hes","clintons","fox",
                    "hold","thats","im","gop","fox","news","000","poll","polls","points","johnson","florida","margin","race",
                    "voting","votes","lead","stein","likely","carolina","north","error","electoral","margin","polling","ads",
                    "october","electorate","utah","august","48","mitt","seats","49","recent","district","fraud","2016","et",
                    "romney","tuesday","seven","38","director","compared","november","weeks","hillary","clintons","pollsters",
                    "leaning","september","final","hold","rep","new","whites","average","270","close","map","hes","37","ad",
                    "senator","competitive","gov","2008","tied","pollster","36","national","polls","voter","registration",
                    "partys","chance","50","incumbent","campaigns","edge","committee","friday","convention","sept","july",
                    "georgia","key","breitbart","seat","data","roughly","cnn","51","challenger","ticket","favorable","vs",
                    "maine","missouri","trailing","52","battleground","state","13","40","indiana","bernie","31","clintons",
                    "anthony","weiner","don","white","york","new","york","want","americans","world","good","saying","times",
                    "polls","supporters","look","work","00","siriusxm","125","weekdays","patriot","125","weekdays","airs",
                    "siriusxm","airs","eastern","et","thats","continued","added","polls","hes","dont","listen","trumps","hold",
                    "breitbart","news","gop","rally","poll","polling","pointing","points","win","shes","im","thing","oh",
                    "theres","want","class","votes","mr","mr trump","mrs","mrs clinton","percent","voters","obama","party",
                    "debate","states","think","american","vote","going",'trump','clinton',"you","know","going","to","video",
                    "clip","williams","mean","you","re","ve","yes","got","we","re","really","this","is","letserver","the",
                    "clinton","podesta","debate","ms","republicans","going","to","democrats","media","government","million",
                    "america","candidates","the","united","states","says","this","is","city","pollstheater","west","sunday",
                    "saturday","brooklyn","avenue","city","night","the","debate","pence","mike","she","said","aug","park",
                    "music","museum","manhattanstreet","city","theater","west","government","economic","brooklyn","the","city",
                    "mrs","mr","re","pence","debate","kaine","mr","pence","the","debate","mr","kaine","mike","pence","mike",
                    "vice","vice","presidential","plan","debates","policy","running","mate"]

stop_words = text.ENGLISH_STOP_WORDS.union(stop_words_extra)

In [12]:
# Generamos los vectores de las notas
count_vect = CountVectorizer(ngram_range = (1,3), max_df = 0.2, min_df = 0.01,stop_words=stop_words, lowercase=True)
x_counts = count_vect.fit_transform(texts)

In [13]:
# Generamos la matriz con valorización tf-idf
tfidf_transformer = TfidfTransformer(norm = 'l2')
x_tfidf = tfidf_transformer.fit_transform(x_counts)

In [14]:
x_tfidf.shape

(5808, 2413)

In [15]:
# Imponemos el número de  tópicos
dim = 6

In [16]:
# Realizamos la descomposición NMF
nmf = NMF(n_components = dim)
nmf_array = nmf.fit_transform(x_tfidf)

In [17]:
# Chequeamos los shapes de las matrices resultantes
print('Matriz de documentos por tópicos:', nmf_array.shape)
print('Matriz de tópicos por términos:', nmf.components_.shape)

Matriz de documentos por tópicos: (5808, 6)
Matriz de tópicos por términos: (6, 2413)


In [18]:
# Labels de cada nota
labels = [np.argmax(x) for x in nmf_array]

In [19]:
# Componentes y nombre de los feautures
components = [nmf.components_[i] for i in range(len(nmf.components_))]
features = count_vect.get_feature_names()

for j in range(len(components)):

    comp = components[j]

    prior_features = sorted(features, key = lambda x: comp[features.index(x)], reverse = True)

    # Guardamos en archivos los features de cada componente ordenados por prioridad y 
    # las notas asociadas a cada tópico.
    fp = open('NMFComponent{}.txt'.format(j), 'a')
    for k in prior_features:
        fp.write('{}, '.format(k))
    fp.close()
        
    fp = open('NMFNotes{}.txt'.format(j),'a')
    for k in range(len(labels)):
        if labels[k] == j:
            fp.write('{}, '.format(k))
            
    fp.close()