# Estrazione dei Topic

### Utilizziamo uno dei dataset disponibili nel pacchetto di Scikit-Learn, 20 newsgroups, una raccolta di migliaia di post appartenenti a 20 categorie diverse di news, per testare la NMF (Non-negative Matrix Factorization) nell'estrazione dei principali topic, senza l'utilizzo dei label. 

In [11]:
import numpy as np
from sklearn.feature_extraction import text
from sklearn import datasets, decomposition
import random

In [12]:
n_samples = 1000 # numero di documenti con cui addestrare il modello 
n_features = 1000 # numero di feature da estrarre da ogni documento
n_topics = 20 # numero di topic da estrarre
n_top_words = 20 # numero di parole più significative rappresentative dei topic

In [13]:
# carico i dati

dataset = datasets.fetch_20newsgroups(shuffle=True, random_state=1)

In [14]:
# vediamo quanti documenti ci sono nel dataset e quali sono le categorie

print("il dataset è composto da {} documenti".format(len(dataset["data"])))

il dataset è composto da 11314 documenti


In [15]:
# vediamo quali sono le categorie di appartenenza dei post

for target in dataset["target_names"]:
    print(target)

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc


In [16]:
# vediamo come sono fatti i documenti

print(dataset["data"][1000])
print("-----------------------------------------------------------------------------------------------------------")
print("-----------------------------------------------------------------------------------------------------------")
print("-----------------------------------------------------------------------------------------------------------")
print("\n",dataset["data"][510])

From: janzen@lichen.mpr.ca (Martin Janzen)
Subject: Re: how to put RPC in HP X/motif environment?
Nntp-Posting-Host: lichen
Reply-To: janzen@mprgate.mpr.ca
Organization: MPR Teltech Ltd.
Lines: 30

In article <C5r03J.Gu3@news2.cis.umn.edu>, ianhogg@milli.cs.umn.edu (Ian J. Hogg) writes:
>In article <1993Apr19.200740.17615@sol.ctr.columbia.edu> nchan@nova.ctr.columbia.edu (Nui Chan) writes:
>>has anybody implements an RPC server in the HP Xwindows? In SUN Xview, there
>>is a notify_enable_rpc_svc() call that automatically executes the rpc processes
>>when it detects an incoming request. I wonder if there is a similar function in
>>HP X/motif that perform the same function.
>
>I've been using the xrpc package for about a year now.  I believe I got it from
>export.  

Glad to hear that it's working for you!

I couldn't find it on "export".  However, Simon Leinen
<simon@liasun6.epfl.ch> has added an Imakefile and an Athena
version, and made it available for FTP in the file
liasun3.epfl.ch:

In [17]:
# estrazione delle features utilizzando le 1000 parole più comuni (max_features=1000)
# escludo le parole che son presenti in più del 95% dei documenti
# normalizzazione TF-IDF

vectorizer = text.CountVectorizer(max_df=0.95, max_features=n_features,stop_words='english')
counts = vectorizer.fit_transform(dataset.data[:n_samples])
tfidf = text.TfidfTransformer().fit_transform(counts)

print("la matrice di partenza ha dimensioni {}".format(tfidf.shape))

la matrice di partenza ha dimensioni (1000, 1000)


In [18]:
# Fit del modello NMF

nmf = decomposition.NMF(n_components=n_topics)
W = nmf.fit_transform(tfidf)
H = nmf.components_

print("ho fattorizzato la matrice in H di dimensioni {} e W di dimensioni {}".format(W.shape,H.shape))

ho fattorizzato la matrice in H di dimensioni (1000, 20) e W di dimensioni (20, 1000)


In [19]:
# stampiano alcune delle parole estratte

feature_names = vectorizer.get_feature_names()
for i in range(20):
    print(random.choice(feature_names))

general
years
bob
mary
designed
video
original
30
previous
pc
uni
jews
uunet
keith
left
second
win
ms
worked
wpi


In [20]:
# infine stampiamo le parole più rappresentative dei 20 topic estratti

for i, topic in enumerate(H):
    print("Topic #{}:".format(i))
    print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]), "\n")

Topic #0:
people don like just think know good way time better say ve want ll thing things make government right use 

Topic #1:
edu university cc article posting host uiuc nntp writes state washington cso virginia distribution michael rochester berkeley csd news reply 

Topic #2:
com netcom writes article stratus sun news jim portal corp att posting distribution reply systems uunet nntp world host support 

Topic #3:
mit file help thanks color lcs hi problem image information internet send computer version faq program edu book ftp mail 

Topic #4:
clipper key chip encryption government keys public secure enforcement secret house brad algorithm use standard phone law na communications security 

Topic #5:
nasa gov space jpl center research moon shuttle program laboratory brian distribution world posting nntp host data 25 long running 

Topic #6:
uk ac university posting ___ sun host nntp thanks writes dc __ return sorry college united wrong application computing advance 

Topic #7:
gam