### Import e Data Preprocessing
In questo notebook usiamo


1.   pandas
2.   nltk
3.   gensim

E le prime 10000 righe di un dataset che contiene titoli di giornale





In [12]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
from nltk.stem import *
import numpy as np
import nltk
nltk.download('wordnet')
import pandas as pd

from google.colab import drive
#drive.mount('/content/drive')

data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False);
#carichiamo solo per prime 10000 righe del dataset altrimenti impiega troppo (+ 10 min)
data_text = data[['headline_text']][:10000]

data_text['index'] = data_text.index
documents = data_text
stemmer = PorterStemmer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [13]:
documents[:5]

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


In [14]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

#Preprocessiamo i titoli, salvando il risultato in ‘processed_docs’



In [15]:
processed_docs = documents['headline_text'].map(preprocess)
processed_docs[:10]

0               [decid, commun, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

### Bag of words on the dataset
Creiamo un dizionario da processed_docs

In [16]:
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
print(dictionary)
#stampiamo i primi 10 elementi del dizionario
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

Dictionary(6532 unique tokens: ['broadcast', 'commun', 'decid', 'licenc', 'awar']...)
0 broadcast
1 commun
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


Filtriamo i token che appaiono 
in meno di 15 documenti (in numero assoluto) o in più del 50% del corpus totale.
Dopodiché teniamo solo i primi 100000 più usati

In [17]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

Per ogni documento creiamo un dizionario che riporta quali
parole e quante volte appaiono quelle parole. Salviamo in "bow_corpus", quindi stampiamo il nostro documento selezionato in precedenza.

Un esempio di uso Dalla documentazione di gensim 
https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-core-concepts-py


```
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

```



In [18]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_doc_34 = bow_corpus[34]

#stampo ID della parola, la parola e il numero di volte che appare
for i in range(len(bow_doc_34)):
    print("Word " + str(bow_doc_34[i][0]) + " (\"" +str(dictionary[bow_doc_34[i][0]]) + "\") appears " + str(bow_doc_34[i][1]))

Word 7 ("rise") appears 1
Word 78 ("expect") appears 1
Word 79 ("threat") appears 1


### Training del modello  LDA usando l'approccio Bag of Words
Usiamo gensim.models.LdaMulticore e lo salviamo in ‘lda_model’. Impiega circa 6 secondi

https://radimrehurek.com/gensim/models/ldamulticore.html


In [19]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=5, id2word=dictionary, passes=2, workers=2)

In [20]:
"""
Get the most significant topics (alias for show_topics() method).
Parameters
num_topics (int, optional) – The number of topics to be selected, 
if -1 - all topics will be in result (ordered by significance).
"""
#Stampo tutti i topic del lda_model, e i pesi di ogni parola nel token
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.020*"protest" + 0.014*"polic" + 0.013*"hold" + 0.012*"iraq" + 0.012*"death" + 0.012*"world" + 0.012*"take" + 0.011*"face" + 0.010*"open" + 0.010*"minist"
Topic: 1 
Words: 0.026*"plan" + 0.017*"water" + 0.012*"crash" + 0.012*"win" + 0.011*"aust" + 0.011*"iraq" + 0.010*"council" + 0.009*"troop" + 0.008*"kill" + 0.008*"world"
Topic: 2 
Words: 0.029*"iraq" + 0.019*"claim" + 0.018*"polic" + 0.017*"govt" + 0.016*"sar" + 0.015*"forc" + 0.012*"lead" + 0.011*"report" + 0.011*"boost" + 0.010*"hospit"
Topic: 3 
Words: 0.015*"continu" + 0.013*"govt" + 0.011*"iraqi" + 0.011*"australia" + 0.010*"kill" + 0.010*"urg" + 0.009*"titl" + 0.009*"iraq" + 0.008*"green" + 0.008*"report"
Topic: 4 
Words: 0.030*"say" + 0.030*"baghdad" + 0.026*"iraq" + 0.019*"charg" + 0.014*"warn" + 0.012*"council" + 0.012*"fund" + 0.011*"iraqi" + 0.010*"health" + 0.009*"seek"


### Valutiamo come un documento già visto viene classificato con LDA Bag of Words
Stampiamo le 10 parole più rilevanti del topic, la prima in alto è quella con il punteggio migliore

In [21]:
for index, score in sorted(lda_model[bow_corpus[34]], key=lambda tup: -1*tup[1]):
    print("\nScore:"+ str(score) + " \t \nTopic: " + str(lda_model.print_topic(index, 10)))


Score:0.63855916 	 
Topic: 0.029*"iraq" + 0.019*"claim" + 0.018*"polic" + 0.017*"govt" + 0.016*"sar" + 0.015*"forc" + 0.012*"lead" + 0.011*"report" + 0.011*"boost" + 0.010*"hospit"

Score:0.20870966 	 
Topic: 0.020*"protest" + 0.014*"polic" + 0.013*"hold" + 0.012*"iraq" + 0.012*"death" + 0.012*"world" + 0.012*"take" + 0.011*"face" + 0.010*"open" + 0.010*"minist"

Score:0.051211562 	 
Topic: 0.026*"plan" + 0.017*"water" + 0.012*"crash" + 0.012*"win" + 0.011*"aust" + 0.011*"iraq" + 0.010*"council" + 0.009*"troop" + 0.008*"kill" + 0.008*"world"

Score:0.050791938 	 
Topic: 0.015*"continu" + 0.013*"govt" + 0.011*"iraqi" + 0.011*"australia" + 0.010*"kill" + 0.010*"urg" + 0.009*"titl" + 0.009*"iraq" + 0.008*"green" + 0.008*"report"

Score:0.050727695 	 
Topic: 0.030*"say" + 0.030*"baghdad" + 0.026*"iraq" + 0.019*"charg" + 0.014*"warn" + 0.012*"council" + 0.012*"fund" + 0.011*"iraqi" + 0.010*"health" + 0.009*"seek"


### Testiamo il modello LDA BOW su un nuovo documento sconosciuto




In [22]:
unseen_document = 'university kill healt government guns'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.5909245014190674	 Topic: 0.015*"continu" + 0.013*"govt" + 0.011*"iraqi" + 0.011*"australia" + 0.010*"kill"
Score: 0.10386024415493011	 Topic: 0.030*"say" + 0.030*"baghdad" + 0.026*"iraq" + 0.019*"charg" + 0.014*"warn"
Score: 0.10342563688755035	 Topic: 0.026*"plan" + 0.017*"water" + 0.012*"crash" + 0.012*"win" + 0.011*"aust"
Score: 0.10164479911327362	 Topic: 0.029*"iraq" + 0.019*"claim" + 0.018*"polic" + 0.017*"govt" + 0.016*"sar"
Score: 0.10014486312866211	 Topic: 0.020*"protest" + 0.014*"polic" + 0.013*"hold" + 0.012*"iraq" + 0.012*"death"
