## Non-Negative Matrix Factorization for Exploring the European Parliament's Topic Agenda
### Assignment 2 for Machine Learning Complements class
By Alexandra de Carvalho, Luís Costa, Nuno Pedrosa

#### Importing the needed Python libraries
We will use Pandas for dataframe manipulation.

In [2]:
import os
import re
import math
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

import pandas as pd

# for modeling 
from sklearn.feature_extraction import DictVectorizer
from sklearn.decomposition import NMF

# for text processing
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /home/alexa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/alexa/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/alexa/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/alexa/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


#### Importing the data

In [3]:
# expand pandas df column display width to enable easy inspection
pd.set_option('max_colwidth', 150)

# read the textfiles to a dataframe
dir_path = 'sample' # folder path
files = [] # list to store files

for path in os.listdir(dir_path):
    if os.path.isfile(os.path.join(dir_path, path)):
        files.append(os.path.join(dir_path, path))
    else:
        subpath = os.path.join(dir_path, path)
        for path2 in os.listdir(subpath):
            if os.path.isfile(os.path.join(subpath, path2)):
                files.append(os.path.join(subpath, path2))

#### Tokenizing
To make all of the text in the speeches as comparable as possible we need to remove punctuation, capitalization, numbers, and strange characters. We also keep the term frequency on each document.

In [4]:
text_tokens = dict()
for filename in files:
    with open(filename, 'rb') as f:
        lines = f.readlines()
        text_tokens[filename] = dict()
        
        for line in lines:
            for token in re.split('\W+', str(line)):
                token = token.lower()
                if len(token) > 3 and not token.isnumeric() and not token.lower() in stopwords.words('english'):
                    text_tokens[filename][token] = text_tokens[filename].get(token, 0) + 1

#### Lemmatizing

In [5]:
wordnet_lemmatizer = WordNetLemmatizer()   # stored function to lemmatize each word
is_noun = lambda pos: pos[:2] == 'NN'

nouns = dict()
for filename, tokens in text_tokens.items():
    if filename not in nouns:
        nouns[filename] = dict()

    for (word, pos) in pos_tag(list(tokens.keys())):
        if is_noun(pos):
            nouns[filename][wordnet_lemmatizer.lemmatize(word)] = nouns[filename].get(wordnet_lemmatizer.lemmatize(word), 0) + text_tokens[filename][word]

#### Building the matrix A

Firstly, only with the term frequency weights.

In [6]:
dictvectorizer = DictVectorizer(sparse=False)
a = dictvectorizer.fit_transform(list(nouns.values()))

Building the list of all tokens (all columns of A, in order).

In [7]:
token_list = dictvectorizer.get_feature_names()



Now calculating updating to TF-IDF weights

In [8]:
for column_idx in range(len(token_list)):
    idf = math.log(len(a[:, column_idx])/len([x for x in a[:, column_idx] if x != 0]), 10)

    for element_idx in range(len(files)):
        if a[element_idx,column_idx] != 0:
            a[element_idx,column_idx] = (math.log(a[element_idx,column_idx], 10) + 1) * idf

#### TODO : USE W2V TO FIND BEST K

#### NMF

In [32]:
k = 2

In [9]:
nmf_model = NMF(k) 
w = nmf_model.fit_transform(a)

#### TODO : VISUALISING RESULTS

In [33]:
t = 10

For each topic, find the t higher weights' index and find the correpondent token (same index) in the token list. These are the descriptors of each topic.

In [31]:
for i, topic in enumerate(nmf_model.components_):
    print("Topic", i, ":",[token_list[x[1]] for x in sorted(zip(topic,range(len(topic))), reverse = True)[:t]])

Topic 0 : ['government', 'people', 'service', 'minister', 'system', 'technology', 'firm', 'company', 'country', 'plan']
Topic 1 : ['game', 'player', 'goal', 'chelsea', 'club', 'team', 'football', 'manager', 'minute', 'chance']


In [36]:
w

array([[0.36705923, 0.01384133],
       [0.26643491, 0.70987538],
       [0.2245102 , 0.        ],
       ...,
       [0.22599703, 0.01654774],
       [0.09393832, 0.13258903],
       [0.03249447, 0.1660655 ]])

In [56]:
for i in range(k):
    print("Topic", i, ":",[files[x[1]].split('/')[-1] for x in sorted(zip(w[:,i],range(len(w[:,i]))), reverse = True)[:t]])

Topic 0 : ['politics_290.txt', 'tech_399.txt', 'business245.txt', 'tech_164.txt', 'business277.txt', 'tech_032.txt', 'tech_335.txt', 'business287.txt', 'tech_199.txt', 'business146.txt']
Topic 1 : ['entertainment_253.txt', 'football_207.txt', 'entertainment_256.txt', 'football_010.txt', 'football_152.txt', 'football_045.txt', 'football_253.txt', 'football_246.txt', 'football_007.txt', 'football_174.txt']
