# Classifying web pages using non-negative matrix factorization

## 15 URLs that need to be automatically classified

The file `brexit_trump_urls.txt` contains a set of 15 links to news articles.  Some of them are about Donald Trump and some of them are about Brexit. A human could easily work out which are which by reading them but here we will demonstrate how to do this automatically using Non-negative matrix factorization via the NAG Library for Python's **real_nmf** function.

In [1]:
# Display the URLs in the file brexit_trump_urls.txt
with open('brexit_trump_urls.txt') as f:
    read_data = f.read()
print(read_data)

https://www.bbc.co.uk/news/uk-politics-47031312
https://www.bbc.co.uk/news/world-us-canada-47477727
https://www.bbc.co.uk/news/uk-scotland-north-east-orkney-shetland-47642076?intlink_from_url=&link_location=live-reporting-story
https://www.bbc.co.uk/news/uk-wales-politics-47651013?intlink_from_url=https://www.bbc.co.uk/news/topics/cwlw3xz0lvvt/brexit&link_location=live-reporting-story
https://www.bbc.co.uk/news/av/world-us-canada-47646183/president-trump-shows-map-of-is-defeat?intlink_from_url=&link_location=live-reporting-map
https://www.bbc.co.uk/news/uk-politics-46393399
https://www.bbc.co.uk/news/business-47644268?intlink_from_url=&link_location=live-reporting-story
https://www.bbc.co.uk/news/world-us-canada-47633940
https://www.bbc.co.uk/news/uk-politics-47627744
https://www.bbc.co.uk/news/uk-politics-parliaments-47653160
https://www.bbc.co.uk/news/world-us-canada-47642335
https://www.bbc.co.uk/news/uk-politics-47660019
https://www.bbc.co.uk/news/world-middle-east-47657843
https:/

# Parsing the websites 

The first step is to download all of the articles and parse them into sets of words

In [2]:
from naginterfaces.library.matop import real_nmf
from collections import Counter
import string
import urllib.request
import re
from scipy.linalg import norm
import numpy as np

print("Reading urls")
with open('brexit_trump_urls.txt', 'r') as f:
    links = f.readlines()

n_links = len(links)

dicts = []
titles = []
words = set()
trans = str.maketrans(string.punctuation, ' '*len(string.punctuation))

print("Parsing webpages")
for link in links:
    f1 = urllib.request.urlopen(link)
    pagewords = []
    paras = re.findall(r'<p>(.*?)</p>', f1.read().decode().lower())
    f2 = urllib.request.urlopen(link)
    title = re.findall(r'<title data-rh="true">(.*?)</title>', f2.read().decode())
    print(title)
    titles.append(title)
    for para in paras:
        pagewords += para.translate(trans).split()

    c = Counter(pagewords)
    dicts.append(c)
    words = words | set(list(c.keys()))

Reading urls
Parsing webpages
['Brexit delay: How is Article 50 extended? - BBC News']
['Kirstjen Nielsen: Walking a tightrope working for Trump - BBC News']
['Trump homes plan at Menie being recommended for approval - BBC News']
['Theresa May at her worst during Brexit speech - Mark Drakeford - BBC News']
['President Trump shows map of &#x27;IS defeat&#x27; - BBC News']
['Brexit: What happens now? - BBC News']
['A tale of two Trumps: Jair Bolsonaro goes to Washington - BBC News']
['Brexit: Theresa May to formally ask for delay - BBC News']
['Corbyn calls for compromise to avoid no-deal Brexit - BBC News']
['Trump: I didn&#x27;t get a thank you for McCain funeral - BBC News']
['Brexit: EU leaders agree Article 50 delay plan - BBC News']
['Trump: Time to recognise Golan Heights as Israeli territory - BBC News']
['Brexit: MPs urged not to travel home alone as tensions rise - BBC News']
['&#x27;Cancel Brexit&#x27; petition passes 2m signatures on Parliament site - BBC News']


The variable `dicts` contains 15 dictionaries where each one corresponds to word frequencies for each URL.  
Here are the most common words in the first URL

In [3]:
dicts[0].most_common(20)

[('the', 37),
 ('a', 37),
 ('to', 21),
 ('uk', 17),
 ('of', 15),
 ('that', 10),
 ('in', 10),
 ('class', 9),
 ('ssrcss', 9),
 ('eu', 9),
 ('it', 9),
 ('on', 9),
 ('bbc', 9),
 ('href', 8),
 ('would', 8),
 ('an', 7),
 ('extension', 7),
 ('and', 7),
 ('www', 7),
 ('co', 7)]

The variable `words` contains the list of all the words we've seen from all of the URLs in alphabetical order.

In [4]:
len(words)

2268

# Cleaning up the word list

This list of 2268 words contains many common words such as 'the', 'that' and 'with' which we want to ignore.  
These unwanted words are commonly referred to as [stopwords](https://en.wikipedia.org/wiki/Stop_words). 
The explicit list of stopwords we are going to use in this analysis are defined in the next cell

In [5]:
stopwords = ['then', 'that', 'have', 'with', 'from', 'they', 'here', 'there', 'their', 'would', 'what', 'which', 'about', 'know',
        'just', 'time', 'like', 'make', 'your', 'year', 'some', 'good', 'into', 'people', 'them', 'other', 'than', 'look', 
        'only', 'over', 'think', 'also', 'back', 'after', 'work', 'first', 'well', 'even', 'want', 'because', 'these', 
        'most', 'leave', 'seem', 'come', 'little', 'last', 'long', 'great', 'high', 'small', 'large', 'next', 'early',
        'same', 'this', 'away', 'down', 'look', 'make', 'three', 'came', 'four', 'please', 'pretty', 'soon', 'under', 
        'went', 'white', 'black', 'give', 'giving', 'given', 'gave', 'knowing', 'knew', 'once', 'round', 'stop', 'take', 
        'taken', 'took', 'thank', 'thanks', 'walk', 'walked', 'always', 'around', 'been', 'before', 'best', 'worst', 'find',
        'found', 'goes', 'pull', 'read', 'right', 'wrong', 'tell', 'telling', 'told', 'upon', 'wish', 'write', 'written', 
        'better', 'carry', 'carried', 'full', 'hold', 'keep', 'kept', 'longer', 'longest', 'shall', 'will', 'begin',
        'beginning', 'together', 'today', 'yesterday', 'children', 'ground', 'hold', 'holding', 'morning', 'afternoon',
        'never', 'myself', 'table', 'water', 'wind', 'window', 'ring', 'rung', 'except', 'where', 'while', 'woman', 
        'whilst', 'were', 'until', 'thing', 'things', 'theirs', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 
        'saturday', 'sunday', 'should', 'shown', 'shut', 'another', 'being', 'does', 'doing', 'make', 'makes', 'more', 
        'name', 'names', 'named', 'through', 'years', 'used', 'said', 'saying', 'says', 'seen', 'sees', 'seeing', 
        'using', 'uses', 'known', 'left', 'send', 'sent', 'choose', 'choice', 'choosing', 'going', 'gets', 'below', 
        'less', 'least', 'might', 'href', 'link', 'https', 'a939', 'abbe', 'ac30f6e24b3f', 'ababa']

Let's remove them from our word list

In [6]:
words = [word for word in words if not word in stopwords]

We also remove any words with fewer than 4 characters

In [7]:
words = [word for word in words if not len(word)<4]

In [8]:
len(words)

1873

We now remove any words starting with a digit or a pound sign

In [9]:
words = [word for word in words if not word.startswith('£')]
words = [word for word in words if not word.startswith('1')]
words = [word for word in words if not word.startswith('2')]
words = [word for word in words if not word.startswith('3')]
words = [word for word in words if not word.startswith('4')]
words = [word for word in words if not word.startswith('5')]
words = [word for word in words if not word.startswith('6')]
words = [word for word in words if not word.startswith('7')]
words = [word for word in words if not word.startswith('8')]
words = [word for word in words if not word.startswith('9')]
words = [word for word in words if not word.startswith('0')]

In [10]:
print("{0} distinct words will be used to form the data matrix".format(len(words)))

1825 distinct words will be used to form the data matrix


# Computing the data matrix

We now want to form a data matrix using these words where row $(i)$ refers to a word and each column $(j)$ refers to a URL.  The $(i,j)^{th}$ entry of this matrix will contain the frequency of occurence of the word $i$ in the URL $j$

In [11]:
# Now create a dict with the words as keys and tuples for the word counts in each webpage
masterdict = {}

print("Creating data matrix")
for word in words:
    # Create list containing the word counts
    freqs = [dic.get(word, 0) for dic in dicts]
    masterdict[word] = freqs

a = np.asfortranarray(np.array(list(masterdict.values())), dtype=np.float64)

print("Final data matrix has size:{0}".format(a.shape))

Creating data matrix
Final data matrix has size:(1825, 15)


In [12]:
a

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

A data matrix is a set of observations of a number of variables. In our matrix of word counts above, the observations were individual web pages and the variables were the frequencies of the different words on those pages, but there are numerous other examples. For instance, the observations might be pixels in an image, with spectrometry data as the variables. Or observations might correspond to different people, with various test results as the variables. In this analysis we are going to assume that each column of our data matrix corresponds to an observation and each row corresponds to a variable.

Data matrices often have the following properties.

* Their entries are non-negative. This isn't always the case but it is true for many important applications.
* They can be very large and may be sparse. We recently encountered a data matrix from seismic tomography which had size 87000×67000, but with only 0.23% of nonzero entries.

Large matrices are cumbersome to deal with. So it is natural to ask whether we can encapsulate the data using a smaller matrix, especially if our data matrix contains many zeros. Various techniques exist to do this, for example principal components analysis, or linear discriminant analysis. The drawback of these methods is that they do not preserve the non-negativity of the original data matrix, making the results potentially difficult to interpret. This is where non-negative matrix factorization comes in.

Non-negative matrix factorization (NMF) takes a non-negative data matrix S and attempts to factor it into the product of a tall, skinny matrix **W (known as the features matrix)** and a **short, wide matrix H (the coefficients matrix)**. 

Both W and H are non-negative. This is shown in the graphic below. **Note the presence of ≈ rather than =** since an exact NMF may not exist. In practice NMF algorithms iterate towards acceptable solutions, rather than obtaining the optimal solution. 

![Non negative Matrix Factorisation](nmf.png)

We now output the data matrix to the file `wordcounts.txt` for future reference

In [13]:
print("Printing data matrix to file")
f2 = open('wordcounts.txt', 'w')

st = ' '*14
for i in range(n_links):
    tmp = 'link ' + str(i+1)
    st += tmp.ljust(8, ' ')

print(st, file=f2)

for i, v in enumerate(words):
    st = v.ljust(14,' ')
    for j in range(n_links):
        st += str(int(a[i,j])).center(8, ' ')
    print(st, file=f2)

f2.close()

Printing data matrix to file


# Computing the non-negative matrix factorization

We now use Non-Negative matrix factorisation to compute the matrices $w$ and $h$ such that 

$$a = w@h$$

where $@$ is the standard Python operator for matrix-matrix multiplication.

As of Mark 27, The NAG library has two routines for computing non-negative matrix factorisation: **real_nmf** and **real_nmf_rcomm** and here we will use **real_nmf** with k=2

In [14]:
m = a.shape[0]
n = a.shape[1]
k = 2
errtol = 1e-6
maxit = 500
seed = 5842

w, h = real_nmf(a, k=2, seed=seed, errtol=errtol, maxit=maxit)

** The function has failed to converge after 500 iterations.
** The factorization given by w and h may still be a good enough approximation
** to be useful. Alternatively an improved factorization may be obtained by
** increasing maxit or using different initial choices of w and h.
  


In [15]:
w.shape

(1825, 2)

In [16]:
h.shape

(2, 15)

The factorization is not exact but is hopefully close enough to be useful

In [17]:
resnorm = norm(a-w@h)/norm(a)
print("norm of residual:")
print(resnorm)

norm of residual:
0.4996361512168467


The strength of NMF is that the preservation of non-negativity makes it easier to interpret the factors W and H. In general W tells us how the different variables can be grouped together into k features that in some way represent the data. The matrix H tells us how our original observations are built from these features, with the non-negativity ensuring this is done in a purely additive manner.

The best way of understanding this is to go back to our original example. Recall that we had a 1834x15 matrix of word frequencies for our 15 web pages. We used the NAG library routine **real_nmf** to factorize it, choosing k=2. This resulted in a 1834×2 features matrix, W and a 2×15 coefficients matrix H. Let's discuss them in turn.

Each column of W corresponds to a particular weighted grouping of the 1834 distinct words from the articles. The larger the entries in the column, the more important the corresponding word is deemed to be. Rather than displaying W in its entirety, We can look at the 10 largest entries in each column to see what the most important words were. The results are shown in below.

In [18]:
for i in range(k):
    tmp = sorted(words, key=lambda x, ind=i: w[words.index(x),ind], reverse=True)
    st = "\nThe most important words in column " + str(i+1) + " of w are:"
    print(st)
    print(tmp[:10])


The most important words in column 1 of w are:
['quot', 'deal', 'brexit', 'ssrcss', 'class', 'delay', 'parliament', 'minister', 'prime', 'e1no5rhv0']

The most important words in column 2 of w are:
['quot', 'trump', 'president', 'women', 'mccain', 'class', 'ssrcss', 'nielsen', 'border', 'security']


Looking at these lists, you'll hopefully agree that the first column corresponds to Trump and the second column Brexit. It seems that our non-negative matrix factorization has successfully detected the two categories of web page. Let's denote these using the numbers 0 and 1. Can we now use the NMF to accurately categorise the individual pages? To do this we need to look at the coefficients matrix H.

We convert h to a pandas dataframe for display purposes

In [19]:
# Convert h to pandas dataframe for display purposes
import pandas as pd
display_h = pd.DataFrame(
    h,
    columns=[
        "link1","link2","link3","link4","link5","link6","link7","link8","link9",
        "link10","link11","link12","link13","link14","link15",
    ]
)
pd.set_option('precision', 3)
display_h

Unnamed: 0,link1,link2,link3,link4,link5,link6,link7,link8,link9,link10,link11,link12,link13,link14,link15
0,12.26,1.195e-15,5.253,50.795,1.353,6.003,5.638,10.739,76.032,19.13,1.195e-15,34.929,1.195e-15,4.433,55.44
1,9.842e-16,53.73,5.094,3.985,3.68,9.842e-16,7.298,24.121,0.117,9.842e-16,43.88,0.353,25.89,1.619,9.842e-16


This coefficients matrix is of size 2×15. The entries in each column show us how well that particular web page fits into our two categories. We assigned each page to a category by simply selecting the largest entry in the column. The results are below. The number next to each link shows how it was categorised by the NMF. We will let you judge for yourself whether the categorisations are correct!

In [20]:
for i, link in enumerate(links):
    category = 0 if h[0,i] > h[1,i] else 1
    title = '"' + (titles[i][0]) + '"'
    st = 'Article ' + title.ljust(78,' ') + ' is in category ' + str(category) + [" (Brexit)"," (Trump)"][category]
    print(st)

Article "Brexit delay: How is Article 50 extended? - BBC News"                         is in category 0 (Brexit)
Article "Kirstjen Nielsen: Walking a tightrope working for Trump - BBC News"           is in category 1 (Trump)
Article "Trump homes plan at Menie being recommended for approval - BBC News"          is in category 0 (Brexit)
Article "Theresa May at her worst during Brexit speech - Mark Drakeford - BBC News"    is in category 0 (Brexit)
Article "President Trump shows map of &#x27;IS defeat&#x27; - BBC News"                is in category 1 (Trump)
Article "Brexit: What happens now? - BBC News"                                         is in category 0 (Brexit)
Article "A tale of two Trumps: Jair Bolsonaro goes to Washington - BBC News"           is in category 1 (Trump)
Article "Brexit: Theresa May to formally ask for delay - BBC News"                     is in category 0 (Brexit)
Article "Corbyn calls for compromise to avoid no-deal Brexit - BBC News"               is in catego

# Further Reading

* [Presentation on Non-Negative Matrix Factorization](https://www.nag.com/market/non-negative-matrix-factorization.pdf) by the author of the NAG routines **real_nmf** and **real_nmf_rcomm**
* [NAG Blog post on which this notebook is based](https://www.nag.com/content/classifying-web-pages-using-non-negative-matrix-factorization)