# Parse a website and say what it is about

## INTRODUCTION

A web page that is text-centric will give an idea about its content through frequent usage of certain words and phrases. 
By analying the content of a page for word and phrase frequencies we can deduce the purpose of the page. 

### Method

The web page content is retrieved in a way similar to 'curl' or 'wget'. Only the human readable content is to be analysed. The content is then tokenized and cleaned of common words such as 'a', 'and', 'the', etc. The rest are trimmed to remove endings (e.g. plurals) to form the stems. The frequency of these are calculated and added to an array in descending order. The first 2 to 5 words will usually say what the page is about.

## Import the Python module

The NLTK module and the package, 'stopwords', are required in this example.

In [1]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\avs29\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Get the web page content

By specifying a URL its raw page content is retrieved. Though it returns the html tags like 'title' and 'description', they cannot be trusted fully. The page content must say what the page is about and it can be corroborated with title and description.

In [2]:
import urllib.request
url = "https://en.wikipedia.org/wiki/Machine_learning"
#url = "https://en.wikipedia.org/wiki/SpaceX"
#url = "https://www.webgenie.com"
#response =  urllib.request.urlopen('https://en.wikipedia.org/wiki/SpaceX')
response =  urllib.request.urlopen(url)
html = response.read()
#print(html)

## Get the page title and keep it.

Generally the page title will represent the page content, but it cannot be assumed as some may be trying to cheat the search engines. Comparing the title with the page content may give us more confidence.

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'html.parser')
text = soup.get_text(strip = True)
title = soup.title.string
metas = soup.find_all('meta')
#print (metas)
#print ([ meta.attrs['content'] for meta in metas if 'name' in meta.attrs and meta.attrs['name'] == 'description' ])
print(title)

Machine learning - Wikipedia


## Split the text into tokens

The text is split at white-spaces. This splitting is not always accurate and sometimes can lead to concatenating two words (e.g. 'Insupervised', 'thetraining', 'datafor'). The frequency of this is low and can be ignored. 

In [4]:
tokens = [t for t in text.split()]
#print(tokens)

## Remove common words

stopwords.words('english') is a file containing a list of words like 'i, me, my, myself, we, ...'. Tokens matching these are removed.

'nltk.FreqDist(clean_tokens)' determines the frequency of occurrence of the remaining words. It has been observed that these frequencies are not accurate. However, the disparities are not big to be a concern.

In [5]:
from nltk.corpus import stopwords
sr= stopwords.words('english')

clean_tokens = tokens[:]
for token in tokens:
    if token in stopwords.words('english'):
        
        clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
#freq
#for key,val in freq.items():
#    print(str(key) + ':' + str(val))
#freq.plot(20, cumulative=False)
#stopwords.words
#print (str(freq))
#str(freq)

## Make a list of words and their frequencies

By appending the frequency number before the word, it will be possible to sort the list numerically. Then, the top x numbers can be chosen for analysis.

In [6]:
wordlist = []
for key,val in freq.items():
    item = str(val) + ':' + str(key)
    wordlist.append(item)
wordlist.sort(key=lambda fname: int(fname.split(':')[0]), reverse=True)
wordlist[:10]

['127:learning',
 '80:machine',
 '46:data',
 '40:training',
 '32:algorithms',
 '30:Machine',
 '26:set',
 '25:used',
 '22:model',
 '20:The']

## Make anagrams

The top 10 are used to make anagrams of the words. The anagram is made from every two consecutive words. This number is arbitrary and we may have to use 3 or more. The principle is to find in a hash array the keys that have these words in any order.

In [31]:
anagrams = []
for i in range (0,9):
    # Take the current and next word from the 'wordlist'
    word1 = wordlist[i].split(":")
    word2 = wordlist[i+1].split(":")

    # Append the words together and remove duplicate letters
    word = ''.join(set(word1[1]+word2[1])) # Remove duplicates

    # Make the anagram and convert all letters to lowercase
    anagram = ''.join(sorted(word.lower()))

    # Add to a list, 'anagrams'
    anagrams.append(anagram)
    
print(anagrams)    

['aceghilmnr', 'acdehimnt', 'adginrt', 'aghilmnorst', 'aceghilmmnorst', 'acehimnst', 'destu', 'delmosu', 'dehlmot']


## Check a word in anagrams 

To check that a word exists in any anagram in the list. This is not required in the final case. It is required to check that two (or more) words together is in the hash array. For that, the previous cell's calculation of 'anagram' is enough.

In [33]:
def checkAnagram(word):
    m = len(anagrams)
    n = len(word)
    if (n < 4):
        print("Word must be at least 4 chars")
        return
    found = 0
    for i in range(m):
        if (found):
            break
        l = anagrams[i]
        found = 0
        for j in range(n):
            if(word[j] in l):
                found = 1
            else:
                found = 0
                break
#        print(found)
    if (found == 0):
        print ("Not Found:", word)
    else:
        print(word,l)
#    print(n)
ok = checkAnagram("machine") 
#print (ok)

machine aceghilmnr
