# Parse a website and say what it is about

## INTRODUCTION

A web page that is text-centric will give an idea about its content through frequent usage of certain words and phrases. 
By analying the content of a page for word and phrase frequencies we can deduce the purpose of the page. 

### Method

The web page content is retrieved in a way similar to 'curl' or 'wget'. Only the human readable content is to be analysed. The content is then tokenized and cleaned of common words such as 'a', 'and', 'the', etc. The rest are trimmed to remove endings (e.g. plurals) to form the stems. The frequency of these are calculated and added to an array in descending order. The first 2 to 5 words will usually say what the page is about.

By training the dataset using a number of pages of the same theme (e.g. Machine Leraning) it should correctly classify a new page in the same theme.

#### Author: Arapaut V. Sivaprasad

#### Dates

    Created: 26 Oct, 2019.  
    Last Modified: 30 Oct, 2019.

## Import the Python module

The NLTK module and the package, 'stopwords', are required in this example.

In [36]:
import nltk
#nltk.download()
nltk.download('stopwords')
nltk.download('words')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\avs29\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\avs29\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

## Get the web page content

By specifying a URL its raw page content is retrieved. Though it returns the html tags like 'title' and 'description', they cannot be trusted fully. The page content must say what the page is about and it can be corroborated with title and description.

In [37]:
import urllib.request
#url = "https://en.wikipedia.org/wiki/Machine_learning"
url = "https://en.wikipedia.org/wiki/SpaceX"
#url = "https://www.webgenie.com"
#response =  urllib.request.urlopen('https://en.wikipedia.org/wiki/SpaceX')
response =  urllib.request.urlopen(url)
html = response.read()
#print(html)

## Get the page title and keep it.

Generally the page title will represent the page content, but it cannot be assumed as some may be trying to cheat the search engines. Comparing the title with the page content may give us more confidence.

In [38]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'html.parser')
text = soup.get_text(strip = True)

# Get the document title
title = soup.title.string
metas = soup.find_all('meta')
#print (metas)

# To get the 'decription' meta tag. Do not remove.
#print ([ meta.attrs['content'] for meta in metas if 'name' in meta.attrs and meta.attrs['name'] == 'description' ])
print(title)

SpaceX - Wikipedia


## Split the text into tokens

The text is split at white-spaces. This splitting is not always accurate and sometimes can lead to concatenating two words (e.g. 'Insupervised', 'thetraining', 'datafor'). The frequency of this is low and can be ignored. 

In [39]:
tokens = [t for t in text.split()]
#print(tokens)

## Remove common words

stopwords.words('english') is a file containing a list of words like 'i, me, my, myself, we, ...'. Tokens matching these are removed.

'nltk.FreqDist(clean_tokens)' determines the frequency of occurrence of the remaining words. It has been observed that these frequencies are not accurate. However, the disparities are not big to be a concern.

In [40]:
from nltk.corpus import stopwords
sr= stopwords.words('english')

clean_tokens = tokens[:]
for token in tokens:
    if token in stopwords.words('english'):
        clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)

## Make a list of words and their frequencies

By appending the frequency number before the word, it will be possible to sort the list numerically. Then, the top x numbers can be chosen for analysis.

In [44]:
# Get a list of words of 4+ chars to compare with tokens from the web page
from nltk.corpus import words
english_words = words.words()
english_dict = {}
for word in (english_words):
    n = len(word)
    if (n > 3):
        english_dict[word] = 1


In [35]:
# Count all words on the page and sort. Only the words in english_dict are taken.
wordlist = []
for key,val in freq.items():
    item = str(val) + ':' + str(key)
    try:
        if(english_dict[str(key)]):
            wordlist.append(item)
    except:
        pass
wordlist.sort(key=lambda fname: int(fname.split(':')[0]), reverse=True)
wordlist[:10]

['101:launch',
 '88:first',
 '51:rocket',
 '46:Space',
 '38:test',
 '38:The',
 '31:company',
 '29:May',
 '29:space',
 '28:March']

## Make anagrams

The top 10 are used to make anagrams of the words. The anagram is made from every two consecutive words. This number is arbitrary and we may have to use 3 or more. These are used to find in a hash array the keys that have these words in any order. Another way is to create two hash strings with the words in both orders. The drawback is that there will be one hash for each word, instead of one anagram for 2 words, and that it will not be possible to check one word alone in the hash string. This is important to verify that the anagram detected actually contains both words. There is a possibility that a word may be in an anagram beloging to another phrase, but the chances of both words in the anagram will be lower (though not impossible).

In [9]:
def MakeAnagram(w1,w2):
    
    # Append the words together and remove duplicate letters
    word = ''.join(set(w1+w2)) # Remove duplicates

    # Make the anagram and convert all letters to lowercase
    anagram = ''.join(sorted(word.lower()))
    return anagram

In [10]:
anagrams = []
anagrams_txt = './anagrams.txt'
with open(anagrams_txt, 'a+') as f:
    for i in range (0,9):
        # Take the current and next word from the 'wordlist'
        word1 = wordlist[i].split(":")
        word2 = wordlist[i+1].split(":")

        n1 = len(word1[1])
        if (n1 < 4):
            continue

        n2 = len(word2[1])
        if (n2 < 4):
            continue
        anagram = MakeAnagram(word1[1], word2[1])

        # Add to a list, 'anagrams'
        anagrams.append(anagram)
        output = anagram + ":" + word1[1] + "," + word2[1] + "\n"
        f.write(output)

#    f.write(anagrams)
f.close()

print(anagrams)    

['acdehimprrstvx', 'acdehilmnrrtuv', 'acfhilnrstu', 'acffilnorst', 'acekoprst', 'acekmpssu', 'ekmstu']


In [11]:
def ReadDict():
    anagrams_txt = './anagrams.txt'
#    print(anagrams_txt)
    with open(anagrams_txt) as f:
        try:
            for line in f:
#                print(line)
                (key, val) = line.split(":")
                d[key] = val
        except:
            pass
    f.close()
#    return d       

In [12]:
d = {}
ReadDict()
d

{'aceghilmnr': 'learning,machine\n',
 'acdehimnt': 'machine,data\n',
 'adginrt': 'data,training\n',
 'aghilmnorst': 'training,algorithms\n',
 'aceghilmmnorst': 'algorithms,Machine\n',
 'delmosu': 'used,model\n',
 'acdehimprrstvx': 'SpaceX,RetrievedMarch\n',
 'acdehilmnrrtuv': 'RetrievedMarch,launch\n',
 'acfhilnrstu': 'launch,first\n',
 'acffilnorst': 'first,Falcon\n',
 'acekoprst': 'rocket,Space\n',
 'acekmpssu': 'Space,Musk\n',
 'ekmstu': 'Musk,test\n'}

In [13]:
# See if a word matches with the dict
def checkDict(key):
#    print (key, ":", d[key])
    return key


## Check a word in anagrams 

To check that a word exists in any anagram in the list. This is not required in the final case. It is required to check that two (or more) words together is in the hash array. For that, the previous cell's calculation of 'anagram' is enough.

In [14]:
# This cell is only to check that a word is in any anagram. Not part of the core program.
def checkAnagram(word):
    
    # Change the search word to lowercase
    word = word.lower()
    m = len(anagrams)
    n = len(word)
    if (n < 4):
        print("Word must be at least 4 chars")
        return
    found = 0
    for i in range(m):
        if (found):
            break
        l = anagrams[i]
        found = 0
        
        # See that all letters in the word are in the anagram, in any order
        for j in range(n):
            if(word[j] in l):
                found = 1
            else:
                found = 0
                break
    if (found == 0):
        pass
    else:
#        print(word,l)
        return checkDict(l)
#    print(n)


In [15]:
# Two words are sent to this func to see if their combined anagram exists in 'd' as a key
# If found, take the value of the key
# Split the value at comma and check whether the first two match the words in any order.
def CheckWords(w1, w2):
    n1 = len(w1)
    n2 = len(w2)
    
    # Both words must be at least 4 chars long.
    if (n1 < 4 or n2 < 4):
        print("Word must be at least 4 chars")
        return 0
    
    # Create the combined anagram. This must be present in 'd' as a key
    anagram = MakeAnagram(w1, w2)
#    print(anagram)
    try:
        if (d[anagram]):
            val = d[anagram].split(",")
            
            # Check that the anagram contains the w1, w2. If one of them is an anagram of another word, it will mismatch
            # This test ensures that the requested two words are really the ones in the hash.
            if ((val[0] == w1 and val[1] == w2) or (val[1] == w1 and val[0] == w2)):
#                print ("Found:", val)
                return val[2]
            else:
                print("Mis-match")
                return 0
    except:
            print("Key Error")
            return 0


In [16]:
#ReadDict()
res = CheckWords('machine', 'learning')
print(res)


Mis-match
0
