# Parse a website and say what it is about

## INTRODUCTION

A web page that is text-centric will give an idea about its content through frequent usage of certain words and phrases. 
By analying the content of a page for word and phrase frequencies we can deduce the purpose of the page. 

### Method

The web page content is retrieved in a way similar to 'curl' or 'wget'. Only the human readable content is to be analysed. The content is then tokenized and cleaned of common words such as 'a', 'and', 'the', etc. The rest are trimmed to remove endings (e.g. plurals) to form the stems. The frequency of these are calculated and added to an array in descending order. The first 2 to 5 words will usually say what the page is about.

By training the dataset using a number of pages of the same theme (e.g. Machine Leraning) it should correctly classify a new page in the same theme.

### Algorithm

- Scan the page content for all words.
    - These can include the 'meta-description' but is not included here.
- Prepare a word count and calculate the frequency (percentage) of each word among total.
    - Sort in descending order
- When training, write out the data in a TSV file.
    - Only the top 10 words are written.
    - This can be editd to add more text.
- When testing a new page, do the same to get the word frequencies.
- Take the first 10 (arbitrary number) of words.
- Read each line in the text file and take the words to compare with the test page words.
    - Add up the freqeuncy values in the training data set.
    - If the frequency is less than 20 (arbitrary) ignore it.
    - Take the line which gives the highest frequency.
    - Take its catagory name and report.

#### Author: Arapaut V. Sivaprasad

#### Dates

    Created: 26 Oct, 2019.  
    Last Modified: 01 Nov, 2019.

## Import the Python module

The NLTK module and the package, 'stopwords', are required in this example.

In [12]:
import nltk
#nltk.download()
nltk.download('stopwords')
nltk.download('words')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\avs29\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\avs29\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

## Get the web page content

By specifying a URL its raw page content is retrieved. Though it returns the html tags like 'title' and 'description', they cannot be trusted fully. The page content must say what the page is about and it can be corroborated with title and description.

In [13]:
import urllib.request
url = "https://en.wikipedia.org/wiki/Machine_learning"
#url = "https://en.wikipedia.org/wiki/SpaceX"
#url = "https://www.webgenie.com"

In [14]:
writeout = 1 # If the URL is already proccessed, then dont write it again in catalog.txt
with open('catalog.txt', 'r') as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
f.close()
content = [x.strip() for x in content]
lc = len(content)
for i in range(lc):
    s = content[i].split('\t')
    print (s[0])
    if (s[0] == url):
        print("URL already scanned.")
        writeout = 0
        break
writeout

https://en.wikipedia.org/wiki/Machine_learning
URL already scanned.


0

In [15]:
response =  urllib.request.urlopen(url)
html = response.read()
#print(html)

## Get the page title and keep it.

Generally the page title will represent the page content, but it cannot be assumed as some may be trying to cheat the search engines. Comparing the title with the page content may give us more confidence.

In [22]:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen(url).read()
text = text_from_html(html)
title = soup.title.string
title

'Machine learning - Wikipedia'

## Split the text into tokens

The text is split at white-spaces. This splitting is not always accurate and sometimes can lead to concatenating two words (e.g. 'Insupervised', 'thetraining', 'datafor'). The frequency of this is low and can be ignored. 

In [17]:
import re
tokens = []
t1 = []
t2 = []
p = re.compile('[-_\", {};:=?\[\]\(\)\'.]')
t = p.split(text)
#print(t)
lt = len(t)
for i in range(lt):
    w = t[i]
    lw = len(w)
    if(lw > 3):
        t1.append(w)
lt = len(t1)
q = re.compile('[a-z][A-Z]')
for i in range (lt):
    s = t1[i]
#    print(s)
    p = re.compile('[a-z][A-Z]')
    try:
        res = p.search(s).group(0)
    except:
        continue
#    print(res)
    n = s.find(res) + 1
    b = s[:n]
    e = s[n:]
    t1[i] = ''
    t2.append(b)
    t2.append(e)
for i in range(lt):
    w = t[i]
    lw = len(w)
    if(lw > 3):
        tokens.append(w)
tokens += t2
#tokens

In [18]:
#tokens = [t for t in text.split()]
lt = len(tokens)
for i in range(lt):
    res = p.split(tokens[i])
#    tokens[i] = tokens[i].replace('[\"]', ' ')
#print(tokens[i])

#print(tokens)

## Remove common words

stopwords.words('english') is a file containing a list of words like 'i, me, my, myself, we, ...'. Tokens matching these are removed.

'nltk.FreqDist(clean_tokens)' determines the frequency of occurrence of the remaining words. It has been observed that these frequencies are not accurate. However, the disparities are not big to be a concern.

In [19]:
from nltk.corpus import stopwords
sr= stopwords.words('english')

clean_tokens = tokens[:]
for token in tokens:
    if token in stopwords.words('english'):
        token = token.lower()
        clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
#freq.items()

## Make a list of words and their frequencies

By appending the frequency number before the word, it will be possible to sort the list numerically. Then, the top x numbers can be chosen for analysis.

In [20]:
# Get a list of words of 4+ chars to compare with tokens from the web page
from nltk.corpus import words
english_words = words.words()
english_dict = {}
for word in (english_words):
    n = len(word)
    if (n > 3):
        english_dict[word] = 1


In [21]:
# Count all words on the page and sort. Only the words in english_dict are taken.
wordlist = []
for key,val in freq.items():
    item = str(val) + ':' + str(key.lower())
    try:
#        print(str(key))
        if(english_dict[str(key)]):
            wordlist.append(item)
    except:
        pass
wordlist.sort(key=lambda fname: int(fname.split(':')[0]), reverse=True)
wordlist[:10]

['178:learning',
 '65:data',
 '52:machine',
 '32:training',
 '26:model',
 '22:edit',
 '20:artificial',
 '19:used',
 '19:also',
 '16:classification']

In [187]:
# Combine the duplicates into a single word and add their counts together
combined_words = []
lw = len(wordlist)
for i in range(lw):
    if(wordlist[i] == ""):
        continue
    fields = wordlist[i].split(':')
    c1 = int(fields[0])
    w1  = fields[1]
    item0 = str(c1) + ":" + w1
    for j in range(i+1, lw):
        if(wordlist[j] == ""):
            continue
        fields1 = wordlist[j].split(':')
        c2 = int(fields1[0])
        w2  = fields1[1]
        if (w1 == w2):
            wordlist[j] = ""
            w1 = w2
            c1 = int(c1)+int(c2)
            item = str(c1) + ":" + w1
            item0 = item
    combined_words.append(item0)
#    wordlist = combined_words   
#print(combined_words)
wordlist = combined_words   
#print(wordlist)

In [186]:
if(writeout):
    print("OK to write")
    # Write out this web page's data into a TSV file 
    wordfreq = ""
    lw = len(wordlist) - 1
    tot = 0
    with open('catalog.txt', 'a') as file:

        # Count the total number of words
        for i in range (0,lw):
            fields = wordlist[i].split(":")
            tot += int(fields[0])
        for i in range (0,10):
            fields = wordlist[i].split(":")
            c = int(fields[0])
            w = fields[1]
    #        f = str(int(c/tot*100+0.5))
            f = str(round(float(c/tot*100),2))
    #        print(c,f)
            fw = f + "-" + w
            wordfreq += fw + " "
            category = "Space Exploration"
            line = url + "\t" + wordfreq + "\t" + title + "\t" + category + "\n"
        file.write(line)
    file.close()
else:
    print("Already written. Not writing out again")

tot

OK to write


2389