# "Natural Language Processing: Website Categorization"
> Classify websites urls into categories using pretrained Bart Model

- toc: true 
- badges: true
- comments: true
- author: Amine EL FAIZ
- categories: [Website Categorization, Bart, Zero-shot Learning, web2vec, text encoding]

# Context

The goal of this project is websites categorization which refers to the process of classifying websites into various categories based on their content and purpose. For example, amazon website can be classified into e-commerce category.

Website categorization either manually or by using machine learning algorithms can be useful in many areas:
- Content filtering: blocking certain websites dependent on their categories from access by certain users.
- Contextual Marketing: Allows businesses to display ads on pages that are similar or relevant to the product or service they offer.
- Brand protection: Looking for copycat websites that are similar to yours but harm your brand.
- Text Data Encoding: Transform websites text data into vectors that have meaning like a website embedding vector and websites with similar purpose have close vectors.

I have come across this project when trying to encode websites text data into vectors to train a machine learning model on those vectors. One of the ways is to classify each website into a set of categories either by probability or classes.
The advantage of using this method besides the vectorization is:
- We can sum up multiple vectors and the result would be meaningful for example in case of a user who visited a series of websites and we want to encode this sequence of visits.
- If a machine learning model is trained on these categories we can interpret the results by doing feature importance.

To obtain this objective the process of transformation is as follows:
- Scrap from website URLs text data.
- Define the categories that will define the content of each website.
- Predict the probability that the scraped text belongs to each of the defined categories using pretrained Bart Model.

# Load Librairies

In [11]:
from transformers import AutoModel, AutoTokenizer 
import cloudscraper
from bs4 import BeautifulSoup
from googletrans import Translator
from tldextract import extract
import transformers
from transformers import pipeline
from transformers import AutoModel
import pandas as pd
import numpy as np
import itertools
from collections import Counter
from googletrans import Translator
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from tldextract import extract
import re
import warnings
warnings.filterwarnings('ignore')

# Websites Scraping

In [12]:
#collapse-show
# Function that will scrapp a desired website
def scrap_website(scraper, headers, website_name, text_size):
    allthecontent = ''
    try:
        r = scraper.get(website_name, headers=headers)
        soup = BeautifulSoup(r.text, 'html.parser')
        title = soup.find('title').text
        description = soup.find('meta', attrs={'name': 'description'})
        if "content" in str(description):
            description = description.get("content")
        else:
            description = ""
        h1 = soup.find_all('h1')
        h1_all = ""
        for x in range(len(h1)):
            if x == len(h1) - 1:
                h1_all = h1_all + h1[x].text
            else:
                h1_all = h1_all + h1[x].text + ". "
        paragraphs_all = ""
        paragraphs = soup.find_all('p')
        for x in range(len(paragraphs)):
            if x == len(paragraphs) - 1:
                paragraphs_all = paragraphs_all + paragraphs[x].text
            else:
                paragraphs_all = paragraphs_all + paragraphs[x].text + ". "
        h2 = soup.find_all('h2')
        h2_all = ""
        for x in range(len(h2)):
            if x == len(h2) - 1:
                h2_all = h2_all + h2[x].text
            else:
                h2_all = h2_all + h2[x].text + ". "
        h3 = soup.find_all('h3')
        h3_all = ""
        for x in range(len(h3)):
            if x == len(h3) - 1:
                h3_all = h3_all + h3[x].text
            else:
                h3_all = h3_all + h3[x].text + ". "
        allthecontent = str(title) + " " + str(description) + " " + str(h1_all) + " " + str(h2_all) + " " + str(
            h3_all) + " " + str(paragraphs_all)
        allthecontent = str(allthecontent)[0:text_size]
    except Exception as e:
        pass
    return allthecontent

def translate_sentence(translator, sentence, text_size):
    translation = translator.translate(sentence)
    translation = str(translation)[0:text_size]
    return translation

In [13]:
#
# Scarp websites and translate words for non english websites
scraper = cloudscraper.create_scraper() 
headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
translator = Translator()

# Example of Amazon
website_name = 'https://www.amazon.com'
website_content = scrap_website(scraper, headers, website_name, text_size = 700)
website_content_en = translate_sentence(translator, website_content, text_size= 700)
website_content_en = ' '.join(re.findall('[a-z]+', website_content.lower()))
print('The Scraped Text after cleaning and english translation of:', website_name, 'is:',website_content)

# Example of youtube
website_name = 'https://www.youtube.com'
website_content = scrap_website(scraper, headers, website_name, text_size = 700)
website_content_en = translate_sentence(translator, website_content, text_size = 700)
website_content_en = ' '.join(re.findall('[a-z]+', website_content.lower()))
print('The Scraped Text after cleaning and english translation of:', website_name, 'is:',website_content)

The Scraped Text after cleaning and english translation of: https://www.amazon.com is: Amazon.com. Spend less. Smile more. Free shipping on millions of items. Get the best of Shopping and Entertainment with Prime. Enjoy low prices and great deals on the largest selection of everyday essentials and other products, including fashion, home, beauty, electronics, Alexa Devices, sporting goods, toys, automotive, pets, baby, books, video games, musical instruments, office supplies, and more.  Sign in for the best experience. Explore Departments  
The Scraped Text after cleaning and english translation of: https://www.youtube.com is: YouTube Share your videos with friends, family, and the world    


# Input and Output Definition

We will define the categories that the bart model will choose from and the websites we want to classify:

In [14]:
#
# websites to classify
websites = ['amazon.com','instagram.com','wikipedia.org','netflix.com','facebook.com','google.com','yahoo.com']
# Categories from bart model to choose from and to labelize each website base on front page content
candidate_labels = ['health','e-commerce','advertising', 'job','computer','education','entertainment',\
                            'home and family','industry','Information Technology','search engine',\
                            'social network','science','news and media','read', 'buisness']
# Tabular DataFrame containing vectors
web2vec = pd.DataFrame(websites, columns=['website_url'])
web2vec[candidate_labels] = np.nan

# Categorization using BART classifier
For further learning about the facebook Bart Classification Model, refer to this [link](https://huggingface.co/facebook/bart-large-mnli).

In [15]:
#collapse-output
# Initialise Bart model that will transform website to useful encoding
# Needs Internet Connection Must be downloaded before 
model_name = "facebook/bart-large-mnli" 
# Download pytorch model
model = AutoModel.from_pretrained(model_name)
# Load Bert Zero shot classification model
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartModel: ['classification_head.out_proj.weight', 'classification_head.dense.bias', 'classification_head.out_proj.bias', 'classification_head.dense.weight']
- This IS expected if you are initializing BartModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BartModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
# 
# Categorization Prediction of amazon website
website_name = 'https://www.amazon.com'
website_content = scrap_website(scraper, headers, website_name, text_size = 700)
website_content_en = translate_sentence(translator, website_content, text_size= 700)
website_content_en = ' '.join(re.findall('[a-z]+', website_content.lower()))
output = classifier(website_content_en, candidate_labels)
print(output)
# The top probability class prediction is e-commerce which is correct.

{'sequence': 'amazon com spend less smile more free shipping on millions of items get the best of shopping and entertainment with prime enjoy low prices and great deals on the largest selection of everyday essentials and other products including fashion home beauty electronics alexa devices sporting goods toys automotive pets baby books video games musical instruments office supplies and more sign in for the best experience explore departments', 'labels': ['e-commerce', 'entertainment', 'read', 'buisness', 'home and family', 'industry', 'search engine', 'advertising', 'job', 'computer', 'health', 'social network', 'science', 'Information Technology', 'education', 'news and media'], 'scores': [0.601190447807312, 0.1486208140850067, 0.09820207953453064, 0.037607591599226, 0.033550821244716644, 0.02008289285004139, 0.0157319363206625, 0.014724561013281345, 0.008524124510586262, 0.006259399000555277, 0.0036483018193393946, 0.0029156305827200413, 0.002454340225085616, 0.002257096115499735, 

In [28]:
#
# the encode_websites function group all steps defined earlier to makje a classification
# Apply the pipeline on websites defined earlier
webList = web2vec.website_url.to_list()
for i in range(web2vec.shape[0]):
    print('Website url', webList[i], 'is being classified')
    try:
        output = encode_websites(classifier, scraper, headers,translator, website_name = webList[i], text_size = 500)
        if output != 0:
            web2vec.loc[i,output['labels'][:3]] = output['scores'][:3]
        else:
            pass
    except Exception as e:
        pass

Website url amazon.com is being classified
Website url instagram.com is being classified
Website url wikipedia.org is being classified
Website url netflix.com is being classified
Website url facebook.com is being classified
Website url google.com is being classified
Website url yahoo.com is being classified


In [27]:
web2vec.fillna(0, inplace=True)
web2vec[candidate_labels] = round(web2vec[candidate_labels], 3)
web2vec['Category_Prediction'] = web2vec[candidate_labels].idxmax(axis=1)
web2vec

Unnamed: 0,website_url,health,e-commerce,advertising,job,computer,education,entertainment,home and family,industry,Information Technology,search engine,social network,science,news and media,read,buisness,Category_Prediction
0,amazon.com,0.0,0.601,0.0,0.0,0.0,0.0,0.149,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.098,0.0,e-commerce
1,instagram.com,0.0,0.0,0.0,0.0,0.0,0.0,0.061,0.0,0.0,0.038,0.0,0.82,0.0,0.0,0.0,0.0,social network
2,wikipedia.org,0.0,0.0,0.0,0.0,0.141,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.304,0.102,read
3,netflix.com,0.0,0.0,0.0,0.0,0.0,0.0,0.296,0.0,0.0,0.169,0.0,0.0,0.0,0.0,0.18,0.0,entertainment
4,facebook.com,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.767,0.0,0.0,0.071,0.0,social network
5,google.com,0.0,0.0,0.0,0.0,0.036,0.0,0.0,0.0,0.0,0.0,0.797,0.0,0.0,0.0,0.047,0.0,search engine
6,yahoo.com,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.253,0.0,0.0,0.288,0.116,0.0,news and media


- The output of the classifier is a list of probabilities, we have selected the 3 top probabilities with the appropriate classes for each website.
The class with the highest probability prediction is the category_prediction column in the web2vec dataframe.
- The model classification scores depend a lot on the categories definition and to improve the model performance you can make the categories more specific.

# Conclusion

The project implemented in this notebook attempts to tackle the problem of encoding website URLs into numeric data which is very useful in a lot of machine learning problems.
Another way to approach this problem is to keep the first part of websites scraping but to apply word embeddings on the text sequence instead of a Bert model. These embeddings can be summed to vectorize the website URLs and map each website to a vector. The drawback of this method is the inability to use these vectors with a classical machine learning model like random forest since generally the pre-trained word embeddings can have large dimensions of at least 50 but it can be useful in training deep learning models like LSTMs.
