### Classify Webpages using SMCFL
Implement the following paper - https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14582

### Dataset

The WebKB dataset contains 1051 webpages from two classes (230 pages in the course class and 821 pages in the non-course class). Each webpage is characterized by 	the page view and the link view. We use a preprocessed version of this dataset, where 3000-dimensional and 1840- dimensional original features are extracted from 	the page view and link view of a webpage, respectively.

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-51/www/co-training/data/

### Background
Webpage data is often multi-view and high-dimensional, and the webpage classification application is usually semisupervised.
Due to these characteristics, using semisupervised multi-view feature learning (SMFL) technique to deal with the webpage classification problem has recently received much attention.

Webpage classification has three characteristics: 

1. Webpage is a kind of multi-view data since it usually contains two or more types of data, e.g.,text, hyperlinks and images, where each type of data can be 	regarded as a view. These multiple views describe the same webpage.  Multi-view learning is concerned with the problem of machine learning from data represented by 	multiple distinct feature sets. Like in web-page classification, a web page can be described by the document text itself and at the same time by the anchor text 	attached to hyperlinks pointing to this page.
2. Webpage classification is a semisupervised application, since labeled pages are harder to collect compared to unlabeled pages in practice. 
3. Webpage data is high-dimensional, since webpages usually contain much information. 

Considering these three characteristics, it is crucial to design effective semi-supervised multi-view feature learning (SMFL) methods.

In [1]:
import re
import os
import numpy as np
import random
import sklearn.preprocessing

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from urllib.request import urlopen 
from random import shuffle 
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import *
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk import sent_tokenize
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
from nltk.corpus import stopwords
from nltk.tag.stanford import StanfordNERTagger
from sklearn.decomposition import TruncatedSVD

# These files will be used to generate features and labels
fulltext_test_file="tfidf_matrix_fulltext_test_large.txt"
inlink_test_file="tfidf_matrix_inlinks_test_large.txt"
fulltext_train_file="tfidf_matrix_fulltext_train_large.txt"
inlink_train_file="tfidf_matrix_inlinks_train_large.txt"

In [13]:
# Using the below methods we generate the train and test files. The files contain TfIdf Matrices

def get_tf(documents, terms):
    doc_matrix = []
    for itr in documents:
        doc_terms = [itr[0].count(t) for t in terms]
        doc_matrix.append(doc_terms)

    return np.array(doc_matrix)

# Convert Text to TfIdf Vector
def get_vector(location1, view_location1, location2, view_location2, principal_components, train_file_name, test_file_name):
    # Create train and test tfidf matrices. Then generate files for the same.
    
    # Declare variables for storage purpose 
    # For webpage
    text = []
    # For number of unique words in all samples
    uniques = []
    location = location1
    # Using PorterStemmer for stemming purpose
    pStemmer = PorterStemmer()
    # To manage Regular Expressions
    tokenizer = RegexpTokenizer(r'\w+')
    labels = []
    
    # Data Preprocessing
    for file in os.listdir(location):
        file = view_location1 + file
        sock = urlopen(file) 
        htmlStr = sock.read() 
        htmlStr = htmlStr.decode("windows-1252")
        sock.close()
        
        labels.append(0)
        # Get the text in <>
        clean_reg = re.compile('<.*?>')
        htmlStr = re.sub(clean_reg, '', htmlStr)
        tokens = tokenizer.tokenize(htmlStr.lower())

        # Use Porter Stemmer
        words = [pStemmer.stem(line) for line in tokens if line not in '']
        
        # Remove the stop words
        stop_words = set(stopwords.words('english'))
        tokens = [w for w in words if not w in stop_words]
        
        # Get unique words
        uniques += list(set(tokens))
        uniques = list(set(uniques))

        temp_str = ""
        for k in tokens:
            temp_str += k + " "

        temp_list = [temp_str]
        # text will hold entire webpage text
        text.append(temp_list)

    count_class1 = len(labels)  # stores count of class one samples
    location = location2
    # read files
    for filename in os.listdir(location):
        filename = view_location2 + filename
        sock = urlopen(filename) 
        htmlStr = sock.read()
        htmlStr = htmlStr.decode("windows-1252")                           
        sock.close()    
        labels.append(1)

        # for obtaining text inside <> tags                     
        cleanr = re.compile('<.*?>')
        htmlStr = re.sub(cleanr, '', htmlStr)

        # preprocess the data
        tokens = word_tokenize(htmlStr)
        words = [pStemmer.stem(line) for line in tokens if line not in '']
        #stop_words = set(stopwords.words('english'))
        tokens = [w for w in words if not w in words]
        uniques += list(set(tokens))
        uniques = list(set(uniques))

        temp_str = ""
        for k in tokens:
            temp_str += k + " "

        temp_list = [temp_str]
        text.append(temp_list)

    count_class2 = len(labels) - count_class1
    labels = np.asarray(labels)
    labels = labels.reshape(labels.shape[0], 1)

    tf_matrix = get_tf(text, uniques)    
    tfIdf = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
    tf_idf_matrix = tfIdf.fit_transform(tf_matrix).todense() 
    
    # Dimensionality reduction using truncated SVD
    # https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    svd = TruncatedSVD(n_components=principal_components, random_state=42)
    tf_idf_matrix_SVD = svd.fit_transform(tf_idf_matrix)

    tfidf_with_labels = np.concatenate((tf_idf_matrix_SVD, labels), axis=1)
    # Extract 30% for testing
    class1_test_data = int(count_class1  * 0.3)
    tfidf_test_matrix = tfidf_with_labels[0:class1_test_data,:]
    class2_test_data = int(count_class2  * 0.3)
    temp_tfidf_test_matrix = tfidf_with_labels[count_class1:(count_class1+class2_test_data),:]
    tfidf_test_matrix = np.concatenate((tfidf_test_matrix,temp_tfidf_test_matrix),axis = 0)
    temp_matrix1 = tfidf_with_labels[class1_test_data:count_class1,:]
    temp_matrix2 = tfidf_with_labels[(count_class1+class2_test_data):,:]
    tfidf_train_matrix = np.concatenate((temp_matrix1, temp_matrix2), axis=0)

    # writes training tfidf into file
    fp = open(train_file_name, 'w')
    for i in range(tfidf_train_matrix.shape[0]):
        for j in range(tfidf_train_matrix.shape[1]):
            fp.write(str(tfidf_train_matrix[i][j]) + " ")      
        fp.write("\n")    
    fp.close()

    # writes test tfidf into file
    fp = open(test_file_name, 'w')
    for i in range(tfidf_test_matrix.shape[0]):
        for j in range(tfidf_test_matrix.shape[1]):
            fp.write(str(tfidf_test_matrix[i][j]) + " ")
        fp.write("\n")
    fp.close()

# Define the path to the course and non-course file location

# 100 dimensions
location1 = '/Users/ankursrivastava/Desktop/Machine Learning/IIIT/Assignments/Project/dataset/course-cotrain-data/fulltext/course/'
view_location1 = 'file:///Users/ankursrivastava/Desktop/Machine Learning/IIIT/Assignments/Project/dataset/course-cotrain-data/fulltext/course/'
location2 = '/Users/ankursrivastava/Desktop/Machine Learning/IIIT/Assignments/Project/dataset/course-cotrain-data/fulltext/non-course/'
view_location2 = 'file:///Users/ankursrivastava/Desktop/Machine Learning/IIIT/Assignments/Project/dataset/course-cotrain-data/fulltext/non-course/'

get_vector(location1,view_location1,location2,view_location2, 100, 'tfidf_matrix_fulltext_train_large.txt', 'tfidf_matrix_fulltext_test_large.txt')

location1 = '/Users/ankursrivastava/Desktop/Machine Learning/IIIT/Assignments/Project/dataset/course-cotrain-data/inlinks/course/'
view_location1 = 'file:///Users/ankursrivastava/Desktop/Machine Learning/IIIT/Assignments/Project/dataset/course-cotrain-data/inlinks/course/'
location2 = '/Users/ankursrivastava/Desktop/Machine Learning/IIIT/Assignments/Project/dataset/course-cotrain-data/inlinks/non-course/'
view_location2 = 'file:///Users/ankursrivastava/Desktop/Machine Learning/IIIT/Assignments/Project/dataset/course-cotrain-data/inlinks/non-course/'

get_vector(location1,view_location1,location2,view_location2, 100, 'tfidf_matrix_inlinks_train_large.txt', 'tfidf_matrix_inlinks_test_large.txt')

In [2]:
# Extract the labels and features from the test file
def get_test_data(fulltext_file, inlink_file):
    fulltext_data=np.genfromtxt(fulltext_file,dtype=None,delimiter=" ")
    inlink_data=np.genfromtxt(inlink_file,dtype=None,delimiter=" ")
    print(fulltext_data.shape)
    print(fulltext_data)
    
    #extract features and label
    text_data=fulltext_data[:,:-1]
    text_label=fulltext_data[:,-1]
    link_data=inlink_data[:,:-1]
    link_label=inlink_data[:,-1]
    print(text_data)
    
    #convert labels from float to int
    text_label=text_label.astype(int)
    link_label=link_label.astype(int)
    print(text_label)
    print(text_label.shape)

    text_data=sklearn.preprocessing.scale(text_data)
    link_data=sklearn.preprocessing.scale(link_date)
    number_of_samples=text_label.shape[0]

    text_label=text_label.reshape(text_label.shape[0], 1)
    link_label=link_label.reshape(link_label.shape[0], 1)

    data_full_with_label = np.concatenate((text_data, text_label), axis=1)
    data_inlink_with_label = np.concatenate((link_data, link_label), axis=1)

    class0_text=data_full_with_label[data_full_with_label[:,-1]==0][:,:-1]
    class0_inlink=data_inlink_with_label[data_inlink_with_label[:,-1]==0][:,:-1]
    class1_text=data_full_with_label[data_full_with_label[:,-1]==1][:,:-1]
    class1_inlink=data_inlink_with_label[data_inlink_with_label[:,-1]==1][:,:-1]

    class0 = np.stack((class0_text, class0_inlink), axis=0)
    class1 = np.stack((class1_text, class1_inlink), axis=0)
    print(class0.shape, class1.shape)
    
    #Add to View List
    view_list=[class0,class1]
    print(view_list[0].shape,view_list[1].shape)
    return view_list

In [1]:
# Extract the labels and features from the train file
def get_train_data(fulltext_file, inlink_file) :
    fulltext_data=np.genfromtxt(fulltext_file,dtype=None,delimiter=" ")
    inlink_data=np.genfromtxt(inlink_file,dtype=None,delimiter=" ")

    # get the features and labels from fulltext
    text_features=fulltext_data[:,:-1]
    text_labels=fulltext_data[:,-1]
    
    # get the features and labels from inlinks
    inlink_features=inlink_data[:,:-1]
    inlink_labels=inlink_data[:,-1]
    
    # convert to integer
    text_labels=text_labels.astype(int)
    inlink_labels=inlink_labels.astype(int)

    text_features=sklearn.preprocessing.scale(text_features)
    inlink_features=sklearn.preprocessing.scale(inlink_features)
    count_data_samples=text_features.shape[0]

    # select random data points and set label -1
    for k in range(count_data_samples):
        # generate a random number between 0 and 1
        p=random.uniform(0,1)
        if(p<0.7):
            text_labels[k]=-1
            inlink_labels[k]=-1

    print("-------------------------------------")

    text_labels=text_labels.reshape(text_labels.shape[0], 1)
    inlink_labels=inlink_labels.reshape(inlink_labels.shape[0], 1)

    text_with_label = np.concatenate((text_features, text_labels), axis=1)
    inlink_with_label = np.concatenate((inlink_features, inlink_labels), axis=1)

    class0_text=text_with_label[text_with_label[:,-1]==0][:,:-1]
    class0_inlink=inlink_labels[inlink_labels[:,-1]==0][:,:-1]
    class1_text=text_with_label[text_with_label[:,-1]==1][:,:-1]
    class1_inlink=inlink_with_label[inlink_with_label[:,-1]==1][:,:-1]
    
    unlabelled_text=text_with_label[text_with_label[:,-1]==-1][:,:-1]
    unlabelled_inlink=inlink_with_label[inlink_with_label[:,-1]==-1][:,:-1]

    class0 = np.stack((class0_text, class0_inlink), axis=0)
    class1 = np.stack((class1_text, class1_inlink), axis=0)
    unlabelled = np.stack((unlabelled_text, unlabelled_inlink), axis=0)

    print(class0.shape, class1.shape, unlabelled.shape)

    view=[class0,class1,unlabelled]
    return view

In [None]:
# The objective function is given by ==> R = max f(W) = S_w - r1*S_b - r2 * S_t
# S_w = within-class correlation
# S_b = between-class correlation
# S_t = total correlation
def calculate_R(view):
    count=len(view)
    feature_count=view[0].shape[2]
    S_w=np.zeros((feature_count,feature_count))

    # Calculate S_w
    for i in range(count-1) :
        ar_class=view[i]
        S_w_temp=np.zeros((feature_count,feature_count))
        n_views,n_docs,n_features=ar_class.shape

        for s in range(n_views) :
            # S(th) view in class i
            ar_view_s=ar_class[s]
            for t in range(n_views) :
                #t(th) view in class i
                ar_view_t=ar_class[t]
                doc_len_1=ar_view_s.shape[0]
                # for each document in s(th) view
                for p in range(doc_len_1) :
                    doc_len_2=ar_view_t.shape[0]
                    # for each document in s(th) view take a document in t(th) view
                    for q in range(doc_len_2) :
                        doc_1=ar_view_s[p]
                        doc_2=ar_view_t[q]
                        #The Feature Vectors are in 1D matrix of length(Number Of Features)
                        #Converting the to 2*D matrix of Dimnesion (1 X Number Of Features)
                        doc_1=doc_1.reshape(ar_view_s.shape[1],1)
                        doc_2=doc_2.reshape(ar_view_t.shape[1],1)
                        temp_prod=np.dot(doc_1,doc_2.T)
                        S_w_temp=S_w_temp+temp_prod
        #l(i) which represents the number of documents in class i
        l_i=n_views*n_docs
        S_w_temp=S_w_temp/(l_i*l_i)
        S_w=S_w+S_w_temp

    # Divide S_w by number of classes
    S_w=S_w/(count-1)

    #-----------------------------------------------------------------------------------------
    #-----------------------Calculation of S_B part of equation for review--------------------
    #-----------------------------------------------------------------------------------------

    S_B=np.zeros((number_of_features,number_of_features))
    for i in range(n_class-1) :
        #Extracting all the view_doc records for a class i
        ar_class_i=class_view[i]
        n_views_i,n_docs_i,n_features_i=ar_class_i.shape
        for j in range(n_class-1) :
            #If Both the Classes same Exit Loop
            S_B_temp_class=np.zeros((number_of_features,number_of_features))
            if i != j :
                #Since interclass Scatter proceed with different Classes
                ar_class_j=class_view[j]
                n_views_j,n_docs_j,n_features_j=ar_class_j.shape
                #For every view in class one
                for s in range(n_views_i) :
                    # S(th) view in class i
                    ar_view_s=ar_class_i[s]
                    #Inner Loop for views
                    for t in range(n_views_j) :
                        #t(th) view in class j
                        ar_view_t=ar_class_j[t]
                        #Number of Documents in View1
                        doc_len_1=ar_view_s.shape[0]
                        for p in range(doc_len_1) :
                            doc_len_2=ar_view_t.shape[0]
                            # for each document in s(th) view take a document in t(th) view
                            for q in range(doc_len_2) :
                                doc_1=ar_view_s[p]
                                doc_2=ar_view_t[q]
                                #The Feature Vectors are in 1D matrix of length(Number Of Features)
                                #Converting the to 2*D matrix of Dimnesion (1 X Number Of Features)
                                doc_1=doc_1.reshape(ar_view_s.shape[1],1)
                                doc_2=doc_2.reshape(ar_view_t.shape[1],1)
                                temp_prod=np.dot(doc_1,doc_2.T)
                                S_B_temp_class=S_B_temp_class+temp_prod
                #l(i) which represents the number of documents in class i
                l_i=n_views_i*n_docs_i
                #l(j) which represents the number of documents in class j
                l_j=n_views_j*n_docs_j
                S_B_temp_class=S_B_temp_class/(l_i*l_j)
                #print (S_W_temp_class.shape)
                S_B=S_B+S_B_temp_class
    S_B=S_B/((n_class-1)*(n_class-2))
    #print("final S_B")
    #print(S_B)

    #-----------------------------------------------------------------------------------------
    #-----------------------Calculation of S_T part of equation for review--------------------
    #-----------------------------------------------------------------------------------------

    S_T=np.zeros((number_of_features,number_of_features))#-----Full S_T Sum-------------------
    S_T_W=np.zeros((number_of_features,number_of_features))#-----S_W part of S_T--------------
    for i in range(n_class) :

        ar_class=class_view[i]
        # print(i)
        # print(ar_class)
        S_T_W_temp_class=np.zeros((number_of_features,number_of_features))
        n_views,n_docs,n_features=ar_class.shape
        print(n_views,n_docs,n_features)

        # Outer Loop for Each view
        for s in range(n_views) :
            # S(th) view in class i
            ar_view_s=ar_class[s]
            #Inner Loop For Views
            for t in range(n_views) :
                #t(th) view in class i
                ar_view_t=ar_class[t]
                doc_len_1=ar_view_s.shape[0]
                # for each document in s(th) view
                for p in range(doc_len_1) :
                    doc_len_2=ar_view_t.shape[0]
                    # for each document in s(th) view take a document in t(th) view
                    for q in range(doc_len_2) :
                        doc_1=ar_view_s[p]
                        doc_2=ar_view_t[q]
                        #The Feature Vectors are in 1D matrix of length(Number Of Features)
                        #Converting the to 2*D matrix of Dimnesion (1 X Number Of Features)
                        doc_1=doc_1.reshape(ar_view_s.shape[1],1)
                        doc_2=doc_2.reshape(ar_view_t.shape[1],1)
                        temp_prod=np.dot(doc_1,doc_2.T)
                        #print (temp_prod)
                        #print(temp_prod.shape)
                        #Keeping Summation in S_W
                        S_T_W_temp_class=S_T_W_temp_class+temp_prod
        #l(i) which represents the number of documents in class i
        l_i=n_views*n_docs
        S_W_temp_class=S_W_temp_class/(l_i*l_i)
        #print (S_W_temp_class.shape)
        S_T_W=S_T_W+S_T_W_temp_class


    #--------------Calculation of the S_B part ---------------
    #---------------------------------------------------------
    S_T_B=np.zeros((number_of_features,number_of_features))
    for i in range(n_class) :
        #Extracting all the view_doc records for a class i
        ar_class_i=class_view[i]
        n_views_i,n_docs_i,n_features_i=ar_class_i.shape
        for j in range(n_class) :
            #If Both the Classes same Exit Loop
            S_T_B_temp_class=np.zeros((number_of_features,number_of_features))
            if i != j :
                #Since interclass Scatter proceed with different Classes
                ar_class_j=class_view[j]
                n_views_j,n_docs_j,n_features_j=ar_class_j.shape
                #For every view in class one
                for s in range(n_views_i) :
                    # S(th) view in class i
                    ar_view_s=ar_class_i[s]
                    #Inner Loop for views
                    for t in range(n_views_j) :
                        #t(th) view in class j
                        ar_view_t=ar_class_j[t]
                        #Number of Documents in View1
                        doc_len_1=ar_view_s.shape[0]
                        for p in range(doc_len_1) :
                            doc_len_2=ar_view_t.shape[0]
                            # for each document in s(th) view take a document in t(th) view
                            for q in range(doc_len_2) :
                                doc_1=ar_view_s[p]
                                doc_2=ar_view_t[q]
                                #The Feature Vectors are in 1D matrix of length(Number Of Features)
                                #Converting the to 2*D matrix of Dimnesion (1 X Number Of Features)
                                doc_1=doc_1.reshape(ar_view_s.shape[1],1)
                                doc_2=doc_2.reshape(ar_view_t.shape[1],1)
                                temp_prod=np.dot(doc_1,doc_2.T)
                                S_T_B_temp_class=S_T_B_temp_class+temp_prod
                #l(i) which represents the number of documents in class i
                l_i=n_views_i*n_docs_i
                #l(j) which represents the number of documents in class j
                l_j=n_views_j*n_docs_j
                S_T_B_temp_class=S_T_B_temp_class/(l_i*l_j)
                #print (S_W_temp_class.shape)
                S_T_B=S_T_B+S_T_B_temp_class

    #Final S_T calculation
    S_T=S_T_B+((n_class-1)*S_T_W)
    S_T=S_T/(2*(n_class)*(n_class-1))

    r1=10
    r2=10
    #Calculating R
    R=S_W-(r1*S_B)-(r2*S_T)

    print("the R is")
    print (R)
    return R