Name: Ujwal Kavalipati

# Task 1: Recommender System Challenge

In this task, we are using implicit feedback to recommend a list of items to each user. We have 2174 items and 2239 users.User interaction data is given to us which has '1' if the user interacted with the item and '0' if the user did not interact. **Implicit feedbacks** are implied in the user actions. The underlying assumption is that **“if a user clicked/viewed/spent time on an item often, it is an indication of its preference for that item”**. Implicit feedback comes on many forms, **browsing, clicking, typing,...** whereas explicit feedback solely comes from some sort of ratings scale.

### 1.Import the necessary libraries

In [320]:
#!pip install implicit
#!pip install sklearn
import implicit                                     #we will be using this package,it has various models for implict feedback
from sklearn import metrics
import scipy.sparse as sparse
import pandas as pd
import numpy as np

### 2.Load the data into the notebook

In [117]:
train_df = pd.read_csv('train_data.csv')
test_df = pd.read_csv('test_data.csv')
valid_df=pd.read_csv('validation_data.csv')

In [396]:
#Let's look at the train data
train_df.head()


Unnamed: 0,user_id,item_id,rating
0,0,0,1
1,0,1,1
2,0,2,1
3,0,3,1
4,0,4,1


In [119]:
#Let's look at the validation data
valid_df.head()

Unnamed: 0,user_id,item_id,rating
0,0,43,1
1,0,1102,0
2,0,815,0
3,0,739,0
4,0,1637,0


### 3.Data Pre-processing

- As we observe ,rating is predominantly '1' in train_df and there are no '0' interactions. We shall merge train_df with valid_df and then build our model

In [401]:
#merge the train and valid dataframes
grouped_df = pd.concat([train_df,valid_df])


In [402]:
#print the shape of grouped_df
grouped_df.shape

(252349, 3)

In [122]:
grouped_df.head()


Unnamed: 0,user_id,item_id,rating
0,0,0,1
1,0,1,1
2,0,2,1
3,0,3,1
4,0,4,1


### 4.Model Building

Now, we have to create a sparse item-user or content-person matrix which is used for fitting the model, Later we use sparse_person-content for getting top recommendations for the user.



In [403]:
#convert to sparse matrix
sparse_content_person = sparse.csr_matrix((grouped_df['rating'], (grouped_df['item_id'], grouped_df['user_id'])))
sparse_person_content = sparse.csr_matrix((grouped_df['rating'], (grouped_df['user_id'], grouped_df['item_id'])))

In [404]:
#Let's print our sparse matrix
sparse_content_person

<2174x2239 sparse matrix of type '<class 'numpy.int64'>'
	with 247424 stored elements in Compressed Sparse Row format>

<p>We can see that a sparse_contet_person matrix is created which is having 2174 rows( which are all the items) and 2239 columns which are the users and sparse_person_content is vice-versa

### 5.Model Fitting

In [406]:
alpha = 30 #The rate in which we'll increase our confidence in a preference with more interactions.
data = (sparse_content_person * alpha)

In [407]:
data[1,:].todense()

matrix([[30,  0,  0, ...,  0,  0,  0]], dtype=int64)

### Implement ALS
Alternating least squares is a form of matrix factorization that reduces this user-item matrix to a much smaller amount of dimension called latent or hidden features. And it does so very computationally effective. More details are provided in the report

In [408]:
#fit the model with als
model1 = implicit.als.AlternatingLeastSquares(factors=8, regularization=0.1, iterations=2500)
model1.fit(data)

HBox(children=(FloatProgress(value=0.0, max=2500.0), HTML(value='')))




In [409]:
#let us see how many items are allocated for every user in the test dataframe
test_df[test_df['user_id']==0]

Unnamed: 0,user_id,item_id
0,0,2158
1,0,2113
2,0,2070
3,0,2026
4,0,1948
...,...,...
95,0,135
96,0,123
97,0,117
98,0,81


Our requirement is to get top 10 items for ever user from the list of items that user have given.So below code does that.We basically get all the top recommendations for every user using model.recommend() function from implicit package,and select the top 10 items which are available under item_id of each user.

In [411]:
#create empty list to store items
its=[]

#create empty list to store users
usr=[]

#create initial start and end indexes for iterating the test dataframe to check whether the recommended items are in user's item list
start=0
end=100

#iterate through all user-ids
for i in range(len(test_df.user_id.unique())):
    #print(i)
    
    #get the recommendations for every user
    recommendations=model1.recommend(i,sparse_person_content,len(grouped_df.item_id.unique()),filter_already_liked_items=False)
    res_list=recommendations
    
    #select the item and put it in a list
    item_user = [x[0] for x in res_list]
    
    #initialise a counter to take only 10 items
    counter=0
    
    #iterate through the items
    for k in item_user:
        
        #check if the item is present in the item-list of that particular user
        if k in test_df.item_id.tolist()[start:end]:      
            
            #append the item-id and user-id to its and usr respectively
            its.append(k)
            usr.append(i)
            
            #increment the counter
            counter+=1
            #stopping condition
            if counter==10:
                break
    #update start and end indexes
    start=end
    end=end+100
    
    
    

In [412]:
#create a dataframe with user and item list which we generated in the previous step
combina=list(zip(usr,its))
combina=pd.DataFrame(combina,columns=['user_id','item_id'])
    

In [413]:
#save it to a csv file
combina.to_csv('30309832.csv',index=False)

We have achieved a  Normalized Discounted Cumulative Gain (NDCG) score of 0.21418 after submitting in Kaggle submission page.

### Implement LogisticMF

A collaborative filtering recommender model that learns probabilistic distribution whether user like it or not.It is a  new probabilistic model for matrix factorization with implicit feedback. The model is simple to implement, highly parallelizable, and has the added benefit that it can model the probability that a user will prefer a specific item

In [429]:
alpha = 15 #The rate in which we'll increase our confidence in a preference with more interactions.
data = (sparse_content_person * alpha)

In [430]:
data[1,:].todense()

matrix([[15,  0,  0, ...,  0,  0,  0]], dtype=int64)

In [431]:
model2 = implicit.lmf.LogisticMatrixFactorization(factors=5, regularization=0.1, iterations=1500)
model2.fit(data)

HBox(children=(FloatProgress(value=0.0, max=1500.0), HTML(value='')))




Our requirement is to get top 10 items for ever user from the list of items that user have given.So below code does that.We basically get all the top recommendations for every user using model.recommend() function from implicit package,and select the top 10 items which are available under item_id of each user.

In [432]:
#create empty list to store items
items=[]

#create empty list to store users
users=[]

#create initial start and end indexes for iterating the test dataframe to check whether the recommended items are in user's item list
start=0
end=100

#iterate through all user-ids
for i in range(len(test_df.user_id.unique())):
    #print(i)
    
    #get the recommendations for every user
    recommendations=model2.recommend(i,sparse_person_content,len(grouped_df.item_id.unique()),filter_already_liked_items=False)
    res_list=recommendations
    
    #select the item and put it in a list
    item_user = [x[0] for x in res_list]
    
    #initialise a counter to take only 10 items
    counter=0
    
    #iterate through the items
    for k in item_user:
        
        #check if the item is present in the item-list of that particular user
        if k in test_df.item_id.tolist()[start:end]:      
            
            #append the item-id and user-id to itess and users respectively
            items.append(k)
            users.append(i)
            
            #increment the counter
            counter+=1
            #stopping condition
            if counter==10:
                break
    #update start and end indexes
    start=end
    end=end+100
    
    
    
    

In [433]:
#Create a dataframe and zip the users,items which we created in previous step
combined=list(zip(users,items))
combined=pd.DataFrame(combined,columns=['user_id','item_id'])
    

In [434]:
#save it to csv for checking the NDCG score
combined.to_csv('lmfpred.csv',index=False)

We have achieved a  Normalized Discounted Cumulative Gain (NDCG) score of 0.16228 after submitting in Kaggle submission page.

### Implement BPR

A recommender model that learns a matrix factorization embedding based off minimizing the pairwise ranking loss.it uses a generic optimization criterion BPR-Opt for personalized ranking that is the maximum posterior estimator derived from a Bayesian analysis of the problem

In [435]:
alpha = 15 #The rate in which we'll increase our confidence in a preference with more interactions.
data = (sparse_content_person * alpha)

In [436]:
model3 = implicit.bpr.BayesianPersonalizedRanking(factors=16, regularization=0.1, iterations=1500)
model3.fit(data)

HBox(children=(FloatProgress(value=0.0, max=1500.0), HTML(value='')))




Our requirement is to get top 10 items for ever user from the list of items that user have given.So below code does that.We basically get all the top recommendations for every user using model.recommend() function from implicit package,and select the top 10 items which are available under item_id of each user. This is similar to what we have done before in previous steps.

In [132]:
#create empty list to store items
its1=[]

#create empty list to store users
usr1=[]

#create initial start and end indexes for iterating the test dataframe to check whether the recommended items are in user's item list
start=0
end=100

#iterate through all user-ids
for i in range(len(test_df.user_id.unique())):
    #print(i)
    
    #get the recommendations for every user
    recommendations=model3.recommend(i,sparse_person_content,len(grouped_df.item_id.unique()),filter_already_liked_items=False)
    res_list=recommendations
    
    #select the item and put it in a list
    item_user = [x[0] for x in res_list]
    
    #initialise a counter to take only 10 items
    counter=0
    
    #iterate through the items
    for k in item_user:
        
        
        #check if the item is present in the item-list of that particular user
        if k in test_df.item_id.tolist()[start:end]:
            
            
            #append the item-id and user-id to its1 and usr1 respectively
            its1.append(k)
            usr1.append(i)
            
            #increment the counter
            counter+=1
            #stopping condition
            if counter==10:
                break
    #update start and end indexes
    start=end
    end=end+100
    
    

In [135]:
#Create a dataframe and zip the users,items which we created in previous step
comb=list(zip(usr1,its1))
comb=pd.DataFrame(comb,columns=['user_id','item_id'])

In [137]:
#save it to a csv file to check our NDCG score
#comb.to_csv('bpr-pred10.csv',index=False)

We have achieved a  Normalized Discounted Cumulative Gain (NDCG) score of 0.05214 after submitting in Kaggle submission page.

### Ensembling ( ALS + LMF)

Let us try now ensembling the output of ALS and LMF and check if it improves our results.(NDCG). We shall create 2 models,LMF and ALS ,then perform mean on score of items returned from default model.recommend() function from implicit package.Then we select the top 10 items based on score and check the NDCG score.

In [437]:
alpha = 30 #The rate in which we'll increase our confidence in a preference with more interactions.
data = (sparse_content_person * alpha)

In [438]:
#convert into dense matrix
data[1,:].todense()

matrix([[30,  0,  0, ...,  0,  0,  0]], dtype=int64)

In [439]:
#fit the model
model_ensmb1 = implicit.lmf.LogisticMatrixFactorization(factors=8, regularization=0.1, iterations=1500)
model_ensmb1.fit(data)

HBox(children=(FloatProgress(value=0.0, max=1500.0), HTML(value='')))




In [440]:
#fit the model with als
model_ensmb2 = implicit.als.AlternatingLeastSquares(factors=8, regularization=0.1, iterations=2500)
model_ensmb2.fit(data)

HBox(children=(FloatProgress(value=0.0, max=2500.0), HTML(value='')))




In [341]:
#create 2 dictionaries for storing recommended items for every model
dictlist1=[]
dictlist2=[]

#iterate through all the user-ids
for i in range(len(test_df.user_id.unique())):
    print(i)
    
    #get the recommendations as a dictionary and append to dictlist
    recommendations1=dict(model_ensmb1.recommend(i,sparse_person_content,len(grouped_df.item_id.unique()),filter_already_liked_items=False))
    recommendations2=dict(model_ensmb2.recommend(i,sparse_person_content,len(grouped_df.item_id.unique()),filter_already_liked_items=False))
    
    #append the recommendations
    dictlist1.append(recommendations1)
    dictlist2.append(recommendations2)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060


<p> Now we have 2 lists containing dictionaries.Now we have to iterate through them and find out top 10 items for each user.
    For example <li>user-id 0 has 2174 items returned from model_ensmb1
                <li>user-id 0 has 2174 items returned from model_ensmb2
    <p>Then we take the average of scores of each item, sort them in descending order and take the top 10 items.The below code achieves this for all users

In [395]:
#counter to select only top 10 items
counter=0

#create empty user and item lists
user=[]
item=[]

#create initial start and end indexes for iterating the test dataframe to check whether the recommended items are in user's item list
start=0
end=100

#iterating through the dictionary
for k in range(len(dictlist1)):
    
    #create a empty dictionary to perform operations and store new results.
    rec={}
    print(k)
    
    #finding sum of scores based on common keys i.e item_ids
    for key in dictlist1[k].keys():
        if key in dictlist2[k].keys():
            rec[key]=dictlist1[k][key]+dictlist2[k][key]
            
    #finding the mean of score        
    for key in rec:
         rec[key] /=  2
            
    #sorting the result dictionary in descending order
    result={n: v for n, v in sorted(rec.items(), key=lambda item: item[1],reverse=True)}
    
    #double checking if the returned result dictionary has all items 
    if len(result)<2174:
        break
        
    #iterating through all keys
    for i in result.keys():
        
        #check if the item is present in the item-list of that particular user
        if i in test_df.item_id.tolist()[start:end]:
            
            ##append the item-id and user-id to item and user respectively
            item.append(i)
            user.append(k)
            
            #increment the counter
            counter+=1
            
            #stopping condition
            if counter==10:
                break
    #updating the start and end indexes
    start=end
    end=end+100
    counter=0
    
    

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060


In [399]:
#create a dataframe and zip the user and item
mix=list(zip(user,item))
mix=pd.DataFrame(mix,columns=['user_id','item_id'])
    

In [400]:
#save it to csv file
#mix.to_csv('mixmodel-10june.csv',index=False)

We have achieved a Normalized Discounted Cumulative Gain (NDCG) score of 0.15196 after submitting in Kaggle submission page.

### Comparision of different algorithms
  
<li>  |  ALS-0.21418    |
<li>  |  LMF-0.16228    |
<li>  |  BPR-0.05214    |
<li>  |  ALS+LMF-0.15196|


    
We can see that ALS has performed better compared to all other models.I expected ensembling would give me a better NDCG score but it went down to 0.15. LMF's performance is better compared to BPR. BPR has the least which indicates this type of data is not suitable for BPR.

# Task 2: Node Classification in Graphs

In [443]:
#import the necessary libraries
import time

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import MultinomialNB, GaussianNB,BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import sklearn.metrics as metrics
import sklearn.linear_model as linear_model
from nltk.corpus import stopwords
from nltk import word_tokenize    
from nltk.tokenize import RegexpTokenizer
from nltk.probability import *
import re
from nltk.tokenize import wordpunct_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from scipy.sparse import hstack
import sklearn.svm as svm
import networkx as nx
import re
import seaborn as sns
%matplotlib inline

In [444]:
#setting the random seeds for reproducibility
import numpy as np
np.random.seed(0)
import torch
torch.manual_seed(0)

<torch._C.Generator at 0x290d8907630>

The docs.txt contains document ids(node ids) and titles.adedges.txt contains the adjacency list for our graph.
<p>We will follow the below steps for performing the node classification
<ol>
<li>Read the graph from the adjacency list into networkx
<li>Train node2vec and get the vector for each node. Now we have a matrix X (where each row is a node vector).
<li>Get the labels for each node. Now we have a vector Y
<li>Train-test splits for the (X, Y) into (Xtrain, Ytrain) and (Xtest, Ytest)
<li>Build a classifier on (Xtrain, Ytrain)
<li>Test on (Xtest, Ytest) and get the accuracy.

In [445]:
#Define a graph
G = nx.Graph()

#read the file docs.txt to get the labels
f1=open('docs.txt', encoding="utf8")

#declare empty lists to store nodes and labels
nodes=[]
labels=[]
for node in f1:
    txt=node
    
    #take the node id
    n=re.match('\d+',txt)
    
    #remove the node id and take the decription(title)
    txt=re.sub('^\d+','',txt)
    
    #append node id and label to respective lists
    labels.append(txt.strip())
    nodes.append(n.group(0))
    
    

In [446]:
#define a networkx graph
g=nx.Graph()

#read the adjacency file
file=open('adjedges.txt','r')
adjlist=[]

for line in file:
    adjlist.append(line.strip())

#use parse_adjlist function     
g = nx.parse_adjlist(adjlist, nodetype = int)


In [447]:
#print number of edges
len(g.edges())

54183

In [448]:
#print number of edges
len(g.nodes())

36928

### Node2Vec


<p>node2vec is an algorithm to generate vector representations of nodes on a graph. The node2vec framework learns low-dimensional representations for nodes in a graph through the use of random walks through a graph starting at a target node. It is useful for a variety of machine learning applications. Besides reducing the engineering effort, representations learned by the algorithm lead to greater predictive power.node2vec follows the intuition that random walks through a graph can be treated like sentences in a corpus. Each node in a graph is treated like an individual word, and a random walk is treated as a sentence. By feeding these "sentences" into a skip-gram, or by using the continuous bag of words model paths found by random walks can be treated as sentences, and traditional data-mining techniques for documents can be used.</p>

In [449]:
#run Node2Vec on our graph with below parameters
#graph: The first positional argument has to be a networkx graph. Node names must be all integers or all strings. On the output model they will always be strings.
#dimensions: Embedding dimensions (default: 128)
#walk_length: Number of nodes in each walk (default: 80)
#num_walks: Number of walks per node (default: 10)
#workers: Number of workers for parallel execution (default: 1)

from node2vec import Node2Vec

#pre-compute the probabilities and generate walks :
node2vec = Node2Vec(g, dimensions=64, walk_length=30, num_walks=100, workers=1)
#embed the nodes
model_nd = node2vec.fit(window=10, min_count=1, batch_words=4)


Computing transition probabilities: 100%|██████████████████████████████████████| 36928/36928 [00:06<00:00, 5856.13it/s]
Generating walks (CPU: 1): 100%|███████████████████████████████████████████████████| 100/100 [1:22:34<00:00, 49.55s/it]


In [450]:
#get the vectors for every node and append it to X
X=[]
for n in nodes:
    vec=model_nd.wv.get_vector(n)
    X.append(vec)
    

In [451]:
#get labels from labels.txt and append it to Y
Y=[]

fh=open('labels.txt','r')
for line in fh:
    node=line
    
    #extracting the label
    node=re.sub('^\d+','',node)
    Y.append(node.strip())
    

In [452]:
#Do train-test split,training size is 20%,testing size is 80%
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.80, random_state=42)

In [453]:
#build the models and compute the performance metrics
models = [
    LogisticRegression(),
    BernoulliNB(),
    LinearSVC(penalty='l2', loss='squared_hinge',max_iter = 2500),
    RandomForestClassifier()
]

#iterating through the models
for clf in models:
    
    model_name = clf.__class__.__name__
    
    #fit the model
    clf.fit(X_train, y_train)
    print(model_name)
    
    # Do the prediction
    y_predict=clf.predict(X_test)
    
    #print the perfromance metrics
    print(confusion_matrix(y_test,y_predict))
    recall=recall_score(y_test,y_predict,average='macro')
    precision=precision_score(y_test,y_predict,average='macro')
    f1score=f1_score(y_test,y_predict,average='macro')
    accuracy=accuracy_score(y_test,y_predict)
    matthews = matthews_corrcoef(y_test,y_predict) 
    print('Accuracy: '+ str(accuracy))
    print('Macro Precision: '+ str(precision))
    print('Macro Recall: '+ str(recall))
    print('Macro F1 score:'+ str(f1score))
    print('MCC:'+ str(matthews))

LogisticRegression
[[4039  180  115  351   21]
 [ 224 2470  258  196   32]
 [2630  281  367  211   10]
 [ 472  240  214 1238   29]
 [1109   65   37  160   27]]
Accuracy: 0.5436030982905983
Macro Precision: 0.4822715354173833
Macro Recall: 0.4647439004467066
Macro F1 score:0.43024913942123844
MCC:0.4126976983091574
BernoulliNB
[[1993  742 1007  959    5]
 [1116 1164  559  341    0]
 [1395  594  942  562    6]
 [ 406   76  169 1542    0]
 [ 597  212  292  294    3]]
Accuracy: 0.3768696581196581
Macro Precision: 0.34559060621397625
Macro Recall: 0.3528103445912386
Macro F1 score:0.3198814396149257
MCC:0.18554460314961677




LinearSVC
[[3995  255   43  412    1]
 [ 171 2628  150  231    0]
 [2647  366  242  244    0]
 [ 418  308   69 1398    0]
 [1120   82   20  175    1]]
Accuracy: 0.5518162393162394
Macro Precision: 0.5461373976293504
Macro Recall: 0.4765384394085162
Macro F1 score:0.42107184898535754
MCC:0.4296626615325506
RandomForestClassifier
[[3323  254  633  482   14]
 [  21 3023   15  120    1]
 [2082  389  768  260    0]
 [ 125  353   26 1687    2]
 [ 872  114  176  182   54]]
Accuracy: 0.5912793803418803
Macro Precision: 0.6203470687603116
Macro Recall: 0.5368265032721012
Macro F1 score:0.4965671978089629
MCC:0.4719005548212587


#### Observations from above result
As we can see, the accuracy scores are not very high.

- Logistic Regresssion-54.3
- Bernoulli -37.6
- Linear SVC-55.18
- Random forest - 59.12

<p>Let us use the information given to us about the nodes i.e title for every node and build feature vector for it so that it can be used to build the model and improve results. 

In [454]:
#print the labels
labels

['Assessing Local Institutional Capacity, Data Availability, and Outcomes by',
 'THE PROSPECTS FOR INTERNET TELEPHONY IN EUROPE AND LATIN AMERICA TPP 127 Telecom Modeling and Policy Analysis',
 'Economic Shocks, Safety Nets, and Fiscal Constraints: Social Protection for the Poor in Latin America',
 'Reform, Growth, and Poverty in Vietnam',
 'Households and Economic Growth in Latin America and the Caribbean',
 'Conference of European Statisticians',
 'grau de Mestre em Lógica Computacional supervision:',
 'World Economic Forum EDITORS',
 'The Mis-Marketing of Arabs in US Media Panel on Race, Markets, Economics and the Media',
 'Reflections on Poverty and Inequality in South Africa: Policy Considerations in an Emerging Democracy',
 'A Theory of Coalitions and Clientelism: Coalition Politics in Iceland 1945-2000',
 'Back NEW APPROACH ON TEACHING GEOTECHNOLOGY',
 'HISTORY OF COMPUTERS, ELECTRONIC COMMERCE, AND AGILE METHODS',
 'Back to the Future: The Potential in Infrastructure Privatizat

In [455]:
#Let's do some preprocessing on the labels so it is ready for conversion into features

import copy 

#create the stoplist which consists of all stopwords
stoplist = set(stopwords.words("english"))

#make a copy of the labels into titles
title=copy.deepcopy(labels)

#define a tokenizer
tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?", gaps=False)


for i in range(len(title)):
    
    #convert the string to lowercase
    title[i]=title[i].lower()
    
    #tokenize the string
    title[i]=tokenizer.tokenize(title[i])
    nst=[]
    
    #remove the stopwords and append it to title
    for j in title[i]:
        if j not in stoplist:
            nst.append(j)
    title[i]=nst
    
    

<p> Remove numbers from our corpus as they won't add any weightage to our feature vector,We are removing single character tokens as they have no meaning

In [456]:
# Remove numbers, but not words that contain numbers.
title = [[token for token in doc if not token.isnumeric()] for doc in title]

# Remove words that are only one character.
title = [[token for token in doc if len(token) > 1] for doc in title]

In [458]:
#define a tokenizer to be used for lemmatisation

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl=WordNetLemmatizer()
    def __call__(self,doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

ML algorithms do not accept text as its input, we need to convert them into numerical representation,so we use TF-IDF(Term Frequency-Inverse Document Frequency) representation. More details about this is provided in the report. 

In [459]:
#Let's define a tf-idf vectorizer with analyser as word and set up the LemmaTokenizer()
vectorizer=TfidfVectorizer(analyzer='word',input='content',
                          tokenizer=LemmaTokenizer()
                           )

In [460]:
#change the format of title suitable for tf-idf vectorizer
inp = [" ".join(x) for x in title]

#apply fit-transorm on the input corpus
lvecs=vectorizer.fit_transform(inp)


In [461]:
#convert X matrix which is list of arrays into a single array
X1=np.array(X)

In [462]:
#use hstack and stack the feature vectors of nodes and titles

Xfull=hstack((lvecs,X1),format='csr')

In [463]:
#now do the train-test split on new feature.

X_train, X_test, y_train, y_test = train_test_split(Xfull, Y, test_size=0.80, random_state=42)

In [464]:
##build the models and compute the performance metrics
models = [
    LogisticRegression(),
    BernoulliNB(),
    #LinearSVC(),
    LinearSVC(penalty='l2', loss='squared_hinge',max_iter = 2500),
    RandomForestClassifier()
]

#iterate through the models
for clf in models:
    model_name = clf.__class__.__name__
    
    #Fit the models
    clf.fit(X_train, y_train)
    print(model_name)
    
    # Do the prediction and print the performance metrics
    y_predict=clf.predict(X_test)
    print(confusion_matrix(y_test,y_predict))
    recall=recall_score(y_test,y_predict,average='macro')
    precision=precision_score(y_test,y_predict,average='macro')
    f1score=f1_score(y_test,y_predict,average='macro')
    accuracy=accuracy_score(y_test,y_predict)
    matthews = matthews_corrcoef(y_test,y_predict) 
    print('Accuracy: '+ str(accuracy))
    print('Macro Precision: '+ str(precision))
    print('Macro Recall: '+ str(recall))
    print('Macro F1 score:'+ str(f1score))
    print('MCC:'+ str(matthews))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  _warn_prf(average, modifier, msg_start, len(result))


LogisticRegression
[[3930  112  407  194   63]
 [  96 2710  207  150   17]
 [ 547  222 2556  158   16]
 [ 177  181  177 1647   11]
 [ 586   54  166   99  493]]
Accuracy: 0.7569444444444444
Macro Precision: 0.7689758812445893
Macro Recall: 0.7042944874168283
Macro F1 score:0.7172212122632284
MCC:0.6828073950360547
BernoulliNB
[[4640    5   61    0    0]
 [1237 1555  388    0    0]
 [1492   79 1926    2    0]
 [1752   23  362   56    0]
 [1366    5   27    0    0]]
Accuracy: 0.5460069444444444
Macro Precision: 0.6075198894796496
Macro Recall: 0.4101895681462648
Macro F1 score:0.38344759772824794
MCC:0.4326858519660948




LinearSVC
[[3886   83  412  115  210]
 [  81 2810  154  117   18]
 [ 492  218 2628  116   45]
 [ 115  131  127 1807   13]
 [ 458   49  134   50  707]]
Accuracy: 0.7904647435897436
Macro Precision: 0.783644629491414
Macro Recall: 0.7580363516878839
Macro F1 score:0.7671092358823883
MCC:0.7271160595070962
RandomForestClassifier
[[3776  316  273  326   15]
 [   5 3089    2   84    0]
 [1143  472 1654  229    1]
 [  26  426    8 1733    0]
 [ 729  147   95  160  267]]
Accuracy: 0.7023904914529915
Macro Precision: 0.7601883101277032
Macro Recall: 0.6455397756796402
Macro F1 score:0.637239562398442
MCC:0.6195537427957958


Thus,we can see that accuracies of the models are immproved compared to the previous setting as we have included the feature vectors from the titles given to us for every node in docs.txt.We have achieved accuracies for different algorithms as follows:
- Logistic Regresssion-75.6
- Bernoulli -54.6
- Random forest - 70.23
- Linear SVC  - 79.04

Thus ,we can see that Linear SVC has performed best with accurately classifying nodes with 79%. Therefore, if we add labels and use them as features instead of using only node vectros, our classification rate is improved.

### References
- https://web.stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf
- https://implicit.readthedocs.io/en/latest/als.html
- http://yifanhu.net/PUB/cf.pdf
- https://arxiv.org/ftp/arxiv/papers/1205/1205.2618.pdf
- https://en.wikipedia.org/wiki/Node2vec