## Part 1:  Recommendation Systems

## 1.1 Introduction

In order to overcome the difficulties in decision making for the user’s, here comes up the goal of recommendation systems, where they come up with a short list of items that fit’s user’s interests. “Items” in this context may be some movies recommendations in terms of Netflix website, items in online shopping (or) super-markets likewise. 

The recommendation systems has reshaped the way of information which is filtering among the complete website in terms of online (or) complete items in the store, with respect to each users called as “personalized information”, namely “personalized recommendation”. It helps in making users to take appropriate decisions by making their work easier, saving time and enhance their shopping experience.

Different approaches for building recommender systems are,
1. Content-based
2. Collaborative Filtering
3. Latent factor based -> Matrix Factorization, Deep Approaches. 

collaborative filtering and latent factor based type of recommendation models are same, but differs in some books.

**content-based**:

Here the assumption is, the more similar the item’s description is to the user’s interest, the more likely the user finds that item’s recommendation interesting. It works well to recommend items to the users with unique tastes, do not depend on the tastes of other users. But, it fails to recommend the items to the user, if there is no user history and over-specialized, can never recommend item’s outside the user’s history, and is independent of other user’s, other user’s interests e.t.c. 

**collaborative-filtering**:

It is the process of selecting information or patterns using techniques involving collaboration among multiple agents, multiple users, viewpoints, data sources e.t.c. Here, the main advantage is we do not need the additional information about the users or content of the items. The user’s ratings (or) purchase history is the only information that is needed. There are two types of CF, (i)Memory-based (ii)Model-based. In “memory-based” , recommendation is based on the previous ratings in the stored matrix that describes user-items relations. There are two types of methods in this kind of memory-based recommendation systems, namely, “user-based” and “item-based”. In user-based CF, users with similar previous ratings for items are likely to rate future items more similarly. In item-based CF, items that have received similar ratings from users are more likely to receive similar ratings from future users. In “model-based” CF, the assumption of underlying model governs how users rate items.

**Selecting Method**:-

We need to recommend items to the user’s, if the item’s history is not present to the user also. Basically, we need to recommend top 10 item-id’s from the given 100 item-id’s of each user, where all the mentioned item-id’s described features need not match with the interest in the user’s profile. Here, is where content based type of recommendation system model fails. So, we use collaborative filtering in doing our task where user-items history is built in matrix form. As we need to recommend items to each user, we will use the similarity between the users to recommend items, even though the items history is not present to that user, that is, we use, user-based collaborative filtering model.                                                                             

                                                 
 **How to implement collaborative filtering method**:-

1.Different packages like “surprise” library that uses powerful algorithms like KNN, Singular Value Decomposition can be used, to find the similarity between the users/items and can implement user/item-based collaborative filtering. 

2.Using “implicit” package for user/item-based collaborative filtering.

3.Using “Matrix Factorization” with neural networks.   


### Usage of implicit feedback:-

I implemented “implicit”  package in this task of user-based type of collaborative filtering. This is because there are only two ratings present, i.e. 0 and 1. If there is an interaction of user with a particular item-id , the rating is 1. Otherwise, rating is 0. As our task, has only two ratings 0 and 1, I used “implicit” package which is nothing but “implicit”  feedback rating. “Implicit feedbacks” are implied in the users actions. The underlying assumption here is “if a user clicked/viewed/spent some time on an item often, it is an indication that user prefers that item”. All the implicit user interactions could be referred as 1; if there is no interaction, it is 0. As, our given data serves as the same purpose as implicit feedback, I used implicit package. This could not be considered as explicit feedback because usually the ratings of explicit feedback  goes between 0-5.                              

## 1.2 Libraries Used

- **implicit**:
To perform implicit type of collaborative filtering

- **sklearn.metrics**:
sklearn.metrics is used calculate the metrics of the drawn fit. In this assignment sklearn.metrics is used to calculate the r2-score for the drawn linear regression model.

- **numpy**:
numPy is a package in Python used for scientific computing to perform different operations.

Irrespective of the above mentioned libraries, other machine learning libraries to perform machine learning algorithms have been used. 

## 1.3 Importing Libraries

In [None]:
!pip install implicit
import implicit
from sklearn import metrics
import scipy.sparse as sparse
import pandas as pd
import numpy as np
import networkx as nx
!pip install node2vec
from node2vec import Node2Vec
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
import nltk.data
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from gensim.models import Word2Vec
from sklearn import model_selection, naive_bayes, svm
from statistics import mean
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
import scipy

### Reading data

In this task, we are given three csv files, namely “train_data”, “validation_data” and “test_data”. In “train_data” and “validation_data”, we are given interactions in terms of ratings for each user with some item-ids and supposed to recommend top 10 item-ids among given 100 item-ids for each user in test data, by using train data. Here, we combine “train_data” and “validation_data” as train_data, because our train_data contains only ratings of 1, but our test data contains both 1 and 0, as the ground truth of the ratings in validation data of each user with some item id’s has 0. So, if we do not train our model with ratings 0 for each user, then the model fails to generalize to the ratings of 0. Because of this reason, we combine train with validation data, where validation data consists of ratings 0 and 1, so now our building model consists of ratings of both 0 and 1, where now our test data do not fail to generalize to the rating 0. The data is spitted into training and validation datasets, only at the starting stage of model development. 

In [0]:
#reading train and validation data as train data  
train_dataa = pd.read_csv("train_data.csv")
val_dataa = pd.read_csv("validation_data.csv")
train_data = pd.concat([train_dataa, val_dataa])

In [0]:
#convert to sparse matrix by grouping item-id and user-id
sparse_item_user = sparse.csr_matrix((train_data['rating'].astype(float), (train_data['item_id'], train_data['user_id'])))
sparse_user_item = sparse.csr_matrix((train_data['rating'].astype(float), (train_data['user_id'], train_data['item_id'])))

In [0]:
sparse_item_user

<2174x2239 sparse matrix of type '<class 'numpy.float64'>'
	with 247424 stored elements in Compressed Sparse Row format>

In [0]:
sparse_user_item

<2239x2174 sparse matrix of type '<class 'numpy.float64'>'
	with 247424 stored elements in Compressed Sparse Row format>

In [0]:
#alpha is nothing but how confident we are in building oure recommendation model
alpha = 20
data = (sparse_item_user * alpha).astype('double')

In [0]:
#converting the obtained sparse matrix into dense matrix
data[1,:].todense()

matrix([[20.,  0.,  0., ...,  0.,  0.,  0.]])

## Model 1:- Alternating Least Squares

It is an iterative optimization process where for every iteration, we try to arrive a closer and closer factorized representation of our original data, so that we could recommend items to a particular user in a more appropriate way. 

Parameters uses in als model:-

1. implicit.als model using three parameters namely, factors, regularization and iterations.
2. factors are nothing but latent factors.
3. regularization parameter is used to reduce the model from over-fitting by minimizing the loss function.
4. iterations is nothing but no.of "als" iterations required to fit our data.

In [0]:
#instantiating the model with the best hyper parameters obtained
model1 = implicit.als.AlternatingLeastSquares(factors = 8, regularization = 0.20, iterations = 2500) 



In [0]:
#fit the model
model1.fit(data)

100%|████████████████████████████████████████████████████████████████████████████| 2500.0/2500 [04:36<00:00,  9.76it/s]


In [0]:
#reading the test data
test_data = pd.read_csv("test_data.csv")

In [0]:
test_data.head()

Unnamed: 0,user_id,item_id
0,0,2158
1,0,2113
2,0,2070
3,0,2026
4,0,1948


In [0]:
#obtaining all the unique user-ids from test-data
test_user_ids = list(test_data['user_id'].unique())

Steps required to obtain recommended item-ids for each user:-

1. Initially, creating an empty dictionary.
2. Traversing through each unique user-id of test-data, with respect to that particular test-data id, getting the recommendation of all item-ids to that particular user-id from built model using "model.recommend()" function. 
3. In model.recommend() function we pass three parameters namely, (i)valuee (ii)sparse_user_item (iii)N. "valuee" is the test-data id, "sparse_user_item" is the sparse matrix of the users with respect to all items, "N" is the integer value that returns those recommendation item-ids, based on the hierarchy values of recommendation probabilities. 
4. Later, try to obtain the item-id values with the particular recommendation value, for those item-ids that have been recommended to the particular user from test-data. Now, among those item-ids recommend only top 10 item-ids to the particular user who is identified with his user-id.
5. Finally, append those item-ids of a particular user to a created empty dictionary.

In [0]:
dictt = {}
for valuee in test_user_ids:
    item_ids = []
    itemss = [ll for ll in model1.recommend(valuee, sparse_user_item, 2174) if ll[0] in test_data[test_data['user_id'] == valuee]['item_id'].values]
    itemss = itemss[0:10]
    for i in itemss:
        item_ids.append(i[0])
    dictt[valuee] = item_ids

In [0]:
#printing the obtained dictionary where key is the user-id and value is the 10 recommended item-ids
dictt

{0: [555, 558, 239, 150, 1862, 156, 1766, 695, 1363, 1128],
 1: [523, 305, 388, 984, 541, 361, 1253, 2031, 1523, 1446],
 2: [834, 1134, 1875, 1654, 95, 494, 476, 406, 2143, 1538],
 3: [1517, 1744, 149, 1930, 439, 1157, 2129, 1165, 2034, 1975],
 4: [291, 925, 514, 452, 16, 694, 819, 2002, 812, 1438],
 5: [674, 193, 902, 148, 471, 1379, 1633, 1106, 293, 1921],
 6: [1279, 1775, 164, 1951, 1685, 821, 1733, 566, 179, 572],
 7: [196, 749, 506, 1434, 988, 753, 1685, 384, 498, 342],
 8: [374, 172, 1294, 1761, 1698, 184, 2027, 509, 203, 891],
 9: [844, 1232, 1323, 538, 2169, 1111, 2108, 1357, 1948, 476],
 10: [790, 1678, 400, 1231, 172, 79, 846, 1844, 37, 2089],
 11: [1561, 1608, 1562, 1457, 1883, 1592, 725, 758, 1581, 1534],
 12: [1529, 1098, 2005, 778, 87, 1292, 1769, 465, 1210, 250],
 13: [48, 674, 1793, 959, 132, 689, 5, 881, 1669, 1095],
 14: [1622, 1317, 33, 780, 1762, 595, 1861, 73, 1599, 1520],
 15: [756, 1217, 92, 496, 556, 1703, 324, 332, 1530, 753],
 16: [971, 1544, 610, 1543, 1691, 

In [0]:
#storing the obtained dictionary values in a data frame
dfObj1 = pd.DataFrame(columns=['user_id', 'item_id'])
for key,value in dictt.items():
    l = dictt[key]
    for i in l:
        dfObj1 = dfObj1.append({'user_id': key, 'item_id': i}, ignore_index = True)

In [0]:
dfObj1.head()

Unnamed: 0,user_id,item_id
0,0,555
1,0,558
2,0,239
3,0,150
4,0,1862


In [0]:
#finally storing the dataframe into a csv file
dfObj1.to_csv("30343275.csv", index = False)

Only the above obtained  dataframe is stored into a csv file, because it is the only one model that gave me best accuracy score when the recommended item-ids for the user in the test-data are submitted to "kaggle".

Finally, the best hyper-parameters are obtained for alpha = 20 , factors = 8, regularization = 0.20, iterations = 2500, after tuning. 

Why do we tune our hyper-parameters? The hyper-parameters must always be tuned to fit our model properly on the data, so that it could perform better, without model being suffering from under-fitting (or) over-fitting.                                                                                                                                                                                         
- **Advantages**:-

1.Works for any kind of item. 

2.Works better. Recommend item to a user, even if he doesn’t have previous history with that item.

3.By iterative steps, good factorized features could be obtained both for users and items.  

- **Disadvantages**:- 

1.If a particular item ratings has not been recorded in the matrix then that item could not be recommended to any user.

2.Cannot recommend items to someone with unique state.

3.The pair-wise loss could be more, as it did not have any methods to optimize it, i.e. for each user the order of recommended item-id’s in our task.


## Model 2 - Bayesian Personalized Ranking 

“It is a recommender model that learns a matrix factorization embedding based on minimizing the pairwise ranking loss”. The main task of it is to provide a user with a ranked list of items, where ranking has to be inferred with the implicit behavior of the users, which we obtain from MF form and then use the optimization methods to optimize the ranking of these items, with respect to each user

### How bayesian model differs from als model?

In als model, we do not consider optimization of ranking of items with respect to each user. But here we does that. To improve NDCG(Normalized Discounted Cumulative Gain), bpr would work better. 

Parameters used in bpr model:-

1. implicit.bpr model using three parameters namely, factors, regularization and iterations.
2. factors are nothing but latent factors.
3. regularization parameter is used to reduce the model from over-fitting by minimizing the loss function.
4. iterations is nothing but no.of "als" iterations required to fit our data.

In [0]:
model2 = implicit.bpr.BayesianPersonalizedRanking(factors = 8, regularization = 0.20, iterations = 2500) #instantiating bpr model
model2.fit(data) #fit the model

100%|███████████████████████████████████████████████| 2500/2500 [09:40<00:00,  3.62it/s, correct=50.13%, skipped=5.22%]


Steps required to obtain recommended item-ids for each user:-

1. Initially, creating an empty dictionary.
2. Traversing through each unique user-id of test-data, with respect to that particular test-data id, getting the recommendation of all item-ids to that particular user-id from built model using "model.recommend()" function. 
3. In model.recommend() function we pass three parameters namely, (i)valuee (ii)sparse_user_item (iii)N. "valuee" is the test-data id, "sparse_user_item" is the sparse matrix of the users with respect to all items, "N" is the integer value that returns those recommendation item-ids, based on the hierarchy values of recommendation probabilities. 
4. Later, try to obtain the item-id values with the particular recommendation value, for those item-ids that have been recommended to the particular user from test-data. Now, among those item-ids recommend only top 10 item-ids to the particular user who is identified with his user-id.
5. Finally, append those item-ids of a particular user to a created empty dictionary.

In [0]:
dictt1 = {}
for valuee in test_user_ids:
    item_ids = []
    itemss = [ll for ll in model2.recommend(valuee, sparse_user_item, 2174) if ll[0] in test_data[test_data['user_id'] == valuee]['item_id'].values]
    itemss = itemss[0:10]
    for i in itemss:
        item_ids.append(i[0])
    dictt1[valuee] = item_ids

In [0]:
#storing the obtained dictionary values in a data frame
dfObj2 = pd.DataFrame(columns=['user_id', 'item_id'])
for key,value in dictt1.items():
    l = dictt1[key]
    for i in l:
        dfObj2 = dfObj2.append({'user_id': key, 'item_id': i}, ignore_index = True)

-**Advantages**:-

1.Increases NDCG gain.

2.BPR allows to use gradient descent to optimize non-differential AUC-ROC curve.

3.BPR combined with other implicit models can give the better ranking results. 


-**Disadvantages**:- 

1.BPR alone cannot give the better results. 

2.If a particular item ratings hasn’t been recorded in the matrix, then that item couldn’t be recommended to any user.                                                                                                                                                                                                3.Cannot recommend items to someone with unique state.                                                                                                                   

## Model 3 - LogisticMatrixFactorization

“It is a collaborative filtering recommender model that learns probabilistic distribution by whether user likes an item (or) not”. If the user likes the item, i.e. he has some interaction with the item and rating 1 is given, otherwise rating 0. In our given task for each user with respect to the item-id’s, only two ratings are present i.e. 0 and 1. So, this type of implicit recommendation systems should work better in our task and solve our purpose. 

Parameters used in lmf model:-

1. implicit.lmf model using three parameters namely, factors, regularization and iterations.
2. factors are nothing but latent factors.
3. regularization parameter is used to reduce the model from over-fitting by minimizing the loss function.
4. iterations is nothing but no.of "als" iterations required to fit our data.

In [0]:
model3 = implicit.lmf.LogisticMatrixFactorization(factors = 8, regularization = 0.20, iterations = 2500) #instantiating lmf model
model3.fit(data) #fit tj=he model

HBox(children=(IntProgress(value=0, max=2500), HTML(value='')))




Steps required to obtain recommended item-ids for each user:-

1. Initially, creating an empty dictionary.
2. Traversing through each unique user-id of test-data, with respect to that particular test-data id, getting the recommendation of all item-ids to that particular user-id from built model using "model.recommend()" function. 
3. In model.recommend() function we pass three parameters namely, (i)valuee (ii)sparse_user_item (iii)N. "valuee" is the test-data id, "sparse_user_item" is the sparse matrix of the users with respect to all items, "N" is the integer value that returns those recommendation item-ids, based on the hierarchy values of recommendation probabilities. 
4. Later, try to obtain the item-id values with the particular recommendation value, for those item-ids that have been recommended to the particular user from test-data. Now, among those item-ids recommend only top 10 item-ids to the particular user who is identified with his user-id.
5. Finally, append those item-ids of a particular user to a created empty dictionary.

In [0]:
dictt2 = {}
for valuee in test_user_ids:
    item_ids = []
    itemss = [ll for ll in model3.recommend(valuee, sparse_user_item, 2174) if ll[0] in test_data[test_data['user_id'] == valuee]['item_id'].values]
    itemss = itemss[0:10]
    for i in itemss:
        item_ids.append(i[0])
    dictt2[valuee] = item_ids

In [0]:
#storing the obtained dictionary values to a dataframe
dfObj3 = pd.DataFrame(columns=['user_id', 'item_id'])
for key,value in dictt2.items():
    l = dictt2[key]
    for i in l:
        dfObj3 = dfObj3.append({'user_id': key, 'item_id': i}, ignore_index = True)

-**Advantages**:-

1.This is a kind of regression model and works better for regression models only.  

-**Disadvantages**:-                                                                                                                                                                                                       
1.This model fails to work if it is not a regression model.  

2.If a particular item ratings has not been recorded in the matrix, then that item couldn’t be recommended to any user.                                                                                                                                                                                                3.Cannot recommend items to someone with unique state.       


## Model 4 - als + lmf

After trying the usage of individual methods, now, I used the combination of two different models i.e. als and lmf. Why? It is always difficult to say that one particular collaborative filtering model works better for our data. One model can give us the correct results for some users by recommending the proper item-ids to that user. Another model can give us the correct results to some other users by recommending the proper item-ids to that user. So, combining two different models will always improve or boost the performance, where some mathematics like averaging/mean of values from two different methods lies behind, that is why the accuracies improve.   

**Implementation:** 

The usage of this model1 is , implicit.als.AlternatingLeastSquares(factors = x, regularization = x, iterations = x) and tries to fit our data into this model. Later, we try to get the predict the probability of each items with respect to each user using our fitted model1 i.e. using als and store it in one dictionary. Next, we use one more model namely model2, i.e. implicit.lmf.LogisticMatrixFactorization(factors = x, regularization = x, iterations = x) and tries to fit our data into this model. Later, we try to predict the probability of each items with respect to each user using our fitted model2 and store it in another dictionary. Now, the dictionaries obtained from both the models consists of user-id as key and value consists of list of tuples, where first value in the tuple is the item-id and the second value is the probability. Now, from these dictionaries, for each user, we will try to average the probabilities of each item, with respect to each item-id of each user and then get the top 10 item ids from these, then recommend those item-ids to the user.  In this way we can improve the performance.                                 

In [0]:
model4_als = implicit.als.AlternatingLeastSquares(factors = 20, regularization = 0.01, iterations = 50) #instantiating the model1 of combined model



In [0]:
model4_als.fit(data) #fitting the model1 of combined model

HBox(children=(IntProgress(value=0, max=50), HTML(value='')))




In [0]:
#obtaining for each user, the recommended probability of each item-ids
dictt3 = {}
for valuee in test_user_ids:
    item_ids = []
    itemss = [ll for ll in model4_als.recommend(valuee, sparse_user_item, 2174) if ll[0] in test_data[test_data['user_id'] == valuee]['item_id'].values]
    dictt3[valuee] = itemss

In [0]:
model4_lmf = implicit.lmf.LogisticMatrixFactorization(factors = 20, regularization = 0.01, iterations = 50) #instantiating the model2 of combined model

In [0]:
model4_lmf.fit(data) #fitting the model2 of combined model

HBox(children=(IntProgress(value=0, max=50), HTML(value='')))




In [0]:
#obtaining for each user, the recommended probability of each item-ids
dictt4 = {}
for valuee in test_user_ids:
    item_ids = []
    itemss = [ll for ll in model4_lmf.recommend(valuee, sparse_user_item, 2174) if ll[0] in test_data[test_data['user_id'] == valuee]['item_id'].values]
    dictt4[valuee] = itemss

In [0]:
#getting the average probability of each item-id, for each user-id with respect to two models and obtaining top 10 item-ids from it
dictt5 = {}
for key,value in dictt3.items():
    itemss = []
    l1 = dictt3[key]
    l2 = dictt4[key]
    dict1 = dict(l1)
    dict2 = dict(l2)
    l3 = [(k, ((dict1[k] + dict2[k])/2)) for k in sorted(dict1)]
    for i in l3:
        itemss.append(i[0])
    itemss = itemss.sort(reverse=True)
    itemss = itemss[0:10]
    dictt5[key] = itemss

In [0]:
#storing the finally obtained dictionary into a dataframe
dfObj4 = pd.DataFrame(columns=['user_id', 'item_id'])
for key,value in dictt5.items():
    l = dictt5[key]
    for i in l:
        dfObj4 = dfObj4.append({'user_id': key, 'item_id': i}, ignore_index = True)

-**Advantages**:-  

1.The main advantage of using these kind of combination of two collaborative filtering models is, it boosts the performance of the built model and improves the accuracy on our data.                                                                               

-**Disadvantages**:-   

1.It is computationally very expensive and requires a lot of time.                                                                                            2.All the times we could not expect the best results.   

## Model 5 - als + bpr

**Implementation:** 

The usage of this model1 is , implicit.als.AlternatingLeastSquares(factors = x, regularization = x, iterations = x) and tries to fit our data into this model. Later, we try to get the predict the probability of each items with respect to each user using our fitted model1 i.e. using als and store it in one dictionary. Next, we use one more model namely model2, i.e. implicit.bpr.BayesianPersonalizedRanking(factors = x, regularization = x, iterations = x) and tries to fit our data into this model. Later, we try to predict the probability of each items with respect to each user using our fitted model2 and store it in another dictionary. Now, the dictionaries obtained from both the models consists of user-id as key and value consists of list of tuples, where first value in the tuple is the item-id and the second value is the probability. Now, from these dictionaries, for each user, we will try to average the probabilities of each item, with respect to each item-id of each user and then get the top 10 item ids from these, then recommend those item-ids to the user.  In this way we can improve the performance.                                 

In [0]:
model5_als = implicit.als.AlternatingLeastSquares(factors = 20, regularization = 0.01, iterations = 50) #instantiating the model

In [0]:
model5_als.fit(data) #fitting the model

HBox(children=(IntProgress(value=0, max=50), HTML(value='')))




In [0]:
#obtaining for each user, the recommended probability of each item-ids
dictt6 = {}
for valuee in test_user_ids:
    item_ids = []
    itemss = [ll for ll in model5_als.recommend(valuee, sparse_user_item, 2174) if ll[0] in test_data[test_data['user_id'] == valuee]['item_id'].values]
    dictt6[valuee] = itemss

In [0]:
model5_bpr = implicit.bpr.BayesianPersonalizedRanking(factors = 20, regularization = 0.01, iterations = 50) #instantiating the model

In [0]:
model5_bpr.fit(data) #fitting the model

In [0]:
#obtaining for each user, the recommended probability of each item-ids
dictt7 = {}
for valuee in test_user_ids:
    item_ids = []
    itemss = [ll for ll in model5_bpr.recommend(valuee, sparse_user_item, 2174) if ll[0] in test_data[test_data['user_id'] == valuee]['item_id'].values]
    dictt7[valuee] = itemss

In [0]:
#getting the average probability of each item-id, for each user-id with respect to two models and obtaining top 10 item-ids from it
dictt8 = {}
for key,value in dictt6.items():
    itemss = []
    l1 = dictt6[key]
    l2 = dictt7[key]
    dict1 = dict(l1)
    dict2 = dict(l2)
    l3 = [(k, ((dict1[k] + dict2[k])/2)) for k in sorted(dict1)]
    for i in l3:
        itemss.append(i[0])
    itemss = itemss[0:10]
    dictt8[key] = itemss

In [None]:
#storing the finally obtained dictionary into a dataframe
dfObj5 = pd.DataFrame(columns=['user_id', 'item_id'])
for key,value in dictt8.items():
    l = dictt8[key]
    for i in l:
        dfObj5 = dfObj5.append({'user_id': key, 'item_id': i}, ignore_index = True)

-**Advantages**:-  

1.The main advantage of using these kind of combination of two collaborative filtering models is, it boosts the performance of the built model and improves the accuracy on our data.                                                                               

-**Disadvantages**:-   

1.It is computationally very expensive and requires a lot of time.                                                                                            2.All the times we could not expect the best results.   

### Analysis of different models in combination:-

1.”als” method usage, has given the best accuracy among all the tried models on our data.  

2.”bpr” couldn’t give us the best results because it do not try to actually get the best features of the user interests repetitively and just try to minimize the loss of obtained ranked items of item-ids of each user , by using the optimization process, and work better only if we combine with other models.

3.”lmf” has to be actually work better, as the ratings of each user with respect to each item-id is 0 and 1, and the model to be built lies on the regression phenomenon. The proper hyper-parameters tuning might give the best results, like how als has been done.  

4.Usually, combining two models always gives the best accuracy, which need not be the case always, but still it has given a very good accuracy of having 20%.
Finally, among all the methods that I have tried with different tuning, “als” has given good accuracy.


### Changing the parameters is actually affecting the performance of model. Why?

- **Alpha**:- 
Higher the value of alpha, the higher we are confident in building our recommendation systems. More value of alpha sometimes makes our model over-fitting and test-data accuracy decreases. Best value of alpha must be selected until the model do not have under-fitting/over-fitting/high bias and variance. 

- **Factors**:- 
Higher the number of latent factors, may sometimes increases the accuracy (or) decreases the accuracy. Lower the number of latent factors, may sometimes increases the accuracy (or) decreases the accuracy. Best value of factors must be selected until the model do not have under-fitting/over-fitting/high bias and variance.                                                                                                                                                                                                                                                                                    
- **Regularization**:- 
Lower the regularization value, higher the accuracy always, as it is introduced to reduce the over-fitting. But, regularization parameter must be selected in such a way that model again do not suffer from over-fitting and reduces the variance. Best value of regularization must be selected until the model do not have under-fitting/over-fitting/high bias and variance.                                                                                                                                                                                                                               
- **Iterations**:- 
The increase in number of iterations, increases the accuracies of the model, as with increase of no.of iterations, the model tries to learn the data in the better way. The number of iterations should be set in such a way that the model do not suffer from over-fitting. Best value of iterations must be selected until the model do not have under-fitting/over-fitting/high bias and variance.          


## Task2 - Node Classification

To perform this task, we are given three “txt” files, namely, “docs.txt”, “adjedges.txt” and “labels.txt”. “docs.txt” file consists of two values, where first value is the “node_id” and second value is the title of that particular node which is recognized using “node_id”. “adjedges.txt” file consists of two values, where first value is the node_id and second values consists of list of node-ids which are nothing but neighbors to the first node_ids. “labels.txt” file consists of two values, where first value is the node_id and second value is the label to that particular node_id. Now, our task is to build the graph for these node_ids and then classify these nodes using different algorithms. Different algorithms could be applied to classify the nodes namely node2vec, deep walk, spectral clustering algorithm, word2vec plus node2vec, TF-IDF plus node2vec e.t.c.  In this task, I implemented node2vec , word2vec , node2vec plus word2vec, TF-IDF plus node2vec.  

## Model 1:- Node2vec Algorithm

This is one of the graph embedding techniques, which we can use in our task. Like in NLP, even here we face structured data, where this data needs to be converted into vector features, which acts as an input to our classification algorithms namely Svc, logistic regression, random forest likewise. They are the transformation of nodes present in a graph to a vector or set of vectors. Node2vec works similar to “word2vec” and does an intelligent sampling strategy. In node2vec algorithm implementation, we use only node-ids to build our model here initially.                                                                                             

**Implementation**:- 

Before building node2vec algorithm, we initially build the graph using “networkx”(It is a python language software package which is used for performing various operations in graph) which is formed using node-ids and edges ids from adjedges.txt file. Now, this graph is passed to the node2vec function to convert each node into the respective vector form. 

Firstly, we will pre-compute the probabilities and generate walks using, node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4) and then embed the nodes on this built node2vec model, where we attain vector representation of each node and pass these to our machine learning classification models, like SVM, random forest and logistic regression.  


In [0]:
#reading adjedges.txt file and storing those values into a dictionary
dictt = {}
with open('adjedges.txt',mode='r') as data_file:
    data_file = [data_file.readline().strip() for i in range(0, 18720)]
    for stringg in data_file:
        data_file = stringg.split(" ")
        dictt[data_file[0]] = data_file[1:]

In [0]:
#creating an undirected graph with respective nodes and edges
G = nx.Graph().to_undirected()
for key,value in dictt.items():
    G.add_node(key)
    for values in value:
        G.add_edge(key, values)

In [12]:
print(G.nodes()) #printing node_ids of the graph

['12828558', '66779408', '38902949', '38998399', '33450563', '26547200', '57470294', '20968604', '12016981', '59453341', '54791317', '9564967', '14589786', '38124547', '44569330', '4289706', '48119811', '34733903', '67315895', '47074959', '36354718', '38323242', '40935482', '77107975', '38994139', '48871207', '75049959', '68339148', '62098573', '66337421', '78106693', '47899676', '58164272', '76914035', '71555172', '6909690', '56473434', '47982561', '26890304', '64372150', '16864792', '9684893', '47949816', '47887794', '34534740', '16482894', '27917678', '44783549', '24245729', '33207680', '66903997', '42441879', '20060161', '54709047', '9633322', '32043684', '45159388', '54592233', '17112646', '17381885', '33338932', '41499729', '39772251', '71431551', '34646799', '72286508', '39750483', '30054787', '63059055', '62862974', '73820982', '73329207', '3895304', '62788245', '52334001', '33043298', '28330077', '63065670', '19535575', '76772526', '41924527', '27845256', '14960319', '20894906

In [13]:
print(G.edges()) #printing th edges of the graph i.e source and destination

[('38902949', '38998399'), ('38998399', '23801630'), ('38998399', '63525655'), ('38998399', '13157756'), ('38998399', '14987799'), ('33450563', '26547200'), ('26547200', '55648908'), ('26547200', '58388309'), ('26547200', '58235014'), ('26547200', '33526181'), ('26547200', '23902004'), ('26547200', '24273779'), ('26547200', '23704113'), ('26547200', '23725565'), ('26547200', '71380922'), ('26547200', '57274569'), ('26547200', '58386106'), ('26547200', '38850445'), ('26547200', '71526650'), ('26547200', '21109490'), ('26547200', '62522021'), ('57470294', '20968604'), ('54791317', '9564967'), ('54791317', '14589786'), ('14589786', '38735655'), ('14589786', '66662911'), ('14589786', '78260037'), ('4289706', '48119811'), ('48119811', '34534740'), ('48119811', '37254616'), ('48119811', '8193855'), ('67315895', '47074959'), ('67315895', '36354718'), ('67315895', '38323242'), ('67315895', '40935482'), ('67315895', '77107975'), ('67315895', '38994139'), ('67315895', '48871207'), ('67315895', '

Node2vec’s sampling strategy accepts four arguments namely,   

(i)   Number of walks:- From each node in the graph, the number of random walks that could be generated. 

(ii)  Walk length:- How many nodes are present in each random walk per each node.   

(iii) Dimensions:- Dimensions is nothing but embedding size dimension of each node-id.

(iv)  Workers:- Workers is nothing but, on how many cores we will run our model. 

The higher no.of workers makes the algorithm to run faster using multi-  threading.

In [16]:
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=75, workers=1) #pre-computing the probabilities
model = node2vec.fit(window=10, min_count=1, batch_words=4) #embedding the nodes into the vector

Computing transition probabilities: 100%|██████████| 36928/36928 [00:08<00:00, 4215.99it/s]
Generating walks (CPU: 1): 100%|██████████| 75/75 [44:43<00:00, 35.78s/it]


In [17]:
#for checking trying to print the vector of first node-id, 
model.wv.get_vector("12828558")

array([ 4.5960611e-03,  5.7906436e-04,  1.5437832e-03,  3.4803490e-04,
        5.2387519e-03, -2.7420567e-03, -8.7501860e-04,  3.1773681e-03,
       -1.1597424e-03,  3.8761730e-04, -5.3570964e-03,  7.2477749e-03,
       -7.0840516e-03,  7.6645892e-03, -5.0894096e-03,  4.3122969e-03,
       -5.2266233e-03,  5.0649475e-03,  5.7480740e-03,  8.6982349e-05,
       -2.1578781e-03,  7.0055351e-03, -5.0420598e-03, -6.7678974e-03,
        2.5218097e-03,  3.3482136e-03,  6.2392252e-03, -5.8576367e-03,
       -4.4650133e-03, -3.8355414e-03,  2.6552084e-06, -9.4005588e-04,
       -3.9095853e-06,  7.1006073e-03, -1.0933195e-03,  7.0058345e-03,
        7.6029953e-03, -6.5379497e-03, -2.0815993e-03, -1.8234686e-03,
        5.3840596e-03,  4.6385829e-03,  3.5011037e-03,  6.4560273e-03,
        3.4903050e-03,  4.1964059e-03, -1.3564621e-03,  2.4941489e-03,
       -4.6178121e-03, -2.9105351e-03, -7.8487105e-04, -7.1005814e-04,
        1.9253698e-03,  6.1125946e-03,  7.3391949e-03, -2.5289569e-03,
      

In [0]:
#reading labels.txt file and storing it into a dictionary
dictt1 = {}
with open('labels.txt',mode='r') as data_file:
    data_file = [data_file.readline().strip() for i in range(0, 18720)]
    for stringg in data_file:
        data_file = stringg.split(" ")
        dictt1[data_file[0]] = data_file[1]

In [0]:
'''
After getting the nodes of the graph, we will traverse through these node-ids and if these node-ids have labels , then storing only those labeled node-ids 
into a list because we need to split our data into train and test. This basically needs labels of nodes.
'''
l1 = []
for valuee in G.nodes():
    if valuee in dictt1.keys():
        l1.append(valuee)

In [0]:
#for each node-id which has label, trying to acquire vector representation of a particular node-id which is built using node2vec
l2 = []
for valuee in l1:
    l2.append(list(model.wv.get_vector(valuee)))

In [0]:
#obtaining the labels for nodes 
labels_l1 = []
for valuee in l1:
    if valuee in dictt1.keys():
        labels_l1.append(dictt1[valuee])

In [0]:
#storing labels and node-ids vector representation into a tuple, where all the set of tuples are stored into a dictionary
nodes_labels_vectors = [list(x) for x in zip(labels_l1, l2)]

In [0]:
#putting these values into a dataframe
dataframee_nodes_labels_vectors = pd.DataFrame(nodes_labels_vectors, columns=['Labels','Vectors'])

In [24]:
dataframee_nodes_labels_vectors.head()

Unnamed: 0,Labels,Vectors
0,0,"[0.004596061, 0.00057906436, 0.0015437832, 0.0..."
1,0,"[-0.002353252, -0.0069559636, -0.0063320836, 0..."
2,0,"[-0.9411332, -1.0707828, 0.64033556, 0.4311686..."
3,0,"[-0.91952175, -0.70541173, 0.04743082, -0.6581..."
4,0,"[-2.8163447, -1.0207331, -1.1489363, -1.013297..."


In [0]:
#initializing the random seed value and trying to split the data into train and test as specified, i.e. 20 percent into train and 80 percent into train
np.random.seed(3)
msk = np.random.rand(len(dataframee_nodes_labels_vectors)) < 0.8
X_train = dataframee_nodes_labels_vectors[~msk].copy()
X_test = dataframee_nodes_labels_vectors[msk].copy()

### Different Types of classification models building using model 1 

In [38]:
SVM = svm.SVC(C = 0.1, kernel = 'linear', degree = 3, gamma = 0.1)
SVM.fit(list(X_train['Vectors']), list(X_train['Labels']))
predictions_svm = SVM.predict(list(X_test['Vectors']))
accuracy_scores = accuracy_score(list(X_test['Labels']), predictions_svm)
print("The accuracy score of SVM classifier using node2vec is", (accuracy_scores * 100))
rf_classifier = RandomForestClassifier(n_estimators=20, random_state=0)
rf_classifier.fit(list(X_train['Vectors']), list(X_train['Labels']))
y_pred = rf_classifier.predict(list(X_test['Vectors']))
accuracy_scores = accuracy_score(list(X_test['Labels']), y_pred)
print("The accuracy score of RandomForest classifier using node2vec is", (accuracy_scores * 100))
gnb = GaussianNB()
y_pred = gnb.fit(list(X_train['Vectors']), list(X_train['Labels'])).predict(list(X_test['Vectors']))
accuracy_scores = accuracy_score(list(X_test['Labels']), y_pred)
print("The accuracy score of Gaussian naive bayes classifier using node2vec is", (accuracy_scores * 100))

The accuracy score of SVM classifier using node2vec is 55.18205418358517
The accuracy score of RandomForest classifier using node2vec is 55.6812886906743
The accuracy score of Gaussian naive bayes classifier using node2vec is 47.40065233308926


-**Advantages/Disadvantages**:- 

1.We can easily convert the given node-ids into a vector form. 

2.We cannot get much information about the particular node using just node-id, which is a number to classify our nodes and finally our interpretation of classification of nodes gives less accuracy.                                                         

3.Computationally expensive. 


4.Not much complex steps before the implementation of the model.                                                                                              

## Model 2 - Node2vec plus Word2vec Algorithm

It is a type of embedding approach, which can be obtained by using two methods namely skip-gram and continuous bag of words(CBOW), where neural networking is involved in both the methods. This algorithm works similar to node2vec, where instead of embedding nodes into vectors, here, we embed words into vectors. In our task, we implement word2vec algorithm on the titles of node-ids, which is a text. Now, word2vec is combined with node2vec, this is done so that we can get more information of each node and may result in increase in the performance of classification of models. Finally, we combine “node-id” with “titles” in our task using node2vec plus word2vec to have more information for each node.                        

-**Implementation**:-

Before initializing word2vec algorithm, all the titles of node-ids is preprocessed, and those preprocessed tokens are passed as an input to word2vec. Why preprocessing is needed? Because, there is no point of having tokens in a sentence from which we could not obtain any meaning. So, some pre-processing steps like getting the unigrams, generating bigrams, removing stop-words, lemmatization has been done. Word2vec has been called using model2 = Word2Vec(new_output, min_count=1, size = 50, workers=1, window =1, sg = 1).

Now, all the pre-processed tokens has been converted into the vector form, and for each token of particular node-id, the particular vector form has been obtained which is nothing but word2vec. Now, different node-ids has different length of pre-processed tokens, which in turn has different length of overall vectors. Now, in order to overcome the error of each input length of one size, the vector representation of all tokens with respect to the particular node-id has been added and then mean of those vectors has been taken, from which we could interpret that this particular vector value may deal with sort of pre-processed tokens. Later, these vector values of each node-id which used title has been combined with node2vec values of each node-id, and passed as an input to our machine learning algorithms. 


In [0]:
#reading docs.txt file and storing those values into a dictionary
dictt_docs = {}
with open('docs.txt',mode='r') as data_file1:
    data_file1 = [data_file1.readline().strip() for i in range(0, 18720)]
    for i1 in data_file1:
        i1 = i1.split(" ",1)
        dictt_docs[i1[0]] = i1[1]

### Pre-Processing steps required for word2vec algorithm

### Why only these pre-processed tokens are chosen? 

These are the basic pre-processing steps that are used for any text classification to classify text into vectors.

### Case normalization and tokenization

In [0]:
final_words = []
ids = []
for key,value in dictt_docs.items():
    ids.append(key)
    titlee = dictt_docs[key]
    titlee = titlee.lower()
    tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
    unigram_tokens = tokenizer.tokenize(titlee)
    final_words.append(unigram_tokens)

### Generating bigrams and appending those bigrams to unigrams

In [0]:
unigrams_bigrams = []
for word in final_words:
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(word)
    bigram_finder.apply_freq_filter(10)
    bigram_finder.apply_word_filter(lambda w: len(w) < 3)
    total_bigrams = bigram_finder.nbest(bigram_measures.pmi, len(word)) 
    bigrams_generated = []
    if len(total_bigrams) > 1:
        for i in total_bigrams:
            wordd = i[0] + "_" + i[1]
            bigrams_generated.append(wordd)
    uni_big = word + bigrams_generated
    unigrams_bigrams.append(uni_big)

In [31]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### Removing stopwords from obtained tokens

In [0]:
nostop_uni_big = []
for word in unigrams_bigrams:
    filtered_words = [wordd for wordd in word if wordd not in stopwords.words('english')]
    nostop_uni_big.append(filtered_words)

### Removing hyphen, apostophy from obtained tokens

In [0]:
new_final_tokens = []
for listt in nostop_uni_big:
    i1 = []
    for wordd in listt:
        if "'" in wordd:
            wordd = wordd.replace("'", "")
        if "-" in wordd:
            wordd = wordd.replace("-","")
        i1.append(wordd)
    new_final_tokens.append(i1)

In [34]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

### Lemmatization

In [0]:
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
new_output = []
for item in new_final_tokens:
    lemmatized_output = [lemmatizer.lemmatize(w) for w in item]
    new_output.append(lemmatized_output)

In [36]:
new_output[0:2]

[['assessing',
  'local',
  'institutional',
  'capacity',
  'data',
  'availability',
  'outcome'],
 ['prospect',
  'internet',
  'telephony',
  'europe',
  'latin',
  'america',
  'tpp',
  'telecom',
  'modeling',
  'policy',
  'analysis']]

In [0]:
#word2vec model building 
model2 = Word2Vec(new_output, min_count=1, size = 50, workers=1, window =1, sg = 1)

In [50]:
model2['assessing']

  """Entry point for launching an IPython kernel.


array([ 0.09125346, -0.25033194, -0.10103494,  0.1122421 , -0.21059445,
       -0.3060286 , -0.02655339,  0.01526459, -0.189685  , -0.09226803,
        0.08177674,  0.3702603 , -0.19404438, -0.03046685, -0.07523925,
        0.10755336, -0.02506292,  0.10935697, -0.0197453 , -0.04730418,
        0.23650824, -0.05730036,  0.13065772,  0.03795892,  0.04857434,
        0.00507091,  0.10039346,  0.02314824, -0.00762692,  0.08600807,
       -0.11779602,  0.03532889, -0.01188836, -0.06196891,  0.14551485,
       -0.22746846, -0.18106456,  0.37999603, -0.13821314, -0.09750101,
        0.12731971, -0.24495271,  0.04502794,  0.05941778,  0.07227387,
       -0.00378423, -0.11785785,  0.22955096,  0.06591793, -0.0491706 ],
      dtype=float32)

In [51]:
# adding each word's vector representation into one final list
output_vectors = []
for i in new_output:
    word_vectors = []
    for wordd in i:
        word_vectors.append(list(model2[wordd]))
    output_vectors.append(word_vectors)

  """


In [0]:
#obtaining labels for each node-id
ids_labels = []
for i in ids:
    if i in dictt1.keys():
        ids_labels.append(dictt1[i])

In [0]:
#storing all the vectors words into one list for one particular node-id
output_vectors_new = []
for i in output_vectors:
    i1 = []
    for j in i:
        i1.append(j)
    output_vectors_new.append(i1)

In [0]:
#splitting the node into train and test
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(output_vectors_new, ids_labels , test_size = 0.8)

In [0]:
#as there are some conversions mismatches, in the total number of vectors obtained for each node, we will take mean of all those vectors in our train data
single_trainx = []
for i in Train_X:
    i1 = []
    for j in i:
        i1.extend(np.float32(j))
    if len(i1) > 0:
        single_trainx.append(mean(i1))
    else:
        single_trainx.append(float(0))

In [0]:
#as there are some conversions mismatches, in the total number of vectors obtained for each node, we will take mean of all those vectors in our train data
single_testx = []
for i in Test_X:
    i1 = []
    for j in i:
        i1.extend(np.float32(j))
    if len(i1) > 0:
        single_testx.append(mean(i1))
    else:
        single_testx.append(float(0))

In [0]:
#converting all train labels into float32 to have one single representation for all
trainn_labels = []
for i in Train_Y:
    trainn_labels.append(np.float32(i))

In [0]:
#converting all test labels into float32 to have one single representation for all
test_labels = []
for i in Test_Y:
    test_labels.append(np.float32(i))

### Different Types of classification models building using word2vec

In [61]:
SVM = svm.SVC(C = 1, kernel='linear', degree = 3, gamma = 0.1  )
SVM.fit(np.array(single_trainx).reshape(-1,1) , trainn_labels)
predictions_svm = SVM.predict(np.array(single_testx).reshape(-1,1))
accuracy_scores = accuracy_score(test_labels, predictions_svm)
print("The accuracy score of SVM classifier using word2vec  is", (accuracy_scores * 100))
rf_classifier = RandomForestClassifier(n_estimators=20, random_state=0)
rf_classifier.fit(np.array(single_trainx).reshape(-1,1) , trainn_labels)
y_pred = rf_classifier.predict(np.array(single_testx).reshape(-1,1))
accuracy_scores = accuracy_score(test_labels, y_pred)
print("The accuracy score of RandomForest classifier using word2vec  is", (accuracy_scores * 100))
gnb = GaussianNB()
y_pred = gnb.fit(np.array(single_trainx).reshape(-1,1) , trainn_labels).predict(np.array(single_testx).reshape(-1,1))
accuracy_scores = accuracy_score(test_labels, y_pred)
print("The accuracy score of Gaussian naive bayes classifier using word2vec  is", (accuracy_scores * 100))

The accuracy score of SVM classifier using word2vec  is 31.497061965811966
The accuracy score of RandomForest classifier using word2vec  is 29.82772435897436
The accuracy score of Gaussian naive bayes classifier using word2vec  is 36.25133547008547


### word2vec plus node2vec

In [0]:
#getting the mean of all word2vecs for each node-id
word2vec_mean = []
for i in output_vectors_new:
    i1 = []
    for j in i:
        i1.extend(np.float32(j))
    if len(i1) > 0:
        word2vec_mean.append(mean(i1))
    else:
        word2vec_mean.append(float(0))

In [0]:
#converting the labels into float32 form to have one single representation for all
word2vec_labels = []
for i in ids_labels:
    word2vec_labels.append(np.float32(i))

In [0]:
#joining node2vec and word2vec vectors of each node-id
node2vec_train_data = list(dataframee_nodes_labels_vectors['Vectors'])
node2vec_word2vec_data = []
for i in range(0, len(node2vec_train_data)):
    node2vec_word2vec_data.append(node2vec_train_data[i] + [word2vec_mean[i]])

In [65]:
len(node2vec_word2vec_data)

18720

In [0]:
#converting the data into train and test
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(node2vec_word2vec_data, word2vec_labels, test_size = 0.8)

In [67]:
clf = BaggingClassifier(base_estimator=SVC(C = 1.0, kernel = 'linear', degree = 3, gamma = 1), n_estimators = 20, random_state = 3)
clf.fit(Train_X, Train_Y)
predicted_values = clf.predict(Test_X)
acc_score = accuracy_score(Test_Y, predicted_values)
print("The accuracy score of SVM classifier using word2vec plus node2vec is", (accuracy_scores * 100))
rf_classifier = RandomForestClassifier(n_estimators=20, random_state=0)
rf_classifier.fit(Train_X, Train_Y)
y_pred = rf_classifier.predict(Test_X)
accuracy_scores = accuracy_score(Test_Y, y_pred)
print("The accuracy score of RandomForest classifier using word2vec plus node2vec  is", (accuracy_scores * 100))
gnb = GaussianNB()
y_pred = gnb.fit(Train_X, Train_Y).predict(Test_X)
accuracy_scores = accuracy_score(Test_Y, y_pred)
print("The accuracy score of Gaussian naive bayes classifier using word2vec plus node2vec  is", (accuracy_scores * 100))

The accuracy score of SVM classifier using word2vec plus node2vec is 36.25133547008547
The accuracy score of RandomForest classifier using word2vec plus node2vec  is 56.95779914529915
The accuracy score of Gaussian naive bayes classifier using word2vec plus node2vec  is 47.849893162393165


-**Advantages/Disadvantages**:-  

1.Semantic meaning of words can be retrieved. In case of graphs , if there exists the proper text information of each node, it works in a better way to classify the nodes.                                                                                              

2.Difficult to implement , needs some memory for pre-processing.

3.If the data is not proper, it could not give expected results. 

4.It transforms the unlabeled raw corpus into labeled data by mapping the target word to its context word and then learn the representation of words in a classification task.                                                                                                                                           

## Model 3 - TF-IDF plus node2vec

TF-IDF stands for “Term frequency – Inverse Document Frequency”. Here, we compute a weight to each word, which tells us the importance of the word in the document and corpus. The weight of each word would be in the form of vector representation. Here, term frequency measures the frequency of a word in the document and document frequency is the importance of document in the whole set of the corpus, which acts similar to TF. IDF is the inverse of the document frequency, which measures the informativeness of term t.                                                                                       

In the similar form of word2vec, we first obtain all the pre-processed tokens of each title with respect to the node-ids. Later, initialize the tfidf vectorizer, and fit the tfidfvectorizer on all the tokens in the corpus, then divide each token into the vector representation which is passed as an input to our machine learning classification algorithms. Now, this tf-idf vectors of each titles node-ids, has been combined with the node2vec of each node-id. Here, we do not get any combining issue because there is a transform method in tf-idf which will transform all the input vectors into of one size, where such kind of phenomenon is not available in word2vec. TF-IDF is similar to word2vec , but word2vec can get the semantic meaning of a list of tokens through its neighboring tokens, which could not take place in TF-IDF. 

In [0]:
#the above pre-processed tokens are taken here
Train_X1 = [' '.join(ele) for ele in new_output]

In [0]:
#using tf-idf vectorizer converting the data into vectors
Tfidf_vect = TfidfVectorizer(new_output, ngram_range = (1,3))
Tfidf_vect.fit(Train_X1)
Train_X1_Tfidf = Tfidf_vect.transform(Train_X1)

In [0]:
#using hstack combining two matrices i.e. 64 bit matrix obtained from tfidf vector and node2vec and combining these two using hstack
node2vec_train_data = list(dataframee_nodes_labels_vectors['Vectors'])
xfull = scipy.sparse.hstack(( Train_X1_Tfidf, np.asarray(node2vec_train_data)))

In [0]:
#splitting the data into train and test data
np.random.seed(124)
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(xfull, labels_l1, test_size = 0.8)

### Different Types of classification models building using tfidf

In [60]:
clf = BaggingClassifier(base_estimator=SVC(C = 1.0, kernel = 'linear', degree = 3, gamma = 1), n_estimators = 36, random_state = 7)
clf.fit(Train_X, Train_Y)
predicted_values = clf.predict(Test_X)
acc_score = accuracy_score(Test_Y, predicted_values)
print("The accuracy score of SVM classifier using TF-IDF is", (acc_score * 100))
rf_classifier = RandomForestClassifier(n_estimators=20, random_state=0)
rf_classifier.fit(Train_X, Train_Y)
y_pred = rf_classifier.predict(Test_X)
accuracy_scores = accuracy_score(Test_Y, y_pred)
print("The accuracy score of RandomForest classifier using TF-IDF  is", (accuracy_scores * 100))

The accuracy score of SVM classifier using TF-IDF is 71.36752136752136
The accuracy score of RandomForest classifier using TF-IDF  is 63.46153846153846


-**Advantages/Disadvantages**:- 

1.Simple algorithm for matching words to a target word. 

2.It is very efficient and easy compute the similarity between the documents.

3.Cannot grasp the semantic meaning of the context of words, co-occurrences in the document. 

4.It is based on the assumption that count of different words provide independent evidence of the similarity.        

## 3. Conclusion:-

Recommendation systems play a very important role to overcome the difficulties in decision making, when different options are available to the users. Some of the users may not know many of the options usage. So, here comes up the concept of recommender systems where it tries to provide appropriate options to the users to some extent. In our task, we used user-content based collaborative filtering using “implicit” package as our data consists of implicit ratings and the item has to be recommended to the user even if he doesn’t have any previous history in that particular item-id.  Node classification plays a very important role to classify the nodes after building the graph like to identify the similar nodes, communities , likewise. For an example , in our task to identify the similar documents into one classifier. Here we used node-id and title-id to get the more information for each node and then built embedding on those nodes and finally classify those nodes using ml algorithms where svm of tfidf plus node2vec has given the best results

### 4. References:-

https://towardsdatascience.com/