## Task 1: Recommender System Challenge

### Introduction

The goal of the task 1 is to bulit a recommender system to recommend item to the user through what the user has purchased. In this task, I established two recommendation systems. I used the ALS algorithm for the first recommendation system, and I used Content Based Recommender for the second recommendation system.

### Recommender system base on ALS algorithms

#### data

On the data, I used a combination of train dataset and validation dataset as a training dataset, and under the recommendation system of als algorithm, I used the item_fea dataset. 

The item_fea data set is a data set of the characteristics of all items, and similar items can be found more accurately through the characteristics of the items.

This algorithm does not use the user_fea data set. When user_fea is used, it will increase the error. Find similar users and then find items with high similarity from the purchase list of similar users and directly find items with high similarity from my purchase list. There is a big error compared to this, because similar users are not exactly the same as the users themselves, so things purchased by similar users may not be needed for users.

#### Alternating Least Squares algorithms (ALS)

In [1]:
import pandas as pd
import numpy as np
import scipy.sparse as sparse
import implicit
from sklearn.preprocessing import MinMaxScaler
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from node2vec import Node2Vec
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC 
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
import networkx as nx

##### read the data into data frame

In [2]:
user_feature = pd.read_csv('user_fea.csv')
train_1 = pd.read_csv('train_data.csv')
train_2 = pd.read_csv('validation_data.csv')
item_feature = pd.read_csv('item_fea.csv')
test_data = pd.read_csv('test_data.csv')

##### process the data

In [3]:
# concat the train_data and validation_data
train_data = pd.concat([train_1,train_2],axis = 0)
train_data.reset_index(inplace = True, drop = True)
# change the column name
item_feature.rename( columns={'Unnamed: 0':'item_id'}, inplace=True )
item_feature['item_feature'] = item_feature.iloc[:,1:].values.tolist()
# put the item_feature into the train data
train_data = train_data.merge(item_feature, on='item_id')

In [4]:
# get the column we need
train_data = train_data[['user_id', 'item_id', 'item_feature', 'rating']]

In [5]:
# create two sparse matrixs
sparse_item_user = sparse.csr_matrix((train_data['rating'].astype(float), (train_data['item_id'], train_data['user_id'])))
sparse_user_item = sparse.csr_matrix((train_data['rating'].astype(float), (train_data['user_id'], train_data['item_id'])))

##### create the model

In [6]:
alpha = 15
data = (sparse_item_user * alpha).astype('double')

In [7]:
model = implicit.als.AlternatingLeastSquares(factors=10, regularization=0.1, iterations=25)
model.fit(data)



HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))




In [8]:
def recommend(user_id, sparse_user_item, user_vecs, item_vecs, num_items=10):
    result_df = pd.DataFrame(columns=['user_id', 'item_id'])
    test_item_list = test_data.loc[(test_data.user_id == user_id)]['item_id'].tolist()
    # Get the interactions scores from the sparse user item matrix
    user_interactions = sparse_user_item[user_id,:].toarray()
    # Add 1 to everything, so that articles with no interaction yet become equal to 1
    user_interactions = user_interactions.reshape(-1) + 1
    # Make articles already interacted zero
    user_interactions[user_interactions > 1] = 0
    # Get dot product of user vector and all item vectors
    rec_vector = user_vecs[user_id,:].dot(item_vecs.T).toarray()
    # Scale this recommendation vector between 0 and 1
    min_max = MinMaxScaler()
    rec_vector_scaled = min_max.fit_transform(rec_vector.reshape(-1,1))[:,0]
    # Item already interacted have their recommendation multiplied by zero
    recommend_vector = user_interactions * rec_vector_scaled
    # comine the item id and the recommend score
    recommend_vector = list(zip(test_item_list,recommend_vector[test_item_list]))
    # sort the list
    recommend_vector_sort = sorted(recommend_vector, key = lambda x: x[1], reverse=True)
    # get the top 10 and put them into a dataframe
    for i in range(0,10):
        result_df = result_df.append({'user_id':int(user_id),'item_id':int(recommend_vector_sort[i][0])},ignore_index=True)
    result_df.user_id = result_df.user_id.astype('int64')
    result_df.item_id = result_df.item_id.astype('int64')
    return(result_df)

In [9]:
user_vecs = sparse.csr_matrix(model.user_factors)
item_vecs = sparse.csr_matrix(model.item_factors)
result = pd.DataFrame(columns=['user_id', 'item_id'])
for user_id in user_feature['Unnamed: 0'].tolist():
    recommend_df = recommend(user_id, sparse_user_item, user_vecs, item_vecs, num_items=10)
    result = pd.concat([result,recommend_df],axis = 0)

In [10]:
result.to_csv('ALS_result.csv',index = False)

the result of this recommender system is 0.20756

###  Content based recommender

The realization process of this algorithm is to calculate the average similarity between the target item and the recommended user's like and dislike items, and calculate the average similarity between the target item and the recommended user's like and dislike items. Through these four similarities Calculate the score of the recommended items, and select the top 10 items from the 100 recommended items.

Cosine Similarity is used in this recommendation algorithm. Its definition is：
$$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $$

Through cosine similarity, we can get the average similarity between the recommended items and the items the user likes, the draw similarity with the item that the user does not like, the average similarity with the item that the similar user likes, and the similarity of the draw with the item that the similar user does not like. Set these four variables to user_like, user_dislike, sim_user_like, sim_user_dislke.

Add user_like and sim_user_dislike and subtract the sum of user_dislike and sim_user_dislike, and use this value as the score of this item. When the score is large, it means that the item is closer to the user or similar user's favorite items. When the value is small, it means This item is closer to the item that the user or similar users do not like. Through this numerical sorting, the top 10 digits with the largest numerical value are selected as the result.

I use item feature and user feature as the result of the content converted by TFIDF, and calculate the similarity matrix of all item and user through these two data sets.

In [11]:
# read the data
user_feature = pd.read_csv('user_fea.csv')
train_1 = pd.read_csv('train_data.csv')
train_2 = pd.read_csv('validation_data.csv')
item_feature = pd.read_csv('item_fea.csv')
test_data = pd.read_csv('test_data.csv')

In [12]:
# Transfer the dataframe into a list
user_feature_list=[]
for i in user_feature.index.values:
    user_feature_list.append(user_feature.iloc[i,1:].tolist())

In [13]:
# Transfer the data into a matrix
user_feature_matrix = np.mat(user_feature_list)

In [14]:
# Use the comsine_similarity to calculate the similarity
cosine_sim_user = cosine_similarity(user_feature_matrix, user_feature_matrix)

In [15]:
# define a function to get the most similar user
def get_sim_user(user):
    sim_scores = list(enumerate(cosine_sim_user[user]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:6]
    user_indices = [i[0] for i in sim_scores]
    return user_indices[0]

In [16]:
# def a function to get the mean of the similiarity
def get_mean_sim_item(sim_rate):
    sim_data_list = sim_rate.tolist()
    return np.mean(sim_data_list)

In [17]:
# transfer the item dataframe
item_feature_list = []
for i in item_feature.index.values:
    item_feature_list.append(item_feature.iloc[i,1:].tolist())

In [18]:
item_feature_matrix = np.mat(item_feature_list)

In [19]:
cosine_sim_item = cosine_similarity(item_feature_matrix, item_feature_matrix)

In [20]:
# combine train and vaildation data set
train_data = pd.concat([train_1,train_2],axis = 0)
train_data.reset_index(inplace = True, drop = True)

In [21]:
# define a function to calculate the grade for each item
def get_sim (user_id,target_item_id):
    # get the user like data and user dislike data
    user_like = train_data[(train_data.user_id == user_id)&(train_data.rating == 1)]
    user_dislike = train_data[(train_data.user_id == user_id)&(train_data.rating == 0)]
    # transfer them into a list
    like_item_list = user_like.loc[:,'item_id'].tolist()
    dislike_item_list = user_dislike.loc[:,'item_id'].tolist()
    # use the function to get the most similar user
    sim_user_id = get_sim_user(user_id)
    # get the similar user like and dislike
    sim_user_like = train_data[(train_data.user_id == sim_user_id)&(train_data.rating == 1)]
    sim_user_dislike = train_data[(train_data.user_id == sim_user_id)&(train_data.rating == 0)]
    # transfer them into a list
    sim_like_item_list = sim_user_like.loc[:,'item_id'].tolist()
    sim_dislike_item_list = user_dislike.loc[:,'item_id'].tolist()
    # get the similiarity
    user_simrate = cosine_sim_item[target_item_id, like_item_list]
    # calculate the mean of the similiarity
    user_like_simrate = get_mean_sim_item(user_simrate)
    # get the similiarity
    user_simrate_dis = cosine_sim_item[target_item_id, dislike_item_list]
    # calculate the mean of the similiarity
    user_dislike_simrate = get_mean_sim_item(user_simrate_dis)
    # get the similiarity for similer user
    sim_user_simrate = cosine_sim_item[target_item_id, sim_like_item_list]
    # calculate the mean of the similiarity
    sim_user_like_simrate = get_mean_sim_item(sim_user_simrate)
    # get the similiarity for similer user
    sim_user_simrate_dis = cosine_sim_item[target_item_id, sim_dislike_item_list]
    #calculate the mean of the similiarity
    sim_user_dislike_simrate = get_mean_sim_item(sim_user_simrate_dis)
    # calculate the grade for the target item
    result = user_like_simrate + sim_user_like_simrate - user_dislike_simrate - sim_user_dislike_simrate
    return result

In [22]:
# for each item in test data, calculate the grade
test_data['grade'] = 0
grade_list = []
for i in test_data.index.values:
    user_id = test_data.loc[i,'user_id']
    target_id = test_data.loc[i,'item_id']
    grade_list.append(get_sim(user_id, target_id))

In [23]:
test_data['grade'] = grade_list

In [24]:
user_id_list = set(test_data.loc[:,'user_id'])

In [25]:
result_dataframe = pd.DataFrame()
for i in user_id_list:
    id_dataframe = test_data[(test_data.user_id == i)]
    id_dataframe = id_dataframe.sort_values(by="grade" , ascending=False)
    recom = id_dataframe.head(10)
    result_dataframe = pd.concat([result_dataframe, recom], axis = 0)

In [26]:
result_dataframe.drop('grade', inplace = True, axis = 1)
result_dataframe.to_csv('cos_result.csv',index = False)

the kaggle result for this algorithms is 0.10842

## Task 2: Node Classification in Graphs

### Discription

In this task, we will build a graph through the networkx library, and then use node2vec vectornize all points, and then establish SVM and Logistic models to make predictions.

### create the gtaph

In [27]:
# read the nodes data
nodes=open('docs.txt','r')

In [28]:
# create the graph
g = nx.Graph()

In [29]:
# process the data and create the node
for i in nodes:
    i = i.replace('\n', '')
    node_list = i.split(' ',1)
    node_id = node_list[0]
    node_content = node_list[1]
    g.add_node(node_id, node_content=str(node_content))

In [30]:
# read the link data
links=open('adjedges.txt', 'r')

In [31]:
# process the data and create the link
for l in links:
    l = l.replace('\n', '')
    link_list = l.split(' ')
    add_link = []
    for i in range(1,len(link_list)):
        if link_list[i]!='':
            add_link.append((link_list[0], link_list[i]))
    if len(add_link) != 0:
        g.add_edges_from(add_link)

In [32]:
# use the node2vec to vectornize the node
node2vec = Node2Vec(g, dimensions=64, walk_length=10, num_walks=10, workers=1)

Computing transition probabilities: 100%|██████████| 36928/36928 [00:06<00:00, 5822.92it/s] 
Generating walks (CPU: 1): 100%|██████████| 10/10 [01:13<00:00,  7.37s/it]


In [33]:
model = node2vec.fit(window=10, min_count=1, batch_words=4)

In [34]:
# transfer the label into data frame and add the feature
data_frame = pd.DataFrame(columns=['node_id', 'label', 'feature'])
data=open('labels.txt','r')
for k in data:
    k = k.replace('\n', '')
    node_id = k.split(' ')[0]
    label = k.split(' ')[1]
    feature = model.wv.get_vector(node_id)
    data_frame = data_frame.append([{'node_id':node_id, 'label':label, 'feature': feature}], ignore_index=True)

In [35]:
train, test = train_test_split(data_frame, test_size = 0.8, random_state = 1234)

use the SVM model

In [36]:
# transfer the data type
feature_train_list = train['feature'].tolist()
feature_train = np.array(feature_train_list)
label_train_list = train['label'].tolist()
label_train = np.array(label_train_list)

feature_test_list = test['feature'].tolist()
feature_test = np.array(feature_test_list)
label_test_list = test['label'].tolist()
label_test = np.array(label_test_list)
# use the svm model
svm_model = LinearSVC().fit(feature_train,label_train)



In [37]:
# get the accuracy
svm_model.score(feature_test,label_test)

0.5414663461538461

use the logistic model

In [38]:
train_label = train['label'].values.reshape(-1,1)
test_label = test['label'].values.reshape(-1,1)

In [39]:
# use the logistic model
logistic_model = LogisticRegression().fit(feature_train, train_label)

  y = column_or_1d(y, warn=True)


In [40]:
# get the accuracy
logistic_model.score(feature_test,test_label)

0.5299145299145299