<h1 style="color:rgb(36,115,172);text-align:center;"> Data Challenge </h1>

<h2 style="text-align:center;"> Project of Data Mining course - SJTU ParisTech </h2>

<h3 style="text-align:center;"> Prediction of link between papers </h3>

<i>Florian Blanchet (Télécom Bretagne, France) <br/> florian.blanchet@telecom-bretagne.eu</i>


---------------------------------------------------------------------------------
=========================

### Summary : 
1 - Introduction <br/>
2 - Import libraries<br/>
3 - Import files<br/>
4 - Feature creation<br/>
5 - Feature saling<br/>
6 - Evaluation<br/>
7 - Training script<br/>
8 - Testing script<br/>

# 1 - Introduction 
<p>The task of this data challenge is to identify whether a citation exists for a given pair of research papers. We will find that there is resemblance of this task with that of link prediction. We will be given a training dataset with the ground truth of whether a link is "true" or "false" among pairs of papers and a test dataset for which we will have to classify the pair of papers.  </p>
<p>The report is linearly organized, the functions presented will be at the end launched in the same order in a 'Main' block. It appears that I have modified some of them that is why they are sometimes with a number in the name.  </p>

# 2 - Import libraries

At first we have to import libraries that we will be needed after. 

In [58]:
# Tools :
import random
import numpy as np
# To create features :
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn import preprocessing
import nltk
# To import files : 
import csv

In [57]:
# Classifiers :
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.neural_network import MLPClassifier
# To evaluate performances :
from sklearn.cross_validation import KFold
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc

We now download tools from nltk library to clean our documents by removing for instances stopwords or to just keep the root of a word. It allow us to have better results.

In [7]:
nltk.download('punkt') # for tokenization
nltk.download('stopwords')
stpwds = set(nltk.corpus.stopwords.words("english"))
stemmer = nltk.stem.PorterStemmer()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/florianblanchet/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/florianblanchet/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


---------------------------------------------------------------------------------
=========================

# 3 - Import files

In this part I describe the files that will be used after. I import them and save data in several list or dictionnaries. The function download_files() allow to download all of them. 

### testing_set.txt :
There are 32,648 node pairs in testing_set. One pair per row, as: source node ID, target node ID. These pairs are the ones we have to predict the label. It will be used at the end to create the submission file. <br/>
Data saved in the 'testing_set' list.

### training_set.txt :  
This file is composed of 615,512 labeled node pairs (1 if there is an edge between the two nodes, 0 else). One pair and label per row, as: source node ID, target node ID, and 1 or 0. The IDs match the papers in the node_information.csv file (see below). <br/>
Data saved in the 'training_set' list.

### node_information.csv :  
For each paper out of 27,770, contains the paper :<br/>
(1) paper unique ID (integer)<br/>
(2) publication year (between 1993 and 2003) (integer)<br/> 
(3) paper title (string)<br/> 
(4) authors (strings separated by ,)<br/> 
(5) name of journal (not available for all papers) (string)<br/> 
(6) abstract (string). Abstracts are already in lowercase, common English stopwords have been removed, and punctuation marks have been removed except for intra-word dashes. <br/> 
Data saved in the 'node_info' list.

### download_files() :  
<p>This function import files presented above.</p>

In [59]:
def download_files():
    print "------------------------------------ "
    print "#########  DOWNLOAD FILES  ######### "
    with open("testing_set.txt", "r") as f:
        reader = csv.reader(f)
        testing_set  = list(reader)
        testing_set = [element[0].split(" ") for element in testing_set]
    with open("training_set.txt", "r") as f:
        reader = csv.reader(f)
        training_set  = list(reader)
        training_set = [element[0].split(" ") for element in training_set]
    with open("node_information.csv", "r") as f:
        reader = csv.reader(f)
        node_info  = list(reader)
    return testing_set,training_set,node_info

--------------------------------------------------------------------------------

### Reduction of training_set & actualization of node_info

Training_set contains too much samples and we can't train on all the dataset. So that we reduce the dataset to 5% for instance to make our tests before training on all the dataset at the end to predict.

#### Actualization of training_Set and creation of valid_ids

Valid_ids is a list of IDs we keep in the new training_set reduced.

In [73]:
#to test code we select sample
def actualisation_set(training_set,percentage):
    print "---------------------------------------"
    print "#####  REDUCTION OF TRAINING_SET ###### "
    to_keep = random.sample(range(len(training_set)), k=int(round(len(training_set)*percentage))) # ID à garder de training_set, 30,776 IDs
    training_set = [training_set[i] for i in to_keep]
    valid_ids=set()   # Type : set  ??
    for element in training_set:
        valid_ids.add(element[0]) # Ajoute ID source
        valid_ids.add(element[1]) # Ajoute ID target
    # Valid_ids = tous les noeuds de notre graph qu'on garde, pas de redondance d'ID
    print "We keep ",len(to_keep)," elements of training_set"
    print "Valid_ids' length : ",len(valid_ids)
    return training_set, valid_ids

#### Actualization of node_info
We just keep information about IDs selected before.

In [74]:
# Sort node_info to keep information of valid_ids' IDs
def actualisation_node_info(node_info):
    print "-----------------------------------------"
    print "######  ACTUALISATION OF NODE_INFO ###### "
    tmp=[element for element in node_info if element[0] in valid_ids ]
    node_info=tmp
    del tmp
    print "Number of nodes in node_info : ",len(node_info)
    return node_info

--------------------------------------------------------------------------------

### Creation of ID_pos, data, cite and dates : 
This dictionnary helps us to find the index of a given ID in node_info. Moreiver we use this function to initialize a dictionnary 'cite', to create a list of dates and to create a list 'data' containing the abstract (1), abstract+title(2), abstract+title+journal(3) or abstract+title+author+journal(4).

In [69]:
def create_ID_pos(node_info):
    print "-----------------------------------"
    print "#########  CREATE ID_POS, data, cite and dates ###### "
    IDs = []
    ID_pos={}
    data = []
    dates = []
    cite={}
    for element in node_info:
        ID_pos[element[0]]=len(IDs) # Donne la position de element[0] dans node_info
        IDs.append(element[0]) # liste des ID dans node_info
        data.append(element[5])
        dates.append(int(element[1]))
        cite[element[0]]=0 # initialise le dictionnaire des citations
    print "Nombre de nodes dans ID_pos : ",len(IDs)
    return ID_pos, data, dates, cite

In [70]:
def create_ID_pos2(node_info):
    print "-----------------------------------"
    print "#########  CREATE ID_POS, data, cite and dates ###### "
    IDs = []
    ID_pos={}
    data = []
    dates = []
    cite={}
    for element in node_info:
        ID_pos[element[0]]=len(IDs) # Donne la position de element[0] dans node_info
        IDs.append(element[0]) # liste des ID dans node_info
        data.append(' '.join([element[5],element[3]]))
        dates.append(int(element[1]))
        cite[element[0]]=0 # initialise le dictionnaire des citations
    #print(IDs)
    #print(node_info[ID_pos[ID]])
    print "Nombre de nodes dans ID_pos : ",len(IDs)
    return ID_pos, data, dates, cite

In [71]:
def create_ID_pos3(node_info):
    print "-----------------------------------"
    print "#########  CREATE ID_POS, data, cite and dates ###### "
    IDs = []
    ID_pos={}
    data = []
    dates = []
    cite={}
    for element in node_info:
        ID_pos[element[0]]=len(IDs) # Donne la position de element[0] dans node_info
        IDs.append(element[0]) # liste des ID dans node_info
        data.append(' '.join([element[5],element[2],element[3]]))
        dates.append(int(element[1]))
        cite[element[0]]=0 # initialise le dictionnaire des citations
    #print(IDs)
    #print(node_info[ID_pos[ID]])
    print "Nombre de nodes dans ID_pos : ",len(IDs)
    return ID_pos, data, dates, cite

In [72]:
def create_ID_pos4(node_info):
    print "-----------------------------------"
    print "#########  CREATE ID_POS, data, cite and dates ###### "
    IDs = []
    ID_pos={}
    data = []
    dates = []
    cite={}
    for element in node_info:
        ID_pos[element[0]]=len(IDs) # Donne la position de element[0] dans node_info
        IDs.append(element[0]) # liste des ID dans node_info
        data.append(' '.join([element[5],element[2],element[3],element[3]]))
        dates.append(int(element[1]))
        cite[element[0]]=0 # initialise le dictionnaire des citations
    #print(IDs)
    #print(node_info[ID_pos[ID]])
    print "Nombre de nodes dans ID_pos : ",len(IDs)
    return ID_pos, data, dates, cite

---------------------------------------------------------------------------------
=========================

# 4 - Feature creation

### Similarity using Tfidf : 

My_preprocessor clean a text to remove stopwords (such as 'of', 'a', 'an'..) and keep the root of a word (such as 'creation' -> 'creat').

In [78]:
def my_preprocessor(text):
    words=text.split(" ") # Already in lower, we convert text(string) into a list of words in unicode
    without_stpwords = [token for token in words if token not in stpwds]  # Remove stopwords
    L=[]
    for token in without_stpwords:  # Handle exception if appears when keeping roots (ex : 'aed')
        try :
            L.append(stemmer.stem(token))
        except : 
            print "Didn't arrive to find root of : ", token
            L.append(token)
            pass
    words2 = L
    doc=' '.join(words2)
    return doc

We now create the tfidf matrix with data list created before. It contains a list of texts containing abstract or abstract+title or abstract+title+journal ...

In [17]:
def create_tfidf(data,my_preprocessor):
    print "-----------------------------------"
    print "#########  CREATE TFIDF MATRIX  ###### "
    m = TfidfVectorizer(preprocessor=my_preprocessor)
    tfidf_matrix = m.fit_transform(data)
    tfidf_matrix = tfidf_matrix.toarray()
    print "Taille de tfidf_matrix : ",tfidf_matrix.shape
    return tfidf_matrix

This function compute the cosine simalirity between to papers. It takes into arguments two elements of data list.

In [18]:
def cosine_similarity(vector1, vector2):
    top=vector1.dot(vector2)
    bottom=np.linalg.norm(vector1)*np.linalg.norm(vector2)
    if bottom==0: 
        return 0.0
    return top/bottom

--------------------------------------------------------------------------------

### Number of times paper cite or cited or both

To define is it cites or is cited I took in training_set element [IDA,IDB,Label] <-> IDA cites IDB if label is '1'. 
<br/>The first function gives us a dictionnary to know the number of times an ID was cited or cite in training_set. 
<br/> The second one tells us the number of times an ID cite and is cited.


In [63]:
# Number of times where an ID is cited or cites another (citer) 
def cite_cited1(training_set,cite):
    print "-----------------------------------"
    print "#########  CREATE DICTIONNARY CITE ###### "
    compteur=0
    for i in range(len(training_set)):
        if (training_set[i][2] == '1'):
            compteur+=1
            cite[training_set[i][1]]+=1
            cite[training_set[i][0]]+=1
    print "Number of '1' labels in training_set : ",compteur, " of ",len(training_set)
    return cite

In [64]:
# Number of times where an ID is cited and an ID cites another (citer) 
def cite_cited2(training_set,cite0):
    print "-----------------------------------"
    print "#########  CHECK PAPERS CITED AND CITER ###### "
    cited = cite0
    citer = cite0
    for i in range(len(training_set)):
        if (training_set[i][2] == '1'):
            cited[training_set[i][1]]+=1
            citer[training_set[i][0]]+=1  # Si il a deja cité quelqu'un
    return cited, citer

--------------------------------------------------------------------------------

### Similarité entre les auteurs, les titres et les abstracts

These functions allow to create features depending on what we have put on 'data' to compute the cosine simalirity and if we want to use 'cite' or 'cited and citer' in argument. They return features as lists.

##### Citer/cited in argument and cosine similarity of abstracts.
Compute number of common words between titles and authors too.

In [83]:
# similarité des titres et auteurs 
def extract_autor_title_abstract_similarity2(training_set,citer,cited,ID_pos,node_info,tfidf_matrix):
    print "-----------------------------------"
    print "#########  EXTRACTION OF AUTOR/TITLE/ABSTRACT/CITATION FEATURES ###### "
    print "Number of features to handle : ", len(training_set)

    overlap_title = []  # number of overlapping words in title
    comm_auth = []      # number of common authors
    citer_source = []
    cited_target = []
    similarity_abstract = []

    for i in range(len(training_set)):
        source = training_set[i][0]
        target = training_set[i][1]
        citer_source.append(citer[source])
        cited_target.append(cited[target])

        source_info = node_info[ID_pos[source]]
        target_info = node_info[ID_pos[target]]

        # convert to lowercase and tokenize
        source_title = source_info[2].lower().split(" ") # renvoie liste des mots du titre en minuscule : dbrane boundstate wavefunctions -> ['dbrane', 'boundstate', 'wavefunctions']
        target_title = target_info[2].lower().split(" ")
        # remove stopwords
        source_title = [token for token in source_title if token not in stpwds]
        target_title = [token for token in target_title if token not in stpwds]
        source_title = [stemmer.stem(token) for token in source_title] # Garde la racine des mots
        target_title = [stemmer.stem(token) for token in target_title]

        # Prend auteur de source et target
        source_auth = source_info[3].split(",") # met sous forme de liste
        target_auth = target_info[3].split(",")
        
        overlap_title.append(len(set(source_title).intersection(set(target_title)))) # Nombre de mots en commun dans titre
        comm_auth.append(len(set(source_auth).intersection(set(target_auth)))) # Nombre de mots en commun dans auteur

        
        similarity  = cosine_similarity(tfidf_matrix[ID_pos[source]],tfidf_matrix[ID_pos[target]])
        similarity_abstract.append(similarity)
        if i % 1000 == 0:
            print i, " examples processsed"

    print "-----------------------------------"
    print "nombre de similitudes sur titres : ",sum(i for i in overlap_title)
    print "nombre de similitudes sur auteurs : ",sum(i for i in comm_auth)
    return overlap_title,comm_auth,citer_source,cited_target,similarity_abstract

##### Cite in argument and cosine similarity of abstracts.
Compute number of common words between titles and authors too.

In [84]:
# similarité des titres et auteurs 
def extract_autor_title_abstract_similarity3(training_set,cite,ID_pos,node_info,tfidf_matrix):
    print "-----------------------------------"
    print "#########  EXTRACTION OF AUTOR/TITLE/ABSTRACT/CITATION FEATURES ###### "
    print "Number of features to handle : ", len(training_set)

    overlap_title = []  # number of overlapping words in title
    comm_auth = []      # number of common authors
    cite_source = []
    cite_target = []
    similarity_abstract = []

    for i in range(len(training_set)):
        source = training_set[i][0]
        target = training_set[i][1]
        cite_source.append(cite[source])
        cite_target.append(cite[target])

        source_info = node_info[ID_pos[source]]
        target_info = node_info[ID_pos[target]]

        # convert to lowercase and tokenize
        source_title = source_info[2].lower().split(" ") # renvoie liste des mots du titre en minuscule : dbrane boundstate wavefunctions -> ['dbrane', 'boundstate', 'wavefunctions']
        target_title = target_info[2].lower().split(" ")
        # remove stopwords
        source_title = [token for token in source_title if token not in stpwds]
        target_title = [token for token in target_title if token not in stpwds]
        source_title = [stemmer.stem(token) for token in source_title] # Garde la racine des mots
        target_title = [stemmer.stem(token) for token in target_title]

        # Prend auteur de source et target
        source_auth = source_info[3].split(",") # met sous forme de liste
        target_auth = target_info[3].split(",")
        
        overlap_title.append(len(set(source_title).intersection(set(target_title)))) # Nombre de mots en commun dans titre
        comm_auth.append(len(set(source_auth).intersection(set(target_auth)))) # Nombre de mots en commun dans auteur

        
        similarity  = cosine_similarity(tfidf_matrix[ID_pos[source]],tfidf_matrix[ID_pos[target]])
        similarity_abstract.append(similarity)
        if i % 1000 == 0:
            print i, " examples processsed"

    print "-----------------------------------"
    print "nombre de similitudes sur titres : ",sum(i for i in overlap_title)
    print "nombre de similitudes sur auteurs : ",sum(i for i in comm_auth)
    return overlap_title,comm_auth,cite_source,cite_target,similarity_abstract

##### Citer/cited in argument and cosine similarity of abstracts+titles.
Compute number of common words between authors.

In [85]:
# similarité des titres et auteurs 
def extract_autor_title_abstract_similarity4(training_set,citer, cited,ID_pos,node_info,tfidf_matrix):
    print "-----------------------------------"
    print "#########  EXTRACTION OF AUTOR/TITLE/ABSTRACT/CITATION FEATURES ###### "
    print "Number of features to handle : ", len(training_set)

    comm_auth = []      # number of common authors
    cite_source = []
    cite_target = []
    similarity_ = []

    for i in range(len(training_set)):
        source = training_set[i][0]
        target = training_set[i][1]
        cite_source.append(citer[source])
        cite_target.append(cited[target])

        source_info = node_info[ID_pos[source]]
        target_info = node_info[ID_pos[target]]

        # Prend auteur de source et target
        source_auth = source_info[3].split(",") # met sous forme de liste
        target_auth = target_info[3].split(",")
        
        comm_auth.append(len(set(source_auth).intersection(set(target_auth)))) # Nombre de mots en commun dans auteur

        similarity  = cosine_similarity(tfidf_matrix[ID_pos[source]],tfidf_matrix[ID_pos[target]])
        similarity_.append(similarity)
        if i % 1000 == 0:
            print i, " examples processsed"

    print "-----------------------------------"
    print "nombre de similitudes sur auteurs : ",sum(i for i in comm_auth)
    return comm_auth,cite_source,cite_target,similarity_

##### Cite in argument and cosine similarity of abstracts+titles.
Compute number of common words between authors.

In [86]:
# similarité des titres et auteurs 
def extract_autor_title_abstract_similarity5(training_set,cite,ID_pos,node_info,tfidf_matrix):
    print "-----------------------------------"
    print "#########  EXTRACTION OF AUTOR/TITLE/ABSTRACT/CITATION FEATURES ###### "
    print "Number of features to handle : ", len(training_set)

    comm_auth = []      # number of common authors
    cite_source = []
    cite_target = []
    similarity_ = []

    for i in range(len(training_set)):
        source = training_set[i][0]
        target = training_set[i][1]
        cite_source.append(cite[source])
        cite_target.append(cite[target])

        source_info = node_info[ID_pos[source]]
        target_info = node_info[ID_pos[target]]

        # Prend auteur de source et target
        source_auth = source_info[3].split(",") # met sous forme de liste
        target_auth = target_info[3].split(",")
        
        comm_auth.append(len(set(source_auth).intersection(set(target_auth)))) # Nombre de mots en commun dans auteur

        
        similarity  = cosine_similarity(tfidf_matrix[ID_pos[source]],tfidf_matrix[ID_pos[target]])
        similarity_.append(similarity)
        if i % 1000 == 0:
            print i, " examples processsed"

    print "-----------------------------------"
    print "nombre de similitudes sur auteurs : ",sum(i for i in comm_auth)
    return comm_auth,cite_source,cite_target,similarity_

##### Citer/cited in argument and cosine similarity of data.

In [87]:
# similarité des titres et auteurs 
def extract_autor_title_abstract_similarity6(training_set,citer, cited,ID_pos,node_info,tfidf_matrix):
    print "-----------------------------------"
    print "#########  EXTRACTION OF AUTOR/TITLE/ABSTRACT/CITATION FEATURES ###### "
    print "Number of features to handle : ", len(training_set)

    cite_source = []
    cite_target = []
    similarity_ = []

    for i in range(len(training_set)):
        source = training_set[i][0]
        target = training_set[i][1]
        cite_source.append(citer[source])
        cite_target.append(cited[target])
        
        source_info = node_info[ID_pos[source]]
        target_info = node_info[ID_pos[target]]

        # Prend auteur de source et target
        source_auth = source_info[5].split(",") # met sous forme de liste
        target_auth = target_info[5].split(",")
   
        similarity  = cosine_similarity(tfidf_matrix[ID_pos[source]],tfidf_matrix[ID_pos[target]])
        similarity_.append(similarity)
        if i % 1000 == 0:
            print i, " examples processsed"
    return cite_source,cite_target,similarity_

##### Citer/cited in argument and cosine similarity of data.

In [88]:
# similarité des titres et auteurs 
def extract_autor_title_abstract_similarity7(training_set,cite,ID_pos,node_info,tfidf_matrix):
    print "-----------------------------------"
    print "#########  EXTRACTION OF AUTOR/TITLE/ABSTRACT/CITATION FEATURES ###### "
    print "Number of features to handle : ", len(training_set)

    cite_source = []
    cite_target = []
    similarity_ = []

    for i in range(len(training_set)):
        source = training_set[i][0]
        target = training_set[i][1]
        cite_source.append(cite[source])
        cite_target.append(cite[target])
        
        similarity  = cosine_similarity(tfidf_matrix[ID_pos[source]],tfidf_matrix[ID_pos[target]])
        similarity_.append(similarity)
        if i % 1000 == 0:
            print i, " examples processsed"
    return cite_source,cite_target,similarity_

##### Create dates features :

In [89]:
# similarité des dates
def extract_date(sett,dates,ID_pos):
    print "-----------------------------------"
    print "#########  EXTRACTION OF DATE FEATURE ###### "

    date_source = []
    date_target = []
    diff = []
    for i in range(len(sett)):
        source = sett[i][0]
        target = sett[i][1]
        date_source.append(dates[ID_pos[source]])
        date_target.append(dates[ID_pos[target]])
        diff.append((abs(int(dates[ID_pos[source]])-int(dates[ID_pos[target]]))))
        if i % 1000 == 0:
            print i, " examples processsed"
    return date_source, date_target,diff

# 5 - Feature scaling before training or testing

#### List of avaibles features : <br/>
overlap_abstract <br/>
overlap_title<br/>
comm_auth<br/>
similarity_abstract<br/>
similarity<br/>
cite_source<br/>
cite_target<br/>
date_source<br/>
date_target<br/>
diff<br/>

In [36]:
#features = [overlap_abstract,overlap_title,comm_auth,cite_source,cite_target,diff]
def scaling(features):
    print "-----------------------------------"
    print "#########  SCALING FEATURES ###### "
    featuress = np.array(features).T # transposée 
    featuress = preprocessing.scale(featuress)
    return featuress

We put labels in a separated list before training a classifier.

In [37]:
# convert labels into integers then into column array
def convert_labels(training_set):
    print "-----------------------------------"
    print "#########  CONVERT LABELS ###### "
    labels = [int(element[2]) for element in training_set]
    labels = list(labels)
    labels_array = np.array(labels)
    return labels_array

---------------------------------------------------------------------------------
=========================

# 6 - Evaluation

I decided to test with 3 classifiers. It appears that the MLP was the best so that I kept it. You can see results with other classifiers beelow.

In [39]:
def classification(training_set,labels_array,training_features):
    print "-----------------------------------"
    print "#########  CLASSIFICATION ###### "
    kf = KFold(len(training_set), n_folds=10) # Cross validation sur 10 sous ensembles de training_features
    sumf1=0
    for train_index, test_index in kf: # train_index, test_indexs = liste d'index sur interval
        X_train, X_test = training_features[train_index], training_features[test_index]
        y_train, y_test = labels_array[train_index], labels_array[test_index]
        # initialize basic SVM :
        #classifier = svm.LinearSVC()
        #classifier = RandomForestClassifier(n_estimators=200, min_samples_split=2)
        classifier =  MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(10,5), random_state=1)

        # train
        classifier.fit(X_train, y_train)
        labels_predicted = classifier.predict(X_test)
        #y_score = classifier.predict_proba(X_test)
        sumf1+=f1_score(labels_predicted,y_test)

        # Evaluation of the prediction
        #print classification_report(y_test, labels_predicted)
        #print "The accuracy score is {:.2%}".format(accuracy_score(y_test, labels_predicted))
        # Compute ROC curve and area under the curve
        #fpr, tpr, thresholds = roc_curve(y_test, y_score[:, 1])
        #print thresholds
        #roc_auc = auc(fpr, tpr)
        #print "Area under the ROC curve : %f" % roc_auc
        #print "---------------------------------------------"

    print "Resultat final : "
    print sumf1/10.0 # moyenne sur les 10 sous ensembles de training_set
    return classifier

#### Evaluation metric : <br/>
𝐹1 = 2 (𝑝∗𝑟)/(𝑝+𝑟) <br/><br/>
Where : 𝑝=𝑡𝑝/(𝑡𝑝+𝑓𝑝) : precision <br/> and r=𝑡𝑝/(𝑡𝑝+𝑓𝑛) : recall

# 7 - Training Script

There are 3 'main' block depending on what we want in the tfidf matrix to compute cosine similarity. <br/>
We just have too run the block to compute the training of a classifier. Take care to select a correct percentage of the dataset to limit the computation time. <br/>
You have to comment/uncomment the line of 'extract_features' if you want to use cite or citer/cited in feature.

##### DATA = ABSTRACT FOR TFIDF_MATRIX

In [94]:
# DATA = ABSTRACT FOR TFIDF_MATRIX
# LOAD FILES / TREATMENT
testing_set,training_set,node_info = download_files()
ID_pos, data,dates,cite0 = create_ID_pos(node_info) # init dictionaire des citations
cited,citer = cite_cited2(training_set,cite0)
cite = cite_cited1(training_set,cite0)

training_set, valid_ids = actualisation_set(training_set,0.05) #5%
#node_info = actualisation_node_info(node_info) # peu de difference entre set entier (27,000) et subset 22,000

# CREATE FEATURES
tfidf_matrix = create_tfidf(data,my_preprocessor)

overlap_title,comm_auth,citer_source,cited_target,similarity_abstract = extract_autor_title_abstract_similarity2(training_set,citer,cited,ID_pos,node_info,tfidf_matrix)
#overlap_title,comm_auth,citer_source,cited_target,similarity_abstract = extract_autor_title_abstract_similarity3(training_set,cite,ID_pos,node_info,tfidf_matrix)

date_source, date_target,diff = extract_date(training_set,dates,ID_pos)

# FEATURE SELECTION :
features = [similarity_abstract,overlap_title,comm_auth,citer_source,cited_target,date_source, date_target]
training_features = scaling(features)
labels_array = convert_labels(training_set)

# CLASSIFICATION : 
classifier = classification(training_set,labels_array,training_features)

------------------------------------ 
#########  DOWNLOAD FILES  ######### 
-----------------------------------
#########  CREATE ID_POS, data, cite and dates ###### 
Nombre de nodes dans ID_pos :  27770
-----------------------------------
#########  CHECK PAPERS CITED AND CITER ###### 
-----------------------------------
#########  CREATE DICTIONNARY CITE ###### 
Number of '1' labels in training_set :  335130  of  615512
---------------------------------------
#####  REDUCTION OF TRAINING_SET ###### 
We keep  30776  elements of training_set
Valid_ids' length :  22865
-----------------------------------
#########  SCALING FEATURES ###### 
-----------------------------------
#########  CONVERT LABELS ###### 
-----------------------------------
#########  CLASSIFICATION ###### 
Resultat final : 
0.683393942064


In [93]:
print len(similarity_abstract),len(overlap_title),len(comm_auth),len(cite_source),len(cite_target),len(date_source),len(date_target)

30776 30776 30776 615512 615512 30776 30776


##### DATA = ABSTRACT + AUTHOR FOR TFIDF_MATRIX

In [245]:
# DATA_ABS_TIT = ABSTRACT + AUTHOR FOR TFIDF_MATRIX
# LOAD FILES / TREATMENT
testing_set,training_set,node_info = download_files()
ID_pos, data,dates,cite0 = create_ID_pos(node_info) # init dictionaire des citations
cited,citer = cite_cited2(training_set,cite0) # Sur 80,000 echant de training_set
cite = cite_cited1(training_set,cite0)

training_set, valid_ids = actualisation_set(training_set,0.05) #5%
#node_info = actualisation_node_info(node_info) # peu de difference entre set entier (27,000) et subset 22,000
ID_pos, data_abs_tit,dates,cite0 = create_ID_pos2(node_info) # init dictionaire des citations


# CREATE FEATURES
#tfidf_matrix_abs_tit = create_tfidf(data_abs_tit,my_preprocessor)

comm_auth,cite_source,cite_target,similarity_ = extract_autor_title_abstract_similarity4(training_set,citer,cited,ID_pos,node_info,tfidf_matrix_abs_tit)
#comm_auth,cite_source,cite_target,similarity_abstract = extract_autor_title_abstract_similarity5(training_set,cite,ID_pos,node_info,tfidf_matrix_abs_tit)

date_source, date_target,diff = extract_date(training_set,dates,ID_pos)

# FEATURE SELECTION :
features = [similarity_,comm_auth,cite_source,cite_target,diff]
features = [similarity_,comm_auth,cite_source,cite_target,date_source, date_target]
training_features = scaling(features)
labels_array = convert_labels(training_set)

# CLASSIFICATION : 
classifier = classification(training_set,labels_array,training_features)

-----------------------------------
#########  DOWNLOAD FILES ###### 
-----------------------------------
#########  CREATE ID_POS ###### 
Nombre de nodes dans ID_pos :  27770
-----------------------------------
#########  CHECK PAPERS CITED AND CITER ###### 
-----------------------------------
#########  REDUCTION OF TRAINING_SET ###### 
On garde 30776 elements de training_set
Taille de valid_ids :  22857
-----------------------------------
#########  CREATE ID_POS ###### 
Nombre de nodes dans ID_pos :  27770
-----------------------------------
#########  EXTRACTION OF AUTOR/TITLE/ABSTRACT/CITATION FEATURES ###### 
nombre d'echantillons à traiter :  30776
0  examples processsed
1000  examples processsed
2000  examples processsed
3000  examples processsed
4000  examples processsed
5000  examples processsed
6000  examples processsed
7000  examples processsed
8000  examples processsed
9000  examples processsed
10000  examples processsed
11000  examples processsed
12000  examples processs

##### DATA = ABSTRACT + TITLE + AUTHOR (+journal) FOR TFIDF_MATRIX
<p style='color:red'>Best combinaison of features : </p>

In [54]:
# DATA_ABS_TIT_AU = ABSTRACT + TITLE + AUTHOR (+journal) FOR TFIDF_MATRIX
# LOAD FILES / TREATMENT
testing_set,training_set,node_info = download_files()
ID_pos, data,dates,cite0 = create_ID_pos(node_info) # init dictionaire des citations
cited,citer = cite_cited2(training_set,cite0) # Sur 80,000 echant de training_set
cite = cite_cited1(training_set,cite0)

training_set, valid_ids = actualisation_set(training_set,1) #100%
ID_pos, data_abs_tit_au,dates,cite0 = create_ID_pos4(node_info) # init dictionaire des citations

# CREATE FEATURES
tfidf_matrix_abs_tit_au = create_tfidf(data_abs_tit_au,my_preprocessor)

cite_source,cite_target,similarity_= extract_autor_title_abstract_similarity6(training_set,citer,cited,ID_pos,node_info,tfidf_matrix_abs_tit_au)
#cite_source,cite_target,similarity_abstract = extract_autor_title_abstract_similarity7(training_set,cite,ID_pos,node_info,tfidf_matrix_abs_tit_au)

date_source, date_target,diff = extract_date(training_set,dates,ID_pos)

# FEATURE SELECTION :
#features = [similarity_,cite_source,cite_target,diff]
features = [similarity_,cite_source,cite_target,date_source, date_target]

training_features = scaling(features)
labels_array = convert_labels(training_set)

# CLASSIFICATION : 
classifier = classification(training_set,labels_array,training_features)

-----------------------------------
#########  DOWNLOAD FILES ###### 
-----------------------------------
#########  CREATE ID_POS ###### 
Nombre de nodes dans ID_pos :  27770
-----------------------------------
#########  CHECK PAPERS CITED AND CITER ###### 
-----------------------------------
#########  REDUCTION OF TRAINING_SET ###### 
On garde 615512 elements de training_set
Taille de valid_ids :  27770
-----------------------------------
#########  CREATE ID_POS ###### 
Nombre de nodes dans ID_pos :  27770
-----------------------------------
#########  EXTRACTION OF AUTOR/TITLE/ABSTRACT/CITATION FEATURES ###### 
nombre d'echantillons à traiter :  615512
0  examples processsed
1000  examples processsed
2000  examples processsed
3000  examples processsed
4000  examples processsed
5000  examples processsed
6000  examples processsed
7000  examples processsed
8000  examples processsed
9000  examples processsed
10000  examples processsed
11000  examples processsed
12000  examples proces

In [173]:
# correct 'setting an array with a sequence' error which appears often when sizes are different
print len(similarity_abstract)
print len(overlap_title)
print len(comm_auth)
print len(citer_source)
print len(cited_target)
print len(diff)

30776
30776
30776
30776
30776
30776


--------------------------------------------------------------------------------

## Evolution of results depending on classifier and features :

### MLP : 

#### With TFIDF_matrix composed of abstracts : 

- 88,15% on 5% of training_set ; features citer/cited/date_source/date_target : 83,31% on leaderboard

#### With TFIDF_matrix composed of abstracts and authors : 

- 94,81% on 5% of training_set ; features citer/cited/date_source/date_target : 86,23% on leaderboard

- 94,89% sur 5% citer/cited/date_source/date_target : 86,52% sur leaderboard

- <p>92,51% sur 5% citer/cited/date_source/date_target : 92,38% sur leaderboard</p> with CITER/CITED on 600,000 samples

- <p >89,03% sur 5% citer/cited/diff : 88,94% sur leaderboard</p> AVEC CITER/CITED SUR Training_set réduit

- <p>89,93% sur 5% citer/cited/diff : 89,76% sur leaderboard</p> AVEC CITER/CITED SUR 80,000 echant

- <p >90,83% sur 5% citer/cited/diff : 90,49% sur leaderboard</p> AVEC CITER/CITED SUR 600,000 echant

- 88,4% sur 5% cite/date_source/date_target : 82,32% sur leaderboard

#### With TFIDF_matrix composed of abstracts, authors and journal : 

- <p>92,32% sur 5% citer/cited/date_source/date_target : 92,51% sur leaderboard</p> AVEC CITER/CITED SUR 600,000 echant

- <p>90,51% sur 5% citer/cited/diff : - de 92,51% sur leaderboard</p> AVEC CITER/CITED SUR 600,000 echant

#### With TFIDF_matrix composed of abstracts, authors and title : 

- <p> 92,58% sur 5% citer/cited/date_source/date_target : 92,52 Marche plus sur leaderboard</p> AVEC CITER/CITED SUR 600,000 echant predict_test2.csv

#### With TFIDF_matrix composed of abstracts, authors, title and journal : 

- <p> 93,13% sur 5% citer/cited/date_source/date_target : 92,97 sur leaderboard</p> AVEC CITER/CITED SUR 600,000 echant predict_test3.csv

- <p> 93,33% sur 50% citer/cited/date_source/date_target : 93,26 sur leaderboard </p> AVEC CITER/CITED SUR 600,000 echant ; temps calcul training : 8min env ; classif prend le plus de temps

- <p style=' color:red;'> 93,37% on 100% of training_set ; fetaures: citer/cited/date_source/date_target : 93,37% on leaderboard </p> with CITER/CITED on 600,000 samples ; time of training : 13min ; classification take long time

### Random Forest : 

#### With TFIDF_matrix composed of abstracts : 

- 86% sur 5% citer/cited/date_source/date_target : 79% sur leaderboard
- 88% sur 5% cite/date_source/date_target : 81% sur leaderboard


#### With TFIDF_matrix composed of abstracts and authors : 

- 93% sur 5% citer/cited/date_source/date_target : 79% sur leaderboard

- 87,8% sur 5% citer/cited/date_source/date_target : 80,74% sur leaderboard

### SVM : 

#### With TFIDF_matrix composed of abstracts and authors : 

- 88,80% sur 5% citer/cited/date_source/date_target : 88,84% leaderboard AVEC CITER/CITED SUR 600,00

- 90% avec citer/cited/date_source/date_target sur 5%
- 88% avec cite/date_source/date_target sur 5%

#### With TFIDF_matrix composed of abstracts : 

- 84,09% sur 5% citer/cited/date_source/date_target : 72,17% leaderboard AVEC CITER/CITED SUR 600,00

- 83,87% sur 5% citer/cited/diff : 72,35% leaderboard AVEC CITER/CITED SUR 600,00

- citer/cited/date_source/date_target : 88.59% leaderboard 
- citer/cited/diff : 88.29% leaderboard 
- 92.59% sur 5% ; 94.2% sur 2% cite/date_source/date_target/diff : (?)% sur leaderboard 
- 90,23% sur 5% ; 92.2% sur 2% cite/date_source/date_target : 87% sur leaderboard
- (?)90,23% sur 5% cite/date_source/date_target :  83.09% sur leaderboard<br/>

Summary : <br/><br/>
I started with a SVM classifier, I tested several combinaisons of features beetween cite or citer/cited and date_source/date_target or diff. I also tested differences between abstract+author or abstract+author+journal or abstract+author+title+journal in tfidf_matrix to compute similarity. I concluded that the more information I put in the better prediction is. And I concluded that the better features were citer/cited/date_source/date_target. <br/><br/>
Then I tried RandomForest classifier but, even with different features, were worst than with the SVM one.<br/> <br/>
So I started using MLP classifier. Results were at the moment much better.<br/>

I also noticed that I limited the computation of citer/cited on the reduced training_set. In that case citer/cited contained not much information so I expended it to the whole dataset. The computation didn't take much time. Results were much better.

# 8 - Testing Script

## Submission file

For each node pair in the testing set, your model should predict whether there is an edge between the two nodes (1) or not (0). <br/>
The testing set contains 50% of true edges (the ones that have been removed from the original network) and 50% of synthetic, wrong edges (pairs of randomly selected nodes between which there was no edge).

This script is used to save prediction of labels from a trained classifier. It save labels in a prediction_test.csv file. Itake into account that variables such as cite/Cited/citer/dates/ID_pos/tfidf_matrix were computed before in the training script

In [55]:
# LOAD FILES, CREATE TFIDF_MATRIX, CITE, CITED, CITER, DATES MADE DURING TRAINING BLOCK

#overlap_title_test,comm_auth_test,citer_source_test,cited_target_test,similarity_abstract_test = extract_autor_title_abstract_similarity2(testing_set,citer,cited,ID_pos,node_info,tfidf_matrix)
#overlap_title_test,comm_auth_test,citer_source_test,cited_target_test,similarity_abstract_test = extract_autor_title_abstract_similarity3(testing_set,cite,ID_pos,node_info,tfidf_matrix)

#comm_auth_test,citer_source_test,cited_target_test,similarity_abstract_test = extract_autor_title_abstract_similarity5(testing_set,cite,ID_pos,node_info,tfidf_matrix_abs_tit)
#comm_auth_test,cite_source_test,cite_target_test,similarity_abstract_test = extract_autor_title_abstract_similarity4(testing_set,citer,cited,ID_pos,node_info,tfidf_matrix_abs_tit)

citer_source_test,cited_target_test,similarity_abstract_test= extract_autor_title_abstract_similarity6(testing_set,citer,cited,ID_pos,node_info,tfidf_matrix_abs_tit_au)
#citer_source_test,cited_target_test,similarity_abstract_test = extract_autor_title_abstract_similarity7(testing_set,cite,ID_pos,node_info,tfidf_matrix_abs_tit_au)


date_source_test, date_target_test,diff_test = extract_date(testing_set,dates,ID_pos)

# FEATURE SELECTION :
#features_test = [overlap_title_test,similarity_abstract_test,comm_auth_test,cite_source_test,cite_target_test,date_source_test, date_target_test]
#features_test = [similarity_abstract_test,comm_auth_test,cite_source_test,cite_target_test,date_source_test, date_target_test]
features_test = [similarity_abstract_test,citer_source_test,cited_target_test,date_source_test, date_target_test]

testing_features = scaling(features_test)

# CLASSIFICATION
labels_predicted = classifier.predict(testing_features)
print "##### PREDICTION TERMINATED ########"
print "taille de testing_set : ",len(testing_set)," taille de labels_predicted : ",len(labels_predicted)
# SAUVEGARDE DE LABELS_PREDICTED
with open('predictions_test.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile)
    spamwriter.writerow(['Id','prediction'])
    for i in range(len(labels_predicted)):
        spamwriter.writerow([i,int(labels_predicted[i])])
print "##### PREDICTION SAVED ########"

-----------------------------------
#########  EXTRACTION OF AUTOR/TITLE/ABSTRACT/CITATION FEATURES ###### 
nombre d'echantillons à traiter :  32648
0  examples processsed
1000  examples processsed
2000  examples processsed
3000  examples processsed
4000  examples processsed
5000  examples processsed
6000  examples processsed
7000  examples processsed
8000  examples processsed
9000  examples processsed
10000  examples processsed
11000  examples processsed
12000  examples processsed
13000  examples processsed
14000  examples processsed
15000  examples processsed
16000  examples processsed
17000  examples processsed
18000  examples processsed
19000  examples processsed
20000  examples processsed
21000  examples processsed
22000  examples processsed
23000  examples processsed
24000  examples processsed
25000  examples processsed
26000  examples processsed
27000  examples processsed
28000  examples processsed
29000  examples processsed
30000  examples processsed
31000  examples processsed


In [222]:
# correct 'setting an array with a sequence' error
print len(similarity_abstract_test)
#print len(overlap_title)
print len(comm_auth_test)
print len(citer_source_test)
print len(cited_target_test)
print len(diff)

32648
32648
32648
32648
30776
