<img src="logo.jpg" width="700" />

# PANACEA, Authorship verification Classifier

The purpose of this code is to create a supervised learning classifier from the Enron dataset wich contains emails generated by 158 employees of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse.

Code main elements:

1. All the code was implemented in Python 2.7 https://www.python.org/
2. The whole Enron dataset can be downloaded from https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz
2. The  Python packages required to run the programs are the following:
    * Jupyter notebook (Python interactive prompt) http://jupyter.org/index.html
    * Matplotlib (visualization) https://matplotlib.org/
    * Numpy (mathematical functions) http://www.numpy.org/
    * Scipy (mathematical functions) https://www.scipy.org/
    * NLTK (Natural Language Processing) https://www.nltk.org/
    * sklearn (Machine Learning) http://scikit-learn.org/stable/
    

It is important to remark that for this classifier it is used the __impostors approach__ which has been successfully used to tackle the __Authorship Verification problem__. For more information see:

1. Authorship verification (problem identified in the Enron dataset): https://pan.webis.de/clef15/pan15-web/author-identification.html
2. Impostors approach: https://pdfs.semanticscholar.org/5c2b/6876df693e096c6c150a5b0d2a2c05043003.pdf
3. Improved impostors approach: http://www.icsd.aegean.gr/lecturers/stamatatos/papers/CLEF-Potha-2017.pdf
    

## Obtain Enron dataset

1. Obtain notebook current path.

In [1]:
import os
noteBookPath= os.getcwd()
os.chdir(noteBookPath)
print noteBookPath

C:\Users\EstebanCj\Desktop\behavior-profile-classifier\Extra stuff


2\. Download the dataset.

In [2]:
import urllib
import os

url="https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz"
dataSetFileName="enron_mail_20150507.tar.gz"
dataSetFilePath=os.path.join(noteBookPath,dataSetFileName)
try:
    if  not (os.path.isfile(dataSetFilePath)):
        print("downloading Enron Dataset")
        urllib.urlretrieve(url, filename=dataSetFilePath)
        print("finished downloadig Dataset")
    else:   
        print("Dataset already downloaded")
except HTTPError:
    print "Could not download"        

Dataset already downloaded


3\. Unzip the dataset.

In [3]:
import tarfile
import os

try:
    if  not (os.path.isdir(os.path.join(noteBookPath,"maildir"))):
        os.chdir(noteBookPath)
        # Opening file with gunzip compression
        print("Extract Enron Dataset")
        file = tarfile.open(dataSetFileName, "r:gz") 
        # Extract all files and folders
        file.extractall()
        file.close()
        print("finished extracting Dataset")
    else:   
        print("Dataset already extracted")     
except tarfile.ReadError:
    print "Could not extract folder"      

Dataset already extracted



## pre-process dataset texts (all_documents folders)

1. Get the current Enron dataset path

In [4]:
import os
datasetPath= os.path.join(noteBookPath,"maildir")
print datasetPath

C:\Users\EstebanCj\Desktop\behavior-profile-classifier\Extra stuff\maildir


2\. For each user in the dataset, obtain the "all_documents" folder path.

In [5]:
import os
folder="all_documents"
datasetUsersPath =[(x,os.path.join(datasetPath,x)) for x in os.listdir(datasetPath) ]
datasetDocumentsPath=[(y[0],os.path.join(y[1],folder)) for y in datasetUsersPath if os.path.isdir(os.path.join(y[1],folder))]
users=[x[0] for x in datasetDocumentsPath]
print "Example of users that have an all_documents folder"
print users[:2]
datasetDocumentsPath=[x[1] for x in datasetDocumentsPath]
print datasetDocumentsPath[:2]

Example of users that have an all_documents folder
['allen-p', 'arnold-j']
['C:\\Users\\EstebanCj\\Desktop\\behavior-profile-classifier\\Extra stuff\\maildir\\allen-p\\all_documents', 'C:\\Users\\EstebanCj\\Desktop\\behavior-profile-classifier\\Extra stuff\\maildir\\arnold-j\\all_documents']


3\. Save all user texts from the dataset in a single file (one file per user) on the __userDocuments__  folfer.

In [6]:
import codecs
import errno
import re
import os

dirName="userDocuments"
try:
    os.mkdir(dirName)
    print "Directory _"+dirName+"_ Created " 
except OSError as e:
    if e.errno == errno.EEXIST:
        print "Directory _"+ dirName+"_ already exists"
        pass
    else:
        raise 
except:
    print "Unexpected error"
    pass        
        
usersDocumentsPath=os.path.join(noteBookPath,dirName)        
for user, documentsPath in zip(users,datasetDocumentsPath):
    texts=[]
    userFilesPath=os.path.join(usersDocumentsPath,user+".txt")
    try:
        
        if  not (os.path.isfile(userFilesPath)):
            print "User: "+user+" documents created"
            for directory, subdirectory, filenames in os.walk(documentsPath):
                for userfile in filenames:
                    with codecs.open(os.path.join(documentsPath, userfile), "r") as file:
                         text = file.read()
                         text= text[text.index(".nsf")+4:]   
                         text = text.replace('\n','')
                         #Text separator: |||   
                         texts.append("File: "+userfile+"|||"+text)
            with codecs.open(userFilesPath, "w" ,"UTF-8") as file:  
                [file.write(x+"\n") for x in texts]
        else:
            print "User: "+user+" documents already exist"
            
    except IOError:
        print "User: "+user+"could not read file:"
        pass
    except  BaseException:
        print "User: "+user+" wrong file/file path"
        pass   
    except:
        print "User: "+user+"Unexpected error:"
        pass

Directory _userDocuments_ already exists
User: allen-p documents already exist
User: arnold-j documents already exist
User: arora-h documents already exist
User: badeer-r documents already exist
User: bailey-s documents already exist
User: bass-e documents already exist
User: baughman-d documents already exist
User: beck-s documents already exist
User: benson-r documents already exist
User: brawner-s documents already exist
User: buy-r documents already exist
User: campbell-l documents already exist
User: carson-m documents already exist
User: cash-m documents already exist
User: corman-s documents already exist
User: cuilla-m documents already exist
User: dasovich-j documents already exist
User: davis-d documents already exist
User: dean-c documents already exist
User: delainey-d documents already exist
User: derrick-j documents already exist
User: dickson-s documents already exist
User: donohoe-t documents already exist
User: dorland-c documents already exist
User: ermis-f documents 

4\. Obtain PoS tags and 3-gram word windows related to users texts. Store them on  __userDocumentsPoSTags__ and __userDocuments3Grams__ folders.

In [7]:
import codecs
import errno
import nltk
import string
import re
import os
from nltk import ngrams

#Create appropriate folders
dirName2="userDocumentsPoSTags"
dirName3="userDocumentsTrigrams"
try:
    os.mkdir(dirName2)
    print "Directory _"+dirName2+"_ Created " 
    os.mkdir(dirName3)
    print "Directory _"+dirName3+"_ Created " 
except OSError as e:
    if e.errno == errno.EEXIST:
        print "One/all of the directories already exists"
        pass
    else:
        raise 
except:
    print "Unexpected error"
    pass 

#Obtain PoS tags and nGrams
PoSTextsPath=os.path.join(noteBookPath,dirName2) 
nGramTextsPath=os.path.join(noteBookPath,dirName3)
users=[os.path.splitext(x)[0] for x in os.listdir(usersDocumentsPath)] 

for user in users:
    PoSFilesPath=os.path.join(PoSTextsPath,user+".txt")
    NgramFilesPath=os.path.join(nGramTextsPath,user+".txt")
    try:
        
        if (not (os.path.isfile(PoSFilesPath))) and (not (os.path.isfile(NgramFilesPath))) :
            
            print "User: "+user+" documents created" 
            PoSTagList= []
            nGramList = []
            filesList = []
            
            with codecs.open(os.path.join(usersDocumentsPath,user+".txt"), "r") as file:
                for line in file:
                    text=line.split("|||")    
                    filesList.append(text[0])
                    #Obtain PoS tags
                    #Tokenize words, keep punctuation symbols for obtaining better PoS tags
                    tokens=nltk.word_tokenize(text[1].replace('\n',''))
                    PoStags=nltk.pos_tag(tokens)
                    PoSText=" ".join([tag[1] for tag in PoStags])
                    PoSTagList.append(PoSText) 
                    #Obtain trigrams
                    #Once obtained PoS tags, eliminate punctuation symbols
                    PoSText= PoSText.translate(None, string.punctuation)
                    trigrams=ngrams(PoSText.split(), 3)
                    # trigram separator: @@ 
                    nGramList.append("@@".join([" ".join(x) for x in trigrams]))
             
            #Save users PoS tags to file 
            with codecs.open(PoSFilesPath, "w" ,"UTF-8") as file:  
                for x, y in zip(filesList,PoSTagList):
                    file.write(x+"|||"+y+"\n") 
                    
            #Save users trigrams to file         
            with codecs.open(NgramFilesPath, "w" ,"UTF-8") as file:  
                for x, y in zip(filesList,nGramList):
                    file.write(x+"|||"+y+"\n")
                    
        else:
            print "User: "+user+" documents already exist" 
                           
    except IOError:
        print "User: "+user+" could not read file:"
        pass
    except  BaseException:
        print "User: "+user+" wrong file/file path"
        pass   
    except:
        print "User: "+user+" Unexpected error:"
        pass                                   

One/all of the directories already exists
User: allen-p documents already exist
User: arnold-j documents already exist
User: arora-h documents already exist
User: badeer-r documents already exist
User: bailey-s documents already exist
User: bass-e documents already exist
User: baughman-d documents already exist
User: beck-s documents already exist
User: benson-r documents already exist
User: brawner-s documents already exist
User: buy-r documents already exist
User: campbell-l documents already exist
User: carson-m documents already exist
User: cash-m documents already exist
User: corman-s documents already exist
User: cuilla-m documents already exist
User: dasovich-j documents already exist
User: davis-d documents already exist
User: dean-c documents already exist
User: delainey-d documents already exist
User: derrick-j documents already exist
User: dickson-s documents already exist
User: donohoe-t documents already exist
User: dorland-c documents already exist
User: ermis-f documents

## Feature selection (PoS tags trigrams)

Obtain the frequency of all trigrams associated to the users documents

In [8]:
import operator
import codecs
import errno
import os

#Create appropriate folders
dirName4="userDocumentsFeatures"

try:
    os.mkdir(dirName4)
    print "Directory _"+dirName4+"_ Created " 
except OSError as e:
    if e.errno == errno.EEXIST:
        print "Directory _"+ dirName4+"_ already exists"
        pass
    else:
        raise 
except:
    print "Unexpected error"
    pass 

featuresTextsPath=os.path.join(noteBookPath,dirName4)
featureSet={}
try:
    featuresTextsPathTrigrams=os.path.join(featuresTextsPath,"trigrams.txt")
    if not (os.path.isfile(featuresTextsPathTrigrams)):
        for user in os.listdir(nGramTextsPath):
            trigramNumber=0
            with codecs.open(os.path.join(nGramTextsPath,user), "r", "UTF-8") as file:
                for line in file:
                     fileElements=line.split("|||")
                     trigrams=(fileElements[1].replace('\n','')).split("@@")   
                     trigramNumber+=len(trigrams)        
                     for trigram in trigrams:
                          if trigram in featureSet:
                             featureSet[trigram]+=1
                          else:
                             featureSet[trigram]=1
                print "User: "+user+" Number of trigrams: "+str(trigramNumber) 
                                             
        #sort trigram frequencies by value                   
        sortedFeatureSet = sorted(featureSet.items(), key=operator.itemgetter(1), reverse=True)
        print "Number of different trigrams in the dataset: "+str(len(sortedFeatureSet))
        #Save users trigrams to file
        
        with codecs.open(featuresTextsPathTrigrams, "w" ,"UTF-8") as file:  
               [file.write(x[0]+"|||"+str(x[1])+"\n") for x in sortedFeatureSet] 
                
        print "Feature document created" 
        
    else:
        print "Feature document already exist" 
            
except IOError:
    print "User: "+user+" could not read file:"
    pass
except  BaseException:
    print "User: "+user+" wrong file/file path"
    pass   
except:
    print "User: "+user+" Unexpected error:"
    pass                        

Directory _userDocumentsFeatures_ already exists
Feature document already exist


## Vector representation

For each document associated to a user, create a vector representation using a __frequency approach__ . These vectors represent positive samples of the author's writing style. Then, create negative samples of this user by randomly choosing texts from the dataset that do not belong to him. It is important to highlight that the number of negative samples is the same as the positive ones to avoid an unbalanced biased classifier.

In [9]:
import operator
import codecs
import errno
import os
import random

#Create appropriate folders
dirName5="userDocumentsVectors"
dirName6="trigramsFrequency"
try:
    os.mkdir(dirName5)
    print "Directory _"+dirName5+"_ Created "
    usersVectorsPath= os.path.join(noteBookPath,dirName5)
    trigramsVectorsPath= os.path.join(usersVectorsPath,dirName6)
    os.mkdir(trigramsVectorsPath)
    print "Directory _"+dirName6+"_ Created "
except OSError as e:
    if e.errno == errno.EEXIST:
        print "One/all of the directories already exists"
        usersVectorsPath= os.path.join(noteBookPath,dirName5)
        trigramsVectorsPath= os.path.join(usersVectorsPath,dirName6)
        pass
    else:
        raise 
except:
    print "Unexpected error"
    pass 

numberFeatures=50
featureSet=[]
vectorSet=[]
try:
    with codecs.open(featuresTextsPathTrigrams,"r","UTF-8") as file:
            for line in file:
                elementsList=line.split("|||")
                featureSet.append(elementsList[0])
    featureSet=featureSet[:numberFeatures] 
    
    for user in os.listdir(nGramTextsPath):
        nGramTextsPathFiles=os.path.join(nGramTextsPath,user)
        with codecs.open(nGramTextsPathFiles, "r", "UTF-8") as file:
                for line in file:
                    vector=[]
                    fileElements=line.split("|||")
                    trigrams= (fileElements[1].replace('\n','')).split("@@")
                    for feature in featureSet:
                        vector.append(str(trigrams.count(feature)))
                    vectorSet.append((user,vector))
                
    random.shuffle(vectorSet)            
    for user in os.listdir(nGramTextsPath):
        trigramsVectorsPathFiles=os.path.join(trigramsVectorsPath,user)
        if not (os.path.isfile(trigramsVectorsPathFiles)):
            print "User: "+user+" vectors created"
            with codecs.open(trigramsVectorsPathFiles, "w", "UTF-8") as file:
                userVectorsPositive=[file.write(",".join(x[1])+",true"+"\n") for x in vectorSet if x[0]==user] 
                userVectorsNegative=[x for x in vectorSet if x[0]!=user] 
                [file.write(",".join(x[1])+",false"+"\n") for x in userVectorsNegative[:len(userVectorsPositive)]]
        else:
            print "User: "+user+" vectors already exist"
                   
except IOError:
    print "User: "+user+" could not read file:"
    pass
except  BaseException:
    print "User: "+user+" wrong file/file path"
    pass   
except:
    print "User: "+user+" Unexpected error:"
    pass                                

One/all of the directories already exists
User: allen-p.txt vectors already exist
User: arnold-j.txt vectors already exist
User: arora-h.txt vectors already exist
User: badeer-r.txt vectors already exist
User: bailey-s.txt vectors already exist
User: bass-e.txt vectors already exist
User: baughman-d.txt vectors already exist
User: beck-s.txt vectors already exist
User: benson-r.txt vectors already exist
User: brawner-s.txt vectors already exist
User: buy-r.txt vectors already exist
User: campbell-l.txt vectors already exist
User: carson-m.txt vectors already exist
User: cash-m.txt vectors already exist
User: corman-s.txt vectors already exist
User: cuilla-m.txt vectors already exist
User: dasovich-j.txt vectors already exist
User: davis-d.txt vectors already exist
User: dean-c.txt vectors already exist
User: delainey-d.txt vectors already exist
User: derrick-j.txt vectors already exist
User: dickson-s.txt vectors already exist
User: donohoe-t.txt vectors already exist
User: dorland-c.t

## Text classification

Create a classification model considering a K-fold cross validation.

1. Obtain the vectors associated to each user in the dataset. 

In [10]:
#Load the vectors previously created to create a classification model
import codecs
import errno
import os
try:
    usersVectors={}
    for user in os.listdir(trigramsVectorsPath): 
        print "Load User: "+user+" vectors"
        usersVectors[user]={}
        
        usersVectors[user]["vectors"]=[]
        usersVectors[user]["labels"]=[]
        
        with codecs.open(os.path.join(trigramsVectorsPath,user),"r","UTF-8") as file:
            for line in file:
                elementsList=(line.replace('\n','')).split(",")
                usersVectors[user]["vectors"].append([int(x) for x in elementsList[:-1]])
                usersVectors[user]["labels"].append(elementsList[-1])
                
except IOError:
    print "User: "+user+" could not read file:"
    pass
except  BaseException:
    print "User: "+user+" wrong file/file path"
    pass   
except:
    print "User: "+user+" Unexpected error:"
    pass                           

Load User: allen-p.txt vectors
Load User: arnold-j.txt vectors
Load User: arora-h.txt vectors
Load User: badeer-r.txt vectors
Load User: bailey-s.txt vectors
Load User: bass-e.txt vectors
Load User: baughman-d.txt vectors
Load User: beck-s.txt vectors
Load User: benson-r.txt vectors
Load User: brawner-s.txt vectors
Load User: buy-r.txt vectors
Load User: campbell-l.txt vectors
Load User: carson-m.txt vectors
Load User: cash-m.txt vectors
Load User: corman-s.txt vectors
Load User: cuilla-m.txt vectors
Load User: dasovich-j.txt vectors
Load User: davis-d.txt vectors
Load User: dean-c.txt vectors
Load User: delainey-d.txt vectors
Load User: derrick-j.txt vectors
Load User: dickson-s.txt vectors
Load User: donohoe-t.txt vectors
Load User: dorland-c.txt vectors
Load User: ermis-f.txt vectors
Load User: farmer-d.txt vectors
Load User: fischer-m.txt vectors
Load User: fossum-d.txt vectors
Load User: gay-r.txt vectors
Load User: germany-c.txt vectors
Load User: gilbertsmith-d.txt vectors
Load 

2\. Set a K-fold cross validation object for each user using the vectors previously created (K= five partitions).

In [11]:
#Import Kfold function from the scikit-learn  which is Machine learning package for Python
#http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
from sklearn.model_selection import KFold
kFoldUsers=[]
foldNumber=5
try:
    for user in os.listdir(trigramsVectorsPath): 
        NumberVectors=len(usersVectors[user]["vectors"])
        if NumberVectors < foldNumber:
            # Define the split - into N folds using a random parameter to get different partitions
            kf = KFold(n_splits=NumberVectors,shuffle=True)
            print "User: "+user+" "+str(NumberVectors)+"-fold created"        
        else:    
            kf = KFold(n_splits=foldNumber,shuffle=True)
            print "User: "+user+" "+str(foldNumber)+"-fold created"

        # Returns the number of splitting iterations in the cross-validation process
        kf.get_n_splits(usersVectors[user]["vectors"])
        kFoldUsers.append((user, kf))
        print kf
except IOError:
    print "User: "+user+" could not read file:"
    pass
except  BaseException:
    print "User: "+user+" wrong file/file path"
    pass   
except:
    print "User: "+user+" Unexpected error:"
    pass               

User: allen-p.txt 5-fold created
KFold(n_splits=5, random_state=None, shuffle=True)
User: arnold-j.txt 5-fold created
KFold(n_splits=5, random_state=None, shuffle=True)
User: arora-h.txt 5-fold created
KFold(n_splits=5, random_state=None, shuffle=True)
User: badeer-r.txt 5-fold created
KFold(n_splits=5, random_state=None, shuffle=True)
User: bailey-s.txt 5-fold created
KFold(n_splits=5, random_state=None, shuffle=True)
User: bass-e.txt 5-fold created
KFold(n_splits=5, random_state=None, shuffle=True)
User: baughman-d.txt 5-fold created
KFold(n_splits=5, random_state=None, shuffle=True)
User: beck-s.txt 5-fold created
KFold(n_splits=5, random_state=None, shuffle=True)
User: benson-r.txt 5-fold created
KFold(n_splits=5, random_state=None, shuffle=True)
User: brawner-s.txt 5-fold created
KFold(n_splits=5, random_state=None, shuffle=True)
User: buy-r.txt 5-fold created
KFold(n_splits=5, random_state=None, shuffle=True)
User: campbell-l.txt 5-fold created
KFold(n_splits=5, random_state=None

3\. For each user, slice the vectors as well as the labels into multiple training and test subsets in order to try different classification variations according to the K-fold cross validation technique

In [12]:
#Import Kfold function from the scikit-learn which is Machine learning package for Python
#http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
from sklearn.model_selection import KFold
try:
    for x in kFoldUsers: 
        num=1
        print "User: "+x[0]+" folds examples:"
        for train_indices, test_indices in x[1].split(usersVectors[x[0]]["vectors"]):
            #Print the first ten indices associated to the features of each user fold
            print("Fold "+str(num)+"---> Train: %s | test: %s" % (train_indices[:10], test_indices[:10]))
            num+=1
except IOError:
    print "User: "+user+" could not read file:"
    pass
except  BaseException:
    print "User: "+user+" wrong file/file path"
    pass   
except:
    print "User: "+user+" Unexpected error:"
    pass                

User: allen-p.txt folds examples:
Fold 1---> Train: [ 0  1  4  5  6  7  8 10 11 13] | test: [ 2  3  9 12 19 29 34 39 46 51]
Fold 2---> Train: [ 0  1  2  3  5  7  9 11 12 13] | test: [ 4  6  8 10 16 18 22 23 36 40]
Fold 3---> Train: [0 1 2 3 4 5 6 7 8 9] | test: [11 14 17 33 35 41 44 57 61 63]
Fold 4---> Train: [ 0  1  2  3  4  6  7  8  9 10] | test: [ 5 20 25 26 28 32 37 42 43 55]
Fold 5---> Train: [ 2  3  4  5  6  8  9 10 11 12] | test: [ 0  1  7 13 15 21 24 27 30 31]
User: arnold-j.txt folds examples:
Fold 1---> Train: [ 0  1  3  4  5  6  7  9 14 15] | test: [ 2  8 10 11 12 13 21 31 33 43]
Fold 2---> Train: [ 0  1  2  3  4  6  7  8  9 10] | test: [ 5 17 20 24 32 36 42 44 46 47]
Fold 3---> Train: [ 0  1  2  3  5  6  7  8 10 11] | test: [ 4  9 14 22 25 28 34 41 66 70]
Fold 4---> Train: [ 0  2  4  5  7  8  9 10 11 12] | test: [ 1  3  6 15 18 23 29 37 40 45]
Fold 5---> Train: [ 1  2  3  4  5  6  8  9 10 11] | test: [ 0  7 16 19 26 27 30 35 38 39]
User: arora-h.txt folds examples:
Fold 1-

Fold 1---> Train: [ 1  2  3  5  6  7  8  9 10 11] | test: [ 0  4 40 42 49 52 55 58 60 61]
Fold 2---> Train: [ 0  1  2  3  4  5  6  8  9 10] | test: [ 7 15 21 24 25 32 33 35 36 43]
Fold 3---> Train: [ 0  2  4  6  7  8 11 12 13 15] | test: [ 1  3  5  9 10 14 17 19 26 31]
Fold 4---> Train: [ 0  1  3  4  5  6  7  9 10 11] | test: [ 2  8 13 16 22 27 29 30 34 38]
Fold 5---> Train: [ 0  1  2  3  4  5  7  8  9 10] | test: [ 6 11 12 18 20 23 28 37 41 44]
User: may-l.txt folds examples:
Fold 1---> Train: [ 1  2  3  4  5  6  7  8 10 11] | test: [ 0  9 14 24 33 44 45 46 61 63]
Fold 2---> Train: [ 0  3  4  5  6  7  8  9 11 12] | test: [ 1  2 10 13 17 20 22 31 32 40]
Fold 3---> Train: [ 0  1  2  5  6  7  8  9 10 11] | test: [ 3  4 12 15 25 26 34 37 39 47]
Fold 4---> Train: [ 0  1  2  3  4  5  6  7  9 10] | test: [ 8 11 18 21 28 29 30 35 36 41]
Fold 5---> Train: [ 0  1  2  3  4  8  9 10 11 12] | test: [ 5  6  7 16 19 23 27 38 43 53]
User: mcconnell-m.txt folds examples:
Fold 1---> Train: [ 1  2  5  6

4\. Set a SVM classifiers to predict whether or not a document belong to a specific user/author.

In [13]:
#Import a Support Vector Machine (SVM) classifier from the scikit-learn package
#http://scikit-learn.org/stable/modules/svm.html
from sklearn import svm
from sklearn.svm import LinearSVC
clf= svm.LinearSVC()
#Show classifier attributes
print clf

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)


5\. Fit the classifier using the folds provided (training and test partitions) and calculate the mean accuracy associated for each user.

In [15]:
import numpy as np
from sklearn.metrics import classification_report

for x in kFoldUsers: 
    try:
        print "User: "+x[0]+" classification results:"
        num=1
        results=[]
        for trainIndex, testIndex in x[1].split(usersVectors[x[0]]["vectors"]): 
            trainVectors=[usersVectors[x[0]]["vectors"][index] for index in trainIndex]
            trainLabels=[usersVectors[x[0]]["labels"][index] for index in trainIndex]
            testVectors=[usersVectors[x[0]]["vectors"][index] for index in testIndex]
            testLabels=[usersVectors[x[0]]["labels"][index] for index in testIndex]
            #http://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html
            #The fit function take as parameters the features and labels samples generated for each training fold
            #The score function take as parameters the features and labels samples generated for each test fold
            result=clf.fit(trainVectors,trainLabels).score(testVectors,testLabels)
            print  "Fold: "+str(num)+" accuracy: "+str(result)
            results.append(result)
            num+=1
        #Obtain the average acuraccy of each classifier    
        #https://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html    
        print "User average accuracy: "+str(np.average(results)) 
        
    except ValueError: 
        print "User: "+x[0]+" can't predict/fit a model: internal error or few vector samples to fit"
        pass   
    except IOError:
        print "User: "+x[0]+" could not read file:"
        pass
    except  BaseException:
        print "User: "+x[0]+" wrong file/file path"
        pass   
    except:
        print "User: "+x[0]+" Unexpected error:"
        pass          

User: allen-p.txt classification results:
Fold: 1 accuracy: 0.6706349206349206
Fold: 2 accuracy: 0.6374501992031872
Fold: 3 accuracy: 0.6414342629482072
Fold: 4 accuracy: 0.6334661354581673
Fold: 5 accuracy: 0.6613545816733067
User average accuracy: 0.6488680199835578
User: arnold-j.txt classification results:
Fold: 1 accuracy: 0.639618138424821
Fold: 2 accuracy: 0.6109785202863962
Fold: 3 accuracy: 0.6634844868735084
Fold: 4 accuracy: 0.60381861575179
Fold: 5 accuracy: 0.645933014354067
User average accuracy: 0.6327665551381165
User: arora-h.txt classification results:
Fold: 1 accuracy: 0.5384615384615384
Fold: 2 accuracy: 0.6153846153846154
Fold: 3 accuracy: 0.38461538461538464
Fold: 4 accuracy: 0.6153846153846154
Fold: 5 accuracy: 0.5769230769230769
User average accuracy: 0.5461538461538462
User: badeer-r.txt classification results:
Fold: 1 accuracy: 0.6
Fold: 2 accuracy: 0.7166666666666667
Fold: 3 accuracy: 0.6583333333333333
Fold: 4 accuracy: 0.6722689075630253
Fold: 5 accuracy: 0

Fold: 1 accuracy: 0.7134502923976608
Fold: 2 accuracy: 0.6656891495601173
Fold: 3 accuracy: 0.6715542521994134
Fold: 4 accuracy: 0.6627565982404692
Fold: 5 accuracy: 0.718475073313783
User average accuracy: 0.6863850731422888
User: griffith-j.txt classification results:
Fold: 1 accuracy: 0.6194690265486725
Fold: 2 accuracy: 0.5132743362831859
Fold: 3 accuracy: 0.588495575221239
Fold: 4 accuracy: 0.6017699115044248
Fold: 5 accuracy: 0.6017699115044248
User average accuracy: 0.5849557522123894
User: grigsby-m.txt classification results:
Fold: 1 accuracy: 0.6277777777777778
Fold: 2 accuracy: 0.6111111111111112
Fold: 3 accuracy: 0.6111111111111112
Fold: 4 accuracy: 0.5166666666666667
Fold: 5 accuracy: 0.6333333333333333
User average accuracy: 0.6
User: guzman-m.txt classification results:
Fold: 1 accuracy: 0.7049576783555018
Fold: 2 accuracy: 0.6783555018137848
Fold: 3 accuracy: 0.6396614268440145
Fold: 4 accuracy: 0.6590084643288996
Fold: 5 accuracy: 0.6476997578692494
User average accura

Fold: 2 accuracy: 0.5520833333333334
Fold: 3 accuracy: 0.5416666666666666
Fold: 4 accuracy: 0.6076388888888888
Fold: 5 accuracy: 0.6354166666666666
User average accuracy: 0.575
User: merriss-s.txt classification results:
Fold: 1 accuracy: 0.7727272727272727
Fold: 2 accuracy: 0.7534246575342466
Fold: 3 accuracy: 0.7534246575342466
Fold: 4 accuracy: 0.7579908675799086
Fold: 5 accuracy: 0.8036529680365296
User average accuracy: 0.7682440846824409
User: mims-thurston-p.txt classification results:
Fold: 1 accuracy: 0.627906976744186
Fold: 2 accuracy: 0.7325581395348837
Fold: 3 accuracy: 0.7209302325581395
Fold: 4 accuracy: 0.7441860465116279
Fold: 5 accuracy: 0.7558139534883721
User average accuracy: 0.7162790697674419
User: neal-s.txt classification results:
Fold: 1 accuracy: 0.6465863453815262
Fold: 2 accuracy: 0.570281124497992
Fold: 3 accuracy: 0.6465863453815262
Fold: 4 accuracy: 0.5461847389558233
Fold: 5 accuracy: 0.6169354838709677
User average accuracy: 0.6053148076175671
User: nem

Fold: 4 accuracy: 0.6537267080745341
Fold: 5 accuracy: 0.6661490683229814
User average accuracy: 0.6456102520611577
User: taylor-m.txt classification results:
Fold: 1 accuracy: 0.5712237093690249
Fold: 2 accuracy: 0.5822179732313576
Fold: 3 accuracy: 0.5826959847036329
Fold: 4 accuracy: 0.5566714490674318
Fold: 5 accuracy: 0.5805834528933524
User average accuracy: 0.57467851385296
User: tholt-j.txt classification results:
Fold: 1 accuracy: 0.56
Fold: 2 accuracy: 0.5369127516778524
Fold: 3 accuracy: 0.6442953020134228
Fold: 4 accuracy: 0.610738255033557
Fold: 5 accuracy: 0.5771812080536913
User average accuracy: 0.5858255033557047
User: townsend-j.txt classification results:
Fold: 1 accuracy: 0.5
Fold: 2 accuracy: 0.4838709677419355
Fold: 3 accuracy: 0.5806451612903226
Fold: 4 accuracy: 0.6774193548387096
Fold: 5 accuracy: 0.6451612903225806
User average accuracy: 0.5774193548387097
User: tycholiz-b.txt classification results:
Fold: 1 accuracy: 0.5
Fold: 2 accuracy: 0.45454545454545453


## Future work

Future research avenues includes the following:

1. Extract/use different linguistic features like stylistic aspects of texts (punctuation, word frequency, etc) or other N-grams windows.
2. Optimize the classifier to maximize prediction results using Scikit Learn functionalities.
3. Use different supervised learning algorithms to improve accuracy (like regression and stuff like that).
4. Create vectors using distinct representation methods like __one hot encoding__ or __tf-idf__.
5. Use other representation models like __graphs__ to obtain meaningful patterns from texts (that can't be uncover with vectors).
6. Use other methodologies to tackle the authorship identification problem.
7. Improve error handling in the Python code implemented.