# Modelling

 * [Dev variables](#Dev-variables)
 * [Load data](#Load-data)
 * [Prepare data](#Prepare-data)
 * [Train model](#Train-model)
   * [Dummy classifier](#Dummy-classifier)


## Dev variables

In [1]:
# data_root_folder='H:/AI_for_Selection/'
# data_root_folder='/media/AIDrive/'
data_root_folder='/home/jeremie/Documents/drive_TNA/ML/AI for selection/prototype/'


## Load data

In [2]:
import os
import sys
import numpy as np
import pandas as pd
import gensim
import spacy
import nltk

import logging
from gensim.summarization import summarize

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

pd.set_option('display.max_colwidth', -1)

In [3]:

folder_path_a = data_root_folder + 'a/'
folder_path_websites = data_root_folder + '/Websites/'
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 

In [4]:
#Metadata file is the supplied by MA to us
metadata_file =  data_root_folder + 'objective_files_with_acronyms LATEST.xlsx'
df = pd.read_excel(open(metadata_file,'rb'), sheet_name='objective_files_with_acronyms', nrows=118677)

In [5]:
len(df)

118677

## Prepare data

In [6]:
df.columns

Index(['Unnamed: 0', 'documentid', 'objectivefileid', 'fileextension',
       'versionnumber', 'disposal_schedule', 'repository', 'datelastmodified',
       'parent11', 'parent10', 'parent9', 'parent8', 'parent7', 'parent6',
       'parent5', 'parent4', 'parent3', 'parent2', 'parent1', 'objective3',
       'objective2', 'objective1', 'originalname', 'documentname',
       'copyflatlines',
       'To the left there is the full data from the EDRMS.\nTo the right is the data with the acronyms',
       'trim_11', 'trim_10', 'trim_9', 'trim_8', 'trim_7', 'trim_6', 'trim_5',
       'trim_4', 'trim_3', 'trim_2', 'trim_1'],
      dtype='object')

In [7]:
df = df.drop(['Unnamed: 0','copyflatlines',
       'To the left there is the full data from the EDRMS.\nTo the right is the data with the acronyms'], axis = 1)

In [8]:
df.head(5)

Unnamed: 0,documentid,objectivefileid,fileextension,versionnumber,disposal_schedule,repository,datelastmodified,parent11,parent10,parent9,...,trim_10,trim_9,trim_8,trim_7,trim_6,trim_5,trim_4,trim_3,trim_2,trim_1
0,A3109716,qA35736,xls,2,24 Projects - Full Projects (Close file when Project ends),Strategic Projects,00:42:31,"Procurement, Project Delivery and Contract Management",Project Delivery,Projects - Closed,...,PD,PC_1,20YRaCS,GE_1,2012RTR,RTRS2012DS,no name,no name,no name,no name
1,A3133123,qA35736,xls,1,24 Projects - Full Projects (Close file when Project ends),Strategic Projects,00:55:08,"Procurement, Project Delivery and Contract Management",Project Delivery,Projects - Closed,...,PD,PC_1,20YRaCS,GE_1,2012RTR,RTRS2012DS,no name,no name,no name,no name
2,A3097046,qA35736,xls,4,24 Projects - Full Projects (Close file when Project ends),Strategic Projects,00:54:09,"Procurement, Project Delivery and Contract Management",Project Delivery,Projects - Closed,...,PD,PC_1,20YRaCS,GE_1,2012RTR,RTRS2012DS,no name,no name,no name,no name
3,A3113792,qA35736,xls,2,24 Projects - Full Projects (Close file when Project ends),Strategic Projects,00:52:54,"Procurement, Project Delivery and Contract Management",Project Delivery,Projects - Closed,...,PD,PC_1,20YRaCS,GE_1,2012RTR,RTRS2012DS,no name,no name,no name,no name
4,A3115138,qA35736,xls,2,24 Projects - Full Projects (Close file when Project ends),Strategic Projects,00:40:05,"Procurement, Project Delivery and Contract Management",Project Delivery,Projects - Closed,...,PD,PC_1,20YRaCS,GE_1,2012RTR,RTRS2012DS,no name,no name,no name,no name


In [9]:
print("check there are as many documents as unique documentid: %r, \n unique document ids: %i" % ((len(df.documentid.unique()) == len(df)),len(df.documentid.unique())))

check there are as many documents as unique documentid: True, 
 unique document ids: 118677


In [10]:
df['ret_schedule']= df['disposal_schedule'].apply(lambda x: x.split()[0] )

In [11]:
df['selected'] =  df['ret_schedule'].apply(lambda x: True if x in(('04', '06','15b','17','21','33','34','35','36')) else False)

In [12]:
df['fileextension_cleaned']=df.fileextension.str.lower()

In [13]:
most_frequent_fileextensions=(df['fileextension_cleaned'].value_counts()>400)
most_frequent_fileextensions=most_frequent_fileextensions[most_frequent_fileextensions].index.values

## Train model

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
X_train, X_test, y_train, y_test = train_test_split(df[['parent11','parent10','fileextension_cleaned']],df[['ret_schedule']],random_state=0,test_size=0.2,stratify=df[['ret_schedule']])

### Dummy classifier

In [16]:
from sklearn.dummy import DummyClassifier

In [47]:
clf = DummyClassifier(strategy="most_frequent")
clf.fit(X_train,y_train)
print("Accuracy on train set", sklearn.metrics.accuracy_score(y_train, clf.predict(X_train)))
print("Accuracy on test set", sklearn.metrics.accuracy_score(y_test, clf.predict(X_test)))

Accuracy on train set 0.2215691850728347
Accuracy on test set 0.22156218402426694


### Decision Tree classifier

# !!! NEED TO encode all labels !!!

In [42]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

### Auto-sklearn classifier

In [27]:
# see https://automl.github.io/auto-sklearn/master/installation.html if error on pip install
# %pip install auto-sklearn