# Text ming, feature creation and classification of federal government (USA)

The USA has a wealth of documents released by the various department through white houe. These documents are stored in the  database of the "Federal Registry". The documents are available from Bill Clinton tenure till date. This study solely focuses on the "Final ruling" document issued by the president. The time of interest is from 20-01-2001 to 20-01-2017 when George Bush and Barak Obama were in power.

The aim of the study is to predict whether the 'final ruling' was issued during George Bush or Barak Obama tenure.This study will also have following activities:
    #Download the data from Federal Registry API
    #Parse the data
    #Extract important features
    #Apply NLP techniques to develop features from the abstract
    #Develop a classification model that can predict whether the 'final ruling document' was issued by arak Obama or George Bush. 


# 1. Information retrieval . 

Maximum of 1000 documents can be downloaded from each API call. Therefore, 1000 most relevent documents were downloaded 
for each year
8000 documents were downloaded for each president, Barak Obama and George Bush. 
It is expected that the dataset will have 16000 observations. The features will be created using LDA and TF-IDF

In [1]:
import pandas as pd
import gensim
import json
import urllib.request  as urllib2
import json as JSON
import os
os.chdir("C://specdata/federal") 



## 1.1 Functions used in the feature creation and parsing

In [2]:
#Develop function for cleaning data.

import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

# List of words that is likely to appear frequently in the document
words_to_remove=['Act', 'rule', 'document', 'regulation', 'new','changes','Commission','final','regulation',
                'rules','order','federal']

# Remove stopwords, tokenize and stemming
def clean_my_data(letters):
    clean_data=[]
    for line in letters:
        tokens = nltk.word_tokenize(str(line))
        tagged = nltk.pos_tag(tokens)
        container=[]
        for words in tagged:
            if (words[1][0] == 'N' or words[1][0]=='J') and (words[1][0] not in words_to_remove):
                container.append(words[0])
        clean_data.append(container)
    return clean_data

# LDA function (topic modelling)

from gensim import corpora, models
def topic_modelling(list_of_list):
    dictionary = corpora.Dictionary(list_of_list)
    corpus = [dictionary.doc2bow(text) for text in list_of_list]
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2000, id2word = dictionary, passes=20)
    topics = ldamodel.print_topics(num_topics=2000, num_words=3)
    return topics

# Abstract/title length calculator

def length(list_of_content):
    len_list=[]
    for rows in list_of_content:
        leng = len(rows) 
        len_list.append(leng)
    return len_list


#Function to extract agnecy name form the downloaded data. GAenct is used as a feature in classification model

def extract_data_agency(agency_list):
    raw_name_1=[]
    
    for dept in range(len(agency_list)):
        raw_name_1.append(agency_list[dept][0]['raw_name'])
    return raw_name_1




## 1.2 Download data from the API

The documents were downloaded from the API using urllib. For each call the API releases only 1000 documents. 16 different calls were made with 8000 for BO and 8000 for GB. The data was obtained in the form of json file, which was saved in the local disk. In the next step, the downloded json data is imported into this notebook for further processing.

In [3]:
#This code imports the downloaded data from the local drive and creates two different list. One for Barak Obama (BO)
# and second for George Bush (GB)

#Import and append data for BO

BO_data_list=[]
with open('BO_1.txt') as data_file_BO_1:    
    BO_data_1 = json.load(data_file_BO_1)
    BO_data_list.append(BO_data_1)

with open('BO_2.txt') as data_file_BO_2:    
    BO_data_2 = json.load(data_file_BO_2)
    BO_data_list.append(BO_data_2)
    
with open('BO_3.txt') as data_file_BO_3:    
    BO_data_3 = json.load(data_file_BO_3)
    BO_data_list.append(BO_data_3)
    
with open('BO_4.txt') as data_file_BO_4:    
    BO_data_4 = json.load(data_file_BO_4)
    BO_data_list.append(BO_data_4)
    
with open('BO_5.txt') as data_file_BO_5:    
    BO_data_5 = json.load(data_file_BO_5)
    BO_data_list.append(BO_data_5)
    
with open('BO_6.txt') as data_file_BO_6:    
    BO_data_6 = json.load(data_file_BO_6)
    BO_data_list.append(BO_data_6)
    
with open('BO_7.txt') as data_file_BO_7:    
    BO_data_7 = json.load(data_file_BO_7)
    BO_data_list.append(BO_data_7)
    
with open('BO_8.txt') as data_file_BO_8:    
    BO_data_8 = json.load(data_file_BO_8)
    BO_data_list .append(BO_data_8)
    
print(len(BO_data_list))

#Import and append data for GB
    
GB_data_list=[]
with open('GB_1.txt') as data_file_GB_1:    
    GB_data_1 = json.load(data_file_GB_1)
    GB_data_list.append(GB_data_1)
    
with open('GB_2.txt') as data_file_GB_2:    
    GB_data_2 = json.load(data_file_GB_2)
    GB_data_list.append(GB_data_2)
    
with open('GB_3.txt') as data_file_GB_3:    
    GB_data_3 = json.load(data_file_GB_3)
    GB_data_list.append(GB_data_3)
    
with open('GB_4.txt') as data_file_GB_4:    
    GB_data_4 = json.load(data_file_GB_4)
    GB_data_list.append(GB_data_4)
    
with open('GB_5.txt') as data_file_GB_5:    
    GB_data_5 = json.load(data_file_GB_5)
    GB_data_list.append(GB_data_5)
    
with open('GB_6.txt') as data_file_GB_6:    
    GB_data_6 = json.load(data_file_GB_6)
    GB_data_list.append(GB_data_6)
    
with open('GB_7.txt') as data_file_GB_7:    
    GB_data_7 = json.load(data_file_GB_7)
    GB_data_list.append(GB_data_7)
    
with open('GB_8.txt') as data_file_GB_8:    
    GB_data_8 = json.load(data_file_GB_8)
    GB_data_list.append(GB_data_8)

print(len(GB_data_list))


8
8


# 2. Generate Features

## 2.1 LDA (Topic Modelling)

 As the first step, list of abstract is created for BO and GB. Subsquently, LDA is applied for each of this set separately. The topic and its probabilities are parsed and 6 feature sets are created with 3 topics and its probabilities. The two lists (Bo and GB) are merged to create a dataframe of features. 

### 2.1.1 LDA on the document abstract released during the term of Barak Obama

In [4]:
#Collect all the abstract for BO in the list 'abstract_list_1'
abstract_list_1=[]
for line1 in BO_data_list:
    for line2 in line1['results']:
        abstract_list_1.append(line2['abstract'])
print(len(abstract_list_1))

#This code will create a list of 8000 abstract for BO

8000


In [5]:
# Clean the data, remove stopwords and punctuations use the function 'clean_my_data' , which was created earlier

import time
start_time = time.time()
BO_abs_clean_data = clean_my_data(abstract_list_1)
print("--- %s seconds ---" % (time.time() - start_time))
print(len(BO_abs_clean_data))

# The output of this cell is the list of abstract without stop words. 
# This data is fed into LDA algorithm for topic modelling

--- 153.04029750823975 seconds ---
8000


In [6]:
# Out of the 8000 abstract (available in the abstract_list_1) 2000 abstracts are fed into the LDA model. 
# LDA modelling for the 2000 abstract takes around an hour. 
# Therefore, topic modelling of 16000 abstract will take 8 hours

# LDA for first 2000 abstract of BO

start_time = time.time()
LDA_BO_1 = topic_modelling(BO_abs_clean_data[0:2000])
print("--- %s seconds ---" % (time.time() - start_time))
print(len(LDA_BO_1))

  (perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words))
  diff = np.log(self.expElogbeta)


--- 2781.64235496521 seconds ---
2000


In [8]:
# LDA from 2000 to 4000 abstract of BO

start_time = time.time()
LDA_BO_2 = topic_modelling(BO_abs_clean_data[2000:4000])
print("--- %s seconds ---" % (time.time() - start_time))
print(len(LDA_BO_1))
print(len(LDA_BO_2))

  (perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words))
  diff = np.log(self.expElogbeta)


--- 2693.0852675437927 seconds ---
2000
2000


In [9]:
# LDA from 4000 to 6000 abstract of BO
start_time = time.time()
LDA_BO_3 = topic_modelling(BO_abs_clean_data[4000:6000])
print("--- %s seconds ---" % (time.time() - start_time))
print(len(LDA_BO_1))
print(len(LDA_BO_2))
print(len(LDA_BO_3))

  (perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words))
  diff = np.log(self.expElogbeta)


--- 2650.5992753505707 seconds ---
2000
2000
2000


In [10]:
# LDA from 6000 to 8000 abstract of BO
start_time = time.time()
LDA_BO_4 = topic_modelling(BO_abs_clean_data[6000:8000])
print("--- %s seconds ---" % (time.time() - start_time))
print(len(LDA_BO_1))
print(len(LDA_BO_2))
print(len(LDA_BO_3))
print(len(LDA_BO_4))

#This completes the LDA modelling for 8000 BO abstract
# Follow the same steps for topic modelling of abstract


  (perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words))
  diff = np.log(self.expElogbeta)


--- 2680.826761484146 seconds ---
2000
2000
2000
2000


In [11]:
# Combine all the LDA list of BO

LDA_BO=LDA_BO_1+LDA_BO_2+LDA_BO_3+LDA_BO_4
print(len(LDA_BO))

8000


In [17]:
#This list is saved in the local drive for future use

import csv

with open('LDABO.csv', 'w') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(LDA_BO)

In [22]:
print(LDA_BO[0])

(0, '0.000*"Free" + 0.000*"Class" + 0.000*"more"')


### 2.1.2 LDA on the document abstract released during the term of George Bush

In [6]:
#Collect all the abstract for BO in the list 'abstract_list_1'

abstract_list_2=[]
for line3 in GB_data_list:
    for line4 in line3['results']:
        abstract_list_2.append(line4['abstract'])
print(len(abstract_list_2))

#This code will create a list of 8000 abstract for BO

8000


In [7]:
# Clean the data, remove stopwords and punctuations use the function 'clean_my_data' , which was created earlier

import time
start_time = time.time()
GB_abs_clean_data = clean_my_data(abstract_list_2)
print("--- %s seconds ---" % (time.time() - start_time))
print(len(GB_abs_clean_data))

# The output of this cell is the list of abstract without stop words. 
# This data is fed into LDA algorithm for topic modelling

--- 155.92902660369873 seconds ---
8000


In [20]:
# Out of the 8000 abstract (available in the abstract_list_2) 2000 abstracts are fed into the LDA model. 
# LDA modelling for the 2000 abstract takes around an hour. 
# Therefore, topic modelling of 16000 abstract will take 8 hours

# LDA for first 2000 abstract of GB

start_time = time.time()
LDA_GB_1 = topic_modelling(GB_abs_clean_data[0:2000])
print("--- %s seconds ---" % (time.time() - start_time))
print(len(LDA_BO_1))

  (perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words))
  diff = np.log(self.expElogbeta)


--- 2884.5209336280823 seconds ---
2000


In [23]:
# LDA from 2000 to 4000 abstract of GB

start_time = time.time()
LDA_GB_2 = topic_modelling(GB_abs_clean_data[2000:4000])
print("--- %s seconds ---" % (time.time() - start_time))
print(len(LDA_GB_2))

  (perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words))
  diff = np.log(self.expElogbeta)


--- 2694.336597919464 seconds ---
2000


In [24]:
# LDA from 4000 to 6000 abstract of GB

start_time = time.time()
LDA_GB_3 = topic_modelling(GB_abs_clean_data[4000:6000])
print("--- %s seconds ---" % (time.time() - start_time))
print(len(LDA_GB_3))

  (perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words))
  diff = np.log(self.expElogbeta)


--- 2629.2344963550568 seconds ---
2000


In [25]:
# LDA from 6000 to 8000 abstract of GB

start_time = time.time()
LDA_GB_4 = topic_modelling(GB_abs_clean_data[6000:8000])
print("--- %s seconds ---" % (time.time() - start_time))
print(len(LDA_GB_4))

  (perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words))
  diff = np.log(self.expElogbeta)


--- 2791.9896953105927 seconds ---
2000


In [26]:
# Combine all the LDA list of GB
LDA_GB= LDA_GB_1+ LDA_GB_2+ LDA_GB_3+ LDA_GB_4

with open('LDAGB.csv', 'w') as myfile2:
    wr = csv.writer(myfile2, quoting=csv.QUOTE_ALL)
    wr.writerow(LDA_GB)   


In [27]:
#Combine the LDA list of GB and BO

LDA= LDA_BO+LDA_GB


with open('LDA.csv', 'w') as myfile3:
    wr = csv.writer(myfile3, quoting=csv.QUOTE_ALL)
    wr.writerow(LDA) 


LDA function generate 3 probable topics with its probability. This data is stored in the list 'LDA' in the form of tuple. The data from this tuple is parsed to generate 6 features consisting of three most probable topics and its three probabilities. The code below is used to generate these 6 features 

In [8]:
# Import LDA list from the drive

import csv
with open ('LDA.csv','r') as csvlist:
    LDA_list= csv.reader(csvlist)
    LDA =[]
    for row in LDA_list:
        LDA  = LDA +[row]
csvlist.close()

In [9]:
LDA= LDA[0]

In [10]:
# Parsing data to create features from topic and its probabilities

import re
import nltk

prob_1=[]
prob_2=[]
prob_3=[]
topic_1=[]
topic_2=[]
topic_3=[]

for prob in range(len(LDA)):
    
    probability= re.findall(r'\d+', str(LDA[prob]))
    pro_1 = float(probability[2])
    prob_1.append(pro_1/1000)
    pro_2= float(probability[4])
    prob_2.append(pro_2/1000)
    pro_3= float(probability[6])
    prob_3.append(pro_3/1000)
    
    words = " ".join(re.findall("[a-zA-Z]+", str(LDA[prob])))
    word= nltk.word_tokenize(str(words))
    try: 
        topic_1.append(word[0])
    except IndexError:
        topic_1.append('NA')
    try: 
        topic_2.append(word[1])
    except IndexError:
        topic_2.append('NA')
    try: 
        topic_3.append(word[2])
    except IndexError:
        topic_3.append('NA')
    


In [11]:
# Create labels

BO= [0]*8000
GB= [1]*8000
presi= BO+GB

In [12]:
# Create dataframe

DF= pd.DataFrame()
#DF_GB= pd.DataFrame()
DF['prob_1']= prob_1
DF['prob_2']= prob_2
DF['prob_3']= prob_3

DF['abstract_topic_1']= topic_1
DF['abstract_topic_2']= topic_2
DF['abstract_topic_3']= topic_3

DF['PRESIDENT']= presi

In [13]:
#DF.head()

Create two dataframe to put togehter the features created till under BO and GB

Merge the two dataframe

## 2.2 Create other features (agency, significance and page length)

Other features are created directly by parsing the downloaded data 

In [14]:
# Extract agency data for BO

agency_list=[]
for objects in BO_data_list:
    for element in objects['results']:
        agency_BO = element['agencies']
        agency_list.append(agency_BO)

# Extract agency data for GB

for objects2 in GB_data_list:
    for element2 in objects2['results']:
        agency_GB = element2['agencies']
        agency_list.append(agency_GB)
        
#Agency list has 16000 entries for both BO and GB

In [15]:
# Agency list has several missing data which throws error message. 
# The following code is writtien to handle error exception and fill the missing values with 'NA'

raw_name=[]
for dept in range(len(agency_list)):
    try:
        raw_name.append(agency_list[dept][0]['raw_name'])
    except IndexError:
        raw_name.append('NA')

In [16]:
# Append the Agency feature to the dataframe DF

DF['Agency']=  raw_name

Please note: 3 topic is not included in this data set

Create a feature for document significance. This data is created by parsing the downloaded data

In [17]:
# Feature: significance
# Create an empty list to hold the 'Significance value'
# For BO

sign_list=[]
for each in BO_data_list:
    for sign in each['results']:
        sign_BO = sign['significant']
        sign_list.append(sign_BO)
        
        
# For GB

for lgb in GB_data_list:
    for lenggb in lgb['results']:
        pglen_GB = lenggb['page_length']
        sign_list.append(pglen_GB)

#Append the list to a new column in the dataframe

DF['Significance']= sign_list


In [18]:
# Feature: page_length
# Create an empty list to hold the 'Page Length'
# For BO
leng_list=[]
for l in BO_data_list:
    for leng in l['results']:
        pglen_BO = leng['page_length']
        leng_list.append(pglen_BO)
        
#For GB
for lgb in GB_data_list:
    for lenggb in lgb['results']:
        pglen_GB = lenggb['page_length']
        leng_list.append(pglen_GB)
        
#Append the list to a new column in the dataframe        

DF['Page_length']= leng_list

In [19]:
# Export the dataframe to local drive
DF.to_csv('dataframe_1.csv')

## 2.3 Create text features using TF-IDF

The text features were created using CountVectorizer and TFIDF. Both the method created same number of features. In this study, TFIDF is used.

In [20]:
#TFIDF features are created using the clean data, which was created in section 2.1.1 an 2.1.2 for BO abd GB together
#BO clean data and GB clean data are merged together

clean_data= BO_abs_clean_data+GB_abs_clean_data

#TFIDF do not accept tokenized data. Therefore, clean_data is detokenized

from nltk.tokenize.moses import MosesDetokenizer
detokenizer = MosesDetokenizer()
clean_data_detok=[]
for it in range(len(clean_data)):
    clean_data_detok.append(detokenizer.detokenize(clean_data[it], return_str=True))
    
# Empty values in the list are filled with 'This is empty list'

for item in range(len(clean_data_detok)):
    if not clean_data_detok[item]:
        clean_data_detok[item]= 'This is empty list'


Create features using Scikit Learn TFIDF

In [21]:
#TFIDF

from sklearn.feature_extraction.text import TfidfVectorizer
cv= TfidfVectorizer()
vector_cv= cv.fit_transform(clean_data_detok)

#Convert the numpy array into Dataframe
count_stem_df= pd.DataFrame(vector_cv.toarray())

#Add column Name
count_stem_df.columns= cv.get_feature_names()




In [22]:
#The current shape of the dataframe is 16000 X 24210. There are too many columns in thi dataframe. 
# In next few steps, the dimension of the dataframe is reduced to a managable level.

count_stem_df.shape

(16000, 24210)

In [23]:
# Firs 1377 columns are numbers. Therefore, those are eliminated

count_stem_df= count_stem_df.iloc[:, 1377: len(count_stem_df.columns)] 
count_stem_df.shape

(16000, 22833)

In [24]:
# 'cols' isa empty list where column name are stored which are later dropped from the dataframe
# In the first step, all the columns with just two alphabet are separated

cols=[]
for name in range(len(count_stem_df.columns)):
    if len(count_stem_df.columns[name])<=2:
        cols.append(count_stem_df.columns[name])
len(cols)

414

In [25]:
# All alphanumeric columns are separated

for names in range(len(count_stem_df.columns)):
    if count_stem_df.columns[names].isalpha()==False:
        cols.append(count_stem_df.columns[names])
len(cols)

1135

In [26]:
#The column accumulated in 'cols' are dropped from dataframe. 
#After the dimension reduction, the shape of the dataframe is 16000 X 21753

count_stem_df= count_stem_df.drop(cols, 1)
count_stem_df.shape

(16000, 21753)

In [27]:
# Similar column names wre observed, such as abbreviate and abbreviation.
# Therefore, column name in consecutive cell with more that 6 similar characters are removed

same=[]
for item in range(len(count_stem_df.columns)):
    
    if list(count_stem_df.columns[item][0:6])==list(count_stem_df.columns[item-1][0:6]):
        same.append(count_stem_df.columns[item])
        #print(A.columns[item],A.columns[item-1] )
print(len(same))

count_stem_df= count_stem_df.drop(same, 1)
print(count_stem_df.shape)

5375
(16000, 16378)


In this section the dimension of the dataframe_2 is reduced from (16000, 24210) to (16000, 16378)

In [28]:
# Export dataframe to local drive
count_stem_df.to_csv("dataframe_2.csv")

Two dataframes are created: (a) dataframe_1 with LDA features and (b) dataframe_2 with countvectorizer. Both the dataframes are stored in the local drive. In this next section, these two dataframes are imported and merged together. Various classification algorithms are then tested on this dataset. 

# 3. Classification model

The section 2 has successfully created a dataframe consisting of features and label. In this section various classification model are applied to predict the president.
Two different dataframes are imported from the local drive and merged to create a master data frame. The features and labels are then separated to create X and Y.

In [2]:
#Import dataframes 

data_1= pd.read_csv('dataframe_1.csv')

In [3]:
# Import second dataframe

data_2=pd.read_csv('dataframe_2.csv')

In [4]:
# Merge the two dataframe

data = pd.concat([data_1, data_2], axis=1)

In [5]:
#Shuffle the dataset

data= data.sample(frac=1)

In [6]:
#Reset index

data= data.reset_index(drop=True)

In [7]:
#Drop missing values
data= data.dropna()
data.shape

(15512, 16390)

In [8]:
#Encode categorical variables 
from sklearn.preprocessing import LabelEncoder

lb_make = LabelEncoder()
data['Agency'] = lb_make.fit_transform(data["Agency"])
data['Significance'] = lb_make.fit_transform(data["Significance"])
data['abstract_topic_1'] = lb_make.fit_transform(data["abstract_topic_1"])
data['abstract_topic_2'] = lb_make.fit_transform(data["abstract_topic_2"])
data['abstract_topic_3'] = lb_make.fit_transform(data["abstract_topic_3"])
data['fitter']=data['fit']

#Drop redundant columns. Column 'fit' has to be changed from 'fit' to 'fitter'. 'fit' throws an error.
data = data.drop(['Unnamed: 0','fit'], 1)

In [9]:
#The 'data' dataframe is too big and therefore it is divided into two equal dataframe
# Each dataframe is used for prediction with 10 fold CV

# create two sets
df_1= data.iloc[:7756, :]
df_2= data.iloc[7756:, :].reset_index()

In [11]:
#drop redundant columns

df_2= df_2.drop(['level_0'], axis=1)


Unnamed: 0,prob_1,prob_2,prob_3,abstract_topic_1,abstract_topic_2,abstract_topic_3,PRESIDENT,Agency,Significance,Page_length,...,zones,zoological,zoonotic,zte,zuchem,zuernii,zumwalt,zymed,zzz,fitter
0,0.143,0.079,0.039,3323,11,1975,0,33,136,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.069,0.035,0.035,1217,2567,3931,0,21,136,15,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.048,0.044,0.04,1659,1152,3978,1,34,55,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.097,0.049,0.049,1388,401,639,1,34,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1650,3485,2670,1,24,36,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# Create X and Y for the set 1

df_1_X= df_1.drop(['PRESIDENT'], axis=1)
df_1_Y= df_1['PRESIDENT']

In [24]:
# Create X and Y for the set 2

df_2_X= df_2.drop(['PRESIDENT'], axis=1)
df_2_Y= df_2['PRESIDENT']

In [14]:
#The performance metrices of various algorithm are stored in adataframe

accuracy= pd.DataFrame()
Algorith=[]
Accuracy=[]
Precision=[]
Recall=[]

In [15]:
from sklearn.metrics import log_loss, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier




### Naive Bayes 

In [22]:
# For the first set
import time
start_time = time.time()

#Import packages
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB


#fit the variables and calculate scores

clf_nb = GaussianNB()
clf_nb.fit(df_1_X, df_1_Y)
score_nb=cross_val_score(clf_nb,df_1_X,df_1_Y, cv=10)
scores_nb_precision = cross_val_score(clf_nb, df_1_X,df_1_Y, cv=10, scoring='precision')
scores_nb_recall = cross_val_score(clf_nb, df_1_X,df_1_Y, cv=10, scoring='recall')


print(score_nb, score_nb.mean())
print(scores_nb_precision, scores_nb_precision.mean()) 
print(scores_nb_recall, scores_nb_recall.mean())

Algorith.append('Naive Bayes')
Accuracy.append(score_nb.mean())
Precision.append(scores_nb_precision.mean())
Recall.append(scores_nb_recall.mean())


print("--- %s seconds ---" % (time.time() - start_time))

[ 0.97683398  0.96267696  0.96525097  0.95741935  0.96129032  0.95612903
  0.96129032  0.97032258  0.96645161  0.96645161] 0.964411674347
[ 0.95609756  0.93111639  0.93556086  0.92216981  0.92874109  0.92
  0.92874109  0.94444444  0.93764988  0.93764988] 0.934217101094
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0
--- 188.31600642204285 seconds ---


In [25]:
# For the second set

start_time = time.time()

#Import packages
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB


##fit the variables and calculate scores

clf_nb_2 = GaussianNB()
clf_nb_2.fit(df_2_X, df_2_Y)
score_nb_2=cross_val_score(clf_nb_2,df_2_X,df_2_Y, cv=10)
scores_nb_precision_2 = cross_val_score(clf_nb_2, df_2_X, df_2_Y, cv=10, scoring='precision')
scores_nb_recall_2 = cross_val_score(clf_nb_2, df_2_X, df_2_Y, cv=10, scoring='recall')


print(score_nb_2, score_nb_2.mean())
print(scores_nb_precision_2, scores_nb_precision_2.mean()) 
print(scores_nb_recall_2, scores_nb_recall_2.mean())

Algorith.append('Naive Bayes')
Accuracy.append(score_nb_2.mean())
Precision.append(scores_nb_precision_2.mean())
Recall.append(scores_nb_recall_2.mean())


print("--- %s seconds ---" % (time.time() - start_time))

[ 0.96778351  0.96907216  0.96778351  0.97551546  0.97680412  0.97680412
  0.96903226  0.97806452  0.97290323  0.97806452] 0.973182740273
[ 0.93990385  0.94216867  0.94202899  0.95365854  0.95599022  0.95599022
  0.94202899  0.95823096  0.94890511  0.95823096] 0.94971364945
[ 1.          1.          0.99744246  1.          1.          1.          1.
  1.          1.          1.        ] 0.999744245524
--- 178.52654719352722 seconds ---


### Decision Tree

In [26]:
#Decision Tree for the first set

start_time = time.time()

#fit the variables and calculate scores
dc_1= DecisionTreeClassifier(max_features=50, max_depth=50, min_samples_split= 6)
dc_1.fit(df_1_X, df_1_Y)

scores_dc_1 = cross_val_score(dc_1, df_1_X, df_1_Y, cv=10)
scores_dc_precision_1 = cross_val_score(dc_1, df_1_X, df_1_Y, cv=10, scoring='precision')
scores_dc_recall_1 = cross_val_score(dc_1,df_1_X, df_1_Y, cv=10, scoring='recall')

print(scores_dc_1, scores_dc_1.mean())
print(scores_dc_precision_1, scores_dc_precision_1.mean()) 
print(scores_dc_recall_1, scores_dc_recall_1.mean())

Algorith.append('Decision Tree')
Accuracy.append(scores_dc_1.mean())
Precision.append(scores_dc_precision_1.mean())
Recall.append(scores_dc_recall_1.mean())

print("--- %s seconds ---" % (time.time() - start_time))

[ 0.56628057  0.68854569  0.55212355  0.53032258  0.88258065  0.55354839
  0.59096774  0.60129032  0.57935484  0.58064516] 0.612565948437
[ 0.67015707  0.58851675  0.87350835  0.55500821  0.53313253  0.54036244
  0.70192308  0.83333333  0.85925926  0.52564103] 0.668084204137
[ 0.71938776  0.92091837  0.30102041  0.86956522  0.96675192  0.65473146
  0.69053708  0.69820972  0.71355499  0.26342711] 0.679810402422


In [27]:
#Decision Tree for second set

start_time = time.time()

#fit the variables and calculate scores
dc_2= DecisionTreeClassifier(max_features=50, max_depth=50, min_samples_split= 6)
dc_2.fit(df_2_X, df_2_Y)

scores_dc_2 = cross_val_score(dc_2, df_2_X, df_2_Y, cv=10)
scores_dc_precision_2 = cross_val_score(dc_2, df_2_X, df_2_Y, cv=10, scoring='precision')
scores_dc_recall_2 = cross_val_score(dc_2,df_2_X, df_2_Y, cv=10, scoring='recall')

print(scores_dc_2, scores_dc_2.mean())
print(scores_dc_precision_2, scores_dc_precision_2.mean()) 
print(scores_dc_recall_2, scores_dc_recall_2.mean())

Algorith.append('Decision Tree')
Accuracy.append(scores_dc_2.mean())
Precision.append(scores_dc_precision_2.mean())
Recall.append(scores_dc_recall_2.mean())

print("--- %s seconds ---" % (time.time() - start_time))

[ 0.52963918  0.58247423  0.53608247  0.54510309  0.9935567   0.53092784
  0.74322581  0.55870968  0.54709677  0.57806452] 0.614488027935
[ 0.69607843  0.96954315  0.74358974  0.52723312  0.6377551   0.5643739
  0.54731861  0.55368421  0.99222798  0.54299363] 0.677479786975
[ 0.93350384  0.86700767  0.35805627  0.90025575  0.62404092  0.22250639
  0.58205128  0.28205128  0.78717949  0.22564103] 0.578229392091
--- 46.83039903640747 seconds ---


### Random Forest

In [28]:
#Random Forest Classifier for the first set

start_time = time.time()

#fit the variables and calculate scores

RF_1 = RandomForestClassifier(n_estimators=100,criterion='entropy', max_features= 50, max_depth= 50, min_samples_leaf= 6 )
RF_1.fit( df_1_X, df_1_Y)

scores_RF_1 = cross_val_score(RF_1, df_1_X, df_1_Y, cv=10)
scores_RF_precision_1 = cross_val_score(RF_1, df_1_X, df_1_Y, cv=10, scoring='precision')
scores_RF_recall_1 = cross_val_score(RF_1,  df_1_X, df_1_Y, cv=10, scoring='recall')

print(scores_RF_1, scores_RF_1.mean())
print(scores_RF_precision_1, scores_RF_precision_1.mean()) 
print(scores_RF_recall_1, scores_RF_recall_1.mean())

Algorith.append('Random Forest')
Accuracy.append(scores_RF_1.mean())
Precision.append(scores_RF_precision_1.mean())
Recall.append(scores_RF_recall_1.mean())

print("--- %s seconds ---" % (time.time() - start_time))

[ 0.93178893  0.94079794  0.94980695  0.85290323  0.98064516  0.9083871
  0.95483871  0.96258065  0.97677419  0.99612903] 0.945465188691
[ 0.9389313   0.99230769  0.95384615  0.93417722  0.96969697  0.96428571
  0.97721519  0.95607235  0.95454545  0.97953964] 0.962061768082
[ 0.94897959  0.93112245  0.95153061  0.89258312  0.96675192  0.97186701
  0.96163683  0.91815857  0.97442455  0.95140665] 0.946846129756
--- 166.6035225391388 seconds ---


In [30]:
#Random Forest Classifier for the second set

start_time = time.time()

#fit the variables and calculate scores
RF_2 = RandomForestClassifier(n_estimators=100,criterion='entropy', max_features= 50, max_depth= 50, min_samples_leaf= 6 )
RF_2.fit( df_2_X, df_2_Y)

scores_RF_2 = cross_val_score(RF_2, df_2_X, df_2_Y, cv=10)
scores_RF_precision_2 = cross_val_score(RF_2, df_2_X, df_2_Y, cv=10, scoring='precision')
scores_RF_recall_2 = cross_val_score(RF_2, df_2_X, df_2_Y, cv=10, scoring='recall')

print(scores_RF_2, scores_RF_2.mean())
print(scores_RF_precision_2, scores_RF_precision_2.mean()) 
print(scores_RF_recall_2, scores_RF_recall_2.mean())

Algorith.append('Random Forest')
Accuracy.append(scores_RF_2.mean())
Precision.append(scores_RF_precision_2.mean())
Recall.append(scores_RF_recall_2.mean())

print("--- %s seconds ---" % (time.time() - start_time))

[ 0.93685567  0.99226804  0.96907216  0.97680412  0.97293814  0.96134021
  0.97806452  0.91483871  0.96        0.97677419] 0.963895576987
[ 0.98982188  0.9278607   0.89252336  0.96725441  0.93638677  0.98457584
  0.98477157  0.97974684  0.96401028  1.        ] 0.962695164776
[ 0.88491049  0.98209719  0.98465473  0.98721228  0.96675192  0.98209719
  0.97435897  0.98974359  0.92051282  0.94358974] 0.961592891337
--- 169.62343311309814 seconds ---


### Gradient Boosting

In [31]:
# Gradient Boosting for the first set

start_time = time.time()

#fit the variables and calculate scores

gb_1= GradientBoostingClassifier(n_estimators=100, min_samples_split=7,max_features= 50)
gb_1.fit(df_1_X, df_1_Y)

scores_gb_1 = cross_val_score(gb_1, df_1_X, df_1_Y, cv=10)
scores_gb_precision_1 = cross_val_score(gb_1, df_1_X, df_1_Y, cv=10, scoring='precision')
scores_gb_recall_1 = cross_val_score(gb_1, df_1_X, df_1_Y, cv=10, scoring='recall')

print(scores_gb_1, scores_gb_1.mean())
print(scores_gb_precision_1, scores_gb_precision_1.mean()) 
print(scores_gb_recall_1, scores_gb_recall_1.mean())

Algorith.append('Gradient Boost')
Accuracy.append(scores_gb_1.mean())
Precision.append(scores_gb_precision_1.mean())
Recall.append(scores_gb_recall_1.mean())

print("--- %s seconds ---" % (time.time() - start_time))


[ 0.998713    0.93564994  0.62805663  0.9883871   0.66451613  0.92774194
  0.98709677  0.92903226  0.59225806  0.61419355] 0.826564536887
[ 0.60046189  0.60628019  0.6047619   0.6185567   0.94458438  0.89928058
  0.92821782  0.58956916  0.92118227  0.64646465] 0.735935954646
[ 0.9872449   0.98469388  0.71683673  0.7084399   0.94629156  0.62659847
  0.65217391  0.7314578   0.94373402  0.9488491 ] 0.824632026724
--- 265.147762298584 seconds ---


In [32]:
# Gradient Boosting for the second set

start_time = time.time()

#fit the variables and calculate scores

gb_2= GradientBoostingClassifier(n_estimators=100, min_samples_split=7,max_features= 50)
gb_2.fit(df_2_X, df_2_Y)

scores_gb_2 = cross_val_score(gb_2, df_2_X, df_2_Y, cv=10)
scores_gb_precision_2 = cross_val_score(gb_2, df_2_X, df_2_Y, cv=10, scoring='precision')
scores_gb_recall_2 = cross_val_score(gb_2, df_2_X, df_2_Y, cv=10, scoring='recall')

print(scores_gb_2, scores_gb_2.mean())
print(scores_gb_precision_2, scores_gb_precision_2.mean()) 
print(scores_gb_recall_2, scores_gb_recall_2.mean())

Algorith.append('Gradient Boost')
Accuracy.append(scores_gb_2.mean())
Precision.append(scores_gb_precision_2.mean())
Recall.append(scores_gb_recall_2.mean())

print("--- %s seconds ---" % (time.time() - start_time))


[ 0.91494845  0.95876289  0.63530928  0.95103093  0.60824742  1.
  0.95870968  0.96        0.97935484  0.60774194] 0.857410542069
[ 0.6027088   0.58872651  0.60580913  0.94513716  0.93939394  0.59832636
  0.63466667  0.58314351  0.97727273  0.63043478] 0.710561958667
[ 0.69565217  0.96930946  0.73401535  0.64705882  0.65217391  0.95396419
  0.97435897  0.94615385  0.57179487  0.98717949] 0.813166109253
--- 264.18249917030334 seconds ---


### Ada Boost

In [33]:
# AdaBoost for first set

start_time = time.time()

#fit the variables and calculate scores

ab_1= AdaBoostClassifier(algorithm= 'SAMME', n_estimators=100, learning_rate=1)
ab_1.fit(df_1_X, df_1_Y)

scores_ab_1 = cross_val_score(ab_1, df_1_X, df_1_Y, cv=10)
scores_ab_precision_1 = cross_val_score(ab_1, df_1_X, df_1_Y, cv=10, scoring='precision')
scores_ab_recall_1 = cross_val_score(ab_1, df_1_X, df_1_Y, cv=10, scoring='recall')

print(scores_ab_1, scores_ab_1.mean())
print(scores_ab_precision_1, scores_ab_precision_1.mean()) 
print(scores_ab_recall_1, scores_ab_recall_1.mean())

Algorith.append('AdaBoost')
Accuracy.append(scores_ab_1.mean())
Precision.append(scores_ab_precision_1.mean())
Recall.append(scores_ab_recall_1.mean())

print("--- %s seconds ---" % (time.time() - start_time))

[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0
--- 61.5029878616333 seconds ---


In [34]:
# AdaBoost for the second set

start_time = time.time()

#fit the variables and calculate scores

ab_2= AdaBoostClassifier(algorithm= 'SAMME', n_estimators=100, learning_rate=1)
ab_2.fit(df_2_X, df_2_Y)

scores_ab_2 = cross_val_score(ab_2,df_2_X, df_2_Y, cv=10)
scores_ab_precision_2 = cross_val_score(ab_2, df_2_X, df_2_Y, cv=10, scoring='precision')
scores_ab_recall_2 = cross_val_score(ab_2, df_2_X, df_2_Y, cv=10, scoring='recall')

print(scores_ab_2, scores_ab_2.mean())
print(scores_ab_precision_2, scores_ab_precision_2.mean()) 
print(scores_ab_recall_2, scores_ab_recall_2.mean())

Algorith.append('AdaBoost')
Accuracy.append(scores_ab_2.mean())
Precision.append(scores_ab_precision_2.mean())
Recall.append(scores_ab_recall_2.mean())

print("--- %s seconds ---" % (time.time() - start_time))

[ 1.          1.          0.99871134  1.          1.          1.          1.
  1.          1.          1.        ] 0.999871134021
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0
[ 1.          1.          0.99744246  1.          1.          1.          1.
  1.          1.          1.        ] 0.999744245524
--- 61.530038595199585 seconds ---


### XGBoost

In [35]:
# XGboost for the first set

start_time = time.time()

#fit the variables and calculate scores

xg_1= XGBClassifier(base_score=0.5,learning_rate=0.1,max_depth=50, n_estimators=100)
xg_1.fit(df_1_X, df_1_Y)

scores_xg_1 = cross_val_score(xg_1, df_1_X, df_1_Y, cv=10)
scores_xg_precision_1 = cross_val_score(xg_1,df_1_X, df_1_Y, cv=10, scoring='precision')
scores_xg_recall_1 = cross_val_score(xg_1, df_1_X, df_1_Y, cv=10, scoring='recall')

print(scores_xg_1, scores_xg_1.mean())
print(scores_xg_precision_1, scores_xg_precision_1.mean()) 
print(scores_xg_recall_1, scores_xg_recall_1.mean())

Algorith.append('XGBoost')
Accuracy.append(scores_xg_1.mean())
Precision.append(scores_xg_precision_1.mean())
Recall.append(scores_xg_recall_1.mean())

print("--- %s seconds ---" % (time.time() - start_time))

[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0
--- 2446.1845536231995 seconds ---


In [36]:
# XGboosting for the second set

start_time = time.time()

xg_2= XGBClassifier(base_score=0.5,learning_rate=0.1,max_depth=50, n_estimators=100)
xg_2.fit(df_2_X, df_2_Y)
scores_xg_2 = cross_val_score(xg_2, df_2_X, df_2_Y, cv=10)
scores_xg_precision_2 = cross_val_score(xg_2, df_2_X, df_2_Y, cv=10, scoring='precision')
scores_xg_recall_2 = cross_val_score(xg_2, df_2_X, df_2_Y, cv=10, scoring='recall')
print(scores_xg_2, scores_xg_2.mean())
print(scores_xg_precision_2, scores_xg_precision_2.mean()) 
print(scores_xg_recall_2, scores_xg_recall_2.mean())
Algorith.append('XGBoost')
Accuracy.append(scores_xg_2.mean())
Precision.append(scores_xg_precision_2.mean())
Recall.append(scores_xg_recall_2.mean())

print("--- %s seconds ---" % (time.time() - start_time))

[ 1.          1.          0.99871134  1.          1.          1.          1.
  1.          1.          1.        ] 0.999871134021
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 1.0
[ 1.          1.          0.99744246  1.          1.          1.          1.
  1.          1.          1.        ] 0.999744245524
--- 2818.3103935718536 seconds ---


In [37]:
# Dataframe showing the performance metrices of the algorithms
accuracy['Algorith']= Algorith
accuracy['Accuracy']= Accuracy
accuracy['Precision']= Precision
accuracy['Recall']= Recall
print(accuracy)

          Algorith  Accuracy  Precision    Recall
0      Naive Bayes  0.964412   0.934217  1.000000
1      Naive Bayes  0.973183   0.949714  0.999744
2    Decision Tree  0.612566   0.668084  0.679810
3    Decision Tree  0.614488   0.677480  0.578229
4    Random Forest  0.945465   0.962062  0.946846
5    Random Forest  0.963896   0.962695  0.961593
6   Gradient Boost  0.826565   0.735936  0.824632
7   Gradient Boost  0.857411   0.710562  0.813166
8         AdaBoost  1.000000   1.000000  1.000000
9         AdaBoost  0.999871   1.000000  0.999744
10         XGBoost  1.000000   1.000000  1.000000
11         XGBoost  0.999871   1.000000  0.999744


### The boosting algorithm has shown a promising performance to accurately classifiy the observations. The AdaBoost and XG Boost, both have 100% acuracy, precision and recall score. Random Forest and Naive Bayes have also shown good results, however the medal goes to boosting algorith. 

### This exercise has also demonstrated that the LDA and TFIDF can create good feature set for classifiying or clustering the documents.

### Chellanges faced:

### The maximum of 1000 documents could be downloaded for each year,  which made the code long
### LDA is a time consuming process. It takes an hour to model topic of 2000 documents. Perhaps, different packages must be tested for speed
### TFIDF and CountVectorize produced same number of features. Generally it is expected that TFIDF will produce less features than CV because it eliminates unnecessary words
### The final data set was too big to run. It was divided into two dataframe and tested separately