## Homework 2, part II. Decision Trees

We  will  use  a  dataset  of  1298  “fake  news”  headlines  (which  mostly  include 
headlines  of  articles  classified  as  biased,  etc.)  and  1968  “real”  news  headlines, 
where the “fake news” headlines are from https://www.kaggle.com/mrisdal/fake-news/data and “real news” headlines are from https://www.kaggle.com/therohk/million-headlines.  The  data  were  cleaned  by  removing  words  from  fake  news 
titles  that  are  not  a  part  of  the  headline,  removing  special  characters  from  the 
headlines,  and  restricting  real  news  headlines  to  those  after  October  2016 
containing the word “trump”. The cleaned-up data are available as clean_real.txt 
and clean_fake.txt in the google colab file. 

Each headline appears as a single line in the data file. You will build a decision 
tree to classify real vs. fake news headlines. Instead of coding the decision trees 
yourself,  you  will  do  what  we  normally  do  in  practice  —  use  an  existing 
implementation.  You  should  use  the  DecisionTreeClassifier  included  in  sklearn. 
Note that figuring out how to use this implementation is a part of the 
assignment.



In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer

import numpy as np

from sklearn import tree
import graphviz

from matplotlib import pyplot as plt

In [None]:
#These two lines allow you to import the necessary data for this tutorial
#You can open the links to see the content of those files if you are curious
! wget https://raw.githubusercontent.com/carrasqu/datah2/master/data/clean_fake.txt 
! wget https://raw.githubusercontent.com/carrasqu/datah2/master/data/clean_real.txt

# This function performs a data split (training, validation, test sets)
def split_data(X, y, train_size=0.7, val_size=0.15):
    total_data = X.shape[0] # This line allows to get the dimension of the first axis of X, which is the total number of data points
    train_size = int(train_size * total_data)
    val_size = int(val_size * total_data)
    test_size = total_data - train_size - val_size

    all_indices = np.random.permutation(np.arange(total_data)) #This line is used to randomize the indices of X and y before splitting into train, validation and test sets
    train_indices = all_indices[:train_size]
    val_indices = all_indices[train_size:train_size + val_size]
    test_indices = all_indices[train_size+val_size:]

    train_X, train_y = X[train_indices], y[train_indices]
    val_X, val_y = X[val_indices], y[val_indices]
    test_X, test_y = X[test_indices], y[test_indices]

    #The output of this function below is a python dictionnay. For instance, to access the train data, you need to access data['train'], if the output of this function is called "data"
    #More details about python dictionaries can be found on this link: https://realpython.com/python-dicts/
    return {
        'train': (train_X, train_y),
        'val':  (val_X, val_y),
        'test': (test_X, test_y)
    }

# This function loads, processes (with CountVectorizer, read and understand the documentation https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html )
# and splits the data from the downloaded files 
def load_data(paths):
    vec = CountVectorizer(input='content')
    lines = []
    counts = []
    for p in paths: # This loop is used to read the files in each path. More details reading files can be found here: https://www.w3schools.com/python/python_file_open.asp
        with open(p) as f:
            file_lines = f.readlines()
        counts.append(len(file_lines))
        lines.extend([l.strip() for l in file_lines])

    vec.fit(lines) #more details about "fit" are provided here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit
    data_matrix = vec.transform(lines).toarray() #more details about "transform" can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.transform
    y = np.concatenate((np.zeros(counts[0]), np.ones(counts[1]))) #Labels 0 are given for real data and Labels 1 for fake data. np.concatenate is used to merge the labels in one array before splitting the data in the next line.
    return split_data(data_matrix, y), vec.get_feature_names_out()

data, feature_names = load_data(['/content/clean_real.txt', '/content/clean_fake.txt'])    

!rm clean* # this is to delete the data from the google colab after we downloaded them


## 2.A

In [None]:
# Extract the dimensionality of the feature vectors, the number of datapoints in the training, validation and test sets. 


## 2.B

In [None]:
# Complete a function to compute the accuracy of a given model on input data X
# and label t
# You can get some inspiration from the code of Homework 1
def get_acc(model, X, t):
    '''
     Complete the code here
    '''
    return acc


In [None]:

# Complete a function that defines and trains decision trees on different depths and split criterion on the data
# store the model, the training accuracy and validation accuracy in the dict out 
# You can take a look at https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
# You can also read the user-guide https://scikit-learn.org/stable/modules/tree.html#tree
def select_model(depths, data, criterion ):
    out = {}
    for d in depths:
        print('Evaluating on depth {}'.format(d))
        out[d] = {}
        '''
        Your definition of the decision tree model goes here:
        tree = ....
        fit your decision tree here (you can use tree.fit, see documentation for more details)
         
        '''
        out[d]['val'] = get_acc(tree, *data['val'])
        out[d]['train'] = get_acc(tree, *data['train'])
        out[d]['model'] = tree
    return out


## 2.C

In [None]:
# Code to train the models on multiple depths and two split criteria 

# train the models with the information gain criterion

depths = [] # the depths you want to explore go in the depths list 

res_entropy = select_model(depths,data, "entropy") # training models with different depths using information gain

# looping over the different models and accuracies to find the optimal model according to its validation accuracy
best_d_entropy = None
best_acc_entropy = 0

for d in res_entropy:
    val_acc = res_entropy[d]['val']
    print("Depth: {}   Train: {}    Val: {}".format(d, res_entropy[d]['train'], val_acc))
    if val_acc  > best_acc_entropy:
        best_d_entropy = d
        best_acc_entropy = val_acc

# train the models with the gini impurity criterion 

res_gini = select_model(depths,data,"gini") # training models with different depths using gini impurity 

# looping over the different models and accuracies to find the optimal model according to its validation accuracy
best_d_gini = None
best_acc_gini = 0

for d in res_gini:
    val_acc = res_gini[d]['val']
    print("Depth: {}   Train: {}    Val: {}".format(d, res_gini[d]['train'], val_acc))
    if val_acc  > best_acc_gini:
        best_d_gini = d
        best_acc_gini = val_acc


In [None]:
# Compute and report the test accuracy of the best model here 


## 2.D

In [None]:
# visualize the two first two layers of the tree here if doing it by code