Look at the function "load_train_test_imdb_data" and try to understand what is it doing.

In [None]:
import os
import numpy as np
import pandas as pd


def load_train_test_imdb_data(data_dir):
    """Loads the IMDB train/test datasets from a folder path.
    Input:
    data_dir: path to the "aclImdb" folder.
    
    Returns:
    train/test datasets as pandas dataframes.
    """

    data = {}
    for split in ["train", "test"]:
        data[split] = []
        for sentiment in ["neg", "pos"]:
            score = 1 if sentiment == "pos" else 0

            path = os.path.join(data_dir, split, sentiment)
            file_names = os.listdir(path)
            for f_name in file_names:
                with open(os.path.join(path, f_name), "r") as f:
                    review = f.read()
                    data[split].append([review, score])

    np.random.shuffle(data["train"])        
    data["train"] = pd.DataFrame(data["train"],
                                 columns=['text', 'sentiment'])

    np.random.shuffle(data["test"])
    data["test"] = pd.DataFrame(data["test"],
                                columns=['text', 'sentiment'])

    return data["train"], data["test"]




Use the function "load_train_test_imdb_data" to load the data. Use the variable names "train_data" and "test_data". 
To load the data, you have to indicate the path to the data folder. If you are using the jupiter server, use the path "/mnt/nvs3/nlp-public/aclImdb/". If you are using jupiter notebook on your own computer, download the data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz and extract it (you can use the command "tar -zxvf aclImdb_v1.tar.gz" on a Unix system)

In [None]:
#save the train and test data in train_data and test_data
path = "/mnt/nvs3/nlp-public/aclImdb/"

Print the first 5 rows of the train data.

In [None]:
#print first 5 rows of train data


Print the first 5 rows of the test data.

In [None]:
#print first 5 elements of test data


Print the information about the train_data dataframe. You can use the function "info()".

In [None]:
#print info for dataframe


Transform each text into a vector of word counts. Use the class "CountVectorizer" with an attribute "stop_words="english"". To create the training features use the method "fit_transform()". To create the test features use the method "transform()".

In [None]:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Transform each text into a vector of word counts
vectorizer = CountVectorizer(stop_words="english")

training_features =    
test_features = 

Print the features' names. To do that use the method "get_feature_names()" from "CountVectorizer".

In [None]:
#print the features' names


Print the dimentions of the training and the test data features. Use the property "shape". 

In [None]:
#print training data dimensions


In [None]:
#print test data dimensions


Train a SVM model using the class "LinearSVC". Use the "fit()" function to train the model. It takes as parameters the training_features, and the annotation form the training data. Make prediction with the trained model using the "predict()" function. The function takes as an input the test_features. 

In [None]:
from sklearn.svm import LinearSVC

# Training with linear SVM
model = #define the model
        #fit the model
y_pred = #predict on new data

Calculate the persformance of the model: confusion matrix, using the function "confusion_matrix()"; accuracy, using the function "accuracy_score()"; precision, using the function "precision_score"; recall, using the function "recall_score"; and F1 score, using the function "f1_score". Use the function "classification_report()" to calculate all metrics. Print the results. 

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


#Calculate performance
conf =  #confusion matrix

    #print confusion matrix
    
acc =  #accuracy

    #print accuracy

prec =  #precision

    #print precision

rec =  #recall

    #print recall

f1 =  #f1

    #print f1 score

rep =  #generates a report for precision, recall, f1-score and support

    #print report


Train a decision tree model using the class "DecisionTreeClassifier". Use the "fit()" function to train the model. It takes as parameters the training_features, and the annotation form the training data. Make prediction with the trained model using the "predict()" function. The function takes as an input the test_features. Print the maximum depth of the tree. To do that use the property "tree_.max_depth".

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Training with decision tree
model = #define the model
    #fit the model
y_predDT = #predict on new data

    #print max depth

Similar to the SVM, print the performance metrics for the DT.

In [None]:
#Calculate performance

conf =  #confusion matrix

    #print confusion matrix
    
acc =  #accuracy

    #print accuracy

prec =  #precision

    #print precision

rec =  #recall

    #print recall

f1 =  #f1

    #print f1 score

rep =  #generates a report for precision, recall, f1-score and support

    #print report


Retrain the decision tree model by limiting the maximum tree depth to 10. You can do that by using the attribute "max_depth" in "DecisionTreeClassifier".

In [None]:
# Training with decision tree with max depth
model = #define the model
    #fit the model
y_predDT = #predict on new data

Once again, calculate the performance metrics and print them.

In [None]:
#Calculate performance
conf =  #confusion matrix

    #print confusion matrix
    
acc =  #accuracy

    #print accuracy

prec =  #precision

    #print precision

rec =  #recall

    #print recall

f1 =  #f1

    #print f1 score

rep =  #generates a report for precision, recall, f1-score and support

    #print report


In [None]:
from IPython.display import SVG
from graphviz import Source
from sklearn import tree
from IPython.display import display

#display DT tree
graph = Source(tree.export_graphviz(model, out_file=None
   , class_names=['0', '1'] 
   , filled = True))

display(SVG(graph.pipe(format='svg')))