# NLP Project: Vector Space Semantics for Similarity between Eastenders Characters

In this project, I will be creating a vector representation of a document containing lines spoken by a character in the Eastenders script data (i.e. from the file `training.csv`), then improving that representation such that each character vector is maximially distinguished from the other character documents. This distinction is measured by how well a simple information retrieval classification method can select documents from validation and test data as belonging to the correct class of document (i.e. deciding which character spoke the lines by measuring the similarity of those document vectors to those built in training).

As the lines are not evenly distributed in terms of frequency, this project stipulates that only a maximum of the first **400 lines** of each character in the training data is allowed to create the training documents, and a maximum of the first **40 lines** in the test data (from `test.csv`). This makes it more challenging, as number of lines spoken by a character can't be used directly or otherwise as a feature.

I will improve the character vector representations by pre-processing, feature extraction and transformation techniques.

In [1]:
import string
import re
import numpy as np
from numpy.linalg import norm
import pandas as pd
from collections import Counter, OrderedDict

import seaborn as sns
import matplotlib.pyplot as plt

import nltk

from sklearn.feature_extraction import DictVectorizer

%matplotlib inline
pd.options.display.max_colwidth=500

In [2]:
# Load in training data and display in pandas dataframe
train_path='training.csv'
all_train_data = pd.read_csv(train_path,  delimiter="\t", skip_blank_lines = True)
test_path ='test.csv'
test_data = pd.read_csv(test_path,  delimiter="\t", skip_blank_lines = True)

# Inspect
all_train_data

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender
0,1350,1,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"Look at ya, not a mark on ya. And you think you're an unlucky man.",FEMALE
1,1350,1,DESERTED CAR PARK EXT NIGHT,OTHER,Shirl...,MALE
2,1350,2,R&R INT NIGHT,JACK,Oi. Where have you been? Huh? What were the texts about?,MALE
3,1350,2,R&R INT NIGHT,RONNIE,Nothing. Nothing. I'll be with you in two minutes yeah?,FEMALE
4,1350,2,R&R INT NIGHT,JACK,"Well I've got mates here I wanted to have a chat with them, instead I've been serving behind the bar.",MALE
...,...,...,...,...,...,...
15314,1399,55,SQUARE EXT DAY LIGHT,OTHER,"Dad? Okay ... alright, just one drink alright. But that's all. It doesn't mean anything. It's just a drink.",MALE
15315,1399,55,SQUARE EXT DAY LIGHT,MAX,Thanks Bradley. Thanks mate... It means the world to me...,MALE
15316,1399,55,SQUARE EXT DAY LIGHT,OTHER,You alright...,MALE
15317,1399,55,SQUARE EXT DAY LIGHT,MAX,"Yeah, yeah, yeah. I'm fine.",MALE


In [4]:
#Inspect
test_data

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender
0,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"You'll be alright Kevin. They're gonna cut you out. It's going to be alright. Come on, just talk to me. Just, just talk to me. Call me names- let me know you're still here with me.",FEMALE
1,1350,7,DESERTED CAR PARK EXT NIGHT,OTHER,...I've got pins and needles in my fingers.,MALE
2,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,That's the cold.,FEMALE
3,1350,7,DESERTED CAR PARK EXT NIGHT,OTHER,Yeah. It's all shutting down.,MALE
4,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,It's just the cold.,FEMALE
...,...,...,...,...,...,...
1762,1398,32,GEORGE STREET EXT NIGHT,JANE,I know. Now Ian come on. Everything's going to be alright. It is. It's going to be alright.,FEMALE
1763,1399,27,[NAME] & [NAME]'S BEDROOM INT NIGHT,TANYA,Night night baby. I love you.,FEMALE
1764,1399,27,[NAME] & [NAME]'S BEDROOM INT NIGHT,OTHER,Love you too.,FEMALE
1765,1399,27,[NAME] & [NAME]'S BEDROOM INT NIGHT,TANYA,"Let's get that under your ankle shall we. Careful, alright? Night night darling. Love you.",FEMALE


In [5]:
# Split into training and test data for heldout validation with random samples of 9:1 train/heldout split
from random import shuffle, seed

seed(0) # set a seed for reproducibility so same split is used each time

epsiode_scene_column = all_train_data.Episode.astype(str) + "-" + all_train_data.Scene.astype(str)
all_train_data['episode_scene'] = epsiode_scene_column
episode_scenes = sorted(list(set([x for x in epsiode_scene_column.values]))) # set function is random, need to sort!

shuffle(episode_scenes)

print(len(episode_scenes))
episode_split = int(0.9*len(episode_scenes))
training_ep_scenes = episode_scenes[:episode_split]
test_ep_scenes = episode_scenes[episode_split:]
print(len(training_ep_scenes), len(test_ep_scenes))

def train_or_heldout_eps(val):
    if val in training_ep_scenes:
        return "training"
    return "heldout"

all_train_data['train_heldout'] = all_train_data['episode_scene'].apply(train_or_heldout_eps)

1394
1254 140


In [6]:
print('Raw Data: ',np.shape(all_train_data))
train_data = all_train_data[all_train_data['train_heldout']=='training']
val_data = all_train_data[all_train_data['train_heldout']=='heldout']
print('Train set: ',np.shape(train_data))
print('Validation set: ',np.shape(val_data))

Raw Data:  (15319, 8)
Train set:  (13638, 8)
Validation set:  (1681, 8)


In [7]:
train_data

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender,episode_scene,train_heldout
0,1350,1,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"Look at ya, not a mark on ya. And you think you're an unlucky man.",FEMALE,1350-1,training
1,1350,1,DESERTED CAR PARK EXT NIGHT,OTHER,Shirl...,MALE,1350-1,training
2,1350,2,R&R INT NIGHT,JACK,Oi. Where have you been? Huh? What were the texts about?,MALE,1350-2,training
3,1350,2,R&R INT NIGHT,RONNIE,Nothing. Nothing. I'll be with you in two minutes yeah?,FEMALE,1350-2,training
4,1350,2,R&R INT NIGHT,JACK,"Well I've got mates here I wanted to have a chat with them, instead I've been serving behind the bar.",MALE,1350-2,training
...,...,...,...,...,...,...,...,...
15314,1399,55,SQUARE EXT DAY LIGHT,OTHER,"Dad? Okay ... alright, just one drink alright. But that's all. It doesn't mean anything. It's just a drink.",MALE,1399-55,training
15315,1399,55,SQUARE EXT DAY LIGHT,MAX,Thanks Bradley. Thanks mate... It means the world to me...,MALE,1399-55,training
15316,1399,55,SQUARE EXT DAY LIGHT,OTHER,You alright...,MALE,1399-55,training
15317,1399,55,SQUARE EXT DAY LIGHT,MAX,"Yeah, yeah, yeah. I'm fine.",MALE,1399-55,training


In [8]:
val_data

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender,episode_scene,train_heldout
56,1350,18,BRANNINGS' HOUSE INT NIGHT,TANYA,Who's that?,FEMALE,1350-18,heldout
57,1350,18,BRANNINGS' HOUSE INT NIGHT,MAX,Tanya! Tanya it's me. I wanna give Oscar a New Year kiss. Tan?,MALE,1350-18,heldout
58,1350,18,BRANNINGS' HOUSE INT NIGHT,SEAN,Right. I'll have a word.,MALE,1350-18,heldout
59,1350,18,BRANNINGS' HOUSE INT NIGHT,TANYA,"No no no leave it, leave it. He ain't got a key. He'll go away. Do you want some crisps?",FEMALE,1350-18,heldout
60,1350,18,BRANNINGS' HOUSE INT NIGHT,MAX,"Who, who you got in there, Tan? Tan, I've got a right to know who my kids are being exposed to. I bet you ain't got anyone have ya? You're just sitting there with a bottle of wine all on your Jack, Tan?",MALE,1350-18,heldout
...,...,...,...,...,...,...,...,...
15268,1399,29,[NAME] & [NAME]'S HOUSE INT NIGHT,STACEY,"I asked him to leave you. I asked him again and again. I begged him. And if he had. I'd have been with him. I'd have been with him now. So it might suit you to think that it was all Max but it wasn't, not really, not in the end.",FEMALE,1399-29,heldout
15269,1399,29,[NAME] & [NAME]'S HOUSE INT NIGHT,TANYA,Why you telling me all this eh?,FEMALE,1399-29,heldout
15270,1399,29,[NAME] & [NAME]'S HOUSE INT NIGHT,STACEY,"Because I don't want you using Sean to punish him, I don't want you wrecking your family cause of me. I just, I just want it all over.",FEMALE,1399-29,heldout
15271,1399,29,[NAME] & [NAME]'S HOUSE INT NIGHT,TANYA,It is over. It's all over. Will you please just go home now and leave me alone.,FEMALE,1399-29,heldout


In [8]:
if False:
    # A simpler alternative to the above, but where a high split means some characters
    # may be left out
    print('Raw Data: ',np.shape(all_train_data))
    idx = int(0.9 * np.shape(all_train_data)[0])
    train_data = all_train_data[:idx]
    val_data = all_train_data[idx:]
    print('Train set: ',np.shape(train_data))
    print('Validation set: ',np.shape(val_data))

In [9]:
# Create one document per character
def create_character_document_from_dataframe(df, max_line_count):
    """Returns a dict with the name of the character as key,
    their lines joined together as a single string, with end of line _EOL_
    markers between them.
    
    ::max_line_count:: the maximum number of lines to be added per character
    """
    character_docs = {}
    character_line_count = {}
    character_gender = {}
    for line, name, gender in zip(df.Line, df.Character_name, df.Gender):
        
        if not name in character_docs.keys():
            character_docs[name] = ""
            character_line_count[name] = 0
        if character_line_count[name]==max_line_count:
            continue
        character_docs[name] += str(line)   + " _EOL_ "
        
        character_line_count[name]+=1
        
    #obtain name gender mapping info for Q2
    for name, gender in zip(df.Character_name, df.Gender):
        if name != "OTHER": #as other has diff gender, do not extract its gender
            character_gender[name] =str(gender)
                
    print("name-gender mapping: ", character_gender)
    print("\nlines per character", character_line_count)
    return character_docs

In [10]:
# print out the number of words each character has in the training set
# only use the first 360 lines of each character
train_character_docs = create_character_document_from_dataframe(train_data, max_line_count=360)
print('Num. Characters: ',len(train_character_docs.keys()),"\n")
total_words = 0
for name in train_character_docs.keys():
    print(name, 'Number of Words: ',len(train_character_docs[name].split()))
    total_words += len(train_character_docs[name].split())
print("total words", total_words)

name-gender mapping:  {'SHIRLEY': 'FEMALE', 'JACK': 'MALE', 'RONNIE': 'FEMALE', 'TANYA': 'FEMALE', 'SEAN': 'MALE', 'ROXY': 'FEMALE', 'HEATHER': 'FEMALE', 'MAX': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'STACEY': 'FEMALE', 'PHIL': 'MALE', 'MINTY': 'MALE', 'CHRISTIAN': 'MALE', 'CLARE': 'FEMALE'}

lines per character {'SHIRLEY': 360, 'OTHER': 360, 'JACK': 360, 'RONNIE': 360, 'TANYA': 360, 'SEAN': 360, 'ROXY': 360, 'HEATHER': 360, 'MAX': 360, 'IAN': 360, 'JANE': 360, 'STACEY': 360, 'PHIL': 360, 'MINTY': 360, 'CHRISTIAN': 342, 'CLARE': 352}
Num. Characters:  16 

SHIRLEY Number of Words:  3848
OTHER Number of Words:  3244
JACK Number of Words:  4435
RONNIE Number of Words:  3442
TANYA Number of Words:  3786
SEAN Number of Words:  3637
ROXY Number of Words:  3838
HEATHER Number of Words:  4098
MAX Number of Words:  4363
IAN Number of Words:  4332
JANE Number of Words:  3648
STACEY Number of Words:  3913
PHIL Number of Words:  3635
MINTY Number of Words:  4005
CHRISTIAN Number of Words:  3738

In [12]:
train_character_docs

{'SHIRLEY': 'Look at ya, not a mark on ya. And you think you\'re an unlucky man. _EOL_ I\'m gonna get help. Oh where\'s my phone? Oh Kevin. Kevin you smashed it, didn\'t ya? Kevin, Kevin, where\'s your phone? _EOL_ No you\'re not, ssh, shut up. _EOL_ Fire brigade and ambulance. There\'s been an accident. On an industrial estate in Walford. ...Um, the Marsh Lane industrial estate. Please come quick. My husband- he\'s not my husband- my friend. He\'s trapped in the car. Please come quick... Shirley Carter. 82 82B George Street, Walford, E20. Please hurry, please come quick. _EOL_ Kevin. Kevin! _EOL_ Kevin I\'m gonna go to the main road - _EOL_ To make sure they know where to go. _EOL_ Kevin I\'ll be five minutes. _EOL_ You\'ll be fine. You\'re talking. I\'ll be five minutes. _EOL_ It\'s alright, it\'s alright, it\'s alright, it\'s alright. _EOL_ Go away. _EOL_ I don\'t know what to say. It\'s a nightmare. _EOL_ He asked me to go with him. _EOL_ Yeah. _EOL_ Between us? Of course not. _EOL

# Task 1. Improve pre-processing


### Answer: To avoid confusion, please kindly run the above codes from the beginning of the notebook before 'pre_process()' section, after that please run all the rest codes from here:

 - Main changes for Q1 have been made in the below pre_process1() function. Instead of simply tokenizing text based on white space, I have experimented various techniques including:
- 1) white space split token 
- 2) change to lower case
- 3) remove 'eol' characters
- 4) remove punctuations
- 5) remove stop words
- 6) implement stemming
   
   and obtained the best performance in terms of a mean rank and accuracy. 
   The most optimal techniques used here are: 1) to 6) except 3). (Please note in this function, POS taggings are also generated for later tasks. It has not been used in this task) 

- In the very last of the code in this question, the final solution and results from different experiments are shown. I have achieved the best result: a mean rank - 3.3125 and accuracy 0.625. The result has shown a great improvement from a mean rank of 5.12 and accuracy of 0.3125 originally.

In [13]:
#Task 1 updated tokenize with POS tags generated （POS tags to be used for Q2)
#what it does to the text: pos tag generated, lower case, 
#remove punctuation, remove stop words, word stemming

# import and download relative libraries for tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
from nltk.stem import PorterStemmer
ps = PorterStemmer()
import nltk
nltk.download('averaged_perceptron_tagger')

def pre_process1(character_text):

    wordlist = []
    poslist= []

    #change to lower case
    character_text = character_text.lower() 
    
    #remove '_eol_'
    #character_text = character_text.replace("_eol_ ","")
    
    #generate pos tags
    tokens = word_tokenize(character_text)
    wordpos = nltk.pos_tag(tokens)

    #remove punctuation
    symbols = "!#$%&()*+-.,/:;<=>?@[\]^`{|}~\n"
    j=0
    for i in range(len(wordpos)):
        for symbol in symbols:
            if wordpos[i-j][0] == symbol:    
                del wordpos[i-j]
                j += 1
              
        
    #remove stop words
    stops = stopwords.words('english')
    k=0
    for i in range(len(wordpos)):
        for stop in stops:
            if wordpos[i-k][0] == stop:    
                del wordpos[i-k]
                k += 1

    for i in range(len(wordpos)):
        word = wordpos[i][0]
        pos = wordpos[i][1]
        wordlist.append(word)
        poslist.append(pos)
 

    #stemming
    wordlist2 = []
    for w in wordlist:
        wordlist2.append(ps.stem(w))
    

    return wordlist2, poslist

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [14]:
# create list of pairs of (character name, pre-processed character) 
training_corpus = [(name, pre_process1(doc)[0]) for name, doc in sorted(train_character_docs.items())]
train_labels = [name for name, doc in training_corpus]

In [15]:
# create list of pairs of (character name, pre-processed pos tags)
training_POS = [(name, pre_process1(doc)[1]) for name, doc in sorted(train_character_docs.items())]

In [16]:
#modified v1.5 - for q1 no new features added yet
import nltk
nltk.download('averaged_perceptron_tagger')
def to_feature_vector_dictionary1(character_doc, tag):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
    

    #poslist = nltk.pos_tag(character_doc)
    #print(tag)

    counts = Counter(character_doc)  # for now a simple count
    counts = dict(counts)
    #pos = Counter(tag)
    #pos = dict(pos)
    #counts.update(pos)

    return counts

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [17]:
#updated version3
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


def create_document_matrix_from_corpus1(corpus,tag, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary1(a[1],b[1]) for a, b in zip(corpus,tag)])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary1(a[1],b[1]) for a, b in zip(corpus,tag)])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus1(training_corpus, training_POS, fitting=True)

In [18]:
training_feature_matrix

<16x3284 sparse matrix of type '<class 'numpy.float64'>'
	with 10378 stored elements in Compressed Sparse Row format>

In [19]:
# get the validation data- only 40 lines used for each character
val_character_docs = create_character_document_from_dataframe(val_data, max_line_count=40)
print('Num. Characters: ',len(val_character_docs.keys()),"\n")
total_words = 0
for name in val_character_docs.keys():
    print(name, 'Num of Words: ',len(val_character_docs[name].split()))
    total_words += len(val_character_docs[name].split())
print("total words", total_words)

# create list of pairs of (character name, pre-processed character) 
val_corpus = [(name, pre_process1(doc)[0]) for name, doc in sorted(val_character_docs.items())]
val_labels = [name for name, doc in val_corpus]
# create list of pairs of (character name, POS tags) 
val_POS = [(name, pre_process1(doc)[1]) for name, doc in sorted(val_character_docs.items())]

name-gender mapping:  {'TANYA': 'FEMALE', 'MAX': 'MALE', 'SEAN': 'MALE', 'SHIRLEY': 'FEMALE', 'STACEY': 'FEMALE', 'RONNIE': 'FEMALE', 'JACK': 'MALE', 'PHIL': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'ROXY': 'FEMALE', 'HEATHER': 'FEMALE', 'MINTY': 'MALE', 'CHRISTIAN': 'MALE', 'CLARE': 'FEMALE'}

lines per character {'TANYA': 40, 'MAX': 40, 'SEAN': 35, 'SHIRLEY': 40, 'OTHER': 40, 'STACEY': 40, 'RONNIE': 40, 'JACK': 40, 'PHIL': 40, 'IAN': 40, 'JANE': 40, 'ROXY': 40, 'HEATHER': 40, 'MINTY': 40, 'CHRISTIAN': 40, 'CLARE': 40}
Num. Characters:  16 

TANYA Num of Words:  438
MAX Num of Words:  737
SEAN Num of Words:  366
SHIRLEY Num of Words:  329
OTHER Num of Words:  357
STACEY Num of Words:  412
RONNIE Num of Words:  464
JACK Num of Words:  351
PHIL Num of Words:  475
IAN Num of Words:  508
JANE Num of Words:  458
ROXY Num of Words:  392
HEATHER Num of Words:  411
MINTY Num of Words:  470
CHRISTIAN Num of Words:  489
CLARE Num of Words:  405
total words 7062


In [20]:
# Just transform the val_feature_matrix, don't fit
val_feature_matrix = create_document_matrix_from_corpus1(val_corpus,val_POS, fitting=False)

In [21]:
val_feature_matrix

<16x3284 sparse matrix of type '<class 'numpy.float64'>'
	with 2071 stored elements in Compressed Sparse Row format>

In [22]:
def compute_cosine_similarity(v1, v2):
    """Takes a pair of vectors v1 and v2 (1-d arrays e.g. [0, 0.5, 0.5])
    returns the cosine similarity between the vectors
    """
    
    # compute cosine similarity manually
    manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
    
    return manual_cosine_similarity

In [23]:
def compute_IR_evaluation_scores(train_feature_matrix, test_feature_matrix, train_labels, test_labels):
    """
    Computes an information retrieval based on training data feature matrix and test data feature matrix
    returns 4-tuple:
    ::mean_rank:: mean of the ranking of the target document in terms of similarity to the query/test document
    1 is the best possible score.
    ::mean_cosine_similarity:: mean cosine similarity score for the target document vs. the test document of the same class
    ::accuracy:: proportion of test documents correctly classified
    ::df:: a data frame with all the similarity measures of the test documents vs. train documents
    
    params:
    ::train_feature_matrix:: a numpy matrix N x M shape where N = number of characters M = number of features
    ::test_feature_matrix::  a numpy matrix N x M shape where N = number of characters M = number of features
    ::train_labels:: a list of character names for the training data in order consistent with train_feature_matrix
    ::test_labels:: a list of character names for the test data in order consistent with test_feature_matrix
    """
    rankings = []
    all_cosine_similarities = []
    pairwise_cosine_similarity = []
    pairs = []
    correct = 0
    for i, target in enumerate(test_labels):
        # compare the left out character against the mean
        idx = i 
        fm_1 = test_feature_matrix.toarray()[idx]
        all_sims = {}
        # print("target:", target)
        for j, other in enumerate(train_labels):
            fm_2 = train_feature_matrix.toarray()[j]
            manual_cosine_similarity = compute_cosine_similarity(fm_1, fm_2)
            pairs.append((target, other))
            pairwise_cosine_similarity.append(manual_cosine_similarity)
            if other == target:
                all_cosine_similarities.append(manual_cosine_similarity)
            all_sims[other] = manual_cosine_similarity

            # print(target, other, manual_cosine_similarity)
        sorted_similarities = sorted(all_sims.items(),key=lambda x:x[1],reverse=True)
        # print(sorted_similarities)
        ranking = {key[0]: rank for rank, key in enumerate(sorted_similarities, 1)}
        # print("Ranking for target", ranking[target])
        if ranking[target] == 1:
            correct += 1
        rankings.append(ranking[target])
        # print("*****")
    mean_rank = np.mean(rankings)
    mean_cosine_similarity = np.mean(all_cosine_similarities)
    accuracy = correct/len(test_labels)
    print("mean rank", np.mean(rankings))
    print("mean cosine similarity", mean_cosine_similarity)
    print(correct, "correct out of", len(test_labels), "/ accuracy:", accuracy )
    
    # get a dafaframe showing all the similarity scores of training vs test docs
    df = pd.DataFrame({'doc1': [x[0] for x in pairs], 'doc2': [x[1] for x in pairs],
                       'similarity': pairwise_cosine_similarity})

    # display characters which are most similar and least similar
    df.loc[[df.similarity.values.argmax(), df.similarity.values.argmin()]]
    return (mean_rank, mean_cosine_similarity, accuracy, df)

In [24]:
def plot_heat_map_similarity(df):
    """Takes a dataframe with header 'doc1, doc2, similarity'
    Plots a heatmap based on the similarity scores.
    """
    test_labels =  sorted(list(set(df.sort_values(['doc1'])['doc1'])))
    # add padding 1.0 values to either side
    cm = [[1.0,] * (len(test_labels)+2)]
    for target in test_labels:
        new_row = [1.0]
        for x in df.sort_values(['doc1', 'doc2'])[df['doc1']==target]['similarity']:
            new_row.append(x)
        new_row.append(1.0)
        cm.append(new_row)
    cm.append([1.0,] * (len(test_labels)+2))
    #print(cm)
    labels = [""] + test_labels + [""]
    fig = plt.figure(figsize=(20,20))
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm)
    plt.title('Similarity matrix between documents as vectors')
    fig.colorbar(cax)
    ax.set_xticks(np.arange(len(labels)))
    ax.set_yticks(np.arange(len(labels)))
    ax.set_xticklabels( labels, rotation=45)
    ax.set_yticklabels( labels)

    for i in range(len(cm)):
        for j in range(len(cm)):

            text = ax.text(j, i, round(cm[i][j],3),
                           ha="center", va="center", color="w")

    plt.xlabel('Training Vector Doc')
    plt.ylabel('Test Vector Doc')
    #fig.tight_layout()
    plt.show()

### The best final performance result for Task 1 is shown below. And all the experiments and their results I have tried for pre_process() have also been listed below in the pasted text form.

In [25]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

mean rank 3.3125
mean cosine similarity 0.936508444550048
10 correct out of 16 / accuracy: 0.625


The different techniques I have tried including:
1) white space split token 
2) change to lower case
3) remove 'eol' characters
4) remove punctuations
5) remove stop words
6) implement stemming

The experiments on different technique combinations are recorded here (note I will just use the technique number):

Experiment 1 1)&2):
mean rank 3.25
mean cosine similarity 0.9499481211709341
7 correct out of 16 / accuracy: 0.4375

Experiment 2 1) to 3): mean rank 2.4375
mean cosine similarity 0.9375237501739655
6 correct out of 16 / accuracy: 0.375

Experiment 3 1) to 4): mean rank 3.25
mean cosine similarity 0.8773615723233157
5 correct out of 16 / accuracy: 0.3125

Experiment 4 1) to 5): mean rank 3.3125
mean cosine similarity 0.7560947346501452
8 correct out of 16 / accuracy: 0.5

Experiment5 1) to 6): mean rank 3.3125 mean cosine similarity 0.7560947346501452 8 correct out of 16 / accuracy: 0.5

Experiment 6 1) to 6) except 5: mean rank 3.0625
mean cosine similarity 0.8782763997283665
6 correct out of 16 / accuracy: 0.375

Experiment 7 1) to 6) except 3: mean rank 3.3125
mean cosine similarity 0.936508444550048
10 correct out of 16 / accuracy: 0.625

According to the above experiments, despite experiment 2 returned the best score of mean rank, I think it is not very generalised to exclude all other techniques, and by looking at experiment 7 it returns a fairly good rank and highest in accuracy. I have used experiment 7's technique as my final solution to pre_process() in the end.

# Task 2. Improve linguistic feature extraction
Use the feature extraction techniques to improve the `to_feature_vector_dictionary` function. 

### Answer: To avoid confusion, please kindly run the above codes from the beginning of the notebook before 'pre_process()' section, after that please run all the rest codes from here:

Main changes for Task 2 have been made for to_feature_vector_dictionary2() function. Initially, it is just a simple count of the tokens. After experimenting many things including:
- POS tags 
- previous word followed by current word
- current word followed by the next word, and many other combinations for word and its POS tags up to the length of 3
- gender feature
- sentiment score for each word

The features used to achieve the best mean rank and accuracy are: 1)current word&tag. And I have achievement improvement to 2.4375 here. 

In addition, I use SelectKBest to find the best features, and I found out when k=5100 the model has improved more to 1.6875. And this is my final result for this question.

In the last part of the code in this task, result for each experiment is shown in details

In [26]:
#Task 2 version3 
# import and download relative libraries for tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
from nltk.stem import PorterStemmer
ps = PorterStemmer()
import nltk
nltk.download('averaged_perceptron_tagger')



def pre_process1(character_text):


    
    wordlist = []
    poslist= []

    #change to lower case
    character_text = character_text.lower()
        
    #remove '_eol_'
    #character_text = character_text.replace("_eol_ ","")
    
    #generate pos tags
    tokens = word_tokenize(character_text)
    wordpos = nltk.pos_tag(tokens)
    
    #remove punctuation
    symbols = "!#$%&()*@+-.,/:;<=>?[\]^`{|}~\n"
    j=0
    for i in range(len(wordpos)):
        for symbol in symbols:
            if wordpos[i-j][0] == symbol:    
                del wordpos[i-j]
                j += 1
              
        
    #remove stop words
    stops = stopwords.words('english')
    k=0
    for i in range(len(wordpos)):
        for stop in stops:
            if wordpos[i-k][0] == stop:    
                del wordpos[i-k]
                k += 1

    for i in range(len(wordpos)):
        word = wordpos[i][0]
        pos = wordpos[i][1]
        wordlist.append(word)
        poslist.append(pos)
 

    #stemming
    wordlist2 = []
    for w in wordlist:
        wordlist2.append(ps.stem(w))
    

    return wordlist2, poslist

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [27]:
# create list of pairs of (character name, pre-processed character) 
training_corpus = [(name, pre_process1(doc)[0]) for name, doc in sorted(train_character_docs.items())]
train_labels = [name for name, doc in training_corpus]

In [28]:
# create list of pairs of (character name, pre-processed pos tags)
training_POS = [(name, pre_process1(doc)[1]) for name, doc in sorted(train_character_docs.items())]

In [28]:
#training_corpus

Main changes for Task 2 has been made for the below to_feature_vector_dictionary2() function. Initially, it is just a simple count of the tokens. After experimenting many things including POS tags, previous word followed by current word, current word followed by the next word, and many other combinations for word up to length of 3, gender feature, sentiment score feature, the features used to achieve the best mean rank and accuracy are: current word&tag. Detailed experiment results are shown at the later part of the code.

In [29]:
#modified v3

import nltk
nltk.download('averaged_perceptron_tagger')
!pip3 install afinn
from afinn import Afinn

def to_feature_vector_dictionary2(character_doc, tag, name):

    wordlist1 = [] #list store previous bigram

    senti_list = [] #list store sentiment scores for each token


    #counts = Counter(character_doc)  # for now a simple count
    #counts = dict(counts)
    counts = {}
    
    #add POS tag features
    #pos = Counter(tag)
    #pos = dict(pos)
    #counts.update(pos)
    
    #sentiment score analysis for each word
    afinn = Afinn()
    for j in character_doc:
        word_senti =  str(afinn.score(j))
        senti_list.append(word_senti)
    
    #experiment1 target word and its tag
    '''for i in range(len(character_doc)):
        word = character_doc[i] +'@'+ tag[i]
        wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
    
    #experiment2 target word and one prev word&tag
    '''for i in range(len(character_doc)):
        if i>0:
            word = character_doc[i-1] +'@'+ tag[i-1]+'_'+character_doc[i]+'@'+tag[i]
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
        
    #experiment 3 target word and one after word&tag
    '''for i in range(len(character_doc)):        
            if len(character_doc)-i>1:
                word = character_doc[i] +'@'+tag[i]+'_' + character_doc[i+1]+'@'+tag[i+1]    
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
        
    #experiment4 target and one after word
    '''for i in range(len(character_doc)):        
            if len(character_doc)-i>1:
                word = character_doc[i] +'_' + character_doc[i+1]    
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
    
    #experiment 5 target and one prev word
    '''for i in range(len(character_doc)):
        if i>0:
            word = character_doc[i-1] +'_'+character_doc[i]
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
    
    #experiment 6 target and one prev word and sentiment score
    '''for i in range(len(character_doc)):
        if i>0:
            word = character_doc[i-1] +'@'+ senti_list[i-1]+'_'+character_doc[i]+'@'+senti_list[i]
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
        
    #experiment 7 target word, tag and separated counter for gender i.e add another key and value - gender:1 for male
    for i in range(len(character_doc)):
        word = character_doc[i] +'@'+ tag[i]
        wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1  

    
    #add gender feature, male=1, female=0
    #gender = 0
    #genders ={'SHIRLEY': 'FEMALE', 'JACK': 'MALE', 'RONNIE': 'FEMALE', 'TANYA': 'FEMALE', 'SEAN': 'MALE', 'ROXY': 'FEMALE', 'HEATHER': 'FEMALE', 'MAX': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'STACEY': 'FEMALE', 'PHIL': 'MALE', 'MINTY': 'MALE', 'CHRISTIAN': 'MALE', 'CLARE': 'FEMALE'}

    #for item in genders:
        #if name == item:
            #gender = genders[name]
            #if gender == "MALE":
                #gender =1
            #else: gender =0
    #counts.update({"Gender": gender})
    
    return counts

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!




A little changes have been made here:


In [30]:
#updated version3
# changes made from original version: tag and name are parsed to 
#to_feature_vector_dictionary2()

corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts

def create_document_matrix_from_corpus2(corpus,tag, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary2(a[1],b[1],a[0]) for a, b in zip(corpus,tag)])

    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary2(a[1],b[1],a[0]) for a, b, in zip(corpus,tag)])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus2(training_corpus, training_POS, fitting=True)

In [31]:
training_feature_matrix

<16x5426 sparse matrix of type '<class 'numpy.float64'>'
	with 13893 stored elements in Compressed Sparse Row format>

In [32]:
# get the validation data- only 40 lines used for each character
val_character_docs = create_character_document_from_dataframe(val_data, max_line_count=40)
print('Num. Characters: ',len(val_character_docs.keys()),"\n")
total_words = 0
for name in val_character_docs.keys():
    print(name, 'Num of Words: ',len(val_character_docs[name].split()))
    total_words += len(val_character_docs[name].split())
print("total words", total_words)

# create list of pairs of (character name, pre-processed character) 
val_corpus = [(name, pre_process1(doc)[0]) for name, doc in sorted(val_character_docs.items())]
val_labels = [name for name, doc in val_corpus]
# create list of pairs of (character name, POS tags) 
val_POS = [(name, pre_process1(doc)[1]) for name, doc in sorted(val_character_docs.items())]
# create list of pairs of (character name, sentiment score) 
#val_senti = [(name, pre_process1(doc)[2]) for name, doc in sorted(val_character_docs.items())]

name-gender mapping:  {'TANYA': 'FEMALE', 'MAX': 'MALE', 'SEAN': 'MALE', 'SHIRLEY': 'FEMALE', 'STACEY': 'FEMALE', 'RONNIE': 'FEMALE', 'JACK': 'MALE', 'PHIL': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'ROXY': 'FEMALE', 'HEATHER': 'FEMALE', 'MINTY': 'MALE', 'CHRISTIAN': 'MALE', 'CLARE': 'FEMALE'}

lines per character {'TANYA': 40, 'MAX': 40, 'SEAN': 35, 'SHIRLEY': 40, 'OTHER': 40, 'STACEY': 40, 'RONNIE': 40, 'JACK': 40, 'PHIL': 40, 'IAN': 40, 'JANE': 40, 'ROXY': 40, 'HEATHER': 40, 'MINTY': 40, 'CHRISTIAN': 40, 'CLARE': 40}
Num. Characters:  16 

TANYA Num of Words:  438
MAX Num of Words:  737
SEAN Num of Words:  366
SHIRLEY Num of Words:  329
OTHER Num of Words:  357
STACEY Num of Words:  412
RONNIE Num of Words:  464
JACK Num of Words:  351
PHIL Num of Words:  475
IAN Num of Words:  508
JANE Num of Words:  458
ROXY Num of Words:  392
HEATHER Num of Words:  411
MINTY Num of Words:  470
CHRISTIAN Num of Words:  489
CLARE Num of Words:  405
total words 7062


In [33]:
# Just transform the val_feature_matrix, don't fit
val_feature_matrix = create_document_matrix_from_corpus2(val_corpus,val_POS, fitting=False)

In [34]:
val_feature_matrix

<16x5426 sparse matrix of type '<class 'numpy.float64'>'
	with 2300 stored elements in Compressed Sparse Row format>

In [35]:
def compute_cosine_similarity(v1, v2):
    """Takes a pair of vectors v1 and v2 (1-d arrays e.g. [0, 0.5, 0.5])
    returns the cosine similarity between the vectors
    """
    
    # compute cosine similarity manually
    manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
    
    return manual_cosine_similarity

In [36]:
def compute_IR_evaluation_scores(train_feature_matrix, test_feature_matrix, train_labels, test_labels):
    """
    Computes an information retrieval based on training data feature matrix and test data feature matrix
    returns 4-tuple:
    ::mean_rank:: mean of the ranking of the target document in terms of similarity to the query/test document
    1 is the best possible score.
    ::mean_cosine_similarity:: mean cosine similarity score for the target document vs. the test document of the same class
    ::accuracy:: proportion of test documents correctly classified
    ::df:: a data frame with all the similarity measures of the test documents vs. train documents
    
    params:
    ::train_feature_matrix:: a numpy matrix N x M shape where N = number of characters M = number of features
    ::test_feature_matrix::  a numpy matrix N x M shape where N = number of characters M = number of features
    ::train_labels:: a list of character names for the training data in order consistent with train_feature_matrix
    ::test_labels:: a list of character names for the test data in order consistent with test_feature_matrix
    """
    rankings = []
    all_cosine_similarities = []
    pairwise_cosine_similarity = []
    pairs = []
    correct = 0
    for i, target in enumerate(test_labels):
        # compare the left out character against the mean
        idx = i 
        fm_1 = test_feature_matrix.toarray()[idx]
        all_sims = {}
        # print("target:", target)
        for j, other in enumerate(train_labels):
            fm_2 = train_feature_matrix.toarray()[j]
            manual_cosine_similarity = compute_cosine_similarity(fm_1, fm_2)
            pairs.append((target, other))
            pairwise_cosine_similarity.append(manual_cosine_similarity)
            if other == target:
                all_cosine_similarities.append(manual_cosine_similarity)
            all_sims[other] = manual_cosine_similarity

            # print(target, other, manual_cosine_similarity)
        sorted_similarities = sorted(all_sims.items(),key=lambda x:x[1],reverse=True)
        # print(sorted_similarities)
        ranking = {key[0]: rank for rank, key in enumerate(sorted_similarities, 1)}
        # print("Ranking for target", ranking[target])
        if ranking[target] == 1:
            correct += 1
        rankings.append(ranking[target])
        # print("*****")
    mean_rank = np.mean(rankings)
    mean_cosine_similarity = np.mean(all_cosine_similarities)
    accuracy = correct/len(test_labels)
    print("mean rank", np.mean(rankings))
    print("mean cosine similarity", mean_cosine_similarity)
    print(correct, "correct out of", len(test_labels), "/ accuracy:", accuracy )
    
    # get a dafaframe showing all the similarity scores of training vs test docs
    df = pd.DataFrame({'doc1': [x[0] for x in pairs], 'doc2': [x[1] for x in pairs],
                       'similarity': pairwise_cosine_similarity})

    # display characters which are most similar and least similar
    df.loc[[df.similarity.values.argmax(), df.similarity.values.argmin()]]
    return (mean_rank, mean_cosine_similarity, accuracy, df)

In [23]:
def plot_heat_map_similarity(df):
    """Takes a dataframe with header 'doc1, doc2, similarity'
    Plots a heatmap based on the similarity scores.
    """
    test_labels =  sorted(list(set(df.sort_values(['doc1'])['doc1'])))
    # add padding 1.0 values to either side
    cm = [[1.0,] * (len(test_labels)+2)]
    for target in test_labels:
        new_row = [1.0]
        for x in df.sort_values(['doc1', 'doc2'])[df['doc1']==target]['similarity']:
            new_row.append(x)
        new_row.append(1.0)
        cm.append(new_row)
    cm.append([1.0,] * (len(test_labels)+2))
    #print(cm)
    labels = [""] + test_labels + [""]
    fig = plt.figure(figsize=(20,20))
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm)
    plt.title('Similarity matrix between documents as vectors')
    fig.colorbar(cax)
    ax.set_xticks(np.arange(len(labels)))
    ax.set_yticks(np.arange(len(labels)))
    ax.set_xticklabels( labels, rotation=45)
    ax.set_yticklabels( labels)

    for i in range(len(cm)):
        for j in range(len(cm)):

            text = ax.text(j, i, round(cm[i][j],3),
                           ha="center", va="center", color="w")

    plt.xlabel('Training Vector Doc')
    plt.ylabel('Test Vector Doc')
    #fig.tight_layout()
    plt.show()

### Evaluation starts here: step 1: when trying different experiments, the best result is Experiment 1 using word&tag only. Result: mean rank improved from Task 1's 3.3125 to 2.4375 here. All other experiments results are shown below in pasted text.

In [38]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

mean rank 2.4375
mean cosine similarity 0.28423874228754586
10 correct out of 16 / accuracy: 0.625


Experiment 1: target word and tag mean rank 2.4375
mean cosine similarity 0.28423874228754586
10 correct out of 16 / accuracy: 0.625

Experiment 2: target word and one previous word and tag: mean rank 3.125
mean cosine similarity 0.09022084210078041
9 correct out of 16 / accuracy: 0.5625

Experiment 3: target word and one after word and tag: mean rank 3.125
mean cosine similarity 0.09022084210078041
9 correct out of 16 / accuracy: 0.5625

Experiment 4: target and one after word and tag: mean rank 2.6875
mean cosine similarity 0.13292936839498484
9 correct out of 16 / accuracy: 0.5625

Experiment 5: target and one previous word mean rank 2.6875
mean cosine similarity 0.13292936839498484
9 correct out of 16 / accuracy: 0.5625

Experiment 6: target and one previous word&sentiment score mean rank 2.6875
mean cosine similarity 0.13292936839498484
9 correct out of 16 / accuracy: 0.5625

Experiment 7: target word and tag, and separated counter for gender i.e add another key and value - gender:1 for male

mean rank 2.4375
mean cosine similarity 0.28423874228754586
10 correct out of 16 / accuracy: 0.625

### Evaluation step 2: experimenting with SelectKBest function to find the best features for the model, I found the best k=5100 with mean rank 1.6875. 

In [37]:
#for loop to iterate different k to find the best features 
from sklearn.feature_selection import SelectKBest, chi2
for i in range(0,training_feature_matrix.shape[1],50):
    if i>0:
        selector =  SelectKBest(chi2,k=i)
        training_feature_matrix2 = selector.fit(training_feature_matrix, train_labels)
        features_train_transformed = selector.transform(training_feature_matrix)
    
        #val_feature_matrix2 = selector.fit(val_feature_matrix, val_labels)
        features_val_transformed = selector.transform(val_feature_matrix)
        print('k= ',i, 'mean rank: ', compute_IR_evaluation_scores(features_train_transformed, features_val_transformed, train_labels, val_labels)[0] )
       

        

  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(n

mean rank 8.5
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  50 mean rank:  8.5
mean rank 8.5
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  100 mean rank:  8.5
mean rank 8.5
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  150 mean rank:  8.5
mean rank 8.625
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  200 mean rank:  8.625
mean rank 8.6875
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  250 mean rank:  8.6875
mean rank 8.6875
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  300 mean rank:  8.6875
mean rank 8.6875
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  350 mean rank:  8.6875
mean rank 8.6875
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  400 mean rank:  8.6875
mean rank 8.4375
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  450 mean rank:  8.4375
mean rank 8.625
mean cosine sim

In [39]:
#the best solution for Q2
from sklearn.feature_selection import SelectKBest, chi2
selector =  SelectKBest(chi2,k=5100)
training_feature_matrix2 = selector.fit(training_feature_matrix, train_labels)
features_train_transformed = selector.transform(training_feature_matrix)
features_val_transformed = selector.transform(val_feature_matrix)
mean_rank, mean_cosine_simliarity, acc, df0 = compute_IR_evaluation_scores(features_train_transformed, features_val_transformed, train_labels, val_labels)

mean rank 1.6875
mean cosine similarity 0.1271974409024287
13 correct out of 16 / accuracy: 0.8125


# Task 3. Add dialogue context data and features
Adjust `create_character_document_from_dataframe` and the other functions so the data incorporates the context of the line spoken by the characters in terms of the lines spoken by other characters in the same scene (immediately before and after). 

### Answer: to avoid confusion, please run all the codes from here as the beginning for Q3:

For this question, the main changes have been made in the function - create_character_document_from_dataframe2() to include one line before and after within the same scene. And I have made the 'character_line_count' dictionary count in a way that the <before line + current line + after line>'s count incrementing by 1 rather than 3 each time. I believe this makes more sense as we considering the surrounding texts as extra contents / features to the current lines to be analyzed.

However, the result is not very good. It achieved a mean rank of 2.0625, which is not as good as Task2's best result (Task2 mean rank - 1.6875). 

In [40]:
import string
import re
import numpy as np
from numpy.linalg import norm
import pandas as pd
from collections import Counter, OrderedDict

import seaborn as sns
import matplotlib.pyplot as plt

import nltk

from sklearn.feature_extraction import DictVectorizer

%matplotlib inline
pd.options.display.max_colwidth=500

In [41]:
# Load in training data and display in pandas dataframe
train_path='training.csv'
all_train_data = pd.read_csv(train_path,  delimiter="\t", skip_blank_lines = True)
test_path ='test.csv'
test_data = pd.read_csv(test_path,  delimiter="\t", skip_blank_lines = True)

# Inspect
all_train_data.head()

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender
0,1350,1,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"Look at ya, not a mark on ya. And you think you're an unlucky man.",FEMALE
1,1350,1,DESERTED CAR PARK EXT NIGHT,OTHER,Shirl...,MALE
2,1350,2,R&R INT NIGHT,JACK,Oi. Where have you been? Huh? What were the texts about?,MALE
3,1350,2,R&R INT NIGHT,RONNIE,Nothing. Nothing. I'll be with you in two minutes yeah?,FEMALE
4,1350,2,R&R INT NIGHT,JACK,"Well I've got mates here I wanted to have a chat with them, instead I've been serving behind the bar.",MALE


In [42]:
#Inspect
test_data.head()

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender
0,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"You'll be alright Kevin. They're gonna cut you out. It's going to be alright. Come on, just talk to me. Just, just talk to me. Call me names- let me know you're still here with me.",FEMALE
1,1350,7,DESERTED CAR PARK EXT NIGHT,OTHER,...I've got pins and needles in my fingers.,MALE
2,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,That's the cold.,FEMALE
3,1350,7,DESERTED CAR PARK EXT NIGHT,OTHER,Yeah. It's all shutting down.,MALE
4,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,It's just the cold.,FEMALE


In [43]:
# Split into training and test data for heldout validation with random samples of 9:1 train/heldout split
from random import shuffle, seed

seed(0) # set a seed for reproducibility so same split is used each time

epsiode_scene_column = all_train_data.Episode.astype(str) + "-" + all_train_data.Scene.astype(str)
all_train_data['episode_scene'] = epsiode_scene_column
episode_scenes = sorted(list(set([x for x in epsiode_scene_column.values]))) # set function is random, need to sort!

shuffle(episode_scenes)

print(len(episode_scenes))
episode_split = int(0.9*len(episode_scenes))
training_ep_scenes = episode_scenes[:episode_split]
test_ep_scenes = episode_scenes[episode_split:]
print(len(training_ep_scenes), len(test_ep_scenes))

def train_or_heldout_eps(val):
    if val in training_ep_scenes:
        return "training"
    return "heldout"

all_train_data['train_heldout'] = all_train_data['episode_scene'].apply(train_or_heldout_eps)

1394
1254 140


In [44]:
print('Raw Data: ',np.shape(all_train_data))
train_data = all_train_data[all_train_data['train_heldout']=='training']
val_data = all_train_data[all_train_data['train_heldout']=='heldout']
print('Train set: ',np.shape(train_data))
print('Validation set: ',np.shape(val_data))

Raw Data:  (15319, 8)
Train set:  (13638, 8)
Validation set:  (1681, 8)


In [45]:
train_data.head()

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender,episode_scene,train_heldout
0,1350,1,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"Look at ya, not a mark on ya. And you think you're an unlucky man.",FEMALE,1350-1,training
1,1350,1,DESERTED CAR PARK EXT NIGHT,OTHER,Shirl...,MALE,1350-1,training
2,1350,2,R&R INT NIGHT,JACK,Oi. Where have you been? Huh? What were the texts about?,MALE,1350-2,training
3,1350,2,R&R INT NIGHT,RONNIE,Nothing. Nothing. I'll be with you in two minutes yeah?,FEMALE,1350-2,training
4,1350,2,R&R INT NIGHT,JACK,"Well I've got mates here I wanted to have a chat with them, instead I've been serving behind the bar.",MALE,1350-2,training


In [46]:
val_data.head()

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender,episode_scene,train_heldout
56,1350,18,BRANNINGS' HOUSE INT NIGHT,TANYA,Who's that?,FEMALE,1350-18,heldout
57,1350,18,BRANNINGS' HOUSE INT NIGHT,MAX,Tanya! Tanya it's me. I wanna give Oscar a New Year kiss. Tan?,MALE,1350-18,heldout
58,1350,18,BRANNINGS' HOUSE INT NIGHT,SEAN,Right. I'll have a word.,MALE,1350-18,heldout
59,1350,18,BRANNINGS' HOUSE INT NIGHT,TANYA,"No no no leave it, leave it. He ain't got a key. He'll go away. Do you want some crisps?",FEMALE,1350-18,heldout
60,1350,18,BRANNINGS' HOUSE INT NIGHT,MAX,"Who, who you got in there, Tan? Tan, I've got a right to know who my kids are being exposed to. I bet you ain't got anyone have ya? You're just sitting there with a bottle of wine all on your Jack, Tan?",MALE,1350-18,heldout


### Main changes for Q3 have been made here below in the create_character_document_from_dataframe2() function to include one line before and after within the same scene. And I have made the 'character_line_count' dictionary count in a way that the <before line + current line + after line>'s count incrementing by 1 rather than 3 each time.

In [47]:
#updated version for Task3
#code modified to add one line before and after for all the documents if they are within the same episode-scene
# Create one document per character
def create_character_document_from_dataframe2(df, max_line_count):

    character_docs = {}
    character_line_count = {}
    character_gender = {}
    for line, name, gender, i in zip(df.Line, df.Character_name, df.Gender, range(len(df))):
       
        if not name in character_docs.keys():
            character_docs[name] = ""
            character_line_count[name] = 0
        if character_line_count[name]==max_line_count:
            continue
        
        # add the line before that if they are within the same scene
        if i>0:
            if df.iloc[i,6] == df.iloc[i-1,6]: #one line before
                character_docs[name] += str(df.iloc[i-1,4]) + "_BFEOL_ "
                #character_line_count[name]+=1
                #print(character_docs)
                #print(character_line_count)
                
        if character_line_count[name]==max_line_count:
            continue
        # add the current line
        character_docs[name] += str(df.iloc[i,4])   + " _EOL_ "
        character_line_count[name]+=1
        #print(character_docs)
        #print(character_line_count)    
        if character_line_count[name]==max_line_count:
            continue
        
        # add the line after that if they are within the same scene
        if len(df)-i-1>0:
            if df.iloc[i,6] == df.iloc[i+1,6]: #one line after
                character_docs[name] += str(df.iloc[i+1,4]) + "_AFTEOL_ "
                #character_line_count[name]+=1
                #print(character_docs)
                #print(character_line_count)

        
    #obtain name gender mapping info for Q2
    for name, gender in zip(df.Character_name, df.Gender):
        if name != "OTHER": #as other has diff gender, do not extract its gender
            character_gender[name] =str(gender)
                
    print("name-gender mapping: ", character_gender)
    print("\nlines per character", character_line_count)
    return character_docs

In [48]:
# print out the number of words each character has in the training set
# only use the first 360 lines of each character
train_character_docs = create_character_document_from_dataframe2(train_data, max_line_count=360)
print('Num. Characters: ',len(train_character_docs.keys()),"\n")
total_words = 0
for name in train_character_docs.keys():
    print(name, 'Number of Words: ',len(train_character_docs[name].split()))
    total_words += len(train_character_docs[name].split())
print("total words", total_words)

name-gender mapping:  {'SHIRLEY': 'FEMALE', 'JACK': 'MALE', 'RONNIE': 'FEMALE', 'TANYA': 'FEMALE', 'SEAN': 'MALE', 'ROXY': 'FEMALE', 'HEATHER': 'FEMALE', 'MAX': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'STACEY': 'FEMALE', 'PHIL': 'MALE', 'MINTY': 'MALE', 'CHRISTIAN': 'MALE', 'CLARE': 'FEMALE'}

lines per character {'SHIRLEY': 360, 'OTHER': 360, 'JACK': 360, 'RONNIE': 360, 'TANYA': 360, 'SEAN': 360, 'ROXY': 360, 'HEATHER': 360, 'MAX': 360, 'IAN': 360, 'JANE': 360, 'STACEY': 360, 'PHIL': 360, 'MINTY': 360, 'CHRISTIAN': 342, 'CLARE': 352}
Num. Characters:  16 

SHIRLEY Number of Words:  9938
OTHER Number of Words:  8215
JACK Number of Words:  10135
RONNIE Number of Words:  10301
TANYA Number of Words:  9660
SEAN Number of Words:  9624
ROXY Number of Words:  10139
HEATHER Number of Words:  10798
MAX Number of Words:  10225
IAN Number of Words:  10647
JANE Number of Words:  10083
STACEY Number of Words:  10200
PHIL Number of Words:  10969
MINTY Number of Words:  11019
CHRISTIAN Number of Wo

In [49]:
!pip3 install afinn
from afinn import Afinn



In [50]:
#Q2 version3 updated tokenize with POS tags and sentiment score generated
#what it does to the text: sentiment score generated, pos tag generated, lower case, 
#remove punctuation, remove stop words, word stemming

#difference made from Q1: sentiment score added

# import and download relative libraries for tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
from nltk.stem import PorterStemmer
ps = PorterStemmer()
import nltk
nltk.download('averaged_perceptron_tagger')



def pre_process1(character_text):


    
    wordlist = []
    poslist= []

    #change to lower case
    character_text = character_text.lower()
        
    #remove '_eol_'
    #character_text = character_text.replace("_eol_ ","")
    
    #generate pos tags
    tokens = word_tokenize(character_text)
    wordpos = nltk.pos_tag(tokens)
    
    #remove punctuation
    symbols = "!#$%&()*@+-.,/:;<=>?[\]^`{|}~\n"
    j=0
    for i in range(len(wordpos)):
        for symbol in symbols:
            if wordpos[i-j][0] == symbol:    
                del wordpos[i-j]
                j += 1
              
        
    #remove stop words
    stops = stopwords.words('english')
    k=0
    for i in range(len(wordpos)):
        for stop in stops:
            if wordpos[i-k][0] == stop:    
                del wordpos[i-k]
                k += 1

    for i in range(len(wordpos)):
        word = wordpos[i][0]
        pos = wordpos[i][1]
        wordlist.append(word)
        poslist.append(pos)
 

    #stemming
    wordlist2 = []
    for w in wordlist:
        wordlist2.append(ps.stem(w))
    

    return wordlist2, poslist

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [51]:
# create list of pairs of (character name, pre-processed character) 
training_corpus = [(name, pre_process1(doc)[0]) for name, doc in sorted(train_character_docs.items())]
train_labels = [name for name, doc in training_corpus]

In [52]:
# create list of pairs of (character name, pre-processed pos tags)
training_POS = [(name, pre_process1(doc)[1]) for name, doc in sorted(train_character_docs.items())]

In [40]:
#training_corpus

In [53]:
#modified v3

import nltk
nltk.download('averaged_perceptron_tagger')
!pip3 install afinn
from afinn import Afinn

def to_feature_vector_dictionary2(character_doc, tag, name):

    wordlist1 = [] #list store previous bigram

    senti_list = [] #list store sentiment scores for each token


    #counts = Counter(character_doc)  # for now a simple count
    #counts = dict(counts)
    counts = {}
    
    #add POS tag features
    #pos = Counter(tag)
    #pos = dict(pos)
    #counts.update(pos)
    
    #sentiment score analysis for each word
    afinn = Afinn()
    for j in character_doc:
        word_senti =  str(afinn.score(j))
        senti_list.append(word_senti)
    
    #experiment1 target word and its tag
    '''for i in range(len(character_doc)):
        word = character_doc[i] +'@'+ tag[i]
        wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
    
    #experiment2 target word and one prev word&tag
    '''for i in range(len(character_doc)):
        if i>0:
            word = character_doc[i-1] +'@'+ tag[i-1]+'_'+character_doc[i]+'@'+tag[i]
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
        
    #experiment 3 target word and one after word&tag
    '''for i in range(len(character_doc)):        
            if len(character_doc)-i>1:
                word = character_doc[i] +'@'+tag[i]+'_' + character_doc[i+1]+'@'+tag[i+1]    
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
        
    #experiment4 target and one after word
    '''for i in range(len(character_doc)):        
            if len(character_doc)-i>1:
                word = character_doc[i] +'_' + character_doc[i+1]    
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
    
    #experiment 5 target and one prev word
    '''for i in range(len(character_doc)):
        if i>0:
            word = character_doc[i-1] +'_'+character_doc[i]
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
    
    #experiment 6 target and one prev word and sentiment score
    '''for i in range(len(character_doc)):
        if i>0:
            word = character_doc[i-1] +'@'+ senti_list[i-1]+'_'+character_doc[i]+'@'+senti_list[i]
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
        
    #experiment 7 target word, tag and separated counter for gender i.e add another key and value - gender:1 for male
    for i in range(len(character_doc)):
        word = character_doc[i] +'@'+ tag[i]
        wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1  

    
    #add gender feature, male=1, female=0
    #gender = 0
    #genders ={'SHIRLEY': 'FEMALE', 'JACK': 'MALE', 'RONNIE': 'FEMALE', 'TANYA': 'FEMALE', 'SEAN': 'MALE', 'ROXY': 'FEMALE', 'HEATHER': 'FEMALE', 'MAX': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'STACEY': 'FEMALE', 'PHIL': 'MALE', 'MINTY': 'MALE', 'CHRISTIAN': 'MALE', 'CLARE': 'FEMALE'}

    #for item in genders:
        #if name == item:
            #gender = genders[name]
            #if gender == "MALE":
                #gender =1
            #else: gender =0
    #counts.update({"Gender": gender})
    
    return counts

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!




In [54]:
#updated version3
# changes made from original version: tag and name are parsed to 
#to_feature_vector_dictionary2()

corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts

def create_document_matrix_from_corpus2(corpus,tag, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary2(a[1],b[1],a[0]) for a, b in zip(corpus,tag)])

    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary2(a[1],b[1],a[0]) for a, b, in zip(corpus,tag)])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus2(training_corpus, training_POS, fitting=True)

In [55]:
training_feature_matrix

<16x10504 sparse matrix of type '<class 'numpy.float64'>'
	with 28239 stored elements in Compressed Sparse Row format>

In [56]:
# get the validation data- only 40 lines used for each character
val_character_docs = create_character_document_from_dataframe(val_data, max_line_count=40)
print('Num. Characters: ',len(val_character_docs.keys()),"\n")
total_words = 0
for name in val_character_docs.keys():
    print(name, 'Num of Words: ',len(val_character_docs[name].split()))
    total_words += len(val_character_docs[name].split())
print("total words", total_words)

# create list of pairs of (character name, pre-processed character) 
val_corpus = [(name, pre_process1(doc)[0]) for name, doc in sorted(val_character_docs.items())]
val_labels = [name for name, doc in val_corpus]
# create list of pairs of (character name, POS tags) 
val_POS = [(name, pre_process1(doc)[1]) for name, doc in sorted(val_character_docs.items())]
# create list of pairs of (character name, sentiment score) 
#val_senti = [(name, pre_process1(doc)[2]) for name, doc in sorted(val_character_docs.items())]

name-gender mapping:  {'TANYA': 'FEMALE', 'MAX': 'MALE', 'SEAN': 'MALE', 'SHIRLEY': 'FEMALE', 'STACEY': 'FEMALE', 'RONNIE': 'FEMALE', 'JACK': 'MALE', 'PHIL': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'ROXY': 'FEMALE', 'HEATHER': 'FEMALE', 'MINTY': 'MALE', 'CHRISTIAN': 'MALE', 'CLARE': 'FEMALE'}

lines per character {'TANYA': 40, 'MAX': 40, 'SEAN': 35, 'SHIRLEY': 40, 'OTHER': 40, 'STACEY': 40, 'RONNIE': 40, 'JACK': 40, 'PHIL': 40, 'IAN': 40, 'JANE': 40, 'ROXY': 40, 'HEATHER': 40, 'MINTY': 40, 'CHRISTIAN': 40, 'CLARE': 40}
Num. Characters:  16 

TANYA Num of Words:  438
MAX Num of Words:  737
SEAN Num of Words:  366
SHIRLEY Num of Words:  329
OTHER Num of Words:  357
STACEY Num of Words:  412
RONNIE Num of Words:  464
JACK Num of Words:  351
PHIL Num of Words:  475
IAN Num of Words:  508
JANE Num of Words:  458
ROXY Num of Words:  392
HEATHER Num of Words:  411
MINTY Num of Words:  470
CHRISTIAN Num of Words:  489
CLARE Num of Words:  405
total words 7062


In [57]:
# Just transform the val_feature_matrix, don't fit
val_feature_matrix = create_document_matrix_from_corpus2(val_corpus,val_POS, fitting=False)

In [58]:
val_feature_matrix

<16x10504 sparse matrix of type '<class 'numpy.float64'>'
	with 2354 stored elements in Compressed Sparse Row format>

In [59]:
def compute_cosine_similarity(v1, v2):
    """Takes a pair of vectors v1 and v2 (1-d arrays e.g. [0, 0.5, 0.5])
    returns the cosine similarity between the vectors
    """
    
    # compute cosine similarity manually
    manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
    
    return manual_cosine_similarity

In [48]:
def compute_IR_evaluation_scores(train_feature_matrix, test_feature_matrix, train_labels, test_labels):
    """
    Computes an information retrieval based on training data feature matrix and test data feature matrix
    returns 4-tuple:
    ::mean_rank:: mean of the ranking of the target document in terms of similarity to the query/test document
    1 is the best possible score.
    ::mean_cosine_similarity:: mean cosine similarity score for the target document vs. the test document of the same class
    ::accuracy:: proportion of test documents correctly classified
    ::df:: a data frame with all the similarity measures of the test documents vs. train documents
    
    params:
    ::train_feature_matrix:: a numpy matrix N x M shape where N = number of characters M = number of features
    ::test_feature_matrix::  a numpy matrix N x M shape where N = number of characters M = number of features
    ::train_labels:: a list of character names for the training data in order consistent with train_feature_matrix
    ::test_labels:: a list of character names for the test data in order consistent with test_feature_matrix
    """
    rankings = []
    all_cosine_similarities = []
    pairwise_cosine_similarity = []
    pairs = []
    correct = 0
    for i, target in enumerate(test_labels):
        # compare the left out character against the mean
        idx = i 
        fm_1 = test_feature_matrix.toarray()[idx]
        all_sims = {}
        # print("target:", target)
        for j, other in enumerate(train_labels):
            fm_2 = train_feature_matrix.toarray()[j]
            manual_cosine_similarity = compute_cosine_similarity(fm_1, fm_2)
            pairs.append((target, other))
            pairwise_cosine_similarity.append(manual_cosine_similarity)
            if other == target:
                all_cosine_similarities.append(manual_cosine_similarity)
            all_sims[other] = manual_cosine_similarity

            # print(target, other, manual_cosine_similarity)
        sorted_similarities = sorted(all_sims.items(),key=lambda x:x[1],reverse=True)
        # print(sorted_similarities)
        ranking = {key[0]: rank for rank, key in enumerate(sorted_similarities, 1)}
        # print("Ranking for target", ranking[target])
        if ranking[target] == 1:
            correct += 1
        rankings.append(ranking[target])
        # print("*****")
    mean_rank = np.mean(rankings)
    mean_cosine_similarity = np.mean(all_cosine_similarities)
    accuracy = correct/len(test_labels)
    print("mean rank", np.mean(rankings))
    print("mean cosine similarity", mean_cosine_similarity)
    print(correct, "correct out of", len(test_labels), "/ accuracy:", accuracy )
    
    # get a dafaframe showing all the similarity scores of training vs test docs
    df = pd.DataFrame({'doc1': [x[0] for x in pairs], 'doc2': [x[1] for x in pairs],
                       'similarity': pairwise_cosine_similarity})

    # display characters which are most similar and least similar
    df.loc[[df.similarity.values.argmax(), df.similarity.values.argmin()]]
    return (mean_rank, mean_cosine_similarity, accuracy, df)

In [49]:
def plot_heat_map_similarity(df):
    """Takes a dataframe with header 'doc1, doc2, similarity'
    Plots a heatmap based on the similarity scores.
    """
    test_labels =  sorted(list(set(df.sort_values(['doc1'])['doc1'])))
    # add padding 1.0 values to either side
    cm = [[1.0,] * (len(test_labels)+2)]
    for target in test_labels:
        new_row = [1.0]
        for x in df.sort_values(['doc1', 'doc2'])[df['doc1']==target]['similarity']:
            new_row.append(x)
        new_row.append(1.0)
        cm.append(new_row)
    cm.append([1.0,] * (len(test_labels)+2))
    #print(cm)
    labels = [""] + test_labels + [""]
    fig = plt.figure(figsize=(20,20))
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm)
    plt.title('Similarity matrix between documents as vectors')
    fig.colorbar(cax)
    ax.set_xticks(np.arange(len(labels)))
    ax.set_yticks(np.arange(len(labels)))
    ax.set_xticklabels( labels, rotation=45)
    ax.set_yticklabels( labels)

    for i in range(len(cm)):
        for j in range(len(cm)):

            text = ax.text(j, i, round(cm[i][j],3),
                           ha="center", va="center", color="w")

    plt.xlabel('Training Vector Doc')
    plt.ylabel('Test Vector Doc')
    #fig.tight_layout()
    plt.show()

In [60]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

mean rank 2.6875
mean cosine similarity 0.22313262553576174
6 correct out of 16 / accuracy: 0.375


In [61]:
#for loop to iterate different k to find the best features 
from sklearn.feature_selection import SelectKBest, chi2
for i in range(0,training_feature_matrix.shape[1],100):
    if i>0:
        selector =  SelectKBest(chi2,k=i)
        training_feature_matrix2 = selector.fit(training_feature_matrix, train_labels)
        features_train_transformed = selector.transform(training_feature_matrix)
    
        #val_feature_matrix2 = selector.fit(val_feature_matrix, val_labels)
        features_val_transformed = selector.transform(val_feature_matrix)
        print('k= ',i, 'mean rank: ', compute_IR_evaluation_scores(features_train_transformed, features_val_transformed, train_labels, val_labels)[0] )
       

        

  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))


mean rank 8.5
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  100 mean rank:  8.5
mean rank 8.5625
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  200 mean rank:  8.5625
mean rank 8.625
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  300 mean rank:  8.625
mean rank 8.625
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  400 mean rank:  8.625
mean rank 8.625
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  500 mean rank:  8.625
mean rank 8.625
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  600 mean rank:  8.625
mean rank 8.6875
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  700 mean rank:  8.6875
mean rank 8.6875
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  800 mean rank:  8.6875
mean rank 8.75
mean cosine similarity nan
0 correct out of 16 / accuracy: 0.0
k=  900 mean rank:  8.75
mean rank 8.75
mean cosine simi

  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))


mean rank 8.9375
mean cosine similarity nan
0 correct out of 16 / accuracy: 0.0
k=  1200 mean rank:  8.9375
mean rank 9.0625
mean cosine similarity nan
0 correct out of 16 / accuracy: 0.0
k=  1300 mean rank:  9.0625
mean rank 9.1875
mean cosine similarity nan
0 correct out of 16 / accuracy: 0.0
k=  1400 mean rank:  9.1875
mean rank 8.6875
mean cosine similarity nan
0 correct out of 16 / accuracy: 0.0
k=  1500 mean rank:  8.6875
mean rank 8.75
mean cosine similarity 0.006450775759770815
0 correct out of 16 / accuracy: 0.0
k=  1600 mean rank:  8.75
mean rank 8.625
mean cosine similarity 0.008453770511845603
0 correct out of 16 / accuracy: 0.0
k=  1700 mean rank:  8.625
mean rank 8.6875
mean cosine similarity 0.008175839777696845
0 correct out of 16 / accuracy: 0.0
k=  1800 mean rank:  8.6875
mean rank 8.125
mean cosine similarity 0.011048154868699884
0 correct out of 16 / accuracy: 0.0
k=  1900 mean rank:  8.125
mean rank 8.125
mean cosine similarity 0.010788512470371714
0 correct out of

### Final best result for Task 3: it seems the result is not as good as the one in Task2. (Task2 best result: 1.6875)

In [62]:
#the best solution for Task2
from sklearn.feature_selection import SelectKBest, chi2
selector =  SelectKBest(chi2,k=10200)
training_feature_matrix2 = selector.fit(training_feature_matrix, train_labels)
features_train_transformed = selector.transform(training_feature_matrix)
features_val_transformed = selector.transform(val_feature_matrix)
mean_rank, mean_cosine_simliarity, acc, df0 = compute_IR_evaluation_scores(features_train_transformed, features_val_transformed, train_labels, val_labels)

mean rank 2.0625
mean cosine similarity 0.11447674882565274
8 correct out of 16 / accuracy: 0.5


# Task 4. Improve the vectorization method
Use a matrix transformation technique like TF-IDF (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) to improve the `create_document_matrix_from_corpus` function, which currently only uses a dictionary vectorizor (`DictVectorizer`) which straight-forwardly maps from the feature dictionaries produced for each character document to a sparse matrix.

### Answer: to avoid confusion, please run all the codes from below as the start point. 
In this question, changes have been made mainly in two functions, they are to_feature_vector_dictionary3() and create_document_matrix_from_corpus3() for implement tdidf matrix. I have tried to use raw data and pre-processed data, and found out the latter returns a better performance. 

The final evaluation mean rank is 1.1875. It is the best result among Task 1 - 4. 

In [63]:
import string
import re
import numpy as np
from numpy.linalg import norm
import pandas as pd
from collections import Counter, OrderedDict

import seaborn as sns
import matplotlib.pyplot as plt

import nltk

from sklearn.feature_extraction import DictVectorizer

%matplotlib inline
pd.options.display.max_colwidth=500

In [64]:
# Load in training data and display in pandas dataframe
train_path='training.csv'
all_train_data = pd.read_csv(train_path,  delimiter="\t", skip_blank_lines = True)
test_path ='test.csv'
test_data = pd.read_csv(test_path,  delimiter="\t", skip_blank_lines = True)

# Inspect
all_train_data.head()

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender
0,1350,1,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"Look at ya, not a mark on ya. And you think you're an unlucky man.",FEMALE
1,1350,1,DESERTED CAR PARK EXT NIGHT,OTHER,Shirl...,MALE
2,1350,2,R&R INT NIGHT,JACK,Oi. Where have you been? Huh? What were the texts about?,MALE
3,1350,2,R&R INT NIGHT,RONNIE,Nothing. Nothing. I'll be with you in two minutes yeah?,FEMALE
4,1350,2,R&R INT NIGHT,JACK,"Well I've got mates here I wanted to have a chat with them, instead I've been serving behind the bar.",MALE


In [65]:
#Inspect
test_data.head()

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender
0,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"You'll be alright Kevin. They're gonna cut you out. It's going to be alright. Come on, just talk to me. Just, just talk to me. Call me names- let me know you're still here with me.",FEMALE
1,1350,7,DESERTED CAR PARK EXT NIGHT,OTHER,...I've got pins and needles in my fingers.,MALE
2,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,That's the cold.,FEMALE
3,1350,7,DESERTED CAR PARK EXT NIGHT,OTHER,Yeah. It's all shutting down.,MALE
4,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,It's just the cold.,FEMALE


In [66]:
# Split into training and test data for heldout validation with random samples of 9:1 train/heldout split
from random import shuffle, seed

seed(0) # set a seed for reproducibility so same split is used each time

epsiode_scene_column = all_train_data.Episode.astype(str) + "-" + all_train_data.Scene.astype(str)
all_train_data['episode_scene'] = epsiode_scene_column
episode_scenes = sorted(list(set([x for x in epsiode_scene_column.values]))) # set function is random, need to sort!

shuffle(episode_scenes)

print(len(episode_scenes))
episode_split = int(0.9*len(episode_scenes))
training_ep_scenes = episode_scenes[:episode_split]
test_ep_scenes = episode_scenes[episode_split:]
print(len(training_ep_scenes), len(test_ep_scenes))

def train_or_heldout_eps(val):
    if val in training_ep_scenes:
        return "training"
    return "heldout"

all_train_data['train_heldout'] = all_train_data['episode_scene'].apply(train_or_heldout_eps)

1394
1254 140


In [67]:
print('Raw Data: ',np.shape(all_train_data))
train_data = all_train_data[all_train_data['train_heldout']=='training']
val_data = all_train_data[all_train_data['train_heldout']=='heldout']
print('Train set: ',np.shape(train_data))
print('Validation set: ',np.shape(val_data))

Raw Data:  (15319, 8)
Train set:  (13638, 8)
Validation set:  (1681, 8)


In [68]:
train_data.head()

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender,episode_scene,train_heldout
0,1350,1,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"Look at ya, not a mark on ya. And you think you're an unlucky man.",FEMALE,1350-1,training
1,1350,1,DESERTED CAR PARK EXT NIGHT,OTHER,Shirl...,MALE,1350-1,training
2,1350,2,R&R INT NIGHT,JACK,Oi. Where have you been? Huh? What were the texts about?,MALE,1350-2,training
3,1350,2,R&R INT NIGHT,RONNIE,Nothing. Nothing. I'll be with you in two minutes yeah?,FEMALE,1350-2,training
4,1350,2,R&R INT NIGHT,JACK,"Well I've got mates here I wanted to have a chat with them, instead I've been serving behind the bar.",MALE,1350-2,training


In [69]:
val_data.head()

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender,episode_scene,train_heldout
56,1350,18,BRANNINGS' HOUSE INT NIGHT,TANYA,Who's that?,FEMALE,1350-18,heldout
57,1350,18,BRANNINGS' HOUSE INT NIGHT,MAX,Tanya! Tanya it's me. I wanna give Oscar a New Year kiss. Tan?,MALE,1350-18,heldout
58,1350,18,BRANNINGS' HOUSE INT NIGHT,SEAN,Right. I'll have a word.,MALE,1350-18,heldout
59,1350,18,BRANNINGS' HOUSE INT NIGHT,TANYA,"No no no leave it, leave it. He ain't got a key. He'll go away. Do you want some crisps?",FEMALE,1350-18,heldout
60,1350,18,BRANNINGS' HOUSE INT NIGHT,MAX,"Who, who you got in there, Tan? Tan, I've got a right to know who my kids are being exposed to. I bet you ain't got anyone have ya? You're just sitting there with a bottle of wine all on your Jack, Tan?",MALE,1350-18,heldout


In [70]:
# Create one document per character
def create_character_document_from_dataframe(df, max_line_count):
    """Returns a dict with the name of the character as key,
    their lines joined together as a single string, with end of line _EOL_
    markers between them.
    
    ::max_line_count:: the maximum number of lines to be added per character
    """
    character_docs = {}
    character_line_count = {}
    character_gender = {}
    for line, name, gender in zip(df.Line, df.Character_name, df.Gender):
        
        if not name in character_docs.keys():
            character_docs[name] = ""
            character_line_count[name] = 0
        if character_line_count[name]==max_line_count:
            continue
        character_docs[name] += str(line)   + " _EOL_ "
        
        character_line_count[name]+=1
        
    #obtain name gender mapping info for Q2
    for name, gender in zip(df.Character_name, df.Gender):
        if name != "OTHER": #as other has diff gender, do not extract its gender
            character_gender[name] =str(gender)
                
    print("name-gender mapping: ", character_gender)
    print("\nlines per character", character_line_count)
    return character_docs

In [71]:
# print out the number of words each character has in the training set
# only use the first 360 lines of each character
train_character_docs = create_character_document_from_dataframe(train_data, max_line_count=360)
print('Num. Characters: ',len(train_character_docs.keys()),"\n")
total_words = 0
for name in train_character_docs.keys():
    print(name, 'Number of Words: ',len(train_character_docs[name].split()))
    total_words += len(train_character_docs[name].split())
print("total words", total_words)

name-gender mapping:  {'SHIRLEY': 'FEMALE', 'JACK': 'MALE', 'RONNIE': 'FEMALE', 'TANYA': 'FEMALE', 'SEAN': 'MALE', 'ROXY': 'FEMALE', 'HEATHER': 'FEMALE', 'MAX': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'STACEY': 'FEMALE', 'PHIL': 'MALE', 'MINTY': 'MALE', 'CHRISTIAN': 'MALE', 'CLARE': 'FEMALE'}

lines per character {'SHIRLEY': 360, 'OTHER': 360, 'JACK': 360, 'RONNIE': 360, 'TANYA': 360, 'SEAN': 360, 'ROXY': 360, 'HEATHER': 360, 'MAX': 360, 'IAN': 360, 'JANE': 360, 'STACEY': 360, 'PHIL': 360, 'MINTY': 360, 'CHRISTIAN': 342, 'CLARE': 352}
Num. Characters:  16 

SHIRLEY Number of Words:  3848
OTHER Number of Words:  3244
JACK Number of Words:  4435
RONNIE Number of Words:  3442
TANYA Number of Words:  3786
SEAN Number of Words:  3637
ROXY Number of Words:  3838
HEATHER Number of Words:  4098
MAX Number of Words:  4363
IAN Number of Words:  4332
JANE Number of Words:  3648
STACEY Number of Words:  3913
PHIL Number of Words:  3635
MINTY Number of Words:  4005
CHRISTIAN Number of Words:  3738

In [72]:
#Task2 version3 updated tokenize with POS tags and sentiment score generated
#what it does to the text: sentiment score generated, pos tag generated, lower case, 
#remove punctuation, remove stop words, word stemming

#difference made from Q1: sentiment score added

# import and download relative libraries for tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
from nltk.stem import PorterStemmer
ps = PorterStemmer()
import nltk
nltk.download('averaged_perceptron_tagger')



def pre_process1(character_text):


    
    wordlist = []
    poslist= []

    #change to lower case
    character_text = character_text.lower()
        
    #remove '_eol_'
    #character_text = character_text.replace("_eol_ ","")
    
    #generate pos tags
    tokens = word_tokenize(character_text)
    wordpos = nltk.pos_tag(tokens)
    
    #remove punctuation
    symbols = "!#$%&()*@+-.,/:;<=>?[\]^`{|}~\n"
    j=0
    for i in range(len(wordpos)):
        for symbol in symbols:
            if wordpos[i-j][0] == symbol:    
                del wordpos[i-j]
                j += 1
              
        
    #remove stop words
    stops = stopwords.words('english')
    k=0
    for i in range(len(wordpos)):
        for stop in stops:
            if wordpos[i-k][0] == stop:    
                del wordpos[i-k]
                k += 1

    for i in range(len(wordpos)):
        word = wordpos[i][0]
        pos = wordpos[i][1]
        wordlist.append(word)
        poslist.append(pos)
 

    #stemming
    wordlist2 = []
    for w in wordlist:
        wordlist2.append(ps.stem(w))
    

    return wordlist2, poslist

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [74]:
# create list of pairs of (character name, pre-processed character) 
training_corpus = [(name, pre_process1(doc)[0]) for name, doc in sorted(train_character_docs.items())]
train_labels = [name for name, doc in training_corpus]

In [75]:
# create list of pairs of (character name, pre-processed pos tags)
training_POS = [(name, pre_process1(doc)[1]) for name, doc in sorted(train_character_docs.items())]

In [77]:
#training_corpus

For Task4, changes for the below function to_feature_vector_dictionary3() has been made. Since the later used TDIDF function TfidfVectorizer() has its own calculation of the tokens, the function to_feature_vector_dictionary3() here is just to join all the pre-processed tokens with a white space.

In [76]:
#modified v4 for Task 4 tdidf
# join all the tokens as a whole and do not do countings
import nltk
nltk.download('averaged_perceptron_tagger')

def to_feature_vector_dictionary3(character_doc):

    #method1: simply join all the words
    new_doc = " ".join(character_doc)
    
    #method2: change word to let it stick with the next word in form word_word+1
    """pbigramlist = []
    for i in range(len(character_doc)):
        
            if len(character_doc)-i>1:
                pbiword = character_doc[i] + '_' + character_doc[i+1]
                pbigramlist.append(pbiword)
    new_doc = " ".join(pbigramlist)"""

    return new_doc

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Changes have been made for the below function create_document_matrix_from_corpus3() for implementing tdidf matrix transforming.

In [77]:
#updated version4 with tdidf
from sklearn.feature_extraction.text import TfidfVectorizer
#corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
tdidfVectorizer = TfidfVectorizer(sublinear_tf=True) # tf-idf transformer is initialized here, tf is in log

def create_document_matrix_from_corpus3(corpus,fitting=False):
    list_doc = [] # a list to contain all the text in the form of suitable for tdidfVectorizer()
    for name, doc in corpus:
        new_doc = to_feature_vector_dictionary3(doc)
        list_doc.append(new_doc)
    if fitting:
         doc_feature_matrix = tdidfVectorizer.fit_transform(list_doc)
    else:
        doc_feature_matrix = tdidfVectorizer.transform(list_doc)
    
    return doc_feature_matrix    
    
training_feature_matrix = create_document_matrix_from_corpus3(training_corpus,fitting=True) 
    

In [78]:
training_feature_matrix

<16x3172 sparse matrix of type '<class 'numpy.float64'>'
	with 10215 stored elements in Compressed Sparse Row format>

In [79]:
# get the validation data- only 40 lines used for each character
val_character_docs = create_character_document_from_dataframe(val_data, max_line_count=40)
print('Num. Characters: ',len(val_character_docs.keys()),"\n")
total_words = 0
for name in val_character_docs.keys():
    print(name, 'Num of Words: ',len(val_character_docs[name].split()))
    total_words += len(val_character_docs[name].split())
print("total words", total_words)

# create list of pairs of (character name, pre-processed character) 
val_corpus = [(name, pre_process1(doc)[0]) for name, doc in sorted(val_character_docs.items())]
val_labels = [name for name, doc in val_corpus]
# create list of pairs of (character name, POS tags) 
val_POS = [(name, pre_process1(doc)[1]) for name, doc in sorted(val_character_docs.items())]
# create list of pairs of (character name, sentiment score) 
#val_senti = [(name, pre_process1(doc)[2]) for name, doc in sorted(val_character_docs.items())]

name-gender mapping:  {'TANYA': 'FEMALE', 'MAX': 'MALE', 'SEAN': 'MALE', 'SHIRLEY': 'FEMALE', 'STACEY': 'FEMALE', 'RONNIE': 'FEMALE', 'JACK': 'MALE', 'PHIL': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'ROXY': 'FEMALE', 'HEATHER': 'FEMALE', 'MINTY': 'MALE', 'CHRISTIAN': 'MALE', 'CLARE': 'FEMALE'}

lines per character {'TANYA': 40, 'MAX': 40, 'SEAN': 35, 'SHIRLEY': 40, 'OTHER': 40, 'STACEY': 40, 'RONNIE': 40, 'JACK': 40, 'PHIL': 40, 'IAN': 40, 'JANE': 40, 'ROXY': 40, 'HEATHER': 40, 'MINTY': 40, 'CHRISTIAN': 40, 'CLARE': 40}
Num. Characters:  16 

TANYA Num of Words:  438
MAX Num of Words:  737
SEAN Num of Words:  366
SHIRLEY Num of Words:  329
OTHER Num of Words:  357
STACEY Num of Words:  412
RONNIE Num of Words:  464
JACK Num of Words:  351
PHIL Num of Words:  475
IAN Num of Words:  508
JANE Num of Words:  458
ROXY Num of Words:  392
HEATHER Num of Words:  411
MINTY Num of Words:  470
CHRISTIAN Num of Words:  489
CLARE Num of Words:  405
total words 7062


In [80]:
# Just transform the val_feature_matrix, don't fit
val_feature_matrix = create_document_matrix_from_corpus3(val_corpus,fitting=False)

In [81]:
val_feature_matrix

<16x3172 sparse matrix of type '<class 'numpy.float64'>'
	with 1991 stored elements in Compressed Sparse Row format>

In [82]:
def compute_cosine_similarity(v1, v2):
    """Takes a pair of vectors v1 and v2 (1-d arrays e.g. [0, 0.5, 0.5])
    returns the cosine similarity between the vectors
    """
    
    # compute cosine similarity manually
    manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
    
    return manual_cosine_similarity

In [83]:
def compute_IR_evaluation_scores(train_feature_matrix, test_feature_matrix, train_labels, test_labels):
    """
    Computes an information retrieval based on training data feature matrix and test data feature matrix
    returns 4-tuple:
    ::mean_rank:: mean of the ranking of the target document in terms of similarity to the query/test document
    1 is the best possible score.
    ::mean_cosine_similarity:: mean cosine similarity score for the target document vs. the test document of the same class
    ::accuracy:: proportion of test documents correctly classified
    ::df:: a data frame with all the similarity measures of the test documents vs. train documents
    
    params:
    ::train_feature_matrix:: a numpy matrix N x M shape where N = number of characters M = number of features
    ::test_feature_matrix::  a numpy matrix N x M shape where N = number of characters M = number of features
    ::train_labels:: a list of character names for the training data in order consistent with train_feature_matrix
    ::test_labels:: a list of character names for the test data in order consistent with test_feature_matrix
    """
    rankings = []
    all_cosine_similarities = []
    pairwise_cosine_similarity = []
    pairs = []
    correct = 0
    for i, target in enumerate(test_labels):
        # compare the left out character against the mean
        idx = i 
        fm_1 = test_feature_matrix.toarray()[idx]
        all_sims = {}
        # print("target:", target)
        for j, other in enumerate(train_labels):
            fm_2 = train_feature_matrix.toarray()[j]
            manual_cosine_similarity = compute_cosine_similarity(fm_1, fm_2)
            pairs.append((target, other))
            pairwise_cosine_similarity.append(manual_cosine_similarity)
            if other == target:
                all_cosine_similarities.append(manual_cosine_similarity)
            all_sims[other] = manual_cosine_similarity

            # print(target, other, manual_cosine_similarity)
        sorted_similarities = sorted(all_sims.items(),key=lambda x:x[1],reverse=True)
        # print(sorted_similarities)
        ranking = {key[0]: rank for rank, key in enumerate(sorted_similarities, 1)}
        # print("Ranking for target", ranking[target])
        if ranking[target] == 1:
            correct += 1
        rankings.append(ranking[target])
        # print("*****")
    mean_rank = np.mean(rankings)
    mean_cosine_similarity = np.mean(all_cosine_similarities)
    accuracy = correct/len(test_labels)
    print("mean rank", np.mean(rankings))
    print("mean cosine similarity", mean_cosine_similarity)
    print(correct, "correct out of", len(test_labels), "/ accuracy:", accuracy )
    
    # get a dafaframe showing all the similarity scores of training vs test docs
    df = pd.DataFrame({'doc1': [x[0] for x in pairs], 'doc2': [x[1] for x in pairs],
                       'similarity': pairwise_cosine_similarity})

    # display characters which are most similar and least similar
    df.loc[[df.similarity.values.argmax(), df.similarity.values.argmin()]]
    return (mean_rank, mean_cosine_similarity, accuracy, df)

In [84]:
def plot_heat_map_similarity(df):
    """Takes a dataframe with header 'doc1, doc2, similarity'
    Plots a heatmap based on the similarity scores.
    """
    test_labels =  sorted(list(set(df.sort_values(['doc1'])['doc1'])))
    # add padding 1.0 values to either side
    cm = [[1.0,] * (len(test_labels)+2)]
    for target in test_labels:
        new_row = [1.0]
        for x in df.sort_values(['doc1', 'doc2'])[df['doc1']==target]['similarity']:
            new_row.append(x)
        new_row.append(1.0)
        cm.append(new_row)
    cm.append([1.0,] * (len(test_labels)+2))
    #print(cm)
    labels = [""] + test_labels + [""]
    fig = plt.figure(figsize=(20,20))
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm)
    plt.title('Similarity matrix between documents as vectors')
    fig.colorbar(cax)
    ax.set_xticks(np.arange(len(labels)))
    ax.set_yticks(np.arange(len(labels)))
    ax.set_xticklabels( labels, rotation=45)
    ax.set_yticklabels( labels)

    for i in range(len(cm)):
        for j in range(len(cm)):

            text = ax.text(j, i, round(cm[i][j],3),
                           ha="center", va="center", color="w")

    plt.xlabel('Training Vector Doc')
    plt.ylabel('Test Vector Doc')
    #fig.tight_layout()
    plt.show()

### Evaluation for Task 4

experiment 1 - result for TfidfVectorizer use all default parameters:mean rank 1.75 (please note: as this a historical experiment code, re-run the code will not produce the same result.)

In [88]:
#please do not re-run this code, as this is a historical experiment
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

mean rank 1.75
mean cosine similarity 0.8715455591225224
12 correct out of 16 / accuracy: 0.75


experiment 2 - result for TfidfVectorizer use sublinear_tf=True:

In [82]:
#sublinear_tf=True tf is calculated in log
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)


mean rank 1.1875
mean cosine similarity 0.32108534404951544
14 correct out of 16 / accuracy: 0.875


# Task 5. Select and test the best vector representation method

### Answer: To avoid confusion, please run from the beggining of the code from here:
In Task 5, I have used the best combination of techniques I tried in Task1-3 to train for all the training data and test on the whole test file. The final score is mean rank - 1.4375.

In [83]:
import string
import re
import numpy as np
from numpy.linalg import norm
import pandas as pd
from collections import Counter, OrderedDict

import seaborn as sns
import matplotlib.pyplot as plt

import nltk

from sklearn.feature_extraction import DictVectorizer

%matplotlib inline
pd.options.display.max_colwidth=500

In [84]:
# Load in training data and display in pandas dataframe
train_path='training.csv'
all_train_data = pd.read_csv(train_path,  delimiter="\t", skip_blank_lines = True)
test_path ='test.csv'
test_data = pd.read_csv(test_path,  delimiter="\t", skip_blank_lines = True)

# Inspect
all_train_data.head()

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender
0,1350,1,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"Look at ya, not a mark on ya. And you think you're an unlucky man.",FEMALE
1,1350,1,DESERTED CAR PARK EXT NIGHT,OTHER,Shirl...,MALE
2,1350,2,R&R INT NIGHT,JACK,Oi. Where have you been? Huh? What were the texts about?,MALE
3,1350,2,R&R INT NIGHT,RONNIE,Nothing. Nothing. I'll be with you in two minutes yeah?,FEMALE
4,1350,2,R&R INT NIGHT,JACK,"Well I've got mates here I wanted to have a chat with them, instead I've been serving behind the bar.",MALE


In [85]:
test_data.head()

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender
0,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"You'll be alright Kevin. They're gonna cut you out. It's going to be alright. Come on, just talk to me. Just, just talk to me. Call me names- let me know you're still here with me.",FEMALE
1,1350,7,DESERTED CAR PARK EXT NIGHT,OTHER,...I've got pins and needles in my fingers.,MALE
2,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,That's the cold.,FEMALE
3,1350,7,DESERTED CAR PARK EXT NIGHT,OTHER,Yeah. It's all shutting down.,MALE
4,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,It's just the cold.,FEMALE


In [90]:
#add episode_scene column for train data
epsiode_scene_column = all_train_data.Episode.astype(str) + "-" + all_train_data.Scene.astype(str)
all_train_data['episode_scene'] = epsiode_scene_column
episode_scenes = sorted(list(set([x for x in epsiode_scene_column.values]))) # set function is random, need to sort!
all_train_data

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender,episode_scene
0,1350,1,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"Look at ya, not a mark on ya. And you think you're an unlucky man.",FEMALE,1350-1
1,1350,1,DESERTED CAR PARK EXT NIGHT,OTHER,Shirl...,MALE,1350-1
2,1350,2,R&R INT NIGHT,JACK,Oi. Where have you been? Huh? What were the texts about?,MALE,1350-2
3,1350,2,R&R INT NIGHT,RONNIE,Nothing. Nothing. I'll be with you in two minutes yeah?,FEMALE,1350-2
4,1350,2,R&R INT NIGHT,JACK,"Well I've got mates here I wanted to have a chat with them, instead I've been serving behind the bar.",MALE,1350-2
...,...,...,...,...,...,...,...
15314,1399,55,SQUARE EXT DAY LIGHT,OTHER,"Dad? Okay ... alright, just one drink alright. But that's all. It doesn't mean anything. It's just a drink.",MALE,1399-55
15315,1399,55,SQUARE EXT DAY LIGHT,MAX,Thanks Bradley. Thanks mate... It means the world to me...,MALE,1399-55
15316,1399,55,SQUARE EXT DAY LIGHT,OTHER,You alright...,MALE,1399-55
15317,1399,55,SQUARE EXT DAY LIGHT,MAX,"Yeah, yeah, yeah. I'm fine.",MALE,1399-55


In [87]:
#add episode_scene column for test data
epsiode_scene_column = test_data.Episode.astype(str) + "-" + test_data.Scene.astype(str)
test_data['episode_scene'] = epsiode_scene_column
episode_scenes = sorted(list(set([x for x in epsiode_scene_column.values]))) # set function is random, need to sort!
test_data.head()

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender,episode_scene
0,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"You'll be alright Kevin. They're gonna cut you out. It's going to be alright. Come on, just talk to me. Just, just talk to me. Call me names- let me know you're still here with me.",FEMALE,1350-7
1,1350,7,DESERTED CAR PARK EXT NIGHT,OTHER,...I've got pins and needles in my fingers.,MALE,1350-7
2,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,That's the cold.,FEMALE,1350-7
3,1350,7,DESERTED CAR PARK EXT NIGHT,OTHER,Yeah. It's all shutting down.,MALE,1350-7
4,1350,7,DESERTED CAR PARK EXT NIGHT,SHIRLEY,It's just the cold.,FEMALE,1350-7


In [91]:
print('Train set: ',np.shape(all_train_data))
print('Test set: ',np.shape(test_data))

Train set:  (15319, 7)
Test set:  (1767, 7)


In [92]:
# Create one document per character
def create_character_document_from_dataframe(df, max_line_count):
    """Returns a dict with the name of the character as key,
    their lines joined together as a single string, with end of line _EOL_
    markers between them.
    
    ::max_line_count:: the maximum number of lines to be added per character
    """
    character_docs = {}
    character_line_count = {}
    character_gender = {}
    for line, name, gender in zip(df.Line, df.Character_name, df.Gender):
        
        if not name in character_docs.keys():
            character_docs[name] = ""
            character_line_count[name] = 0
        if character_line_count[name]==max_line_count:
            continue
        character_docs[name] += str(line)   + " _EOL_ "
        
        character_line_count[name]+=1
        
    #obtain name gender mapping info for Q2
    for name, gender in zip(df.Character_name, df.Gender):
        if name != "OTHER": #as other has diff gender, do not extract its gender
            character_gender[name] =str(gender)
                
    print("name-gender mapping: ", character_gender)
    print("\nlines per character", character_line_count)
    return character_docs

In [93]:
# print out the number of words each character has in the training set
# only use the first 400 lines of each character
train_character_docs = create_character_document_from_dataframe(all_train_data, max_line_count=400)
print('Num. Characters: ',len(train_character_docs.keys()),"\n")
total_words = 0
for name in train_character_docs.keys():
    print(name, 'Number of Words: ',len(train_character_docs[name].split()))
    total_words += len(train_character_docs[name].split())
print("total words", total_words)

name-gender mapping:  {'SHIRLEY': 'FEMALE', 'JACK': 'MALE', 'RONNIE': 'FEMALE', 'TANYA': 'FEMALE', 'SEAN': 'MALE', 'ROXY': 'FEMALE', 'HEATHER': 'FEMALE', 'MAX': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'STACEY': 'FEMALE', 'PHIL': 'MALE', 'MINTY': 'MALE', 'CHRISTIAN': 'MALE', 'CLARE': 'FEMALE'}

lines per character {'SHIRLEY': 400, 'OTHER': 400, 'JACK': 400, 'RONNIE': 400, 'TANYA': 400, 'SEAN': 400, 'ROXY': 400, 'HEATHER': 400, 'MAX': 400, 'IAN': 400, 'JANE': 400, 'STACEY': 400, 'PHIL': 400, 'MINTY': 400, 'CHRISTIAN': 385, 'CLARE': 400}
Num. Characters:  16 

SHIRLEY Number of Words:  4233
OTHER Number of Words:  3606
JACK Number of Words:  4786
RONNIE Number of Words:  3872
TANYA Number of Words:  4214
SEAN Number of Words:  4026
ROXY Number of Words:  4200
HEATHER Number of Words:  4504
MAX Number of Words:  5107
IAN Number of Words:  4863
JANE Number of Words:  4117
STACEY Number of Words:  4325
PHIL Number of Words:  4103
MINTY Number of Words:  4391
CHRISTIAN Number of Words:  4250

In [94]:
#Q2 version3 updated tokenize with POS tags and sentiment score generated
#what it does to the text: sentiment score generated, pos tag generated, lower case, 
#remove punctuation, remove stop words, word stemming

#difference made from Q1: sentiment score added

# import and download relative libraries for tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
from nltk.stem import PorterStemmer
ps = PorterStemmer()
import nltk
nltk.download('averaged_perceptron_tagger')



def pre_process1(character_text):


    
    wordlist = []
    poslist= []

    #change to lower case
    character_text = character_text.lower()
        
    #remove '_eol_'
    #character_text = character_text.replace("_eol_ ","")
    
    #generate pos tags
    tokens = word_tokenize(character_text)
    wordpos = nltk.pos_tag(tokens)
    
    #remove punctuation
    symbols = "!#$%&()*@+-.,/:;<=>?[\]^`{|}~\n"
    j=0
    for i in range(len(wordpos)):
        for symbol in symbols:
            if wordpos[i-j][0] == symbol:    
                del wordpos[i-j]
                j += 1
              
        
    #remove stop words
    stops = stopwords.words('english')
    k=0
    for i in range(len(wordpos)):
        for stop in stops:
            if wordpos[i-k][0] == stop:    
                del wordpos[i-k]
                k += 1

    for i in range(len(wordpos)):
        word = wordpos[i][0]
        pos = wordpos[i][1]
        wordlist.append(word)
        poslist.append(pos)
 

    #stemming
    wordlist2 = []
    for w in wordlist:
        wordlist2.append(ps.stem(w))
    

    return wordlist2, poslist

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [95]:
# create list of pairs of (character name, pre-processed character) 
training_corpus = [(name, pre_process1(doc)[0]) for name, doc in sorted(train_character_docs.items())]
train_labels = [name for name, doc in training_corpus]

In [96]:
# create list of pairs of (character name, pre-processed pos tags)
training_POS = [(name, pre_process1(doc)[1]) for name, doc in sorted(train_character_docs.items())]

In [None]:
#training_corpus

In [97]:
#modified v3

import nltk
nltk.download('averaged_perceptron_tagger')
!pip3 install afinn
from afinn import Afinn

def to_feature_vector_dictionary2(character_doc, tag, name):

    wordlist1 = [] #list store previous bigram

    senti_list = [] #list store sentiment scores for each token


    #counts = Counter(character_doc)  # for now a simple count
    #counts = dict(counts)
    counts = {}
    
    #add POS tag features
    #pos = Counter(tag)
    #pos = dict(pos)
    #counts.update(pos)
    
    #sentiment score analysis for each word
    afinn = Afinn()
    for j in character_doc:
        word_senti =  str(afinn.score(j))
        senti_list.append(word_senti)
    
    #experiment1 target word and its tag
    '''for i in range(len(character_doc)):
        word = character_doc[i] +'@'+ tag[i]
        wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
    
    #experiment2 target word and one prev word&tag
    '''for i in range(len(character_doc)):
        if i>0:
            word = character_doc[i-1] +'@'+ tag[i-1]+'_'+character_doc[i]+'@'+tag[i]
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
        
    #experiment 3 target word and one after word&tag
    '''for i in range(len(character_doc)):        
            if len(character_doc)-i>1:
                word = character_doc[i] +'@'+tag[i]+'_' + character_doc[i+1]+'@'+tag[i+1]    
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
        
    #experiment4 target and one after word
    '''for i in range(len(character_doc)):        
            if len(character_doc)-i>1:
                word = character_doc[i] +'_' + character_doc[i+1]    
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
    
    #experiment 5 target and one prev word
    '''for i in range(len(character_doc)):
        if i>0:
            word = character_doc[i-1] +'_'+character_doc[i]
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
    
    #experiment 6 target and one prev word and sentiment score
    '''for i in range(len(character_doc)):
        if i>0:
            word = character_doc[i-1] +'@'+ senti_list[i-1]+'_'+character_doc[i]+'@'+senti_list[i]
            wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1'''
        
    #experiment 7 target word, tag and separated counter for gender i.e add another key and value - gender:1 for male
    for i in range(len(character_doc)):
        word = character_doc[i] +'@'+ tag[i]
        wordlist1.append(word)

    word_counts = Counter(wordlist1)
    word_counts = dict(word_counts)
    for a in word_counts:
        counts[a] = 0
        counts[a] +=1  

    
    #add gender feature, male=1, female=0
    #gender = 0
    #genders ={'SHIRLEY': 'FEMALE', 'JACK': 'MALE', 'RONNIE': 'FEMALE', 'TANYA': 'FEMALE', 'SEAN': 'MALE', 'ROXY': 'FEMALE', 'HEATHER': 'FEMALE', 'MAX': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'STACEY': 'FEMALE', 'PHIL': 'MALE', 'MINTY': 'MALE', 'CHRISTIAN': 'MALE', 'CLARE': 'FEMALE'}

    #for item in genders:
        #if name == item:
            #gender = genders[name]
            #if gender == "MALE":
                #gender =1
            #else: gender =0
    #counts.update({"Gender": gender})
    
    return counts

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dandi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!




In [98]:
#updated version3
# changes made from original version: tag and name are parsed to 
#to_feature_vector_dictionary2()

corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts

def create_document_matrix_from_corpus2(corpus,tag, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary2(a[1],b[1],a[0]) for a, b in zip(corpus,tag)])

    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary2(a[1],b[1],a[0]) for a, b, in zip(corpus,tag)])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus2(training_corpus, training_POS, fitting=True)

In [99]:
training_feature_matrix

<16x5795 sparse matrix of type '<class 'numpy.float64'>'
	with 15005 stored elements in Compressed Sparse Row format>

In [100]:
# get the test data- only 40 lines used for each character
test_character_docs = create_character_document_from_dataframe(test_data, max_line_count=40)
print('Num. Characters: ',len(val_character_docs.keys()),"\n")
total_words = 0
for name in test_character_docs.keys():
    print(name, 'Num of Words: ',len(test_character_docs[name].split()))
    total_words += len(test_character_docs[name].split())
print("total words", total_words)

# create list of pairs of (character name, pre-processed character) 
test_corpus = [(name, pre_process1(doc)[0]) for name, doc in sorted(test_character_docs.items())]
test_labels = [name for name, doc in test_corpus]
# create list of pairs of (character name, POS tags) 
test_POS = [(name, pre_process1(doc)[1]) for name, doc in sorted(test_character_docs.items())]


name-gender mapping:  {'SHIRLEY': 'FEMALE', 'HEATHER': 'FEMALE', 'PHIL': 'MALE', 'SEAN': 'MALE', 'TANYA': 'FEMALE', 'MAX': 'MALE', 'JACK': 'MALE', 'IAN': 'MALE', 'JANE': 'FEMALE', 'STACEY': 'FEMALE', 'ROXY': 'FEMALE', 'RONNIE': 'FEMALE', 'CHRISTIAN': 'MALE', 'MINTY': 'MALE', 'CLARE': 'FEMALE'}

lines per character {'SHIRLEY': 40, 'OTHER': 40, 'HEATHER': 40, 'PHIL': 40, 'SEAN': 40, 'TANYA': 40, 'MAX': 40, 'JACK': 40, 'IAN': 40, 'JANE': 40, 'STACEY': 40, 'ROXY': 40, 'RONNIE': 40, 'CHRISTIAN': 40, 'MINTY': 40, 'CLARE': 40}
Num. Characters:  16 

SHIRLEY Num of Words:  373
OTHER Num of Words:  453
HEATHER Num of Words:  451
PHIL Num of Words:  406
SEAN Num of Words:  466
TANYA Num of Words:  465
MAX Num of Words:  494
JACK Num of Words:  412
IAN Num of Words:  509
JANE Num of Words:  414
STACEY Num of Words:  634
ROXY Num of Words:  392
RONNIE Num of Words:  390
CHRISTIAN Num of Words:  629
MINTY Num of Words:  428
CLARE Num of Words:  368
total words 7284


In [102]:
# Just transform the test_feature_matrix, don't fit
test_feature_matrix = create_document_matrix_from_corpus2(test_corpus,test_POS,fitting=False)

In [103]:
test_feature_matrix

<16x5795 sparse matrix of type '<class 'numpy.float64'>'
	with 2343 stored elements in Compressed Sparse Row format>

In [104]:
def compute_cosine_similarity(v1, v2):
    """Takes a pair of vectors v1 and v2 (1-d arrays e.g. [0, 0.5, 0.5])
    returns the cosine similarity between the vectors
    """
    
    # compute cosine similarity manually
    manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
    
    return manual_cosine_similarity

In [104]:
def compute_IR_evaluation_scores(train_feature_matrix, test_feature_matrix, train_labels, test_labels):
    """
    Computes an information retrieval based on training data feature matrix and test data feature matrix
    returns 4-tuple:
    ::mean_rank:: mean of the ranking of the target document in terms of similarity to the query/test document
    1 is the best possible score.
    ::mean_cosine_similarity:: mean cosine similarity score for the target document vs. the test document of the same class
    ::accuracy:: proportion of test documents correctly classified
    ::df:: a data frame with all the similarity measures of the test documents vs. train documents
    
    params:
    ::train_feature_matrix:: a numpy matrix N x M shape where N = number of characters M = number of features
    ::test_feature_matrix::  a numpy matrix N x M shape where N = number of characters M = number of features
    ::train_labels:: a list of character names for the training data in order consistent with train_feature_matrix
    ::test_labels:: a list of character names for the test data in order consistent with test_feature_matrix
    """
    rankings = []
    all_cosine_similarities = []
    pairwise_cosine_similarity = []
    pairs = []
    correct = 0
    for i, target in enumerate(test_labels):
        # compare the left out character against the mean
        idx = i 
        fm_1 = test_feature_matrix.toarray()[idx]
        all_sims = {}
        # print("target:", target)
        for j, other in enumerate(train_labels):
            fm_2 = train_feature_matrix.toarray()[j]
            manual_cosine_similarity = compute_cosine_similarity(fm_1, fm_2)
            pairs.append((target, other))
            pairwise_cosine_similarity.append(manual_cosine_similarity)
            if other == target:
                all_cosine_similarities.append(manual_cosine_similarity)
            all_sims[other] = manual_cosine_similarity

            # print(target, other, manual_cosine_similarity)
        sorted_similarities = sorted(all_sims.items(),key=lambda x:x[1],reverse=True)
        # print(sorted_similarities)
        ranking = {key[0]: rank for rank, key in enumerate(sorted_similarities, 1)}
        # print("Ranking for target", ranking[target])
        if ranking[target] == 1:
            correct += 1
        rankings.append(ranking[target])
        # print("*****")
    mean_rank = np.mean(rankings)
    mean_cosine_similarity = np.mean(all_cosine_similarities)
    accuracy = correct/len(test_labels)
    print("mean rank", np.mean(rankings))
    print("mean cosine similarity", mean_cosine_similarity)
    print(correct, "correct out of", len(test_labels), "/ accuracy:", accuracy )
    
    # get a dafaframe showing all the similarity scores of training vs test docs
    df = pd.DataFrame({'doc1': [x[0] for x in pairs], 'doc2': [x[1] for x in pairs],
                       'similarity': pairwise_cosine_similarity})

    # display characters which are most similar and least similar
    df.loc[[df.similarity.values.argmax(), df.similarity.values.argmin()]]
    return (mean_rank, mean_cosine_similarity, accuracy, df)

In [105]:
def plot_heat_map_similarity(df):
    """Takes a dataframe with header 'doc1, doc2, similarity'
    Plots a heatmap based on the similarity scores.
    """
    test_labels =  sorted(list(set(df.sort_values(['doc1'])['doc1'])))
    # add padding 1.0 values to either side
    cm = [[1.0,] * (len(test_labels)+2)]
    for target in test_labels:
        new_row = [1.0]
        for x in df.sort_values(['doc1', 'doc2'])[df['doc1']==target]['similarity']:
            new_row.append(x)
        new_row.append(1.0)
        cm.append(new_row)
    cm.append([1.0,] * (len(test_labels)+2))
    #print(cm)
    labels = [""] + test_labels + [""]
    fig = plt.figure(figsize=(20,20))
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm)
    plt.title('Similarity matrix between documents as vectors')
    fig.colorbar(cax)
    ax.set_xticks(np.arange(len(labels)))
    ax.set_yticks(np.arange(len(labels)))
    ax.set_xticklabels( labels, rotation=45)
    ax.set_yticklabels( labels)

    for i in range(len(cm)):
        for j in range(len(cm)):

            text = ax.text(j, i, round(cm[i][j],3),
                           ha="center", va="center", color="w")

    plt.xlabel('Training Vector Doc')
    plt.ylabel('Test Vector Doc')
    #fig.tight_layout()
    plt.show()

In [105]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, test_feature_matrix, train_labels, test_labels)

mean rank 2.25
mean cosine similarity 0.28408187907385607
9 correct out of 16 / accuracy: 0.5625


In [106]:
#for loop to iterate different k to find the best features 
from sklearn.feature_selection import SelectKBest, chi2
for i in range(0,training_feature_matrix.shape[1],50):
    if i>0:
        selector =  SelectKBest(chi2,k=i)
        training_feature_matrix2 = selector.fit(training_feature_matrix, train_labels)
        features_train_transformed = selector.transform(training_feature_matrix)
    
        #val_feature_matrix2 = selector.fit(val_feature_matrix, val_labels)
        features_test_transformed = selector.transform(test_feature_matrix)
        print('k= ',i, 'mean rank: ', compute_IR_evaluation_scores(features_train_transformed, features_test_transformed, train_labels, test_labels)[0] )
       

        

  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(n

mean rank 8.5
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  50 mean rank:  8.5
mean rank 8.5
mean cosine similarity nan
1 correct out of 16 / accuracy: 0.0625
k=  100 mean rank:  8.5
mean rank 8.5
mean cosine similarity nan
2 correct out of 16 / accuracy: 0.125
k=  150 mean rank:  8.5
mean rank 8.625
mean cosine similarity nan
2 correct out of 16 / accuracy: 0.125
k=  200 mean rank:  8.625
mean rank 8.6875
mean cosine similarity nan
2 correct out of 16 / accuracy: 0.125
k=  250 mean rank:  8.6875
mean rank 8.375
mean cosine similarity nan
3 correct out of 16 / accuracy: 0.1875
k=  300 mean rank:  8.375
mean rank 8.25
mean cosine similarity nan
3 correct out of 16 / accuracy: 0.1875
k=  350 mean rank:  8.25
mean rank 8.375
mean cosine similarity nan
2 correct out of 16 / accuracy: 0.125
k=  400 mean rank:  8.375
mean rank 7.5625
mean cosine similarity nan
2 correct out of 16 / accuracy: 0.125
k=  450 mean rank:  7.5625
mean rank 7.5625
mean cosine similarity nan


  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
  manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))


mean rank 5.875
mean cosine similarity 0.04514847783991911
6 correct out of 16 / accuracy: 0.375
k=  1400 mean rank:  5.875
mean rank 5.875
mean cosine similarity 0.04763525813617664
7 correct out of 16 / accuracy: 0.4375
k=  1450 mean rank:  5.875
mean rank 5.9375
mean cosine similarity 0.0467600818712719
6 correct out of 16 / accuracy: 0.375
k=  1500 mean rank:  5.9375
mean rank 5.9375
mean cosine similarity 0.046026775745138
6 correct out of 16 / accuracy: 0.375
k=  1550 mean rank:  5.9375
mean rank 5.9375
mean cosine similarity 0.04696019649708574
6 correct out of 16 / accuracy: 0.375
k=  1600 mean rank:  5.9375
mean rank 5.3125
mean cosine similarity 0.051014347538185845
7 correct out of 16 / accuracy: 0.4375
k=  1650 mean rank:  5.3125
mean rank 5.1875
mean cosine similarity 0.052231439140724395
7 correct out of 16 / accuracy: 0.4375
k=  1700 mean rank:  5.1875
mean rank 5.125
mean cosine similarity 0.05334230377037211
8 correct out of 16 / accuracy: 0.5
k=  1750 mean rank:  5.12

# Final evaluation: mean rank 1.4375

In [107]:
#the best solution
from sklearn.feature_selection import SelectKBest, chi2
selector =  SelectKBest(chi2,k=5250)
training_feature_matrix2 = selector.fit(training_feature_matrix, train_labels)
features_train_transformed = selector.transform(training_feature_matrix)
features_test_transformed = selector.transform(test_feature_matrix)
mean_rank, mean_cosine_simliarity, acc, df0 = compute_IR_evaluation_scores(features_train_transformed, features_test_transformed, train_labels, test_labels)

mean rank 1.4375
mean cosine similarity 0.09966820928178358
11 correct out of 16 / accuracy: 0.6875
