# Assignment 5: Grounded, Lexical Semantics

## Natural Language Processing - Boise State University

### Instructions

* Attached to the corresponding Trello card for this assignment are the files `features.txt` and `segmented-labeled.txt` which have data for a reference resolution task. I have already done a lot of the data munging for you. The rdg_munging.html / rdg_munging.ipynb notebook shows how I did that. At the very end I saved two data frames as two pickles named `scenedata.pkl` and `refexpdata.pkl`. You will use these two files. 
* You are to use the `scenedata.pkl` and `refexpdata.pkl` files to train logistic regression classifiers that take low-level object ('visual') data as features and produce a probability that an object matches a word's classifier. 

**scenedata** scenes are separated by `episodeid`. For each `eposodeid`, there are 8 images, each with an `imageid`. For each image, there could be between 1 and 7 `pieceid` depending on the scene type. However, for this assignment we only care about cases where there is only 1 object in each image

Below is an example Scene where each image has two pieces (see http://www.sigdial.org/workshops/conference17/proceedings/pdf/SIGDIAL30.pdf for more information):

![title](rdg_scene_example.png) 

Using these kinds of scenes, the task was for the *Director* who knew which object needed to be selected, was to instruct the *Matcher* just which object that was. The *Director*'s game screen had the same images on it, but they were usually in a different order, forcing the *Director* to describe the objects in the image rather than the image placement on the grid (e.g., so a *Director* couldn't just say something like "first row, second column") to indicate an image).

The goal of this assignment is to use the data to train logistic regression classifiers for each word in the corpus and evaluate how well they can be used for resolving references to visual objects. **Note** that the goal is to resolve references to individual objects, not individual images (i.e., images can have more than one object in them). 

First, load the data and get an idea what it is:

The dataframe `scenes` is like a database that has the features of each object in each image for each episode. 

The dataframe `refs` has the referring expressions, where each each `id` represents an individual referring expression (i.e., grouping by id groups all the words in a referring rexpression), the `episodeid`, `imageid`, and `targetid` denote the episode, image of the episode, and target object in the image that is being referred by that referring expression. Note that for all referring expressions grouped by an id, the `id`, `episodeid`, `imageid`, and `targetid` are the same. The only thing that is different are the words in the word column. The words are ordered by row. (See example in the above cell.)

Note that the targetid is the pieceid for the referred object in a particular `episodeid`/`imageid`

In [1]:
import pandas as pd
import numpy as np

In [2]:
scenes = pd.read_pickle('scenedata.pkl')
refs = pd.read_pickle('refexpdata.pkl')

refs['type'] = refs.episodeid.map(lambda x: x.split('/')[0])
refs = refs[refs.type == 'Set0'] # we only use images where there is only one object in the image

In [10]:
scenes[:10]

Unnamed: 0,pieceid,imageid,episodeid,r,g,b,h,s,v,orientation,num_edges,pos_x,pos_y,h_skew_left-skewed,h_skew_right-skewed,h_skew_symmetric,v_skew_bottom-skewed,v_skew_symmetric,v_skew_top-skewed,c_diff
0,0,1,Set0/1,86.480225,57.164215,46.304261,8.293657,127.795376,86.661635,5.742743,8,199,164,0,1,0,1,0,0,257.870122
1,0,2,Set0/1,79.55544,74.452909,59.535351,22.51474,74.233586,79.337073,41.51936,10,222,159,1,0,0,0,0,1,273.065926
2,0,3,Set0/1,130.428545,111.25028,86.211567,17.137593,94.26875,131.00056,-7.716261,12,203,161,0,0,1,1,0,0,259.094577
3,0,4,Set0/1,69.591751,55.848775,83.48426,135.273859,92.572226,83.479976,-21.40881,8,222,151,0,0,1,0,0,1,268.486499
4,0,5,Set0/1,36.108723,79.887808,112.033928,102.723919,177.755478,112.230646,42.677817,6,220,169,1,0,0,0,0,1,277.418456
5,0,6,Set0/1,144.449219,90.613706,23.64974,16.616909,217.554651,144.855265,-3.779378,8,213,139,0,0,1,1,0,0,254.342289
6,0,7,Set0/1,74.257687,41.812776,36.635007,19.366604,144.902781,74.258458,14.165111,6,238,154,0,0,1,0,0,1,283.478394
7,0,8,Set0/1,80.111553,87.422078,93.454739,103.250921,38.994379,93.49244,-32.055374,10,218,154,0,0,1,0,0,1,266.908224
8,0,10,Set0/2,45.385632,67.779668,101.704448,108.096946,148.533716,101.900799,-0.229313,4,194,148,0,0,1,0,1,0,244.008197
9,0,11,Set0/2,49.125512,34.05651,31.479114,16.03172,108.200859,48.973843,9.211177,6,216,174,1,0,0,0,0,1,277.366184


In [4]:
scenes[:5]

Unnamed: 0,pieceid,imageid,episodeid,r,g,b,h,s,v,orientation,num_edges,pos_x,pos_y,h_skew_left-skewed,h_skew_right-skewed,h_skew_symmetric,v_skew_bottom-skewed,v_skew_symmetric,v_skew_top-skewed,c_diff
0,0,1,Set0/1,86.480225,57.164215,46.304261,8.293657,127.795376,86.661635,5.742743,8,199,164,0,1,0,1,0,0,257.870122
1,0,2,Set0/1,79.55544,74.452909,59.535351,22.51474,74.233586,79.337073,41.51936,10,222,159,1,0,0,0,0,1,273.065926
2,0,3,Set0/1,130.428545,111.25028,86.211567,17.137593,94.26875,131.00056,-7.716261,12,203,161,0,0,1,1,0,0,259.094577
3,0,4,Set0/1,69.591751,55.848775,83.48426,135.273859,92.572226,83.479976,-21.40881,8,222,151,0,0,1,0,0,1,268.486499
4,0,5,Set0/1,36.108723,79.887808,112.033928,102.723919,177.755478,112.230646,42.677817,6,220,169,1,0,0,0,0,1,277.418456


In [5]:
refs.columns

Index(['id', 'episodeid', 'imageid', 'target', 'word', 'type'], dtype='object')

In [6]:
refs[refs.id == 4] # show the referring expression for id=4

Unnamed: 0,id,episodeid,imageid,target,word,type
3,4,Set0/1,8,0,like,Set0
3,4,Set0/1,8,0,off,Set0
3,4,Set0/1,8,0,to,Set0
3,4,Set0/1,8,0,the,Set0
3,4,Set0/1,8,0,left,Set0
3,4,Set0/1,8,0,like,Set0
3,4,Set0/1,8,0,a,Set0
3,4,Set0/1,8,0,reverse,Set0
3,4,Set0/1,8,0,l,Set0


### Procedure and Hints

* This was made easier for me using pandasql / pysqldf, but anything that can be done using pandasql/pydsqldf can be done using pandas merge functions. 
* I split the data for you into train/test
* Training is tricky. You need to do the following for each word in the vocabulary:
   * Get all of the features for the objects where that word was used. These are your positive training examples. 
   * Randomly choose features for objects where that word was *not* used. These are your negative training examples. 
   * You should have the same number of negative and positive training examples
   * Use `0` to label the negative training examples and `1` to label to positive training examples. 
   * Train the logistic regression classifier using the labeleled positive and negative examples (penalty='l2' helps here). 
   * I recommend using a dictionary where key=word, value=classifier
* Testing is also tricky. You need to make sure you are conducting a realistic test. You want to represent your data as if you are looking at a scene. That means, for a referring expression, you want the 8 corresponding images and all of the objects in those images. You then take the words in the referring expression, get their respective classifiers, and test them on each of the objects in each of the images. For each object, you will sum the probabilities that are returned for each classifier. The object with the highest score (i.e., the highest sum of probabilities) will be the guessed referent object. To calculate accuracy, you will check to see if that object's pieceid matches the targetid. If they do, then your accuracy increases. 
    * I was able to do testing using a query that joined the test and scene data into a dataframe such that all words and all objects were reprsented in individual rows. 
    * I then made a new column in that dataframe that was the probability of applying the word in a row to the object features in the same row. 
    * I then used a query to sum the results over the objects (accomplished by grouping by certain columns).
    * I then used a query to find the max-scored object and compared that with the target. 
* For this assignment, your accuracy needs to be above 50%. That seems low, but at the best when there is one object in each of the 8 images, the baseline is 1/8 (12.5%).

In [11]:
res = refs
data = scenes

from pandasql import sqldf
from pandasql import *

pysqldf = lambda q: sqldf(q, globals())

In [12]:
#
# merge the data and res dataframes so we can get the targets' features
#
query = '''
SELECT res.word, res.id, target, res.episodeid, res.imageid, data.* 
FROM data 
INNER JOIN res
ON data.episodeid = res.episodeid
AND data.pieceid = res.target 
AND data.imageid = res.imageid
ORDER BY id, data.episodeid

'''

positive = pysqldf(query)

positive[:5]

Unnamed: 0,word,id,target,episodeid,imageid,pieceid,imageid.1,episodeid.1,r,g,...,num_edges,pos_x,pos_y,h_skew_left-skewed,h_skew_right-skewed,h_skew_symmetric,v_skew_bottom-skewed,v_skew_symmetric,v_skew_top-skewed,c_diff
0,a,4,0,Set0/1,8,0,8,Set0/1,80.111553,87.422078,...,10,218,154,0,0,1,0,0,1,266.908224
1,l,4,0,Set0/1,8,0,8,Set0/1,80.111553,87.422078,...,10,218,154,0,0,1,0,0,1,266.908224
2,left,4,0,Set0/1,8,0,8,Set0/1,80.111553,87.422078,...,10,218,154,0,0,1,0,0,1,266.908224
3,like,4,0,Set0/1,8,0,8,Set0/1,80.111553,87.422078,...,10,218,154,0,0,1,0,0,1,266.908224
4,like,4,0,Set0/1,8,0,8,Set0/1,80.111553,87.422078,...,10,218,154,0,0,1,0,0,1,266.908224


In [16]:
words = list(set(positive.word)) # vocabulary

len(words)

201

## Train

In [17]:
# split into train/test

import random

num_eval = 100

eids = set(positive.id)
test_eids = set(random.sample(eids, num_eval))
train_eids = list(set(eids - test_eids))
test_eids = list(test_eids)

positive_train = positive[positive.id.isin(train_eids)]
test = positive[positive.id.isin(test_eids)]

test[:5]

Unnamed: 0,word,id,target,episodeid,imageid,pieceid,imageid.1,episodeid.1,r,g,...,num_edges,pos_x,pos_y,h_skew_left-skewed,h_skew_right-skewed,h_skew_symmetric,v_skew_bottom-skewed,v_skew_symmetric,v_skew_top-skewed,c_diff
0,a,4,0,Set0/1,8,0,8,Set0/1,80.111553,87.422078,...,10,218,154,0,0,1,0,0,1,266.908224
1,l,4,0,Set0/1,8,0,8,Set0/1,80.111553,87.422078,...,10,218,154,0,0,1,0,0,1,266.908224
2,left,4,0,Set0/1,8,0,8,Set0/1,80.111553,87.422078,...,10,218,154,0,0,1,0,0,1,266.908224
3,like,4,0,Set0/1,8,0,8,Set0/1,80.111553,87.422078,...,10,218,154,0,0,1,0,0,1,266.908224
4,like,4,0,Set0/1,8,0,8,Set0/1,80.111553,87.422078,...,10,218,154,0,0,1,0,0,1,266.908224


In [36]:
from sklearn.linear_model import LogisticRegression

wac = {}

todrop=['word'  ,  'id',  'target' ,'episodeid', 'imageid',  'pieceid', 'imageid', 'episodeid']

for word in words[7:]:
    sub = positive[positive.word == word]
    sub.drop(todrop, inplace=True, axis=1)
    sub = sub.values
    
    neg = positive[positive.word != word]
    neg.drop(todrop, inplace=True, axis=1)
    neg = neg.sample(n=len(sub))
    
    X = np.concatenate((sub, neg))
    y = [1] * len(sub) + [0] * len(neg)
    
    
    clfr = LogisticRegression()
    clfr.fit(X,y)
    
    wac[word] = clfr


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


## Test

In [37]:
current_ref = refs[refs.id == test_eids[0]] # one of the test instances, note that the episodeid is the same for all rows

current_ref

Unnamed: 0,id,episodeid,imageid,target,word,type
2047,2048,Set0/5,10,0,a,Set0
2047,2048,Set0/5,10,0,blue,Set0
2047,2048,Set0/5,10,0,line,Set0


In [38]:
current_scene = scenes[scenes.episodeid == current_ref.episodeid.values[0]] # the scene associated with the referring expression for test_eids[0]

current_scene

Unnamed: 0,pieceid,imageid,episodeid,r,g,b,h,s,v,orientation,num_edges,pos_x,pos_y,h_skew_left-skewed,h_skew_right-skewed,h_skew_symmetric,v_skew_bottom-skewed,v_skew_symmetric,v_skew_top-skewed,c_diff
32,0,1,Set0/5,86.461362,57.165823,46.312255,8.261725,127.720795,86.649155,-5.767752,8,164,199,0,1,0,1,0,0,257.870122
33,0,10,Set0/5,45.399836,67.800287,101.553366,108.064142,148.334975,101.766112,0.233121,4,148,194,0,0,1,0,1,0,244.008197
34,0,11,Set0/5,49.11639,34.042341,31.458537,16.00078,108.169659,48.973854,-9.19819,6,174,216,1,0,0,0,0,1,277.366184
35,0,3,Set0/5,130.379127,111.251166,86.236057,17.130106,94.129453,130.944134,8.091026,12,161,203,0,1,0,0,1,0,259.094577
36,0,4,Set0/5,69.628902,55.864865,83.443403,135.30757,92.423856,83.460656,21.376735,8,151,222,1,0,0,0,1,0,268.486499
37,0,6,Set0/5,142.549847,89.422573,23.574949,16.836343,216.467236,142.876377,3.218008,8,142,213,0,1,0,0,1,0,255.994141
38,0,7,Set0/5,74.22591,41.852241,36.687255,18.405462,144.535784,74.184104,-14.171412,6,154,238,1,0,0,0,1,0,283.478394
39,0,8,Set0/5,80.003099,87.487751,93.362157,102.853781,39.101191,93.393338,32.005782,10,154,218,1,0,0,0,1,0,266.908224


In [41]:
ref_words = current_ref.word.values

ref_words

array(['a', 'blue', 'line'], dtype=object)

In [58]:
from collections import defaultdict

todrop = ['pieceid', 'imageid', 'episodeid']

current_prob=defaultdict(float)

for word in ref_words:
    for index,image in current_scene.iterrows():
        features = image.drop(todrop)
        imageid = image['imageid']
        p = wac[word].predict_proba(features.values.reshape(1, -1))[0][1]
        current_prob[imageid] += p

current_prob, current_ref.imageid

(defaultdict(float,
             {'1': 0.7571089606055978,
              '10': 2.371988125507083,
              '11': 0.5253920593095787,
              '3': 0.8750894481626966,
              '4': 0.509981143265047,
              '6': 0.5700887757421564,
              '7': 0.5351268527625416,
              '8': 0.35432505790174745}),
 2047    10
 2047    10
 2047    10
 Name: imageid, dtype: object)