# From spaces to features

Today:
- **We will be working in groups of two.**
- **Choose a new partner that you haven't worked with before.**

Given similarity data, NMDS can recover psychological representations that are dimensional/spatial (i.e., points in space).

However, consider mental concepts such as "is_a_multiple_of_two". Whether or not a number is a multiple of two is a discrete **feature** of a number, and may not be well captured by a continuous psychological dimension. If some mental objects are represented via sets of discrete features, how can we infer those kinds representations?

This also opens up the broader question of what other representational structures the mind might employ (spaces, features, others?), and how we might infer each. Today we will focus on **binary** feature representations.

In [164]:
# Run this cell first
import numpy as np
import pandas as pd
from tools import *

Below we load a representation of eight animals that consists of six binary features. Recall that psychological representations of any kind are not directly observable, but the example below will help us think about how we might infer them later.

In [165]:
df_animal_feats = load_animal_features()
df_animal_feats

Unnamed: 0,Mammal,Pet,Lives in Water,Has Fur,Carnivore,Domesticated
Dog,1,1,0,1,1,1
Cat,1,1,0,1,1,1
Horse,1,0,0,1,0,1
Salamander,0,0,1,0,1,0
Snake,0,0,0,0,1,0
Eagle,0,0,0,0,1,0
Goldfish,0,1,1,0,0,1
Dolphin,1,0,1,0,1,0


Recall that representations are related to similarity (i.e., human similarity judgments). In the case of NMDS and spatial representations, similarity values $s_{ij}$ give us hints about what representations produce them by telling us about representational distance.

In the case of discrete features, similarity values $s_{ij}$ give us hints about two things: (1) the features of objects and (2) their "importance" or "salience".

Shepard & Arabie (1979) proposed that the similarity between objects defined by features is a weighted sum of shared features. It implies simply that objects are similar to the extent that they share important features.

In particular, given $m$ features (such as in `df_animal_feats` where $m=6$), estimated similarity $\hat{s}_{ij}$ is defined as:

$\hat{s}_{ij} = \sum_{k=1}^{m} w_k f_{ik} f_{jk}$,

where:
- $f_{ik}$ is the $k{^\text{th}}$ feature for object $i$ (e.g., the binary value for "Lives in Water"),
- $f_{jk}$ is the $k{^\text{th}}$ feature for object $j$, and
- $w_k$ is a non-negative weight corresponding to the $k{^\text{th}}$ feature.

If objects $i$ and $j$ share feature $k$ (i.e., $f_{ik} = f_{jk} = 1$), then the product $f_{ik} f_{jk} = 1 \times 1 = 1$. Otherwise, $f_{ik} f_{jk} = 0$.

Note that if objects $i$ and $j$ share feature $k$, then $w_k f_{ik} f_{jk} = w_k \times 1 = w_k$.

Thus, the weighted sum is a sum of the weights for only the shared features.

Because weights $w_k$ are non-negative, the fact that a feature is shared will never decrease similarity, which makes intuitive sense.

**Exercise 1:**

To understand how shared features tell us something about similarity, create a function called `compute_counts` that takes in a dataframe like `df_animals` and returns a similarity matrix as a dataframe where each item $(i, j)$ in the matrix is the count:

$\text{count}_{ij} = \sum_{k=1}^{m} f_{ik} f_{jk}$.

The indices and columns of the output dataframe should match the input (e.g., `df_animals.index`).

Set all items of the matrix where $i<=j$ to `np.nan` such that only the lower triangle is filled with similarity values.

In [166]:
# Your code here

def compute_counts(df):
    sim_matrix = np.zeros((len(df), len(df)))
    for i in range(len(df)):
        for j in range(len(df)):
            if i <= j:
                sim_matrix[i, j] = np.nan
            else:
                sim_matrix[i, j] = np.sum(df.iloc[i] * df.iloc[j])
            
    return pd.DataFrame(sim_matrix, index=df.index, columns=df.index)
    
    
    
# do not change
df_counts = compute_counts(df_animal_feats)
df_counts

Unnamed: 0,Dog,Cat,Horse,Salamander,Snake,Eagle,Goldfish,Dolphin
Dog,,,,,,,,
Cat,5.0,,,,,,,
Horse,3.0,3.0,,,,,,
Salamander,1.0,1.0,0.0,,,,,
Snake,1.0,1.0,0.0,1.0,,,,
Eagle,1.0,1.0,0.0,1.0,1.0,,,
Goldfish,2.0,2.0,1.0,1.0,0.0,0.0,,
Dolphin,2.0,2.0,1.0,2.0,1.0,1.0,1.0,


In [167]:
# TEST YOUR SOLUTION

# DON'T CHANGE THIS CELL
if np.isnan(df_counts.loc['Dog', 'Dog']) and df_counts.loc['Cat', 'Dog'] == 5.0:
    print('Test passed')
else:
    print('Test failed')

Test passed


In the above, dog and cat are the most similar in the sense that they share the most features. 

However, the above count-based pattern of similarity can change depending on how "important" each shared feature is (i.e., how big each $w_k$ value is).

Let's now incorporate those weights.

**Exercise 2:**

Create a function called `compute_similarity` that takes in a dataframe like `df_animals` and returns a similarity matrix as a dataframe where each item $(i, j)$ in the matrix is the count:

$\hat{s}_{ij} = \sum_{k=1}^{m} w_k f_{ik} f_{jk}$.

The indices and columns of the output dataframe should match the input (e.g., `df_animals.index`).

Set all items of the matrix where $i<=j$ to `np.nan` such that only the lower triangle is filled with similarity values.

In [178]:
# Your code here

def compute_similarity(df, weight):
    weight = np.array(weight)
    
    F = df.values
    F_weighted = F * weight
    
    S = np.dot(F_weighted, F.T)
    
    sim_df = pd.DataFrame(S, index=df.index, columns=df.index)
    
    sim_df.values[np.triu_indices_from(sim_df)] = np.nan
    
    return sim_df


In [179]:
# TEST YOUR SOLUTION

# DON'T CHANGE THIS CELL
_ = compute_similarity(df_animal_feats, np.ones(df_animal_feats.shape[1])*0.5)
if np.isnan(_.loc['Dog', 'Dog']) and _.loc['Cat', 'Dog'] == 5.0*0.5:
    print('Test passed')
else:
    print('Test failed')

Test passed


**Exercise 2.1:** Compute animal similarity using a numpy of weights all set to 0.25. Store the result in a dataframe called `df_sim_estimated`.

In [180]:
# Your code here

df_sim_estimated = compute_similarity(df_animal_feats, np.ones(df_animal_feats.shape[1])*0.25)

In [181]:
# TEST YOUR SOLUTION

# DON'T CHANGE THIS CELL
if np.isnan(df_sim_estimated.loc['Dog', 'Dog']) and df_sim_estimated.loc['Cat', 'Dog'] == 1.25:
    print('Test passed')
else:
    print('Test failed')

Test passed


Importance weights $w_k$ are another part of the mental representation that we can't directly observe.

However, given observable human similarity data, we can infer the weights.

Below we load corresponding animal similarity data (average judgments).

In [182]:
df_animal_sim = load_sim_data()
df_animal_sim

Unnamed: 0,Dog,Cat,Horse,Salamander,Snake,Eagle,Goldfish,Dolphin
Dog,,,,,,,,
Cat,0.7,,,,,,,
Horse,0.45,0.45,,,,,,
Salamander,0.05,0.05,0.0,,,,,
Snake,0.05,0.05,0.0,0.05,,,,
Eagle,0.05,0.05,0.0,0.05,0.05,,,
Goldfish,0.25,0.25,0.05,0.15,0.0,0.0,,
Dolphin,0.35,0.35,0.3,0.2,0.05,0.05,0.15,


Our goal is to find the set of weights such that our similarity estimates $\hat{s}_{ij}$ are as close as possible to the actual similarities $s_{ij}$.

Similar to our definition of stress, we can define a measure of total squared error:

$SE = \sum_{i>j} (s_{ij} - \hat{s}_{ij})^2$.

We want to find a set of weights that make this error as small as possible.

**Exercise 3:**

Create a function called `compute_error` that takes in a dataframe like `df_animal_sim` of real similarity values and a dataframe like `df_animal_sim` of estimated similarity values and returns the summed squared error as a single float value.

In [204]:
# Your code here

def compute_error(real_df, estimated_df):
    return np.nansum((real_df - estimated_df).values**2)

In [206]:
# TEST YOUR SOLUTION

# DON'T CHANGE THIS CELL
_1 = compute_error(df_animal_sim, df_animal_sim)
_2 = compute_error(
    df_animal_sim, 
    compute_similarity(
        df_animal_feats, 
        np.ones(df_animal_feats.shape[1])*0.5
    )
)
if np.isclose(_1, 0) and np.isclose(_2, 10.769999999999998):
    print('Test passed')
else:
    print('Test failed')

Test passed


**Exercise 3.1:** Call `compute_error` with arguments that will demonstrate the best possible error attainable with respect to the animal data. Store the result in `best_possible_error`.

In [207]:
# Your code here

best_possible_error = compute_error(df_animal_sim, df_animal_sim)
best_possible_error

np.float64(0.0)

In [208]:
# TEST YOUR SOLUTION

# DON'T CHANGE THIS CELL
if np.isclose(best_possible_error, 0):
    print('Test passed')
else:
    print('Test failed')

Test passed


Recall that $\hat{s}_{ij} = \sum_{k=1}^{m} w_k f_{ik} f_{jk}$, and thus we can write the error as:

$SE = \sum_{i>j} (s_{ij} - \hat{s}_{ij})^2 = \sum_{i>j} (s_{ij} - \sum_{k=1}^{m} w_k f_{ik} f_{jk})^2$.

Notice that all terms above are currently given except for each weight $w_k$. We want to find the weights that make the above sum as small as possible.

**Exercise 3.2:** Compute error when weights are all set to 1.0, and store the result in `error_when_weights_are_1`.

In [209]:
# Your code here
error_when_weights_are_1 = compute_error(df_animal_sim, compute_similarity(df_animal_feats, np.ones(df_animal_feats.shape[1])))


In [210]:
# TEST YOUR SOLUTION

# DON'T CHANGE THIS CELL
if np.isclose(error_when_weights_are_1, float('30000000000070.95'[::-1])):
    print('Test passed')
else:
    print('Test failed')

Test passed


**Exercise 3.3:** Compute error when weights are all set to 0.5, and store the result in `error_when_weights_are_smaller`.

In [212]:
# Your code here

error_when_weights_are_smaller = compute_error(df_animal_sim, compute_similarity(df_animal_feats, np.ones(df_animal_feats.shape[1])*0.5))

In [213]:
# TEST YOUR SOLUTION

# DON'T CHANGE THIS CELL
if np.isclose(error_when_weights_are_smaller, float('899999999999967.01'[::-1])):
    print('Test passed')
else:
    print('Test failed')

Test passed


In the above, weights are equal only as an example. Ultimately we will want unequal weights (i.e., that represent differences in feature importance) and that minimize the error.

**Exercise 4:** Define a function called `error_given_weights` that takes a numpy array of weights as input and returns summed squared error as a single float value.

In [214]:
# Your code here

def error_given_weights(weights):
    return compute_error(df_animal_sim, compute_similarity(df_animal_feats, weights))

In [215]:
# TEST YOUR SOLUTION

# DON'T CHANGE THIS CELL
if np.isclose(error_given_weights(np.ones(df_animal_feats.shape[1])), 59.07000000000003):
    print('Test passed')
else:
    print('Test failed')

Test passed


We want to find the input weights that "minimize" the output of the `error_given_weights` above.

We can do this using the `minimize` function from `scipy`:

In [216]:
from scipy.optimize import minimize

`minimize` takes as input the function to minimize and an initial guess for the input that will minimize it. 

It returns an object with an attribute called `x` (e.g., `my_output.x`) containing the inputs that will minimize the function.

In our case, the function to minimize is `error_given_weights`, and an initial guess for the weights can be all ones: `np.ones(df_animal_feats.shape[1])`.

**Exercise 5:** Use the `minimize` function to find the importance weights that best explain human similarity judgments. Store the resulting weights in a numpy array called `best_animal_weights`.

In [217]:
# Your code here

best_animal_weights = minimize(error_given_weights, np.ones(df_animal_feats.shape[1])).x

In [218]:
# TEST YOUR SOLUTION

# DON'T CHANGE THIS CELL
if np.all(np.isclose(best_animal_weights, [0.3, 0.2, 0.15, 0.1, 0.05, 0.05], atol=1e-5)):
    print('Test passed')
else:
    print('Test failed')

Test passed


The best weights found should indeed be unequal:

In [219]:
best_animal_weights

array([0.30000156, 0.19999464, 0.14999923, 0.09999255, 0.04999998,
       0.0500062 ])

The error in reconstructing similarity given these best weights should be much smaller than the error we guessed before (e.g., equal weights of 1), and hopefully close to 0.

**Exercise 5.1:** Compute the error given the best weights and store the result in `error_given_best_weights`.

In [220]:
# Your code here

error_given_best_weights = error_given_weights(best_animal_weights)

In [221]:
# TEST YOUR SOLUTION

# DON'T CHANGE THIS CELL
if np.isclose(error_given_best_weights, 0.0):
    print('Test passed')
else:
    print('Test failed')

Test passed


**Exercise 5.2:** Based on the weights we found, which feature does the mind think is most important?

In [223]:
answer1 = "Mammal"
# answer1 = "Pet"
# answer1 = "Lives in Water"
# answer1 = "Has Fur"
# answer1 = "Carnivore"
# answer1 = "Domesticated"

In [224]:
# TEST YOUR SOLUTION

# DON'T CHANGE THIS CELL
if 'lam'[::-1] in answer1:
    print('Test passed')
else:
    print('Test failed')

Test passed


Unfortunately, what we've accomplished so far is the easy part. 

In the previous example, the feature matrix was given for the sake of example, but recall that such features are not actually observable and must be inferred.

As we saw, given a feature matrix, inferring importance weights is relatively straightforward. However, inferring the feature matrix in the first place is much more difficult.

One shortcut researchers sometimes take is to simply predefine a large set of possible features and ask people questions such as "Does this animal have the feature is_dangerous?". Data from such experiments are called **feature norms**. However, these experimental designs have several issues, most notably that the researcher is unlikely to choose the correct superset of features to ask about.

Instead, we would like to infer the features from the similarity data. Unfortunately, this presents a combinatorial optimization problem that is an open problem in cognitive psychology. We will explore the search for such features in a future lab.