## Understanding the McGurk effect

### Background
From the lecture, I know that a cognitive model can simulate the mental steps that people take when they are performing tasks. The McGurk effect is about how the visual perception will influence the auditory perception. We can simulate this process by building a cognitive model. In this model, we can predict if human can get the right information when they find that what they see don't match what they hear.

### Implement the model
I plan to build a model to predict how many trials humans need to get right information with visual influence.
I'd like to describe the potential contribution of each of the 4 processes to modeling the model of the MccGurk effect.

In [1]:
import math
import copy
import random
import numpy as np
import scipy.stats as sp
import matplotlib.pyplot as plt
import jupyter_module_loader
jupyter_module_loader.register()

#### State Estimation
I assume that people will recognize the visual influence according to previous failed experience. As a result, their sensitivity to the McGurk effect would be decreased. If a person is more sensitive to the McGurk effect, he will be more easily influenced by the visual and make mistake.  

To support the model, I use 0-1 to indicate the accuracy of the received word. For hearing, the received word keeps unchanged (1) which is also the correct result, but for vision, the number means the degree to match the sound. For example, 1 means that the mouth movement matches the sound and 0 means it is completely different from the sound.

In [26]:
def init_prior():
    return {'identify_target': False,  # if identify the correct word
            'aim_visual': 0,  # the aim deviation of visual people suppose to receive
            'sensitivity': 0.85} # the sensitivity to McGurk effect

In [27]:
def init_state():
    return {'audio_target': 1,  # the correct result
            'visual_target': 0,  # the matching degree of mouth movement and sound
            'sensitivity': 0.85}  # the sensitivity to McGurk effect

In [28]:
def state_estimate( belief, obs ):
    # update the belief with the new observation.
    random_offset = random.random()
    # decrease the sensitivity, avoid less than 0
    sensitivity = max(0, belief['sensitivity'] - random_offset)
    aim_visual = estimate_target(sensitivity)
    return {'sensitivity': sensitivity,
            'identify_target': obs['identify_target'],
            'aim_visual': aim_visual}

In [6]:
def init_bounds():
    return {'audio_noise': 0.01,
            'visual_noise': 0.01,
            'perceptual_noise': 0.01}    

#### Observation
The Observation function is responsible for gathering information from the state. If we already have information from visual and auditory, we can predict what humans would perceive.  I assume the consequence would be influenced by the sensitivity.   
There could also be some perception noise, but I don't consider them now in order to simplify the model.

In [41]:
def observation(state, bounds):
    p_target = perceptual_target(state)
    estimate = estimate_target(state['sensitivity'])
    obs = {'identify_target': p_target > 0.9,
           'sensitivity': state['sensitivity'],
           'aim_visual': estimate }

    return obs

def perceptual_target(state):
    sensitivity = state['sensitivity']
    audio_target = state['audio_target']
    visual_target = state['visual_target']
    
    return audio_target - sensitivity * (1 - visual_target) # one possible formula to compute the perceptual result

def estimate_target(sensitivity):
    if sensitivity == 0:
        return 1
    # avoid bigger than 1
    return min(1, 0.9 / sensitivity)

In [34]:
observation( init_state(), init_bounds())

{'identify_target': False, 'sensitivity': 0.85, 'aim_visual': 1}

#### Controller
Having made an observation, the agent will have information that it can use to guide the selection of an action. In this model, if people fail to identify the correct word, they need to continue hearing and watching in order to succeed. In this process, I would like to change the sensitivity in the `state` from `belief` in order to join in the computation in observation process.

In [35]:
def controller( belief ):
    if not belief['identify_target']:
        action = {'name': 'hear_watch',
                  'sensitivity': belief['sensitivity'],
                  'aim_visual': belief['aim_visual']}        
    else:
        action = {'name': 'stop'}
    return action

#### Environment
Next, the environment function determines what happens when controller selects an action.  
When people hear the pronunciation, what they hear is the correct word. When they watch the mouth movement, they would find it may be different from the pronunciation. In this process, I guess there might be some audio noise and visual noise which could influence the result received by the eyes and ears.

In [36]:
def environment( state, action, bounds ):
    if action['name'] == 'hear_watch':
        random_audio_noise = np.random.normal(0, bounds['audio_noise'], 1)[0]
        state['audio_target'] = min(1, 1 + random_audio_noise)
        visual_target = action['aim_visual']
        random_visual_noise = np.random.normal(0, bounds['visual_noise'], 1)[0]
        state['visual_target'] = max(0, visual_target + random_visual_noise)
        state['sensitivity'] = action['sensitivity']
    return state 

In [37]:
def agent( belief, state, bounds, stop ):
    num_trials = 0
    done = False
    while not done:
        action = controller( belief )
        print('action', action)
        state = environment( state, action, bounds )
        print('state', state)
        obs = observation(state, bounds )
        print('obs', obs)
        belief = state_estimate( belief, obs )
        print('belief', belief)
        
        num_trials += 1
        done = stop(state)
    return num_trials

In [42]:
def terminate(state):
    return perceptual_target(state) > 0.9

agent(init_prior(), init_state(), init_bounds(), terminate)

action {'name': 'hear_watch', 'sensitivity': 0.85, 'aim_visual': 0}
state {'audio_target': 1, 'visual_target': 0.009548647568407878, 'sensitivity': 0.85}
obs {'identify_target': False, 'sensitivity': 0.85, 'aim_visual': 1}
belief {'sensitivity': 0.003916331479663748, 'identify_target': False, 'aim_visual': 1}
action {'name': 'hear_watch', 'sensitivity': 0.003916331479663748, 'aim_visual': 1}
state {'audio_target': 1, 'visual_target': 1.0081396959977496, 'sensitivity': 0.003916331479663748}
obs {'identify_target': True, 'sensitivity': 0.003916331479663748, 'aim_visual': 1}
belief {'sensitivity': 0, 'identify_target': True, 'aim_visual': 1}


2

### Improvement

To get an optimal model of the McGurk effect, in State Estimation process, I suppose there can be better methods to decide the decreasing of the sensitivity or some other factors that may change the visual influence.  

In Control process, in addition to hear and watch at the same time, there can be more other actions. For example, people can select actions of closing their eyes so that they can get results only depend on the sound.  

In Environment process, the distribution of the noise should be specified. It may be associated with the distance.  

In Observation process, how to compute the final perceived result according to hearing and vision is a problem. I think it should be measured in some experiments. And of course, there should be perceptual noise which would influence our brain to receive the result.

### Implications in HCI
Since the McGurk effect suggests that our senses make optimal use of visual perception to reduce ambiguity in auditory perception, in some interface, we can combine the vision and the sound together to improve users' experience. For example, some websites enable users to identify the verification codes by hearing and seeing which increase the accuracy.

### Kalman filter for vision dominance
From the lecture, we know that the filter takes two estimates (z1 and z2) of a variable z and estimates of the noise in these variables (sigma1 and sigma2). It returns a new estimate z3 and an estimate of the noise sigma3. The new estimate balances the previous estimates and gets better prediction.
According to the McGurk effect, we can infer that vision estimation can correct the previous estimation (sound), and then get a new estimation which has less uncertainty than the two previous estimates. When we hear uncertain news, we can utilize our visual perception to reduce ambiguity in auditory perception. If the vision is different from the sound, we can easily be guided to believe what we see and doubt what we hear. In this case, the sound can be easily changed by the vision, so we can see the vision dominate sound.