# CS786 Assignment 4 by Anmol Pabla (190154)

---



## Q1 Implementing a GCM Model:

Implementing the model based on assumptions:
*   Subject is polite, and is far more likely to call someone big average 
than
large

*   Subject is more likely to use weight than height to make category
judgments about size.

In [383]:
# Importing the necessary Libraries 
import pandas as pd
import numpy as np

# Importing data provided in CSVs and converting it into np arrays
test_df = pd.read_csv('Y.csv',header=None)
train_df = pd.read_csv('X.csv',header=None)

train = train_df.to_numpy()
test = test_df.to_numpy()

In [384]:
# First 5 rows of the traning data
train_df.head()

Unnamed: 0,0,1,2
0,48,58,1
1,54,62,1
2,48,56,1
3,46,62,1
4,47,59,1


In [385]:
# function to get distance between train data and stimulus
def get_dist(x, y, alpha):
  dist = 0
  for i in range(len(alpha)):
    dist += alpha[i]*(x[i]-y[i])
  return abs(dist)  

# function to get similarity between between train data and stimulus
def get_similarity(x, y, alpha, beta):
  dist = get_dist(x,y,alpha)
  return np.exp(-1*beta*dist)

In [386]:
# Function to determine the category the stimulus y belongs to
def categorize(y, train, alpha, beta, gamma):
  sums = np.zeros(3)
  prob = np.zeros(3)

  # Getting the summations needed
  for i in range(len(train)):
    cat = train[i][2]-1
    sums[cat] += gamma[cat]*get_similarity(train[i],y,alpha,beta)
  
  # Calculating Probabilities
  prob[0] = sums[0]/(sums[0]+sums[1]+sums[2])
  prob[1] = sums[1]/(sums[0]+sums[1]+sums[2])
  prob[2] = sums[2]/(sums[0]+sums[1]+sums[2])

  # returning the argument with max probability as selected category
  return np.argmax(prob) 

In [387]:
# Function to mplement the model and use utility functions defined above
def GCM(train, stimulus):
  Label = ['Small', 'Average', 'Large']
  # Giving more weightage to weight than height
  alpha = [0.2,0.1]
  beta = 1
  # Biasing against Large as the Subject is polite
  gamma = [1,1,0.7]

  cat = categorize(stimulus, train, alpha, beta, gamma)
  return Label[cat] 

In [388]:
# Calling the GCM function over all test data points
for c in test:
  category = GCM(train,c)
  print('For Weight:',c[0],'Kg and Height',c[1],'inches, The Category assigned is ', category)

For Weight: 74 Kg and Height 67 inches, The Category assigned is  Average
For Weight: 69 Kg and Height 63 inches, The Category assigned is  Average
For Weight: 92 Kg and Height 81 inches, The Category assigned is  Large
For Weight: 64 Kg and Height 61 inches, The Category assigned is  Average
For Weight: 66 Kg and Height 84 inches, The Category assigned is  Average
For Weight: 76 Kg and Height 68 inches, The Category assigned is  Large
For Weight: 61 Kg and Height 58 inches, The Category assigned is  Average
For Weight: 64 Kg and Height 76 inches, The Category assigned is  Average
For Weight: 68 Kg and Height 66 inches, The Category assigned is  Average
For Weight: 34 Kg and Height 61 inches, The Category assigned is  Small


The data obtained from GCM categorization is given above.

---
##Q2 Implementing the RMC Model:


Modifications have been made to the original code provided, these changes have been made to ensure that the clustering done by the RMC Model is limited to three categories and the partitions are formulated correctly to enhance the performance of the model

In [389]:

# Implementation of Anderson's venerable "rational" model of categorization.
# rather than compute the full Bayesian posterior, it views items sequentially
# and assigns each to the maximum a posteriori cluster.
#
# At the end it is presented with a stimulus with one item missing, and
# predicts the probability that its value is a '0' or a '1' or a '2'
#
# Modified on the Implementation in python by John McDonnell
#
# References: Anderson (1990) and Anderson (1991),



#Utility functions:

class RMC_Class:
    """
    See Anderson (1990, 1991)
    'Categories' renamed 'clusters' to avoid confusion.
    Discrete version.
    
    Stimulus format is a list of integers from 0 to n-1 where n is the number
    of possible features (e.g. [1,0,1])
    
    args: c, alphas
    """
    
    def __init__(self, args):
        self.partition = [[],[],[]]
        self.c, self.alpha = args
        self.alpha0 = sum(self.alpha.T)
        self.N = 0
    
    def probClustVal(self, k, i, val):
        # Find P(j|k)
        cj = len([x for x in self.partition[k] if x[i]==val])
        nk = len(self.partition)
        return (cj + self.alpha[i])/(nk + self.alpha0)
    
    def condclusterprob(self, stim, k):
        # Find P(F|k)
        pjks = []
        for i in range(len(stim)):
            cj = len([x for x in self.partition[k] if x[i]==stim[i]])
            nk = len(self.partition[k])
            pjks.append( (cj + self.alpha[i])/(nk + self.alpha0) )
        return np.product( pjks )
        
    
    def posterior(self, stim):
        # Find P(k|F) for each cluster
        pk = np.zeros( len(self.partition) )
        pFk = np.zeros( len(self.partition) )
        
        # existing clusters:
        for k in range(len(self.partition)):
            pk[k] = self.c * len(self.partition[k])/ ((1-self.c) + self.c * self.N)
            if len(self.partition[k])==0: # case of new cluster
                pk[k] = (1-self.c) / (( 1-self.c ) + self.c * self.N)
            pFk[k] = self.condclusterprob( stim, k)
        
        # put it together
        pkF = (pk*pFk) 
        
        return pkF
    
    def stimulate(self, stim):
        category = stim[2]-1       
        # Adding stimulus to the right category partition
        # Here we make use of the fact we need only three categories and training
        # data is available for the categories
        self.partition[category].append(stim)
        
        self.N += 1
    
    def query(self, stimulus):
        # Queried value should be -1
        qdim = -1
        for i in range(len(stimulus)):
            if stimulus[i] < 0:
                if qdim != -1:
                    raise Exception("ERROR: Multiple dimensions queried.")
                qdim = i
        
        self.N = sum([len(x) for x in self.partition])
        
        # Calculating probabilities using Partition
        pkF = self.posterior(stimulus)

        pjF = np.array( [sum( [ pkF[k] * self.probClustVal(k, qdim, j+1) 
                          for k in range(len(self.partition))] ) 
                          for j in range(3)] )
        
        Final = pjF / sum(pjF)
        return Final


In [390]:
# Modeling the RMC functions which return the category with the help of 
# utility functions defined above
def RMC(train, stimulus):
  Label = ['Small', 'Average', 'Large']
  model = RMC_Class([0.8, np.array([0.2,0.1,0.3])])
  for s in train:
      model.stimulate(s)

  cat = model.query( stimulus + [-1] )
  return Label[np.argmax(cat)]

In [391]:
# Calling the above defined RMC functon for all data points
for c in test: 
  category = RMC(train, c)
  print('For Weight:',c[0],'Kg and Height',c[1],'inches, The Category assigned is ', category)

For Weight: 74 Kg and Height 67 inches, The Category assigned is  Average
For Weight: 69 Kg and Height 63 inches, The Category assigned is  Average
For Weight: 92 Kg and Height 81 inches, The Category assigned is  Average
For Weight: 64 Kg and Height 61 inches, The Category assigned is  Average
For Weight: 66 Kg and Height 84 inches, The Category assigned is  Average
For Weight: 76 Kg and Height 68 inches, The Category assigned is  Average
For Weight: 61 Kg and Height 58 inches, The Category assigned is  Small
For Weight: 64 Kg and Height 76 inches, The Category assigned is  Average
For Weight: 68 Kg and Height 66 inches, The Category assigned is  Average
For Weight: 34 Kg and Height 61 inches, The Category assigned is  Small


The data obtained from RMC categorization is given above.

---
##Q3 Empirically verifying that the above models assume exchangability of data



To empirically verify that the above models assume exchangablity of data, we pass shuffled datasets through these models. 

If the models are indeed data exchangable, we would acheive same predictions in each case.

---



GCM Model:

In [392]:
# List to store predctions for each run
categorizations_GCM = []

# We will be analysing the data for 7 different shuffles
for j in range(7):
  cat_iteration = []
  # shuffling the data
  np.random.shuffle(train)

  # Getting the categorzations
  for i in range(len(test)):
    cat_iteration.append(GCM(train,test[i]))
  
  # Storing the categorzations of each run in categorizations_GCM  
  categorizations_GCM.append(cat_iteration)

# Transposing categorizations_GCM, this would cause each row to have categorization
# results for a test data point acheived over the 7 shufflling itertions
categorizations_GCM = np.transpose(categorizations_GCM)  

for i in range(len(test)):
  print('For the Test Data point', i+1,' we obtain these results:')
  print('   ',categorizations_GCM[i],'\n')

For the Test Data point 1  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' 'Average'] 

For the Test Data point 2  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' 'Average'] 

For the Test Data point 3  we obtain these results:
    ['Large' 'Large' 'Large' 'Large' 'Large' 'Large' 'Large'] 

For the Test Data point 4  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' 'Average'] 

For the Test Data point 5  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' 'Average'] 

For the Test Data point 6  we obtain these results:
    ['Large' 'Large' 'Large' 'Large' 'Large' 'Large' 'Large'] 

For the Test Data point 7  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' 'Average'] 

For the Test Data point 8  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' '

RMC Model:

In [393]:
# List to store predctions for each run
categorizations_RMC = []

# We will be analysing the data for 7 different shuffles
for j in range(7):
  cat_iteration = []
  # shuffling the data
  np.random.shuffle(train)

  # Getting the categorzations
  for i in range(len(test)):
    cat_iteration.append(RMC(train,test[i]))
  
  # Storing the categorzations of each run in categorizations_RMC
  categorizations_RMC.append(cat_iteration)

# Transposing categorizations_RMC, this would cause each row to have categorization
# results for a test data point acheived over the 7 shufflling itertions
categorizations_RMC = np.transpose(categorizations_RMC)  

for i in range(len(test)):
  print('For the Test Data point', i+1,' we obtain these results:')
  print('   ',categorizations_RMC[i],'\n')

For the Test Data point 1  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' 'Average'] 

For the Test Data point 2  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' 'Average'] 

For the Test Data point 3  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' 'Average'] 

For the Test Data point 4  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' 'Average'] 

For the Test Data point 5  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' 'Average'] 

For the Test Data point 6  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Average' 'Average' 'Average'] 

For the Test Data point 7  we obtain these results:
    ['Small' 'Small' 'Small' 'Small' 'Small' 'Small' 'Small'] 

For the Test Data point 8  we obtain these results:
    ['Average' 'Average' 'Average' 'Average' 'Averag

---
We find that in both the models the categorization results remain consistent even after reshuffling the training data. Thus, it is confirmed that both the GCM and RMC Model follow data exchangability.

