# Citation Learning
---

## Overview

This notebook demonstrates the learning process for synthetic citation data. In real citation data, the words are observations while the labels arehidden states. In this synthetic dataset, both hidden states and observations are represented as integer values that are correlated.

### Definitions

- **Hidden States**: Integer values ranging from \(0\) to \(n-1\).
- **Observed States**: Integer values ranging from \(0\) to \(n-1\).

In this notebook, a hidden state of \(i\) generates an observation of \(i\) with a probability of \(0.8\) and \(n=5\). Both of these values can be changed.

### Constraints

The hidden states appear in non-repeating blocks. In the context of citations, this reflects the fact that sequences such as "author, author, title, title, author" are not feasible. In our synthetic data, this constraint translates to avoiding sequences of hidden states like "2, 2, 2, 1, 1, 3, 3, 3, 3, 2".

### Learning Process

The learning process consists of the following steps:

1. **Generate Observations**: Create a specified number of observations for the learning phase.
2. **Perturb Model Parameters**: Adjust the model parameters to ensure that the exact solution is unknown during the learning process.
3. **Apply Generative Learning Algorithm**: This algorithm generates feasible sequences of hidden states based on the observations, updates the model parameters, and iterates this process until convergence is achieved.

In [1]:
# Import necessary libraries
from citation import Citation
import numpy as np

In [2]:
# Set the random seed for reproducibility (None for random results)
seed=None

Citation.yaml is where information about the model itself is stored. You can modify the parameters of the model there
- num_categories: The number of possible hidden and observed states
- transition_prob: The probability of transitioning away from the hidden state you are in
- noise_prob: The probability of observing a value different from the hidden state 

In [3]:
# Load model parameters from the configuration file
model = Citation()
model.load_process(filename="citation.yaml") #Loads the model parameters from citation.yaml

In [4]:
# Display model parameters
print("Model parameters: ")
print("Start Probabilities:\n", np.array(model.start_probs))
print("Emission Probabilities:\n", np.array(model.emission_probs))
print("Transition Matrix:\n", np.array(model.trans_mat))

Model parameters: 
Start Probabilities:
 [0.2 0.2 0.2 0.2 0.2]
Emission Probabilities:
 [[0.8  0.05 0.05 0.05 0.05]
 [0.05 0.8  0.05 0.05 0.05]
 [0.05 0.05 0.8  0.05 0.05]
 [0.05 0.05 0.05 0.8  0.05]
 [0.05 0.05 0.05 0.05 0.8 ]]
Transition Matrix:
 [[0.9   0.025 0.025 0.025 0.025]
 [0.025 0.9   0.025 0.025 0.025]
 [0.025 0.025 0.9   0.025 0.025]
 [0.025 0.025 0.025 0.9   0.025]
 [0.025 0.025 0.025 0.025 0.9  ]]


In [5]:
# Generate and display a sequence of hidden states and corresponding observations
T = 25  # Length of the examples
H = model.generate_hidden(t_max=T)
observations = model.generate_observations_from_hidden(H, return_obs=True)

print(f"{'Example of Sequence of Feasible Hidden States:':<50} {H}")
print(f"{'Example of Corresponding Observations:':<50} {observations}")

Example of Sequence of Feasible Hidden States:     [[2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4]]
Example of Corresponding Observations:             [[1 4 2 2 2 2 2 2 2 2 2 2 4 4 4 4 2 4 4 4 4 2 4 2 4]]


In [6]:
# Generate observations for learning
num_hidden = 100  # Total number of observations
t_max = 25  # Length of each observation
H = model.generate_hidden(num_hidden=num_hidden, t_max=t_max)
observations_for_learning = model.generate_observations_from_hidden(H, return_obs=True)

In [7]:
# Perturb model parameters for learning
perturb_parameter = 2  # A value >= 1; larger values mean more perturbation
model.perturb_parameters(perturb_parameter)

# Display perturbed model parameters
print("Perturbed Model Parameters:")
print("----------------------------")
print("Start Probabilities:\n", np.array(model.start_probs))
print("Emission Probabilities:\n", np.array(model.emission_probs))
print("Transition Matrix:\n", np.array(model.trans_mat))

Perturbed Model Parameters:
----------------------------
Start Probabilities:
 [0.18587216 0.22103907 0.1990714  0.25752293 0.13649444]
Emission Probabilities:
 [[0.81426933 0.05408484 0.03762405 0.02604742 0.06797436]
 [0.03509836 0.88682303 0.041434   0.01882756 0.01781705]
 [0.04610341 0.05816821 0.8042537  0.04932225 0.04215243]
 [0.02437915 0.05641431 0.02206489 0.8802222  0.01691946]
 [0.05052728 0.08210432 0.04387774 0.10824458 0.71524609]]
Transition Matrix:
 [[0.8943776  0.02043164 0.05357614 0.01573422 0.0158804 ]
 [0.0183664  0.89284065 0.0198323  0.04192194 0.0270387 ]
 [0.02486446 0.04942769 0.81785926 0.06616465 0.04168393]
 [0.02506976 0.014454   0.01000229 0.93290942 0.01756453]
 [0.03426309 0.01314621 0.012509   0.04258366 0.89749804]]


In [8]:
#Learning 
num_hidden = 1 #The number of random states that are generated each step in the learning process
learn_param = -2/3 #A value in [-1,-1/2), more negative values mean that previously generated hidden states are weighted less heavily
convergence_factor = 1E-3 #If the parameters change less than this in an iteration, the algorithm terminates
num_it = 0 #The starting value of number of iterations, a larger value implies that you trust the initial guess of parameters more
noisy = True #Determines whether the amount of change in the model parameters is output every time, when the output value is less than covergence_factor, the algorithm terminates

model.learn(observations_for_learning, learn_param=learn_param, convergence_factor=convergence_factor, num_it=num_it, num_hidden=num_hidden, noisy=noisy)

0.23629541045776026
0.1626420664536743
0.08591644085748945
0.10720303661231212
0.05771747318917525
0.030657679432340157
0.03577494582510293
0.022064305122493133
0.02979264914506985
0.021945212096353417
0.022008476492827
0.02247211567052132
0.035117586886544316
0.01678396012871497
0.01202953036525689
0.01757494759798356
0.008976000700801234
0.017666807361595205
0.016819878150220763
0.020690809496078433
0.016943022975739186
0.02087854045057752
0.012673313007399434
0.01492304576994602
0.020424100756109537
0.012719304876709636
0.01074369052855606
0.010627439448586212
0.011468520345419564
0.00852636239905763
0.009909562573112596
0.007743178511755738
0.012102694202672015
0.010166452338771781
0.011422176811993257
0.006894299573081769
0.009698372328112825
0.011146808362423627
0.005292152457126276
0.005151893962506565
0.006785406715817899
0.007573369576330313
0.009150163667333566
0.005969973134923118
0.007266293105622573
0.006521580200078453
0.010342339570415304
0.006401056713252695
0.005940112

0.0012813946340513305
0.0020793557834589693
0.0023533860565136955
0.001254223454251524
0.0017018322103005135
0.0025567367622622017
0.0017175067683457557
0.0010424553095250977
0.0017838177275163458
0.0025959463108384817
0.002167922532797628
0.001826425158672207
0.0020864155473202806
0.0024962471710029688
0.0013096013765430126
0.0023037572053002386
0.0021885796871707053
0.0011339154906077897
0.0012071010166610774
0.0016211779560709183
0.0015315106592326266
0.001971899009390099
0.0023110094257408342
0.002135959480060401
0.002114409045761878
0.0016349131192611999
0.0019129434470832318
0.0012436698080255382
0.0012547792337893882
0.0016259593081786317
0.0012486676582125489
0.002151426040229387
0.0009442800569504639


In [9]:
# Display learned model parameters
print("Learned model parameters: ")
print("Start Probabilities:\n", np.array(model.start_probs))
print("Emission Probabilities:\n", np.array(model.emission_probs))
print("Transition Matrix:\n", np.array(model.trans_mat))

Learned model parameters: 
Start Probabilities:
 [0.16464261 0.35480067 0.13944123 0.15661071 0.18450478]
Emission Probabilities:
 [[0.21409422 0.20816521 0.1594483  0.21631239 0.20197988]
 [0.15342449 0.21828027 0.15778804 0.24011392 0.23039327]
 [0.2100881  0.23246903 0.15379718 0.21319736 0.19044833]
 [0.20882711 0.22358344 0.14613637 0.22765499 0.19379809]
 [0.21421919 0.22130624 0.15082587 0.22684951 0.18679918]]
Transition Matrix:
 [[9.88334284e-01 3.11550390e-05 1.16344672e-02 8.67729589e-10
  9.25850769e-08]
 [3.47325911e-03 8.83199584e-01 3.17874401e-03 4.05713999e-02
  6.95770125e-02]
 [5.78614392e-05 6.62205633e-04 9.95117441e-01 3.44463198e-04
  3.81802832e-03]
 [3.71261814e-02 3.19043830e-11 1.00645260e-03 9.61867246e-01
  1.19867362e-07]
 [8.84830202e-03 8.14087473e-12 9.11975025e-04 2.01229915e-02
  9.70116731e-01]]
