# Data Transformation

We need our data to look something like this for py-irt

```json
{"subject_id": "pedro",    "responses": {"q1": 1, "q2": 0, "q3": 1, "q4": 0}}
{"subject_id": "pinguino", "responses": {"q1": 1, "q2": 1, "q3": 0, "q4": 0}}
{"subject_id": "ken",      "responses": {"q1": 1, "q2": 1, "q3": 1, "q4": 1}}
{"subject_id": "burt",     "responses": {"q1": 0, "q2": 0, "q3": 0, "q4": 0}}
```


## Setup

```
conda create -n ENVNAME python=3.10 pandas
conda activate ENVNAME
```

```
conda install -c conda-forge jupyterlab
pip install pandas pyro-ppl py-irt
```

In [14]:
!pip install pandas pyro-ppl py-irt


Defaulting to user installation because normal site-packages is not writeable


In [15]:
import pandas as pd
import json

In [16]:
D = pd.read_csv("cola.rp", sep="\t", names=["modelID","itemID","response"])

D.head()

Unnamed: 0,modelID,itemID,response
0,m-0.34,0,1
1,m-0.34,1,1
2,m-0.34,2,1
3,m-0.34,3,1
4,m-0.34,4,1


In [17]:
D.head(-10)

Unnamed: 0,modelID,itemID,response
0,m-0.34,0,1
1,m-0.34,1,1
2,m-0.34,2,1
3,m-0.34,3,1
4,m-0.34,4,1
...,...,...,...
427535,m-0.04,8536,1
427536,m-0.04,8537,1
427537,m-0.04,8538,1
427538,m-0.04,8539,0


In [18]:
len(D['modelID'].unique())

50

In [19]:
response_patterns = {}

for idx, row in D.iterrows():
    if row["modelID"] not in response_patterns:
        response_patterns[row["modelID"]] = {}
    response_patterns[row["modelID"]][f"q{row['itemID']}"] = row["response"]

In [20]:
with open("cola_pyirt.jsonlines", "w") as outfile:
    for key, val in response_patterns.items():
        outrow = {"subject_id": key, "responses": val}
        outfile.write(json.dumps(outrow) + "\n")

In [21]:
!py-irt train 1pl cola_pyirt.jsonlines output/cola1pl

[2;36m[08:09:50][0m[2;36m [0mconfig: [33mmodel_type[0m=[32m'1pl'[0m [33mepochs[0m=[1;36m2000[0m [33mpriors[0m=[3;35mNone[0m           ]8;id=404654;file:///home/lalor/.local/lib/python3.10/site-packages/py_irt/cli.py\[2mcli.py[0m]8;;\[2m:[0m]8;id=230324;file:///home/lalor/.local/lib/python3.10/site-packages/py_irt/cli.py#109\[2m109[0m]8;;\
[2;36m           [0m[33minitializers[0m=[1m[[0m[1m][0m [33mdims[0m=[3;35mNone[0m [33mlr[0m=[1;36m0[0m[1;36m.1[0m [33mlr_decay[0m=[1;36m0[0m[1;36m.9999[0m           [2m          [0m
[2;36m           [0m[33mdropout[0m=[1;36m0[0m[1;36m.5[0m [33mhidden[0m=[1;36m100[0m [33mvocab_size[0m=[3;35mNone[0m [33mlog_every[0m=[1;36m100[0m       [2m          [0m
[2;36m           [0m[33mseed[0m=[3;35mNone[0m [33mdeterministic[0m=[3;91mFalse[0m                              [2m          [0m
[2;36m          [0m[2;36m [0mdata_path: cola_pyirt.jsonlines                           

For the next block, use either

```
!cat output/cola1pl/best_parameters.json
```

or

```
!type output\cola1pl\best_parameters.json
```

depending on your operating system.

In [32]:
with open ("output/cola1pl/best_parameters.json","r") as file:
    data = json.load(file)
data.keys()

dict_keys(['ability', 'diff', 'irt_model', 'item_ids', 'subject_ids'])

In [33]:
len(data["diff"])

8551

In [39]:
items = pd.DataFrame(
    index=data["item_ids"], 
    data= data["diff"],
    columns=["diff"]
)
items.head()

Unnamed: 0,diff
0,-16.024603
1,-14.949307
2,-20.603096
3,-22.10672
4,-17.044914


In [41]:
subjects = pd.DataFrame(
    index=data["subject_ids"], 
    data= data["ability"],
    columns=["ability"]
)
subjects.head()

Unnamed: 0,ability
0,9.435297
1,10.093401
2,8.901505
3,9.171259
4,8.865317
