# Am I Better Off With Julia?
Is Julia better for the kind of work I do? To find out, I explore how hard it is to write Julia code that runs as or more quickly as my Numba-compiled Python code that produces...
- A Serial Position Curve
- A Lag-CRP
- A Fitted CMR parameter configuration

## Data Preparation

### PyCall
First let's work out how cleanly and quickly Pycall can retrieve the output of my `prepare_murdock1962_data` function that will form the basis of these tests.

At first I copied my matching code cell from my Python-based notebook and tried to call it with PyCall. I found that the library could not find my compmemlearn package. This is because PyCall installed and uses a separate version of Python in my .julia folder. I have to use that to install packages. 

Doing the editable install through my terminal this way worked ok enough, but installing an online package caused trouble because "the SSL module is not available". When I open PyCall's `python.exe`, `import ssl` results in a corresponding ModuleNotFoundError.

In the end, I set PyCall's environment to use my main Python instance instead of the workspace it created. I ran this sample based on [a solution shared by angelv](https://discourse.julialang.org/t/import-package-from-python/47144/12) as a code cell and restarted my kernel:

```python
ENV["PYTHON"]="c:/programdata/miniconda3/python.exe"                                                                                              
using Pkg
pkg"build PyCall"
using PyCall
```

In [None]:
using PyCall

In [None]:
py"""
from compmemlearn.datasets import prepare_murdock1970_data

trials, events, list_length = prepare_murdock1970_data('../../data/mo1970.txt')
events.head()
"""

println(py"list_length" + py"list_length")
println(py"trials[0]")
py"events".head()

40
[15, 16, 17, 18, 20, 11, 0, 0, 0, 0, 0, 0, 0]


Unnamed: 0,subject,list,item,input,output,study,recall,repeat,intrusion
0,1,1,1,1,,True,False,0,False
1,1,1,2,2,,True,False,0,False
2,1,1,3,3,,True,False,0,False
3,1,1,4,4,,True,False,0,False
4,1,1,5,5,,True,False,0,False


This executes pretty quickly (no apparent compilation after the first using) and these variables are surprisingly easy to play with. However, if we check their types, we see we're not totally finished pulling the DataFrame out of Julia, though other variables are converted cleanly.

In [None]:
println(typeof(py"list_length"))
println(typeof(py"events"))
println(typeof(py"trials"))

Int64
PyObject
Matrix{Int64}


A `pd_to_df` function [proposed by lungben](https://discourse.julialang.org/t/converting-pandas-dataframe-returned-from-pycall-to-julia-dataframe/43001/2) seems to finish the conversion quickly.

In [None]:
using DataFrames

function pd_to_df(df_pd)
    df= DataFrame()
    for col in df_pd.columns
        df[!, col] = getproperty(df_pd, col).values
    end
    df
end

trials = py"trials"
list_length = py"list_length"
events = pd_to_df(py"events")
first(events, 5)

Unnamed: 0_level_0,subject,list,item,input,output,study,recall,repeat,intrusion
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Float64,Bool,Bool,Int32,Bool
1,1,1,1,1,,1,0,0,0
2,1,1,2,2,,1,0,0,0
3,1,1,3,3,,1,0,0,0
4,1,1,4,4,,1,0,0,0
5,1,1,5,5,,1,0,0,0


I gotta wonder if PythonCall handles this transfer more cleanly or not.

## Serial Position Effect

My first attempt uses StatsBase and the global scope

In [None]:
using StatsBase

fast_spc(trials, item_count) = counts(trials, item_count) / size(trials, 1)

fast_spc(trials, 20);
@time fast_spc(trials, 20)

  0.000034 seconds (2 allocations: 448 bytes)


20-element Vector{Float64}:
 0.44305555555555554
 0.29097222222222224
 0.2222222222222222
 0.18958333333333333
 0.1388888888888889
 0.15694444444444444
 0.15486111111111112
 0.14097222222222222
 0.16041666666666668
 0.18958333333333333
 0.15347222222222223
 0.1875
 0.21875
 0.2534722222222222
 0.27847222222222223
 0.3125
 0.3972222222222222
 0.5875
 0.6881944444444444
 0.7875

## Lag-CRP

In [None]:
function fast_crp(trials, item_count)
    
    lag_range = item_count
    total_actual_lags = zeros(lag_range * 2 + 1)
    total_possible_lags = zeros(lag_range * 2 + 1)
    terminus = sum(trials .!= 0, dims=2) .- 1
    
    # compute actual serial lag b/t recalls
    actual_lags = trials[:, 2:end] - trials[:, begin:end-1]
    actual_lags = actual_lags .+ lag_range
    
    # tabulate bin totals for actual and possible lags
    for i in 1:size(trials, 1)
        possible_items = 1:(item_count + 1)
        previous_item = 0
        
        for recall_index in 1:terminus[i]
            
            # track possible and actual lags
            if recall_index > 1
                total_actual_lags[actual_lags[i, recall_index-1]] += 1
                                
                # exploit equivalence b/t item index and study position to track possible lags
                possible_lags = possible_items .- previous_item 
                possible_lags .+= lag_range
                total_possible_lags[possible_lags] .+= 1
                
            end
            
            # update pool of possible items to exclude recalled item
            previous_item = trials[i, recall_index]
            possible_items = possible_items[possible_items .!= previous_item]
            
        end
        
        # small correction to avoid nans
        total_possible_lags[total_actual_lags.==0] .+= 1
    end
    
    return total_actual_lags/total_possible_lags
end

fast_crp (generic function with 1 method)

In [None]:
fast_crp(trials, list_length)

41×41 Matrix{Float64}:
 0.000252908  0.000486841  0.000695565  …  0.000479522  0.00039034
 0.000143477  0.000276189  0.000394599     0.000272037  0.000221443
 0.000184817  0.000355768  0.000508297     0.00035042   0.000285249
 0.000131318  0.000252783  0.000361159     0.000248983  0.000202677
 0.00014834   0.000285551  0.000407975     0.000281258  0.00022895
 0.00017509   0.000337044  0.000481545  …  0.000331977  0.000270236
 0.000204272  0.000393218  0.000561802     0.000387306  0.000315275
 0.000213999  0.000411942  0.000588555     0.00040575   0.000330288
 0.000177522  0.000341725  0.000488233     0.000336588  0.000273989
 0.000233454  0.000449392  0.00064206      0.000442636  0.000360314
 0.000196976  0.000379174  0.000541738  …  0.000373474  0.000304015
 0.000226158  0.000435348  0.000621995     0.000428804  0.000349054
 0.000303976  0.000585146  0.000836015     0.000576349  0.000469159
 ⋮                                      ⋱               ⋮
 0.000102136  0.000196609  0.00028090

### Trying Out PythonCall

In [None]:
ENV["JULIA_PYTHONCALL_EXE"]="c:/programdata/miniconda3/python.exe"                                                                                              
using Pkg
pkg"build PythonCall"

In [None]:
using PythonCall
@py from compmemlearn.analyses import prepare_murdock1970_data

LoadError: LoadError: MethodError: no method matching var"@py"(::LineNumberNode, ::Module, ::Symbol, ::Expr, ::Expr)
[0mClosest candidates are:
[0m  var"@py"(::LineNumberNode, ::Module, ::Any) at C:\Users\gunnj\.julia\packages\PythonCall\7klbm\src\py_macro.jl:799
in expression starting at In[8]:2

In [None]:
@py 1+2

[0m[1mPython int: [22m3

In [None]:
@py import numpy as np

In [None]:
@py from compmemlearn.analyses import prepare_murdock1970_data

LoadError: LoadError: MethodError: no method matching var"@py"(::LineNumberNode, ::Module, ::Symbol, ::Expr, ::Expr)
[0mClosest candidates are:
[0m  var"@py"(::LineNumberNode, ::Module, ::Any) at C:\Users\gunnj\.julia\packages\PythonCall\7klbm\src\py_macro.jl:799
in expression starting at In[17]:1

I can't get it to eval imports that use `from` I guess?

## General Notes
- Installing new packages like PyCall and DataFrames takes a long time compared to pip!
- There isn't just PyCall. PythonCall and JuliaCall seem to be newer and have some nice features to make cross-use more seamless.
- No tradition of selective function imports that make clear which functions are available in the global namespace or not. I kinda don't like that. Ah -- I can use import instead. Thank goodness.

## Language Learning Notes
Pycall uses a string macro to handle my short python scripts. That's why it doesn't look like I'm using a normal `function(arg1, arg2)`. 

Check item types with `typeof`. Use `isa` to test for types -- ex. `1 isa Number`.

Print with `println`.

`show` seems to be the closest thing to `head` here. Actually, `first(events, 5)` might be closer.

I wonder what the exclamation point in my `pd_to_df` function is doing. 

> ! in indexing is specific to DataFrames, and signals that you want a reference to the underlying vector storing the data, rather than a copy of it. 

> Columns can be directly (i.e. without copying) accessed via df.col or df[!, :col]. [...] Since df[!, :col] does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original df. To get a copy of the column use df[:, :col]: changing the vector returned by this syntax does not change df.

Since the indexing with ! does not involve any data copy, it will generally be more efficient. How does a character like ! become package-exclusive like that? Another macro?

Functions can modify (mutate) the contents of the objects their arguments refer to. (The names of functions which do this are conventionally suffixed with '!'.) That helps explain why `!` is used in this context since the ! denotes mutation rather than copying. But we're using it as an index. How is this cool? `!` is a generic function itself so maybe it just interprets that.