# Bayesian Inference with Julia - Gen

This is a passionate child's introduction to probabilistic programming and Bayesian inference using Julia and Gen.

## Load data

In [1]:
using DataFrames, CSV

input_path = "../Data/Intermediate_Files/"
output_path = "../Data/Processed_Data/"

# Load pacmap output data
df = CSV.read(output_path*"pacmap_output/pacmap_5d_output_acute_leukemia_cleaned.csv", DataFrame)

# Define X and y
X = Matrix(df[:, ["PaCMAP 1", "PaCMAP 2", "PaCMAP 3", "PaCMAP 4", "PaCMAP 5"]])  # shape (n_samples=1399, n_features=5)
y = df[:, "ELN AML 2022 Diagnosis"]  # shape (n_samples=1399,) with 11 string classes

X_train = X[df[!, "Train Test"] .== "Discovery (train) Samples", :]
y_train = y[df[!, "Train Test"] .== "Discovery (train) Samples"]
X_test = X[df[!, "Train Test"] .== "Validation (test) Samples", :]
y_test = y[df[!, "Train Test"] .== "Validation (test) Samples"]

# Mapping from string labels to integer labels
using MLBase
label_to_int = labelmap(y_train)

# Convert y_train and y_test to integer labels
y_train = labelencode(label_to_int, y_train)
y_test = labelencode(label_to_int, y_test)

# Prepare your data
num_classes = length(unique(y_train)) # 11 classes
num_features = size(X_train, 2) # 5 features

# print size of each set
println("X_train: ", size(X_train)) # X_train: (1399, 5)
println("y_train: ", size(y_train)) # y_train: (1399,)
println("X_test: ", size(X_test)) # X_test: (110, 5)
println("y_test: ", size(y_test)) # y_test: (110,)


X_train: (1399, 5)
y_train: (1399,)
X_test: (110, 5)
y_test: (110,)


## Define the multinomial logistic regression model

In [3]:
# Install necessary packages
using Gen, Distributions

# Define our logistic model
@gen function logistic_regression_model(X::Array{Float64}, y::Array{Int64})
    # Define the priors for our weights. Here we are assuming a Gaussian prior for simplicity.
    # The mean is 0 and the standard deviation is 1.

    weights = @trace(mvnormal(zeros(num_features), Matrix{Float64}(I, num_features, num_features)), :weights)

    # Compute the log-odds
    for i in 1:size(X, 1)
        # Dot product of the features and the weights gives the log-odds
        log_odds = X[i, :] * weights

        # We then convert this to a probability using the softmax function
        probabilities = softmax(log_odds)

        # Our observation is then a categorical distribution with the calculated probabilities
        @trace(categorical(probabilities), (:y, i))
    end
end

DynamicDSLFunction{Any}(Dict{Symbol, Any}(), Dict{Symbol, Any}(), Type[Array{Float64}, Array{Int64}], false, Union{Nothing, Some{Any}}[nothing, nothing], var"##logistic_regression_model#292", Bool[0, 0], false)

## Run Bayesian inference

This will estimate the posterior distribution of our weights given the observed data. 
Here we attempt the inference algorithm Metropolis-Hastings, which is a Markov chain Monte Carlo (MCMC) method.

In [5]:
using LinearAlgebra

# Specify the data to be used in the model
data = (X_train, y_train)

# Define our proposal distribution, which randomly walks in the space of weights
@gen function proposal(weights::Array{Float64})
    @trace(mvnormal(weights, 0.1*Matrix{Float64}(I, num_features, num_features)), :weights)
end

# Perform inference using Metropolis-Hastings
traces = []
for i in 1:2000
    if i == 1
        trace, = Gen.generate(logistic_regression_model, data, choicemap())
    else
        trace, = Gen.mh(trace, proposal)
    end
    push!(traces, trace)
end


LoadError: MethodError: no method matching *(::Vector{Float64}, ::Vector{Float64})

[0mClosest candidates are:
[0m  *(::Any, ::Any, [91m::Any[39m, [91m::Any...[39m)
[0m[90m   @[39m [90mBase[39m [90m[4moperators.jl:578[24m[39m
[0m  *([91m::StridedMatrix{T}[39m, ::StridedVector{S}) where {T<:Union{Float32, Float64, ComplexF32, ComplexF64}, S<:Real}
[0m[90m   @[39m [32mLinearAlgebra[39m [90m/opt/julia-1.9.2/share/julia/stdlib/v1.9/LinearAlgebra/src/[39m[90m[4mmatmul.jl:49[24m[39m
[0m  *(::StridedVecOrMat, [91m::LinearAlgebra.LQPackedQ[39m)
[0m[90m   @[39m [32mLinearAlgebra[39m [90m/opt/julia-1.9.2/share/julia/stdlib/v1.9/LinearAlgebra/src/[39m[90m[4mlq.jl:293[24m[39m
[0m  ...


### Troubleshooting required here :(

## System Info

In [7]:
using InteractiveUtils

println("Julia Version: ", VERSION)
println("System: ", Sys.KERNEL)
println("Architecture: ", Sys.ARCH)
println("CPU cores: ", Sys.CPU_THREADS)
println("Byte Order: ", ENDIAN_BOM == 0x04030201 ? "Little Endian" : "Big Endian")

using Pkg

Pkg.status("DataFrames")
Pkg.status("CSV")
Pkg.status("MLBase")
Pkg.status("Gen")
Pkg.status("Distributions")
Pkg.status("LinearAlgebra")

Julia Version: 1.9.2
System: Linux
Architecture: x86_64
CPU cores: 20
Byte Order: Little Endian
[32m[1mStatus[22m[39m `~/.julia/environments/v1.9/Project.toml`
  [90m[a93c6f00] [39mDataFrames v1.6.0
[32m[1mStatus[22m[39m `~/.julia/environments/v1.9/Project.toml`
  [90m[336ed68f] [39mCSV v0.10.11
[32m[1mStatus[22m[39m `~/.julia/environments/v1.9/Project.toml`
  [90m[f0e99cf1] [39mMLBase v0.9.1
[32m[1mStatus[22m[39m `~/.julia/environments/v1.9/Project.toml`
  [90m[ea4f424c] [39mGen v0.4.5
[32m[1mStatus[22m[39m `~/.julia/environments/v1.9/Project.toml`
  [90m[31c24e10] [39mDistributions v0.25.98
[32m[1mNo Matches[22m[39m in `~/.julia/environments/v1.9/Project.toml`


## SPPL as a pythonic alternative

- __SPPL__: Sum-Product Probabilistic Language

- __Github__: [https://github.com/probsys/sppl](https://github.com/probsys/sppl)

- __Paper__: [SPPL: Probabilistic Programming with Fast Exact Symbolic Inference](https://arxiv.org/abs/2010.03485)

- __Intro on SPNs__: [Visualizing and understanding Sum-Product Networks](https://link.springer.com/article/10.1007/s10994-018-5760-y)