# <img src="https://github.com/JuliaLang/julia-logo-graphics/raw/master/images/julia-logo-color.png" height="100" /> _Colab Notebook Template_

## Instructions
1. Work on a copy of this notebook: _File_ > _Save a copy in Drive_ (you will need a Google account). Alternatively, you can download the notebook using _File_ > _Download .ipynb_, then upload it to [Colab](https://colab.research.google.com/).
2. If you need a GPU: _Runtime_ > _Change runtime type_ > _Hardware accelerator_ = _GPU_.
3. Execute the following cell (click on it and press Ctrl+Enter) to install Julia, IJulia and other packages (if needed, update `JULIA_VERSION` and the other parameters). This takes a couple of minutes.
4. Reload this page (press Ctrl+R, or ⌘+R, or the F5 key) and continue to the next section.

_Notes_:
* If your Colab Runtime gets reset (e.g., due to inactivity), repeat steps 2, 3 and 4.
* After installation, if you want to change the Julia version or activate/deactivate the GPU, you will need to reset the Runtime: _Runtime_ > _Factory reset runtime_ and repeat steps 3 and 4.

In [None]:
%%shell
set -e

#---------------------------------------------------#
JULIA_VERSION="1.8.2" # any version ≥ 0.7.0
JULIA_PACKAGES="IJulia BenchmarkTools"
JULIA_PACKAGES_IF_GPU="CUDA" # or CuArrays for older Julia versions
JULIA_NUM_THREADS=2
#---------------------------------------------------#

if [ -z `which julia` ]; then
  # Install Julia
  JULIA_VER=`cut -d '.' -f -2 <<< "$JULIA_VERSION"`
  echo "Installing Julia $JULIA_VERSION on the current Colab Runtime..."
  BASE_URL="https://julialang-s3.julialang.org/bin/linux/x64"
  URL="$BASE_URL/$JULIA_VER/julia-$JULIA_VERSION-linux-x86_64.tar.gz"
  wget -nv $URL -O /tmp/julia.tar.gz # -nv means "not verbose"
  tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
  rm /tmp/julia.tar.gz

  # Install Packages
  nvidia-smi -L &> /dev/null && export GPU=1 || export GPU=0
  if [ $GPU -eq 1 ]; then
    JULIA_PACKAGES="$JULIA_PACKAGES $JULIA_PACKAGES_IF_GPU"
  fi
  for PKG in `echo $JULIA_PACKAGES`; do
    echo "Installing Julia package $PKG..."
    julia -e 'using Pkg; pkg"add '$PKG'; precompile;"' &> /dev/null
  done

  # Install kernel and rename it to "julia"
  echo "Installing IJulia kernel..."
  julia -e 'using IJulia; IJulia.installkernel("julia", env=Dict(
      "JULIA_NUM_THREADS"=>"'"$JULIA_NUM_THREADS"'"))'
  KERNEL_DIR=`julia -e "using IJulia; print(IJulia.kerneldir())"`
  KERNEL_NAME=`ls -d "$KERNEL_DIR"/julia*`
  mv -f $KERNEL_NAME "$KERNEL_DIR"/julia

  echo ''
  echo "Successfully installed `julia -v`!"
  echo "Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then"
  echo "jump to the 'Checking the Installation' section."
fi

In [None]:
versioninfo()

In [None]:
using Pkg

In [None]:
Pkg.add("Revise")
Pkg.add("DataFrames")
Pkg.add("CSV")
Pkg.add("BenchmarkTools")
Pkg.add("FastaIO")
Pkg.add(url="https://github.com/bwbioinfo/KEGGAPI.jl")

In [None]:
using Revise
using DataFrames
using CSV
using FastaIO
using BenchmarkTools
using KEGGAPI

# Case 1: From Swissprot ID to Kegg information

### 1. Convert outside Database ID to Kegg ID and vice versa



| Database       | DB Identifier
|:---------------|:-----------------|
|Uniprot ID      | "uniprotid"      |
|NCBI Gene ID    | "ncbi-geneid"    |
|NCBI Protein ID | "ncbi-proteinid" |
|KEGG ID         | "genes"          |
    

### 1.1 Outside identifiers directly use as input

To determine if a protein/gene is in KEGG database, the function conv uses as input the KEGG identifier and the gene of interest with the DB identifier.

Only those outside identifiers with a hit in KEGG database are return

In [None]:
@time kegg_conv_uniprot = KEGGAPI.conv("genes", "uniprot:A0A072UR65")
DataFrame(
    kegg_conv_uniprot.data,
    kegg_conv_uniprot.colnames
    )

### 1.2 Outside database identifiers from a file as input

Several identifiers from the same database can be run at once. Either as input from a file or several identifiers join by "+" sign.

Only those outside identifiers with a hit in KEGG database are return.

The selected dataset belong to Uniprot proteins Review dataset. User can download the data and upload it to their session. https://www.kaggle.com/datasets/andreylovyagin/uniprot-proteins-reviewed-swissprot?select=data.csv

In [None]:
df = DataFrame(CSV.File("subset_data.csv", header=1, delim=","));

Entry identifiers in sample file belong to Uniprot Database

In [None]:
db = "uniprot:"
dbentry = string.(db, df.Entry)
entry = join(dbentry, "+")

kegg_conv_uniprot = KEGGAPI.conv("genes", entry)
DataFrame(
  kegg_conv_uniprot.data,
  kegg_conv_uniprot.colnames
)

### 1.3 Convert KEGG identifiers to outside database

To obtain the outside database identifier of a KEGG protein the function conv uses the DB identifier of the desire database and the KEGG gene identifier.

Several identifiers from the same database can be run at once.

Only those identifiers with a hit in the database are return.

In [None]:
@time ncbi_conv_kegg = KEGGAPI.conv("ncbi-proteinid", "mtr:25493984")
DataFrame(
  ncbi_conv_kegg.data,
  ncbi_conv_kegg.colnames
)

### 2. Gene gene information

To obtain gene information at KEGG database the function "find" uses the string "genes" and the KEGG gene identifier.

In [None]:
@time kegg_find_genes = KEGGAPI.find("genes", "mtr:25493984")
DataFrame(
  kegg_find_genes.data,
  kegg_find_genes.colnames
)

### 3. Get Enzyme sequences, nucleotide and amino acid.

#### 3.1 Get nucleotide sequence and save to fasta file

With the "kegg_get" function user can get nucleotide sequence of one or more gene using an array with KEGG protein id and the string "ntseq".

The output of the function can be save to file using the function FastaWriter from the FastaOI package.

In [None]:
# Nucleotide sequence
@time kegg_ntseq = KEGGAPI.kegg_get(["mtr:25493984", "shz:shn_30305"], "ntseq")

@time FastaWriter("ntseq.fasta") do fw
    for ch in kegg_ntseq[2]
        write(fw, ch)
    end
end

#### 3.2 Get amino acid sequence and save to fasta file

With the "kegg_get" function user can get amino acid sequence of one or more gene using an array with KEGG protein id and the string "aaseq".

The output of the function can be save to file using the function FastaWriter from the FastaOI package.

In [None]:
@time kegg_aaseq = KEGGAPI.kegg_get(["mtr:25493984", "shz:shn_30305"], "aaseq")

@time FastaWriter("aaseq.fasta") do fw
    for ch in kegg_aaseq[2]
        write(fw, ch)
    end
end

### 4. Ortholog group

To identify the ortholog related to the enzyme of interest, the function link takes as input the string "ko", and the KEGG gene identifier.

In [None]:
@time kegg_ko = KEGGAPI.link("ko", "mtr:25493984")
DataFrame(
  kegg_ko.data,
  kegg_ko.colnames
)

### 5. Reaction(s) catalyzed by gene of interest

To obtain the reactions associated to a gene, and a KEGG orthogroup, the input of the "link" function are the string "reaction", and KEGG ortholog number as "KXXXXX".

In [None]:
@time kegg_reaction = KEGGAPI.link("reaction", "K01183")
DataFrame(
  kegg_reaction.data,
  kegg_reaction.colnames
)

### 6. Reaction information

To obtain the reactions information, the "kegg_get" function requires an array of reaction KEGG identifier as "rn:RXXXXX"

In [None]:
@time kegg_reaction_info = KEGGAPI.kegg_get([kegg_reaction.data[2][1]])
split(kegg_reaction_info[2][1], "\n")

### 7. Pathway(s) including by gene of interest

To obtain pathways associated to a gene the input of the "link" function are the string "pathway", and KEGG gene identifier.

In [None]:
@time kegg_pathways = KEGGAPI.link("pathway", "mtr:25493984")
DataFrame(
  kegg_pathways.data,
  kegg_pathways.colnames
)

### 8. Obtain pathway information

To collect information about a pathway, the function find requieres as input the string "pathway" and the KEGG pathway identifer as "path:mapXXXXX".

In [None]:
@time kegg_pathway_find = KEGGAPI.find("pathway", "path:map00520")
DataFrame(
  kegg_pathway_find.data,
  kegg_pathway_find.colnames
)

### 9. Download pathway of interest.

The get_image function is to download a any image, the imput is the pathway number as path:mapXXXXX

The save_image function is to save the figure in a png file. The input is a string wiht the name of the file and the extension ".png"

In [None]:
@time kegg_image = KEGGAPI.get_image("path:map00520")
@time KEGGAPI.save_image(kegg_image, "aminoacid.png")

### 10. Visualize saved pathway

In [None]:
Pkg.add("TestImages")
Pkg.add("Images")
Pkg.add("FileIO")
Pkg.add("Colors")

In [None]:
using Images, TestImages, Colors
img = load("aminoacid.png")

### 11. Ortholog genes

Identify all genes related to the KEGG ortholog group using the link function. The input is the string "genes" and the KEGG ortholog group as "KXXXXX".

In [None]:
@time kegg_ko_genes = KEGGAPI.link("genes", "K01183")
DataFrame(
  kegg_ko_genes.data,
  kegg_ko_genes.colnames
)

### 12. Save to file ortholog genes sequence for downstream analysis.

#### 12.1 Get nucleotide sequence and save to fasta file

With the "kegg_get" function user can get nucleotide sequence of one or more gene using an array with KEGG protein id and the string "ntseq".

The output of the function can be save to file using the function FastaWriter from the FastaOI package.

In [None]:
@time kegg_ntseq = KEGGAPI.kegg_get(kegg_ko_genes.data[2][1:50], "ntseq")

@time FastaWriter("MSA_ntseq.fasta") do fw
    for ch in kegg_ntseq[2]
        write(fw, ch)
    end
end

#### 12.2 Get amino acid sequence and save to fasta file

With the "kegg_get" function user can get amino acid sequence of one or more gene using an array with KEGG protein id and the string "aaseq".
The output of the function can be save to file using the function FastaWriter from the FastaOI package.

In [None]:
@time kegg_aaseq = KEGGAPI.kegg_get(kegg_ko_genes.data[2][1:50], "aaseq")

@time FastaWriter("MSA_aaseq.fasta") do fw
    for ch in kegg_aaseq[2]
        write(fw, ch)
    end
end