# Serratus Data Migration -- v200623
```
Lead     : ababaian
Issue    : #83
start    : 2020 06 23
complete : 2020 06 xx
files    : ~/serratus/notebook/200623_ab/
s3 files : s3://serratus-public/lovelywater2/
s3 files : s3://lovelywater2/
```

# Release updates

- `200612_qc`: Second batch of viral-metagenomes
- `200613_mu`: Remaining murine samples
- `200613_inv` : Invertebrate samples batch 1
- `200620_inv` : Bat + Invertebrate samples 2


In [1]:
WORKDIR='serratus/notebook/200623_ab'
mkdir -p $WORKDIR; cd $WORKDIR




## `s3://lovelywater2/README.md`

See: [Data Release Wiki](https://github.com/ababaian/serratus/wiki/Access-Data-Release)

# Migrate .SraRunInfo files


## `s3://lovelywater2/sra/README.md`

See: [SRA Queries Wiki](https://github.com/ababaian/serratus/wiki/SRA-queries)

In [None]:
# Performed on EC2
# ec2-3-235-55-90.compute-1.amazonaws.com
# login as lovelywater2 IAM
aws configure
aws configure set default.s3.max_concurrent_requests 100

# Example 
aws s3 sync --quiet --acl "public-read" \
  s3://serratus-public/lovelywater2/ \
  s3://lovelywater2/


In [None]:
# SRA RunInfo Files
# - Invertebrates
aws s3 cp \
  s3://serratus-public/notebook/200613_ab//invert_SraRunInfo.csv \
  ./inv_SraRunInfo.csv

# - all bat
aws s3 cp \
  s3://serratus-public/notebook/200620_ab/bat_SraRunInfo.csv \
  ./bat_SraRunInfo.csv

# - virome 2
aws s3 cp \
  s3://serratus-public/out/200612_qc/viro2_SraRunInfo.csv \
  ./viro_SraRunInfo.csv
  
# - scRNA (control)
aws s3 cp \
  s3://serratus-public/tmp/scRNA_SraRunInfo.csv \
  ./scRNA_SraRunInfo.csv

# download previous md5sum
aws s3 cp s3://lovelywater2/sra/sra.md5sum ./
md5sum *.csv >> sra.md5sum
aws s3 cp sra.md5sum s3://lovelywater2/sra/sra.md5sum

# Zip 
gzip *
wc -l *

```
      2823 bat_SraRunInfo.csv
   2193741 inv_SraRunInfo.csv
   1096932 scRNA_SraRunInfo.csv
     22252 viro_SraRunInfo.csv
   3315748 total

2d2998b585f6b5035b051b0960692c96  hu_SraRunInfo.csv
8224e6cea6afe2d4da73c23d5804ddd4  hu_meta_SraRunInfo.csv
499fa3d5a1fa8cf86efce1925c7e27fd  mamm_SraRunInfo.csv
a9e14f6043f70e485ebebeb81ace8da7  mu_SraRunInfo.csv
e39b50b78465f7e12676ef18d179de5f  vert_SraRunInfo.csv
1108e9cda3e07b55b19ece9ee8ac4dca  bat_SraRunInfo.csv
ccd2bc301495cddf11a95e63e746ce8f  inv_SraRunInfo.csv
d54a86323896e1a0f97c7403b2c85e69  scRNA_SraRunInfo.csv
e9222b54cee8a65bc3781589f5cbf642  viro_SraRunInfo.csv
```

In [None]:
aws s3 sync --quiet --acl "public-read" \
  ./ \
  s3://lovelywater2/sra/

# Migrate data files


In [None]:
aws configure set default.s3.max_concurrent_requests 100

# perform these in 4x `screen` to maximize CPU usage
# Virome 2
aws s3 sync --quiet \
  s3://serratus-public/out/200612_qc/bam/ \
  s3://lovelywater2/bam/ &
aws s3 sync --quiet \
  s3://serratus-public/out/200612_qc/summary/ \
  s3://lovelywater2/summary/

# Murine
aws s3 sync --quiet  --acl "public-read" \
  s3://serratus-public/out/200613_mu/bam/ \
  s3://lovelywater2/bam/ &
aws s3 sync --quiet  --acl "public-read" \
  s3://serratus-public/out/200613_mu/summary/ \
  s3://lovelywater2/summary/

# Invertebrates 1
aws s3 sync --quiet  --acl "public-read" \
  s3://serratus-public/out/200613_inv/bam/ \
  s3://lovelywater2/bam/ &
aws s3 sync --quiet  --acl "public-read" \
  s3://serratus-public/out/200613_inv/summary/ \
  s3://lovelywater2/summary/
  
# Invertebrates 2
aws s3 sync --quiet  --acl "public-read" \
  s3://serratus-public/out/200620_inv/bam/ \
  s3://lovelywater2/bam/ &
aws s3 sync --quiet  --acl "public-read" \
  s3://serratus-public/out/200620_inv/summary/ \
  s3://lovelywater2/summary/
  

# README + index.tsv

In [None]:
# Index
# Download a list of all summary files as index
aws s3 ls s3://lovelywater2/summary/ > index.tsv

aws s3 cp --quiet --acl "public-read" \
  index.tsv s3://lovelywater2/index.tsv

In [None]:
# README
# README.md and sra/README.md copied from wiki
sudo yum install -y git
git clone https://github.com/ababaian/serratus.wiki.git

# Copy from wiki to local
aws s3 cp --acl "public-read" \
  serratus.wiki/Access-Data-Release.md \
  s3://serratus-public/lovelywater2/README.md
  
aws s3 cp --acl "public-read" \
  serratus.wiki/SRA-queries.md \
  s3://serratus-public/lovelywater2/sra/README.md
  

## cc0 - Data Licensing

The `cc0` license was taken from the github template, dumped into a text file "LICENSE.md" and will be included in the `s3://lovelywater2` bucket. This is to adhere to the FAIR principals with an explicit license.



In [None]:
aws s3 cp -acl "public-read" \
  ./LICENSE.md \
  s3://lovelywater2/LICENSE.md