## Change cell names in S3 and move to new S3 bucket

A notebook for renaming a bunch of cells. You need to run this notebook in an environment with the `utilities` repo installed.

In [13]:
import utilities.s3_util as s3u # importing this module of functions for working with s3

import os # importing this to work with filenames

from collections import Counter # for counting things

First we need to get a list of all of the files we want to rename...depending on where they are that might be easy or hard.

The function `s3u.get_files` takes a bucket name (e.g. `czb-seqbot`) and a **prefix** (e.g. `fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley`) and returns a generator of all the file names under that path. You can further filter them by matching against the extension or something else:

In [3]:
fastq_files = [
    fn for fn in s3u.get_files(
        bucket='czb-seqbot', 
        prefix='fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley'
    )
    if fn.endswith('.fastq.gz')
]
len(fastq_files) # how many files?

4216

In [4]:
fastq_files[::1000] # look at the files

['fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/A10_B009246_S10_R1_001.fastq.gz',
 'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/D5_B009315_S245_R1_001.fastq.gz',
 'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/H21_B009320_S225_R1_001.fastq.gz',
 'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/L18_B009469_S102_R1_001.fastq.gz',
 'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/P14_B009247_S158_R1_001.fastq.gz']

In [8]:
# get just the basename (i.e. no folders)
base_fns = [os.path.basename(fn) for fn in fastq_files]

base_fns[::1000]

['A10_B009246_S10_R1_001.fastq.gz',
 'D5_B009315_S245_R1_001.fastq.gz',
 'H21_B009320_S225_R1_001.fastq.gz',
 'L18_B009469_S102_R1_001.fastq.gz',
 'P14_B009247_S158_R1_001.fastq.gz']

In [12]:
# everything has the same number of underscores?
# it's good to check this instead of assuming, because
# we use this to grab the plate name
Counter(fn.count('_') for fn in base_fns)

Counter({4: 4216})

In [14]:
# count the plate names, since we know it's always after the first underscore
Counter(fn.split('_')[1] for fn in base_fns)

Counter({'B009246': 702,
         'B009247': 704,
         'B009315': 702,
         'B009319': 702,
         'B009320': 702,
         'B009469': 704})

In [17]:
# all the files with a given plate name
old_names = [fn for fn in fastq_files if fn.find('_B009246_') > -1]

# just using str.replace to rename the plates
new_names = [fn.replace('_B009246_', '_B??????_') for fn in old_names]

print(len(old_names), len(new_names))
old_names[::100], new_names[::100]

702 702


(['fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/A10_B009246_S10_R1_001.fastq.gz',
  'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/C16_B009246_S64_R1_001.fastq.gz',
  'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/E21_B009246_S117_R1_001.fastq.gz',
  'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/G6_B009246_S150_R1_001.fastq.gz',
  'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/J12_B009246_S228_R1_001.fastq.gz',
  'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/L19_B009246_S283_R1_001.fastq.gz',
  'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/N3_B009246_S15_R1_001.fastq.gz',
  'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/P9_B009246_S69_R1_001.fastq.gz'],
 ['fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/A10_B??????_S10_R1_001.fastq.gz',
  'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/C16_B??????_S64_R1_001.fastq.gz',
  'fastqs/181120_A00111_0230_BHGLJGDMXX/rawdata/Ashley/E21_B??????_S117_R1_001.fastq.gz',
  'fastqs/181120_A

The function `s3u.copy_files` takes five arguments:
  
  - `src_list` is the list of the files you want to copy, including the full path
  - `dest_list` is their new names, in the same order as the source
  - `b` is the bucket for the original files
  - `nb` is the bucket for the new copies (you can just use the same bucket)
  - `n_proc` tells the function how many processes to run&mdash;the copying doesn't really use your CPU much, so you can set this to 2-4 times the number of CPUs on your machine to speed things up.

In [None]:
s3u.copy_files(old_names, new_names, b='czb-seqbot', nb='czb-seqbot', n_proc=4)

In [None]:
# you could take this opportunity to move the files into your own bucket, if you like
# just need to change `nb` to `darmanis-group` after changing the path of the destination files
new_names2 = [
    os.path.join('some/path/inside/darmanis-group/', os.path.basename(fn))
    for fn in new_names
]

The function `s3u.remove_files` will delete things from S3, so it's pretty dangerous. Also it might not let you do this depending on where in S3 you are trying to delete from. It takes four arguments:

 - `file_list` is the list of files to delete (including path but not bucket)
 - `b` is the bucket to delete from
 - `really` is just a flag to make sure you're thinking about this, must be `True` to delete
 - `n_proc` is similar to above&mdash;you can use more processes to speed this up

In [None]:
# remove the old versions of these files
s3u.remove_files(old_names, b='czb-seqbot', really=True, n_proc=4)

You could rerun this a few times to rename different sets of plates, or you could rewrite it into a loop to do it all in one go.