Aspen Coyle

afcoyle@uw.edu

Roberts Lab, UW-SAFS

2021-12-06

# 1_1_downloading_jensen_data.ipynb

Upon her retirement from NOAA, Pam Jensen sent over a large number of samples. With some exceptions, nearly all are in sealed 96-well plates. 

A fairly brief examination of the plate labels revealed the following:

**Collection dates:** From ~2003 to 2019, with most coming in the 2010s

**Species:** Almost all are presumed _C. bairdi_ or _C. opilio_, but a few are confirmed to be the following:
- Hybrid _Chionoecetes_ (bairdi and opilio)
- _C. tanneri_ (grooved Tanner crab)
- _Paralithodes camtschaticus_ (red king crab)
- _Paralithodes platypus_ (blue king crab)
- _Hyas sp._ (lyre crabs)
- _Lithodes couesi_ (scarlet king crab)

**Location:** Generally were collected on surveys. Many in the eastern Bering Sea (EBS), with a good number in southeast Alaska (SEAK), several in the Gulf of Alaska (GOA), and northern Bering Sea (NBS). There are also several from Newfoundland, the Sea of Okhotsk, and the Chukchi Sea. For many, location cannot be determined from the plates alone, as they are marked only with the project number.

**Preservation method:** 100% EtOH, with minimal exceptions

**Sample contents:** Hemolymph samples with a few tissue samples



Along with the physical samples, Pam sent a flash drive containing a huge amount of information on (among other things), the collection details for these samples. **In this script, we will download all data from that flash drive**. 

In [1]:
# Paths will originate relative to the scripts/ directory
!pwd

/mnt/c/Users/acoyl/Documents/GitHub/historical_hemat/scripts


In [3]:
!wget --no-check-certificate --no-parent --recursive --reject index.html* -P ../data/jensen_data/ https://gannet.fish.washington.edu/hematodinium/

--2021-12-06 20:17:05--  https://gannet.fish.washington.edu/hematodinium/
Resolving gannet.fish.washington.edu (gannet.fish.washington.edu)... 128.95.149.52, 140.142.5.5, 173.250.227.69, ...
Connecting to gannet.fish.washington.edu (gannet.fish.washington.edu)|128.95.149.52|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 2530 (2.5K) [text/html]
Saving to: ‘../data/jensen_data/gannet.fish.washington.edu/hematodinium/index.html.tmp’


2021-12-06 20:17:05 (232 MB/s) - ‘../data/jensen_data/gannet.fish.washington.edu/hematodinium/index.html.tmp’ saved [2530/2530]

Loading robots.txt; please ignore errors.
--2021-12-06 20:17:05--  https://gannet.fish.washington.edu/robots.txt
Reusing existing connection to gannet.fish.washington.edu:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-06 20:17:05 ERROR 404: Not Found.

Removing ../data/jensen_data/gannet.fish.washington.edu/hematodinium/index.html.tmp 

This downloaded a series of files into a directory named jensen_data. However, it put the actual files several levels deeper, in jensen_data/gannet.fish.washington.edu/hematodinium/[data_is_here]. We want to move all the files in that hematodinium folder up higher into our jensen_data directory



In [4]:
!mv ../data/jensen_data/gannet.fish.washington.edu/hematodinium/* ../data/jensen_data/

We'll now remove irrelevant folders for tidiness

In [6]:
# Now-empty directory we moved all samples from
!rm -r ../data/jensen_data/gannet.fish.washington.edu/

# Empty folder - in Gannett, links to base directory
!rmdir '../data/jensen_data/@eaDir'

In [7]:
!ls ../data/jensen_data/

'Hemato protocols'	       'Roberts sample trans July 20.docx'
'Hemato samples'	       'Samples transferred to Steven 072820.xlsx'
'Lauth etal 2019 EBS rpt.pdf'  'Special Projects'
'NPRB transcriptome'


A lot of these files and folders have spaces in the names. We want to replace them with underscores to make them easier to reference. This is recursive, and will rename all sub-directory files and folders as well.

In [8]:
!find ../data/jensen_data/ -depth -name '* *' -execdir bash -c 'for i; do mv "$i" "${i// /_}"; done' _ {} +

## File Renaming and Removal

For the sake of cleanliness, we will note all file renaming and removal here

In [None]:
## Remove a duplicated Excel file - also present in jensen_data/NPRB_transcriptome with the same name
!rm ../data/jensen_data/Hemato_samples/Collection_Plate_Information_110217.xlsx

**Special_Projects folder:**
Removing irrelevant folders and items

In [None]:
# 2006: Contains lots of non-BCS projects and some irrelevant files within the BCS project folder.
!rm -r ../data/jensen_data/Special_Projects/Special_Project_Requests_2006/CephParasites/
!rm -r ../data/jensen_data/Special_Projects/Special_Project_Requests_2006/FRPparasite/
!rm -r ../data/jensen_data/Special_Projects/Special_Project_Requests_2006/Ichthyophonus/
!rm -r ../data/jensen_data/Special_Projects/Special_Project_Requests_2006/TissueLibrary/
!rm "../data/jensen_data/Special_Projects/Special_Project_Requests_2006/BCS/100%_EtOH_Field_SOP.doc"
!rm "../data/jensen_data/Special_Projects/Special_Project_Requests_2006/BCS/MSDS100%ETOH.pdf"
!rm ../data/jensen_data/Special_Projects/Special_Project_Requests_2006/BCS/BCS_datasheet.tif
!rm ../data/jensen_data/Special_Projects/Special_Project_Requests_2006/BCS/2006_BCS_Special_Project_Appl2.doc
!rm ../data/jensen_data/Special_Projects/Special_Project_Requests_2006/BCS/Thumbs.db

In [None]:
# 2007: Contains some non-BCS projects
!find ../data/jensen_data/Special_Projects/Special_Project_Requests_2007/* -type f ! -wholename "../data/jensen_data/Special_Projects/Special_Project_Requests_2007/2007_FRP_BCS_Special_Project.doc" -delete

In [22]:
# 2008: Only potentially BCS-related projects. Still, a few irrelevant files
# and others duplicated as Word and PDFs
!rm ../data/jensen_data/Special_Projects/Special_Project_Requests_2008/2008_FRP_*.doc
!rm "../data/jensen_data/Special_Projects/Special_Project_Requests_2008/SpecialProjectApp&Instructions_08-2-1.doc"

In [23]:
# 2009: Only the project application instructions 
!rm ../data/jensen_data/Special_Projects/Special_Project_Requests_2009/SpecialProjectApp\&Instructions_09.doc

In [24]:
# 2010: none
# 2011: Ichthyophonus project instructions and project application instructions
!rm ../data/jensen_data/Special_Projects/Special_Project_Requests_2011/2011_EBS_Ichthyophonus_Special_Project.doc
!rm ../data/jensen_data/Special_Projects/Special_Project_Requests_2011/Special_Project_App_2011.doc

In [26]:
# 2012: No removals. However, both PDFs misspell "Chukchi". For easier searching, rename
!mv ../data/jensen_data/Special_Projects/Special_Project_Requests_2012/2012_Chuckchi_Sea_BCS_Special_Project.pdf ../data/jensen_data/Special_Projects/Special_Project_Requests_2012/2012_Chukchi_Sea_BCS_Special_Project.pdf
!mv ../data/jensen_data/Special_Projects/Special_Project_Requests_2012/2012_Chuckshi_Sea_Pathology_Special_Project.pdf ../data/jensen_data/Special_Projects/Special_Project_Requests_2012/2012_Chukchi_Sea_Pathology_Special_Project.pdf

In [None]:
# 2013-2016: none
# 2017-2018: One database file each
!rm ../data/jensen_data/Special_Projects/Special_Project_Requests_2017/Thumbs.db
!rm ../data/jensen_data/Special_Projects/Special_Project_Requests_2018/Thumbs.db

# Done!

We have successfully downloaded all data from Pam Jensen at 8:25pm on 2021-12-06

There are two other files that are in the jensen_data directory - an inventory of samples and mixed boxes. These are manual inventories of the plates the Roberts lab has, and what was used to create the summary at the top of this script. Since they were created manually, there is no script detailing their creation.