## 2_1_organize_ADFG_SE_AK_files.ipynb

This script deals with renaming and reorganization of the files obtained from the Andrew Olson at the Alaska Department of Fish and Game.

These files contain data on Tanner crabs caught during the annual Tanner crab and red king crab surveys, which date back annually to 1978. They were transferred via Google Drive, though at the moment, I'm looking for a more permanent place to host them. As a result, ignore all folders below labeled "copy_for_gannett". This is just an original copy of all these files that I'm hoping to transfer to Gannett later so that I don't need to re-download all these files.

#### Look at all filenames in our main directory
(note: I ran the code to replace spaces with underscores prior to this when actually running this notebook)

In [14]:
!ls ../data/ADFG_SE_AK_pot_surveys/

CTD
CTD_Stations.pdf
Pot_Set_Data_for_Tanner_and_RKC_surveys.csv
README.md
ROP.CF.1J.2019.02.pdf
ROP.CF.1J.2019.12.pdf
Specimen_data_for_Tanner_and_RKC_surveys.csv
Specimen_data_for_Tanner_and_RKC_surveys.xlsx
TC_survey_specimen_data_1978-1984.csv
TC_survey_specimen_data_1985-1994.csv
TC_survey_specimen_data_1995-1999.csv
TC_survey_specimen_data_2000-2004.csv
TC_survey_specimen_data_2005-2009.csv
TC_survey_specimen_data_2010-2013.csv
TC_survey_specimen_data_2014-2016.csv
TC_survey_specimen_data_2017-2020.csv
Tidbits
copy_for_gannet


#### Remove irrelevant files

The Specimen_data_for_Tanner_and_RKC_surveys files are part of a previous attempt to transfer files, before it was discovered that the maximum was 65,000 lines. Both can be removed

In [33]:
!rm ../data/ADFG_SE_AK_pot_surveys/Specimen_data_for_Tanner_and_RKC_surveys.*

#### Replace all spaces with underscores

We'll do this first, as it'll make it much easier to rename other things

In [4]:
# This will replace all spaces with underscores in the main directory and all subdirectories
!find ../data/ADFG_SE_AK_pot_surveys/ -depth -name '* *' -execdir bash -c 'for i; do mv "$i" "${i// /_}"; done' _ {} +

#### Rename maps and regional operation plans (ROPs)

In [17]:
# Rename map of CTD stations to CTD_station_map.pdf
!mv ../data/ADFG_SE_AK_pot_surveys/CTD_Stations.pdf ../data/ADFG_SE_AK_pot_surveys/CTD_station_map.pdf

# Rename red king crab ROP
!mv ../data/ADFG_SE_AK_pot_surveys/ROP.CF.1J.2019.02.pdf ../data/ADFG_SE_AK_pot_surveys/ROP_RKC_2019_survey.pdf

# Rename Tanner crab ROP
!mv ../data/ADFG_SE_AK_pot_surveys/ROP.CF.1J.2019.12.pdf ../data/ADFG_SE_AK_pot_surveys/ROP_Tanner_2019_survey.pdf

mv: cannot stat '../data/ADFG_SE_AK_pot_surveys/CTD_Stations.pdf': No such file or directory
mv: cannot stat '../data/ADFG_SE_AK_pot_surveys/ROP.CF.1J.2019.02.pdf': No such file or directory


#### Check if all our survey specimen data files have the same header

They should - they were downloaded from the same database simultaneously, and are only split into separate files due to size limitations. Still, better safe than sorry!

In [21]:
# Look at the first line of a file chosen arbitrarily
!head -n 1 ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_2017-2020.csv

﻿Year,Project,Trip No,Location,Pot No,Specimen No,Species,Number Of Specimens,Sex,Length Millimeters,Width Millimeters,Weight Grams,Width Spines Millimeters,Chela Height Millimeters,Recruit Status,Specimen Comments,Shell Condition,Egg Condition,Egg Development,Leg Condition,Legal Size,Leatherback,Parasite,Egg Percent,Blackmat,Tag No,Tag Event Code


In [29]:
# First line: Find all csv files that match that string
# Second line: Check the first line of each file for the first line of the file chosen above
# Third and fourth lines: Print whether each file is a match or not
!find ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data* -type f -name '*.csv' -exec \
awk 'NR==1 && "Year,Project,Trip No,Location,Pot No,Specimen No,Species,Number Of Specimens,Sex,Length Millimeters,Width Millimeters,Weight Grams,Width Spines Millimeters,Chela Height Millimeters,Recruit Status,Specimen Comments,Shell Condition,Egg Condition,Egg Development,Leg Condition,Legal Size,Leatherback,Parasite,Egg Percent,Blackmat,Tag No,Tag Event Code" \
{ print "Match in file " FILENAME; exit } \
{print "No match in file " FILENAME; exit }' \
{} \;

Match in file ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_1978-1984.csv
Match in file ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_1985-1994.csv
Match in file ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_1995-1999.csv
Match in file ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_2000-2004.csv
Match in file ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_2005-2009.csv
Match in file ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_2010-2013.csv
Match in file ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_2014-2016.csv
Match in file ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_2017-2020.csv


Looks all good! Now let's merge, removing the first (header) line of all except one file

In [32]:
# FNR is the line number of each file, NR is the line number globally.
# First line of the first file is accepted, other first lines are ignored
!awk '(NR == 1) || (FNR > 1)' ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data*.csv > ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_all_years.csv

In [34]:
# Check that we did this properly by checking the line counts.
# This gives the sum of all files with a dash after _data, which includes all our original files
!wc -l ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data*-*.csv

   21809 ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_1978-1984.csv
   50216 ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_1985-1994.csv
   23892 ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_1995-1999.csv
   39538 ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_2000-2004.csv
   53967 ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_2005-2009.csv
   53584 ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_2010-2013.csv
   47954 ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_2014-2016.csv
   53497 ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_2017-2020.csv
  344457 total


In [35]:
# Now let's get the total line numbers of our new merged file
!wc -l ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_all_years.csv
# Looks like we have 7 fewer lines, which exactly matches the 7 files removed from our 8 headers!

344450 ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_all_years.csv


In [38]:
# With that assured, we can remove all our original data files
# We'll be a bit more careful with this command, running it twice to specify files by year
!rm ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_19??-19??.csv
!rm ../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_20??-20??.csv

rm: cannot remove '../data/ADFG_SE_AK_pot_surveys/TC_survey_specimen_data_19??-19??.csv': No such file or directory


In [39]:
# Make new directory for ROPs and maps
!mkdir ../data/ADFG_SE_AK_pot_surveys/survey_information

In [40]:
# Move ROPs and maps to new folder
!mv ../data/ADFG_SE_AK_pot_surveys/ROP_*_2019_survey.pdf ../data/ADFG_SE_AK_pot_surveys/survey_information/
!mv ../data/ADFG_SE_AK_pot_surveys/CTD_station_map.pdf ../data/ADFG_SE_AK_pot_surveys/survey_information/