# Mystery File!

The purpose of this exercise is to give you a little taste of what we will be learning about this semester in Introduction to Digital Curation. Over the semester we will learn about how data is structured and how to interact with it from the Python programming language.

Please access the file from Google Drive and answer any or all of the following if you can. Do not worry if you can't, this is stuff we will be learning about over the next few months.

* What is the format of the file?
* What does the file contain?
* How would you use the file?
* Where did the file come from?
* Who created the information in the file?
* Does it have a URL?

## Get the File

Colab lets you mount your Google Drive. I will share a folder of data with you so you can easily access files we will be working with in Colab. If you want you can mount your own Google Drive folders as well.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


Now we can use the Python pathlib module to read the file.

In [2]:
import pathlib
f = pathlib.Path('/content/drive/Shared drives/INST341/module-01/file.tar')

Does the file exist?

In [3]:
f.is_file()

True

## File Type

We can use the python-magic module to determine the type of the file. But first we need to install it, since it is not part of core Python. It also depends on a system library called libmagic which we can install.

In [4]:
! pip3 install python-magic
! sudo apt-get install libmagic1

Collecting python-magic
  Downloading https://files.pythonhosted.org/packages/59/77/c76dc35249df428ce2c38a3196e2b2e8f9d2f847a8ca1d4d7a3973c28601/python_magic-0.4.18-py2.py3-none-any.whl
Installing collected packages: python-magic
Successfully installed python-magic-0.4.18
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  libmagic-mgc
Suggested packages:
  file
The following NEW packages will be installed:
  libmagic-mgc libmagic1
0 upgraded, 2 newly installed, 0 to remove and 39 not upgraded.
Need to get 252 kB of archives.
After this operation, 5,214 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic-mgc amd64 1:5.32-2ubuntu0.4 [184 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-up

Now we can import the [python-magic](https://pypi.org/project/python-magic/) module. 

In [5]:
import magic

And we can use it to identify the type of file.

In [6]:
magic.from_file(f.as_posix())

'POSIX tar archive'

Now that we know a little more about the file we can look it up. Wikipedia is surprisingly good for information about types of files. Here is the article about [TAR files](https://en.wikipedia.org/wiki/Tar_(computing))

> In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. The command-line utility was first introduced in the Version 7 Unix in January 1979, replacing the tp program.[2] The file structure to store this information was standardized in POSIX.1-1988[3] and later POSIX.1-2001,[4] and became a format supported by most modern file archiving systems. 

## TAR Contents

So file.tar is a *tape archive file*. That means it is a file that contains other files much like a ZIP file. Lets use Python's [tarfile](https://docs.python.org/3/library/tarfile.html) module to read it.

In [7]:
import tarfile
tar = tarfile.open(f)

Now that we have a our variable tar that represents the tar file we can use a loop to list its contents:

In [8]:
for info in tar:
  print(info)

<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_assembly_report.txt' at 0x7fdbc563b110>
<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_assembly_stats.txt' at 0x7fdbc563b2a0>
<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_cds_from_genomic.fna.gz' at 0x7fdbc563b430>
<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_feature_count.txt.gz' at 0x7fdbc563b4f8>
<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_feature_table.txt.gz' at 0x7fdbc563b5c0>
<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.fna.gz' at 0x7fdbc563b688>
<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.gbff.gz' at 0x7fdbc563b750>
<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_geno

Interesting! There is a lot of stuff in here. Lets extract all the files into our current working directory so we can look at them.

In [9]:
tar.extractall()

The README.txt listed above looked interesting. Lets read that in and print it out.

In [17]:
text = open('ncbi-genomes-2020-08-27/README.txt').read()
print(text)

################################################################################
README for ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/
           ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/
           ftp://ftp.ncbi.nlm.nih.gov/genomes/all/

Last updated: November 01, 2019
################################################################################

Background
Sequence data is provided for all single organism genome assemblies that are 
included in NCBI's Assembly resource (www.ncbi.nlm.nih.gov/assembly/).  This 
includes submissions to databases of the International Nucleotide Sequence 
Database Collaboration, which are available in NCBI's GenBank database, as well 
as the subset of those submissions that are included in NCBI's RefSeq Genomes 
project. 

Available by anonymous FTP at:
     ftp://ftp.ncbi.nlm.nih.gov/genomes/

Please refer to README files and the FTP FAQ for additional information:
     https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/

Subscribe to the genomes-annou

That's a lot to read. But if you scroll to the top you'll see that some of this data is from the [GenBank](https://en.wikipedia.org/wiki/GenBank). 

This file describes the contents of the tarfile! Most of this is way over my head, I'm not a geneticist! But maybe it would be interesting to look at one of the files it mentions?

In [19]:
text = open('ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_assembly_report.txt').read()
print(text)

# Assembly name:  ASM985889v3
# Organism name:  Severe acute respiratory syndrome coronavirus 2 (viruses)
# Isolate:  Wuhan-Hu-1
# Taxid:          2697049
# BioProject:     PRJNA485481
# Submitter:      na
# Date:           2020-01-13
# Assembly type:  na
# Release type:   major
# Assembly level: Complete Genome
# Genome representation: full
# Assembly method: Megahit v. V1.1.3
# Sequencing technology: Illumina
# Relation to type material: ICTV additional isolate
# RefSeq category: Reference Genome
# GenBank assembly accession: GCA_009858895.3
# RefSeq assembly accession: GCF_009858895.2
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession	RefSeq Unit Accession	Assembly-Unit name
## GCA_009858905.3	GCF_009858905.2	Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-N

Oh wow, so this is genetic information about the Coronavirus! Lets take a look at one of the gzipped files using the python [gzip](https://docs.python.org/3/library/gzip.html) module.

In [27]:
import gzip

text = gzip.open('ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.fna.gz', 'rt').read()
print(text)

>MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAA
AATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGG
ACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT
CGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGC
CTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACAT
CTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAA
ACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC
GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAG
AACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGA
TCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACG
GAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTA
GCACGTGCTGGTA

This looks like the genetic sequence for the Coronavirus!

## Answers

Maybe we now have more questions than answers but here is one way of answering the initial questions posed at the beginning. If all of this seemed very hard. Don't worry, it was supposed to be difficult. By the end of the semester you should feel more comfortable using Python this way. But for the moment just get a sense of the flow of what happened.

**1. What is the format of the file?**

The file we started with was a tar file. But it contained other files such as text files and gzipped files.

**2. What does the file contain?**

It appears to contain information about the Coronavirus.

**3. How would you use the file?**

Geneticists could use this information to identify the virus in their labs.

**4. Where did the file come from?**

The file came from GenBank, which is a project run out of the National Institues of Health nearby in Bethesda.

**5. Who created the information in the file?**

Examining the metadata a little more closely (e.g. GCA_009858895.3_ASM985889v3_protein.gpff) shows that the genetic data was uploaded by Chinese scientists January 5, 2020 when they were publishing their findings in Nature.

**6. Does it have a URL?**

Sometimes you can use identifiers in data to try to locate more information about them on the web. In this case we can try to Google for ASM985889v3 which brings us to:

https://www.ncbi.nlm.nih.gov/assembly/GCF_009858895.2/

That looks like a start at least.
