modules/module-01/notebook.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Mystery Fille",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1dxguN98xxQT",
        "colab_type": "text"
      },
      "source": [
        "# Mystery File!\n",
        "\n",
        "The purpose of this exercise is to give you a little taste of what we will be learning about this semester in Introduction to Digital Curation. Over the semester we will learn about how data is structured and how to interact with it from the Python programming language.\n",
        "\n",
        "Please access the file from Google Drive and answer any or all of the following if you can. Do not worry if you can't, this is stuff we will be learning about over the next few months.\n",
        "\n",
        "* What is the format of the file?\n",
        "* What does the file contain?\n",
        "* How would you use the file?\n",
        "* Where did the file come from?\n",
        "* Who created the information in the file?\n",
        "* Does it have a URL?\n",
        "\n",
        "## Get the File\n",
        "\n",
        "Colab lets you mount your Google Drive. I will share a folder of data with you so you can easily access files we will be working with in Colab. If you want you can mount your own Google Drive folders as well.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RU5p7F2xs4lK",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 124
        },
        "outputId": "8c73b96b-f7bd-45e4-d309-8f9c34baf608"
      },
      "source": [
        "from google.colab import drive\n",
        "drive.mount('/content/drive')"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code\n",
            "\n",
            "Enter your authorization code:\n",
            "··········\n",
            "Mounted at /content/drive\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3Lhaqp1NyPlt",
        "colab_type": "text"
      },
      "source": [
        "Now we can use the Python pathlib module to read the file."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-G8AnH2qtESD",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import pathlib\n",
        "f = pathlib.Path('/content/drive/Shared drives/INST341/module-01/file.tar')"
      ],
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Q0QzQm3g0MJN",
        "colab_type": "text"
      },
      "source": [
        "Does the file exist?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aZTS5mBRvJjV",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "d7381259-8905-4e30-c422-f815c86e7295"
      },
      "source": [
        "f.is_file()"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 3
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yC12RBi60uum",
        "colab_type": "text"
      },
      "source": [
        "## File Type\n",
        "\n",
        "We can use the python-magic module to determine the type of the file. But first we need to install it, since it is not part of core Python. It also depends on a system library called libmagic which we can install."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Kjlnuxdh1I15",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 766
        },
        "outputId": "417d0ec0-1555-49f8-b0ac-84b1bdb4f6c6"
      },
      "source": [
        "! pip3 install python-magic\n",
        "! sudo apt-get install libmagic1"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Collecting python-magic\n",
            "  Downloading https://files.pythonhosted.org/packages/59/77/c76dc35249df428ce2c38a3196e2b2e8f9d2f847a8ca1d4d7a3973c28601/python_magic-0.4.18-py2.py3-none-any.whl\n",
            "Installing collected packages: python-magic\n",
            "Successfully installed python-magic-0.4.18\n",
            "Reading package lists... Done\n",
            "Building dependency tree       \n",
            "Reading state information... Done\n",
            "The following package was automatically installed and is no longer required:\n",
            "  libnvidia-common-440\n",
            "Use 'sudo apt autoremove' to remove it.\n",
            "The following additional packages will be installed:\n",
            "  libmagic-mgc\n",
            "Suggested packages:\n",
            "  file\n",
            "The following NEW packages will be installed:\n",
            "  libmagic-mgc libmagic1\n",
            "0 upgraded, 2 newly installed, 0 to remove and 39 not upgraded.\n",
            "Need to get 252 kB of archives.\n",
            "After this operation, 5,214 kB of additional disk space will be used.\n",
            "Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic-mgc amd64 1:5.32-2ubuntu0.4 [184 kB]\n",
            "Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic1 amd64 1:5.32-2ubuntu0.4 [68.6 kB]\n",
            "Fetched 252 kB in 1s (387 kB/s)\n",
            "debconf: unable to initialize frontend: Dialog\n",
            "debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 2.)\n",
            "debconf: falling back to frontend: Readline\n",
            "debconf: unable to initialize frontend: Readline\n",
            "debconf: (This frontend requires a controlling tty.)\n",
            "debconf: falling back to frontend: Teletype\n",
            "dpkg-preconfigure: unable to re-open stdin: \n",
            "Selecting previously unselected package libmagic-mgc.\n",
            "(Reading database ... 144579 files and directories currently installed.)\n",
            "Preparing to unpack .../libmagic-mgc_1%3a5.32-2ubuntu0.4_amd64.deb ...\n",
            "Unpacking libmagic-mgc (1:5.32-2ubuntu0.4) ...\n",
            "Selecting previously unselected package libmagic1:amd64.\n",
            "Preparing to unpack .../libmagic1_1%3a5.32-2ubuntu0.4_amd64.deb ...\n",
            "Unpacking libmagic1:amd64 (1:5.32-2ubuntu0.4) ...\n",
            "Setting up libmagic-mgc (1:5.32-2ubuntu0.4) ...\n",
            "Setting up libmagic1:amd64 (1:5.32-2ubuntu0.4) ...\n",
            "Processing triggers for man-db (2.8.3-2ubuntu0.1) ...\n",
            "Processing triggers for libc-bin (2.27-3ubuntu1) ...\n",
            "/sbin/ldconfig.real: /usr/local/lib/python3.6/dist-packages/ideep4py/lib/libmkldnn.so.0 is not a symbolic link\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "h0sHCzXH1pCK",
        "colab_type": "text"
      },
      "source": [
        "Now we can import the [python-magic](https://pypi.org/project/python-magic/) module. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jFbb3WIvvL_f",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import magic"
      ],
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S36xy3qb1FoW",
        "colab_type": "text"
      },
      "source": [
        "And we can use it to identify the type of file."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NN2yyfQd1HGq",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "ba2588e6-07bb-4aa4-fdb0-69899acc5bd7"
      },
      "source": [
        "magic.from_file(f.as_posix())"
      ],
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'POSIX tar archive'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 6
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tVk_tYzV2GsW",
        "colab_type": "text"
      },
      "source": [
        "Now that we know a little more about the file we can look it up. Wikipedia is surprisingly good for information about types of files. Here is the article about [TAR files](https://en.wikipedia.org/wiki/Tar_(computing))\n",
        "\n",
        "> In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from \"tape archive\", as it was originally developed to write data to sequential I/O devices with no file system of their own. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. The command-line utility was first introduced in the Version 7 Unix in January 1979, replacing the tp program.[2] The file structure to store this information was standardized in POSIX.1-1988[3] and later POSIX.1-2001,[4] and became a format supported by most modern file archiving systems. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "97JlQ40Z3RSw",
        "colab_type": "text"
      },
      "source": [
        "## TAR Contents\n",
        "\n",
        "So file.tar is a *tape archive file*. That means it is a file that contains other files much like a ZIP file. Lets use Python's [tarfile](https://docs.python.org/3/library/tarfile.html) module to read it."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Oo5a1upa3xM3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import tarfile\n",
        "tar = tarfile.open(f)"
      ],
      "execution_count": 7,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uN1YZ_cZ4elu",
        "colab_type": "text"
      },
      "source": [
        "Now that we have a our variable tar that represents the tar file we can use a loop to list its contents:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "f2VMKa4t37a5",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 332
        },
        "outputId": "1f4e0374-d953-459d-e3a7-15e9b3428b90"
      },
      "source": [
        "for info in tar:\n",
        "  print(info)"
      ],
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_assembly_report.txt' at 0x7fdbc563b110>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_assembly_stats.txt' at 0x7fdbc563b2a0>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_cds_from_genomic.fna.gz' at 0x7fdbc563b430>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_feature_count.txt.gz' at 0x7fdbc563b4f8>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_feature_table.txt.gz' at 0x7fdbc563b5c0>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.fna.gz' at 0x7fdbc563b688>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.gbff.gz' at 0x7fdbc563b750>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.gff.gz' at 0x7fdbc563b818>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.gtf.gz' at 0x7fdbc563b8e0>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_protein.faa.gz' at 0x7fdbc563ba70>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_protein.gpff.gz' at 0x7fdbc563bb38>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_translated_cds.faa.gz' at 0x7fdbc563bcc8>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/assembly_status.txt' at 0x7fdbc563bd90>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/annotation_hashes.txt' at 0x7fdbc563b9a8>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/md5checksums.txt' at 0x7fdbc563be58>\n",
            "<TarInfo 'ncbi-genomes-2020-08-27/README.txt' at 0x7fdbc563bf20>\n",
            "<TarInfo 'report.txt' at 0x7fdbaa8ef048>\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CHfhVRw24Z9V",
        "colab_type": "text"
      },
      "source": [
        "Interesting! There is a lot of stuff in here. Lets extract all the files into our current working directory so we can look at them."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3Zn_KfEf4zkL",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "tar.extractall()"
      ],
      "execution_count": 9,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hKev2b7k5lLQ",
        "colab_type": "text"
      },
      "source": [
        "The README.txt listed above looked interesting. Lets read that in and print it out."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "HAB1vFTT5p9I",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "outputId": "3a9b1257-51b0-4566-cf52-48bd9b7f44db"
      },
      "source": [
        "text = open('ncbi-genomes-2020-08-27/README.txt').read()\n",
        "print(text)"
      ],
      "execution_count": 17,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "################################################################################\n",
            "README for ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/\n",
            "           ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/\n",
            "           ftp://ftp.ncbi.nlm.nih.gov/genomes/all/\n",
            "\n",
            "Last updated: November 01, 2019\n",
            "################################################################################\n",
            "\n",
            "==========\n",
            "Background\n",
            "==========\n",
            "Sequence data is provided for all single organism genome assemblies that are \n",
            "included in NCBI's Assembly resource (www.ncbi.nlm.nih.gov/assembly/).  This \n",
            "includes submissions to databases of the International Nucleotide Sequence \n",
            "Database Collaboration, which are available in NCBI's GenBank database, as well \n",
            "as the subset of those submissions that are included in NCBI's RefSeq Genomes \n",
            "project. \n",
            "\n",
            "Available by anonymous FTP at:\n",
            "     ftp://ftp.ncbi.nlm.nih.gov/genomes/\n",
            "\n",
            "Please refer to README files and the FTP FAQ for additional information:\n",
            "     https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/\n",
            "\n",
            "Subscribe to the genomes-announce mail list to be informed of changes to the\n",
            "NCBI genomes FTP site:\n",
            "     https://www.ncbi.nlm.nih.gov/mailman/listinfo/genomes-announce\n",
            "\n",
            "\n",
            "=====================================================================\n",
            "Genome sequence and annotation data is provided in three directories:\n",
            "=====================================================================\n",
            "1) all:     content is the union of GenBank and RefSeq assemblies. \n",
            "            Two directories under \"all\" are named for the accession prefix (GCA\n",
            "            or GCF) and these directories contain another three levels of \n",
            "            directories named for digits 1-3, 4-6 & 7-9 of the assembly \n",
            "            accession. The next level is the data directories for individual \n",
            "            assembly versions. Only data directories for \"latest\" assemblies\n",
            "            are refreshed when annotation is updated or when software updates\n",
            "            are released, so new file formats or improvements to existing \n",
            "            formats are not available for non-latest assemblies.\n",
            "            A third directory, named \"annotation_releases\" contains the products\n",
            "            of the NCBI Eukaryotic Genome Annotation Pipeline (see below). The \n",
            "            data are organized first by taxonomy ID and then by annotation \n",
            "            release ID. It is expected that many users will prefer to access the\n",
            "            annotation release data using the paths under the \"refseq\" directory\n",
            "            that use the organism name (see below). \n",
            "2) genbank: content includes primary submissions of assembled genome sequence \n",
            "            and associated annotation data, if any, as exchanged among members \n",
            "            of the International Nucleotide Sequence Database Collaboration, \n",
            "            of which NCBI's GenBank database is a member. The GenBank directory \n",
            "            area includes genome sequence data for a larger number of organisms \n",
            "            than the RefSeq directory area; however, some assemblies are \n",
            "            unannotated. The sub-directory structure includes:\n",
            "            a. archaea\n",
            "            b. bacteria\n",
            "            c. fungi\n",
            "            d. invertebrate\n",
            "            e. metagenomes\n",
            "            f. other -  this directory includes synthetic genomes\n",
            "            g. plant\n",
            "            h. protozoa\n",
            "            i. vertebrate_mammalian\n",
            "            j. vertebrate_other\n",
            "            k. viral\n",
            "3) refseq:  content includes assembled genome sequence and RefSeq annotation \n",
            "            data. All prokaryotic and eukaryotic RefSeq genomes have annotation. \n",
            "            RefSeq annotation data may be calculated by NCBI annotation  \n",
            "            pipelines or propagated from the GenBank submission. The RefSeq \n",
            "            directory area includes fewer organisms than the GenBank directory\n",
            "            area because not all genome assemblies are selected for the RefSeq\n",
            "            project.\n",
            "            Sub-directories include:\n",
            "            a. archaea\n",
            "            b. bacteria\n",
            "            c. fungi\n",
            "            d. invertebrate\n",
            "            e. plant\n",
            "            f. protozoa\n",
            "            g. vertebrate_mammalian\n",
            "            h. vertebrate_other \n",
            "            i. viral\n",
            "            j. mitochondrion [Content of the mitochondrion, plasmid and plastid\n",
            "            k. plasmid     directories is from the RefSeq release FTP site. See \n",
            "            l. plastid     ftp://ftp.ncbi.nlm.nih.gov/refseq/release/README]\n",
            "\n",
            "Data are further organized within each of the above directories as a series of \n",
            "directories named as the species binomial. For example:\n",
            "   ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Escherichia_coli/\n",
            "           - or - \n",
            "   ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/\n",
            "\n",
            "The next hierarchy provides access to all assemblies for the species, latest \n",
            "assemblies, and selected reference or representative assemblies for the species \n",
            "(if any). Within these groupings, sequence and annotation (and other) data is \n",
            "provided per assembly in a series of directories that are named using the rule:\n",
            "\n",
            "   [Assembly accession.version]_[assembly name]\n",
            "\n",
            "For example, the directory hierarchy for the GenBank Bacillus thuringiensis \n",
            "strain 97-27 genome, which has the assembly accession GCA_000008505.1 and \n",
            "default assembly name ASM850v1 looks like this:  \n",
            "   /genomes/genbank/bacteria/Bacillus_thuringiensis/all_assembly_versions/GCA_000008505.1_ASM850v1\n",
            "\n",
            "The directory hierarchy for the RefSeq annotated human reference genome which \n",
            "has the assembly accession GCF_000001405.39 and assembly name GRCh38.p13 looks \n",
            "like this:\n",
            "   /genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.39_GRCh38.p13\n",
            "\n",
            "Species that have been annotated by the NCBI Eukaryotic Genome Annotation \n",
            "Pipeline will also have a directory named \"annotation_releases\" described below.\n",
            "\n",
            "Genome assemblies of interest can be identified using the NCBI Assembly resource\n",
            "(www.ncbi.nlm.nih.gov/assembly), or by using the assembly summary report files \n",
            "that are provided for both all genbank and all refseq assemblies:\n",
            "ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt\n",
            "or ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt\n",
            "ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt\n",
            "or ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt\n",
            "\n",
            "Assembly summary report files containing information on assemblies for a \n",
            "particular taxonomic group or species are provided in the group and \n",
            "Genus_species directories under the \"genbank\" and \"refseq\" directory trees. e.g.\n",
            "ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt\n",
            "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Sulfolobus_islandicus/assembly_summary.txt\n",
            "\n",
            "Search the meta-data fields, or filter the files, to find assemblies of \n",
            "interest.\n",
            "\n",
            "\n",
            "===========================\n",
            "Data provided per assembly:\n",
            "===========================\n",
            "Sequence and other data files provided per assembly are named according to the \n",
            "rule:\n",
            "[assembly accession.version]_[assembly name]_[content type].[optional format]\n",
            "\n",
            "File formats and content:\n",
            "\n",
            "   assembly_status.txt\n",
            "       A text file reporting the current status of the version of the assembly\n",
            "       for which data is provided. Any assembly anomalies are also reported.\n",
            "   *_assembly_report.txt file\n",
            "       Tab-delimited text file reporting the name, role and sequence \n",
            "       accession.version for objects in the assembly. The file header contains \n",
            "       meta-data for the assembly including: assembly name, assembly \n",
            "       accession.version, scientific name of the organism and its taxonomy ID, \n",
            "       assembly submitter, and sequence release date.\n",
            "   *_assembly_stats.txt file\n",
            "       Tab-delimited text file reporting statistics for the assembly including: \n",
            "       total length, ungapped length, contig & scaffold counts, contig-N50, \n",
            "       scaffold-L50, scaffold-N50, scaffold-N75, and scaffold-N90\n",
            "   *_assembly_regions.txt\n",
            "       Provided for assemblies that include alternate or patch assembly units. \n",
            "       Tab-delimited text file reporting the location of genomic regions and \n",
            "       listing the alt/patch scaffolds placed within those regions.\n",
            "   *_assembly_structure directory\n",
            "       This directory will only be present if the assembly has internal \n",
            "       structure. When present, it will contain AGP files that define how \n",
            "       component sequences are organized into scaffolds and/or chromosomes. \n",
            "       Other files define how scaffolds and chromosomes are organized into \n",
            "       non-nuclear and other assembly-units, and how any alternate or patch \n",
            "       scaffolds are placed relative to the chromosomes. Refer to the README.txt\n",
            "       file in the assembly_structure directory for additional information.\n",
            "   *_cds_from_genomic.fna.gz\n",
            "       FASTA format of the nucleotide sequences corresponding to all CDS \n",
            "       features annotated on the assembly, based on the genome sequence. See \n",
            "       the \"Description of files\" section below for details of the file format.\n",
            "   *_feature_count.txt.gz\n",
            "       Tab-delimited text file reporting counts of gene, RNA, CDS, and similar\n",
            "       features, based on data reported in the *_feature_table.txt.gz file.\n",
            "       See the \"Description of files\" section below for details of the file \n",
            "       format.\n",
            "   *_feature_table.txt.gz\n",
            "       Tab-delimited text file reporting locations and attributes for a subset \n",
            "       of annotated features. Included feature types are: gene, CDS, RNA (all \n",
            "       types), operon, C/V/N/S_region, and V/D/J_segment. Replaces the .ptt & \n",
            "       .rnt format files that were provided in the old genomes FTP directories.\n",
            "       See the \"Description of files\" section below for details of the file \n",
            "       format.\n",
            "   *_genomic.fna.gz file\n",
            "       FASTA format of the genomic sequence(s) in the assembly. Repetitive \n",
            "       sequences in eukaryotes are masked to lower-case (see below).\n",
            "       The FASTA title is formatted as sequence accession.version plus \n",
            "       description. The genomic.fna.gz file includes all top-level sequences in\n",
            "       the assembly (chromosomes, plasmids, organelles, unlocalized scaffolds,\n",
            "       unplaced scaffolds, and any alternate loci or patch scaffolds). Scaffolds\n",
            "       that are part of the chromosomes are not included because they are\n",
            "       redundant with the chromosome sequences; sequences for these placed \n",
            "       scaffolds are provided under the assembly_structure directory.\n",
            "   *_genomic.gbff.gz file\n",
            "       GenBank flat file format of the genomic sequence(s) in the assembly. This\n",
            "       file includes both the genomic sequence and the CONTIG description (for \n",
            "       CON records), hence, it replaces both the .gbk & .gbs format files that \n",
            "       were provided in the old genomes FTP directories.\n",
            "   *_genomic.gff.gz file\n",
            "       Annotation of the genomic sequence(s) in Generic Feature Format Version 3\n",
            "       (GFF3). Sequence identifiers are provided as accession.version.\n",
            "       Additional information about NCBI's GFF files is available at \n",
            "       ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt.\n",
            "   *_genomic.gtf.gz file\n",
            "       Annotation of the genomic sequence(s) in Gene Transfer Format Version 2.2\n",
            "       (GTF2.2). Sequence identifiers are provided as accession.version.\n",
            "   *_genomic_gaps.txt.gz\n",
            "       Tab-delimited text file reporting the coordinates of all gaps in the \n",
            "       top-level genomic sequences. The gaps reported include gaps specified in\n",
            "       the AGP files, gaps annotated on the component sequences, and any other \n",
            "       run of 10 or more Ns in the sequences. See the \"Description of files\" \n",
            "       section below for details of the file format.\n",
            "   *_protein.faa.gz file\n",
            "       FASTA format sequences of the accessioned protein products annotated on\n",
            "       the genome assembly. The FASTA title is formatted as sequence \n",
            "       accession.version plus description.\n",
            "   *_protein.gpff.gz file\n",
            "       GenPept format of the accessioned protein products annotated on the \n",
            "       genome assembly\n",
            "   *_rm.out.gz file\n",
            "       RepeatMasker output; \n",
            "       Provided for Eukaryotes \n",
            "   *_rm.run file\n",
            "       Documentation of the RepeatMasker version, parameters, and library; \n",
            "       Provided for Eukaryotes \n",
            "   *_rna.fna.gz file\n",
            "       FASTA format of accessioned RNA products annotated on the genome \n",
            "       assembly; Provided for RefSeq assemblies as relevant (Note, RNA and mRNA \n",
            "       products are not instantiated as a separate accessioned record in GenBank\n",
            "       but are provided for some RefSeq genomes, most notably the eukaryotes.)\n",
            "       The FASTA title is provided as sequence accession.version plus \n",
            "       description.\n",
            "   *_rna.gbff.gz file\n",
            "       GenBank flat file format of RNA products annotated on the genome \n",
            "       assembly; Provided for RefSeq assemblies as relevant\n",
            "   *_rna_from_genomic.fna.gz\n",
            "       FASTA format of the nucleotide sequences corresponding to all RNA \n",
            "       features annotated on the assembly, based on the genome sequence. See \n",
            "       the \"Description of files\" section below for details of the file format.\n",
            "   *_translated_cds.faa.gz\n",
            "       FASTA sequences of individual CDS features annotated on the genomic \n",
            "       records, conceptually translated into protein sequence. The sequence \n",
            "       corresponds to the translation of the nucleotide sequence provided in the\n",
            "       *_cds_from_genomic.fna.gz file. \n",
            "   *_wgsmaster.gbff.gz\n",
            "       GenBank flat file format of the WGS master for the assembly (present only\n",
            "       if a WGS master record exists for the sequences in the assembly).\n",
            "   annotation_hashes.txt\n",
            "       Tab-delimited text file reporting hash values for different aspects\n",
            "       of the annotation data. See the \"Description of files\" section below \n",
            "       for details of the file format.\n",
            "   md5checksums.txt file\n",
            "       file checksums are provided for all data files in the directory\n",
            "\n",
            "Additional directories and files provided for organisms annotated by the NCBI \n",
            "Eukaryotic Genome Annotation Pipeline:\n",
            "\n",
            "   *_pseudo_without_product.fna.gz\n",
            "       FASTA format of the genomic sequence corresponding to pseudogene and \n",
            "       other gene regions which do not have any associated transcribed RNA \n",
            "       products or translated protein products. It includes annotated gene \n",
            "       regions that require rearrangement to provide the final product, e.g.\n",
            "       immunoglobulin segments. These sequences are not assigned accession \n",
            "       numbers, and are derived directly from the assembled genomic sequences.\n",
            "       The FASTA title has a local sequence identifier, the Gene ID and gene \n",
            "       name.\n",
            "\n",
            "Evidence_alignments directory\n",
            "   *_cross_species_tx_alns.gff.gz\n",
            "       Alignments of cDNAs, ESTs and TSAs from other species to the genomic\n",
            "       sequence(s) in Generic Feature Format Version 3 (GFF3) [not all \n",
            "       annotation releases have cross-species alignments]. These alignments may\n",
            "       have been used as evidence for gene prediction by the annotation \n",
            "       pipeline. Sequence identifiers are provided as accession.version. \n",
            "       Additional information about NCBI's GFF files is available at \n",
            "       ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt.\n",
            "\n",
            "   *_same_species_tx_alns.gff.gz\n",
            "       Alignments of same-species cDNAs, ESTs and TSAs to the genomic \n",
            "       sequence(s) in Generic Feature Format Version 3 (GFF3). These alignments\n",
            "       were used as evidence for gene prediction by the annotation pipeline. \n",
            "       Sequence identifiers are provided as accession.version. Additional \n",
            "       information about NCBI's GFF files is available at \n",
            "       ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt.\n",
            "\n",
            "Gnomon_models directory \n",
            "   *_gnomon_model.gff.gz\n",
            "       Gnomon annotation of the genomic sequence(s) in Generic Feature Format\n",
            "       Version 3 (GFF3). Sequence identifiers are provided as accession.version\n",
            "       for the genomic sequences and Gnomon identifiers for the Gnomon models:\n",
            "       gene.XXX for genes, GNOMON.XXX.m for transcripts and GNOMON.XXX.p for \n",
            "       proteins. These identifiers are NOT universally unique. They are unique\n",
            "       per annotation release only. Additional information about NCBI's GFF \n",
            "       files is available at \n",
            "       ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt.   \n",
            "   *_gnomon_protein.faa.gz\n",
            "       FASTA format sequences of Gnomon protein models annotated on the genome\n",
            "       assembly. The FASTA title is the Gnomon identifier for the protein model\n",
            "       (>gnl|GNOMON|XXX.p)   \n",
            "   *_gnomon_rna.fna.gz\n",
            "       FASTA format sequences of Gnomon transcript models annotated on the \n",
            "       genome assembly. The FASTA title is the Gnomon identifier for the \n",
            "       transcript (>gnl|GNOMON|XXX.m)\n",
            "\n",
            "RefSeq_transcripts_alignments directory\n",
            "   *_knownrefseq_alns.bam\n",
            "       Alignments of the annotated Known RefSeq transcripts (identified with \n",
            "       accessions prefixed with NM_ and NR_) to the genome in BAM format [not \n",
            "       all annotation releases have Known RefSeq transcripts]. For more \n",
            "       information about the BAM format see: \n",
            "       https://samtools.github.io/hts-specs/SAMv1.pdf\n",
            "   *_knownrefseq_alns.bam.bai\n",
            "       Index of the BAM alignments of the annotated Known RefSeq transcripts \n",
            "       to the genome. [not all annotation releases have Known RefSeq \n",
            "       transcripts]\n",
            "   *_modelrefseq_alns.bam\n",
            "       Alignments of the annotated Model RefSeq transcripts (identified with \n",
            "       accessions prefixed with XM_ and XR_) to the genome in BAM format. For \n",
            "       more information about the BAM format see:\n",
            "       https://samtools.github.io/hts-specs/SAMv1.pdf\n",
            "   *_modelrefseq_alns.bam.bai\n",
            "       Index of the BAM alignments of the annotated Model RefSeq transcripts to\n",
            "       the genome.\n",
            "\n",
            "Annotation_comparison directory\n",
            "       This directory is only provided for re-annotations of the same species.\n",
            "   *_compare_prev.txt.gz\n",
            "       Matching genes and transcripts in the current and previous annotation \n",
            "       releases binned by type of difference (column 1 for genes and column 14 \n",
            "       for transcripts), in tabular format.\n",
            "   *_compare_prev.gbp.gz\n",
            "       Genome Workbench project file for visualization and search of differences\n",
            "       between the current and previous annotation releases.\n",
            "       See how to download and use the 64-bit version of Genome Workbench:\n",
            "       https://www.ncbi.nlm.nih.gov/tools/gbench/\n",
            "\n",
            "\n",
            "=====================================\n",
            "Data provided per annotation release:\n",
            "=====================================\n",
            "The annotation_releases directory offers data grouped by organism and specific \n",
            "annotation release (100, 101, etc.) for organisms that have been annotated \n",
            "by the NCBI Eukaryotic Genome Annotation Pipeline. \n",
            "\n",
            "Each annotation release corresponds to an annotation run. The annotation \n",
            "release identifiers (AR) are numbered sequentially starting at 100,\n",
            "independently of the assembly used. An assembly may have been annotated multiple\n",
            "times, and be featured in different annotation release directories. For example\n",
            "Apis mellifera AR 103 was executed on the same assembly as A. mellifera AR 102,\n",
            "Amel_4.5, using experimental evidence not available at the time AR 102 was \n",
            "produced. A. mellifera AR 104 was executed in 2018 on a newer assembly, \n",
            "Amel_HAv3.1.\n",
            "\n",
            "The 'current' directory contains the data for the most recent annotation.\n",
            "For many organisms, only the most recent annotation may be available. Previous \n",
            "annotations are available at \n",
            "ftp://ftp.ncbi.nlm.nih.gov/genomes/<organism>\n",
            "\n",
            "For a small set of organisms including human (taxid 9606), we provide annotation\n",
            "updates named <AR>.<date> that incorporate improvements made to genes and \n",
            "transcripts by RefSeq curation experts. See more details in:\n",
            "https://ncbiinsights.ncbi.nlm.nih.gov/2019/03/26/human-genome-annotation-bimonthly-update/\n",
            "\n",
            "Each annotation release directory contains:\n",
            "\n",
            "README_[organism_name]_annotation_release_[annotation_release_id]\n",
            "   This file provides information specific to the specific annotation release,\n",
            "   including data freeze dates, release date and release number, and the \n",
            "   annotated assemblies.\n",
            "\n",
            "[organism name]_ARXXX_annotation_report.xml\n",
            "   This file is the XML version of the HTML report for the organism:\n",
            "   https://www.ncbi.nlm.nih.gov/genome/annotation_euk/[org_name]/[annotation_release_id]/\n",
            "   e.g. https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/108/\n",
            "   It contains information on the annotation release, including:\n",
            "      Important dates associated with the annotation\n",
            "      Assemblies\n",
            "      Gene and feature statistics\n",
            "      Masking results\n",
            "      Transcript and protein alignments used for the annotation\n",
            "      Assembly-assembly alignments used to track genes from the previous \n",
            "      assembly to the current, or from the reference to an alternate assembly\n",
            "      if relevant\n",
            "\n",
            "One directory for each genome assembly that was annotated in the release. \n",
            "   Named as [assembly accession.version]_[assembly name]. \n",
            "   This directory contains the files provided for all genome assemblies plus \n",
            "   those files described above under \"Additional directories and files provided\n",
            "   for organisms annotated by the NCBI Eukaryotic Genome Annotation Pipeline\".\n",
            "\n",
            "\n",
            "=====================\n",
            "Description of files:\n",
            "=====================\n",
            "\n",
            "Masking of fasta sequences in genomic.fna.gz files\n",
            "--------------------------------------------------\n",
            "Repetitive sequences in eukaryotic genome assembly sequence files, as \n",
            "identified by WindowMasker (Morgulis A, Gertz EM, Schaffer AA, Agarwala R. \n",
            "2006. Bioinformatics 22:134-41), have been masked to lower-case.\n",
            "\n",
            "Alignment programs typically have parameters that control whether the program \n",
            "will ignore lower-case masking, treat it as soft-masking (i.e. only for finding \n",
            "initial matches) or treat it as hard-masking. By default NCBI BLAST will ignore \n",
            "lower-case masking but this can be changed by adding options to the blastn \n",
            "command-line.\n",
            "To have blastn treat lower-case masking in the query sequence as soft-masking \n",
            "add:\n",
            "     -lcase_masking\n",
            "To have blastn treat lower-case masking in the query sequence as hard-masking \n",
            "add:\n",
            "     -lcase_masking -soft_masking false\n",
            "\n",
            "Alternatively, commands such as the following can be used to generate either \n",
            "unmasked sequence or sequence masked with Ns.\n",
            "\n",
            "Example commands to remove lower-case masking:\n",
            "perl -pe '/^[^>]/ and $_=uc' genomic.fna > genomic.unmasked.fna\n",
            "  -or-\n",
            "awk '{if(/^[^>]/)$0=toupper($0);print $0}' genomic.fna > genomic.unmasked.fna\n",
            "\n",
            "Example commands to convert lower-case masking to masking with Ns (hard-masked):\n",
            "perl -pe '/^[^>]/ and $_=~ s/[a-z]/N/g' genomic.fna > genomic.N-masked.fna\n",
            "  -or-\n",
            "awk '{if(/^[^>]/)gsub(/[a-z]/,\"N\");print $0}' genomic.fna > genomic.N-masked.fna\n",
            "\n",
            "\n",
            "*_cds_from_genomic.fna.gz & *_rna_from_genomic.fna.gz\n",
            "-----------------------------------------------------\n",
            "FASTA sequences of individual features annotated on the genomic records. The \n",
            "sequences are based solely on the genome sequence and annotated feature at a\n",
            "particular location. They may differ from the product sequences found in the \n",
            "*_rna.fna.gz and *_protein.faa.gz files which may be based on transcript or \n",
            "other data sources and include mismatches, indels, or additional sequence not \n",
            "found at a particular genomic location.\n",
            "\n",
            "Seq-ids are constructed based on the following rule to ensure uniqueness:\n",
            "lcl|<genomic accession.version>_<feature_type>_<product accession.version>_<counter>\n",
            "Note the seq-id is not intended to be stable if the annotation is updated; in \n",
            "particular, addition or removal of feature(s) will cause the counter to change \n",
            "on following features.\n",
            "\n",
            "The remainder of the FASTA definition line is composed of a series of qualifiers\n",
            "bounded by brackets, as described at:\n",
            "  https://www.ncbi.nlm.nih.gov/Sequin/modifiers.html\n",
            "  The qualifiers that may appear in these files are:\n",
            "      gene\n",
            "      locus_tag\n",
            "      db_xref\n",
            "      protein\n",
            "      product\n",
            "      ncRNA_class\n",
            "      pseudo\n",
            "      pseudogene\n",
            "      frame\n",
            "      partial\n",
            "      transl_except\n",
            "      exception\n",
            "      protein_id\n",
            "      location\n",
            "\n",
            "Note that some qualifier values such as product names may themselves contain \n",
            "un-escaped brackets, which should be allowed for if parsing the files.\n",
            "   \n",
            "For CDS features that begin in frame 2 or 3, the first 1 or 2 bp of sequence\n",
            "are trimmed from the CDS FASTA so that it always begins with the first complete\n",
            "codon. The location and frame qualifiers are left unaltered; consequently, the \n",
            "length of the ranges in the location string may be 1-2 bp longer than the FASTA \n",
            "sequence.\n",
            "\n",
            "For RefSeq assemblies annotated by NCBI's Eukaryotic Genome Annotation \n",
            "Pipeline, a gene may have a frameshifting indel(s) in the genome that is \n",
            "thought to result from a genome sequencing error; in these cases, the gene is \n",
            "still considered to be protein-coding and annotated with mRNA and CDS features, \n",
            "but the genome sequence won't translate correctly downstream from the \n",
            "frameshift. To compensate, the FASTA sequence of the genomic CDS and RNA \n",
            "features is modified with 1-2 bp gaps (aka \"micro-introns\") in order to \n",
            "restore the predicted reading frame. This modification is reflected by 1-2 bp \n",
            "micro-introns in the location qualifier. An equivalent modification is also\n",
            "made in the *_genomic.gff.gz file. A protein-coding gene may also be annotated\n",
            "with a CDS feature containing an in-frame stop codon that is translated as a\n",
            "selenocysteine, subject to stop-codon readthrough, or thought to result from a\n",
            "genome sequencing error; in these cases, a transl_except qualifier is provided\n",
            "indicating the genomic location of the stop codon and its proposed translation.\n",
            "For more details, see the section on \"Annotation accommodations for putative \n",
            "assembly errors\" in:\n",
            "ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt\n",
            "\n",
            "Pseudogenes annotated with CDS features may be included in the \n",
            "*_cds_from_genomic.fna.gz file, and have FASTAs that are disrupted by \n",
            "frameshifting indels or in-frame stop codons. Pseudogene features can be\n",
            "identified and screened out based on the presence of a [pseudo=true] qualifier\n",
            "in the defline.\n",
            "\n",
            "\n",
            "*_feature_count.txt.gz\n",
            "----------------------\n",
            "Tab-delimited text file reporting counts of gene, RNA, CDS, and similar \n",
            "features, based on data reported in the *_feature_table.txt.gz file (see below).\n",
            "Separate counts are provided for different sets of sequences in the assembly \n",
            "corresponding to the primary assembly, non-nuclear assembly, all alt-loci \n",
            "sequences, and all patch scaffolds.\n",
            "\n",
            "The file is tab delimited (including a #header) with the following columns:\n",
            "col 1: Feature: INSDC feature type\n",
            "col 2: Class: Gene features are subdivided into classes according to the gene \n",
            "       biotype. ncRNA features are subdivided according to the ncRNA_class. CDS \n",
            "       features are subdivided into with_protein and without_protein, depending \n",
            "       on whether the CDS feature has a protein accession assigned or not. CDS \n",
            "       features marked as without_protein include CDS features for C regions and \n",
            "       V/D/J segments of immunoglobulin and similar genes that undergo genomic \n",
            "       rearrangement, and pseudogenes.\n",
            "col 3: Full Assembly: assembly accession.version for the full assembly\n",
            "col 4: Assembly-unit accession: assembly accession.version for the assembly \n",
            "       unit.\n",
            "col 5: Assembly-unit name: name of the assembly unit or set of sequences. For \n",
            "       assemblies with alt-loci or patch scaffolds, such as GRCh38.p11, all \n",
            "       sequences from all alt-loci or patches are combined together.\n",
            "col 6: Unique Ids: counts of unique identifiers. For gene features, this is the\n",
            "       count of unique GeneID db_xrefs, or locus_tags, such that genes that are\n",
            "       annotated at more than one location on the assembly unit (e.g. on both \n",
            "       chrX and chrY in the PAR region) are counted once. For RNA and CDS \n",
            "       features, this is the count of unique product accessions. If no product \n",
            "       accession is assigned, such as for RNA features in GenBank genomes or CDS\n",
            "       features classified as without_protein, then \"na\" is reported\n",
            "col 7: Placements: count of all features of that type on the indicated assembly \n",
            "       unit or set of sequences.\n",
            "\n",
            "Stats of common interest are:\n",
            "- the count of protein-coding genes in the nuclear genome, which corresponds to\n",
            "  \"gene\" in column 1, \"protein_coding\" in column 2, \"Primary Assembly\" in \n",
            "  column 5, and the count of Unique Ids as reported in column 6\n",
            "- the count of distinct protein sequences annotated in the nuclear genome, which\n",
            "  corresponds to \"CDS\" in column 1, \"with_protein\" in column 2, and the count of\n",
            "  Unique Ids as reported in column 6\n",
            "- the count of total CDS features with proteins annotated in the primary \n",
            "  assembly, regardless of whether two CDSes encode exactly the same protein and\n",
            "  use the same RefSeq WP_ protein accession, which corresponds to \"CDS\" in \n",
            "  column 1, \"with_protein\" in column 2, and the count of Placements as reported\n",
            "  in column 7\n",
            "\n",
            "\n",
            "*_feature_table.txt.gz\n",
            "----------------------\n",
            "Tab-delimited text file reporting locations and attributes for a subset of \n",
            "annotated features. Included feature types are: gene, CDS, RNA (all types), \n",
            "operon, C/V/N/S_region, and V/D/J_segment. \n",
            "\n",
            "The file is tab delimited (including a #header) with the following columns:\n",
            "col 1: feature: INSDC feature type\n",
            "col 2: class: Gene features are subdivided into classes according to the gene \n",
            "       biotype computed based on the set of child features for that gene. See \n",
            "       the description of the gene_biotype attribute in the GFF3 documentation\n",
            "       for more details: ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt\n",
            "       ncRNA features are subdivided according to the ncRNA_class. CDS features\n",
            "       are subdivided into with_protein and without_protein, depending on \n",
            "       whether the CDS feature has a protein accession assigned or not. CDS \n",
            "       features marked as without_protein include CDS features for C regions and \n",
            "       V/D/J segments of immunoglobulin and similar genes that undergo genomic \n",
            "       rearrangement, and pseudogenes.\n",
            "col 3: assembly: assembly accession.version\n",
            "col 4: assembly_unit: name of the assembly unit, such as \"Primary Assembly\", \n",
            "       \"ALT_REF_LOCI_1\", or \"non-nuclear\"\n",
            "col 5: seq_type: sequence type, computed from the \"Sequence-Role\" and \n",
            "       \"Assigned-Molecule-Location/Type\" in the *_assembly_report.txt file. The\n",
            "       value is computed as:\n",
            "       if an assembled-molecule, then reports the location/type value. e.g. \n",
            "       chromosome, mitochondrion, or plasmid\n",
            "       if an unlocalized-scaffold, then report \"unlocalized scaffold on <type>\".\n",
            "       e.g. unlocalized scaffold on chromosome\n",
            "       else the role, e.g. alternate scaffold, fix patch, or novel patch\n",
            "col 6: chromosome\n",
            "col 7: genomic_accession\n",
            "col 8: start: feature start coordinate (base-1). start is always less than end\n",
            "col 9: end: feature end coordinate (base-1)\n",
            "col10: strand\n",
            "col11: product_accession: accession.version of the product referenced by this \n",
            "       feature, if exists\n",
            "col12: non-redundant_refseq: for bacteria and archaea assemblies, the \n",
            "       non-redundant WP_ protein accession corresponding to the CDS feature. May\n",
            "       be the same as column 11, for RefSeq genomes annotated directly with WP_\n",
            "       RefSeq proteins, or may be different, for genomes annotated with \n",
            "       genome-specific protein accessions (e.g. NP_ or YP_ RefSeq proteins) that\n",
            "       reference a WP_ RefSeq accession.\n",
            "col13: related_accession: for eukaryotic RefSeq annotations, the RefSeq protein\n",
            "       accession corresponding to the transcript feature, or the RefSeq \n",
            "       transcript accession corresponding to the protein feature.\n",
            "col14: name: For genes, this is the gene description or full name. For RNA, CDS,\n",
            "       and some other features, this is the product name.\n",
            "col15: symbol: gene symbol\n",
            "col16: GeneID: NCBI GeneID, for those RefSeq genomes included in NCBI's Gene \n",
            "       resource\n",
            "col17: locus_tag\n",
            "col18: feature_interval_length: sum of the lengths of all intervals for the \n",
            "       feature (i.e. the length without introns for a joined feature)\n",
            "col19: product_length: length of the product corresponding to the \n",
            "       accession.version in column 11. Protein product lengths are in amino acid\n",
            "       units, and do not include the stop codon which is included in column 18.\n",
            "       Additionally, product_length may differ from feature_interval_length if \n",
            "       the product contains sequence differences vs. the genome, as found for \n",
            "       some RefSeq transcript and protein products based on mRNA sequences and \n",
            "       also for INSDC proteins that are submitted to correct genome \n",
            "       discrepancies.\n",
            "col20: attributes: semi-colon delimited list of a controlled set of qualifiers.\n",
            "       The list currently includes:\n",
            "       partial, pseudo, pseudogene, ribosomal_slippage, trans_splicing, \n",
            "       anticodon=NNN (for tRNAs), old_locus_tag=XXX \n",
            "\n",
            "\n",
            "*_genomic_gaps.txt.gz\n",
            "---------------------\n",
            "Tab-delimited text file reporting the coordinates of all gaps in the top-level \n",
            "genomic sequences. The gaps reported include gaps specified in the AGP files, \n",
            "gaps annotated on the component sequences, and any other run of 10 or more Ns \n",
            "in the sequences. Gap types are reported using the International Nucleotide \n",
            "Sequence Database Collaboration feature table terms with spaces replaced by \n",
            "underscores, see: http://www.insdc.org/files/feature_table.html\n",
            "\n",
            "The file is tab delimited (including a #header) with the following columns:\n",
            "col 1: sequence accession.version\n",
            "col 2: gap start position (1-based)\n",
            "col 3: gap stop position (1-based)\n",
            "col 4: gap_length\n",
            "col 5: gap_type. \n",
            "       One of: centromere, heterochromatin, short_arm, telomere,\n",
            "       between_scaffolds, within_scaffold, repeat_between_scaffolds, \n",
            "       repeat_within_scaffold, contamination, unknown\n",
            "col 6: linkage_evidence. \n",
            "       Gaps of type within_scaffold or repeat_within_scaffold have one or more \n",
            "       of the following types of linkage evidence: paired-ends, pcr, \n",
            "       proximity_ligation, align_genus, align_xgenus, align_trnscpt, \n",
            "       within_clone, clone_contig, map, strobe, unspecified, \n",
            "       inferred_from_sequence. \n",
            "       Multiple lines of linkage evidence are separated by a ';' delimiter. \n",
            "       Gaps of type contamination have unspecified as the linkage evidence.\n",
            "       All other gap types have 'na' in the linkage evidence column. \n",
            "       See: http://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/ \n",
            "\n",
            "\n",
            "annotation_hashes.txt\n",
            "---------------------\n",
            "Tab-delimited text file reporting hash values and change dates for specific \n",
            "details of the annotation. Hashes are computed based on the underlying data in \n",
            "ASN.1 format, and thus aren't affected by changes in file formats. In contrast,\n",
            "the checksums reported in the md5checksums.txt file will change with any change\n",
            "to the files, including file formats and differences in gzip compression. The\n",
            "hashes are useful to monitor for when annotation has changed in a way that is \n",
            "significant for a particular use case and warrants downloading the updated \n",
            "records.\n",
            "\n",
            "The file is tab delimited (including a #header) with the following columns:\n",
            "col 1: Assembly accession: accession.version\n",
            "col 2: Descriptors hash: hash of all descriptors on top-level sequence records,\n",
            "       including BioSource, molinfo, user objects, publications, and dates\n",
            "col 3: Descriptors last changed: date and time of the last change to any \n",
            "       descriptors\n",
            "col 4: Features hash: hash of all features annotated on the assembly, including\n",
            "       both locations and qualifiers stored directly on the genome records. For\n",
            "       RefSeq genomes annotated with WP proteins and some other cases, protein\n",
            "       product names aren't stored on the genome records and thus changes in \n",
            "       protein names do not alter the features hash.\n",
            "col 5: Features last changed: date and time of the last change to any features\n",
            "col 6: Locations hash: hash of just the locations of all features annotated on\n",
            "       the assembly.\n",
            "col 7: Locations last changed: date and time of the last change to any feature\n",
            "       locations\n",
            "col 8: Protein names hash: hash of the protein names for all CDS features \n",
            "       annotated on the assembly.\n",
            "col 9: Protein names last changed: date and time of the last change to any \n",
            "       protein names.\n",
            "\n",
            "Example use cases:\n",
            "  A change in the Locations hash indicates that at least one feature has been \n",
            "     added, removed, or had its location altered.\n",
            "  A change in the Features hash but not the Locations hash implies that only\n",
            "     feature qualifiers have changed, such as names or db_xrefs.\n",
            "  A change in the Protein names hash indicates that at least one protein name\n",
            "     has changed compared to the previous files provided on the genomes FTP \n",
            "     site. Note for RefSeq prokaryotic genomes, protein names are updated \n",
            "     continuously but files on the FTP site are only refreshed intermittently\n",
            "     to minimize churn.\n",
            "  A change in the Descriptors hash but not the Features hash implies that only\n",
            "     record metadata has been touched, such as the addition of a publication.\n",
            "\n",
            "NOTE: currently the descriptors hash values are not stable due to a bug.\n",
            "\n",
            "\n",
            "assembly_status.txt\n",
            "------------------\n",
            "A text file reporting the current status of the version of the assembly for \n",
            "which data is provided. Any assembly anomalies are also reported. Lines have the\n",
            "format tag=value.\n",
            "\n",
            "First line: status=<value> \n",
            "  where <value> is one of latest, replaced or suppressed\n",
            "Second line (if any): assembly anomaly=<value>\n",
            "  where value is a comma separated list of assembly anomalies as described in\n",
            "  the \"Anomalous assemblies\" section of this web page:\n",
            "  https://www.ncbi.nlm.nih.gov/assembly/help/anomnotrefseq/\n",
            "\n",
            "\n",
            "*_compare_prev.txt.gz\n",
            "--------------------- \n",
            "The annotation produced for this release was compared to the annotation in the \n",
            "previous release. Scores for pairs of best-mapping current and previous gene and\n",
            "transcript features were calculated based on overlap in exon sequence and \n",
            "matches in exon boundaries. Pairs of current and previous features were \n",
            "categorized based on these scores, whether they are reciprocal best matches, and\n",
            "changes in attributes (gene biotype, completeness, etc.). If the assembly was \n",
            "updated between the two releases, alignments between the current and the \n",
            "previous assembly were used to match the current and previous gene and \n",
            "transcript features in aligning regions.\n",
            "\n",
            "col 1: gene category: categorization of the difference between the gene in the \n",
            "       current annotation and the gene it maps best to in the previous \n",
            "       annotation: \n",
            "       * Changed locus ID: new feature identifier\n",
            "       * Merged: the previous feature matches only part of the current feature\n",
            "       * Split: the current feature matches only part of the previous feature\n",
            "       * Changed locus type: difference in the gene biotype (coding vs. \n",
            "         non-coding, pseudogene vs. coding, etc...) - applies to genes only\n",
            "       * Current-novel: Current feature has no matching previous feature\n",
            "       * Previous-novel: previous feature has no matching current feature  \n",
            "       * Current-other: current feature can't be matched unambiguously to \n",
            "         previous feature\n",
            "       * Previous-other: previous feature can't be matched unambiguously to \n",
            "         current feature\n",
            "       * Current-unmapped: current feature location can't be mapped to the \n",
            "         previous assembly\n",
            "       * Previous-unmapped: previous feature location can't be mapped to the \n",
            "         current assembly\n",
            "       * Moved: feature ID found on both current and previous but not placed \n",
            "         on regions aligned to each other by assembly-assembly alignment\n",
            "       * Identical: identical exon boundaries in current and previous\n",
            "       * Variant: alternative variant in current not in previous - applies \n",
            "         to transcripts only\n",
            "       * Change in exception: Exceptions are added to RNA features when the \n",
            "         RefSeq transcript sequence doesn't match the conceptual sequence from\n",
            "         the genome due to the presence of mismatches, indels, or additional \n",
            "         sequence, or in some other cases of unusual biology like ribosomal \n",
            "         slippage. This category reports when the current or previous RefSeq \n",
            "         transcript sequence was annotated with an exception and the matched \n",
            "         transcript does not. - applies to transcripts only\n",
            "       * Similar: highly similar features, with support scores of 0.66 or more\n",
            "         (on a scale of 0 to 1) on both sides of the comparison. The support \n",
            "         score is derived from a combination of matching exon boundaries and \n",
            "         sequence overlap. - applies to genes and non-coding transcripts only\n",
            "       * Similar, change in CDS: support scores of 0.66 or more on both sides \n",
            "         AND the change affects the CDS - applies to coding transcripts only\n",
            "       * Similar, change in UTR only: support scores of 0.66 or more on both \n",
            "         sides AND the change affects UTRs only (not the CDS) - applies to \n",
            "         coding transcripts only\n",
            "       * Changed feature type: difference in the feature type - applies to \n",
            "         coding transcripts only\n",
            "       * Changed completeness: feature is partial in current and complete in \n",
            "         the previous or vice versa - applies to genes only\n",
            "       * Changed substantially: low similarity feature with support scores \n",
            "         below 0.66 on one or both sides. \n",
            "       * NA: No transcript feature associated with the current or previous gene \n",
            "         (pseudogene)\n",
            "       * Previous-variant: alternative variant in previous not in current \n",
            "         - applies to transcripts only\n",
            "       * Other: complex cases not fitting in categories above\n",
            "\n",
            "       Notes:\n",
            "       1. A gene may be categorized as not 'Identical' if the boundaries of a \n",
            "       gene are unchanged in the two annotation releases, but the children \n",
            "       exons have changed. This situation is frequent when a different set of \n",
            "       alternative variants are predicted for a gene. \n",
            "       2. Since a transcript feature on one side of the comparison may overlap\n",
            "       multiple transcripts in the other side, a transcript may appear on \n",
            "       multiple lines in the report.\n",
            "col 2: current GeneID: Gene database identifier for the current gene \n",
            "col 3: current gene biotype: attribute computed on gene features based on the \n",
            "       set of child features to indicate the overall biotype for the gene \n",
            "       annotation at this location. See list of possible biotypes in \n",
            "       ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt\n",
            "col 4: current assembly unit: Assembly unit on which the current gene is \n",
            "       annotated   \n",
            "col 5: current genomic accession: RefSeq accession.version of the sequence on \n",
            "       which the current gene is annotated\n",
            "col 6: current gene range: start and stop coordinates of the current gene on the\n",
            "       genomic sequence. \n",
            "col 7: orientation of the current gene on the genomic sequence\n",
            "col 8-13: same description as for col 2-7 but for the previous gene that maps \n",
            "       best to the current gene.\n",
            "col14: transcript category: categorization of the mapping between the current \n",
            "       and the previous annotated transcript. See description for column 1\n",
            "col15: current transcript accession: RefSeq accession.version of the transcript\n",
            "       annotated on the current assembly.\n",
            "col16: current protein accession: RefSeq accession.version of the protein\n",
            "       annotated on the current assembly.\n",
            "col17: current transcript range: start and stop coordinates of the current \n",
            "       transcript on the genomic sequence.\n",
            "col18-20: same description as for col 15-17 but for the previous transcript\n",
            "       that maps best to the current transcript.\n",
            " \n",
            "*_compare_prev.gbp.gz\n",
            "---------------------\n",
            "See the description for _compare_prev.txt.gz for a description of the content.\n",
            "To visualize the differences in annotation in Genome WorkBench \n",
            "(https://www.ncbi.nlm.nih.gov/tools/gbench/):\n",
            "Unzip *_compare_previous.gbp.gz\n",
            "Load *_compare_previous.gbp to GenomeWorkBench \n",
            "Open the comparison report *_.compare_prev.table.asn as a Generic Table View\n",
            "Use search and filter on terms in the table to find a specific gene, sequence \n",
            "or category\n",
            "Right-click or double-click on a row and open the 'Graphical Sequence View' for\n",
            "a graphical view of the gene on the current or the previous scaffold\n",
            "Make sure 'Project features for aligned sequences' is enabled in the Alignments\n",
            "track if the comparison is across two different versions of an assembly.\n",
            "\n",
            "________________________________________________________________________________\n",
            "National Center for Biotechnology Information (NCBI)\n",
            "National Library of Medicine\n",
            "National Institutes of Health\n",
            "8600 Rockville Pike\n",
            "Bethesda, MD 20894, USA\n",
            "tel: (301) 496-2475\n",
            "fax: (301) 480-9241\n",
            "e-mail: info@ncbi.nlm.nih.gov\n",
            "________________________________________________________________________________\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "w7HbbWZG8SSO",
        "colab_type": "text"
      },
      "source": [
        "That's a lot to read. But if you scroll to the top you'll see that some of this data is from the [GenBank](https://en.wikipedia.org/wiki/GenBank). \n",
        "\n",
        "This file describes the contents of the tarfile! Most of this is way over my head, I'm not a geneticist! But maybe it would be interesting to look at one of the files it mentions?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "SM9Pnnbo5tJh",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 575
        },
        "outputId": "d34c12bc-804d-49cc-944a-9578360c01ce"
      },
      "source": [
        "text = open('ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_assembly_report.txt').read()\n",
        "print(text)"
      ],
      "execution_count": 19,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "# Assembly name:  ASM985889v3\n",
            "# Organism name:  Severe acute respiratory syndrome coronavirus 2 (viruses)\n",
            "# Isolate:  Wuhan-Hu-1\n",
            "# Taxid:          2697049\n",
            "# BioProject:     PRJNA485481\n",
            "# Submitter:      na\n",
            "# Date:           2020-01-13\n",
            "# Assembly type:  na\n",
            "# Release type:   major\n",
            "# Assembly level: Complete Genome\n",
            "# Genome representation: full\n",
            "# Assembly method: Megahit v. V1.1.3\n",
            "# Sequencing technology: Illumina\n",
            "# Relation to type material: ICTV additional isolate\n",
            "# RefSeq category: Reference Genome\n",
            "# GenBank assembly accession: GCA_009858895.3\n",
            "# RefSeq assembly accession: GCF_009858895.2\n",
            "# RefSeq assembly and GenBank assemblies identical: yes\n",
            "#\n",
            "## Assembly-Units:\n",
            "## GenBank Unit Accession\tRefSeq Unit Accession\tAssembly-Unit name\n",
            "## GCA_009858905.3\tGCF_009858905.2\tPrimary Assembly\n",
            "#\n",
            "# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by\n",
            "# unlocalized scaffolds.\n",
            "# Unplaced scaffolds are listed at the end.\n",
            "# RefSeq is equal or derived from GenBank object.\n",
            "#\n",
            "# Sequence-Name\tSequence-Role\tAssigned-Molecule\tAssigned-Molecule-Location/Type\tGenBank-Accn\tRelationship\tRefSeq-Accn\tAssembly-Unit\tSequence-Length\tUCSC-style-name\n",
            "NC_045512.2\tassembled-molecule\tna\tSegment\tMN908947.3\t=\tNC_045512.2\tPrimary Assembly\t29903\tna\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UIUFVRf88-uc",
        "colab_type": "text"
      },
      "source": [
        "Oh wow, so this is genetic information about the Coronavirus! Lets take a look at one of the gzipped files using the python [gzip](https://docs.python.org/3/library/gzip.html) module."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cvk9Yxk_9KVP",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "outputId": "91fc7c03-51b6-4163-9bf1-7b37bb058207"
      },
      "source": [
        "import gzip\n",
        "\n",
        "text = gzip.open('ncbi-genomes-2020-08-27/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.fna.gz', 'rt').read()\n",
        "print(text)"
      ],
      "execution_count": 27,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            ">MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome\n",
            "ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAA\n",
            "AATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGG\n",
            "ACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT\n",
            "CGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGC\n",
            "CTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACAT\n",
            "CTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAA\n",
            "ACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC\n",
            "GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAG\n",
            "AACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGA\n",
            "TCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACG\n",
            "GAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTA\n",
            "GCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCG\n",
            "TGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAAT\n",
            "TGGCAAAGAAATTTGACACCTTCAATGGGGAATGTCCAAATTTTGTATTTCCCTTAAATTCCATAATCAAGACTATTCAA\n",
            "CCAAGGGTTGAAAAGAAAAAGCTTGATGGCTTTATGGGTAGAATTCGATCTGTCTATCCAGTTGCGTCACCAAATGAATG\n",
            "CAACCAAATGTGCCTTTCAACTCTCATGAAGTGTGATCATTGTGGTGAAACTTCATGGCAGACGGGCGATTTTGTTAAAG\n",
            "CCACTTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACTACTTGTGGTTACTTACCCCAAAATGCTGTT\n",
            "GTTAAAATTTATTGTCCAGCATGTCACAATTCAGAAGTAGGACCTGAGCATAGTCTTGCCGAATACCATAATGAATCTGG\n",
            "CTTGAAAACCATTCTTCGTAAGGGTGGTCGCACTATTGCCTTTGGAGGCTGTGTGTTCTCTTATGTTGGTTGCCATAACA\n",
            "AGTGTGCCTATTGGGTTCCACGTGCTAGCGCTAACATAGGTTGTAACCATACAGGTGTTGTTGGAGAAGGTTCCGAAGGT\n",
            "CTTAATGACAACCTTCTTGAAATACTCCAAAAAGAGAAAGTCAACATCAATATTGTTGGTGACTTTAAACTTAATGAAGA\n",
            "GATCGCCATTATTTTGGCATCTTTTTCTGCTTCCACAAGTGCTTTTGTGGAAACTGTGAAAGGTTTGGATTATAAAGCAT\n",
            "TCAAACAAATTGTTGAATCCTGTGGTAATTTTAAAGTTACAAAAGGAAAAGCTAAAAAAGGTGCCTGGAATATTGGTGAA\n",
            "CAGAAATCAATACTGAGTCCTCTTTATGCATTTGCATCAGAGGCTGCTCGTGTTGTACGATCAATTTTCTCCCGCACTCT\n",
            "TGAAACTGCTCAAAATTCTGTGCGTGTTTTACAGAAGGCCGCTATAACAATACTAGATGGAATTTCACAGTATTCACTGA\n",
            "GACTCATTGATGCTATGATGTTCACATCTGATTTGGCTACTAACAATCTAGTTGTAATGGCCTACATTACAGGTGGTGTT\n",
            "GTTCAGTTGACTTCGCAGTGGCTAACTAACATCTTTGGCACTGTTTATGAAAAACTCAAACCCGTCCTTGATTGGCTTGA\n",
            "AGAGAAGTTTAAGGAAGGTGTAGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAA\n",
            "TTGTCGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTTCAGACATTCTTTAAGCTTGTAAATAAATTT\n",
            "TTGGCTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTAAACTTAAAGCCTTGAATTTAGGTGAAACATTTGTCACGCA\n",
            "CTCAAAGGGATTGTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCCTACTCATGCCTCTAAAAGCCCCAAAAGAAA\n",
            "TTATCTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAAGTTGTCTTGAAAACTGGTGATTTACAACCA\n",
            "TTAGAACAACCTACTAGTGAAGCTGTTGAAGCTCCATTGGTTGGTACACCAGTTTGTATTAACGGGCTTATGTTGCTCGA\n",
            "AATCAAAGACACAGAAAAGTACTGTGCCCTTGCACCTAATATGATGGTAACAAACAATACCTTCACACTCAAAGGCGGTG\n",
            "CACCAACAAAGGTTACTTTTGGTGATGACACTGTGATAGAAGTGCAAGGTTACAAGAGTGTGAATATCACTTTTGAACTT\n",
            "GATGAAAGGATTGATAAAGTACTTAATGAGAAGTGCTCTGCCTATACAGTTGAACTCGGTACAGAAGTAAATGAGTTCGC\n",
            "CTGTGTTGTGGCAGATGCTGTCATAAAAACTTTGCAACCAGTATCTGAATTACTTACACCACTGGGCATTGATTTAGATG\n",
            "AGTGGAGTATGGCTACATACTACTTATTTGATGAGTCTGGTGAGTTTAAATTGGCTTCACATATGTATTGTTCTTTCTAC\n",
            "CCTCCAGATGAGGATGAAGAAGAAGGTGATTGTGAAGAAGAAGAGTTTGAGCCATCAACTCAATATGAGTATGGTACTGA\n",
            "AGATGATTACCAAGGTAAACCTTTGGAATTTGGTGCCACTTCTGCTGCTCTTCAACCTGAAGAAGAGCAAGAAGAAGATT\n",
            "GGTTAGATGATGATAGTCAACAAACTGTTGGTCAACAAGACGGCAGTGAGGACAATCAGACAACTACTATTCAAACAATT\n",
            "GTTGAGGTTCAACCTCAATTAGAGATGGAACTTACACCAGTTGTTCAGACTATTGAAGTGAATAGTTTTAGTGGTTATTT\n",
            "AAAACTTACTGACAATGTATACATTAAAAATGCAGACATTGTGGAAGAAGCTAAAAAGGTAAAACCAACAGTGGTTGTTA\n",
            "ATGCAGCCAATGTTTACCTTAAACATGGAGGAGGTGTTGCAGGAGCCTTAAATAAGGCTACTAACAATGCCATGCAAGTT\n",
            "GAATCTGATGATTACATAGCTACTAATGGACCACTTAAAGTGGGTGGTAGTTGTGTTTTAAGCGGACACAATCTTGCTAA\n",
            "ACACTGTCTTCATGTTGTCGGCCCAAATGTTAACAAAGGTGAAGACATTCAACTTCTTAAGAGTGCTTATGAAAATTTTA\n",
            "ATCAGCACGAAGTTCTACTTGCACCATTATTATCAGCTGGTATTTTTGGTGCTGACCCTATACATTCTTTAAGAGTTTGT\n",
            "GTAGATACTGTTCGCACAAATGTCTACTTAGCTGTCTTTGATAAAAATCTCTATGACAAACTTGTTTCAAGCTTTTTGGA\n",
            "AATGAAGAGTGAAAAGCAAGTTGAACAAAAGATCGCTGAGATTCCTAAAGAGGAAGTTAAGCCATTTATAACTGAAAGTA\n",
            "AACCTTCAGTTGAACAGAGAAAACAAGATGATAAGAAAATCAAAGCTTGTGTTGAAGAAGTTACAACAACTCTGGAAGAA\n",
            "ACTAAGTTCCTCACAGAAAACTTGTTACTTTATATTGACATTAATGGCAATCTTCATCCAGATTCTGCCACTCTTGTTAG\n",
            "TGACATTGACATCACTTTCTTAAAGAAAGATGCTCCATATATAGTGGGTGATGTTGTTCAAGAGGGTGTTTTAACTGCTG\n",
            "TGGTTATACCTACTAAAAAGGCTGGTGGCACTACTGAAATGCTAGCGAAAGCTTTGAGAAAAGTGCCAACAGACAATTAT\n",
            "ATAACCACTTACCCGGGTCAGGGTTTAAATGGTTACACTGTAGAGGAGGCAAAGACAGTGCTTAAAAAGTGTAAAAGTGC\n",
            "CTTTTACATTCTACCATCTATTATCTCTAATGAGAAGCAAGAAATTCTTGGAACTGTTTCTTGGAATTTGCGAGAAATGC\n",
            "TTGCACATGCAGAAGAAACACGCAAATTAATGCCTGTCTGTGTGGAAACTAAAGCCATAGTTTCAACTATACAGCGTAAA\n",
            "TATAAGGGTATTAAAATACAAGAGGGTGTGGTTGATTATGGTGCTAGATTTTACTTTTACACCAGTAAAACAACTGTAGC\n",
            "GTCACTTATCAACACACTTAACGATCTAAATGAAACTCTTGTTACAATGCCACTTGGCTATGTAACACATGGCTTAAATT\n",
            "TGGAAGAAGCTGCTCGGTATATGAGATCTCTCAAAGTGCCAGCTACAGTTTCTGTTTCTTCACCTGATGCTGTTACAGCG\n",
            "TATAATGGTTATCTTACTTCTTCTTCTAAAACACCTGAAGAACATTTTATTGAAACCATCTCACTTGCTGGTTCCTATAA\n",
            "AGATTGGTCCTATTCTGGACAATCTACACAACTAGGTATAGAATTTCTTAAGAGAGGTGATAAAAGTGTATATTACACTA\n",
            "GTAATCCTACCACATTCCACCTAGATGGTGAAGTTATCACCTTTGACAATCTTAAGACACTTCTTTCTTTGAGAGAAGTG\n",
            "AGGACTATTAAGGTGTTTACAACAGTAGACAACATTAACCTCCACACGCAAGTTGTGGACATGTCAATGACATATGGACA\n",
            "ACAGTTTGGTCCAACTTATTTGGATGGAGCTGATGTTACTAAAATAAAACCTCATAATTCACATGAAGGTAAAACATTTT\n",
            "ATGTTTTACCTAATGATGACACTCTACGTGTTGAGGCTTTTGAGTACTACCACACAACTGATCCTAGTTTTCTGGGTAGG\n",
            "TACATGTCAGCATTAAATCACACTAAAAAGTGGAAATACCCACAAGTTAATGGTTTAACTTCTATTAAATGGGCAGATAA\n",
            "CAACTGTTATCTTGCCACTGCATTGTTAACACTCCAACAAATAGAGTTGAAGTTTAATCCACCTGCTCTACAAGATGCTT\n",
            "ATTACAGAGCAAGGGCTGGTGAAGCTGCTAACTTTTGTGCACTTATCTTAGCCTACTGTAATAAGACAGTAGGTGAGTTA\n",
            "GGTGATGTTAGAGAAACAATGAGTTACTTGTTTCAACATGCCAATTTAGATTCTTGCAAAAGAGTCTTGAACGTGGTGTG\n",
            "TAAAACTTGTGGACAACAGCAGACAACCCTTAAGGGTGTAGAAGCTGTTATGTACATGGGCACACTTTCTTATGAACAAT\n",
            "TTAAGAAAGGTGTTCAGATACCTTGTACGTGTGGTAAACAAGCTACAAAATATCTAGTACAACAGGAGTCACCTTTTGTT\n",
            "ATGATGTCAGCACCACCTGCTCAGTATGAACTTAAGCATGGTACATTTACTTGTGCTAGTGAGTACACTGGTAATTACCA\n",
            "GTGTGGTCACTATAAACATATAACTTCTAAAGAAACTTTGTATTGCATAGACGGTGCTTTACTTACAAAGTCCTCAGAAT\n",
            "ACAAAGGTCCTATTACGGATGTTTTCTACAAAGAAAACAGTTACACAACAACCATAAAACCAGTTACTTATAAATTGGAT\n",
            "GGTGTTGTTTGTACAGAAATTGACCCTAAGTTGGACAATTATTATAAGAAAGACAATTCTTATTTCACAGAGCAACCAAT\n",
            "TGATCTTGTACCAAACCAACCATATCCAAACGCAAGCTTCGATAATTTTAAGTTTGTATGTGATAATATCAAATTTGCTG\n",
            "ATGATTTAAACCAGTTAACTGGTTATAAGAAACCTGCTTCAAGAGAGCTTAAAGTTACATTTTTCCCTGACTTAAATGGT\n",
            "GATGTGGTGGCTATTGATTATAAACACTACACACCCTCTTTTAAGAAAGGAGCTAAATTGTTACATAAACCTATTGTTTG\n",
            "GCATGTTAACAATGCAACTAATAAAGCCACGTATAAACCAAATACCTGGTGTATACGTTGTCTTTGGAGCACAAAACCAG\n",
            "TTGAAACATCAAATTCGTTTGATGTACTGAAGTCAGAGGACGCGCAGGGAATGGATAATCTTGCCTGCGAAGATCTAAAA\n",
            "CCAGTCTCTGAAGAAGTAGTGGAAAATCCTACCATACAGAAAGACGTTCTTGAGTGTAATGTGAAAACTACCGAAGTTGT\n",
            "AGGAGACATTATACTTAAACCAGCAAATAATAGTTTAAAAATTACAGAAGAGGTTGGCCACACAGATCTAATGGCTGCTT\n",
            "ATGTAGACAATTCTAGTCTTACTATTAAGAAACCTAATGAATTATCTAGAGTATTAGGTTTGAAAACCCTTGCTACTCAT\n",
            "GGTTTAGCTGCTGTTAATAGTGTCCCTTGGGATACTATAGCTAATTATGCTAAGCCTTTTCTTAACAAAGTTGTTAGTAC\n",
            "AACTACTAACATAGTTACACGGTGTTTAAACCGTGTTTGTACTAATTATATGCCTTATTTCTTTACTTTATTGCTACAAT\n",
            "TGTGTACTTTTACTAGAAGTACAAATTCTAGAATTAAAGCATCTATGCCGACTACTATAGCAAAGAATACTGTTAAGAGT\n",
            "GTCGGTAAATTTTGTCTAGAGGCTTCATTTAATTATTTGAAGTCACCTAATTTTTCTAAACTGATAAATATTATAATTTG\n",
            "GTTTTTACTATTAAGTGTTTGCCTAGGTTCTTTAATCTACTCAACCGCTGCTTTAGGTGTTTTAATGTCTAATTTAGGCA\n",
            "TGCCTTCTTACTGTACTGGTTACAGAGAAGGCTATTTGAACTCTACTAATGTCACTATTGCAACCTACTGTACTGGTTCT\n",
            "ATACCTTGTAGTGTTTGTCTTAGTGGTTTAGATTCTTTAGACACCTATCCTTCTTTAGAAACTATACAAATTACCATTTC\n",
            "ATCTTTTAAATGGGATTTAACTGCTTTTGGCTTAGTTGCAGAGTGGTTTTTGGCATATATTCTTTTCACTAGGTTTTTCT\n",
            "ATGTACTTGGATTGGCTGCAATCATGCAATTGTTTTTCAGCTATTTTGCAGTACATTTTATTAGTAATTCTTGGCTTATG\n",
            "TGGTTAATAATTAATCTTGTACAAATGGCCCCGATTTCAGCTATGGTTAGAATGTACATCTTCTTTGCATCATTTTATTA\n",
            "TGTATGGAAAAGTTATGTGCATGTTGTAGACGGTTGTAATTCATCAACTTGTATGATGTGTTACAAACGTAATAGAGCAA\n",
            "CAAGAGTCGAATGTACAACTATTGTTAATGGTGTTAGAAGGTCCTTTTATGTCTATGCTAATGGAGGTAAAGGCTTTTGC\n",
            "AAACTACACAATTGGAATTGTGTTAATTGTGATACATTCTGTGCTGGTAGTACATTTATTAGTGATGAAGTTGCGAGAGA\n",
            "CTTGTCACTACAGTTTAAAAGACCAATAAATCCTACTGACCAGTCTTCTTACATCGTTGATAGTGTTACAGTGAAGAATG\n",
            "GTTCCATCCATCTTTACTTTGATAAAGCTGGTCAAAAGACTTATGAAAGACATTCTCTCTCTCATTTTGTTAACTTAGAC\n",
            "AACCTGAGAGCTAATAACACTAAAGGTTCATTGCCTATTAATGTTATAGTTTTTGATGGTAAATCAAAATGTGAAGAATC\n",
            "ATCTGCAAAATCAGCGTCTGTTTACTACAGTCAGCTTATGTGTCAACCTATACTGTTACTAGATCAGGCATTAGTGTCTG\n",
            "ATGTTGGTGATAGTGCGGAAGTTGCAGTTAAAATGTTTGATGCTTACGTTAATACGTTTTCATCAACTTTTAACGTACCA\n",
            "ATGGAAAAACTCAAAACACTAGTTGCAACTGCAGAAGCTGAACTTGCAAAGAATGTGTCCTTAGACAATGTCTTATCTAC\n",
            "TTTTATTTCAGCAGCTCGGCAAGGGTTTGTTGATTCAGATGTAGAAACTAAAGATGTTGTTGAATGTCTTAAATTGTCAC\n",
            "ATCAATCTGACATAGAAGTTACTGGCGATAGTTGTAATAACTATATGCTCACCTATAACAAAGTTGAAAACATGACACCC\n",
            "CGTGACCTTGGTGCTTGTATTGACTGTAGTGCGCGTCATATTAATGCGCAGGTAGCAAAAAGTCACAACATTGCTTTGAT\n",
            "ATGGAACGTTAAAGATTTCATGTCATTGTCTGAACAACTACGAAAACAAATACGTAGTGCTGCTAAAAAGAATAACTTAC\n",
            "CTTTTAAGTTGACATGTGCAACTACTAGACAAGTTGTTAATGTTGTAACAACAAAGATAGCACTTAAGGGTGGTAAAATT\n",
            "GTTAATAATTGGTTGAAGCAGTTAATTAAAGTTACACTTGTGTTCCTTTTTGTTGCTGCTATTTTCTATTTAATAACACC\n",
            "TGTTCATGTCATGTCTAAACATACTGACTTTTCAAGTGAAATCATAGGATACAAGGCTATTGATGGTGGTGTCACTCGTG\n",
            "ACATAGCATCTACAGATACTTGTTTTGCTAACAAACATGCTGATTTTGACACATGGTTTAGCCAGCGTGGTGGTAGTTAT\n",
            "ACTAATGACAAAGCTTGCCCATTGATTGCTGCAGTCATAACAAGAGAAGTGGGTTTTGTCGTGCCTGGTTTGCCTGGCAC\n",
            "GATATTACGCACAACTAATGGTGACTTTTTGCATTTCTTACCTAGAGTTTTTAGTGCAGTTGGTAACATCTGTTACACAC\n",
            "CATCAAAACTTATAGAGTACACTGACTTTGCAACATCAGCTTGTGTTTTGGCTGCTGAATGTACAATTTTTAAAGATGCT\n",
            "TCTGGTAAGCCAGTACCATATTGTTATGATACCAATGTACTAGAAGGTTCTGTTGCTTATGAAAGTTTACGCCCTGACAC\n",
            "ACGTTATGTGCTCATGGATGGCTCTATTATTCAATTTCCTAACACCTACCTTGAAGGTTCTGTTAGAGTGGTAACAACTT\n",
            "TTGATTCTGAGTACTGTAGGCACGGCACTTGTGAAAGATCAGAAGCTGGTGTTTGTGTATCTACTAGTGGTAGATGGGTA\n",
            "CTTAACAATGATTATTACAGATCTTTACCAGGAGTTTTCTGTGGTGTAGATGCTGTAAATTTACTTACTAATATGTTTAC\n",
            "ACCACTAATTCAACCTATTGGTGCTTTGGACATATCAGCATCTATAGTAGCTGGTGGTATTGTAGCTATCGTAGTAACAT\n",
            "GCCTTGCCTACTATTTTATGAGGTTTAGAAGAGCTTTTGGTGAATACAGTCATGTAGTTGCCTTTAATACTTTACTATTC\n",
            "CTTATGTCATTCACTGTACTCTGTTTAACACCAGTTTACTCATTCTTACCTGGTGTTTATTCTGTTATTTACTTGTACTT\n",
            "GACATTTTATCTTACTAATGATGTTTCTTTTTTAGCACATATTCAGTGGATGGTTATGTTCACACCTTTAGTACCTTTCT\n",
            "GGATAACAATTGCTTATATCATTTGTATTTCCACAAAGCATTTCTATTGGTTCTTTAGTAATTACCTAAAGAGACGTGTA\n",
            "GTCTTTAATGGTGTTTCCTTTAGTACTTTTGAAGAAGCTGCGCTGTGCACCTTTTTGTTAAATAAAGAAATGTATCTAAA\n",
            "GTTGCGTAGTGATGTGCTATTACCTCTTACGCAATATAATAGATACTTAGCTCTTTATAATAAGTACAAGTATTTTAGTG\n",
            "GAGCAATGGATACAACTAGCTACAGAGAAGCTGCTTGTTGTCATCTCGCAAAGGCTCTCAATGACTTCAGTAACTCAGGT\n",
            "TCTGATGTTCTTTACCAACCACCACAAACCTCTATCACCTCAGCTGTTTTGCAGAGTGGTTTTAGAAAAATGGCATTCCC\n",
            "ATCTGGTAAAGTTGAGGGTTGTATGGTACAAGTAACTTGTGGTACAACTACACTTAACGGTCTTTGGCTTGATGACGTAG\n",
            "TTTACTGTCCAAGACATGTGATCTGCACCTCTGAAGACATGCTTAACCCTAATTATGAAGATTTACTCATTCGTAAGTCT\n",
            "AATCATAATTTCTTGGTACAGGCTGGTAATGTTCAACTCAGGGTTATTGGACATTCTATGCAAAATTGTGTACTTAAGCT\n",
            "TAAGGTTGATACAGCCAATCCTAAGACACCTAAGTATAAGTTTGTTCGCATTCAACCAGGACAGACTTTTTCAGTGTTAG\n",
            "CTTGTTACAATGGTTCACCATCTGGTGTTTACCAATGTGCTATGAGGCCCAATTTCACTATTAAGGGTTCATTCCTTAAT\n",
            "GGTTCATGTGGTAGTGTTGGTTTTAACATAGATTATGACTGTGTCTCTTTTTGTTACATGCACCATATGGAATTACCAAC\n",
            "TGGAGTTCATGCTGGCACAGACTTAGAAGGTAACTTTTATGGACCTTTTGTTGACAGGCAAACAGCACAAGCAGCTGGTA\n",
            "CGGACACAACTATTACAGTTAATGTTTTAGCTTGGTTGTACGCTGCTGTTATAAATGGAGACAGGTGGTTTCTCAATCGA\n",
            "TTTACCACAACTCTTAATGACTTTAACCTTGTGGCTATGAAGTACAATTATGAACCTCTAACACAAGACCATGTTGACAT\n",
            "ACTAGGACCTCTTTCTGCTCAAACTGGAATTGCCGTTTTAGATATGTGTGCTTCATTAAAAGAATTACTGCAAAATGGTA\n",
            "TGAATGGACGTACCATATTGGGTAGTGCTTTATTAGAAGATGAATTTACACCTTTTGATGTTGTTAGACAATGCTCAGGT\n",
            "GTTACTTTCCAAAGTGCAGTGAAAAGAACAATCAAGGGTACACACCACTGGTTGTTACTCACAATTTTGACTTCACTTTT\n",
            "AGTTTTAGTCCAGAGTACTCAATGGTCTTTGTTCTTTTTTTTGTATGAAAATGCCTTTTTACCTTTTGCTATGGGTATTA\n",
            "TTGCTATGTCTGCTTTTGCAATGATGTTTGTCAAACATAAGCATGCATTTCTCTGTTTGTTTTTGTTACCTTCTCTTGCC\n",
            "ACTGTAGCTTATTTTAATATGGTCTATATGCCTGCTAGTTGGGTGATGCGTATTATGACATGGTTGGATATGGTTGATAC\n",
            "TAGTTTGTCTGGTTTTAAGCTAAAAGACTGTGTTATGTATGCATCAGCTGTAGTGTTACTAATCCTTATGACAGCAAGAA\n",
            "CTGTGTATGATGATGGTGCTAGGAGAGTGTGGACACTTATGAATGTCTTGACACTCGTTTATAAAGTTTATTATGGTAAT\n",
            "GCTTTAGATCAAGCCATTTCCATGTGGGCTCTTATAATCTCTGTTACTTCTAACTACTCAGGTGTAGTTACAACTGTCAT\n",
            "GTTTTTGGCCAGAGGTATTGTTTTTATGTGTGTTGAGTATTGCCCTATTTTCTTCATAACTGGTAATACACTTCAGTGTA\n",
            "TAATGCTAGTTTATTGTTTCTTAGGCTATTTTTGTACTTGTTACTTTGGCCTCTTTTGTTTACTCAACCGCTACTTTAGA\n",
            "CTGACTCTTGGTGTTTATGATTACTTAGTTTCTACACAGGAGTTTAGATATATGAATTCACAGGGACTACTCCCACCCAA\n",
            "GAATAGCATAGATGCCTTCAAACTCAACATTAAATTGTTGGGTGTTGGTGGCAAACCTTGTATCAAAGTAGCCACTGTAC\n",
            "AGTCTAAAATGTCAGATGTAAAGTGCACATCAGTAGTCTTACTCTCAGTTTTGCAACAACTCAGAGTAGAATCATCATCT\n",
            "AAATTGTGGGCTCAATGTGTCCAGTTACACAATGACATTCTCTTAGCTAAAGATACTACTGAAGCCTTTGAAAAAATGGT\n",
            "TTCACTACTTTCTGTTTTGCTTTCCATGCAGGGTGCTGTAGACATAAACAAGCTTTGTGAAGAAATGCTGGACAACAGGG\n",
            "CAACCTTACAAGCTATAGCCTCAGAGTTTAGTTCCCTTCCATCATATGCAGCTTTTGCTACTGCTCAAGAAGCTTATGAG\n",
            "CAGGCTGTTGCTAATGGTGATTCTGAAGTTGTTCTTAAAAAGTTGAAGAAGTCTTTGAATGTGGCTAAATCTGAATTTGA\n",
            "CCGTGATGCAGCCATGCAACGTAAGTTGGAAAAGATGGCTGATCAAGCTATGACCCAAATGTATAAACAGGCTAGATCTG\n",
            "AGGACAAGAGGGCAAAAGTTACTAGTGCTATGCAGACAATGCTTTTCACTATGCTTAGAAAGTTGGATAATGATGCACTC\n",
            "AACAACATTATCAACAATGCAAGAGATGGTTGTGTTCCCTTGAACATAATACCTCTTACAACAGCAGCCAAACTAATGGT\n",
            "TGTCATACCAGACTATAACACATATAAAAATACGTGTGATGGTACAACATTTACTTATGCATCAGCATTGTGGGAAATCC\n",
            "AACAGGTTGTAGATGCAGATAGTAAAATTGTTCAACTTAGTGAAATTAGTATGGACAATTCACCTAATTTAGCATGGCCT\n",
            "CTTATTGTAACAGCTTTAAGGGCCAATTCTGCTGTCAAATTACAGAATAATGAGCTTAGTCCTGTTGCACTACGACAGAT\n",
            "GTCTTGTGCTGCCGGTACTACACAAACTGCTTGCACTGATGACAATGCGTTAGCTTACTACAACACAACAAAGGGAGGTA\n",
            "GGTTTGTACTTGCACTGTTATCCGATTTACAGGATTTGAAATGGGCTAGATTCCCTAAGAGTGATGGAACTGGTACTATC\n",
            "TATACAGAACTGGAACCACCTTGTAGGTTTGTTACAGACACACCTAAAGGTCCTAAAGTGAAGTATTTATACTTTATTAA\n",
            "AGGATTAAACAACCTAAATAGAGGTATGGTACTTGGTAGTTTAGCTGCCACAGTACGTCTACAAGCTGGTAATGCAACAG\n",
            "AAGTGCCTGCCAATTCAACTGTATTATCTTTCTGTGCTTTTGCTGTAGATGCTGCTAAAGCTTACAAAGATTATCTAGCT\n",
            "AGTGGGGGACAACCAATCACTAATTGTGTTAAGATGTTGTGTACACACACTGGTACTGGTCAGGCAATAACAGTTACACC\n",
            "GGAAGCCAATATGGATCAAGAATCCTTTGGTGGTGCATCGTGTTGTCTGTACTGCCGTTGCCACATAGATCATCCAAATC\n",
            "CTAAAGGATTTTGTGACTTAAAAGGTAAGTATGTACAAATACCTACAACTTGTGCTAATGACCCTGTGGGTTTTACACTT\n",
            "AAAAACACAGTCTGTACCGTCTGCGGTATGTGGAAAGGTTATGGCTGTAGTTGTGATCAACTCCGCGAACCCATGCTTCA\n",
            "GTCAGCTGATGCACAATCGTTTTTAAACGGGTTTGCGGTGTAAGTGCAGCCCGTCTTACACCGTGCGGCACAGGCACTAG\n",
            "TACTGATGTCGTATACAGGGCTTTTGACATCTACAATGATAAAGTAGCTGGTTTTGCTAAATTCCTAAAAACTAATTGTT\n",
            "GTCGCTTCCAAGAAAAGGACGAAGATGACAATTTAATTGATTCTTACTTTGTAGTTAAGAGACACACTTTCTCTAACTAC\n",
            "CAACATGAAGAAACAATTTATAATTTACTTAAGGATTGTCCAGCTGTTGCTAAACATGACTTCTTTAAGTTTAGAATAGA\n",
            "CGGTGACATGGTACCACATATATCACGTCAACGTCTTACTAAATACACAATGGCAGACCTCGTCTATGCTTTAAGGCATT\n",
            "TTGATGAAGGTAATTGTGACACATTAAAAGAAATACTTGTCACATACAATTGTTGTGATGATGATTATTTCAATAAAAAG\n",
            "GACTGGTATGATTTTGTAGAAAACCCAGATATATTACGCGTATACGCCAACTTAGGTGAACGTGTACGCCAAGCTTTGTT\n",
            "AAAAACAGTACAATTCTGTGATGCCATGCGAAATGCTGGTATTGTTGGTGTACTGACATTAGATAATCAAGATCTCAATG\n",
            "GTAACTGGTATGATTTCGGTGATTTCATACAAACCACGCCAGGTAGTGGAGTTCCTGTTGTAGATTCTTATTATTCATTG\n",
            "TTAATGCCTATATTAACCTTGACCAGGGCTTTAACTGCAGAGTCACATGTTGACACTGACTTAACAAAGCCTTACATTAA\n",
            "GTGGGATTTGTTAAAATATGACTTCACGGAAGAGAGGTTAAAACTCTTTGACCGTTATTTTAAATATTGGGATCAGACAT\n",
            "ACCACCCAAATTGTGTTAACTGTTTGGATGACAGATGCATTCTGCATTGTGCAAACTTTAATGTTTTATTCTCTACAGTG\n",
            "TTCCCACCTACAAGTTTTGGACCACTAGTGAGAAAAATATTTGTTGATGGTGTTCCATTTGTAGTTTCAACTGGATACCA\n",
            "CTTCAGAGAGCTAGGTGTTGTACATAATCAGGATGTAAACTTACATAGCTCTAGACTTAGTTTTAAGGAATTACTTGTGT\n",
            "ATGCTGCTGACCCTGCTATGCACGCTGCTTCTGGTAATCTATTACTAGATAAACGCACTACGTGCTTTTCAGTAGCTGCA\n",
            "CTTACTAACAATGTTGCTTTTCAAACTGTCAAACCCGGTAATTTTAACAAAGACTTCTATGACTTTGCTGTGTCTAAGGG\n",
            "TTTCTTTAAGGAAGGAAGTTCTGTTGAATTAAAACACTTCTTCTTTGCTCAGGATGGTAATGCTGCTATCAGCGATTATG\n",
            "ACTACTATCGTTATAATCTACCAACAATGTGTGATATCAGACAACTACTATTTGTAGTTGAAGTTGTTGATAAGTACTTT\n",
            "GATTGTTACGATGGTGGCTGTATTAATGCTAACCAAGTCATCGTCAACAACCTAGACAAATCAGCTGGTTTTCCATTTAA\n",
            "TAAATGGGGTAAGGCTAGACTTTATTATGATTCAATGAGTTATGAGGATCAAGATGCACTTTTCGCATATACAAAACGTA\n",
            "ATGTCATCCCTACTATAACTCAAATGAATCTTAAGTATGCCATTAGTGCAAAGAATAGAGCTCGCACCGTAGCTGGTGTC\n",
            "TCTATCTGTAGTACTATGACCAATAGACAGTTTCATCAAAAATTATTGAAATCAATAGCCGCCACTAGAGGAGCTACTGT\n",
            "AGTAATTGGAACAAGCAAATTCTATGGTGGTTGGCACAACATGTTAAAAACTGTTTATAGTGATGTAGAAAACCCTCACC\n",
            "TTATGGGTTGGGATTATCCTAAATGTGATAGAGCCATGCCTAACATGCTTAGAATTATGGCCTCACTTGTTCTTGCTCGC\n",
            "AAACATACAACGTGTTGTAGCTTGTCACACCGTTTCTATAGATTAGCTAATGAGTGTGCTCAAGTATTGAGTGAAATGGT\n",
            "CATGTGTGGCGGTTCACTATATGTTAAACCAGGTGGAACCTCATCAGGAGATGCCACAACTGCTTATGCTAATAGTGTTT\n",
            "TTAACATTTGTCAAGCTGTCACGGCCAATGTTAATGCACTTTTATCTACTGATGGTAACAAAATTGCCGATAAGTATGTC\n",
            "CGCAATTTACAACACAGACTTTATGAGTGTCTCTATAGAAATAGAGATGTTGACACAGACTTTGTGAATGAGTTTTACGC\n",
            "ATATTTGCGTAAACATTTCTCAATGATGATACTCTCTGACGATGCTGTTGTGTGTTTCAATAGCACTTATGCATCTCAAG\n",
            "GTCTAGTGGCTAGCATAAAGAACTTTAAGTCAGTTCTTTATTATCAAAACAATGTTTTTATGTCTGAAGCAAAATGTTGG\n",
            "ACTGAGACTGACCTTACTAAAGGACCTCATGAATTTTGCTCTCAACATACAATGCTAGTTAAACAGGGTGATGATTATGT\n",
            "GTACCTTCCTTACCCAGATCCATCAAGAATCCTAGGGGCCGGCTGTTTTGTAGATGATATCGTAAAAACAGATGGTACAC\n",
            "TTATGATTGAACGGTTCGTGTCTTTAGCTATAGATGCTTACCCACTTACTAAACATCCTAATCAGGAGTATGCTGATGTC\n",
            "TTTCATTTGTACTTACAATACATAAGAAAGCTACATGATGAGTTAACAGGACACATGTTAGACATGTATTCTGTTATGCT\n",
            "TACTAATGATAACACTTCAAGGTATTGGGAACCTGAGTTTTATGAGGCTATGTACACACCGCATACAGTCTTACAGGCTG\n",
            "TTGGGGCTTGTGTTCTTTGCAATTCACAGACTTCATTAAGATGTGGTGCTTGCATACGTAGACCATTCTTATGTTGTAAA\n",
            "TGCTGTTACGACCATGTCATATCAACATCACATAAATTAGTCTTGTCTGTTAATCCGTATGTTTGCAATGCTCCAGGTTG\n",
            "TGATGTCACAGATGTGACTCAACTTTACTTAGGAGGTATGAGCTATTATTGTAAATCACATAAACCACCCATTAGTTTTC\n",
            "CATTGTGTGCTAATGGACAAGTTTTTGGTTTATATAAAAATACATGTGTTGGTAGCGATAATGTTACTGACTTTAATGCA\n",
            "ATTGCAACATGTGACTGGACAAATGCTGGTGATTACATTTTAGCTAACACCTGTACTGAAAGACTCAAGCTTTTTGCAGC\n",
            "AGAAACGCTCAAAGCTACTGAGGAGACATTTAAACTGTCTTATGGTATTGCTACTGTACGTGAAGTGCTGTCTGACAGAG\n",
            "AATTACATCTTTCATGGGAAGTTGGTAAACCTAGACCACCACTTAACCGAAATTATGTCTTTACTGGTTATCGTGTAACT\n",
            "AAAAACAGTAAAGTACAAATAGGAGAGTACACCTTTGAAAAAGGTGACTATGGTGATGCTGTTGTTTACCGAGGTACAAC\n",
            "AACTTACAAATTAAATGTTGGTGATTATTTTGTGCTGACATCACATACAGTAATGCCATTAAGTGCACCTACACTAGTGC\n",
            "CACAAGAGCACTATGTTAGAATTACTGGCTTATACCCAACACTCAATATCTCAGATGAGTTTTCTAGCAATGTTGCAAAT\n",
            "TATCAAAAGGTTGGTATGCAAAAGTATTCTACACTCCAGGGACCACCTGGTACTGGTAAGAGTCATTTTGCTATTGGCCT\n",
            "AGCTCTCTACTACCCTTCTGCTCGCATAGTGTATACAGCTTGCTCTCATGCCGCTGTTGATGCACTATGTGAGAAGGCAT\n",
            "TAAAATATTTGCCTATAGATAAATGTAGTAGAATTATACCTGCACGTGCTCGTGTAGAGTGTTTTGATAAATTCAAAGTG\n",
            "AATTCAACATTAGAACAGTATGTCTTTTGTACTGTAAATGCATTGCCTGAGACGACAGCAGATATAGTTGTCTTTGATGA\n",
            "AATTTCAATGGCCACAAATTATGATTTGAGTGTTGTCAATGCCAGATTACGTGCTAAGCACTATGTGTACATTGGCGACC\n",
            "CTGCTCAATTACCTGCACCACGCACATTGCTAACTAAGGGCACACTAGAACCAGAATATTTCAATTCAGTGTGTAGACTT\n",
            "ATGAAAACTATAGGTCCAGACATGTTCCTCGGAACTTGTCGGCGTTGTCCTGCTGAAATTGTTGACACTGTGAGTGCTTT\n",
            "GGTTTATGATAATAAGCTTAAAGCACATAAAGACAAATCAGCTCAATGCTTTAAAATGTTTTATAAGGGTGTTATCACGC\n",
            "ATGATGTTTCATCTGCAATTAACAGGCCACAAATAGGCGTGGTAAGAGAATTCCTTACACGTAACCCTGCTTGGAGAAAA\n",
            "GCTGTCTTTATTTCACCTTATAATTCACAGAATGCTGTAGCCTCAAAGATTTTGGGACTACCAACTCAAACTGTTGATTC\n",
            "ATCACAGGGCTCAGAATATGACTATGTCATATTCACTCAAACCACTGAAACAGCTCACTCTTGTAATGTAAACAGATTTA\n",
            "ATGTTGCTATTACCAGAGCAAAAGTAGGCATACTTTGCATAATGTCTGATAGAGACCTTTATGACAAGTTGCAATTTACA\n",
            "AGTCTTGAAATTCCACGTAGGAATGTGGCAACTTTACAAGCTGAAAATGTAACAGGACTCTTTAAAGATTGTAGTAAGGT\n",
            "AATCACTGGGTTACATCCTACACAGGCACCTACACACCTCAGTGTTGACACTAAATTCAAAACTGAAGGTTTATGTGTTG\n",
            "ACATACCTGGCATACCTAAGGACATGACCTATAGAAGACTCATCTCTATGATGGGTTTTAAAATGAATTATCAAGTTAAT\n",
            "GGTTACCCTAACATGTTTATCACCCGCGAAGAAGCTATAAGACATGTACGTGCATGGATTGGCTTCGATGTCGAGGGGTG\n",
            "TCATGCTACTAGAGAAGCTGTTGGTACCAATTTACCTTTACAGCTAGGTTTTTCTACAGGTGTTAACCTAGTTGCTGTAC\n",
            "CTACAGGTTATGTTGATACACCTAATAATACAGATTTTTCCAGAGTTAGTGCTAAACCACCGCCTGGAGATCAATTTAAA\n",
            "CACCTCATACCACTTATGTACAAAGGACTTCCTTGGAATGTAGTGCGTATAAAGATTGTACAAATGTTAAGTGACACACT\n",
            "TAAAAATCTCTCTGACAGAGTCGTATTTGTCTTATGGGCACATGGCTTTGAGTTGACATCTATGAAGTATTTTGTGAAAA\n",
            "TAGGACCTGAGCGCACCTGTTGTCTATGTGATAGACGTGCCACATGCTTTTCCACTGCTTCAGACACTTATGCCTGTTGG\n",
            "CATCATTCTATTGGATTTGATTACGTCTATAATCCGTTTATGATTGATGTTCAACAATGGGGTTTTACAGGTAACCTACA\n",
            "AAGCAACCATGATCTGTATTGTCAAGTCCATGGTAATGCACATGTAGCTAGTTGTGATGCAATCATGACTAGGTGTCTAG\n",
            "CTGTCCACGAGTGCTTTGTTAAGCGTGTTGACTGGACTATTGAATATCCTATAATTGGTGATGAACTGAAGATTAATGCG\n",
            "GCTTGTAGAAAGGTTCAACACATGGTTGTTAAAGCTGCATTATTAGCAGACAAATTCCCAGTTCTTCACGACATTGGTAA\n",
            "CCCTAAAGCTATTAAGTGTGTACCTCAAGCTGATGTAGAATGGAAGTTCTATGATGCACAGCCTTGTAGTGACAAAGCTT\n",
            "ATAAAATAGAAGAATTATTCTATTCTTATGCCACACATTCTGACAAATTCACAGATGGTGTATGCCTATTTTGGAATTGC\n",
            "AATGTCGATAGATATCCTGCTAATTCCATTGTTTGTAGATTTGACACTAGAGTGCTATCTAACCTTAACTTGCCTGGTTG\n",
            "TGATGGTGGCAGTTTGTATGTAAATAAACATGCATTCCACACACCAGCTTTTGATAAAAGTGCTTTTGTTAATTTAAAAC\n",
            "AATTACCATTTTTCTATTACTCTGACAGTCCATGTGAGTCTCATGGAAAACAAGTAGTGTCAGATATAGATTATGTACCA\n",
            "CTAAAGTCTGCTACGTGTATAACACGTTGCAATTTAGGTGGTGCTGTCTGTAGACATCATGCTAATGAGTACAGATTGTA\n",
            "TCTCGATGCTTATAACATGATGATCTCAGCTGGCTTTAGCTTGTGGGTTTACAAACAATTTGATACTTATAACCTCTGGA\n",
            "ACACTTTTACAAGACTTCAGAGTTTAGAAAATGTGGCTTTTAATGTTGTAAATAAGGGACACTTTGATGGACAACAGGGT\n",
            "GAAGTACCAGTTTCTATCATTAATAACACTGTTTACACAAAAGTTGATGGTGTTGATGTAGAATTGTTTGAAAATAAAAC\n",
            "AACATTACCTGTTAATGTAGCATTTGAGCTTTGGGCTAAGCGCAACATTAAACCAGTACCAGAGGTGAAAATACTCAATA\n",
            "ATTTGGGTGTGGACATTGCTGCTAATACTGTGATCTGGGACTACAAAAGAGATGCTCCAGCACATATATCTACTATTGGT\n",
            "GTTTGTTCTATGACTGACATAGCCAAGAAACCAACTGAAACGATTTGTGCACCACTCACTGTCTTTTTTGATGGTAGAGT\n",
            "TGATGGTCAAGTAGACTTATTTAGAAATGCCCGTAATGGTGTTCTTATTACAGAAGGTAGTGTTAAAGGTTTACAACCAT\n",
            "CTGTAGGTCCCAAACAAGCTAGTCTTAATGGAGTCACATTAATTGGAGAAGCCGTAAAAACACAGTTCAATTATTATAAG\n",
            "AAAGTTGATGGTGTTGTCCAACAATTACCTGAAACTTACTTTACTCAGAGTAGAAATTTACAAGAATTTAAACCCAGGAG\n",
            "TCAAATGGAAATTGATTTCTTAGAATTAGCTATGGATGAATTCATTGAACGGTATAAATTAGAAGGCTATGCCTTCGAAC\n",
            "ATATCGTTTATGGAGATTTTAGTCATAGTCAGTTAGGTGGTTTACATCTACTGATTGGACTAGCTAAACGTTTTAAGGAA\n",
            "TCACCTTTTGAATTAGAAGATTTTATTCCTATGGACAGTACAGTTAAAAACTATTTCATAACAGATGCGCAAACAGGTTC\n",
            "ATCTAAGTGTGTGTGTTCTGTTATTGATTTATTACTTGATGATTTTGTTGAAATAATAAAATCCCAAGATTTATCTGTAG\n",
            "TTTCTAAGGTTGTCAAAGTGACTATTGACTATACAGAAATTTCATTTATGCTTTGGTGTAAAGATGGCCATGTAGAAACA\n",
            "TTTTACCCAAAATTACAATCTAGTCAAGCGTGGCAACCGGGTGTTGCTATGCCTAATCTTTACAAAATGCAAAGAATGCT\n",
            "ATTAGAAAAGTGTGACCTTCAAAATTATGGTGATAGTGCAACATTACCTAAAGGCATAATGATGAATGTCGCAAAATATA\n",
            "CTCAACTGTGTCAATATTTAAACACATTAACATTAGCTGTACCCTATAATATGAGAGTTATACATTTTGGTGCTGGTTCT\n",
            "GATAAAGGAGTTGCACCAGGTACAGCTGTTTTAAGACAGTGGTTGCCTACGGGTACGCTGCTTGTCGATTCAGATCTTAA\n",
            "TGACTTTGTCTCTGATGCAGATTCAACTTTGATTGGTGATTGTGCAACTGTACATACAGCTAATAAATGGGATCTCATTA\n",
            "TTAGTGATATGTACGACCCTAAGACTAAAAATGTTACAAAAGAAAATGACTCTAAAGAGGGTTTTTTCACTTACATTTGT\n",
            "GGGTTTATACAACAAAAGCTAGCTCTTGGAGGTTCCGTGGCTATAAAGATAACAGAACATTCTTGGAATGCTGATCTTTA\n",
            "TAAGCTCATGGGACACTTCGCATGGTGGACAGCCTTTGTTACTAATGTGAATGCGTCATCATCTGAAGCATTTTTAATTG\n",
            "GATGTAATTATCTTGGCAAACCACGCGAACAAATAGATGGTTATGTCATGCATGCAAATTACATATTTTGGAGGAATACA\n",
            "AATCCAATTCAGTTGTCTTCCTATTCTTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTC\n",
            "TTTAAAAGAAGGTCAAATCAATGATATGATTTTATCTCTTCTTAGTAAAGGTAGACTTATAATTAGAGAAAACAACAGAG\n",
            "TTGTTATTTCTAGTGATGTTCTTGTTAACAACTAAACGAACAATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAG\n",
            "TCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTG\n",
            "ACAAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCCAATGTTACTTGGTTCCAT\n",
            "GCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGC\n",
            "TTCCACTGAGAAGTCTAACATAATAAGAGGCTGGATTTTTGGTACTACTTTAGATTCGAAGACCCAGTCCCTACTTATTG\n",
            "TTAATAACGCTACTAATGTTGTTATTAAAGTCTGTGAATTTCAATTTTGTAATGATCCATTTTTGGGTGTTTATTACCAC\n",
            "AAAAACAACAAAAGTTGGATGGAAAGTGAGTTCAGAGTTTATTCTAGTGCGAATAATTGCACTTTTGAATATGTCTCTCA\n",
            "GCCTTTTCTTATGGACCTTGAAGGAAAACAGGGTAATTTCAAAAATCTTAGGGAATTTGTGTTTAAGAATATTGATGGTT\n",
            "ATTTTAAAATATATTCTAAGCACACGCCTATTAATTTAGTGCGTGATCTCCCTCAGGGTTTTTCGGCTTTAGAACCATTG\n",
            "GTAGATTTGCCAATAGGTATTAACATCACTAGGTTTCAAACTTTACTTGCTTTACATAGAAGTTATTTGACTCCTGGTGA\n",
            "TTCTTCTTCAGGTTGGACAGCTGGTGCTGCAGCTTATTATGTGGGTTATCTTCAACCTAGGACTTTTCTATTAAAATATA\n",
            "ATGAAAATGGAACCATTACAGATGCTGTAGACTGTGCACTTGACCCTCTCTCAGAAACAAAGTGTACGTTGAAATCCTTC\n",
            "ACTGTAGAAAAAGGAATCTATCAAACTTCTAACTTTAGAGTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTAC\n",
            "AAACTTGTGCCCTTTTGGTGAAGTTTTTAACGCCACCAGATTTGCATCTGTTTATGCTTGGAACAGGAAGAGAATCAGCA\n",
            "ACTGTGTTGCTGATTATTCTGTCCTATATAATTCCGCATCATTTTCCACTTTTAAGTGTTATGGAGTGTCTCCTACTAAA\n",
            "TTAAATGATCTCTGCTTTACTAATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGG\n",
            "GCAAACTGGAAAGATTGCTGATTATAATTATAAATTACCAGATGATTTTACAGGCTGCGTTATAGCTTGGAATTCTAACA\n",
            "ATCTTGATTCTAAGGTTGGTGGTAATTATAATTACCTGTATAGATTGTTTAGGAAGTCTAATCTCAAACCTTTTGAGAGA\n",
            "GATATTTCAACTGAAATCTATCAGGCCGGTAGCACACCTTGTAATGGTGTTGAAGGTTTTAATTGTTACTTTCCTTTACA\n",
            "ATCATATGGTTTCCAACCCACTAATGGTGTTGGTTACCAACCATACAGAGTAGTAGTACTTTCTTTTGAACTTCTACATG\n",
            "CACCAGCAACTGTTTGTGGACCTAAAAAGTCTACTAATTTGGTTAAAAACAAATGTGTCAATTTCAACTTCAATGGTTTA\n",
            "ACAGGCACAGGTGTTCTTACTGAGTCTAACAAAAAGTTTCTGCCTTTCCAACAATTTGGCAGAGACATTGCTGACACTAC\n",
            "TGATGCTGTCCGTGATCCACAGACACTTGAGATTCTTGACATTACACCATGTTCTTTTGGTGGTGTCAGTGTTATAACAC\n",
            "CAGGAACAAATACTTCTAACCAGGTTGCTGTTCTTTATCAGGATGTTAACTGCACAGAAGTCCCTGTTGCTATTCATGCA\n",
            "GATCAACTTACTCCTACTTGGCGTGTTTATTCTACAGGTTCTAATGTTTTTCAAACACGTGCAGGCTGTTTAATAGGGGC\n",
            "TGAACATGTCAACAACTCATATGAGTGTGACATACCCATTGGTGCAGGTATATGCGCTAGTTATCAGACTCAGACTAATT\n",
            "CTCCTCGGCGGGCACGTAGTGTAGCTAGTCAATCCATCATTGCCTACACTATGTCACTTGGTGCAGAAAATTCAGTTGCT\n",
            "TACTCTAATAACTCTATTGCCATACCCACAAATTTTACTATTAGTGTTACCACAGAAATTCTACCAGTGTCTATGACCAA\n",
            "GACATCAGTAGATTGTACAATGTACATTTGTGGTGATTCAACTGAATGCAGCAATCTTTTGTTGCAATATGGCAGTTTTT\n",
            "GTACACAATTAAACCGTGCTTTAACTGGAATAGCTGTTGAACAAGACAAAAACACCCAAGAAGTTTTTGCACAAGTCAAA\n",
            "CAAATTTACAAAACACCACCAATTAAAGATTTTGGTGGTTTTAATTTTTCACAAATATTACCAGATCCATCAAAACCAAG\n",
            "CAAGAGGTCATTTATTGAAGATCTACTTTTCAACAAAGTGACACTTGCAGATGCTGGCTTCATCAAACAATATGGTGATT\n",
            "GCCTTGGTGATATTGCTGCTAGAGACCTCATTTGTGCACAAAAGTTTAACGGCCTTACTGTTTTGCCACCTTTGCTCACA\n",
            "GATGAAATGATTGCTCAATACACTTCTGCACTGTTAGCGGGTACAATCACTTCTGGTTGGACCTTTGGTGCAGGTGCTGC\n",
            "ATTACAAATACCATTTGCTATGCAAATGGCTTATAGGTTTAATGGTATTGGAGTTACACAGAATGTTCTCTATGAGAACC\n",
            "AAAAATTGATTGCCAACCAATTTAATAGTGCTATTGGCAAAATTCAAGACTCACTTTCTTCCACAGCAAGTGCACTTGGA\n",
            "AAACTTCAAGATGTGGTCAACCAAAATGCACAAGCTTTAAACACGCTTGTTAAACAACTTAGCTCCAATTTTGGTGCAAT\n",
            "TTCAAGTGTTTTAAATGATATCCTTTCACGTCTTGACAAAGTTGAGGCTGAAGTGCAAATTGATAGGTTGATCACAGGCA\n",
            "GACTTCAAAGTTTGCAGACATATGTGACTCAACAATTAATTAGAGCTGCAGAAATCAGAGCTTCTGCTAATCTTGCTGCT\n",
            "ACTAAAATGTCAGAGTGTGTACTTGGACAATCAAAAAGAGTTGATTTTTGTGGAAAGGGCTATCATCTTATGTCCTTCCC\n",
            "TCAGTCAGCACCTCATGGTGTAGTCTTCTTGCATGTGACTTATGTCCCTGCACAAGAAAAGAACTTCACAACTGCTCCTG\n",
            "CCATTTGTCATGATGGAAAAGCACACTTTCCTCGTGAAGGTGTCTTTGTTTCAAATGGCACACACTGGTTTGTAACACAA\n",
            "AGGAATTTTTATGAACCACAAATCATTACTACAGACAACACATTTGTGTCTGGTAACTGTGATGTTGTAATAGGAATTGT\n",
            "CAACAACACAGTTTATGATCCTTTGCAACCTGAATTAGACTCATTCAAGGAGGAGTTAGATAAATATTTTAAGAATCATA\n",
            "CATCACCAGATGTTGATTTAGGTGACATCTCTGGCATTAATGCTTCAGTTGTAAACATTCAAAAAGAAATTGACCGCCTC\n",
            "AATGAGGTTGCCAAGAATTTAAATGAATCTCTCATCGATCTCCAAGAACTTGGAAAGTATGAGCAGTATATAAAATGGCC\n",
            "ATGGTACATTTGGCTAGGTTTTATAGCTGGCTTGATTGCCATAGTAATGGTGACAATTATGCTTTGCTGTATGACCAGTT\n",
            "GCTGTAGTTGTCTCAAGGGCTGTTGTTCTTGTGGATCCTGCTGCAAATTTGATGAAGACGACTCTGAGCCAGTGCTCAAA\n",
            "GGAGTCAAATTACATTACACATAAACGAACTTATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAG\n",
            "CAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTT\n",
            "CGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTGTTTTTCAGAGCGCTTCCAAAATCATAACCCTCAAAAAGAGATGGC\n",
            "AACTAGCACTCTCCAAGGGTGTTCACTTTGTTTGCAACTTGCTGTTGTTGTTTGTAACAGTTTACTCACACCTTTTGCTC\n",
            "GTTGCTGCTGGCCTTGAAGCCCCTTTTCTCTATCTTTATGCTTTAGTCTACTTCTTGCAGAGTATAAACTTTGTAAGAAT\n",
            "AATAATGAGGCTTTGGCTTTGCTGGAAATGCCGTTCCAAAAACCCATTACTTTATGATGCCAACTATTTTCTTTGCTGGC\n",
            "ATACTAATTGTTACGACTATTGTATACCTTACAATAGTGTAACTTCTTCAATTGTCATTACTTCAGGTGATGGCACAACA\n",
            "AGTCCTATTTCTGAACATGACTACCAGATTGGTGGTTATACTGAAAAATGGGAATCTGGAGTAAAAGACTGTGTTGTATT\n",
            "ACACAGTTACTTCACTTCAGACTATTACCAGCTGTACTCAACTCAATTGAGTACAGACACTGGTGTTGAACATGTTACCT\n",
            "TCTTCATCTACAATAAAATTGTTGATGAGCCTGAAGAACATGTCCAAATTCACACAATCGACGGTTCATCCGGAGTTGTT\n",
            "AATCCAGTAATGGAACCAATTTATGATGAACCGACGACGACTACTAGCGTGCCTTTGTAAGCACAAGCTGATGAGTACGA\n",
            "ACTTATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTTCTTGCTTTCGTGGTAT\n",
            "TCTTGCTAGTTACACTAGCCATCCTTACTGCGCTTCGATTGTGTGCGTACTGCTGCAATATTGTTAACGTGAGTCTTGTA\n",
            "AAACCTTCTTTTTACGTTTACTCTCGTGTTAAAAATCTGAATTCTTCTAGAGTTCCTGATCTTCTGGTCTAAACGAACTA\n",
            "AATATTATATTAGTTTTTCTGTTTGGAACTTTAATTTTAGCCATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCT\n",
            "TAAAAAGCTCCTTGAACAATGGAACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATG\n",
            "CCAACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGTTATGGCCAGTAACTTTAGCTTGTTTTGTG\n",
            "CTTGCTGCTGTTTACAGAATAAATTGGATCACCGGTGGAATTGCTATCGCAATGGCTTGTCTTGTAGGCTTGATGTGGCT\n",
            "CAGCTACTTCATTGCTTCTTTCAGACTGTTTGCGCGTACGCGTTCCATGTGGTCATTCAATCCAGAAACTAACATTCTTC\n",
            "TCAACGTGCCACTCCATGGCACTATTCTGACCAGACCGCTTCTAGAAAGTGAACTCGTAATCGGAGCTGTGATCCTTCGT\n",
            "GGACATCTTCGTATTGCTGGACACCATCTAGGACGCTGTGACATCAAGGACCTGCCTAAAGAAATCACTGTTGCTACATC\n",
            "ACGAACGCTTTCTTATTACAAATTGGGAGCTTCGCAGCGTGTAGCAGGTGACTCAGGTTTTGCTGCATACAGTCGCTACA\n",
            "GGATTGGCAACTATAAATTAAACACAGACCATTCCAGTAGCAGTGACAATATTGCTTTGCTTGTACAGTAAGTGACAACA\n",
            "GATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAGTTTCCATTT\n",
            "GGAATCTTGATTACATCATAAACCTCATAATTAAAAATTTATCTAAGTCACTAACTGAGAATAAATATTCTCAATTAGAT\n",
            "GAAGAGCAACCAATGGAGATTGATTAAACGAACATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGA\n",
            "GCTTTATCACTACCAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATACGAGGGCA\n",
            "ATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGCACTCAATTTGCTTTTGCTTGTCCTGAC\n",
            "GGCGTAAAACACGTCTATCAGTTACGTGCCAGATCAGTTTCACCTAAACTGTTCATCAGACAAGAGGAAGTTCAAGAACT\n",
            "TTACTCTCCAATTTTTCTTATTGTTGCGGCAATAGTGTTTATAACACTTTGCTTCACACTCAAAAGAAAGACAGAATGAT\n",
            "TGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTTGTTTTAATTATGCTTATTATCTTTT\n",
            "GGTTCTCACTTGAACTGCAAGATCATAATGAAACTTGTCACGCCTAAACGAACATGAAATTTCTTGTTTTCTTAGGAATC\n",
            "ATCACAACTGTAGCTGCATTTCACCAAGAATGTAGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCC\n",
            "GTGTCCTATTCACTTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATCAGCACCTTTAATTGAATTGTGCGTGG\n",
            "ATGAGGCTGGTTCTAAATCACCCATTCAGTACATCGATATCGGTAATTATACAGTTTCCTGTTTACCTTTTACAATTAAT\n",
            "TGCCAGGAACCTAAATTGGGTAGTCTTGTAGTGCGTTGTTCGTTCTATGAAGACTTTTTAGAGTATCATGACGTTCGTGT\n",
            "TGTTTTAGATTTCATCTAAACGAACAAACTAAAATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTAC\n",
            "GTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCC\n",
            "AAGGTTTACCCAATAATACTGCGTCTTGGTTCACCGCTCTCACTCAACATGGCAAGGAAGACCTTAAATTCCCTCGAGGA\n",
            "CAAGGCGTTCCAATTAACACCAATAGCAGTCCAGATGACCAAATTGGCTACTACCGAAGAGCTACCAGACGAATTCGTGG\n",
            "TGGTGACGGTAAAATGAAAGATCTCAGTCCAAGATGGTATTTCTACTACCTAGGAACTGGGCCAGAAGCTGGACTTCCCT\n",
            "ATGGTGCTAACAAAGACGGCATCATATGGGTTGCAACTGAGGGAGCCTTGAATACACCAAAAGATCACATTGGCACCCGC\n",
            "AATCCTGCTAACAATGCTGCAATCGTGCTACAACTTCCTCAAGGAACAACATTGCCAAAAGGCTTCTACGCAGAAGGGAG\n",
            "CAGAGGCGGCAGTCAAGCCTCTTCTCGTTCCTCATCACGTAGTCGCAACAGTTCAAGAAATTCAACTCCAGGCAGCAGTA\n",
            "GGGGAACTTCTCCTGCTAGAATGGCTGGCAATGGCGGTGATGCTGCTCTTGCTTTGCTGCTGCTTGACAGATTGAACCAG\n",
            "CTTGAGAGCAAAATGTCTGGTAAAGGCCAACAACAACAAGGCCAAACTGTCACTAAGAAATCTGCTGCTGAGGCTTCTAA\n",
            "GAAGCCTCGGCAAAAACGTACTGCCACTAAAGCATACAATGTAACACAAGCTTTCGGCAGACGTGGTCCAGAACAAACCC\n",
            "AAGGAAATTTTGGGGACCAGGAACTAATCAGACAAGGAACTGATTACAAACATTGGCCGCAAATTGCACAATTTGCCCCC\n",
            "AGCGCTTCAGCGTTCTTCGGAATGTCGCGCATTGGCATGGAAGTCACACCTTCGGGAACGTGGTTGACCTACACAGGTGC\n",
            "CATCAAATTGGATGACAAAGATCCAAATTTCAAAGATCAAGTCATTTTGCTGAATAAGCATATTGACGCATACAAAACAT\n",
            "TCCCACCAACAGAGCCTAAAAAGGACAAAAAGAAGAAGGCTGATGAAACTCAAGCCTTACCGCAGAGACAGAAGAAACAG\n",
            "CAAACTGTGACTCTTCTTCCTGCTGCAGATTTGGATGATTTCTCCAAACAATTGCAACAATCCATGAGCAGTGCTGACTC\n",
            "AACTCAGGCCTAAACTCATGCAGACCACACAAGGCAGATGGGCTATATAAACGTTTTCGCTTTTCCGTTTACGATATATA\n",
            "GTCTACTCTTGTGCAGAATGAATTCTCGTAACTACATAGCACAAGTAGATGTAGTTAACTTTAATCTCACATAGCAATCT\n",
            "TTAATCAGTGTGTAACATTAGGGAGGACTTGAAAGAGCCACCACATTTTCACCGAGGCCACGCGGAGTACGATCGAGTGT\n",
            "ACAGTGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCAT\n",
            "GTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "V8016Y9u93Nx",
        "colab_type": "text"
      },
      "source": [
        "This looks like the genetic sequence for the Coronavirus!\n",
        "\n",
        "## Answers\n",
        "\n",
        "Maybe we now have more questions than answers but here is one way of answering the initial questions posed at the beginning. If all of this seemed very hard. Don't worry, it was supposed to be difficult. By the end of the semester you should feel more comfortable using Python this way. But for the moment just get a sense of the flow of what happened.\n",
        "\n",
        "**1. What is the format of the file?**\n",
        "\n",
        "The file we started with was a tar file. But it contained other files such as text files and gzipped files.\n",
        "\n",
        "**2. What does the file contain?**\n",
        "\n",
        "It appears to contain information about the Coronavirus.\n",
        "\n",
        "**3. How would you use the file?**\n",
        "\n",
        "Geneticists could use this information to identify the virus in their labs.\n",
        "\n",
        "**4. Where did the file come from?**\n",
        "\n",
        "The file came from GenBank, which is a project run out of the National Institues of Health nearby in Bethesda.\n",
        "\n",
        "**5. Who created the information in the file?**\n",
        "\n",
        "Examining the metadata a little more closely (e.g. GCA_009858895.3_ASM985889v3_protein.gpff) shows that the genetic data was uploaded by Chinese scientists January 5, 2020 when they were publishing their findings in Nature.\n",
        "\n",
        "**6. Does it have a URL?**\n",
        "\n",
        "Sometimes you can use identifiers in data to try to locate more information about them on the web. In this case we can try to Google for ASM985889v3 which brings us to:\n",
        "\n",
        "https://www.ncbi.nlm.nih.gov/assembly/GCF_009858895.2/\n",
        "\n",
        "That looks like a start at least.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VHbAy_bHA5iz",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}