---
title: "Tutorial: How big is a genome? Exploring size and scope of the human genome in Python"
format: html
---




## Introduction 

This tutorial uses basic libraries in Python to explore how big the human genome is and the volume of data generate when we submit a sample to a genetic testing service like 23andMe.

## Preliminaries

### Libraries

We'll use standard Python data science libraries in this tutorial, including pandas, seaborn, and numpy.


In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

### Data

The main dataset in this tutorial is a table of chromosome lengths from the [NCBI](https://www.ncbi.nlm.nih.gov/grc/human/data?asm=GRCh37).

This is a small dataset, so we can quickly build a pandas dataframe from the data:


In [None]:
#| echo: true
# raw data lists
chromo_numeric = [1,2,3,4,5,6,7,8,9,10,
                  11,12,13,14,15,16,17,
                  18,19,20,21,22,23,24,25]
chromo_str = ["1","2","3","4","5","6","7","8","9","10",
                  "11","12","13","14","15","16","17",
                  "18","19","20","21","22","X","Y","Mito"]
chromo_len = [249250621,243199373,198022430,191154276,180915260,
                171115067,159138663,146364022,141213431,135534747,135006516,
                133851895,115169878,107349540,102531392,90354753,81195210,78077248,59128983,
                63025520,48129895,51304566,155270560,59373566,16569]

# convert to series
chromo_numeric_ser = pd.Series(chromo_numeric)
chromo_str_ser = pd.Series(chromo_str)
chromo_len_ser    = pd.Series(chromo_len)

## additional info
snps_approx = chromo_len_ser*0.04
#chromo_type = ["Autosome","Sex","Organelle"]
#chromo_type = np.repeat(chromo_type, [22,2,1], axis=0)
#chromo_type_ser = pd.Series(chromo_type)

## build pandas dataframe
chromo_info = pd.DataFrame({"chrom_numeric": chromo_numeric_ser,
                           "chromo_str": chromo_str_ser,
                           #"chrom_type": chromo_type_ser,
                           "chrom_len":   chromo_len_ser,
                           "snps_approx": snps_approx})

The assembled data looks like this:


In [None]:
#| echo: true
chromo_info

## Data visualization

### Chromosome size


In [None]:
sns.barplot(data = chromo_info,
                x = "chrom_len",
                y = "chrom_numeric",
                orient = 'h'#,
                #hue = "chrom_type"
                );

### How much of human genome is examined by 23andMe?


Calculations


In [None]:
total = chromo_len_ser.sum()
snps_23_and_me = 929045 # from 1117.23andme.txt
other = total -snps_23_and_me
snps_percent = snps_23_and_me/total*100
print("Approximately", round(snps_percent,3), "percent of our genome is represented in data from 23andMe")

Piegraph


In [None]:
plt.close()
plt.pie([snps_23_and_me,other],
 labels=["23andme\nPositions","Rest of\ngenome"]) ;

How much of our genome is examined in research-grade datasets?


In [None]:
genomes1k = 125484020 # Byrska-Bishop et al 2022
other = total -genomes1k
snps_1kgpercent = genomes1k/total*100

Pie graph


In [None]:
plt.close()
plt.pie([genomes1k,other],
 labels=["1000 Genomes\nProject","Rest of\ngenome"]) ;

### Chromosome size versus amount survey


In [None]:
plt.close()
bar1 = sns.barplot(data = chromo_info,
x = "chrom_len",
y = "chrom_numeric",
orient = 'h'#,
    #hue = "chrom_type"
    );

bar2 = sns.barplot(x="snps_approx", y="chrom_numeric", data = chromo_info,color='lightblue',orient = 'h');

top_bar = mpatches.Patch(color='darkblue', label='x')
bottom_bar = mpatches.Patch(color='lightblue', label='y')


plt.show()

## Size of consumer genomics industry

TODO: data source


In [None]:
mil = 1000000
plt.close()
N = [25000000/mil,14000000/mil,8000000/mil,1628438/mil,300000/mil]
company = ["Ancestry.com","23andMe","MyHeritage","Family Tree DNA\nFamily Finder","Living DNA"]

df = pd.DataFrame({"DNA tests (millions)": N,
                    "Company": company})

sns.barplot(data = df,y = "DNA tests (millions)", x = "Company")