<a href="https://colab.research.google.com/github/bicks1/hughesintern/blob/main/gff_chrom_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**File Name**:

```
gff_chrom.v2.ipynb
```

**Description**:

```
This program is a part of a series of programs for information extraction and mining of gene annotations in GFF3 files

Feature/Type (column 3) defintions: http://www.sequenceontology.org/browser/obob.cgi

Biotype(attribute in column 9) definition: https://www.gencodegenes.org/pages/biotypes.html

```

**Authors**:

```
Sophia Bick, Chun Liang
```


###[Step 1]: Install Python modules, Map Google Drive that contains GFF3 files

In [None]:
!pip install pandas
!pip install gffpandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gffpandas
  Downloading gffpandas-1.2.0.tar.gz (178 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.8/178.8 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gffpandas
  Building wheel for gffpandas (setup.py) ... [?25l[?25hdone
  Created wheel for gffpandas: filename=gffpandas-1.2.0-py2.py3-none-any.whl size=6248 sha256=b510686f55a52bba46f0574f38f049fab9121bd6873fa603e669b439dadfe3c5
  Stored in directory: /root/.cache/pip/wheels/57/87/f1/1d0c74fbc5151562ba7953dc110a7d8c63c6c3229d025bc8cd
Successfully built gffpandas
Installing collected packages: gffpandas
Successfully installed gffpandas-1.2.0


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

!ls "/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern"

Mounted at /content/drive
 chrom_and_gene_Qs.ipynb	    Homo_sapiens.GRCh38.109.chromosome.20.gff3
 chrom_and_gene_Qs_lc.ipynb	    Homo_sapiens.GRCh38.dna.chromosome.20.fa
 gff_chromosome.csv		    module_explore.ipynb
 gff_chromosome_exclude.csv	    module_explore_lc.ipynb
 gff_chrom.v0.ipynb		   'Project Qs.gdoc'
 gff_chrom.v1.ipynb		    ResearchPlan.gdoc
 gff_chrom.v2.ipynb		    researchplan_Qs.ipynb
 gff_gene.v0.ipynb		    researchplan_Qs_lc.ipynb
'GFF module notes.gdoc'		    transcription_Qs.ipynb
 gff_transcript.v0.ipynb	    transcription_Qs_lc.ipynb
 Homo_sapiens.GRCh38.109.chr.gff3


In [None]:
chrom20gff = "/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern/Homo_sapiens.GRCh38.109.chromosome.20.gff3"
chrom20fa = "/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern/Homo_sapiens.GRCh38.109.chromosome.20.fa"

In [None]:
hg38gff = "/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern/Homo_sapiens.GRCh38.109.chr.gff3"

In [None]:
import gffpandas.gffpandas as gffpd
import pandas as pd

###[Step 2]: Get a quick overview of a given GFF3 file and extract different types and counts for the third column (types)

In [None]:
annotation = gffpd.read_gff3(hg38gff)
# gffpd stats dict with feature type counts
stats = annotation.stats_dic()
#print(stats)
feature_dict = stats["Counted_feature_types"]
print(feature_dict)

  self.df = pd.read_table(self._gff_file, comment='#',


{'exon': 1648074, 'CDS': 886460, 'three_prime_UTR': 207390, 'biological_region': 180084, 'five_prime_UTR': 173461, 'mRNA': 110869, 'lnc_RNA': 91534, 'transcript': 26490, 'ncRNA_gene': 25925, 'gene': 21507, 'pseudogenic_transcript': 15226, 'pseudogene': 15224, 'ncRNA': 2211, 'snRNA': 1906, 'miRNA': 1877, 'unconfirmed_transcript': 1143, 'snoRNA': 942, 'V_gene_segment': 252, 'J_gene_segment': 97, 'scRNA': 50, 'rRNA': 49, 'D_gene_segment': 41, 'C_gene_segment': 29, 'chromosome': 25, 'tRNA': 22}


In [None]:
# counts of each feature type
# Mark out all types with "gene" embedded
for key, val in feature_dict.items():
  if "gene" in key:
    print("=====> gene-related feature type: {0} | count: {1} <=====".format(key, val))
  else:
    print("                                  {0} | count: {1}".format(key,val))

                                  exon | count: 1648074
                                  CDS | count: 886460
                                  three_prime_UTR | count: 207390
                                  biological_region | count: 180084
                                  five_prime_UTR | count: 173461
                                  mRNA | count: 110869
                                  lnc_RNA | count: 91534
                                  transcript | count: 26490
=====> gene-related feature type: ncRNA_gene | count: 25925 <=====
=====> gene-related feature type: gene | count: 21507 <=====
                                  pseudogenic_transcript | count: 15226
=====> gene-related feature type: pseudogene | count: 15224 <=====
                                  ncRNA | count: 2211
                                  snRNA | count: 1906
                                  miRNA | count: 1877
                                  unconfirmed_transcript | count: 1143
                   

In [None]:
feature_df=pd.DataFrame.from_dict(feature_dict, orient='index', columns=["Count"])
print(feature_df)

                          Count
exon                    1648074
CDS                      886460
three_prime_UTR          207390
biological_region        180084
five_prime_UTR           173461
mRNA                     110869
lnc_RNA                   91534
transcript                26490
ncRNA_gene                25925
gene                      21507
pseudogenic_transcript    15226
pseudogene                15224
ncRNA                      2211
snRNA                      1906
miRNA                      1877
unconfirmed_transcript     1143
snoRNA                      942
V_gene_segment              252
J_gene_segment               97
scRNA                        50
rRNA                         49
D_gene_segment               41
C_gene_segment               29
chromosome                   25
tRNA                         22


###[Step 3]: Get Chromosome Information, particularly the column 9 attributes

####[Step 3.1]: Get survey of all the attributes in the column 9

In [None]:
chrom = annotation.filter_feature_of_type(["chromosome"])

attr_to_columns = chrom.attributes_to_columns()
print(attr_to_columns)

        seq_id  source        type    start        end score strand phase  \
0            1  GRCh38  chromosome        1  248956422     .      .     .   
316318      10  GRCh38  chromosome        1  133797422     .      .     .   
452913      11  GRCh38  chromosome        1  135086622     .      .     .   
651742      12  GRCh38  chromosome        1  133275309     .      .     .   
837294      13  GRCh38  chromosome        1  114364328     .      .     .   
895171      14  GRCh38  chromosome        1  107043718     .      .     .   
1016283     15  GRCh38  chromosome        1  101991189     .      .     .   
1136632     16  GRCh38  chromosome        1   90338345     .      .     .   
1286973     17  GRCh38  chromosome        1   83257441     .      .     .   
1489331     18  GRCh38  chromosome        1   80373285     .      .     .   
1547846     19  GRCh38  chromosome        1   58617616     .      .     .   
1735032      2  GRCh38  chromosome        1  242193529     .      .     .   

####[Step 3.2]: Define two functions used for chromosome data extraction and analysis

In [None]:
def feature_len(start, end):
    """
    :param start: feature's start
    :param end: feature's end
    :return length: feature's overall length
    """
    length = abs(int(end) - int(start) + 1)
    return length

In [None]:
def chromosome_stats(gff_fn,exclude_list=[]):
  print("Inside foo()")
  annotation = gffpd.read_gff3(gff_fn)
  gff_chrom = annotation.filter_feature_of_type(["chromosome"])
  gff_chrom_stats = gff_chrom.stats_dic()

  ######################################################################################################################################################################
  # Through [Step 3.1], we known that the feature "Alias" contains compounded information seperated by comma ","
  # Accordingly, we can parse them into different columns
  ######################################################################################################################################################################
  gff_chrom_df = gff_chrom.attributes_to_columns()                                                              # split tags seperated by ";" into own column
  gff_chrom_df[["Alias", "abbreviation", "accession"]] = gff_chrom_df["Alias"].str.split(',', expand=True)      # split Alias entry seperated by "," into named columns

  seq_len = gff_chrom_df.apply(lambda x: feature_len(x["start"], x["end"]), axis=1)  # return Series of chrom len
  gff_chrom_df = gff_chrom_df.assign(length=seq_len)  # add chromosome len to df
  # Rename all columns into small letters
  gff_chrom_df.rename(columns={"ID":"id", "Alias":"alias"}, inplace=True)
  # print(gff_chrom_df)

  # number of chromosomes
  chrom_num = gff_chrom_stats["Counted_feature_types"]["chromosome"]

  # Remove the dataframe rows whose "seq_id" columns are in the exclude_list
  if len(exclude_list)>0:
      print("The following chromosomes is/are not used for stats:", exclude_list)
      gff_chrom_df = gff_chrom_df[~gff_chrom_df["seq_id"].isin(exclude_list)]
      chrom_num = chrom_num - len(exclude_list)

  print("The total number of chromosomes for stats is:", chrom_num)

  # set index col using seq_id; X, Y, MT index assignment is 23,24,25
  gff_chrom_df.set_index("seq_id", inplace=True) # change index col to seq_id
  gff_chrom_df["seq_id"] = gff_chrom_df.index  # copy index col back to new seq_id col
  # gff_chrom_df.rename(index={"X":23, "Y":24, "MT":25}, inplace=True)  # change indices of X, Y, MT
  gff_chrom_df.rename_axis(None, inplace=True)  # remove index col name

  #print(gff_chrom_df)

  short_label = gff_chrom_df.loc[gff_chrom_df["length"] == min(gff_chrom_df["length"]), "seq_id"].iloc[0]
  short = gff_chrom_df.loc[gff_chrom_df["seq_id"]==short_label, "length"].iloc[0]
  long_label = gff_chrom_df.loc[gff_chrom_df["length"] == max(gff_chrom_df["length"]), "seq_id"].iloc[0]
  long = gff_chrom_df.loc[gff_chrom_df["seq_id"]==long_label, "length"].iloc[0]
  chrom_sum = gff_chrom_df["length"].sum()
  average = chrom_sum / chrom_num

  print("The shortest chromosome length is", short, "for chromosome:", short_label)
  print("The longest chromosome length is", long, "for chromosome:", long_label)
  print("The average chromosome length is", average)

  return (gff_chrom_stats, gff_chrom_df)

[Step 3.3]: Get the chromsome data by calling two aforementioned functions

In [None]:
# A function with an exclude list
stats_d1, chrom_df1=chromosome_stats(hg38gff, [1, "MT"])

Inside foo()


  self.df = pd.read_table(self._gff_file, comment='#',


The following chromosomes is/are not used for stats: [1, 'MT']
The total number of chromosomes for stats is: 23
The shortest chromosome length is 46709983 for chromosome: 21
The longest chromosome length is 242193529 for chromosome: 2
The average chromosome length is 123312713.82608695


In [None]:
print(stats_d1)

{'Maximal_bp_length': 248956421, 'Minimal_bp_length': 16568, 'Counted_strands': {'.': 25}, 'Counted_feature_types': {'chromosome': 25}}


In [None]:
print(chrom_df1)

    source        type    start        end score strand phase  \
10  GRCh38  chromosome        1  133797422     .      .     .   
11  GRCh38  chromosome        1  135086622     .      .     .   
12  GRCh38  chromosome        1  133275309     .      .     .   
13  GRCh38  chromosome        1  114364328     .      .     .   
14  GRCh38  chromosome        1  107043718     .      .     .   
15  GRCh38  chromosome        1  101991189     .      .     .   
16  GRCh38  chromosome        1   90338345     .      .     .   
17  GRCh38  chromosome        1   83257441     .      .     .   
18  GRCh38  chromosome        1   80373285     .      .     .   
19  GRCh38  chromosome        1   58617616     .      .     .   
2   GRCh38  chromosome        1  242193529     .      .     .   
20  GRCh38  chromosome        1   64444167     .      .     .   
21  GRCh38  chromosome        1   46709983     .      .     .   
22  GRCh38  chromosome        1   50818468     .      .     .   
3   GRCh38  chromosome   

In [None]:
# A function call without an exclude list
stats_d2, chrom_df2 = chromosome_stats(hg38gff)

Inside foo()


  self.df = pd.read_table(self._gff_file, comment='#',


The total number of chromosomes for stats is: 25
The shortest chromosome length is 16569 for chromosome: MT
The longest chromosome length is 248956422 for chromosome: 1
The average chromosome length is 123406616.36


In [None]:
###### NOTE: this is for all chromosomes in GFF3
# built in method from gffpandas; should it still be returned?

# stats for the chromosome only df
print(stats_d2)

{'Maximal_bp_length': 248956421, 'Minimal_bp_length': 16568, 'Counted_strands': {'.': 25}, 'Counted_feature_types': {'chromosome': 25}}


In [None]:
# printout of chromosome only df
print(chrom_df2)

    source        type    start        end score strand phase  \
1   GRCh38  chromosome        1  248956422     .      .     .   
10  GRCh38  chromosome        1  133797422     .      .     .   
11  GRCh38  chromosome        1  135086622     .      .     .   
12  GRCh38  chromosome        1  133275309     .      .     .   
13  GRCh38  chromosome        1  114364328     .      .     .   
14  GRCh38  chromosome        1  107043718     .      .     .   
15  GRCh38  chromosome        1  101991189     .      .     .   
16  GRCh38  chromosome        1   90338345     .      .     .   
17  GRCh38  chromosome        1   83257441     .      .     .   
18  GRCh38  chromosome        1   80373285     .      .     .   
19  GRCh38  chromosome        1   58617616     .      .     .   
2   GRCh38  chromosome        1  242193529     .      .     .   
20  GRCh38  chromosome        1   64444167     .      .     .   
21  GRCh38  chromosome        1   46709983     .      .     .   
22  GRCh38  chromosome   

[Step 3.4]: Save data in CSV or excel

In [None]:
chrom_df1.to_csv("gff_chromosome_exclude.csv")
chrom_df2.to_csv("gff_chromosome.csv")

In [None]:
!ls -al

total 28
drwxr-xr-x 1 root root 4096 Jun  2 17:24 .
drwxr-xr-x 1 root root 4096 Jun  2 14:55 ..
drwxr-xr-x 4 root root 4096 May 31 13:45 .config
drwx------ 6 root root 4096 Jun  2 15:03 drive
-rw-r--r-- 1 root root 3778 Jun  2 17:24 gff_chromosome.csv
-rw-r--r-- 1 root root 3498 Jun  2 17:24 gff_chromosome_exclude.csv
drwxr-xr-x 1 root root 4096 May 31 13:46 sample_data


In [None]:
chrom_df1.to_csv("/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern/gff_chromosome_exclude.csv")
chrom_df2.to_csv("/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern/gff_chromosome.csv")
