<a href="https://colab.research.google.com/github/bicks1/hughesintern/blob/main/gff_gene_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**File Name**:

```
gff_gene.ipynb
```

**Description**:

```
This program is a part of a series of programs for information extraction and mining of gene annotations in GFF3 files

Feature/Type (column 3) defintions: http://www.sequenceontology.org/browser/obob.cgi

Biotype(attribute in column 9) definition: https://www.gencodegenes.org/pages/biotypes.html

```

**Authors**:

```
Sophia Bick, Chun Liang
```


[Step 1]: Install Python modules, Map Google Drive that contains GFF3 files

In [None]:
!pip install pandas
!pip install gffpandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gffpandas
  Downloading gffpandas-1.2.0.tar.gz (178 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.8/178.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gffpandas
  Building wheel for gffpandas (setup.py) ... [?25l[?25hdone
  Created wheel for gffpandas: filename=gffpandas-1.2.0-py2.py3-none-any.whl size=6248 sha256=f602a932d384fa264ea84bec6b83a85bf3284563b09c0b061244610cb144de7a
  Stored in directory: /root/.cache/pip/wheels/57/87/f1/1d0c74fbc5151562ba7953dc110a7d8c63c6c3229d025bc8cd
Successfully built gffpandas
Installing collected packages: gffpandas
Successfully installed gffpandas-1.2.0


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

!ls "/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern"

Mounted at /content/drive
 chrom_and_gene_Qs.ipynb        gff_transcript.v0.ipynb
 chrom_and_gene_Qs_lc.ipynb     gff_transcript.v1.ipynb
 gff_chromosome.csv	        Homo_sapiens.GRCh38.109.chr.gff3
 gff_chromosome_exclude.csv     Homo_sapiens.GRCh38.109.chromosome.20.gff3
 gff_chrom.v0.ipynb	        Homo_sapiens.GRCh38.dna.chromosome.20.fa
 gff_chrom.v1.ipynb	        module_explore.ipynb
 gff_chrom.v2.ipynb	        module_explore_lc.ipynb
 gff_gene.v0.ipynb	       'Project Qs.gdoc'
 gff_gene.v1.ipynb	        ResearchPlan.gdoc
 gff_individual_gene.v0.ipynb   researchplan_Qs.ipynb
'GFF module notes.gdoc'         researchplan_Qs_lc.ipynb
 gff_transcript_all.csv         transcription_Qs.ipynb
 gff_transcript_mRNA.csv        transcription_Qs_lc.ipynb
 gff_transcript_segment


In [None]:
hg38gff = "/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern/Homo_sapiens.GRCh38.109.chr.gff3"

In [None]:
import gffpandas.gffpandas as gffpd
import pandas as pd

[Step 2]: Get a quick overview of a given GFF3 file and extract different types and counts for the third column (types)

In [None]:
annotation = gffpd.read_gff3(hg38gff)
# gffpd stats dict with feature type counts
stats = annotation.stats_dic()
#print(stats)
feature_dict = stats["Counted_feature_types"]
print(feature_dict)

  self.df = pd.read_table(self._gff_file, comment='#',


{'exon': 1648074, 'CDS': 886460, 'three_prime_UTR': 207390, 'biological_region': 180084, 'five_prime_UTR': 173461, 'mRNA': 110869, 'lnc_RNA': 91534, 'transcript': 26490, 'ncRNA_gene': 25925, 'gene': 21507, 'pseudogenic_transcript': 15226, 'pseudogene': 15224, 'ncRNA': 2211, 'snRNA': 1906, 'miRNA': 1877, 'unconfirmed_transcript': 1143, 'snoRNA': 942, 'V_gene_segment': 252, 'J_gene_segment': 97, 'scRNA': 50, 'rRNA': 49, 'D_gene_segment': 41, 'C_gene_segment': 29, 'chromosome': 25, 'tRNA': 22}


In [None]:
# counts of each feature type

for key, val in feature_dict.items():
  if "gene" in key and "segment" not in key:
    print("=====> gene-related feature type: {0} | count: {1} <=====".format(key, val))
  else:
    print("                                  {0} | count: {1}".format(key,val))

                                  exon | count: 1648074
                                  CDS | count: 886460
                                  three_prime_UTR | count: 207390
                                  biological_region | count: 180084
                                  five_prime_UTR | count: 173461
                                  mRNA | count: 110869
                                  lnc_RNA | count: 91534
                                  transcript | count: 26490
=====> gene-related feature type: ncRNA_gene | count: 25925 <=====
=====> gene-related feature type: gene | count: 21507 <=====
                                  pseudogenic_transcript | count: 15226
=====> gene-related feature type: pseudogene | count: 15224 <=====
                                  ncRNA | count: 2211
                                  snRNA | count: 1906
                                  miRNA | count: 1877
                                  unconfirmed_transcript | count: 1143
                   

In [None]:
feature_df=pd.DataFrame.from_dict(feature_dict, orient='index', columns=["Count"])
print(feature_df)

                          Count
exon                    1648074
CDS                      886460
three_prime_UTR          207390
biological_region        180084
five_prime_UTR           173461
mRNA                     110869
lnc_RNA                   91534
transcript                26490
ncRNA_gene                25925
gene                      21507
pseudogenic_transcript    15226
pseudogene                15224
ncRNA                      2211
snRNA                      1906
miRNA                      1877
unconfirmed_transcript     1143
snoRNA                      942
V_gene_segment              252
J_gene_segment               97
scRNA                        50
rRNA                         49
D_gene_segment               41
C_gene_segment               29
chromosome                   25
tRNA                         22


[Step 3]: Get Gene Information, particularly the column 9 attributes

[Step 3.1]: Get survey of all the attributes in the column 9

In [None]:
# Case 1: only normal "gene"
gene_dir = annotation.filter_feature_of_type(["gene"])

attr_to_columns = gene_dir.attributes_to_columns()
print(attr_to_columns)

        seq_id          source  type     start       end score strand phase  \
84           1  ensembl_havana  gene     65419     71585     .      +     .   
370          1  ensembl_havana  gene    450740    451678     .      -     .   
502          1  ensembl_havana  gene    685716    686654     .      -     .   
1097         1  ensembl_havana  gene    923923    944575     .      +     .   
1459         1  ensembl_havana  gene    944203    959309     .      -     .   
...        ...             ...   ...       ...       ...   ...    ...   ...   
3409854      Y  ensembl_havana  gene  24607560  24639207     .      +     .   
3409920      Y  ensembl_havana  gene  24763069  24813492     .      -     .   
3410046      Y  ensembl_havana  gene  24833843  24907040     .      +     .   
3410444      Y  ensembl_havana  gene  25030901  25062548     .      -     .   
3410636      Y  ensembl_havana  gene  25622117  25624902     .      +     .   

                                                att

In [None]:
# Case 2: only normal "ncRNA_gene"
gene_dir = annotation.filter_feature_of_type(["ncRNA_gene"])

attr_to_columns = gene_dir.attributes_to_columns()
print(attr_to_columns)

        seq_id   source        type     start       end score strand phase  \
16           1   havana  ncRNA_gene     11869     14409     .      +     .   
43           1  mirbase  ncRNA_gene     17369     17436     .      -     .   
52           1   havana  ncRNA_gene     29554     31109     .      +     .   
61           1  mirbase  ncRNA_gene     30366     30503     .      +     .   
64           1   havana  ncRNA_gene     34554     36081     .      -     .   
...        ...      ...         ...       ...       ...   ...    ...   ...   
3410679      Y  ensembl  ncRNA_gene  25723342  25723495     .      +     .   
3410682      Y   havana  ncRNA_gene  25728490  25733388     .      +     .   
3410731      Y  ensembl  ncRNA_gene  25928979  25929142     .      +     .   
3410798      Y  ensembl  ncRNA_gene  26247384  26247521     .      +     .   
3410816      Y  ensembl  ncRNA_gene  26360989  26361092     .      +     .   

                                                attributes  \
1

In [None]:
# Case 3: only normal "pseudogene"
gene_dir = annotation.filter_feature_of_type(["pseudogene"])

attr_to_columns = gene_dir.attributes_to_columns()
print(attr_to_columns)

        seq_id  source        type     start       end score strand phase  \
21           1  havana  pseudogene     12010     13670     .      +     .   
29           1  havana  pseudogene     14404     29570     .      -     .   
73           1  havana  pseudogene     52473     53312     .      +     .   
81           1  havana  pseudogene     62949     63887     .      +     .   
121          1  havana  pseudogene    131025    134836     .      +     .   
...        ...     ...         ...       ...       ...   ...    ...   ...   
3410857      Y  havana  pseudogene  26549425  26549743     .      +     .   
3410860      Y  havana  pseudogene  26586642  26591601     .      -     .   
3410865      Y  havana  pseudogene  26594851  26634652     .      -     .   
3410880      Y  havana  pseudogene  26626520  26627159     .      -     .   
3410885      Y  havana  pseudogene  56855244  56855488     .      +     .   

                                                attributes  \
21       ID=g

The gene-related types are only "gene", "ncRNA_gene", and "pseudogene"

In [None]:
# Case 4: All gene related features including "gene"
gene_rel = annotation.filter_feature_of_type(["gene", "ncRNA_gene", "pseudogene"])

attr_to_columns = gene_rel.attributes_to_columns()
print(attr_to_columns)

        seq_id   source        type     start       end score strand phase  \
16           1   havana  ncRNA_gene     11869     14409     .      +     .   
21           1   havana  pseudogene     12010     13670     .      +     .   
29           1   havana  pseudogene     14404     29570     .      -     .   
43           1  mirbase  ncRNA_gene     17369     17436     .      -     .   
52           1   havana  ncRNA_gene     29554     31109     .      +     .   
...        ...      ...         ...       ...       ...   ...    ...   ...   
3410857      Y   havana  pseudogene  26549425  26549743     .      +     .   
3410860      Y   havana  pseudogene  26586642  26591601     .      -     .   
3410865      Y   havana  pseudogene  26594851  26634652     .      -     .   
3410880      Y   havana  pseudogene  26626520  26627159     .      -     .   
3410885      Y   havana  pseudogene  56855244  56855488     .      +     .   

                                                attributes  \
1

[Step 3.2]: Define two functions used for gene data extraction and analysis

In [None]:
def feature_len(start, end):
    """
    :param start: feature's start
    :param end: feature's end
    :return length: feature's overall length
    """
    length = abs(int(end) - int(start) + 1)
    return length

In [None]:
def gene_stats(gff_fn, type_list=[]):

  ##############################################################################################
  # This function only takes two arguments: (gff_fn, type_list=[])
  # If type is an empty list, the it will do gene stats for all three types of genes
  # Otherwise, it will do gene stats for a list of individual types ("gene", "ncRNA_gene", "pseudogene").
  # The value of type must be from the "type" column in GFF3 files
  # In this function, we need to print the count or tally of each gene types.
  ##############################################################################################

  """
  Find count, shortest, longest, and average length
  of gene features in GFF file
  Uses: gffpandas, pandas, feature_len()
  :param gff_fn: file name of desired GFF3 file
  :param exclude: list of chromosomes to exclude their genes from length consideration, default is empty list
  :return gff_chrom_stats: dict of max and min length and # of gene features for whole GFF3 file
  :return gff_chrom_df: Pandas df of all gene features in GFF3 file; col correspond to GFF3 col with attributes' col tags as own columns
  """

  print("Inside foo()")
  annotation = gffpd.read_gff3(gff_fn)

  if type_list == []:
    type_list = ["gene", "ncRNA_gene", "pseudogene"]

  gff_ft = annotation.filter_feature_of_type(type_list)
  gff_ft_stats = gff_ft.stats_dic()

  gff_ft_df = gff_ft.attributes_to_columns()  # split tags into own df col

  seq_len = gff_ft_df.apply(lambda x: feature_len(x["start"], x["end"]), axis=1)  # return Series of gene len
  gff_ft_df = gff_ft_df.assign(length=seq_len)  # add gene len to df
  # Rename all columns into small letters
  gff_ft_df.rename(columns={"ID":"id", "Name":"name"}, inplace=True)

  # Include dataframe rows whose "type" columns are in the type_list
  gff_ft_df = gff_ft_df.loc[gff_ft_df["type"].isin(type_list)]
  print("The following types is/are used for stats:", type_list)

  # number of gene features
  print("Counts of each feature type are:")
  ft_num = gff_ft_df["type"].value_counts().to_string()
  print(ft_num)

  # set index col
  gff_ft_df.reset_index(inplace=True)
  gff_ft_df.index += 1
  gff_ft_df.pop("index")

  # shortest and longest gene features
  short_label = gff_ft_df.loc[gff_ft_df["length"] == min(gff_ft_df["length"]), "gene_id"].iloc[0]
  short = gff_ft_df.loc[gff_ft_df["gene_id"]==short_label, "length"].iloc[0]
  short_type = gff_ft_df.loc[gff_ft_df["gene_id"]==short_label, "type"].iloc[0]
  long_label = gff_ft_df.loc[gff_ft_df["length"] == max(gff_ft_df["length"]), "gene_id"].iloc[0]
  long = gff_ft_df.loc[gff_ft_df["gene_id"]==long_label, "length"].iloc[0]
  long_type = gff_ft_df.loc[gff_ft_df["gene_id"]==long_label, "type"].iloc[0]

  ft_sum = gff_ft_df["length"].sum()
  total_ft = len(gff_ft_df)
  #print("Number of gene features kept", total_ft)
  average = ft_sum / total_ft

  print("The shortest gene feature length is: [{0}] with the type: [{1}] and the ID: [{2}]".format( short, short_type, short_label))
  print("The longest gene feature length is: [{0}] with the type: [{1}] and the ID: [{2}]".format( long, long_type, long_label))
  print("The average length for gene feature is: [{0}]".format(average))

  return gff_ft_stats, gff_ft_df

[Step 3.3]: Get the gene data by calling two aforementioned functions

In [None]:
# Case 1: all gene types
stats_all, df_all = gene_stats(hg38gff, [])

Inside foo()


  self.df = pd.read_table(self._gff_file, comment='#',


The following types is/are used for stats: ['gene', 'ncRNA_gene', 'pseudogene']
Counts of each feature type are:
ncRNA_gene    25925
gene          21507
pseudogene    15224
The shortest gene feature length is: [8] with the type: [gene] and the ID: [ENSG00000223997]
The longest gene feature length is: [2473539] with the type: [gene] and the ID: [ENSG00000078328]
The average length for gene feature is: [32378.224064734422]


In [None]:
# Case 1: all gene types stats
print(stats_all)

{'Maximal_bp_length': 2473538, 'Minimal_bp_length': 7, 'Counted_strands': {'+': 31781, '-': 30875}, 'Counted_feature_types': {'ncRNA_gene': 25925, 'gene': 21507, 'pseudogene': 15224}}


In [None]:
# Case 1: printout of all gene types df
print(df_all)

      seq_id   source        type     start       end score strand phase  \
1          1   havana  ncRNA_gene     11869     14409     .      +     .   
2          1   havana  pseudogene     12010     13670     .      +     .   
3          1   havana  pseudogene     14404     29570     .      -     .   
4          1  mirbase  ncRNA_gene     17369     17436     .      -     .   
5          1   havana  ncRNA_gene     29554     31109     .      +     .   
...      ...      ...         ...       ...       ...   ...    ...   ...   
62652      Y   havana  pseudogene  26549425  26549743     .      +     .   
62653      Y   havana  pseudogene  26586642  26591601     .      -     .   
62654      Y   havana  pseudogene  26594851  26634652     .      -     .   
62655      Y   havana  pseudogene  26626520  26627159     .      -     .   
62656      Y   havana  pseudogene  56855244  56855488     .      +     .   

                                              attributes  \
1      ID=gene:ENSG00000290

In [None]:
# Case 2: "gene" only
stats_gene, df_gene = gene_stats(hg38gff, ["gene"])

Inside foo()


  self.df = pd.read_table(self._gff_file, comment='#',


The following types is/are used for stats: ['gene']
Counts of each feature type are:
gene    21507
The shortest gene feature length is: [8] with the type: [gene] and the ID: [ENSG00000223997]
The longest gene feature length is: [2473539] with the type: [gene] and the ID: [ENSG00000078328]
The average length for gene feature is: [64213.41911935649]


In [None]:
# Case 2: "gene" only
####### length is off by one in this dictionary
print(stats_gene)

{'Maximal_bp_length': 2473538, 'Minimal_bp_length': 7, 'Counted_strands': {'+': 10995, '-': 10512}, 'Counted_feature_types': {'gene': 21507}}


In [None]:
# Case 2: "gene" only
print(df_gene)

      seq_id          source  type     start       end score strand phase  \
1          1  ensembl_havana  gene     65419     71585     .      +     .   
2          1  ensembl_havana  gene    450740    451678     .      -     .   
3          1  ensembl_havana  gene    685716    686654     .      -     .   
4          1  ensembl_havana  gene    923923    944575     .      +     .   
5          1  ensembl_havana  gene    944203    959309     .      -     .   
...      ...             ...   ...       ...       ...   ...    ...   ...   
21503      Y  ensembl_havana  gene  24607560  24639207     .      +     .   
21504      Y  ensembl_havana  gene  24763069  24813492     .      -     .   
21505      Y  ensembl_havana  gene  24833843  24907040     .      +     .   
21506      Y  ensembl_havana  gene  25030901  25062548     .      -     .   
21507      Y  ensembl_havana  gene  25622117  25624902     .      +     .   

                                              attributes  \
1      ID=gene:

In [None]:
# Case 3: "ncRNA_gene" only
stats_ncRNA_gene, df_ncRNA_gene = gene_stats(hg38gff, ["ncRNA_gene"])

Inside foo()


  self.df = pd.read_table(self._gff_file, comment='#',


The following types is/are used for stats: ['ncRNA_gene']
Counts of each feature type are:
ncRNA_gene    25925
The shortest gene feature length is: [41] with the type: [ncRNA_gene] and the ID: [ENSG00000263526]
The longest gene feature length is: [1375317] with the type: [ncRNA_gene] and the ID: [ENSG00000231918]
The average length for gene feature is: [22978.791398264224]


In [None]:
# Case 4: "pseudogene" only
stats_pseudogene, df_pseudogene = gene_stats(hg38gff, ["pseudogene"])

Inside foo()


  self.df = pd.read_table(self._gff_file, comment='#',


The following types is/are used for stats: ['pseudogene']
Counts of each feature type are:
pseudogene    15224
The shortest gene feature length is: [23] with the type: [pseudogene] and the ID: [ENSG00000271544]
The longest gene feature length is: [909387] with the type: [pseudogene] and the ID: [ENSG00000286215]
The average length for gene feature is: [3410.8535864424593]


The shortest gene length is 8. Validated through Ensembl and NCBI;

T cell receptor delta diversity 1 (TRDD1)


[Step 3.4]: Save data in CSV or excel

In [None]:
df_all.to_csv("gff_genef_all.csv")
df_gene.to_csv("gff_genef_gene.csv")
df_ncRNA_gene.to_csv("gff_genef_ncRNA.csv")
df_pseudogene.to_csv("gff_genef_pseudo.csv")


In [None]:
!ls -al

total 104280
drwxr-xr-x 1 root root     4096 Jun  8 22:25 .
drwxr-xr-x 1 root root     4096 Jun  8 21:55 ..
drwxr-xr-x 4 root root     4096 Jun  7 17:43 .config
drwx------ 6 root root     4096 Jun  8 22:05 drive
-rw-r--r-- 1 root root 24399122 Jun  8 22:25 gff_fgene_all.csv
-rw-r--r-- 1 root root  9195156 Jun  8 22:25 gff_fgene_gene.csv
-rw-r--r-- 1 root root  8791298 Jun  8 22:25 gff_fgene_ncRNA.csv
-rw-r--r-- 1 root root  6390700 Jun  8 22:25 gff_fgene_pseudo.csv
-rw-r--r-- 1 root root 24399122 Jun  8 22:23 gff_gene_all.csv
-rw-r--r-- 1 root root  9195156 Jun  8 22:24 gff_gene_gene.csv
-rw-r--r-- 1 root root  8791298 Jun  8 22:24 gff_gene_ncRNA.csv
-rw-r--r-- 1 root root  9195156 Jun  8 22:23 gff_gene_only.csv
-rw-r--r-- 1 root root  6390700 Jun  8 22:24 gff_gene_pseudo.csv
drwxr-xr-x 1 root root     4096 Jun  7 17:44 sample_data


In [None]:
df_all.to_csv("/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern/gff_genef_all.csv")
df_gene.to_csv("/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern/gff_genef_gene.csv")
df_ncRNA_gene.to_csv("/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern/gff_genef_ncRNA.csv")
df_pseudogene.to_csv("/content/drive/My Drive/Lab_share/Lab_member/SophiaBick/HughesIntern/gff_genef_pseudo.csv")