1. Extract Relevant Information from the .gff Files 



The .gff files already have detailed SNP data, including the Chr, Start position, and possibly the reference/alternate allele in the description column. Extract this information as it is necessary for matching with population frequency databases like GnomAD.

Parse the .gff file to extract:
Chromosome (Chr)
Position (Start position or End position if necessary)
Reference allele and alternate allele (if available in the description column).
Tools: Python with pandas, re for regex parsing, or pybedtools.

In [None]:
import pandas as pd
import re

# Load the GFF file
def load_gff(file_path):
    # Read the GFF file, assuming no header and tab-separated values
    gff = pd.read_csv(file_path, sep="\t", header=None, comment="#")
    gff.columns = [
        "Chr", "SNP_caller", "Feature_Type", "Start", "End",
        "Quality", "Strand", "Codon_phase", "Description"
    ]
    return gff

# Extract relevant fields from the 'Description' column
def extract_snp_info(gff_df):
    # Extract Ref allele from the 'Description' field
    gff_df['Ref'] = gff_df['Description'].str.extract(r'Ref=([A-Za-z]+)')
    # Extract Alt allele from the 'Description' field
    gff_df['Alt'] = gff_df['Description'].str.extract(r'Alt=([A-Za-z]+)')
    # Extract additional details like dbSNP ID if present
    gff_df['dbSNP_ID'] = gff_df['Description'].str.extract(r'dbSNP=([rs\d]+)')
    return gff_df

# Save the processed data to a new file
def save_processed_data(gff_df, output_path):
    # Keep only necessary columns
    processed_df = gff_df[["Chr", "Start", "End", "Ref", "Alt", "dbSNP_ID"]]
    processed_df.to_csv(output_path, sep="\t", index=False)
    print(f"Processed data saved to {output_path}")

# Main function
def main():
    input_file = "example.gff"  # Path to your GFF file
    output_file = "processed_snp_data.tsv"  # Output file for processed SNP data
    
    print("Loading GFF file...")
    gff_data = load_gff(input_file)
    
    print("Extracting SNP information...")
    processed_data = extract_snp_info(gff_data)
    
    print("Saving processed data...")
    save_processed_data(processed_data, output_file)
    print("Processing complete.")

# Run the script
if __name__ == "__main__":
    main()


2. Obtain Population Frequency Data
Download population frequency data, such as GnomAD or dbSNP, which includes:

Chromosome

Position

Reference allele

Alternate allele

Population frequencies (e.g., global, specific populations).

GnomAD:

Download VCF files from gnomad.broadinstitute.org.
dbSNP:

Download the relevant build for GRCh37/hg19 from NCBI's dbSNP.


3. Convert Population Frequency Data to a Queryable Format
VCF files are large and need preprocessing:

Use tools like bcftools, tabix, or Python libraries (pysam, cyvcf2) to extract necessary columns and filter the dataset to match your .gff data.
Alternatively, convert the VCF to a more efficient format (e.g., TSV or SQLite).

In [None]:

#bash
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO.AF\n' gnomad.vcf.gz > gnomad_af.tsv

#####


# Load population frequency data
gnomad = pd.read_csv("gnomad_af.tsv", sep="\t", names=["Chr", "Position", "Ref", "Alt", "AF"])



4. Match .gff Data with Population Frequency Data
Use Python or another scripting language to annotate the .gff file with frequency data.

Approach:

Load both .gff and population frequency data into memory (e.g., as pandas dataframes).
Merge them on:
Chromosome
Position
Reference and alternate alleles (if available).
Insert frequency data into the .gff file.

In [None]:
# Merge on Chromosome, Position, Ref, and Alt
annotated_gff = gff.merge(gnomad, left_on=["Chr", "Start", "Ref", "Alt"], 
                          right_on=["Chr", "Position", "Ref", "Alt"], how="left")

# Save the annotated GFF
annotated_gff.to_csv("annotated_file.gff", sep="\t", index=False)