# Data Loading and Preprocessing 

- loads in data files for GTEx gene expression data (v8) and correspodning metadata 
    - gene exprssion data should be a .gct.gz file
    - metadata should be a .txt file
        - both files should be saved to the same working directory as the analysis notebooks

- creates a single dataframe containing: 
    - 5,000 most variable genes with standardization
    - 10 most common tissue types


author: @emilyekstrm <br>
11/29/25

In [1]:
#load in modules 
import gzip
import pandas as pd
import io
from sklearn.preprocessing import StandardScaler

### Load in GTEx Analysis v8 Data

In [2]:
# set gene expression file path
gct_filepath = 'GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz'

In [3]:
# open and read gzipped gct file
with gzip.open(gct_filepath, 'rt') as f:
    lines = f.read()

gct_io = io.StringIO(lines)
gct_df = pd.read_csv(gct_io, sep='\t', skiprows=2) # skip first two lines
#print(gct_df.head())

In [4]:
# get the top 5,000 most variable genes
numeric_gct_df = gct_df.select_dtypes(include=['number'])  # select only numeric columns
gene_variances = numeric_gct_df.var(axis=1) # calculate variance for each gene
top_5000_genes = gene_variances.nlargest(5000).index # get indices of top 5000 variable genes
top_5000_gct_df = gct_df.loc[top_5000_genes] # subset original dataframe to top 5000 variant genes
#print(top_5000_gct_df.head())

### Load in Metadata

In [5]:
# set metadata file path
metadata_filepath = 'GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt'

In [6]:
# open and read metadata file
with open(metadata_filepath, 'r') as f:
    metadata = f.read()
    #print(metadata)

In [7]:
# convert metadata to pd dataframe
metadata_df = pd.read_csv(metadata_filepath, sep='\t')
#print(metadata_df.head())

In [13]:
# get the top 10 tissues with the largest sample size
top_10_tissues = metadata_df['SMTS'].value_counts().head(10).index.tolist()
print(f"Top 10 tissues: {top_10_tissues}")
print(f"Sample counts for top 10 tissues:")
print(metadata_df['SMTS'].value_counts().head(10))

Top 10 tissues: ['Blood', 'Brain', 'Skin', 'Esophagus', 'Blood Vessel', 'Adipose Tissue', 'Heart', 'Muscle', 'Lung', 'Colon']
Sample counts for top 10 tissues:
SMTS
Blood             3480
Brain             3326
Skin              2014
Esophagus         1582
Blood Vessel      1473
Adipose Tissue    1327
Heart             1141
Muscle            1132
Lung               867
Colon              821
Name: count, dtype: int64


### Unified dataset with inputs & targets

In [14]:
# set inputs (genes) and targets (top 10 tissues only)
inputs = top_5000_gct_df.drop(columns=['Description']).set_index('Name').T  # transpose so samples are rows
targets = metadata_df.set_index('SAMPID').loc[inputs.index, 'SMTS']  # match targets with inputs

# Filter to only include samples from top 10 tissues
top_10_mask = targets.isin(top_10_tissues)
filtered_inputs = inputs[top_10_mask]
filtered_targets = targets[top_10_mask]

unified_df = filtered_inputs.copy() # make a copy of filtered inputs
unified_df['Tissue'] = filtered_targets # add tissue labels to inputs

print(f"Original dataset size: {inputs.shape[0]} samples")
print(f"Filtered dataset size: {unified_df.shape[0]} samples")
print(f"Final tissue distribution:")
print(unified_df['Tissue'].value_counts())

Original dataset size: 17382 samples
Filtered dataset size: 12385 samples
Final tissue distribution:
Tissue
Brain             2642
Skin              1809
Esophagus         1445
Blood Vessel      1335
Adipose Tissue    1204
Blood              929
Heart              861
Muscle             803
Colon              779
Lung               578
Name: count, dtype: int64


### Standardize input data

In [15]:
# standardize input gene expression data
scaler = StandardScaler()
input_features = unified_df.drop(columns=['Tissue']) # drop tissue labels for standardization
standardized_features = scaler.fit_transform(input_features) # fit and transform
standardized_df = pd.DataFrame(standardized_features, index=input_features.index, columns=input_features.columns) # convert back to dataframe
standardized_df['Tissue'] = unified_df['Tissue'].values # add tissue labels back

print(f"Final standardized dataset shape: {standardized_df.shape}")
print(f"Number of genes (features): {standardized_df.shape[1] - 1}")
print(f"Number of samples: {standardized_df.shape[0]}")

Final standardized dataset shape: (12385, 5001)
Number of genes (features): 5000
Number of samples: 12385


### Save dataframe as csv

In [16]:
# save standardized dataframe as csv
standardized_df.to_csv('standardized_gtex_data.csv', index_label='SampleID')
print(f"Saved standardized dataset to 'standardized_gtex_data.csv'")

Saved standardized dataset to 'standardized_gtex_data.csv'
