<a href="https://colab.research.google.com/github/christophergaughan/Dark_Transcripts/blob/main/RNA_BERT_Cancer_model_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Building a Multi-Cancer RNA Dataset for a BERT Model**
## **Objective**
Our goal is to construct a large and robust dataset consisting of RNA sequences from both cancerous and non-cancerous genes across multiple cancer subtypes. The dataset will be used to fine-tune a BERT model for classifying RNA sequences as either *Cancerous* or *Non-Cancerous*.

---

## **Key Steps**
1. **Cancer Subtype Selection**:
   - Focus on multiple cancer subtypes (e.g., breast cancer, lung cancer, colon cancer, prostate cancer) to ensure diversity in the dataset.

2. **Data Collection**:
   - **Cancerous Genes**:
     - Collect RNA-Seq data for cancerous tissues from *The Cancer Genome Atlas (TCGA)*.
   - **Healthy Genes**:
     - Collect RNA-Seq data for healthy tissues from *Genotype-Tissue Expression (GTEx)*.

3. **Data Preprocessing**:
   - Normalize raw RNA-Seq count data for both cancerous and healthy samples.
   - Perform differential expression analysis to identify cancer-specific genes.

4. **Gene Mapping and Validation**:
   - Map cancer-specific genes to biological pathways using the KEGG database.
   - Retrieve RNA sequences for both cancerous and healthy counterparts.

5. **Dataset Creation**:
   - Label RNA sequences from cancerous tissues as **"Cancerous"** and sequences from healthy tissues as **"Non-Cancerous"**.
   - Tokenize RNA sequences into k-mers for compatibility with the BERT model.

6. **Model Fine-Tuning**:
   - Fine-tune a pre-trained BERT model for RNA classification using the labeled dataset.
   - Evaluate model performance on an unseen test set.

---

## **Outcome**
- A labeled, multi-cancer RNA dataset ready for machine learning.
- A fine-tuned BERT model capable of accurately classifying RNA sequences as cancerous or non-cancerous.

---

## **Applications**
- Enhance our understanding of cancer-specific RNA sequence patterns.
- Assist in predicting the cancerous nature of novel RNA sequences.


# **Anticipating Problems and Solutions**

| **Potential Problems**                              | **Possible Solutions**                                                                 |
|-----------------------------------------------------|---------------------------------------------------------------------------------------|
| **1. Limited availability of RNA-Seq data for some cancer subtypes** | Use publicly available datasets (e.g., TCGA, GEO, GTEx) and focus on well-studied subtypes initially. Expand to other subtypes as data becomes available. |
| **2. Imbalanced dataset (fewer cancerous genes compared to non-cancerous)** | Use techniques like oversampling cancerous data, undersampling non-cancerous data, or generating synthetic data with mutations. |
| **3. No direct healthy counterpart for some cancerous genes** | Use genes with similar biological functions, from the same pathway, or from paralogous families as counterparts. Validate biological relevance using KEGG or Reactome. |
| **4. Noisy or incomplete RNA-Seq data**             | Preprocess data carefully by filtering out low-quality sequences, normalizing counts, and ensuring uniform length for sequences. |
| **5. High dimensionality of RNA sequences**         | Use k-mer tokenization to reduce dimensionality and represent sequences in a format compatible with the BERT model. |
| **6. Difficulty in identifying biologically relevant genes** | Perform differential expression analysis with stringent criteria (e.g., p-value and fold change thresholds) and cross-reference with cancer gene databases (e.g., COSMIC, OncoKB). |
| **7. Computational challenges with large datasets** | Use cloud computing resources (e.g., Google Colab, AWS) and optimize code to handle large-scale processing efficiently. |
| **8. Overfitting on training data**                 | Split data into training, validation, and test sets. Use data augmentation and regularization techniques (e.g., dropout) during model training. |
| **9. Tokenization challenges for very long RNA sequences** | Truncate or split long sequences into manageable chunks. Train the BERT model with a max sequence length suitable for the majority of the data. |
| **10. Lack of robust evaluation metrics**           | Use a combination of metrics like accuracy, precision, recall, F1-score, and ROC-AUC to evaluate model performance comprehensively. |
| **11. Ethical and data privacy concerns**           | Ensure that all datasets used are open-access and comply with data usage guidelines. Avoid sharing sensitive patient information. |
| **12. Difficulties in reproducing results**         | Create a clear and modular pipeline using reproducible tools (e.g., Jupyter/Colab notebooks, version control with Git). Document all steps and dependencies. |

---

## **Summary**
By anticipating these challenges and preparing solutions, we can ensure a smoother execution of the project and minimize potential setbacks.


# Steps to Build gdc-client in Google Colab

In [None]:
!git clone https://github.com/NCI-GDC/gdc-client.git


In [None]:
%cd gdc-client


In [None]:
!pip install -r requirements.txt


In [None]:
%cd bin


In [None]:
!chmod +x package


In [None]:
!./package


In [None]:
%cd gdc-client/bin



In [None]:
!chmod +x package


In [None]:
!./package


In [None]:
%cd ..

In [None]:
!ls

In [None]:
!pip install virtualenv


In [None]:
%cd bin


In [None]:
!chmod +x package


In [None]:
!./package


In [None]:
!ls /content/gdc-client/bin


In [None]:
!/content/gdc-client/bin/gdc-client download -m /path/to/new_manifest.csv -d /path/to/destination



In [None]:
!ls /content/gdc-client/bin


In [None]:
!unzip /content/gdc-client/bin/gdc-client_2.3_Ubuntu_x64.zip -d /content/gdc-client/bin/


In [None]:
!ls /content/gdc-client/bin/


In [None]:
!chmod +x /content/gdc-client/bin/gdc-client


In [None]:
!/content/gdc-client/bin/gdc-client download -m /content/new_manifest.csv -d /content/downloaded_files/


In [None]:
%cd /content/gdc-client/bin/gdc-client/bin/gdc-client/bin


## Data Formatting
To train a BERT-based RNA model, the dataset should be formatted as follows:

**Structure of the Dataset**
Each row should have:

1. RNA Sequence: The RNA sequence (cancerous or non-cancerous).
2. Label: A binary label indicating whether the sequence is cancerous (`1`) or non-cancerous (`0`).

**Transforming the DataFrame**

Assuming we have RNA sequence data files (`file_id`, `file_name`) and labels (`sample_type`), we can parse the RNA sequences from the files and structure the data:

In [None]:
import pandas as pd

# Paths to the uploaded files
gdc_manifest_path = '/content/drive/MyDrive/RNA_sequence_files/gdc_manifest.csv'
new_manifest_path = '/content/drive/MyDrive/RNA_sequence_files/new_manifest.csv'

# Read the contents of the files
try:
    gdc_manifest = pd.read_csv(gdc_manifest_path)
    new_manifest = pd.read_csv(new_manifest_path)

    # Display the first few rows of each manifest
    print("GDC Manifest (first few rows):")
    print(gdc_manifest.head(), "\n")

    print("New Manifest (first few rows):")
    print(new_manifest.head())

except Exception as e:
    print(f"Error reading files: {e}")


In [None]:
!/content/gdc-client/bin/gdc-client download -m /content/drive/MyDrive/RNA_sequence_files/gdc_manifest.csv -d /content/drive/MyDrive/RNA_sequence_files/downloaded_files


In [None]:
!ls

In [None]:
!pwd

In [None]:
!mv /content/drive/MyDrive/RNA_sequence_files/gdc_manifest.csv /content/gdc-client/bin/


In [None]:
!ls

In [None]:
!mv /content/drive/MyDrive/RNA_sequence_files/new_manifest.csv /content/gdc-client/bin/


In [None]:
!ls

In [None]:
!./gdc-client download -m gdc_manifest.csv -d /content/drive/MyDrive/RNA_sequence_files/downloaded_files


In [None]:
import pandas as pd

# Convert gdc_manifest.csv to gdc_manifest.txt
gdc_manifest_path = "/content/gdc-client/bin/gdc_manifest.csv"
gdc_manifest_txt_path = "/content/gdc-client/bin/gdc_manifest.txt"
gdc_manifest_df = pd.read_csv(gdc_manifest_path, header=None)  # No headers in the CSV
gdc_manifest_df.to_csv(gdc_manifest_txt_path, sep='\t', index=False, header=['id'])

# Convert new_manifest.csv to new_manifest.txt
new_manifest_path = "/content/gdc-client/bin/new_manifest.csv"
new_manifest_txt_path = "/content/gdc-client/bin/new_manifest.txt"
new_manifest_df = pd.read_csv(new_manifest_path, header=None)  # No headers in the CSV
new_manifest_df.to_csv(new_manifest_txt_path, sep='\t', index=False, header=['id', 'filename'])

print("Manifests converted to .txt format.")


In [None]:
!ls

This line is wrong because it is looking in the wrong directory, we have moved the files to:
`/content/gdc-client/bin` this appears to have been key to this process

In [None]:
!./gdc-client download -m gdc_manifest.txt -d /content/drive/MyDrive/RNA_sequence_files/downloaded_files


In [None]:
!./gdc-client download -m gdc_manifest.txt


In [None]:
!pwd

In [None]:
!ls

In [None]:
!file <filename>


In [None]:
!head -n 5 <filename>  # For text-based files like .tsv


In [None]:
!file gdc_manifest.txt


In [None]:
!head -n 5 gdc_manifest.txt


In [None]:
!ls | grep 1d4c26b3-b9cd-4c63-9bed-84906d01ed21


In [None]:
import pandas as pd

manifest = pd.read_csv('gdc_manifest.txt', sep='\t')  # Adjust separator if needed
print(manifest.head())


#### It looks like the gdc_manifest.txt file has been successfully parsed, and its contents align with the downloaded files in your directory. The file contains a single column labeled id, listing unique identifiers for the files.

**Next Steps**
* Validate All IDs: Ensure every ID in the manifest has a corresponding file in the directory:


In [None]:
import os

manifest_ids = set(manifest['id'])
downloaded_files = set(os.listdir('.'))  # List all files in the current directory

missing_ids = manifest_ids - downloaded_files
if missing_ids:
    print(f"Missing files for IDs: {missing_ids}")
else:
    print("All files from the manifest are present.")


#### Map Metadata: If you have additional metadata files (e.g., `new_manifest.csv`) linking these IDs to cancerous/non-cancerous labels or other attributes, merge them with the current manifest for further analysis


In [None]:
metadata = pd.read_csv('new_manifest.csv')  # Adjust path and format if needed
merged = manifest.merge(metadata, left_on='id', right_on='file_id', how='left')  # Use the appropriate column
print(merged.head())


Check the Column Names in new_manifest.csv
Run the following code to inspect the columns in `new_manifest.csv`


In [None]:
metadata = pd.read_csv('new_manifest.csv')
print(metadata.columns)


Adjust the Merge Based on the Correct Column Names
Once you know the correct column name for the IDs in `new_manifest.csv`, update the `merge` statement. For example:

If the column for IDs in `new_manifest.csv` is named `id`, you can update the merge:


In [None]:
merged = manifest.merge(metadata, on='id', how='left')


Since the CSV files lack headers, we need to explicitly set headers or handle the lack of headers during reading and merging. Here's how to proceed:

#### Read the Files Without Headers
We can specify `header=None` while reading the files to treat all rows as data


In [None]:
# Read the manifest file without headers
manifest = pd.read_csv('gdc_manifest.txt', sep='\t', header=None, names=['id'])

# Read the metadata file without headers
metadata = pd.read_csv('new_manifest.csv', header=None, names=['file_id', 'file_name'])

# Inspect the data
print(manifest.head())
print(metadata.head())


In [None]:
# Drop the header-like row from the manifest DataFrame
manifest = manifest[manifest['id'] != 'id']

# Merge on the correct columns
merged = manifest.merge(metadata, left_on='id', right_on='file_id', how='left')

# Inspect the merged DataFrame
print(merged.head())


The `merged` DataFrame successfully aligns the IDs from the `manifest` with the corresponding files from the `metadata`. Here's what the result indicates:

#### Structure of the `merged` DataFrame:
* Columns:
    * `id`: The IDs from the `manifest`.
    * `file_id`: Corresponding IDs from the `metadata`, verifying the match.
    * `file_name`: The RNA sequence file names associated with the IDs.
### Next Steps:
1. Validate Data Completeness:

Check if any IDs in the manifest were not matched in the metadata.
Use merged to identify rows where file_name is NaN.

In [None]:
missing_files = merged[merged['file_name'].isnull()]
print(f"Number of missing files: {missing_files.shape[0]}")
print(missing_files)


2. Filter by Desired Criteria:

    * Extract specific subsets of data, e.g., cancerous vs. non-cancerous samples, if such labels exist in metadata.
    * If no labels are present, we may need to cross-reference another dataset for annotations.
3. Prepare for Sequence Processing:

    * Confirm the physical existence of the files listed in file_name.
    * Load these files to inspect their contents and validate RNA sequence formats.
4. Save the Processed Data:

    * Save the merged DataFrame as a CSV file for future reference

In [None]:
merged.to_csv('merged_manifest.csv', index=False)
print("Merged manifest saved to 'merged_manifest.csv'")


In [None]:
!ls -la

1. Verify Physical Files:

    * Ensure all files listed in file_name are present in the directory where the files were downloaded.

In [None]:
import os

# Directory where files are downloaded
download_dir = '/content/gdc-client/bin'

# List of expected files
expected_files = set(merged['file_name'])

# Files present in the directory
downloaded_files = set(os.listdir(download_dir))

# Missing files
missing_files = expected_files - downloaded_files
if missing_files:
    print(f"Missing files: {len(missing_files)}")
    print(missing_files)
else:
    print("All files are present.")


In [None]:
import pandas as pd

merged_manifest = pd.read_csv('merged_manifest.csv')
print(merged_manifest.head())


In [None]:
import os

download_dir = '/content/gdc-client/bin'  # Adjust if needed
downloaded_files = [f for f in os.listdir(download_dir) if f.endswith('.tsv')]
print(f"Downloaded files count: {len(downloaded_files)}")


In [None]:
missing_manifest = merged_manifest[~merged_manifest['file_name'].isin(downloaded_files)]
missing_manifest[['id']].to_csv('missing_manifest.txt', index=False, header=False)


In [None]:
downloaded_files = set(f.rsplit('.', 1)[0] for f in os.listdir(download_dir))


In [None]:
with open('missing_manifest.txt', 'r') as f:
    print(f.readlines()[:10])  # Print first 10 lines for verification


In [None]:
ls -l missing_manifest.txt


In [None]:
ls -ld /content/gdc-client/bin


In [None]:
!./gdc-client download -m missing_manifest.txt -d /content/gdc-client/bin


In [None]:
!pwd

In [None]:
!head -n 10 missing_manifest.txt


In [None]:
!cat -A missing_manifest.txt | head -n 10


In [None]:
!file missing_manifest.txt


In [None]:
!sed -i 's/[[:space:]]*$//' missing_manifest.txt


In [None]:
!dos2unix missing_manifest.txt


In [None]:
!cat -A missing_manifest.txt | head -n 10


In [None]:
!sed -i 's/\r$//' missing_manifest.txt


In [None]:
!cat -A missing_manifest.txt | head -n 10


In [None]:
with open('missing_manifest.txt', 'r') as file:
    lines = file.readlines()

# Remove any carriage return characters
lines = [line.strip() + '\n' for line in lines]

with open('missing_manifest.txt', 'w') as file:
    file.writelines(lines)


In [None]:
!cat -A missing_manifest.txt | head -n 10


In [None]:
!xxd missing_manifest.txt | head -n 10


In [None]:
!head -n 10 missing_manifest.txt


In [None]:
!./gdc-client download -m missing_manifest.txt -d /content/gdc-client/bin


The `ERROR: Invalid manifest` issue persists, suggesting there is still some problem with the missing_manifest.txt file format or structure. Let's take another systematic approach to resolve this:

#### Troubleshooting Steps:
1. Manifest File Format

    * Ensure the file starts without a header. The `missing_manifest.txt` file should only contain the file IDs (no column names or extra lines).
    * Double-check that each line contains a valid UUID (no extra spaces, symbols, or characters).
2. File Encoding

    * Verify the encoding of the file to ensure it is in plain `ASCII` or `UTF-8` without a `BOM` (Byte Order Mark). This can be done using the file command:



In [None]:
!file missing_manifest.txt


In [None]:
!cat -A missing_manifest.txt | head -n 10


In [None]:
!diff missing_manifest.txt sample_manifest.txt


In [None]:
!ls -la

In [None]:
!tr -d '\r' < missing_manifest.txt > sanitized_manifest.txt


In [None]:
!mv sanitized_manifest.txt missing_manifest.txt


In [None]:
!file missing_manifest.txt


In [None]:
import re

with open("missing_manifest.txt", "r") as f:
    lines = f.readlines()

uuid_pattern = re.compile(r'^[a-f0-9\-]{36}$')
valid = all(uuid_pattern.match(line.strip()) for line in lines)

print("All UUIDs valid:", valid)


In [None]:
!chmod +x ./gdc-client


In [None]:
!./gdc-client download -m missing_manifest.txt -d /content/gdc-client/bin


In [None]:
!zip -r session_backup.zip /content/
