<a href="https://colab.research.google.com/github/christophergaughan/Dark_Transcripts/blob/main/RNA_BERT_Cancer_model_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis
Neural networks, particularly BERT-like models, can effectively identify patterns in RNA sequences that differentiate cancerous and non-cancerous samples. By leveraging the sequential nature of RNA and the contextual learning capabilities of BERT, we hypothesize that a model trained on labeled RNA sequences can generalize to classify new sequences as "cancerous" or "non-cancerous," potentially uncovering biologically relevant patterns. Here we begin by exploring this hypothesis using `The Cancer Genome Atlas (TCGA)` to obtain samples of cancerous and non cancerous genes.

---

## Strengths of the Approach

1. **Neural Networks for Pattern Detection**:
   - BERT models excel at understanding sequential data and finding patterns within text-like data.
   - Applying this to RNA sequences leverages the same strengths since RNA can be represented as strings of nucleotides (A, T/U, G, C).

2. **Structured Supervised Learning**:
   - Training on labeled data ("cancerous" vs. "non-cancerous") provides a clear learning objective.
   - A properly trained model can generalize to unseen sequences, identifying cancerous patterns based on learned data.

3. **Potential to Uncover Novel Patterns**:
   - BERT may identify subtle or unknown motifs associated with cancerous transformations, aiding biological understanding.

4. **Interpretability Possibilities**:
   - Techniques like attention visualizations can help interpret which sequence regions are most predictive of cancer.

---

## Critiques and Challenges

1. **Data Quality and Quantity**:
   - RNA sequences are complex, and cancerous vs. non-cancerous differences might be subtle.
   - A large, diverse, and well-labeled dataset is crucial to avoid overfitting.

2. **Biological Variability**:
   - Cancerous states may result from various factors (mutations, epigenetics, etc.) that are not always directly observable in RNA sequences.
   - Non-cancerous samples (e.g., "Solid Tissue Normal") may vary widely, introducing noise.

3. **Feature Representation**:
   - RNA sequences are inherently long. Feeding raw sequences into a BERT model could result in computational inefficiency.
   - Consider preprocessing to break sequences into biologically meaningful units (e.g., codons, motifs) or truncate/normalize sequence length.

4. **Model Choice**:
   - While BERT is powerful, DNA/RNA-specific architectures (e.g., DNABERT) may yield better results.
   - Pretraining on a large corpus of RNA sequences before fine-tuning might improve performance.

5. **Interpretability Trade-offs**:
   - Neural networks, particularly transformers, are often seen as black boxes.
   - For biological applications, interpretability is critical. Combining BERT with tools for explanation is important.

---

## Suggestions for Refinement

### 1. **Preprocessing**:
   - Ensure uniformity in RNA sequences (e.g., consistent capitalization, removal of ambiguous nucleotides like "N").
   - Normalize sequences to manage varying lengths.

### 2. **Experiment with Architectures**:
   - Compare BERT models with alternatives like CNNs, RNNs, or DNA/RNA-specific models (e.g., DNABERT, ProtBERT).
   - Explore pre-trained models for DNA/RNA sequences.

### 3. **Data Augmentation**:
   - Introduce small variations to non-cancerous sequences to simulate natural variability and improve robustness.

### 4. **Benchmark Against Baseline Models**:
   - Train simpler models (e.g., logistic regression, random forests) on extracted features to establish a baseline.
   - Validating with baseline models ensures the data has predictive signals.

### 5. **Plan for Evaluation**:
   - Use cross-validation to ensure the model generalizes well.
   - Consider additional metrics (e.g., precision, recall, ROC-AUC) for imbalanced datasets.

### 6. **Interpretability Framework**:
   - Use attention weights or tools like SHAP (SHapley Additive exPlanations) to interpret predictions.
   - Highlight predictive sequence regions and validate findings with biological experts.

---

## Final Thoughts
This approach is scientifically sound and forward-looking. If executed carefully, it has the potential to reveal both computational and biological insights. The challenges primarily lie in data preprocessing, architecture selection, and model interpretability, all of which can be managed with thoughtful experimentation and analysis.




# **Anticipating Problems and Solutions**

| **Potential Problems**                              | **Possible Solutions**                                                                 |
|-----------------------------------------------------|---------------------------------------------------------------------------------------|
| **1. Limited availability of RNA-Seq data for some cancer subtypes** | Use publicly available datasets (e.g., TCGA, GEO, GTEx) and focus on well-studied subtypes initially. Expand to other subtypes as data becomes available. |
| **2. Imbalanced dataset (fewer cancerous genes compared to non-cancerous)** | Use techniques like oversampling cancerous data, undersampling non-cancerous data, or generating synthetic data with mutations. |
| **3. No direct healthy counterpart for some cancerous genes** | Use genes with similar biological functions, from the same pathway, or from paralogous families as counterparts. Validate biological relevance using KEGG or Reactome. |
| **4. Noisy or incomplete RNA-Seq data**             | Preprocess data carefully by filtering out low-quality sequences, normalizing counts, and ensuring uniform length for sequences. |
| **5. High dimensionality of RNA sequences**         | Use k-mer tokenization to reduce dimensionality and represent sequences in a format compatible with the BERT model. |
| **6. Difficulty in identifying biologically relevant genes** | Perform differential expression analysis with stringent criteria (e.g., p-value and fold change thresholds) and cross-reference with cancer gene databases (e.g., COSMIC, OncoKB). |
| **7. Computational challenges with large datasets** | Use cloud computing resources (e.g., Google Colab, AWS) and optimize code to handle large-scale processing efficiently. |
| **8. Overfitting on training data**                 | Split data into training, validation, and test sets. Use data augmentation and regularization techniques (e.g., dropout) during model training. |
| **9. Tokenization challenges for very long RNA sequences** | Truncate or split long sequences into manageable chunks. Train the BERT model with a max sequence length suitable for the majority of the data. |
| **10. Lack of robust evaluation metrics**           | Use a combination of metrics like accuracy, precision, recall, F1-score, and ROC-AUC to evaluate model performance comprehensively. |
| **11. Ethical and data privacy concerns**           | Ensure that all datasets used are open-access and comply with data usage guidelines. Avoid sharing sensitive patient information. |
| **12. Difficulties in reproducing results**         | Create a clear and modular pipeline using reproducible tools (e.g., Jupyter/Colab notebooks, version control with Git). Document all steps and dependencies. |

---

## **Summary**
By anticipating these challenges and preparing solutions, we can ensure a smoother execution of the project and minimize potential setbacks.


# Steps to Build gdc-client in Google Colab

In [None]:
!git clone https://github.com/NCI-GDC/gdc-client.git


In [None]:
%cd gdc-client


In [None]:
!pip install -r requirements.txt


In [None]:
%cd bin


In [None]:
!chmod +x package


In [None]:
!./package


In [None]:
%cd gdc-client/bin



In [None]:
!chmod +x package


In [None]:
!./package


In [None]:
%cd ..

In [None]:
!ls

In [None]:
!pip install virtualenv


In [None]:
%cd bin


In [None]:
!chmod +x package


In [None]:
!./package


In [None]:
!ls /content/gdc-client/bin


In [None]:
!/content/gdc-client/bin/gdc-client download -m /path/to/new_manifest.csv -d /path/to/destination



In [None]:
!ls /content/gdc-client/bin


In [None]:
!unzip /content/gdc-client/bin/gdc-client_2.3_Ubuntu_x64.zip -d /content/gdc-client/bin/


In [None]:
!ls /content/gdc-client/bin/


In [None]:
!chmod +x /content/gdc-client/bin/gdc-client


In [None]:
!/content/gdc-client/bin/gdc-client download -m /content/new_manifest.csv -d /content/downloaded_files/


In [None]:
%cd /content/gdc-client/bin/gdc-client/bin/gdc-client/bin


## Data Formatting
To train a BERT-based RNA model, the dataset should be formatted as follows:

**Structure of the Dataset**
Each row should have:

1. RNA Sequence: The RNA sequence (cancerous or non-cancerous).
2. Label: A binary label indicating whether the sequence is cancerous (`1`) or non-cancerous (`0`).

**Transforming the DataFrame**

Assuming we have RNA sequence data files (`file_id`, `file_name`) and labels (`sample_type`), we can parse the RNA sequences from the files and structure the data:

In [None]:
import pandas as pd

# Paths to the uploaded files
gdc_manifest_path = '/content/drive/MyDrive/RNA_sequence_files/gdc_manifest.csv'
new_manifest_path = '/content/drive/MyDrive/RNA_sequence_files/new_manifest.csv'

# Read the contents of the files
try:
    gdc_manifest = pd.read_csv(gdc_manifest_path)
    new_manifest = pd.read_csv(new_manifest_path)

    # Display the first few rows of each manifest
    print("GDC Manifest (first few rows):")
    print(gdc_manifest.head(), "\n")

    print("New Manifest (first few rows):")
    print(new_manifest.head())

except Exception as e:
    print(f"Error reading files: {e}")


In [None]:
!/content/gdc-client/bin/gdc-client download -m /content/drive/MyDrive/RNA_sequence_files/gdc_manifest.csv -d /content/drive/MyDrive/RNA_sequence_files/downloaded_files


In [None]:
!ls

In [None]:
!pwd

In [None]:
!mv /content/drive/MyDrive/RNA_sequence_files/gdc_manifest.csv /content/gdc-client/bin/


In [None]:
!ls

In [None]:
!mv /content/drive/MyDrive/RNA_sequence_files/new_manifest.csv /content/gdc-client/bin/


In [None]:
!ls

In [None]:
!./gdc-client download -m gdc_manifest.csv -d /content/drive/MyDrive/RNA_sequence_files/downloaded_files


In [None]:
import pandas as pd

# Convert gdc_manifest.csv to gdc_manifest.txt
gdc_manifest_path = "/content/gdc-client/bin/gdc_manifest.csv"
gdc_manifest_txt_path = "/content/gdc-client/bin/gdc_manifest.txt"
gdc_manifest_df = pd.read_csv(gdc_manifest_path, header=None)  # No headers in the CSV
gdc_manifest_df.to_csv(gdc_manifest_txt_path, sep='\t', index=False, header=['id'])

# Convert new_manifest.csv to new_manifest.txt
new_manifest_path = "/content/gdc-client/bin/new_manifest.csv"
new_manifest_txt_path = "/content/gdc-client/bin/new_manifest.txt"
new_manifest_df = pd.read_csv(new_manifest_path, header=None)  # No headers in the CSV
new_manifest_df.to_csv(new_manifest_txt_path, sep='\t', index=False, header=['id', 'filename'])

print("Manifests converted to .txt format.")


In [None]:
!ls

This line is wrong because it is looking in the wrong directory, we have moved the files to:
`/content/gdc-client/bin` this appears to have been key to this process

In [None]:
!./gdc-client download -m gdc_manifest.txt -d /content/drive/MyDrive/RNA_sequence_files/downloaded_files


In [None]:
!./gdc-client download -m gdc_manifest.txt


In [None]:
!pwd

In [None]:
!ls

In [None]:
!file <filename>


In [None]:
!head -n 5 <filename>  # For text-based files like .tsv


In [None]:
!file gdc_manifest.txt


In [None]:
!head -n 5 gdc_manifest.txt


In [None]:
!ls | grep 1d4c26b3-b9cd-4c63-9bed-84906d01ed21


In [None]:
import pandas as pd

manifest = pd.read_csv('gdc_manifest.txt', sep='\t')  # Adjust separator if needed
print(manifest.head())


#### It looks like the gdc_manifest.txt file has been successfully parsed, and its contents align with the downloaded files in your directory. The file contains a single column labeled id, listing unique identifiers for the files.

**Next Steps**
* Validate All IDs: Ensure every ID in the manifest has a corresponding file in the directory:


In [None]:
import os

manifest_ids = set(manifest['id'])
downloaded_files = set(os.listdir('.'))  # List all files in the current directory

missing_ids = manifest_ids - downloaded_files
if missing_ids:
    print(f"Missing files for IDs: {missing_ids}")
else:
    print("All files from the manifest are present.")


#### Map Metadata: If you have additional metadata files (e.g., `new_manifest.csv`) linking these IDs to cancerous/non-cancerous labels or other attributes, merge them with the current manifest for further analysis


In [None]:
metadata = pd.read_csv('new_manifest.csv')  # Adjust path and format if needed
merged = manifest.merge(metadata, left_on='id', right_on='file_id', how='left')  # Use the appropriate column
print(merged.head())


Check the Column Names in new_manifest.csv
Run the following code to inspect the columns in `new_manifest.csv`


In [None]:
metadata = pd.read_csv('new_manifest.csv')
print(metadata.columns)


Adjust the Merge Based on the Correct Column Names
Once you know the correct column name for the IDs in `new_manifest.csv`, update the `merge` statement. For example:

If the column for IDs in `new_manifest.csv` is named `id`, you can update the merge:


In [None]:
merged = manifest.merge(metadata, on='id', how='left')


Since the CSV files lack headers, we need to explicitly set headers or handle the lack of headers during reading and merging. Here's how to proceed:

#### Read the Files Without Headers
We can specify `header=None` while reading the files to treat all rows as data


In [None]:
# Read the manifest file without headers
manifest = pd.read_csv('gdc_manifest.txt', sep='\t', header=None, names=['id'])

# Read the metadata file without headers
metadata = pd.read_csv('new_manifest.csv', header=None, names=['file_id', 'file_name'])

# Inspect the data
print(manifest.head())
print(metadata.head())


In [None]:
# Drop the header-like row from the manifest DataFrame
manifest = manifest[manifest['id'] != 'id']

# Merge on the correct columns
merged = manifest.merge(metadata, left_on='id', right_on='file_id', how='left')

# Inspect the merged DataFrame
print(merged.head())


The `merged` DataFrame successfully aligns the IDs from the `manifest` with the corresponding files from the `metadata`. Here's what the result indicates:

#### Structure of the `merged` DataFrame:
* Columns:
    * `id`: The IDs from the `manifest`.
    * `file_id`: Corresponding IDs from the `metadata`, verifying the match.
    * `file_name`: The RNA sequence file names associated with the IDs.
### Next Steps:
1. Validate Data Completeness:

Check if any IDs in the manifest were not matched in the metadata.
Use merged to identify rows where file_name is NaN.

In [None]:
missing_files = merged[merged['file_name'].isnull()]
print(f"Number of missing files: {missing_files.shape[0]}")
print(missing_files)


2. Filter by Desired Criteria:

    * Extract specific subsets of data, e.g., cancerous vs. non-cancerous samples, if such labels exist in metadata.
    * If no labels are present, we may need to cross-reference another dataset for annotations.
3. Prepare for Sequence Processing:

    * Confirm the physical existence of the files listed in file_name.
    * Load these files to inspect their contents and validate RNA sequence formats.
4. Save the Processed Data:

    * Save the merged DataFrame as a CSV file for future reference

In [None]:
merged.to_csv('merged_manifest.csv', index=False)
print("Merged manifest saved to 'merged_manifest.csv'")


In [None]:
!ls -la

1. Verify Physical Files:

    * Ensure all files listed in file_name are present in the directory where the files were downloaded.

In [None]:
import os

# Directory where files are downloaded
download_dir = '/content/gdc-client/bin'

# List of expected files
expected_files = set(merged['file_name'])

# Files present in the directory
downloaded_files = set(os.listdir(download_dir))

# Missing files
missing_files = expected_files - downloaded_files
if missing_files:
    print(f"Missing files: {len(missing_files)}")
    print(missing_files)
else:
    print("All files are present.")


In [None]:
import pandas as pd

merged_manifest = pd.read_csv('merged_manifest.csv')
print(merged_manifest.head())


In [None]:
import os

download_dir = '/content/gdc-client/bin'  # Adjust if needed
downloaded_files = [f for f in os.listdir(download_dir) if f.endswith('.tsv')]
print(f"Downloaded files count: {len(downloaded_files)}")


In [None]:
missing_manifest = merged_manifest[~merged_manifest['file_name'].isin(downloaded_files)]
missing_manifest[['id']].to_csv('missing_manifest.txt', index=False, header=False)


In [None]:
downloaded_files = set(f.rsplit('.', 1)[0] for f in os.listdir(download_dir))


In [None]:
with open('missing_manifest.txt', 'r') as f:
    print(f.readlines()[:10])  # Print first 10 lines for verification


In [None]:
ls -l missing_manifest.txt


In [None]:
ls -ld /content/gdc-client/bin


In [None]:
!./gdc-client download -m missing_manifest.txt -d /content/gdc-client/bin


In [None]:
!pwd

In [None]:
!head -n 10 missing_manifest.txt


In [None]:
!cat -A missing_manifest.txt | head -n 10


In [None]:
!file missing_manifest.txt


In [None]:
!sed -i 's/[[:space:]]*$//' missing_manifest.txt


In [None]:
!dos2unix missing_manifest.txt


In [None]:
!cat -A missing_manifest.txt | head -n 10


In [None]:
!sed -i 's/\r$//' missing_manifest.txt


In [None]:
!cat -A missing_manifest.txt | head -n 10


In [None]:
with open('missing_manifest.txt', 'r') as file:
    lines = file.readlines()

# Remove any carriage return characters
lines = [line.strip() + '\n' for line in lines]

with open('missing_manifest.txt', 'w') as file:
    file.writelines(lines)


In [None]:
!cat -A missing_manifest.txt | head -n 10


In [None]:
!xxd missing_manifest.txt | head -n 10


In [None]:
!head -n 10 missing_manifest.txt


In [None]:
!./gdc-client download -m missing_manifest.txt -d /content/gdc-client/bin


The `ERROR: Invalid manifest` issue persists, suggesting there is still some problem with the missing_manifest.txt file format or structure. Let's take another systematic approach to resolve this:

#### Troubleshooting Steps:
1. Manifest File Format

    * Ensure the file starts without a header. The `missing_manifest.txt` file should only contain the file IDs (no column names or extra lines).
    * Double-check that each line contains a valid UUID (no extra spaces, symbols, or characters).
2. File Encoding

    * Verify the encoding of the file to ensure it is in plain `ASCII` or `UTF-8` without a `BOM` (Byte Order Mark). This can be done using the file command:



In [None]:
!file missing_manifest.txt


In [None]:
!cat -A missing_manifest.txt | head -n 10


In [None]:
!diff missing_manifest.txt sample_manifest.txt


In [None]:
!ls -la

In [None]:
!tr -d '\r' < missing_manifest.txt > sanitized_manifest.txt


In [None]:
!mv sanitized_manifest.txt missing_manifest.txt


In [None]:
!file missing_manifest.txt


In [None]:
import re

with open("missing_manifest.txt", "r") as f:
    lines = f.readlines()

uuid_pattern = re.compile(r'^[a-f0-9\-]{36}$')
valid = all(uuid_pattern.match(line.strip()) for line in lines)

print("All UUIDs valid:", valid)


In [None]:
!chmod +x ./gdc-client


In [None]:
!./gdc-client download -m missing_manifest.txt -d /content/gdc-client/bin


In [None]:
!zip -r session_backup.zip /content/


In [None]:
!curl 'https://api.gdc.cancer.gov/files/183f73e2-9ddd-4fee-aab3-1b850855a0fd?pretty=true'


In [None]:
!curl 'https://api.gdc.cancer.gov/files/00252d7f-e222-462f-badc-b97e8dce2021?pretty=true'


In [None]:
!curl 'https://api.gdc.cancer.gov/files/_mapping?pretty=true'


In [None]:
!curl 'https://api.gdc.cancer.gov/cases/_mapping?pretty=true'


In [None]:
!curl 'https://api.gdc.cancer.gov/files/183f73e2-9ddd-4fee-aab3-1b850855a0fd?fields=cases.samples.sample_type,cases.samples.submitter_id&pretty=true'


In [None]:
!curl 'https://api.gdc.cancer.gov/files?filters={"op":"in","content":{"field":"files.file_id","value":["183f73e2-9ddd-4fee-aab3-1b850855a0fd","<another_uuid>"]}}&fields=cases.samples.sample_type,cases.samples.submitter_id&pretty=true'


In [None]:
!curl 'https://api.gdc.cancer.gov/files?filters={"op":"and","content":[{"op":"in","content":{"field":"cases.samples.sample_type","value":["Primary Tumor","Metastatic"]}}]}&fields=cases.samples.sample_type,cases.samples.submitter_id&pretty=true'


In [None]:
import requests
import json

# GDC API endpoint
BASE_URL = "https://api.gdc.cancer.gov/files"

def query_sample_type(uuids):
    """
    Query the GDC API for sample types for a given list of UUIDs.
    """
    # Prepare filters JSON
    filters = {
        "op": "in",
        "content": {
            "field": "files.file_id",
            "value": uuids
        }
    }

    # API request payload
    payload = {
        "filters": filters,
        "fields": "cases.samples.sample_type,cases.samples.submitter_id",
        "format": "JSON",
        "size": len(uuids)
    }

    # Make POST request
    response = requests.post(BASE_URL, json=payload)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return None

def categorize_samples(api_response):
    """
    Categorize samples into Cancerous and Non-Cancerous.
    """
    cancerous = []
    non_cancerous = []

    for file in api_response.get('data', {}).get('hits', []):
        for sample in file.get('cases', [{}])[0].get('samples', []):
            sample_type = sample.get('sample_type')
            submitter_id = sample.get('submitter_id')

            if sample_type in ["Primary Tumor", "Metastatic"]:
                cancerous.append({"submitter_id": submitter_id, "type": sample_type})
            elif sample_type in ["Solid Tissue Normal", "Blood Derived Normal"]:
                non_cancerous.append({"submitter_id": submitter_id, "type": sample_type})

    return cancerous, non_cancerous

# Example usage
uuid_list = ["183f73e2-9ddd-4fee-aab3-1b850855a0fd", "another_uuid_here"]
response = query_sample_type(uuid_list)

if response:
    cancerous, non_cancerous = categorize_samples(response)
    print("Cancerous Samples:", cancerous)
    print("Non-Cancerous Samples:", non_cancerous)


In [None]:
import requests
import json

# File containing UUIDs
UUID_FILE = "missing_manifest.txt"

# GDC API endpoint
BASE_URL = "https://api.gdc.cancer.gov/files"

def load_uuids(file_path):
    """
    Load UUIDs from a file into a list.
    """
    with open(file_path, "r") as file:
        return [line.strip() for line in file.readlines() if line.strip()]

def query_sample_type(uuids):
    """
    Query the GDC API for sample types for a given list of UUIDs.
    """
    # Prepare filters JSON
    filters = {
        "op": "in",
        "content": {
            "field": "files.file_id",
            "value": uuids
        }
    }

    # API request payload
    payload = {
        "filters": filters,
        "fields": "cases.samples.sample_type,cases.samples.submitter_id",
        "format": "JSON",
        "size": len(uuids)
    }

    # Make POST request
    response = requests.post(BASE_URL, json=payload)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return None

def categorize_samples(api_response):
    """
    Categorize samples into Cancerous and Non-Cancerous.
    """
    cancerous = []
    non_cancerous = []

    for file in api_response.get('data', {}).get('hits', []):
        for sample in file.get('cases', [{}])[0].get('samples', []):
            sample_type = sample.get('sample_type')
            submitter_id = sample.get('submitter_id')

            if sample_type in ["Primary Tumor", "Metastatic"]:
                cancerous.append({"submitter_id": submitter_id, "type": sample_type})
            elif sample_type in ["Solid Tissue Normal", "Blood Derived Normal"]:
                non_cancerous.append({"submitter_id": submitter_id, "type": sample_type})

    return cancerous, non_cancerous

def batch_process_uuids(uuid_list, batch_size=100):
    """
    Process UUIDs in batches to query the API.
    """
    cancerous_all = []
    non_cancerous_all = []

    for i in range(0, len(uuid_list), batch_size):
        batch = uuid_list[i:i + batch_size]
        print(f"Processing batch {i // batch_size + 1}...")
        response = query_sample_type(batch)
        if response:
            cancerous, non_cancerous = categorize_samples(response)
            cancerous_all.extend(cancerous)
            non_cancerous_all.extend(non_cancerous)

    return cancerous_all, non_cancerous_all

# Main logic
if __name__ == "__main__":
    uuids = load_uuids(UUID_FILE)
    cancerous, non_cancerous = batch_process_uuids(uuids, batch_size=50)

    print("Cancerous Samples:", cancerous)
    print("Non-Cancerous Samples:", non_cancerous)


In [None]:
import csv

def save_to_csv(file_name, data, fieldnames):
    """
    Save data to a CSV file.
    """
    with open(file_name, mode="w", newline="") as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

# Save cancerous samples
save_to_csv("cancerous_samples.csv", cancerous, ["submitter_id", "type"])

# Save non-cancerous samples
save_to_csv("non_cancerous_samples.csv", non_cancerous, ["submitter_id", "type"])

print("Files saved: cancerous_samples.csv, non_cancerous_samples.csv")


In [None]:
!curl 'https://api.gdc.cancer.gov/history/00252d7f-e222-462f-badc-b97e8dce2021'


In [None]:
!curl 'https://api.gdc.cancer.gov/files?filters={"op":"=","content":{"field":"files.data_type","value":"Gene Expression Quantification"}}&fields=file_id,00252d7f-e222-462f-badc-b97e8dce2021,analysis.workflow_type&pretty=true'


In [None]:
!curl --request POST --header "Content-Type: application/json" --data '
{
  "filters": {
    "op": "and",
    "content": [
      {
        "op": "in",
        "content": {
          "field": "cases.submitter_id",
          "value": ["SC108221_merged", "SC080780_merged"]  # Replace with your sample IDs
        }
      },
      {
        "op": "=",
        "content": {
          "field": "files.data_type",
          "value": "Gene Expression Quantification"
        }
      }
    ]
  },
  "fields": "file_id,file_name,cases.submitter_id,data_category,data_type,platform,experimental_strategy",
  "format": "JSON",
  "size": "1000"
}' 'https://api.gdc.cancer.gov/files'


In [None]:
!curl --request POST --header "Content-Type: application/json" --data '
{
  "filters": {
    "op": "and",
    "content": [
      {
        "op": "=",
        "content": {
          "field": "files.data_type",
          "value": "Gene Expression Quantification"
        }
      },
      {
        "op": "=",
        "content": {
          "field": "files.experimental_strategy",
          "value": "RNA-Seq"
        }
      }
    ]
  },
  "fields": "file_id,file_name,cases.submitter_id,data_category,data_type,platform,experimental_strategy",
  "format": "JSON",
  "size": "1000"
}' 'https://api.gdc.cancer.gov/files'


In [None]:
!ls

In [None]:
import os
import requests
import json
import pandas as pd

# List all files in the directory
directory_path = '.'  # Adjust path as needed
all_files = os.listdir(directory_path)

# Filter UUIDs (assumes UUIDs are filenames without extensions)
uuids = [f.split('.')[0] for f in all_files if len(f.split('.')[0]) == 36]

print(f"Total UUIDs found: {len(uuids)}")


In [None]:
# GDC API URL
api_url = "https://api.gdc.cancer.gov/files"
headers = {"Content-Type": "application/json"}

# Function to query API in batches
def query_gdc(uuids, batch_size=100):
    results = []
    for i in range(0, len(uuids), batch_size):
        batch = uuids[i:i + batch_size]
        payload = {
            "filters": {
                "op": "in",
                "content": {
                    "field": "files.file_id",
                    "value": batch
                }
            },
            "fields": "file_id,file_name,cases.submitter_id,data_category,data_type,platform",
            "format": "JSON",
            "size": batch_size
        }

        response = requests.post(api_url, headers=headers, data=json.dumps(payload))
        if response.status_code == 200:
            results.extend(response.json().get('data', {}).get('hits', []))
        else:
            print(f"Error: {response.status_code}, {response.text}")
    return results

# Query the API
rna_files = query_gdc(uuids)

# Convert to DataFrame for easy manipulation
rna_files_df = pd.DataFrame(rna_files)
rna_files_df.to_csv("rna_sequence_files.csv", index=False)
print("RNA sequence files saved to rna_sequence_files.csv")


In [None]:
# Load manifest UUIDs
manifest = pd.read_csv("gdc_manifest.csv")
uuids = manifest['id'].tolist()  # Assuming 'id' column contains UUIDs


In [None]:
# Load manifest and inspect columns
manifest = pd.read_csv("gdc_manifest.csv")
print(manifest.columns)  # Check available columns

# Assuming the column with UUIDs is named 'uuid' (adjust based on output)
uuids = manifest['uuid'].tolist()  # Replace 'uuid' with the actual column name
print(f"Total UUIDs found in manifest: {len(uuids)}")


In [None]:
manifest = pd.read_csv("gdc_manifest.csv")
print(manifest.head())  # Display the first few rows of the manifest


In [None]:
print("Columns:", manifest.columns)
print("Shape:", manifest.shape)


In [None]:
# Reload manifest with no header
manifest = pd.read_csv("gdc_manifest.csv", header=None)

# Assign a default column name
manifest.columns = ['uuid']

# Extract UUIDs as a list
uuids = manifest['uuid'].tolist()

print(f"Total UUIDs found in manifest: {len(uuids)}")


In [None]:
# Query the GDC API in batches
def query_gdc(uuids, batch_size=100):
    results = []
    for i in range(0, len(uuids), batch_size):
        batch = uuids[i:i + batch_size]
        payload = {
            "filters": {
                "op": "in",
                "content": {
                    "field": "files.file_id",
                    "value": batch
                }
            },
            "fields": "file_id,file_name,cases.submitter_id,data_category,data_type,platform",
            "format": "JSON",
            "size": batch_size
        }

        response = requests.post("https://api.gdc.cancer.gov/files", headers={"Content-Type": "application/json"}, data=json.dumps(payload))
        if response.status_code == 200:
            results.extend(response.json().get('data', {}).get('hits', []))
        else:
            print(f"Error: {response.status_code}, {response.text}")
    return results

# Call the function and save the results
rna_files = query_gdc(uuids)
rna_files_df = pd.DataFrame(rna_files)
rna_files_df.to_csv("rna_sequence_files.csv", index=False)
print("RNA sequence files saved to rna_sequence_files.csv")


In [None]:
import pandas as pd

# Load the saved RNA sequence files metadata
rna_sequence_files_path = "rna_sequence_files.csv"
rna_sequence_files_df = pd.read_csv(rna_sequence_files_path)

# Display the first few rows of the dataframe to verify contents
rna_sequence_files_df.head()


In [None]:
# Load cancerous and non-cancerous datasets
cancerous_samples = pd.read_csv("cancerous_samples.csv")
non_cancerous_samples = pd.read_csv("non_cancerous_samples.csv")

# Add a classification column to each dataset
cancerous_samples['classification'] = 'cancerous'
non_cancerous_samples['classification'] = 'non-cancerous'

# Combine the two datasets
classification_data = pd.concat([cancerous_samples, non_cancerous_samples], ignore_index=True)

# Merge with RNA sequence metadata using 'submitter_id'
rna_sequence_files_df['submitter_id'] = rna_sequence_files_df['cases'].apply(lambda x: eval(x)[0]['submitter_id'])  # Extract submitter_id from cases
linked_data = rna_sequence_files_df.merge(classification_data, on='submitter_id', how='left')

# Save and display the linked dataset
linked_data.to_csv("linked_rna_sequence_files.csv", index=False)
print("Linked data saved to linked_rna_sequence_files.csv")
linked_data.head()


In [None]:
# Unique submitter IDs in each dataset
rna_submitter_ids = set(rna_sequence_files_df['submitter_id'])
classification_submitter_ids = set(classification_data['submitter_id'])

# Find unmatched IDs
unmatched_ids = rna_submitter_ids - classification_submitter_ids
print(f"Unmatched submitter IDs: {unmatched_ids}")
print(f"Total unmatched IDs: {len(unmatched_ids)}")


In [None]:
# Normalize case for submitter IDs
rna_sequence_files_df['submitter_id'] = rna_sequence_files_df['submitter_id'].str.upper()
classification_data['submitter_id'] = classification_data['submitter_id'].str.upper()

# Retry merge
linked_data = rna_sequence_files_df.merge(classification_data, on='submitter_id', how='left')

# Save and inspect the results
linked_data.to_csv("linked_rna_sequence_files_updated.csv", index=False)
print("Updated linked data saved to linked_rna_sequence_files_updated.csv")
linked_data.head()


In [None]:
# Combine cancerous and non-cancerous datasets
classification_data = pd.DataFrame(cancerous + non_cancerous)

# Merge the RNA sequence metadata with classification data
linked_data = rna_sequence_files_df.merge(classification_data, on="submitter_id", how="left")

# Save the enriched dataset
linked_data.to_csv("linked_rna_sequence_classification.csv", index=False)
print("Enriched dataset saved to linked_rna_sequence_classification.csv")

# Display the first few rows for verification
print(linked_data.head())


In [None]:
unmatched_ids = linked_data[linked_data['type'].isna()]['submitter_id'].unique()
print(f"Unmatched Submitter IDs: {unmatched_ids}")
print(f"Total Unmatched Submitter IDs: {len(unmatched_ids)}")


In [None]:
linked_data['submitter_id_normalized'] = linked_data['submitter_id'].str.split('-').str[:3].str.join('-')
classification_data['submitter_id_normalized'] = classification_data['submitter_id'].str.split('-').str[:3].str.join('-')

# Retry merge with normalized IDs
linked_data = linked_data.merge(
    classification_data[['submitter_id_normalized', 'type', 'classification']],
    on='submitter_id_normalized',
    how='left'
)


In [None]:
print(classification_data.columns)


In [None]:
if 'classification' not in classification_data.columns:
    classification_data['classification'] = classification_data['type'].apply(
        lambda x: 'cancerous' if x in ['Primary Tumor', 'Metastatic'] else 'non-cancerous'
    )


In [None]:
linked_data['submitter_id_normalized'] = linked_data['submitter_id'].str.split('-').str[:3].str.join('-')
classification_data['submitter_id_normalized'] = classification_data['submitter_id'].str.split('-').str[:3].str.join('-')


In [None]:
linked_data = linked_data.merge(
    classification_data[['submitter_id_normalized', 'type', 'classification']],
    on='submitter_id_normalized',
    how='left'
)


In [None]:
print(linked_data[['submitter_id', 'type', 'classification']].head())
linked_data.to_csv("linked_rna_sequence_classification_updated.csv", index=False)


In [None]:
print(linked_data.columns)


In [None]:
# Add the 'classification' column if missing in classification_data
if 'classification' not in classification_data.columns:
    classification_data['classification'] = classification_data['type'].apply(
        lambda x: 'cancerous' if x in ['Primary Tumor', 'Metastatic'] else 'non-cancerous'
    )

# Retry merging with the normalized IDs
linked_data = linked_data.merge(
    classification_data[['submitter_id_normalized', 'type', 'classification']],
    on='submitter_id_normalized',
    how='left'
)


In [None]:
print(linked_data[['submitter_id', 'type', 'classification']].head())


In [None]:
linked_data.to_csv("linked_rna_sequence_classification_updated.csv", index=False)
print("Updated data saved successfully!")


In [None]:
print(linked_data.columns)


In [None]:
# Add the 'classification' column if missing in classification_data
if 'classification' not in classification_data.columns:
    classification_data['classification'] = classification_data['type'].apply(
        lambda x: 'cancerous' if x in ['Primary Tumor', 'Metastatic'] else 'non-cancerous'
    )

# Retry merging with the normalized IDs
linked_data = linked_data.merge(
    classification_data[['submitter_id_normalized', 'type', 'classification']],
    on='submitter_id_normalized',
    how='left'
)


In [None]:
print(linked_data[['submitter_id', 'type', 'classification']].head())


In [None]:
# Drop redundant columns before merge to avoid conflicts
columns_to_drop = ['type_x', 'type_y', 'classification_x', 'classification_y']
linked_data = linked_data.drop(columns=[col for col in columns_to_drop if col in linked_data.columns])

# Add 'classification' if missing in classification_data
if 'classification' not in classification_data.columns:
    classification_data['classification'] = classification_data['type'].apply(
        lambda x: 'cancerous' if x in ['Primary Tumor', 'Metastatic'] else 'non-cancerous'
    )

# Retry merging with normalized IDs
linked_data = linked_data.merge(
    classification_data[['submitter_id_normalized', 'type', 'classification']],
    on='submitter_id_normalized',
    how='left',
    suffixes=('', '_new')  # Avoid conflicts by using unique suffixes
)

# Display the desired columns
print(linked_data[['submitter_id', 'type', 'classification']].head())

# Save the updated dataset
linked_data.to_csv("linked_rna_sequence_classification_updated.csv", index=False)
print("Updated data saved successfully!")


In [None]:
# Remove exact duplicates
linked_data = linked_data.drop_duplicates()

# Resolve conflicting classifications by prioritizing 'cancerous'
def resolve_classification(group):
    if 'cancerous' in group['classification'].values:
        return 'cancerous'
    return 'non-cancerous'

# Aggregate classifications
resolved_classifications = (
    linked_data.groupby('submitter_id')['classification']
    .apply(resolve_classification)
    .reset_index()
)

# Merge resolved classifications back into the dataset
linked_data = linked_data.drop(columns=['classification']).merge(
    resolved_classifications, on='submitter_id', how='left'
)

# Drop any remaining duplicates
linked_data = linked_data.drop_duplicates()

# Display the cleaned data
print(linked_data[['submitter_id', 'type_x', 'classification']].head())

# Save the updated dataset
linked_data.to_csv("linked_rna_sequence_classification_cleaned.csv", index=False)
print("Cleaned data saved successfully!")


In [None]:
# Inspect columns in linked_data
print(f"Columns in linked_data: {linked_data.columns}")

# Verify the available type columns and classification column
type_column = 'type_new' if 'type_new' in linked_data.columns else 'type'
classification_column = 'classification' if 'classification' in linked_data.columns else None

print(f"Using type column: {type_column}")
print(f"Using classification column: {classification_column}")

# Check and clean classification data
if classification_column:
    # Remove duplicates
    linked_data = linked_data.drop_duplicates()

    # Resolve conflicting classifications
    def resolve_classification(group):
        if 'cancerous' in group.dropna().values:
            return 'cancerous'
        return 'non-cancerous'

    resolved_classifications = (
        linked_data.groupby('submitter_id')[classification_column]
        .apply(resolve_classification)
        .reset_index()
    )

    # Merge resolved classifications back
    linked_data = linked_data.drop(columns=[classification_column], errors='ignore').merge(
        resolved_classifications, on='submitter_id', how='left'
    )

    # Drop remaining duplicates
    linked_data = linked_data.drop_duplicates()

# Display the cleaned data
print(linked_data[['submitter_id', type_column, 'classification']].head())

# Save cleaned data
linked_data.to_csv("linked_rna_sequence_classification_cleaned.csv", index=False)
print("Cleaned data saved successfully!")


In [None]:
def resolve_classification(group):
    if 'cancerous' in group['classification'].values:
        return 'cancerous'
    return 'non-cancerous'


In [None]:
# Resolve classification explicitly by type_new
def resolve_classification(group):
    if 'Primary Tumor' in group['type_new'].values or 'Metastatic' in group['type_new'].values:
        return 'cancerous'
    return 'non-cancerous'

# Group by submitter_id and apply the resolve logic
linked_data['classification'] = linked_data.groupby('submitter_id').apply(
    lambda group: resolve_classification(group)
).reset_index(level=0, drop=True)


In [None]:
# Resolve classification explicitly by type_new
def resolve_classification(group):
    if 'Primary Tumor' in group['type_new'].values or 'Metastatic' in group['type_new'].values:
        return 'cancerous'
    return 'non-cancerous'

# Group by submitter_id and apply the resolve logic
linked_data['classification'] = linked_data.groupby('submitter_id', group_keys=False).apply(
    lambda group: resolve_classification(group)
)


In [None]:
# Resolve classification explicitly by type_new
def resolve_classification(group):
    if 'Primary Tumor' in group['type_new'].values or 'Metastatic' in group['type_new'].values:
        return 'cancerous'
    return 'non-cancerous'

# Group by submitter_id and resolve classifications
linked_data['classification'] = linked_data.groupby('submitter_id', group_keys=False).apply(
    lambda group: pd.Series({'classification': resolve_classification(group)})
).reset_index(drop=True)['classification']


In [None]:
# Resolve classification explicitly by type_new
def resolve_classification(group):
    if 'Primary Tumor' in group['type_new'].values or 'Metastatic' in group['type_new'].values:
        return 'cancerous'
    return 'non-cancerous'

# Group by submitter_id, exclude grouping column, and resolve classifications
linked_data['classification'] = (
    linked_data.drop(columns=['classification'])  # Drop the existing classification column if it exists
    .groupby('submitter_id', group_keys=False)
    .apply(lambda group: resolve_classification(group))
)

# Ensure the new classification column aligns with the DataFrame
linked_data.reset_index(drop=True, inplace=True)

print(linked_data[['submitter_id', 'type_new', 'classification']].head())


In [None]:
# Verify the columns
print(linked_data.columns)

# Use 'type_new' as the classification column
classification_column = 'type_new'  # Update this based on the correct column name

# Ensure the classification column exists
if classification_column not in linked_data.columns:
    raise ValueError(f"Column '{classification_column}' not found in the DataFrame.")

# Function to resolve classification
def resolve_classification(group):
    if group[classification_column].str.contains("Primary Tumor|Metastatic", na=False).any():
        return "cancerous"
    elif group[classification_column].str.contains("Normal", na=False).any():
        return "non-cancerous"
    return "unknown"

# Apply classification resolution
linked_data['classification'] = linked_data.groupby('submitter_id', group_keys=False).apply(
    lambda group: pd.Series(resolve_classification(group), index=group.index)
)

# Save and inspect the cleaned data
linked_data.to_csv("linked_rna_sequence_classification_cleaned.csv", index=False)
print(linked_data[['submitter_id', classification_column, 'classification']].drop_duplicates().head())
print("Cleaned data saved successfully!")


In [None]:
# Check the columns in the DataFrame
print(linked_data.columns)

# Ensure the 'type_new' column exists
classification_column = 'type_new'
if classification_column not in linked_data.columns:
    raise ValueError(f"Column '{classification_column}' not found in the DataFrame.")

# Define classification logic based on 'type_new', handling NaN values
def classify(row):
    type_value = row[classification_column]
    if isinstance(type_value, str) and "Normal" in type_value:
        return "non-cancerous"
    elif isinstance(type_value, str):
        return "cancerous"
    return "unknown"  # Default classification for NaN or other unexpected values

# Apply classification logic to each row
linked_data['classification'] = linked_data.apply(classify, axis=1)

# Save and inspect the cleaned data
linked_data.to_csv("linked_rna_sequence_classification_cleaned.csv", index=False)
print(linked_data[['submitter_id', classification_column, 'classification']].drop_duplicates().head())
print("Cleaned data saved successfully!")


### Validation Code

In [None]:
# Check if the classification rule is correctly applied across the DataFrame
incorrect_classifications = linked_data[
    ((linked_data['type_new'].str.contains("Normal", na=False)) & (linked_data['classification'] != "non-cancerous")) |
    ((~linked_data['type_new'].str.contains("Normal", na=False)) & (linked_data['classification'] != "cancerous"))
]

# Display rows with incorrect classifications
if not incorrect_classifications.empty:
    print("Incorrect classifications found:")
    print(incorrect_classifications)
else:
    print("All rows are correctly classified based on the rule.")

# Display a sample of the cleaned DataFrame for visual inspection
import ace_tools as tools; tools.display_dataframe_to_user(name="Linked RNA Sequence Classification", dataframe=linked_data)

# Optional: Save the DataFrame again for reference
linked_data.to_csv("final_linked_rna_sequence_classification.csv", index=False)
print("Final cleaned and validated data saved successfully!")


In [None]:
!pip install ace-tools

In [None]:
# Filter rows with missing type_new values
missing_type_new = linked_data[linked_data['type_new'].isna()]
print(f"Rows with missing 'type_new': {len(missing_type_new)}")

# Optionally inspect the rows with missing 'type_new'
print(missing_type_new[['submitter_id', 'type_new', 'classification']].head())

# Apply classification only to rows with valid 'type_new'
def classify(row):
    if pd.notna(row['type_new']) and "Normal" in row['type_new']:
        return "non-cancerous"
    elif pd.notna(row['type_new']):
        return "cancerous"
    return "unknown"  # Handle cases where type_new is NaN

linked_data['classification'] = linked_data.apply(classify, axis=1)

# Check for any remaining unknown classifications
unknown_classifications = linked_data[linked_data['classification'] == "unknown"]
print(f"Rows with 'unknown' classification after applying logic: {len(unknown_classifications)}")

# Save the final DataFrame
linked_data.to_csv("final_linked_rna_sequence_classification_cleaned.csv", index=False)
print("Final cleaned and validated data saved successfully!")

# Display the final DataFrame
import ace_tools as tools; tools.display_dataframe_to_user(name="Linked RNA Sequence Classification Cleaned", dataframe=linked_data)


In [None]:
import pandas as pd

# Assuming the DataFrame `linked_data` is already loaded or created earlier

# Filter rows with missing type_new values
missing_type_new = linked_data[linked_data['type_new'].isna()]
print(f"Rows with missing 'type_new': {len(missing_type_new)}")

# Apply classification only to rows where type_new is present
def classify(row):
    if pd.notna(row['type_new']) and "Normal" in row['type_new']:
        return "non-cancerous"
    elif pd.notna(row['type_new']):
        return "cancerous"
    return "unknown"  # Assign "unknown" for rows with NaN in 'type_new'

linked_data['classification'] = linked_data.apply(classify, axis=1)

# Check for any remaining unknown classifications
unknown_classifications = linked_data[linked_data['classification'] == "unknown"]
print(f"Rows with 'unknown' classification after applying logic: {len(unknown_classifications)}")

# Save the final cleaned data
linked_data.to_csv("final_linked_rna_sequence_classification_cleaned.csv", index=False)
print("Final cleaned and validated data saved successfully!")

# Display the DataFrame to the user for review
import ace_tools as tools; tools.display_dataframe_to_user(name="Linked RNA Sequence Classification Cleaned", dataframe=linked_data)


In [None]:
import pandas as pd

# Save the cleaned DataFrame for download
file_path = "Linked_RNA_Sequence_Classification_Cleaned.csv"
linked_data.to_csv(file_path, index=False)

# Notify the user the file is ready for download
file_path


In [None]:
# Save the cleaned DataFrame locally in a robust way
local_file_path = 'Linked_RNA_Sequence_Classification_Cleaned.csv'
linked_data.to_csv(local_file_path, index=False)

# Provide the path to the user
local_file_path


In [None]:
!pwd

In [None]:
!ls -ltr

In [None]:
!pwd

check for duplicates

In [None]:
import pandas as pd

# Load the provided CSV file
file_path = 'linked_rna_sequence_classification_cleaned.csv'
data = pd.read_csv(file_path)

# Check for duplicates in the dataset
duplicates = data[data.duplicated()]

# Display information about duplicates
duplicates_count = duplicates.shape[0]

duplicates_count


#### dynamically reading the file_id column from the CSV file and fetching the RNA sequences using the GDC API would be the most efficient and dynamic approach. We can then write the fetched RNA sequences into a new column named sequence in the same DataFrame and save the updated CSV.

#### Steps:
1. Load the CSV file: Read the CSV file to extract file_ids dynamically.

2. Query the GDC API: Use the GDC API to fetch RNA sequence data for each file_id.

3. Add Sequences to the DataFrame: Store the RNA sequences in a new column sequence.

4. Save the Updated Data: Save the DataFrame with the new column to a new CSV file.

In [None]:
import pandas as pd
import requests

# Path to the cleaned CSV file
csv_file_path = "/content/gdc-client/bin/linked_rna_sequence_classification_cleaned_cg_edited.csv"

# Load the CSV file
data = pd.read_csv(csv_file_path)

# GDC API URL
BASE_URL = "https://api.gdc.cancer.gov/data/"

# Function to fetch RNA sequence data for a given file_id
def fetch_rna_sequence(file_id):
    try:
        url = f"{BASE_URL}{file_id}"
        response = requests.get(url, stream=True)
        if response.status_code == 200:
            # Assuming response contains plain text RNA sequence data
            return response.text[:100]  # Keep first 100 characters for brevity (adjust as needed)
        else:
            print(f"Failed to fetch data for file_id {file_id}: {response.status_code}")
            return None
    except Exception as e:
        print(f"Error fetching data for file_id {file_id}: {e}")
        return None

# Add a new column to store RNA sequences
data['sequence'] = data['file_id'].apply(fetch_rna_sequence)

# Save the updated DataFrame to a new CSV file
output_csv_path = "/content/linked_rna_sequences_with_data.csv"
data.to_csv(output_csv_path, index=False)

print(f"Updated CSV with RNA sequences saved to: {output_csv_path}")


#### Here is the full Python script for extracting FASTQ sequences using file IDs from the cleaned dataset and writing the corresponding data to a CSV file.

#### This script assumes:

1. We are querying the GDC API for the FASTQ sequences.
2. The cleaned dataset is already prepared and available in CSV format.
3. The script dynamically retrieves the FASTQ sequences and appends them to the dataset.



In [None]:
import os
import requests
import pandas as pd

# Input and output file paths
input_csv = "/content/drive/MyDrive/Colab Notebooks/RNA_GDC_Labellled/linked_rna_sequence_classification_cleaned_cg_edited.csv"
output_csv = "/content/linked_rna_sequences_with_fastq.csv"
gdc_api_url = "https://api.gdc.cancer.gov/data"

# Function to download FASTQ sequences for a given file ID
def download_fastq(file_id):
    try:
        response = requests.get(f"{gdc_api_url}/{file_id}", stream=True)
        response.raise_for_status()

        # Save the FASTQ file locally for verification (optional)
        file_path = f"/content/fastq_files/{file_id}.fastq"
        os.makedirs(os.path.dirname(file_path), exist_ok=True)
        with open(file_path, "wb") as f:
            for chunk in response.iter_content(chunk_size=1024):
                f.write(chunk)

        return file_path  # Return the saved file path
    except requests.exceptions.RequestException as e:
        print(f"Error downloading FASTQ for file ID {file_id}: {e}")
        return None

# Load the cleaned dataset
df = pd.read_csv(input_csv)

# Ensure a directory exists for storing FASTQ files
os.makedirs("/content/fastq_files", exist_ok=True)

# Initialize a new column for FASTQ file paths
df["fastq_path"] = None

# Iterate through the dataset and retrieve FASTQ sequences
for index, row in df.iterrows():
    file_id = row["file_id"]
    print(f"Processing file ID: {file_id}")

    # Download the FASTQ file and store the path
    fastq_path = download_fastq(file_id)
    if fastq_path:
        df.at[index, "fastq_path"] = fastq_path

# Save the updated dataset with FASTQ file paths
df.to_csv(output_csv, index=False)

print(f"FASTQ sequences extraction complete. Results saved to: {output_csv}")


In [None]:
!pip install biopython

In [None]:
import pandas as pd
from Bio import SeqIO
import os

# Input CSV file path
input_csv = "/content/drive/MyDrive/Colab Notebooks/RNA_GDC_Labellled/linked_rna_sequence_classification_cleaned_cg_edited.csv"

# Output CSV file path
output_csv = "/content/drive/MyDrive/Colab Notebooks/RNA_GDC_Labellled/linked_rna_sequence_classification_with_sequences.csv"

# Read the CSV file
data = pd.read_csv(input_csv)

# Ensure there is a column for FASTQ sequences
data['fastq_seq'] = ""

# Iterate over each row to process the FASTQ files
for index, row in data.iterrows():
    # Assuming 'file_id' column exists and FASTQ files are named using the file_id
    fastq_path = f"/content/fastq_files/{row['file_id']}.fastq"

    if os.path.exists(fastq_path):
        sequences = []
        # Parse the FASTQ file and extract sequences
        with open(fastq_path, "r") as handle:
            for record in SeqIO.parse(handle, "fastq"):
                sequences.append(str(record.seq))  # Extract sequence

        # Combine sequences into a single string
        data.at[index, 'fastq_seq'] = ";".join(sequences)  # Combine all sequences
    else:
        # Mark the sequence as missing if the file doesn't exist
        data.at[index, 'fastq_seq'] = "MISSING"

# Save the updated DataFrame back to a CSV file
data.to_csv(output_csv, index=False)

print(f"Updated CSV with sequences saved to: {output_csv}")


In [None]:
print(data.head())  # Inspect the first few rows to ensure the DataFrame is as expected
print(data.columns)  # Verify column names


In [None]:
data.to_csv(output_csv, index=False)
print(f"CSV file successfully saved to {output_csv}")


In [None]:
import os
print(os.path.exists(output_csv))  # Should return True if the file was saved


In [None]:
def is_valid_fastq(file_path):
    try:
        with open(file_path, "r") as handle:
            for record in SeqIO.parse(handle, "fastq"):
                return True  # File is valid if parsing succeeds
    except Exception:
        return False  # File is invalid if an exception occurs


In [None]:
print(data.columns)


In [None]:
import os

# Define the base directory where FASTQ files are stored
fastq_base_dir = "/path/to/fastq_files"  # Replace with the actual path to your FASTQ files

# Add a new column for the constructed FASTQ file paths
data['fastq_path'] = data['file_id'].apply(lambda x: os.path.join(fastq_base_dir, f"{x}.fastq"))

# Debugging: Verify the new column
print(data[['file_id', 'fastq_path']].head())


In [None]:
fastq_path = "/content/fastq_files"  # Replace with the correct directory /content/fastq_files


In [None]:
import os

# Define the actual directory of FASTQ files
fastq_base_dir = "/content/fastq_files"  # Update with the correct path

# Create the `fastq_path` column
data['fastq_path'] = data['file_id'].apply(lambda x: os.path.join(fastq_base_dir, f"{x}.fastq"))

# Check for missing files
missing_files = data[~data['fastq_path'].apply(os.path.exists)]
if not missing_files.empty:
    print("Missing files detected:")
    print(missing_files[['file_id', 'fastq_path']])
else:
    print("All files found.")


In [None]:
# Example: If the column name is 'file_path'
for index, row in data.iterrows():
    fastq_path = row['fastq_path']  # Replace 'file_path' with the actual column name
    try:
        sequences = []
        with open(fastq_path, "r") as handle:
            for record in SeqIO.parse(handle, "fastq"):
                sequences.append(str(record.seq))
        sequences_column.append(";".join(sequences))
    except Exception as e:
        print(f"Error processing {fastq_path}: {e}")
        sequences_column.append("ERROR")


In [None]:
print(data.columns)  # Lists all column names
print(data.head())   # Displays the first few rows


In [None]:
import pandas as pd
from Bio import SeqIO
import os

# Input CSV file path
input_csv = "/content/drive/MyDrive/Colab Notebooks/RNA_GDC_Labellled/linked_rna_sequence_classification_cleaned_cg_edited.csv"  # Update this with your actual file path

# Output CSV file path
output_csv = "/content/drive/MyDrive/Colab Notebooks/RNA_GDC_Labellled/linked_rna_sequence_classification_with_sequences_we_hope.csv"  # Update this with your desired output path

# Read the CSV file
data = pd.read_csv(input_csv)

# Initialize a list to hold the sequences
sequences_column = []

# Function to validate FASTQ file format
def is_valid_fastq(filepath):
    try:
        with open(filepath, "r") as handle:
            for _ in SeqIO.parse(handle, "fastq"):
                return True  # If we can parse at least one record, it's valid
    except Exception as e:
        print(f"Error validating FASTQ file {filepath}: {e}")
    return False

# Iterate over each row in the DataFrame
for index, row in data.iterrows():
    fastq_path = row['fastq_path']  # Ensure this column contains the FASTQ file paths

    # Check if the file exists
    if not os.path.exists(fastq_path):
        print(f"Missing file: {fastq_path}")
        sequences_column.append("MISSING")
        continue

    # Check if the file is a valid FASTQ
    if not is_valid_fastq(fastq_path):
        print(f"Invalid FASTQ file: {fastq_path}")
        sequences_column.append("INVALID")
        continue

    # Extract sequences from the FASTQ file
    sequences = []
    try:
        with open(fastq_path, "r") as handle:
            for record in SeqIO.parse(handle, "fastq"):
                sequences.append(str(record.seq))  # Extract sequence
        sequences_column.append(";".join(sequences))  # Combine all sequences
    except Exception as e:
        print(f"Error processing file {fastq_path}: {e}")
        sequences_column.append("ERROR")

# Add the sequences as a new column
data['fastq_seq'] = sequences_column

# Save the updated DataFrame back to a CSV file
data.to_csv(output_csv, index=False)

print(f"Updated CSV with sequences saved to: {output_csv}")


In [None]:
print([col for col in data.columns])


In [None]:
data = pd.read_csv(input_csv, dtype=str)  # Read all columns as strings to avoid unexpected behavior
print(data.columns)  # Confirm the column names


In [None]:
data['fastq_path'] = "/path/to/fastq_files/" + data['file_id'] + ".fastq"


In [None]:
# Define the directory containing FASTQ files
fastq_dir = "/content/fastq_files/"  # Update to your actual directory path

# Generate the full paths to the FASTQ files
data['fastq_path'] = fastq_dir + data['file_id'] + ".fastq"


In [None]:
import pandas as pd
from Bio import SeqIO
import os

# Input CSV file path
input_csv = "/content/drive/MyDrive/Colab Notebooks/RNA_GDC_Labellled/linked_rna_sequence_classification_cleaned_cg_edited.csv"

# Output CSV file path
output_csv = "/content/drive/MyDrive/Colab Notebooks/RNA_GDC_Labellled/linked_rna_sequence_classification_with_sequences.csv"

# FASTQ file directory (update this to your actual directory)
fastq_dir = "/content/fastq_files/"

# Read the CSV file
data = pd.read_csv(input_csv)

# Add the fastq_path column if not already present
if 'fastq_path' not in data.columns:
    data['fastq_path'] = fastq_dir + data['file_id'] + ".fastq"

# Initialize a list to hold the sequences
sequences_column = []

# Function to validate FASTQ file format
def is_valid_fastq(filepath):
    try:
        with open(filepath, "r") as handle:
            for _ in SeqIO.parse(handle, "fastq"):
                return True  # If we can parse at least one record, it's valid
    except Exception as e:
        print(f"Error validating FASTQ file {filepath}: {e}")
    return False

# Iterate over each row in the DataFrame
for index, row in data.iterrows():
    fastq_path = row['fastq_path']  # Column containing the FASTQ file paths

    # Check if the file exists
    if not os.path.exists(fastq_path):
        print(f"Missing file: {fastq_path}")
        sequences_column.append("MISSING")
        continue

    # Check if the file is a valid FASTQ
    if not is_valid_fastq(fastq_path):
        print(f"Invalid FASTQ file: {fastq_path}")
        sequences_column.append("INVALID")
        continue

    # Extract sequences from the FASTQ file
    sequences = []
    try:
        with open(fastq_path, "r") as handle:
            for record in SeqIO.parse(handle, "fastq"):
                sequences.append(str(record.seq))  # Extract sequence
        sequences_column.append(";".join(sequences))  # Combine all sequences
    except Exception as e:
        print(f"Error processing file {fastq_path}: {e}")
        sequences_column.append("ERROR")

# Add the sequences as a new column
data['fastq_seq'] = sequences_column

# Save the updated DataFrame back to a CSV file
data.to_csv(output_csv, index=False)

print(f"Updated CSV with sequences saved to: {output_csv}")


In [None]:
data.to_csv("/content/updated_file_with_fastq_paths.csv", index=False)


In [None]:
import os

def validate_fastq(file_path):
    """
    Validates the FASTQ file to ensure each record starts with '@'.
    """
    with open(file_path, 'r') as file:
        lines = file.readlines()
        for i in range(0, len(lines), 4):  # FASTQ format: 4 lines per record
            if not lines[i].startswith('@'):
                return False
    return True

def fix_fastq(file_path, output_path):
    """
    Adds '@' to the start of records missing the '@' character.
    """
    with open(file_path, 'r') as file:
        lines = file.readlines()

    fixed_lines = []
    for i in range(0, len(lines), 4):
        if not lines[i].startswith('@'):
            lines[i] = '@' + lines[i].strip() + '\n'
        fixed_lines.extend(lines[i:i+4])

    with open(output_path, 'w') as output_file:
        output_file.writelines(fixed_lines)

# Path to the directory containing FASTQ files
directory_path = "/content/fastq_files"
output_directory = "/content/validated_fastq_files"

# Create output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)

for filename in os.listdir(directory_path):
    file_path = os.path.join(directory_path, filename)
    output_path = os.path.join(output_directory, filename)

    if validate_fastq(file_path):
        print(f"{filename} is valid.")
    else:
        print(f"{filename} is invalid. Attempting to fix...")
        fix_fastq(file_path, output_path)
        if validate_fastq(output_path):
            print(f"{filename} was successfully fixed and saved to {output_path}.")
        else:
            print(f"Failed to fix {filename}. Please inspect the file manually.")


In [None]:
import pandas as pd
import requests

# File path to the CSV
csv_file_path = '/content/drive/MyDrive/RNA_sequence_files/linked_rna_sequence_classification_cleaned_cg_edited.csv'

# Step 1: Read the CSV file and extract 'file_id'
df = pd.read_csv(csv_file_path)
file_ids = df['file_id'].tolist()  # Assuming 'file_id' is the column name

# Step 2: GraphQL API endpoint and headers
graphql_url = "https://api.example.com/graphql"  # Replace with actual endpoint
headers = {
    "Authorization": "Bearer YOUR_API_TOKEN",  # Replace with your token
    "Content-Type": "application/json"
}

# Step 3: Construct GraphQL query dynamically for each file_id
for file_id in file_ids:
    # GraphQL query with dynamic file_id
    query = f"""
    {{
      submitted_unaligned_reads (file_id: "{file_id}") {{
        file_name
        id
        project_id
        submitter_id
      }}
    }}
    """

    # Step 4: Send the POST request to GraphQL endpoint
    response = requests.post(graphql_url, json={"query": query}, headers=headers)

    # Step 5: Process the response
    if response.status_code == 200:
        data = response.json()
        # Extract and display the results
        if 'data' in data and 'submitted_unaligned_reads' in data['data']:
            for item in data['data']['submitted_unaligned_reads']:
                print(f"File ID: {file_id}, File Name: {item['file_name']}, UUID: {item['id']}, Project ID: {item['project_id']}, Submitter ID: {item['submitter_id']}")
        else:
            print(f"No data found for File ID: {file_id}")
    else:
        print(f"Error querying File ID {file_id}: {response.status_code}, {response.text}")


In [None]:
graphql_url = "https://actual-api-endpoint.com/graphql"


In [None]:
!curl https://actual-api-endpoint.com/graphql


In [None]:
import os
import csv
import subprocess

# Define the CSV file path and download directory
csv_file_path = "/content/drive/MyDrive/RNA_sequence_files/linked_rna_sequence_classification_cleaned_cg_edited.csv"
download_dir = "/content/drive/MyDrive/RNA_sequence_files"  # Replace with your desired download directory

# Ensure the download directory exists
os.makedirs(download_dir, exist_ok=True)

# Function to read file IDs from the CSV file
def get_file_ids_from_csv(csv_path):
    file_ids = []
    try:
        with open(csv_path, mode="r") as csv_file:
            reader = csv.DictReader(csv_file)
            # Assuming the column name for file_id is "file_id"
            for row in reader:
                if "file_id" in row and row["file_id"].strip():
                    file_ids.append(row["file_id"].strip())
        print(f"Extracted {len(file_ids)} file IDs from the CSV.")
    except Exception as e:
        print(f"Error reading CSV file: {e}")
    return file_ids

# Function to download a single file without a token
def download_file(file_id):
    command = [
        "gdc-client", "download",
        "-d", download_dir,
        "-f", file_id
    ]
    try:
        subprocess.run(command, check=True)
        print(f"Successfully downloaded file with ID: {file_id}")
    except subprocess.CalledProcessError as e:
        print(f"Failed to download file with ID: {file_id}")
        print(f"Error: {e}")

# Main execution
file_ids = get_file_ids_from_csv(csv_file_path)

if file_ids:
    for file_id in file_ids:
        download_file(file_id)
    print("All downloads completed.")
else:
    print("No file IDs found. Please check the CSV file.")
