In [7]:
import glob  # To find image files in folders
import pandas as pd  # To handle tabular data
import os  # To handle file paths
from config import *  # Import variables from a configuration file

if __name__ == '__main__':
    # 📌 Step 1: Load the spatial transcriptomics dataset (cm_norm.tsv)
    cm = pd.read_csv(os.path.join(DATASET_PATH, 'cm_norm.tsv'), header=0, sep='\t', index_col=0)
    
   
    # 📌 Step 2: Filter dataset based on a condition (optional)
    if CONDITION_COLUMN:  
        cm_ = cm.loc[cm[CONDITION_COLUMN] == CONDITION]  # Selects rows where the condition matches
        cm = cm_  # Update the main dataset
    
    
    # 📌 Step 3: Find all histology images (.jpeg) from a folder
    img_files = glob.glob(TILE_PATH+'/*/*.jpeg')  # Get all image file paths inside subdirectories
    print(f"Found {len(img_files)} image files!")

    
    # 📌 Step 4: Match image filenames with the corresponding gene expression spots
    col_cm = list(cm.index)  # Get the list of row indices (gene expression spot IDs)
    sorted_img = []  # To store matched images
    sorted_cm = []  # To store matched gene expression spots
    
    for img in img_files:
        id_img = os.path.splitext(os.path.basename(img))[0].replace("-", "_")  # Extract the image ID
        for c in col_cm:
            id_c = c.replace("x", "_")  # Modify gene expression IDs for consistency
            if id_img == id_c:  # If they match, store them
                sorted_img.append(img)
                sorted_cm.append(c)

    print(f"Matched {len(sorted_img)} images with gene expression data!")

    # Reorder the gene expression matrix based on the matched images
    cm = cm.reindex(sorted_cm)  

    
    # 📌 Step 5: Create a dataset (dataset.tsv) linking images with gene expression data and labels
    df = pd.DataFrame(data={'img': sorted_img, 'cm': sorted_cm, 'label': cm[LABEL_COLUMN]})
    df.to_csv(os.path.join(DATASET_PATH, 'dataset.tsv'), sep='\t')  # Save dataset as TSV file
    print("Dataset file saved as dataset.tsv!")
    
    # 📌 Step 6: Save the reordered gene expression matrix (cm_final.tsv)
    cm.to_csv(os.path.join(DATASET_PATH, "cm_final.tsv"), sep='\t')  # Save the filtered matrix
    print("Process completed successfully!")


Found 2229 image files!
Matched 2114 images with gene expression data!
Dataset file saved as dataset.tsv!
Process completed successfully!


In [5]:
import glob
import pandas as pd
import os
from config import *

if __name__ == '__main__':
    # 📌 Step 1: Load datasets
    cm = pd.read_csv(os.path.join(DATASET_PATH, 'cm_norm.tsv'), header=0, sep='\t', index_col=0)
    spatial_coords = pd.read_csv(os.path.join(DATASET_PATH, 'spatial_coords.tsv'), sep='\t', index_col=0)  # Load spatial coordinates

    # 📌 Step 2: Filter dataset (optional)
    if CONDITION_COLUMN:
        cm = cm.loc[cm[CONDITION_COLUMN] == CONDITION]

    # 📌 Step 3: Find histology images
    img_files = glob.glob(TILE_PATH+'/*/*.jpeg')
    print(f"Found {len(img_files)} image files!")

    # 📌 Step 4: Match images with gene expression spots
    col_cm = list(cm.index)
    sorted_img, sorted_cm = [], []
    
    for img in img_files:
        id_img = os.path.splitext(os.path.basename(img))[0].replace("-", "_")
        for c in col_cm:
            id_c = c.replace("x", "_")
            if id_img == id_c:
                sorted_img.append(img)
                sorted_cm.append(c)

    print(f"Matched {len(sorted_img)} images with gene expression data!")

    # 📌 Step 5: Reorder gene matrix and spatial coordinates
    cm = cm.reindex(sorted_cm)  # Reorder gene matrix
    spatial_coords = spatial_coords.reindex(sorted_cm)  # Reorder spatial coordinates to match

    # 📌 Step 6: Create dataset.tsv (without spatial coordinates)
    df = pd.DataFrame(data={'img': sorted_img, 'cm': sorted_cm, 'label': cm[LABEL_COLUMN]})
    df.to_csv(os.path.join(DATASET_PATH, 'dataset.tsv'), sep='\t')

    # 📌 Step 7: Save reordered spatial coordinates separately
    spatial_coords.to_csv(os.path.join(DATASET_PATH, "spatial_coords_final.tsv"), sep='\t')
    
    cm.to_csv(os.path.join(DATASET_PATH, "cm_final.tsv"), sep='\t')
    print("Process completed! Spatial coordinates reordered and saved separately.")

Found 2229 image files!
Matched 2114 images with gene expression data!
Process completed! Spatial coordinates reordered and saved separately.


Got it! Let's go step by step and break down the script into smaller, easier-to-understand parts.

---

# **🔍 Understanding the Code Step by Step**

### **🌟 Goal of the Script**
The script aims to **match histology images with spatial transcriptomics data** (gene expression matrix) and create a dataset file (`dataset.tsv`) that can be used for training a model. It also filters and saves the gene expression data (`cm_final.tsv`).

---

## **🛠 Step 1: Import Necessary Libraries**
```python
import glob  # To find image files in folders
import pandas as pd  # To handle the tabular gene expression data
import os  # To handle file paths
from config import *  # Imports variables from a configuration file
```
- `glob` helps to **find all image files** in a given directory.  
- `pandas` helps to **load and process the gene expression data**.  
- `os` is used to **handle file paths** in a way that works across different operating systems.  
- `from config import *` imports variables (e.g., `DATASET_PATH`, `TILE_PATH`, `LABEL_COLUMN`, etc.), which are assumed to be defined in a `config.py` file.

---

## **📂 Step 2: Load the Gene Expression Data**
```python
cm = pd.read_csv(os.path.join(DATASET_PATH, 'cm_norm.tsv'), header=0, sep='\t', index_col=0)
```
- Reads the **gene expression matrix** (`cm_norm.tsv`) into a DataFrame called `cm`.
- `sep='\t'` ensures it reads a **tab-separated** file (TSV format).
- `index_col=0` makes the first column (probably **spot IDs**) the index of the DataFrame.

---

## **🎯 Step 3: Filter Data Based on a Condition (Optional)**
```python
if CONDITION_COLUMN:
    cm_ = cm.loc[cm[CONDITION_COLUMN] == CONDITION]  # Selects rows where the condition matches
    cm = cm_
```
- If a condition is specified (e.g., selecting only **cancerous tissue spots**), it **filters** the dataset based on `CONDITION_COLUMN`.
- Example: If `CONDITION_COLUMN = "tissue_type"` and `CONDITION = "tumor"`, it selects only tumor spots.

---

## **🖼 Step 4: Load Histology Image Files**
```python
img_files = glob.glob(TILE_PATH+'/*/*.jpeg')  # Find all .jpeg files in subdirectories
```
- Uses `glob.glob()` to **find all histology image files** in subdirectories inside `TILE_PATH`.  
- `TILE_PATH/*/*.jpeg` means:
  - `*` = Any folder inside `TILE_PATH`
  - `*.jpeg` = Any JPEG image file

✅ **Example:**  
If `TILE_PATH = "/data/images"`, it will find files like:
```
/data/images/sample1/image1.jpeg
/data/images/sample2/image2.jpeg
```

---

## **🔗 Step 5: Match Images with Expression Data**
```python
col_cm = list(cm.index)  # Get the list of gene expression row indices
sorted_img = []  # To store matched images
sorted_cm = []  # To store matched gene expression spots

for img in img_files:
    id_img = os.path.splitext(os.path.basename(img))[0].replace("-", "_")  # Extract ID from image filename
    for c in col_cm:
        id_c = c.replace("x", "_")  # Modify gene expression IDs for consistency
        if id_img == id_c:  # If they match, store them
            sorted_img.append(img)
            sorted_cm.append(c)
```

### **What This Does:**
- **Extracts the unique identifier** from the image filename.
  - `os.path.basename(img)` gets just the filename (e.g., `image1.jpeg`).
  - `os.path.splitext(...)[0]` removes the `.jpeg` extension, leaving just `image1`.
  - `.replace("-", "_")` ensures the format matches the expression data.
- **Checks if the image ID matches any gene expression spot ID**.
  - Some datasets may have different formats (e.g., `image1` vs. `image1-x`), so replacing characters ensures compatibility.
- If a match is found, it saves the corresponding **image file path** and **gene expression row ID**.

✅ **Example of Matching:**  
| Image Filename | Processed ID | Expression Data ID | Match? |
|---------------|-------------|--------------------|--------|
| `sample-123.jpeg` | `sample_123` | `sample_123` | ✅ Yes |
| `tumor-spot-1.jpeg` | `tumor_spot_1` | `tumor_spot_1` | ✅ Yes |
| `control-5.jpeg` | `control_5` | `control_x5` | ❌ No |

---

## **📋 Step 6: Reorder the Gene Expression Data**
```python
cm = cm.reindex(sorted_cm)  # Reorder cm to match sorted_cm (filtered IDs)
```
- After filtering and matching, the gene expression matrix is **reordered** so that it matches the image list.

---

## **📄 Step 7: Create a Dataset File (`dataset.tsv`)**
```python
df = pd.DataFrame(data={'img': sorted_img, 'cm': sorted_cm, 'label': cm[LABEL_COLUMN]})
df.to_csv(os.path.join(DATASET_PATH, 'dataset.tsv'), sep='\t')
```
- Creates a **DataFrame (`df`)** with three columns:
  - **"img"** → The path to the histology image.
  - **"cm"** → The corresponding gene expression spot ID.
  - **"label"** → The disease stage or category (`LABEL_COLUMN`).
- Saves this DataFrame as a **TSV file** (`dataset.tsv`) in `DATASET_PATH`.

✅ **Example of `dataset.tsv`:**
| img                           | cm          | label  |
|--------------------------------|------------|--------|
| `/data/images/sample1.jpeg`    | `sample1`  | Tumor  |
| `/data/images/sample2.jpeg`    | `sample2`  | Normal |
| `/data/images/sample3.jpeg`    | `sample3`  | Tumor  |

---

## **📂 Step 8: Save the Final Gene Expression Data (`cm_final.tsv`)**
```python
cm.to_csv(os.path.join(DATASET_PATH, "cm_final.tsv"), sep='\t')
```
- Saves the **filtered and reordered** gene expression matrix as `cm_final.tsv`.

---

# **🚀 Summary of What This Script Does**
1. **Loads the spatial transcriptomics dataset (`cm_norm.tsv`).**
2. **Filters it based on a condition (optional).**
3. **Finds all histology images (`.jpeg`) from a folder.**
4. **Matches image filenames with the corresponding gene expression spots.**
5. **Creates a dataset (`dataset.tsv`) linking images with gene expression data and labels.**
6. **Saves the reordered gene expression matrix (`cm_final.tsv`).**

---

# **🔍 Potential Issues & Fixes**
| Issue | Fix |
|-------|-----|
| No matching images found | Check if filename formats match correctly. Use `print(id_img, id_c)` to debug. |
| `LABEL_COLUMN` missing | Ensure `LABEL_COLUMN` exists in `cm`. Add `if LABEL_COLUMN not in cm.columns: raise KeyError(...)`. |
| `dataset.tsv` is empty | The filtering step might have removed all data. Print `cm.shape` before and after filtering. |

---

## **📢 Next Steps**
Would you like me to:
1. **Optimize the code further** to improve speed?
2. **Modify it for a different dataset structure**?
3. **Help with debugging if something isn’t working?**

Let me know what you need! 😊🚀