# **(Data Collection)**

## Objectives

* Prepare and preprocess CT scan images for tumor detection modeling.
* Merge tumor classes into a single "tumor" category.
* Resize all images to 224x224 pixels and normalize pixel values.
* Analyze and address class imbalance using oversampling and augmentation.
* Save processed data and metadata for downstream modeling and visualization.

## Inputs

* Raw CT scan images (tumor and no-tumor) in their respective folders.
* Any available metadata (e.g., image labels, file paths).

## Outputs

* Preprocessed images (224x224, normalized) saved to a structured directory.
* A CSV or DataFrame containing image paths and labels.
* Augmented images for minority class to address imbalance.
* Summary statistics and visualizations of class distribution.

## Additional Comments

* All data must be anonymized and checked for quality before processing.
* Oversampling and augmentation will be applied only to the minority class (no-tumor).
* The notebook should be run top-down, with each step building on the previous.
* Outputs will be used in subsequent modeling and visualization notebooks.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/brain-tumor-classification/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/brain-tumor-classification'

# Data preparation

install important libraries and download data from kaggle

In [9]:
!pip install kagglehub
!pip install opencv-python
!pip install pandas numpy
!pip install scikit-learn
!pip install matplotlib seaborn
!pip install tensorflow
!pip install keras
!pip install imbalanced-learn
!pip install tqdm

import kagglehub
path = kagglehub.dataset_download("masoudnickparvar/brain-tumor-mri-dataset")
print("Path to dataset files:", path)

import shutil
import os

src = "/home/cistudent/.cache/kagglehub/datasets/masoudnickparvar/brain-tumor-mri-dataset/versions/1"
dst = "/workspaces/brain-tumor-classification/data"

if not os.path.exists(dst):
    os.makedirs(dst)

for folder in os.listdir(src):
    shutil.copytree(os.path.join(src, folder), os.path.join(dst, folder), dirs_exist_ok=True)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgr

**Data Organization**

Data is downloaded and stored in the **data** folder in the root
different tumour types are in seperate folders in the data folder
no_tumour folder data also exists in the data folder in the root

**Image Quality Check:**

Review images for quality, removing corrupted, low-resolution, or irrelevant scans.

In [16]:
# using (Pillow)
from PIL import Image
import os

image_dir = "./data"
min_width, min_height = 100, 100

corrupted_images = []
low_res_images = []

for root, dirs, files in os.walk(image_dir):
    for file in files:
        if file.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tiff')):
            img_path = os.path.join(root, file)
            try:
                with Image.open(img_path) as img:
                    w, h = img.size
                    if h < min_height or w < min_width:
                        low_res_images.append(img_path)
            except Exception as e:
                corrupted_images.append(img_path)

print("Corrupted images:", corrupted_images)
print("Low-resolution images:", low_res_images)

# remove these files
for img_path in corrupted_images + low_res_images:
    try:
        os.remove(img_path)
    except Exception as e:
        print(f"Could not remove {img_path}: {e}")

Corrupted images: []
Low-resolution images: []


**Label Verification: Ensure that each image is correctly labeled (tumor or no-tumor).**



Confirms each image is in the correct folder.
Creates a table of image paths and their labels.
spot any mislabeling or misplaced files.
inspect the DataFrame or CSV for errors.

In [20]:
import os
import pandas as pd

image_dir = "./data"
data = []

for label in os.listdir(image_dir):
    label_path = os.path.join(image_dir, label)
    if os.path.isdir(label_path):
        for file in os.listdir(label_path):
            if file.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tiff')):
                img_path = os.path.join(label_path, file)
                data.append({"image_path": img_path, "label": label})

df = pd.DataFrame(data)
print(df.head())
print("Label counts:\n", df['label'].value_counts())

# Optionally, save to CSV for further use
df.to_csv("image_labels.csv", index=False)

                        image_path      label
0  ./data/pituitary/Tr-pi_0837.jpg  pituitary
1  ./data/pituitary/Te-pi_0263.jpg  pituitary
2  ./data/pituitary/Tr-pi_1277.jpg  pituitary
3  ./data/pituitary/Te-pi_0156.jpg  pituitary
4  ./data/pituitary/Tr-pi_1414.jpg  pituitary
Label counts:
 label
notumor       2000
pituitary     1757
meningioma    1645
glioma        1621
Name: count, dtype: int64


**Image Preprocessing:**

Resizing: Standardize to a consistent size (e.g., 224x224 pixels) to match model input requirements.

Normalization: Scale pixel values (e.g., to a 0-1 range) to improve model training stability.

Format Conversion: Convert images to a consistent format (e.g., PNG or JPEG) if necessary. 

Class Merging: If there are multiple tumor types, combine them into a single "tumor" class if the modeling goal is binary classification.

In [22]:
from PIL import Image
import os

input_dir = "./data"
resized_dir = "./resized_data"
target_size = (224, 224)

if not os.path.exists(resized_dir):
    os.makedirs(resized_dir)

for root, dirs, files in os.walk(input_dir):
    for file in files:
        if file.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tiff')):
            img_path = os.path.join(root, file)
            rel_dir = os.path.relpath(root, input_dir)
            out_dir = os.path.join(resized_dir, rel_dir)
            if not os.path.exists(out_dir):
                os.makedirs(out_dir)
            out_path = os.path.join(out_dir, file)
            try:
                with Image.open(img_path) as img:
                    img = img.resize(target_size)
                    img.save(out_path)
            except Exception as e:
                print(f"Error resizing {img_path}: {e}")
print("Resizing complete. Images saved to:", resized_dir)

Error resizing ./data/notumor/Tr-no_1012.jpg: cannot write mode P as JPEG
Error resizing ./data/notumor/Tr-no_1019.jpg: cannot write mode RGBA as JPEG
Error resizing ./data/notumor/Tr-no_1020.jpg: cannot write mode RGBA as JPEG
Error resizing ./data/notumor/Tr-no_1011.jpg: cannot write mode RGBA as JPEG
Resizing complete. Images saved to: ./resized_data


In [24]:
import numpy as np

normalized_dir = "./normalized_data"

if not os.path.exists(normalized_dir):
    os.makedirs(normalized_dir)

for root, dirs, files in os.walk(resized_dir):
    for file in files:
        if file.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tiff')):
            img_path = os.path.join(root, file)
            rel_dir = os.path.relpath(root, resized_dir)
            out_dir = os.path.join(normalized_dir, rel_dir)
            if not os.path.exists(out_dir):
                os.makedirs(out_dir)
            out_path = os.path.join(out_dir, os.path.splitext(file)[0] + ".png")  # Convert to PNG
            try:
                with Image.open(img_path) as img:
                    img = img.convert("RGB")
                    arr = np.array(img) / 255.0  # Normalize to 0-1
                    img_norm = Image.fromarray((arr * 255).astype(np.uint8))
                    img_norm.save(out_path, "PNG")
            except Exception as e:
                print(f"Error normalizing {img_path}: {e}")
print("Normalization and format conversion complete. Images saved to:", normalized_dir)

Normalization and format conversion complete. Images saved to: ./normalized_data


In [26]:
merged_dir = "./merged_data"
tumor_folders = ["glioma", "meningioma", "pituitary"]
notumor_folder = "notumor"

if not os.path.exists(merged_dir):
    os.makedirs(merged_dir)

for label in os.listdir(normalized_dir):
    label_path = os.path.join(normalized_dir, label)
    if not os.path.isdir(label_path):
        continue

    # Merge tumor classes
    if label.lower() in tumor_folders:
        out_label = "tumor"
    elif label.lower() == notumor_folder:
        out_label = "notumor"
    else:
        continue

    out_label_dir = os.path.join(merged_dir, out_label)
    if not os.path.exists(out_label_dir):
        os.makedirs(out_label_dir)

    for file in os.listdir(label_path):
        src_file = os.path.join(label_path, file)
        dst_file = os.path.join(out_label_dir, file)
        try:
            os.rename(src_file, dst_file)
        except Exception as e:
            print(f"Error moving {src_file}: {e}")
print("Class merging complete. Images saved to:", merged_dir)

Class merging complete. Images saved to: ./merged_data


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [27]:
import os
try:
    # create here your folder
    # os.makedirs(name='')
except Exception as e:
    print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (1114530593.py, line 5)