Making dataset of images of x-ray scans for developing project of detedcion changes based on x-ray scans.

In [None]:
!pip install -q kaggle

## 1. Upload the Kaggle API credentials

This step is used to upload your Kaggle API credentials to Google Colab. These credentials are stored in a file named kaggle.json that you can download from your Kaggle account settings. This file is required to use the Kaggle API.

In [None]:
from google.colab import files
files.upload()  # Select the kaggle.json file from your local disk

## 2. Move the `kaggle.json` file to the `.kaggle` directory

Next, we move the `kaggle.json` file to the `.kaggle` directory, which is the default location where the Kaggle API looks for these credentials. We also set the file permissions so that only the user can read and write it, which is a requirement from Kaggle for security purposes.

In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [None]:
!kaggle datasets download -d nih-chest-xrays/data



Downloading data.zip to /content
100% 42.0G/42.0G [31:59<00:00, 28.4MB/s]
100% 42.0G/42.0G [31:59<00:00, 23.5MB/s]


In [None]:
import zipfile
import os

# Define the path to the .zip file and to the directory where it should be unzipped
zip_file_path = '/content/data.zip'
destination_directory = '/content/data'

# Create the destination directory if it doesn't exist
os.makedirs(destination_directory, exist_ok=True)

# Open the .zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Unzip the file to the destination directory
    zip_ref.extractall(destination_directory)

print("Files unzipped successfully.")



Files unzipped successfully.


## 3. Load and process the dataset

We load the dataset from the CSV file and process it. We select 1000 images from each class and store them in a new DataFrame. This is done to reduce the size of the dataset and make it more manageable for our purposes.

In [None]:
import pandas as pd

# Load the CSV file
df = pd.read_csv('/content/data/Data_Entry_2017.csv')

# Create an empty DataFrame to which you will add the selected images
selected_images_df = pd.DataFrame()

# Iterate over all unique classes in df
for class_name in df['Finding Labels'].unique():
    # Select 1000 images from this class
    selected_images = df[df['Finding Labels'] == class_name].sample(1000, replace=True)
    # Add them to selected_images_df
    selected_images_df = pd.concat([selected_images_df, selected_images])

# Now selected_images_df contains 1000 images from each class


## 4. Create directories for each class and copy the images

Finally, we create a directory for each class in the dataset and copy the selected images into their respective directories. Each directory will contain 1000 images from its corresponding class. This step prepares our data for further use, such as for training a machine learning model.

In [27]:
import shutil

folder_path = '/content/selected_images'
shutil.rmtree(folder_path)

In [28]:
import os
import shutil
import glob
import time
import IPython.display as display

# Define the path to the main folder where all subfolders with images are stored
main_images_folder = '/content/data'

# Create a main folder for the selected images
selected_images_path = '/content/selected_images'
os.makedirs(selected_images_path, exist_ok=True)

# To hold the last 10 messages
messages = []

# Counters for images per class
class_counters = {}

# Total images copied
total_images_copied = 0

# Iterate over all selected images
for index, row in selected_images_df.iterrows():
    # Get the class names and image name for this row
    class_names = row['Finding Labels'].split('|')
    image_name = row['Image Index']

    # If the image has only one label or the first label hasn't been used yet, copy the image
    if len(class_names) == 1 or class_names[0] not in class_counters:
        class_name = class_names[0]

        # Create a folder for this class, if it doesn't exist yet
        class_folder = os.path.join(selected_images_path, class_name)
        os.makedirs(class_folder, exist_ok=True)

        # Initialize the counter for this class if it doesn't exist yet
        if class_name not in class_counters:
            class_counters[class_name] = 0

        # Find the image in the subfolders and copy it into the corresponding class folder
        for folder in glob.glob(main_images_folder + '/*'):
            source_path = os.path.join(folder, 'images', image_name)
            if os.path.exists(source_path):
                destination_path = os.path.join(class_folder, image_name)
                shutil.copy(source_path, destination_path)

                # Increment the counter for this class
                class_counters[class_name] += 1

                # Increment total images copied
                total_images_copied += 1

                # Print a message and add it to the messages list if total_images_copied is a multiple of 100
                if total_images_copied % 100 == 0:
                    message = f"Copied image {image_name} to {class_folder}. Total images in {class_name}: {class_counters[class_name]}"
                    print(message)
                    messages.append(message)

                break  # Stop searching once the image is found and copied


Copied image 00002256_016.png to /content/selected_images/Cardiomegaly. Total images in Cardiomegaly: 100
Copied image 00029665_000.png to /content/selected_images/Cardiomegaly. Total images in Cardiomegaly: 200
Copied image 00015202_000.png to /content/selected_images/Cardiomegaly. Total images in Cardiomegaly: 300
Copied image 00007326_000.png to /content/selected_images/Cardiomegaly. Total images in Cardiomegaly: 400
Copied image 00030764_000.png to /content/selected_images/Cardiomegaly. Total images in Cardiomegaly: 500
Copied image 00006875_027.png to /content/selected_images/Cardiomegaly. Total images in Cardiomegaly: 600
Copied image 00008898_000.png to /content/selected_images/Cardiomegaly. Total images in Cardiomegaly: 700
Copied image 00015563_011.png to /content/selected_images/Cardiomegaly. Total images in Cardiomegaly: 800
Copied image 00025684_000.png to /content/selected_images/Cardiomegaly. Total images in Cardiomegaly: 900
Copied image 00010095_000.png to /content/sele

In [29]:
for class_name in os.listdir(selected_images_path):
    class_folder = os.path.join(selected_images_path, class_name)
    num_files = len(os.listdir(class_folder))
    print(f"{class_name}: {num_files} files")


Nodule: 845 files
Fibrosis: 548 files
Cardiomegaly: 642 files
No Finding: 988 files
Hernia: 110 files
Pneumonia: 313 files
Emphysema: 605 files
Pneumothorax: 814 files
Edema: 502 files
Atelectasis: 889 files
Mass: 807 files
Consolidation: 720 files
Infiltration: 944 files
Effusion: 899 files
Pleural_Thickening: 674 files


In [33]:
import shutil

# Define the name of the zip file
zip_name = 'selected_images'

# Compress the folder
shutil.make_archive(zip_name, 'zip', selected_images_path)


'/content/selected_images.zip'

In [32]:
from google.colab import drive
drive.mount('/content/drive')

# copy the zip file into Google Drive
shutil.copy('selected_images.zip', '/content/drive/MyDrive/')

Mounted at /content/drive


'/content/drive/MyDrive/selected_images.zip'