## **Data Understanding**

The data that will be used in this project will be scraped from [DermNet NZ Image Library](https://dermnetnz.org/image-library) and [ISIC 2019 Challenge](https://challenge.isic-archive.com/data/#2019) websites.

## 1. Data Scraping

In [None]:
# install required libraries
!pip install selenium
!pip install webdriver_manager

In [1]:
# import required libraries
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import re
import time
from IPython.core.display import HTML
import webbrowser
import requests as rq
import os
import pathlib
import pandas as pd

### Identify all the diseases data available for scraping

In [3]:
# Creating an instance of the Chrome web browser
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# URL to the dermnet website
url = "https://dermnetnz.org/image-library"

# Navigating to the specified url in chrome
driver.get(url)

# Finding all the skin_disorders listed on the main page
skin_disorder_tag_main_page = driver.find_elements("xpath", '//a[@class="imageList__group__item"]')

# For loop to extract the skin disorder names and the link to the skin_disorders
disorder_names = []
link_list= []
for tag in skin_disorder_tag_main_page:
    disorder_names.append(tag.text)
    link_list.append(tag.get_attribute("href"))

# Previewing the lists created:
print(f'The first 10 skin condition names: \n {disorder_names[:10]}\n')
print(f'The first 10 links to skin condition image links:\n{link_list[:10]}')

The first 10 skin condition names: 
 ['Acne affecting the back images', 'Acne affecting the face images', 'Acne and other follicular disorder images', 'Acquired dermal macular hyperpigmentation images', 'Acral lentiginous melanoma images', 'Actinic keratosis affecting the face images', 'Actinic keratosis affecting the hand images', 'Actinic keratosis affecting the legs and feet images', 'Actinic keratosis affecting the scalp images', 'Actinic keratosis dermoscopy images']

The first 10 links to skin condition image links:
['https://dermnetnz.org/topics/acne-affecting-the-back-images', 'https://dermnetnz.org/topics/acne-face-images', 'https://dermnetnz.org/image-catalogue/acne-and-other-follicular-disorder-images', 'https://dermnetnz.org/topics/acquired-dermal-macular-hyperpigmentation-images', 'https://dermnetnz.org/images/acral-lentiginous-melanoma-images', 'https://dermnetnz.org/topics/actinic-keratosis-face-images', 'https://dermnetnz.org/topics/actinic-keratosis-affecting-the-hand-

In [4]:
# The number of skin_disorders listed in the website
print(f'The are {len(disorder_names)} skin conditions listed in the DermNet website.')

The are 294 skin conditions listed in the DermNet website.


### Create a dataframe with disease name and data download link

In [5]:
# Creating a dataframe with two columns, the skin_disorder names and the links to the images of the skin disorders
name_link_df = pd.DataFrame({'skin_disorder_name': disorder_names, 'link': link_list})

# Saving the dataframe as a csv file
name_link_df.to_csv('Data/name_link.csv', index=False)

# Previewing the first five rows of the dataframe
name_link_df.head()

Unnamed: 0,skin_disorder_name,link
0,Acne affecting the back images,https://dermnetnz.org/topics/acne-affecting-th...
1,Acne affecting the face images,https://dermnetnz.org/topics/acne-face-images
2,Acne and other follicular disorder images,https://dermnetnz.org/image-catalogue/acne-and...
3,Acquired dermal macular hyperpigmentation images,https://dermnetnz.org/topics/acquired-dermal-m...
4,Acral lentiginous melanoma images,https://dermnetnz.org/images/acral-lentiginous...


In [14]:
name_link_df["skin_disorder_name"][100:140]

100                       Erythema nodosum images
101                             Erythrasma images
102                           Erythroderma images
103          Extragenital lichen sclerosus images
104                           Eyebrow loss images
105                           Eyelash loss images
106                      Eyelid dermatitis images
107                          Fabry disease images
108                            Facial acne images
109                       Facial psoriasis images
110    Fibrofolliculomas in Birt–Hogg–Dubé images
111                          Fifth disease images
112                     Fishtank granuloma images
113                     Flexural psoriasis images
114                           Folliculitis images
115                Folliculitis keloidalis images
116                  Fordyce angiokeratoma images
117                  Fungal skin infection images
118         Generalised pustular psoriasis images
119                  Genital Crohn disease images


### Selection of diseases from the source for scraping

In [43]:
index_list = [0,1,2,108,64,109,113,118,122,129,222,223]
disorder_names_nw= [disorder_names[i] for i in index_list]
print(disorder_names_nw)

['Acne affecting the back images', 'Acne affecting the face images', 'Acne and other follicular disorder images', 'Facial acne images', 'Chronic plaque psoriasis images', 'Facial psoriasis images', 'Flexural psoriasis images', 'Generalised pustular psoriasis images', 'Genital psoriasis images', 'Guttate psoriasis images', 'Psoriasis affecting the face images', 'Psoriasis of the scalp images']


In [44]:
# initialize the webdriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# create an empty list to store the dataframes for each link/disease pair
dfs = []
# loop through each link/disease pair
for link, disease_name in zip(link_list, disorder_names_nw):
    # create an empty list to store the image URLs
    image_disease = []

    try:
        # navigate to the link
        driver.get(link)

        # maximize the window to ensure all elements are visible
        driver.maximize_window()

        # find all the elements on the page with the class "imageLinkBlock__item__image"
        skin_image_disorder = driver.find_elements("xpath", '//div[@class="imageLinkBlock__item__image"]')

        # loop through each element and find all the images within it
        for image in skin_image_disorder:
            list_ = image.find_elements("tag name", "img")
            for lists in list_:
                # add each image URL to the image_disease list
                image_disease.append(lists.get_attribute("src"))

        # create a list of dictionaries, where each dictionary represents a row in the DataFrame
        # each dictionary contains an image URL and the disease name
        rows = [{'skin_disorder_name': disease_name, 'images': url} for url in image_disease]

        # create the DataFrame using the list of dictionaries
        df = pd.DataFrame(rows, columns=['skin_disorder_name', 'images'])

        # add the dataframe to the list of dataframes
        dfs.append(df)

    except Exception as e:
        # if an error occurs, print the error message and move to the next link/disease pair
        print(f"Error occurred for {disease_name}: {str(e)}")
        continue

# concatenate all the dataframes into a single dataframe
result_df = pd.concat(dfs)

# Turning all the skin_disorder_names to lower case:
result_df['skin_disorder_name'] = result_df['skin_disorder_name'].map(lambda x: x.lower())

# save the result dataframe to a CSV file named "data.csv"
result_df.to_csv('Data/data1-294.csv', index=False)

In [22]:
# DataFrame with all the images:
image_df = pd.read_csv('Data/data1-294.csv')

# Previewing the first 3 rows of the dataframe
image_df.head()

Unnamed: 0,skin_disorder_name,images
0,facial acne images,https://dermnetnz.org/assets/Uploads/acne/acne...
1,facial acne images,https://dermnetnz.org/assets/Uploads/acne/acne...
2,facial acne images,https://dermnetnz.org/assets/Uploads/acne/acne...
3,facial acne images,https://dermnetnz.org/assets/Uploads/acne/acne...
4,facial acne images,https://dermnetnz.org/assets/Uploads/acne/acne...


### View all the images to be downloaded as an html list

In [45]:
# Function takes in the image url and returns an html <img> tag that displays the image
def to_img_tag(path):
    return '<img src="'+ path + '" width="50" >'

# Save the HTML table to a file
with open('Data/image_table.html', 'w') as f:
    f.write(image_df.to_html(escape=False,formatters=dict(images=to_img_tag)))
    
# Open the HTML file in the default web browser
webbrowser.open('Data/image_table.html')

True

### Downloading and Saving the images into a folder

In [46]:
def save_image(folder: str, name: str, url: str, index:int):
    # Get the data from the url
    image_source = rq.get(url)

    # If there's a suffix, we will grab that
    suffix = pathlib.Path(url).suffix

    # Check if the suffix is one of the following
    if suffix not in ['.jpg', '.jpeg', '.png', '.gif']:
        # Default to .png
        output = name + str(index) + '.png'

    else:
        output = name + str(index) + suffix

    # Check first if folder exists, else create a new one
    if not os.path.exists(folder):
        os.makedirs(folder)

    # Create our output in the specified folder (wb = write bytes)
    with open(f'{folder}{output}', 'wb') as file:
        file.write(image_source.content)
        print(f'Successfully downloaded: {output}')


if __name__ == '__main__':
    # Load the dataframe with image urls and disease names
    df = pd.read_csv('Data/data1-294.csv')

    # Loop through the dataframe
    for index, row in df.iterrows():
        # Get the image url and disease name
        image_url = row['images']
        disease_name = row['skin_disorder_name']

        # Save the image
        save_image('Images/', disease_name, image_url, index)

Successfully downloaded: acne affecting the back images0.jpg
Successfully downloaded: acne affecting the back images1.jpg
Successfully downloaded: acne affecting the back images2.jpg
Successfully downloaded: acne affecting the back images3.jpg
Successfully downloaded: acne affecting the back images4.jpg
Successfully downloaded: acne affecting the back images5.jpg
Successfully downloaded: acne affecting the back images6.jpg
Successfully downloaded: acne affecting the back images7.jpg
Successfully downloaded: acne affecting the back images8.jpg
Successfully downloaded: acne affecting the back images9.jpg
Successfully downloaded: acne affecting the back images10.jpg
Successfully downloaded: acne affecting the back images11.jpg
Successfully downloaded: acne affecting the back images12.jpg
Successfully downloaded: acne affecting the back images13.jpg
Successfully downloaded: acne affecting the back images14.jpg
Successfully downloaded: acne affecting the back images15.jpg
Successfully downl

Successfully downloaded: acne affecting the face images133.jpg
Successfully downloaded: acne affecting the face images134.jpg
Successfully downloaded: acne affecting the face images135.jpg
Successfully downloaded: acne affecting the face images136.jpg
Successfully downloaded: acne affecting the face images137.jpg
Successfully downloaded: acne affecting the face images138.jpg
Successfully downloaded: acne affecting the face images139.jpg
Successfully downloaded: acne affecting the face images140.jpg
Successfully downloaded: acne affecting the face images141.jpg
Successfully downloaded: acne affecting the face images142.jpg
Successfully downloaded: acne affecting the face images143.jpg
Successfully downloaded: acne affecting the face images144.jpg
Successfully downloaded: acne affecting the face images145.jpg
Successfully downloaded: acne affecting the face images146.jpg
Successfully downloaded: acne affecting the face images147.jpg
Successfully downloaded: acne affecting the face images

Successfully downloaded: acne affecting the face images264.jpg
Successfully downloaded: acne affecting the face images265.jpg
Successfully downloaded: acne affecting the face images266.jpg
Successfully downloaded: acne affecting the face images267.jpg
Successfully downloaded: acne affecting the face images268.jpg
Successfully downloaded: acne affecting the face images269.jpg
Successfully downloaded: acne affecting the face images270.jpg
Successfully downloaded: acne affecting the face images271.jpg
Successfully downloaded: acne affecting the face images272.jpg
Successfully downloaded: acne affecting the face images273.jpg
Successfully downloaded: acne affecting the face images274.jpg
Successfully downloaded: acne affecting the face images275.jpg
Successfully downloaded: acne affecting the face images276.jpg
Successfully downloaded: acne affecting the face images277.jpg
Successfully downloaded: acne affecting the face images278.jpg
Successfully downloaded: acne affecting the face images

Successfully downloaded: acne and other follicular disorder images380.jpg
Successfully downloaded: acne and other follicular disorder images381.jpg
Successfully downloaded: acne and other follicular disorder images382.jpg
Successfully downloaded: acne and other follicular disorder images383.jpg
Successfully downloaded: acne and other follicular disorder images384.jpg
Successfully downloaded: acne and other follicular disorder images385.jpg
Successfully downloaded: acne and other follicular disorder images386.jpg
Successfully downloaded: acne and other follicular disorder images387.jpg
Successfully downloaded: acne and other follicular disorder images388.jpg
Successfully downloaded: acne and other follicular disorder images389.jpg
Successfully downloaded: acne and other follicular disorder images390.jpg
Successfully downloaded: acne and other follicular disorder images391.jpg
Successfully downloaded: acne and other follicular disorder images392.jpg
Successfully downloaded: acne and othe

Successfully downloaded: chronic plaque psoriasis images503.jpg
Successfully downloaded: chronic plaque psoriasis images504.jpg
Successfully downloaded: chronic plaque psoriasis images505.jpg
Successfully downloaded: chronic plaque psoriasis images506.jpg
Successfully downloaded: chronic plaque psoriasis images507.jpg
Successfully downloaded: chronic plaque psoriasis images508.jpg
Successfully downloaded: chronic plaque psoriasis images509.jpg
Successfully downloaded: chronic plaque psoriasis images510.jpg
Successfully downloaded: facial psoriasis images511.jpg
Successfully downloaded: facial psoriasis images512.jpg
Successfully downloaded: facial psoriasis images513.jpg
Successfully downloaded: facial psoriasis images514.jpg
Successfully downloaded: facial psoriasis images515.jpg
Successfully downloaded: facial psoriasis images516.jpg
Successfully downloaded: facial psoriasis images517.jpg
Successfully downloaded: facial psoriasis images518.jpg
Successfully downloaded: facial psoriasi

Successfully downloaded: genital psoriasis images642.jpg
Successfully downloaded: genital psoriasis images643.jpg
Successfully downloaded: genital psoriasis images644.jpg
Successfully downloaded: genital psoriasis images645.jpg
Successfully downloaded: genital psoriasis images646.jpg
Successfully downloaded: genital psoriasis images647.jpg
Successfully downloaded: genital psoriasis images648.jpg
Successfully downloaded: genital psoriasis images649.jpg
Successfully downloaded: genital psoriasis images650.jpg
Successfully downloaded: genital psoriasis images651.jpg
Successfully downloaded: genital psoriasis images652.jpg
Successfully downloaded: genital psoriasis images653.jpg
Successfully downloaded: genital psoriasis images654.jpg
Successfully downloaded: genital psoriasis images655.jpg
Successfully downloaded: genital psoriasis images656.jpg
Successfully downloaded: genital psoriasis images657.jpg
Successfully downloaded: guttate psoriasis images658.jpg
Successfully downloaded: guttat

## 2. Data Cleaning

In [26]:
# import required libraries
import shutil
import os
from PIL import Image
import imagehash
import random

In [27]:
# load and preview dataset
image_df = pd.read_csv('Data/data1-294.csv')
print(image_df.shape)
image_df.head()

(615, 2)


Unnamed: 0,skin_disorder_name,images
0,facial acne images,https://dermnetnz.org/assets/Uploads/acne/acne...
1,facial acne images,https://dermnetnz.org/assets/Uploads/acne/acne...
2,facial acne images,https://dermnetnz.org/assets/Uploads/acne/acne...
3,facial acne images,https://dermnetnz.org/assets/Uploads/acne/acne...
4,facial acne images,https://dermnetnz.org/assets/Uploads/acne/acne...


In [28]:
# Function to randomly select 1000 images
def reduce_images(folder_path):
    # Get the list of image file name
    file_names = os.listdir(folder_path)

    # Shuffle the file names
    random.shuffle(file_names)

    # Select the first 1000 file names
    selected_file_names = file_names[:1000]

    # Create a new folder to store the selected images
    selected_folder_path = f'{folder_path}_images'
    os.mkdir(selected_folder_path)

    # Copy the selected images to the new folder
    for file_name in selected_file_names:
        file_path = os.path.join(folder_path, file_name)
        selected_file_path = os.path.join(selected_folder_path, file_name)
        shutil.copy(file_path, selected_file_path)

### **Cleaning Non Leprocy Images**

**i. Moving acne images in the Images folder to their own folder**

In [47]:
# Labels representing acne in DermNet's scrapped data
acne_labels = list(image_df[image_df['skin_disorder_name'].str.contains('acne')]['skin_disorder_name'].unique())

# removing acne labels whose images will not be used because there are not clear
#acne_labels.remove('infantile acne images')
#acne_labels.remove('steroid acne images')

acne_labels

['facial acne images']

In [48]:
# Getting the acne images file names
original_acne_img = [image_name for image_name in os.listdir('Images/') \
                     if ('acne affecting the back images' in image_name) |\
                        ('acne affecting the face images' in image_name) |\
                        ('acne and other follicular disorder images' in image_name) |\
                        ('facial acne images' in image_name)
                        ]

# Confirming the number of acne images before any cleaning
print('There are', len(original_acne_img),'acne images')
original_acne_img[:5]

There are 514 acne images


['acne affecting the face images194.jpg',
 'acne and other follicular disorder images327.jpg',
 'acne and other follicular disorder images415.jpg',
 'acne and other follicular disorder images311.jpg',
 'facial acne images5.jpg']

In [49]:
# Creating a new folder with just acne images to make cleaning easier
folder_name = 'cleaned_images/acne_images/'

# Checking if the folder exists and deleting it if it exists
if os.path.exists(folder_name):
    # deleting the folder and its contents
    shutil.rmtree(folder_name)

# create the new folder
os.mkdir(folder_name)

# Moving the images into that folder
for img in original_acne_img:
    origin = os.path.join('Images/', img)
    destination = os.path.join(folder_name, img)
    shutil.copy(origin, destination)

# Confirming that the number of acne images after moving them to a separate folder is still 702
acne_img = [image_name for image_name in os.listdir('cleaned_images/acne_images/')]
print('There are', len(acne_img),'acne images.')

There are 514 acne images.


In [50]:
# extra acne images
extra_acne = [image_name for image_name in os.listdir('extra_images/extra_acne_images/')]
extra_acne[:5]

[]

**ii. Combining the images into one folder**

In [51]:
# moving the extra images into the acne folder
for img in extra_acne:
    origin = os.path.join('extra_images/extra_acne_images/', img)
    destination = os.path.join('cleaned_images/acne_images/', img)
    shutil.copy(origin, destination)

# Confirming that the total acne images is 1427 before any cleaning
acne_img = [image_name for image_name in os.listdir('cleaned_images/acne_images/')]
print('There are a total of', len(acne_img),'acne images.')

There are a total of 514 acne images.


**iii. Removing duplicate images from the folder**

In [52]:
# Function for removing duplicated images.
def drop_duplicated_images(folder):

    # Define a threshold for image similarity
    threshold = 8

    # Define a dictionary to store the hash values and file paths of the images
    image_hashes = {}
    duplicated_images = []

    # Loop through all the image files in a directory
    for filename in os.listdir(folder):
        # Load the image file
        image = Image.open(os.path.join(folder, filename))

         # Compute the hash value of the image using the average hash algorithm
        hash_value = imagehash.average_hash(image)

        # Check if the hash value is already in the dictionary
        if hash_value in image_hashes:
            # If a similar hash value already exists, delete the duplicate image
            duplicated_images.append(filename)
            os.remove(os.path.join(folder, filename))
        else:
             # Otherwise, add the hash value and file path to the dictionary
            image_hashes[hash_value] = os.path.join(folder, filename)

    return duplicated_images

In [53]:
# Dropping duplicates
duplicated_images = drop_duplicated_images('cleaned_images/acne_images/')

# number of acne images after removing duplicated images (1109)
acne_img = [image_name for image_name in os.listdir('cleaned_images/acne_images/')]
print('There are', len(acne_img),'acne images after removing duplicated images')

There are 470 acne images after removing duplicated images


In [54]:
# dropping those images from the acne_images folder
indexes_to_drop = [295, 296, 297, 298, 300, 303, 304, 307, 308, 309, 310, 311, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 325, 326, 328, 329, 330, 333, 337, 338, 339, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 354, 355, 359, 361, 362, 363, 364, 366, 367, 368, 371, 372, 373, 374, 375, 376, 378, 380, 381, 382, 384, 385, 387, 388, 389, 390, 392, 393, 395, 396, 397, 398, 402, 403, 405, 408, 409, 411, 413, 415, 416, 417, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 431, 432, 433, 434, 436, 437, 438, 441, 443, 444, 445, 446, 447]

for filename in os.listdir('cleaned_images/acne_images/'):
    for index in indexes_to_drop:
        if f"images{index}" in filename.lower():
            os.remove(os.path.join('cleaned_images/acne_images/', filename))

print("Number of acne images left:", len(os.listdir('cleaned_images/acne_images/')))

Number of acne images left: 361


### **Cleaning Non-Psoriasis Images**

**i. Moving psoriasis images in the Images folder to their own folder**

In [55]:
# Labels representing acne in DermNet's scrapped data
psoriasis_labels = list(image_df[image_df['skin_disorder_name'].str.contains('psoriasis')]['skin_disorder_name'].unique())
print(psoriasis_labels)

# Count of labels representing psoriasis
len(psoriasis_labels)

['chronic plaque psoriasis images', 'facial psoriasis images', 'flexural psoriasis images', 'generalised pustular psoriasis images', 'genital psoriasis images', 'guttate psoriasis images']


6

In [56]:
# Getting the psoriasis images file names
psoriasis_img = [image_name for image_name in os.listdir('Images/') if 'psoriasis' in image_name]

# Checking if psoriasis folder exists and deleting it if it exists
if os.path.exists('cleaned_images/psoriasis/'):
    # deleting the folder and its contents
    shutil.rmtree('cleaned_images/psoriasis/')

# Creating a new folder with just psoriasis images to make cleaning easier
os.mkdir('cleaned_images/psoriasis/')
for img in psoriasis_img:
    origin = os.path.join('Images/', img)
    destination = os.path.join('cleaned_images/psoriasis/', img)
    shutil.copy(origin, destination)

for filename in os.listdir('extra_images/extra_psoriasis_images/'):
    src_path = os.path.join('extra_images/extra_psoriasis_images/', filename)
    dst_path = os.path.join('cleaned_images/psoriasis/', filename)
    shutil.copy(src_path, dst_path)

# Number of psoriasis images after moving them to a separate folder
psoriasis_img = [image_name for image_name in os.listdir('cleaned_images/psoriasis/')]
print('There are', len(psoriasis_img),'psoriasis images')

There are 940 psoriasis images


**ii. Removing duplicated images from the folder**

In [57]:
# Dropping duplicates
duplicated_images = drop_duplicated_images('cleaned_images/psoriasis/')

psoriasis_img = [image_name for image_name in os.listdir('cleaned_images/psoriasis/')]
print('There are', len(psoriasis_img),'psoriasis images after removing duplicate images')

There are 732 psoriasis images after removing duplicate images
