## MNIST Skin Cancer 

Using a MNIST Skin Cancer dataset and machine learning techniques to detect and predict the presence of malignant cancer in patient lesion images. There is much interest in developing a robust cancer detection tool in the medical community and by patients as well. Most diagnosis in this dataset do not indicate a malignancy;however,the medical community is seeking an additional diagnostic tool in hopes of avoiding unnecessary surgeries and the complications those may entail. Thus researchers have compiled this dataset of lesion images and basic information to seek help.

## Exploratory Data Analysis

### Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(42)
from sklearn.model_selection import train_test_split

%matplotlib inline

### Reading in the MetaData 

In [2]:
skin = pd.read_csv('./../datasets/HAM10000_metadata.csv')

In [3]:
skin_8by8 = pd.read_csv('./../datasets/hmnist_8_8_RGB.csv')#low res images

In [4]:
skin_28by28 = pd.read_csv('./../datasets/hmnist_28_28_RGB.csv')#medium res images

### Exploring the dataset

This shows the first five rows of the full resolution image files

In [5]:
skin.head()

Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization
0,HAM_0000118,ISIC_0027419,bkl,histo,80.0,male,scalp
1,HAM_0000118,ISIC_0025030,bkl,histo,80.0,male,scalp
2,HAM_0002730,ISIC_0026769,bkl,histo,80.0,male,scalp
3,HAM_0002730,ISIC_0025661,bkl,histo,80.0,male,scalp
4,HAM_0001466,ISIC_0031633,bkl,histo,75.0,male,ear


Here are the first five rows of the low resolution pixel csv files.

In [None]:
skin_8by8.head()

Here are the first five rows of the medium resolution pixel csv files.

In [None]:
skin_28by28.head()

In [None]:
skin.shape

In [None]:
skin_8by8.shape

In [None]:
skin_28by28.shape

This dataset has seven columns : a lesion index, an image index, diagnosis, diagnosis method, patient age, patient sex, and lesion location. There are 10,015 rows in this set. Some of the lesions have more than one image. In image processing, oftentimes images are rotated to teach the model to recognize the target in different orientations

In [None]:
skin['image_id'].nunique()

There are 10015 images in the dataset, but some lesions have more than one image.

In [None]:
skin['lesion_id'].nunique()

### Checking for missing information

In [None]:
skin.isnull().sum()

There are some missing values in the age column and some categories have an choice called unknown. Since the model will be working only with the image files, the missing data will not effect the model performance.

### Checking the Data Types

In [None]:
skin.dtypes

### Diagnoses Column

dx is the column containing all the diagnoses. There are seven categories of diagnoses. This is a list for interpreting the abbreviations.
1. nv: Melanocytic nevi-birthmarks, moles, resembles melanoma
2. mel:Melanoma-the most dangerous form of skin cancer
3. bkl:Benign keratosis-like lesions 
4. bcc:Basal cell carcinoma-rarely metastasizes but does spread
5. akiec:Actinic keratoses-scaly patch due to years of sun exposure
6. vasc:Vascular lesions-birthmarks can be flat or raised can be benign or malignant
7. df:Dermatofibroma-superficial benign fibrous histocytoma

Below is a breakdown of the diagnoses in the dataset. The distribution is not even, leading to concerns about unbalanced classes.

In [None]:
skin.dx.value_counts(normalize=True,Ascending = False)

In [None]:
skin['dx'].value_counts().plot.barh()

This chart shows the number of images for each diagnosis are in the dataset. The distribution is far from even, with one particular diagnosis(nv) having the majority of the images. 

### Diagnosis Method Column 

The dx_type column describes how each diagnosis was decided upon. There are four categories, one can imagine that eventually a computer model may a fifth category. The four current categories are:
1. histo:histopathological examination of physically removed cells
2. follow_up:no change in three or more visits or after 1.5 years
3. consensus:same conclusion by two unrelated dermotologists
4. confocal: close visual examination using microscopy

In [None]:
skin.dx_type.value_counts()


In [None]:
skin['dx_type'].value_counts().plot.barh()


This chart shows how the diagnosis are decided on, histo and follow_up make up most of them.

### Localization column

This column describes where on the body the lesion is located.
1. back                   
2. lower extremity
3. trunk
4. upper extremity
5. abdomen
6. face
7. chest
8. foot
9. unknown
10. neck
11. scalp
12. hand
13. ear
14. genital
15. acral(fingers, toes)

In [None]:
skin['localization'].value_counts()

In [None]:
skin['localization'].value_counts().plot.barh()

This chart shows the distribution of the lesion locations- logically most occur on parts of the body with the greatest surface area and those parts that are most exposed to the sun.

### Age column

The chart shows the distribution of patients by age.

In [None]:
skin.age.describe()

In [None]:
These are the statistics for the age column.

In [None]:
skin['age'].plot.hist(bins=50)

This chart shows the age distribution of the patients.

### Sex Column 

In [None]:
skin['sex'].value_counts()

In [None]:
skin['sex'].value_counts().plot.bar()

### Making numbered columns to look for correlations in a heatmap.

In [None]:
skin['dx_num'] = skin['dx'].map({'mel' :1, 'bcc' :2,'nv':0,
                                 'bkl':3,'vasc':4,'df':5,'akiec':6})

In [None]:
skin['dx_type_num'] = skin['dx_type'].map({'histo' :1,'follow_up':2,
                                           'consensus':0,'confocal':3})

In [None]:
skin['localization_num']= skin['localization'].map({'scalp':0, 'ear':1, 'face':2, 'back':3, 'trunk':4, 'chest':5,
       'upper extremity':6, 'abdomen':7, 'unknown':8, 'lower extremity':9,
       'genital':10, 'neck':11, 'hand':12, 'foot':13, 'acral':14})

In [None]:
skin['sex_num']=skin['sex'].map({'male':1, 'female':0})

In [None]:
feature_num=['sex_num','dx_num','dx_type_num',
             'age','localization_num']

In [None]:
skin_num=skin[feature_num]

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(skin_num.corr(), annot=True,cmap='Spectral' );

The heatmap does not show significant relationships between numbered columns. The strongest is between age and diagnosis .37.

In [None]:
plt.figure(figsize = (12,6))
sns.scatterplot(skin['age'],skin['dx'],
               s=30,
               color ='r')               
plt.xlabel('Age',fontsize=20)
plt.ylabel('Diagnosis',fontsize = 20)
plt.title('Diagnosis And Age',fontsize=25)
plt.show();


The scatterplot above shows that patients over 30 have all seven diagnosis while patients younger than that only have a few.

In [None]:
plt.figure(figsize = (18,6))
sns.scatterplot(skin['localization'],skin['dx'],
               s=30,
               color ='r')               
plt.xlabel('Diagnosis',fontsize=20)
plt.ylabel('Localization',fontsize = 20)
plt.title('Diagnosis And Localization',fontsize=25)
plt.show();


The chart above shows certain conditions occur throughout the body surface while others occur in specific areas

In [None]:
plt.figure(figsize = (12,6))
sns.scatterplot(skin['dx_type'],skin['dx'],
               s=30,
               color ='r')               
plt.xlabel('Method',fontsize=20)
plt.ylabel('Diagnosis',fontsize = 20)
plt.title('Diagnosis And Method',fontsize=25)
plt.show();


Of the four diagnostic methods, histopatholigical(examination by removal) is used for all diagnosis. The use of the other three methods is much more limited.

In [None]:
sns.countplot(x='dx',hue='age',data= skin)
plt.legend(loc='right')

This analysis has helped to form an idea for examining the data. Rather than a multiclass classification where the model would try to identify each type lesion, it makes more sense to divide the dataset into malignant and benign subgroups and have a binary classification puzzle instead. Melanoma and carcinoma are in the malignant subgroup, while the balance will make up the benign subgroup. 