# Overview

This notebook contains all code and discussions for the __intermediate iteration__ of the research question involving __lesion diagnosis/type and malignancy__. The specific question asked is whether there are statistically significant differences between the malignancy rates of different types of lesion diagnoses. This is an advanced version of [this novice module](../novice/Q3.ipynb) revolving around the same features.

# Table of Contents

I. [Setup]

II. [Data Retrieval]

1. [File Access]
2. [Loading & Processing]

III. [Analysis]

1. 
2. 

IV. [Discussion]

# Setup

Refer to [this module](../novice/Data_Download.ipynb) for replicating the data-download process using the ISIC Python tool. The command is repeated below, although file/directory locations may need adjustment. 

`
python download_archive.py \
--num-images=50 \ 
--images-dir ../../sample_imgs \ 
--descs-dir ../../sample_dscs -s \ 
--seg-dir ../../sample_segs --seg-skill expert
`

The following are necessary imports for this module.

In [1]:
# data retrieval
import glob
import json

# data manipulation, analysis, and visualization
import pandas as pd

# Data Retrieval

# File Access

Note that image manipulation is not actually needed for this module. This shortens data retrieval; the same `glob.glob` functionality as before can be used, to get an array of paths for description files only.

In [2]:
dsc_filepaths = glob.glob('../../sample_dscs/*')
print('Descriptions: ', len(dsc_filepaths))

Descriptions:  50


# Loading & Processing

Descriptions are stored in JSON format as before. Opening a single file is useful for determining the attributes of interest. Here, the attributes `benign_malignant` and `diagnosis` (nested in the `clinical` attribute of the `meta` key) are direct representations of the features in question above.

In [3]:
json.load(open(dsc_filepaths[0], 'r'))

{'_id': '5436e3abbae478396759f0cf',
 '_modelType': 'image',
 'created': '2014-10-09T19:36:11.989000+00:00',
 'creator': {'_id': '5450e996bae47865794e4d0d', 'name': 'User 6VSN'},
 'dataset': {'_accessLevel': 0,
  '_id': '5a2ecc5e1165975c945942a2',
  'description': 'Moles and melanomas.\nBiopsy-confirmed melanocytic lesions. Both malignant and benign lesions are included.',
  'license': 'CC-0',
  'name': 'UDA-1',
  'updated': '2014-11-10T02:39:56.492000+00:00'},
 'meta': {'acquisition': {'image_type': 'dermoscopic',
   'pixelsX': 1022,
   'pixelsY': 767},
  'clinical': {'age_approx': 55,
   'anatom_site_general': 'anterior torso',
   'benign_malignant': 'benign',
   'diagnosis': 'nevus',
   'diagnosis_confirm_type': None,
   'melanocytic': True,
   'sex': 'female'}},
 'name': 'ISIC_0000000',
 'notes': {'reviewed': {'accepted': True,
   'time': '2014-11-10T02:39:56.492000+00:00',
   'userId': '5436c6e7bae4780a676c8f93'},
  'tags': ['Challenge 2018: Task 1-2: Training',
   'Challenge 2019:

Each data point can hence be thought of as a (malignancy, diagnosis type) pair or tuple. We can iterate over all filepaths, extract the two variable values, and format them this way, as follows.

In [9]:
# first load descriptions, and extract clinical variables
dscs = [json.load(open(x, 'r'))['meta']['clinical'] for x in dsc_filepaths]

# make a list of data points, with relevant variables only
data = [(x['benign_malignant'], x['diagnosis']) for x in dscs]

# sample output
for i in range(3):
    print('Data point #%d: %s' % (i, data[i]))

Data point #0: ('benign', 'nevus')
Data point #1: ('benign', 'nevus')
Data point #2: ('malignant', 'melanoma')


For analysis this is best stored as a `pandas.DataFrame` table, which can be done easily. 

In [12]:
df = pd.DataFrame(data, columns=['malignancy', 'diagnosis'])
df.head()

Unnamed: 0,malignancy,diagnosis
0,benign,nevus
1,benign,nevus
2,malignant,melanoma
3,benign,nevus
4,malignant,melanoma


# Analysis

## INITIAL

Idk.

In [None]:
# is that many groups ok
# is some type of dropout needed