# Homework 1: Python for basic data analysis


Name: Santos, Carlos

Department: ECE




This homework aims to help you practice basic Python programing skills using the breast cancer wisconsin dataset. 

![breast image](breastimg.png)

| *Fig. 1. Cell nuclei in a breast histopathology image* | 
|---|
|Fine Needle Aspiration (FNA) biopsy: https://www.cancer.org/cancer/breast-cancer/screening-tests-and-early-detection/breast-biopsy/fine-needle-aspiration-biopsy-of-the-breast.html|
|H&E stain: https://en.wikipedia.org/wiki/H%26E_stain|


Tasks:

    [Task 1](#section1)

    [Task 2](#section2)

    [Task 3](#section3)

    [Task 4](#section4)

    [Task 5](#section5)

## Dataset

    - Number of data samples: 569
    
    - Each data sample has 30 numeric features/attributes. The first 10 features were directly calculated using mean feautues of all nuclei in an image
    
    - Class labels
        : 212 Malignant (0)
        : 357 Benign (1)
        
    https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

In [1]:
import sklearn.datasets as ds
import numpy as np

In [2]:
breast_ds = ds.load_breast_cancer()
print('Data fields in breast_ds: \n', dir(breast_ds))

print('\n Dataset description:\n', breast_ds['DESCR'])

Data fields in breast_ds: 
 ['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']

 Dataset description:
 .. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst

In [3]:
# we are going to use the first 10 features in this assignment.
ftrs = breast_ds['data'][:, :10].tolist() # np array to list
tgt = (breast_ds['target']).tolist()

print('size of ftrs: ', len(ftrs), len(ftrs[0]))
print('size of tgt: ', len(tgt))

size of ftrs:  569 10
size of tgt:  569


## Task 1: Count and print out the number of malignant samples. 10 points <a id = "section1"/>

In [4]:
# Let's begin by defining what a "malignant" sample is!
# FNA breast biopsy: Fine needle Aspiration
# A doctor called a pathologist will look at the biopsy tissue or fluid to find out if there are cancer cells in it.

# 1. If the fluid is brown, green, or tan, the lump is most likely a cyst, and not cancer.
# 2. Bloody or clear fluid can mean either a cyst that’s not cancer or, very rarely, cancer.
# 3. If the lump is solid, the doctor will look at small groups of cells from the biopsy to determine what it is.

#print(breast_ds.target_names)
#print(breast_ds.keys())
feature_names = breast_ds.feature_names
data = breast_ds.data
target = breast_ds.target
benign = []
malignant = []


#print(feature_names)
#print(data)
#print(target)

# - Class labels
#    : 212 Malignant (0)
#    : 357 Benign (1)

benign_count = 0
malignant_count = 0

# Count the benign and malignant samples and
# Split the benign and malignant samples

for index,item in enumerate(target):
    if item == 1:
        benign_count = benign_count + 1
        benign.append(data[index])
    else:
        malignant_count = malignant_count + 1
        malignant.append(data[index])

print("benign_count:", benign_count, "Size of Benign list:", len(benign))
print("malignant_count", malignant_count, "Size of malignant list:", len(malignant))
print("Sample Total:", benign_count + malignant_count)




benign_count: 357 Size of Benign list: 357
malignant_count 212 Size of malignant list: 212
Sample Total: 569


## Task 2: data search. 20 points.  <a id = "section2"/>

Let the user input a sample idx (1 to 569), and your code will output the data features and the corresponding class label

Extra 5 points for dealing with abnormal input.


In [5]:
while True:
    user_input = input("Input Sample Number (1-569) or exit: ")
    # We convert the input into a number, if it fails there was some text in the input
    try:
        user_input = int(user_input)
        
        if user_input >= 1 and user_input <= 569:
            # User Input should be good if we get here
            print("Your Sample is being loaded: ", user_input)
            if target[user_input - 1] == 0:
                sample_label = "malignant"
            else:
                sample_label = "benign"
            print("\n\nSample Type: ", sample_label)
            for i in range(0, 30):
                print("Sample: ", feature_names[i], data[user_input-1][i])
        else:
            raise ValueError
        
        
            
        #print(sample_data, sample_label)
    except ValueError:
        if user_input == "exit":
            break
        else:
            print("An invalid number was entered! Please try again!")
        



KeyboardInterrupt: Interrupted by user

## Task 3. 30 points  <a id = "section3"/>

Task 3.1: Calculate and print out the mean, min and max values of the feature 'concave points (7)' for all benign samples.
Tip: use the for loop


In [None]:
# Find mean, min, max for concave points in benign samples
from random import randint
benign_mean = 0.0
benign_min = benign[randint(0,len(benign)-1)][7]
benign_max = benign[randint(0,len(benign)-1)][7]
for sample in benign:
    # We know that 7 is the concave points
    # To find mean we add all the values up first
    benign_mean = sample[7] + benign_mean
    
    # Is it max? or min
    if sample[7] >= benign_max:
        benign_max = sample[7]
    
    if sample[7] <= benign_min:
        benign_min = sample[7]

benign_mean = benign_mean/len(benign)
print("Benign Max:", benign_max)
print("Benign Min:", benign_min)
print("Benign Mean:", benign_mean)

Task 3.2: Calculate and print out the mean, min and max values of the feature 'concave points' for all malignant samples.


In [None]:
# Find mean, min, max for concave points in malignant samples
malignant_mean = 0.0
malignant_min = malignant[randint(0,len(malignant)-1)][7]
malignant_max = malignant[randint(0,len(malignant)-1)][7]
for sample in malignant:
    # We know that 7 is the concave points
    # To find mean we add all the values up first
    malignant_mean = sample[7] + malignant_mean
    
    # Is it max? or min
    if sample[7] >= malignant_max:
        malignant_max = sample[7]
    
    if sample[7] <= malignant_min:
        malignant_min = sample[7]

malignant_mean = malignant_mean/len(malignant)
print("Malignant Max:", malignant_max)
print("Malignant Min:", malignant_min)
print("Malignant Mean:", malignant_mean)

## Task 4: count the number of benign samples that have 'concave points' values less than 0.17. 20 points  <a id = "section4"/>



In [None]:
sample_count = 0
for sample in benign:
    if sample[7] < 0.17:
        sample_count = sample_count + 1

print("Benign Samples with concave points less than 0.17:", sample_count)

## Task 5. 20 points <a id = "section5"/>

Define a function that calculates the Euclidean distance between any two given data samples
 

In [9]:
import math


In [17]:
def euclidean_distance(sample1_list, sample2_list):
    ''' This function will calcuate the Eucilidean distance between
        any two given samples.
        
        euclidean_distance = sum of all distances then squared. distiance=(X1-Y1)
    '''
    total_distance = 0
    # Since we pass in 2 lists one for each sample
    if len(sample1_list) == len(sample2_list):
        # All is good they are the same size!
        for index, item in enumerate(sample1_list):
            distance = math.pow(sample1_list[index] - sample2_list[index], 2)
            total_distance = total_distance + distance
            
        # Now we do the squre part
        return math.sqrt(total_distance)
    else:
        return -1

# first ten points
print("10 points:", euclidean_distance(ftrs[1], ftrs[43]))

# All 30 points
print("30 points:", euclidean_distance(data[1], data[43]))
            
                
                
                
    

10 points: 782.1672620284146
30 points: 1309.8714219163683
