# Assignment 1: Data Science and Big Data Analysis (COSC 5340)

#                           Student Name


"Mammographic Mass Data Set"

This data set can be used to predict the severity (benign or malignant)
of a mammographic mass lesion from BI-RADS attributes and the patient's age.
It contains a BI-RADS assessment, the patient's age and three BI-RADS attributes
together with the ground truth (the severity field) for 516 benign and
445 malignant masses that have been identified on full field digital mammograms
collected at the Institute of Radiology of the
University Erlangen-Nuremberg between 2003 and 2006.
Each instance has an associated BI-RADS assessment ranging from 1 (definitely benign)
to 5 (highly suggestive of malignancy) assigned in a double-review process by
physicians. Assuming that all cases with BI-RADS assessments greater or equal
a given value (varying from 1 to 5), are malignant and the other cases benign,
sensitivities and associated specificities can be calculated. These can be an
indication of how well a CAD system performs compared to the radiologists.

Class Distribution: benign: 516; malignant: 445

 Attribute Information:
   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)

Missing Attribute Values: Yes
    - BI-RADS assessment:    2
    - Age:                   5
    - Shape:                31
    - Margin:               48
    - Density:              76
    - Severity:              0

In [1]:
#Importing required library
import numpy as np
import pandas as pd

In [2]:
#Load 'mammographic_masses' dataset into a pandas dataframe object 'data'

data = pd.read_csv('mammographic_masses.data',index_col = False, header=None)

In [3]:
#Checking top5 data 
data.head()

Unnamed: 0,0,1,2,3,4,5
0,5,67,3,5,3,1
1,4,43,1,1,?,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
4,5,74,1,5,?,1


In [4]:
#Setting the columns name from our dataset file

data.columns = ['BI-RADS assessment','Age','Shape','Margin','Density','Severity']

In [5]:
#Watching the changes
data

Unnamed: 0,BI-RADS assessment,Age,Shape,Margin,Density,Severity
0,5,67,3,5,3,1
1,4,43,1,1,?,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
4,5,74,1,5,?,1
...,...,...,...,...,...,...
956,4,47,2,1,3,0
957,4,56,4,5,3,1
958,4,64,4,5,3,0
959,5,66,4,5,3,1


In [6]:
#Getting information of our dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 961 entries, 0 to 960
Data columns (total 6 columns):
BI-RADS assessment    961 non-null object
Age                   961 non-null object
Shape                 961 non-null object
Margin                961 non-null object
Density               961 non-null object
Severity              961 non-null int64
dtypes: int64(1), object(5)
memory usage: 45.2+ KB


In [7]:
#Checking for null values 
#Result shows no null values
data.isnull().any()

BI-RADS assessment    False
Age                   False
Shape                 False
Margin                False
Density               False
Severity              False
dtype: bool

In [8]:
#Finding number of unique value in 'Age' column

data['Age'].nunique()

74

In [9]:
#Finding unique values for all columns

data['BI-RADS assessment'].unique()

array(['5', '4', '3', '?', '2', '55', '0', '6'], dtype=object)

In [10]:
data['Age'].unique()

array(['67', '43', '58', '28', '74', '65', '70', '42', '57', '60', '76',
       '64', '36', '54', '52', '59', '40', '66', '56', '75', '63', '45',
       '55', '46', '39', '81', '77', '48', '78', '50', '61', '62', '44',
       '23', '80', '53', '49', '51', '25', '72', '73', '68', '33', '47',
       '29', '34', '71', '84', '24', '86', '41', '87', '21', '19', '35',
       '37', '79', '85', '69', '38', '32', '27', '83', '88', '26', '31',
       '?', '18', '82', '93', '30', '22', '96', '20'], dtype=object)

In [11]:
data['Shape'].unique()

array(['3', '1', '4', '?', '2'], dtype=object)

In [12]:
data['Margin'].unique()

array(['5', '1', '?', '4', '3', '2'], dtype=object)

In [13]:
data['Density'].unique()

array(['3', '?', '1', '2', '4'], dtype=object)

In [14]:
data['Severity'].unique()

array([1, 0], dtype=int64)

In [15]:
#Finding unique value in each column
data.nunique()

BI-RADS assessment     8
Age                   74
Shape                  5
Margin                 6
Density                5
Severity               2
dtype: int64

In [16]:
#Trying to convert Age column to integer but failed possible '?' value
data['Age'].astype(int)

ValueError: invalid literal for int() with base 10: '?'

# Handling Missing Values

In [17]:
#Defining method to replace '?' with null value

def check(x):
    if x == '?':
        return np.NaN
    return x

In [18]:
#Applying that function to every columns

data['Age'] = data['Age'].apply(check)
data['Shape'] = data['Shape'].apply(check)
data['BI-RADS assessment'] = data['BI-RADS assessment'].apply(check)
data['Margin'] = data['Margin'].apply(check)
data['Density'] = data['Density'].apply(check)

In [19]:
#Checking total null value in each columns

data.isnull().sum()

BI-RADS assessment     2
Age                    5
Shape                 31
Margin                48
Density               76
Severity               0
dtype: int64

In [20]:
#Finding average value of 'BI-RADS assessment' after dropping null value and converting rest of the objects to integer
avg_col1 = round(data['BI-RADS assessment'].dropna().astype(int).mean())

In [21]:
avg_col1

4

In [22]:
#Handling missing values

#Finding average value of each columns and replacing null value with average value of each columns
avg_col = []
for i in range(len(data.columns)-1):
    avg_col.append(round(data[data.columns[i]].dropna().astype(int).mean()))
    data[data.columns[i]] = data[data.columns[i]].fillna(avg_col[i])
    print(avg_col[i])

4
55
3
3
3


In [23]:
#Converting all the attribute columns to integer type

data['Density'] = data['Density'].astype(int)
data['Age'] = data['Age'].astype(int)
data['Shape'] = data['Shape'].astype(int)
data['BI-RADS assessment'] = data['BI-RADS assessment'].astype(int)
data['Margin'] = data['Margin'].astype(int)

In [24]:
#Checking the data again

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 961 entries, 0 to 960
Data columns (total 6 columns):
BI-RADS assessment    961 non-null int32
Age                   961 non-null int32
Shape                 961 non-null int32
Margin                961 non-null int32
Density               961 non-null int32
Severity              961 non-null int64
dtypes: int32(5), int64(1)
memory usage: 26.4 KB


In [25]:
# Check for any number of missing value
data.isnull().sum()

BI-RADS assessment    0
Age                   0
Shape                 0
Margin                0
Density               0
Severity              0
dtype: int64

# Calculate Similarity for each attribute

![](1.png)

In [26]:
#Creating numpy array equals the size of data for each columns
#To store the similarity between data objects

array1 = np.zeros((len(data),len(data)))
array2 = np.zeros((len(data),len(data)))
array3 = np.zeros((len(data),len(data)))
array4 = np.zeros((len(data),len(data)))
array5 = np.zeros((len(data),len(data)))

In [27]:
array1

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [28]:
#Similarity for the attribute 'BI-RADS assessment' (ORDINAL ATTRIBUTE)

#Using d = |p-q| / (n-1); s = 1 - d

# For BI-RADS assessment,  order ranges from 1 to 5 so, n=5

s = data['BI-RADS assessment']

for i in range(0,961):
    for j in range (i,961):
        array1[i,j] = 1 - ((abs (s[i] - s[j]))/(5-1))

        array1[j,i] = 1 - ((abs (s[i] - s[j]))/(5-1))

print(array1)

[[1.   0.75 1.   ... 0.75 1.   0.75]
 [0.75 1.   0.75 ... 1.   0.75 1.  ]
 [1.   0.75 1.   ... 0.75 1.   0.75]
 ...
 [0.75 1.   0.75 ... 1.   0.75 1.  ]
 [1.   0.75 1.   ... 0.75 1.   0.75]
 [0.75 1.   0.75 ... 1.   0.75 1.  ]]


In [29]:
#Similarity for the attribute 'Age' (RATIO ATTRIBUTE)

#Using d = |p-q|; s = 1/(1+d)

s = data['Age']

for i in range(0,961):
    for j in range (i,961):
        array2[i,j]  =  1/(1+(abs (s[i] - s[j])))
        array2[j,i] = 1/(1+(abs (s[i] - s[j])))

print(array2)

[[1.         0.04       0.1        ... 0.25       0.5        0.16666667]
 [0.04       1.         0.0625     ... 0.04545455 0.04166667 0.05      ]
 [0.1        0.0625     1.         ... 0.14285714 0.11111111 0.2       ]
 ...
 [0.25       0.04545455 0.14285714 ... 1.         0.33333333 0.33333333]
 [0.5        0.04166667 0.11111111 ... 0.33333333 1.         0.2       ]
 [0.16666667 0.05       0.2        ... 0.33333333 0.2        1.        ]]


In [30]:
#Similarity for the attribute 'Shape' (NOMINAL ATTRIBUTE)

#Using s = 1 if p=q; s=0 if p!=q

s = data['Shape']

for i in range(0,961):
    for j in range (i,961):
        if s [i] == s[j]:
            array3[i,j] = 1
            array3[j,i] = 1
    
        else:
            array3[i,j] = 0
            array3[j,i] = 0

print(array3)

[[1. 0. 0. ... 0. 0. 1.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 1. 1. 0.]
 ...
 [0. 0. 1. ... 1. 1. 0.]
 [0. 0. 1. ... 1. 1. 0.]
 [1. 0. 0. ... 0. 0. 1.]]


In [31]:
#Similarity for the attribute 'Margin' (NOMINAL ATTRIBUTE)

#Using s = 1 if p=q; s=0 if p!=q

s = data['Margin']

for i in range(0,961):
    for j in range (i,961):
        if s [i] == s[j]:
            array4[i,j] = 1
            array4[j,i] = 1
            
        else:
            array4[i,j] = 0
            array4[j,i] = 0

print(array4)

[[1. 0. 1. ... 1. 1. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [1. 0. 1. ... 1. 1. 0.]
 ...
 [1. 0. 1. ... 1. 1. 0.]
 [1. 0. 1. ... 1. 1. 0.]
 [0. 0. 0. ... 0. 0. 1.]]


In [32]:
#Similarity for the attribute 'Density' (ORDINAL ATTRIBUTE)

#Using d = |p-q| / (n-1); s = 1 - d

# For Density,  order ranges from 1 to 4 so, n=4

s = data['Density']

for i in range(0,961):
    for j in range (i,961):
        array5[i,j] = 1 - ((abs (s[i] - s[j]))/(4-1))
        array5[j,i] = 1 - ((abs (s[i] - s[j]))/(4-1))

print(array5)

[[1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 ...
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]]


# Combining Similarity

![](2.png)

In [33]:
#Creating Dataframe to store similairty value of each data objects attribute-wise

mat1 = pd.DataFrame(array1)
mat2 = pd.DataFrame(array2)
mat3 = pd.DataFrame(array3)
mat4 = pd.DataFrame(array4)
mat5 = pd.DataFrame(array5)
z = np.zeros((len(data),len(data)))
final_matrix = pd.DataFrame(z)

In [34]:
# Using combining similiraty formula to combine the similarity between attributes
# In formula, del(k) value equals 1 because the values does not equals zero or have missing values

d = 1
for i in range(len(data)):
    for j in range(i,len(data)):
        z[i,j] = z[j,i] = (d*mat1.loc[i,j]+d*mat2.loc[i,j]+d*mat3.loc[i,j]+d*mat4.loc[i,j]+d*mat5.loc[i,j])/5

In [35]:
#Checking combined similarity value
z

array([[1.        , 0.358     , 0.62      , ..., 0.6       , 0.7       ,
        0.58333333],
       [0.358     , 1.        , 0.3625    , ..., 0.40909091, 0.35833333,
        0.41      ],
       [0.62      , 0.3625    , 1.        , ..., 0.77857143, 0.82222222,
        0.39      ],
       ...,
       [0.6       , 0.40909091, 0.77857143, ..., 1.        , 0.81666667,
        0.46666667],
       [0.7       , 0.35833333, 0.82222222, ..., 0.81666667, 1.        ,
        0.39      ],
       [0.58333333, 0.41      , 0.39      , ..., 0.46666667, 0.39      ,
        1.        ]])

In [36]:
#Creating Dataframe for the final values
final_matrix = pd.DataFrame(z)

In [37]:
final_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,951,952,953,954,955,956,957,958,959,960
0,1.000000,0.358000,0.620000,0.355000,0.625000,0.416667,0.600000,0.407692,0.618182,0.691667,...,0.800000,0.450000,0.366667,0.356061,0.362500,0.359524,0.566667,0.600000,0.700000,0.583333
1,0.358000,1.000000,0.362500,0.812500,0.556250,0.608696,0.407143,0.650000,0.563333,0.227778,...,0.358000,0.407692,0.414286,0.622222,0.420000,0.640000,0.414286,0.409091,0.358333,0.410000
2,0.620000,0.362500,1.000000,0.356452,0.611765,0.375000,0.365385,0.411765,0.700000,0.533333,...,0.820000,0.568182,0.416667,0.358333,0.578571,0.366667,0.816667,0.778571,0.822222,0.390000
3,0.355000,0.812500,0.356452,1.000000,0.554255,0.605263,0.404651,0.563333,0.556667,0.222727,...,0.355000,0.404878,0.406897,0.625000,0.408000,0.610000,0.406897,0.405405,0.355128,0.405714
4,0.625000,0.556250,0.611765,0.554255,1.000000,0.570000,0.390000,0.606061,0.811111,0.480000,...,0.625000,0.378571,0.360526,0.355000,0.358696,0.357143,0.560526,0.568182,0.622222,0.365385
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
956,0.359524,0.640000,0.366667,0.610000,0.357143,0.410526,0.408333,0.383333,0.368182,0.230952,...,0.359524,0.409091,0.620000,0.815385,0.433333,1.000000,0.420000,0.411111,0.360000,0.412500
957,0.566667,0.414286,0.816667,0.406897,0.560526,0.420000,0.413333,0.363333,0.650000,0.456667,...,0.766667,0.615385,0.600000,0.409091,0.640000,0.420000,1.000000,0.822222,0.768182,0.428571
958,0.600000,0.409091,0.778571,0.405405,0.568182,0.500000,0.428571,0.358696,0.575000,0.456667,...,0.800000,0.640000,0.422222,0.406667,0.615385,0.411111,0.822222,1.000000,0.816667,0.466667
959,0.700000,0.358333,0.822222,0.355128,0.622222,0.450000,0.390000,0.408000,0.620000,0.495238,...,0.900000,0.616667,0.368182,0.356250,0.563333,0.360000,0.768182,0.816667,1.000000,0.390000


In [38]:
#Writing the combined similarity to output file
final_matrix.to_csv('Similarity Matrix.csv')