# Breast Tumor Data Exploration
  
  

Breast cancer is the most common malignancy among women, accounting for nearly 1 in 3 cancers diagnosed among women in the United States, and it is the second leading cause of cancer death among women. Breast Cancer occurs as a results of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor does not mean cancer - tumors can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer performed.

The [Breast Cancer](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) datasets is available machine learning repository maintained by the University of California, Irvine. The dataset contains **569 samples of malignant and benign tumor cells**.
* The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M=malignant, B=benign), respectively.
* The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.

#### Getting Started: Load libraries and set options

In [None]:
#load libraries
import numpy as np         # linear algebra
import pandas as pd        # data processing, CSV file I/O (e.g. pd.read_csv)

#### Load Dataset

First, load the supplied CSV file using additional options in the Pandas read_csv function.

In [None]:
# For Google colab users
url = 'https://raw.githubusercontent.com/acolajanni/Python_DoubleCursus/main/data/Breast_cancer.csv'

data = pd.read_csv(url)
data

In [None]:
data.columns

#### Inspecting the data
The first step is to visually inspect the new data set. There are multiple ways to achieve this:
* The easiest being to request the first few records using the DataFrame data.head()* method. By default, “data.head()” returns the first 5 rows from the DataFrame object df (excluding the header row).
* Alternatively, one can also use “df.tail()” to return the five rows of the data frame.
* For both head and  tail methods, there is an option to specify the number of records by including the required number in between the parentheses when calling either method.Inspecting the data

In [None]:
data.head(5)

In [None]:
data.tail(5)

In [None]:
# Id column is redundant and not useful, we want to drop it
data.drop('id', axis = 1, inplace=True)
#data.drop('Unnamed: 0', axis=1, inplace=True)
data.head(5)

You can check the number of cases, as well as the number of fields, using the shape method, as shown below.

In [None]:
data.shape

In the result displayed, you can see the data has 569 records, each with 32 columns.

The **“info()”** method provides a concise summary of the data; from the output, it provides the type of data in each column, the number of non-null values in each column, and how much memory the data frame is using.

The method **dtypes.value_counts()** will return the number of columns of each type in a DataFrame:

In [None]:
# Review data types with "info()".
data.info()
data.dtypes.value_counts()

From the above results, from the 31 variables column id number 1: "diagnosis" with 569 non-null object, the rest are float. More on [python variables](https://www.tutorialspoint.com/python/python_variable_types.htm)

In [None]:
#check for missing variables
data.isnull().any()

In [None]:
data.diagnosis.unique()

In [None]:
data.describe()

#### Counting the different values of diagnosis:

1. Using df.groupby()

In [None]:
# Group by diagnosis and review the output.
diag_gr = data.groupby('diagnosis', axis=0)
pd.DataFrame(diag_gr.size(), columns=['# of observations'])

2. Using Counter() for the collections library

In [None]:
from collections import Counter

diag_gr = Counter(data.diagnosis)
diag_gr = pd.DataFrame.from_dict(diag_gr, orient='index',  columns=['# of observations'])
diag_gr

From the results above, diagnosis is a categorical variable, because it represents a fix number of possible values (i.e, Malignant, of Benign. The machine learning algorithms wants numbers, and not strings, as their inputs so we need some method of coding to convert them.



In [None]:
import matplotlib.pyplot as plt

plt.bar(height=diag_gr['# of observations'], x=diag_gr.index)
plt.title("Barplot showing the number of benign and malignant tumor")

##  Visualise distribution of data via scatterplots

Scatterplot are commonly used to visualize two numerical variables.

Using scatterplots is a good way to visualy verify if it exist a link between two variables.


In [None]:
plt.scatter(data.values[:,1],data.values[:,9])
plt.xlabel(data.columns[1])
plt.ylabel(data.columns[9])

**Using scatterplots, it is possible to outline thee different groups**

In [None]:
M = data[data.diagnosis=="M"]
B = data[data.diagnosis=="B"]

plt.scatter(M.values[:,1],M.values[:,9],  label='Malignant', alpha=0.9)
plt.scatter(B.values[:,1],B.values[:,9],  label='Benign',  alpha=0.5)

plt.xlabel(M.columns[1])
plt.ylabel(M.columns[9])

plt.legend()

In [None]:
plt.scatter(M.concavity_mean.values ,M.area_mean.values,  label='Malignant',  alpha=0.9)
plt.scatter(B.concavity_mean.values ,B.area_mean.values,  label='Benign' ,  alpha=0.5)

plt.xlabel("concavity")
plt.ylabel("area")

plt.legend()

##  Visualise distribution of data via histograms
Histograms are commonly used to visualize numerical variables. A histogram is similar to a bar graph after the values of the variable are grouped (binned) into a finite number of intervals (bins).

Histograms group data into bins and provide you a count of the number of observations in each bin. From the shape of the bins you can quickly get a feeling for whether an attribute is Gaussian, skewed or even has an exponential distribution. It can also help you see possible outliers.

In [None]:
data_mean=data.iloc[:,1:]
hist_mean=data_mean.hist( bins=12, figsize=(15, 10),grid=True, color = 'salmon', layout=(4,4) )

In [None]:
data.columns

### __Observation__

>We can see that perhaps the variable  **concavity_mean**, may have an exponential distribution ( ). We can also see that perhaps the **symmetry_mean** and **fractal_dimension**_mean attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.


### Do the same for the other variables:

In [None]:
# For data.iloc[:,5:9]

In [None]:
# For data.iloc[:,9:]

**An alternative for the histogram, is the density plot** :

In [None]:
plt = data.iloc[:,1:].plot(kind= 'density', subplots=True, layout=(5,3), sharex=False,
                     sharey=False,fontsize=12, figsize=(15,10))


## Comparison between groups:
Is the mean symmetry across multiple measures significantly different between malignant and benign tumors?

#### H0: There is no significative differences of symmetry_mean between the two types of tumors with an error risk of $\alpha$ = 0.05
> $\mu_B = \mu_M $

#### H1: There is a significative differences of symmetry_mean between the two types of tumors with an error risk of $\alpha$ = 0.05
> $\mu_B ≠ \mu_M $

### What test should be used ? under what condition ?

We want to compare the means of two groups. We can use a Student's t-test if the symmetry_mean in the overall population follows a normal distribution, and if the two groups have similar variances.

In [None]:
data.symmetry_mean.hist()

In [None]:
B_symmetry = B.symmetry_mean.values
M_symmetry = M.symmetry_mean.values


print("Variance of mean symmetry for Benign tumors: {:.6f}".format( B_symmetry.var()) )
print("Variance of mean symmetry for Malignant tumors: {:.6f}".format( M_symmetry.var()) )


**The variable seems to respect both condition to use a Student t test:**
- An almost gaussian distribution
- Similar variances between groups

In [None]:
import scipy.stats as stats

t_stat, p_value = stats.ttest_ind(B_symmetry, M_symmetry, alternative='two-sided')

print("T-statistic value: ", t_stat)
print("P-Value: ", p_value)

$pvalue < 0.05$ ; On peut rejeter H0 au risque $\alpha$ = 5%

**There is a significative difference between the symmetry of a benign tumor and a malignant one.**

#### We can graphically verify - Using histograms

In [None]:
import matplotlib.pyplot as plt


# plotting first histogram
plot = plt.hist(x=B_symmetry, label='Malignant', alpha=.8, edgecolor='blue')

# plotting second histogram
plot = plt.hist(x=M_symmetry, label='Benign', alpha=.8, edgecolor='grey')

# Define limit of the plot
min_ylim, max_ylim = plt.ylim()

print(max_ylim)
# Adding annotation for the mean
plt.axvline(B_symmetry.mean(), color='blue', linestyle='dashed', linewidth=1)
plt.text(B_symmetry.mean()*.9, max_ylim*0.9, 'Mean: {:.3f}'.format(B_symmetry.mean()))

plt.axvline(M_symmetry.mean(), color='red', linestyle='dashed', linewidth=1)
plt.text(M_symmetry.mean()*1.05, max_ylim*0.9, 'Mean: {:.3f}'.format(M_symmetry.mean()))

plt.legend()
plt.xlabel("Symmetry")

# Showing the plot using plt.show()
plt.show()

#### We can graphically verify - Using boxplot:

In [None]:
my_dict = {'Benign': B_symmetry, 'Malignant': M_symmetry}

fig, ax = plt.subplots()
ax.boxplot(my_dict.values())
ax.set_xticklabels(my_dict.keys())
min_xlim, max_xlim = plt.xlim()

plt.axhline(data.symmetry_mean.mean(), color='red', linestyle='dashed', linewidth=1)
plt.text(max_xlim*.5, data.symmetry_mean.mean()*1.05, 'Population mean: \n {:.2f}'.format(data.symmetry_mean.mean() ) )


plt.title("Boxplot of symmetry_mean for the benign and malignant tumor")

### __Observation__

>We can see that perhaps the variable **Symmetry_mean** might be a good predictor of the tumor state: a lower value of symmetry for a tumor might lead to a malignant tumor

## Check the correlation between variables:

In [None]:
data.iloc[:,2:].corr()

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

plt.style.use('fivethirtyeight')
sns.set_style("white")

# Compute the correlation matrix
corr = data.iloc[:,2:].corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr)

# Masking upper triangle
mask[np.triu_indices_from(mask)] = True
# Showing the diagonal
mask[np.diag_indices_from(mask)] = False

# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(8, 8))
plt.title('Breast Cancer Feature Correlation')

# Generate a custom diverging colormap
cmap = sns.diverging_palette(260, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, vmax=1.2, square='square', cmap=cmap, mask = mask,
            ax=ax,annot=True, fmt='.2g',linewidths=2,
            annot_kws={"fontsize": 8, "fontweight": "bold", "fontfamily": "serif"},)

### Observation:
We can see strong positive relationship exists with mean values paramaters between 1-0.75;.
* The mean area of the tissue nucleus has a strong positive correlation with mean values of radius and parameter;
* Some parameters are moderately positive correlated (r between 0.5-0.75)are concavity and area, concavity and perimeter etc
* Likewise, we see some strong negative correlation between fractal_dimension with radius, texture, parameter mean values.
    

### We can also have an overview of the relation between variables with the seaborn package:

In [None]:
plt.style.use('fivethirtyeight')
sns.set_style("white")

columns_to_show = data.columns[2:10].tolist() + ['diagnosis','radius_mean']


g = sns.PairGrid(data[columns_to_show] , hue = 'diagnosis' )

g = g.map_diag(sns.histplot)
g = g.map_offdiag(sns.scatterplot, s = 5)
g.add_legend()

### Summary

* Mean values of cell radius, perimeter, area, compactness, concavity
    and concave points can be used in classification of the cancer. Larger
    values of these parameters tends to show a correlation with malignant
    tumors.
* mean values of texture, smoothness, symmetry or fractual dimension
    does not show a particular preference of one diagnosis over the other.