# Breast Cancer

**Data Set Information:**

There are two classes (binary classification): “no-recurrence-events” and “recurrence-events”, that describe whether the patient’s cancer reappeared after treatment. The other 9 attributes contain general information about the patients themselves as well as more specific information about their individual cancer diagnoses. Using this information the goal is to classify whether a patient will have breast cancer again, or not


**Attribute Information:**

**- Class:** Describes if a patient had recurrent tumors;<br>
**- age:** Age listed in Interval of 10 years;<br>
**- menopause:** Nominal Short text description;<br>
**- tumor-size:** Interval in which falls the diamater of tumor falls;<br>
**- inv-nodes:** Interval in which falls the number of lymph-nodes in close proximity of the tumor;<br>
**- node-caps:** Nominal Describe whenever there're metastases or not;<br>
**- deg-malig:** Numerical Describe how bad the cancer is;<br>
**- breast:** Nominal Describe the afflicted breast;<br>
**- breast-quad:** Nominal Text representing the location of  tumor in the breast.<br>
**- irradiate** Nominal yes/no Indicates whenever the patient underwent radiation therapy.

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Get the data

In [None]:
data = pd.read_csv('breast-cancer.csv')

In [None]:
data

*Remove quotation marks*

In [None]:
for column in data:
    data[column] = data[column].map(lambda x: x.lstrip("'").rstrip("'"))

In [None]:
data.head()

### Basic Data Information

In [None]:
data.info()

In [None]:
data.describe()

### Check missing values

In [None]:
data.loc[(data['age'] == '?') | (data['menopause'] == '?') | \
         (data['tumor-size'] == '?') | (data['inv-nodes'] == '?') | \
         (data['node-caps'] == '?') | (data['deg-malig'] == '?') | \
         (data['breast'] == '?') | (data['breast-quad'] == '?') | \
         (data['irradiat'] == '?') | (data['Class'] == '?')] 

*Convert missing data (indicated by a ?) into NaN*

In [None]:
data.replace("?", np.nan, inplace = True)

In [None]:
print(data.isnull().sum())

# Exploratory Data Analysis

In [None]:
c_palette = ['tab:red','tab:green']

*Countplot of the Target* 

In [None]:
sns.set_style('darkgrid')
ax = sns.countplot(x = data['Class'], palette=c_palette)

total = len(data['Class'])

for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:.1f}%'.format(100 * height/total),
            ha="center")

**Class on Age Interval**

In [None]:
plt.figure(figsize=(12, 5))
sns.countplot(y="age", hue="Class", data=data, palette=c_palette)
plt.show()

**Class on Menopause**

In [None]:
plt.figure(figsize=(12, 5))
sns.countplot(y="menopause", hue="Class", data=data, palette=c_palette)
plt.show()

**Class on Breast**

In [None]:
plt.figure(figsize=(12, 5))
sns.countplot(y="breast", hue="Class", data=data, palette=c_palette)
plt.show()

**Class on Breast-Quad**

*Divide par Left/Right breast*

In [None]:
right_b = data.loc[data['breast'] == 'right']
left_b = data.loc[data['breast'] == 'left']

In [None]:
fig = plt.figure(figsize = (15,10))
ax1 = fig.add_subplot(2,1,1)
sns.countplot(y=left_b['breast-quad'], hue="Class", data=data, ax = ax1, palette=c_palette)
ax1.set(ylabel='Left Breast')

ax2 = fig.add_subplot(2,1,2)
sns.countplot(y=right_b['breast-quad'], hue="Class", data=data, ax=ax2, palette=c_palette)
ax2.set(ylabel='Right Breast')

**Class on Degree of Malignancy**

In [None]:
fig = plt.figure(figsize = (15,10))
ax1 = fig.add_subplot(2,2,1)
deg_malig = data['deg-malig'].astype(float)
sns.violinplot(data = data, x='Class', y=deg_malig, ax=ax1, palette=c_palette)
sns.swarmplot(data = data, x='Class', y='deg-malig', color = 'k', alpha = 0.6, ax=ax1)

**Class on Lymph-Nodes**

In [None]:
plt.figure(figsize=(12, 5))
sns.countplot(y="inv-nodes", hue="Class", data=data, palette=c_palette)
plt.show()

**Class on Metastases**

In [None]:
plt.figure(figsize=(12, 5))
sns.countplot(y="node-caps", hue="Class", data=data, palette=c_palette)
plt.show()

**Class on Irradiate**

In [None]:
plt.figure(figsize=(12, 5))
sns.countplot(y="irradiat", hue="Class", data=data, palette=c_palette)
plt.show()