# Lab0 - Exploratory Data Analysis

Imagine we are interested in predicting breast cancer (benign or maligant). First we want to do some
data exploration to get a feel for the data.

Source: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

## 0: Import libraries

In [None]:
# 0. import libraries
# your source code below
import numpy as np;
import pandas as pd;
import matplotlib.pyplot as plt
import seaborn as sns

#import warnings library
import warnings
warnings.filterwarnings('ignore')

## 1. Load Data 

In [None]:
# 1.1 read the data into dataframe from the csv file breast_cancer.csv
# your source code below
df = pd.read_csv('breast_cancer.csv' #df is equals to the dataframe, and the dataframe is equal to the csv file


In [None]:
# 1.2 Show a few samples of the data  hint: use head().
# your source code below
print(df.head()) #print the first 5 rows of the dataframe

In [None]:
# 1.3 Display the information data such as each column's type, non-null information hint: use info().
# your source code below
print(df.info()) # print the information of the dataframe, by information it means the data type of each column, and the number of non-null values

In [None]:
# 1.4 Display the summary information of data such mean and standard deviation. hint: use describe().
# your source code below
print(df.describe()) # print the summary of the dataframe, by summary it means the mean, std, min, max, etc of each column it wont work the way we want without first cleaning the data though which we will do in 2.1
# because of columns we will drop soon, we are getting mostly nans for the mean and std, etc, this is due to the Unamed: 32 column


In [None]:
# 1.5 print out the label count for each classification. hint: use value_counts().
# your source code below
#We will get the count of each label in the diagnosis column, to see how many malignant and benign cases we have
print(df['diagnosis'].value_counts()) # print the count of each label in the diagnosis column, diagnosis is the column name, and value_counts() is the function that will count the number of each label in the column
# I choose diagnosis because it is the column that we are trying to predict, and it is the column that we will be using to train our model


## 2. Process Data 

In [None]:
# 2.1 drop the columns of ['Unnamed: 32','id'] from the dataframe hint: use drop()
# your source code below
df = df.drop(columns=['Unnamed: 32', 'id'], errors='ignore') # drop the columns of ['Unnamed: 32','id'] from the dataframe, errors='ignore' is used to ignore the error that would be thrown if we tried to drop a column that was already dropped
# now that we have dropped the columns, we can print the head again to see the changes
# now we get valuable information about the data instead of a bunch of nans
print(df.describe()) # print the summary of the dataframe, by summary it means the mean, std, min, max, etc of each column, now after being cleaned, we can see the mean, std, min, max, etc of each column

In [None]:
# 2.2 store the feature names as a list, and print it out
# your source code below
# Store the feature names as a list
feature_names = df.columns.tolist() # feature_names is equal to the columns of the dataframe, and tolist() is used to convert the columns into a list

# Print it out
print(feature_names) # print the feature names, feature_names is equal to the columns of the dataframe, and tolist() is used to convert the columns into a list




In [None]:
# 2.3 construct X and y, so that:
# - X holds all data information except label column (i.e., diagnosis)
# - y hold the data label column (i.e., diagnosis)
#  !!! you still keep your original dataframe for later use !!!
# print out the shapes of the dataframe, X and y
# your source code below

# Constructing X and y
# the 30 without the last column
X = df.drop(columns=['diagnosis']) # X is equal to the dataframe, and the dataframe is equal to the csv file, and we are dropping the diagnosis column because that is the column
# the last column
y = df['diagnosis'] # y is equal to the dataframe, and the dataframe is equal to the csv file, and we are only keeping the diagnosis column because that is the column

# Printing out the shapes of the dataframe, X, and y
# i used format strings to make it easier to read
print(f"Shape of dataframe: {df.shape}") # print the shape of the dataframe
print(f"Shape of X: {X.shape}") # print the shape of X
print(f"Shape of y: {y.shape}") # print the shape of y


## 3. Visualize Data

In [None]:
#3.1: Plot histograms for X (all 30 features) hint: use subplots and histograms
# make sure your subplots should be well spaced
# your source code below

# # Find out how many features we have
n_features = X.shape[1] # shape returns a tuple (n_rows, n_columns), in this case, (569, 30)
# I want this readable so we will put 5 features, (since we are dealing with 30 it could just as easily be 6), or 2 and 15, or 1  and 30, etc 
# or histograms, per row
#n_cols = 6
n_cols = 5 # n_cols is equal to 5

# I need to calculate how many row I will need for subplots
# n_rows = (n_features // n_cols) + 1
n_rows = (n_features + n_cols - 1) // n_cols  # n_rows is equal to the number of features plus the number of columns minus 1, divided by the number of columns

# I need to set the spacing between the subplots so they are readable. 
# Ill go 15 by 15 and I can go bigger later if necessary
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(15, 15)) # fig is equal to the figure, and the figure is equal to the number of rows, number of columns, and the size of the figure

# I need to add space between the subplots so they are readable
fig.subplots_adjust(hspace=1.0, wspace=1.0) # fig is equal to the figure, and the figure is equal to the space between the subplots

# For each individual feature, I want to draw a histogram
for i, column in enumerate(X.columns):
    # Figure out each histogram should go
    row = i // n_cols # row is equal to the index of the feature divided by the number of columns
    col = i % n_cols # col is equal to the index of the feature modulo the number of columns

    # Draw the histogram for each feature
    axes[row, col].hist(X[column], bins=20, color='blue', alpha=0.5) # axes is equal to the row and column we calculated above, and we are drawing a histogram with 20 bins, blue, and 50% transparency
    
    # Dynamically add title so we know which feature goes with which plot
    axes[row, col].set_title(column) # axes is equal to the row and column we calculated above, and we are setting the title to the name of the feature
    # Label the x and y axes
    axes[row, col].set_xlabel('Value') # axes is equal to the row and column we calculated above, and we are setting the x label to Value
    axes[row, col].set_ylabel('Frequency') # axes is equal to the row and column we calculated above, and we are setting the y label to Frequency

# If we have some empty space left in our grid (like if we have 27 features and are using a 5x6 grid)
# then we want to hide these empty plots
for j in range(i+1, n_rows*n_cols): # j is equal to the index of the feature plus 1, and we are looping through the number of rows times the number of columns
    axes[j // n_cols, j % n_cols].axis('off') # axes is equal to the index of the feature divided by the number of columns, and the index of the feature modulo the number of columns, and we are turning off the axis

# badabing badaboom, now I draw the plot and presto
plt.show()

In [None]:
# 3.2: do pairplot for the first five columns in data matrix
# hint: using seaborn.pairplot, and use hue ='diagnosis'
# make sure you only extract the first five columns

# your source code below

# I want to make a subset of the dataframe that only has the first 5 columns
# subset_df is a dataframe that only has the first 5 columns
# df.iloc[:, :5] means I want all rows, and all columns up to but not including the 5th column
subset_df = df.iloc[:, :5]

#check to make sure it worked
#print(subset_df.head())

# With sns.pairplot, I can make a pairplot of the first 5 columns,
# giving the parameter hue='diagnosis' will color the points based on the diagnosis
# giving the parameter diag_kind='hist' will make the diagonal plots histograms
# I could have used the standard curve, but I think the histogram is more informative
# to change the color palette, I can give the parameter palette='husl' or any other color palettes like 'viridis', 'magma', etc
# to change the scatter plot to a histogram, I can give the parameter kind='hist', but I prefer the scatter plot
sns.pairplot(subset_df, hue='diagnosis', palette='magma', kind='hist')

# badabing badaboom show the plot
plt.show()
