# Exploratory Data Analysis in Python
 What is Exploratory Data Analysis (EDA)?
Exploratory Data Analysis is all about analyzing the dataset and summarizing the key insights and characteristics of the data. EDA is one of the first steps that we follow in a Data Science Project to understand the data better. We can also include some Data Visualization tasks in EDA. Once we get this basic understanding, we can move on with the predictive & prescriptive part.

Checklist for EDA:
1. Checking the different features present in the dataset & its shape
2. Checking the data type of each columns
3. Encoding the labels for classification problems
4. Checking for missing values
5. Descriptive summary of the dataset
6. Checking the distribution of the target variable
7. Grouping the data based on target variable

# Understanding EDA with an interesting use case in Python:
Dataset: In order to understand EDA, we will be working on the Breast Cancer Wisconsin (Diagnostic) Data Set. Here, Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. You can find this dataset in Kaggle or UCI ML Repository. You can also download the dataset from here.

Using this dataset, we can build a classification system which can predict whether a person has Benign or Malignant tumor. Malignant tumors are considered cancerous. In the EDA part, we will try to understand the characteristics of the data and its descriptive measures.

As a starter, let’s import the dependencies.

In [20]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"



import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

breast_cancer_data = pd.read_csv('./data/cancer.csv')


# Checking the different features present in the dataset:
For this, we can use the head() function in pandas

In [21]:
breast_cancer_data.head()
breast_cancer_data.shape

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


(569, 33)

As we can see here, the dataset contains 569 rows (data points) and 32 columns (features).
The second column is “diagnosis”, where, “M” represents Malignant & “B” represents Benign. This is our Target column.

#  Checking the data type of each columns and non-null count:


In [22]:
breast_cancer_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

As we can see here, the ‘id’ is in the form of integer; ‘diagnosis’ column is in the form of ‘object’. So, it is a categorical variable. Whereas, the remaining are continuous numerical variables.

# Encoding the labels for classification problems:
Now let’s encode the “diagnosis” column, so that all the columns are in the numerical format. We will encode “B” as 0 and “M” as 1.




In [23]:
label_encode = LabelEncoder()
labels = label_encode.fit_transform(breast_cancer_data['diagnosis'])
labels[15:25]
breast_cancer_data['target'] = labels
breast_cancer_data['diagnosis'][15:25] 
breast_cancer_data.drop(columns=['id','diagnosis'], axis=1,    inplace=True)

breast_cancer_data['target'][15:25]

array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1])

15    M
16    M
17    M
18    M
19    B
20    B
21    B
22    M
23    M
24    M
Name: diagnosis, dtype: object

15    1
16    1
17    1
18    1
19    0
20    0
21    0
22    1
23    1
24    1
Name: target, dtype: int32

Here, we are encoding the “diagnosis” column, storing it in a different column called “target” and removing the “diagnosis” column. We are also removing the “id” column as it is not necessary.

# Checking for missing values:
Now, let’s check whether there are any missing values in the dataset.

In [25]:
breast_cancer_data.isnull().sum()

radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed: 32                569
target                       0
dtype: i

The above line of code gives an output on how many missing values are there in each column. I included first few rows of the output here.
As we can see here, there are no missing values in this case. If there are missing values in a dataset, we will handle them in “Feature Engineering” part.

# Descriptive summary of the dataset:
The next step is to get some statistical measures about the dataset. This is what we call as “Descriptive Statistics” which is a summarization of the data. For this, we can use describe() function in pandas.

In [27]:
breast_cancer_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
radius_mean,569.0,14.127292,3.524049,6.981,11.7,13.37,15.78,28.11
texture_mean,569.0,19.289649,4.301036,9.71,16.17,18.84,21.8,39.28
perimeter_mean,569.0,91.969033,24.298981,43.79,75.17,86.24,104.1,188.5
area_mean,569.0,654.889104,351.914129,143.5,420.3,551.1,782.7,2501.0
smoothness_mean,569.0,0.09636,0.014064,0.05263,0.08637,0.09587,0.1053,0.1634
compactness_mean,569.0,0.104341,0.052813,0.01938,0.06492,0.09263,0.1304,0.3454
concavity_mean,569.0,0.088799,0.07972,0.0,0.02956,0.06154,0.1307,0.4268
concave points_mean,569.0,0.048919,0.038803,0.0,0.02031,0.0335,0.074,0.2012
symmetry_mean,569.0,0.181162,0.027414,0.106,0.1619,0.1792,0.1957,0.304
fractal_dimension_mean,569.0,0.062798,0.00706,0.04996,0.0577,0.06154,0.06612,0.09744


The main inference that we can get here is, for most of the columns, the mean value is larger than median value (50th percentile: 50%). This is an indication that those features have a right skewed data. This information will be visible for us when we create distribution plot for individual features in Data Visualization part.

# Checking the distribution of the target variable:
The next step is to check the distribution of the dataset based on the target variable to see if there is an imbalance. This is an exclusive step for Classification problems.

As we can see, there is a slight imbalance in the dataset ( number of Benign(0) cases is more than number of Malignant(1) cases). The imbalance is not too much to worry about in this case.


In [29]:
breast_cancer_data['target'].value_counts()


0    357
1    212
Name: target, dtype: int64

# Grouping the data based on target variable:
This step is also exclusive for Classification problems. This is to group the dataset based on the target variable. We will be grouping the data points as 0 & 1 representing Benign & Malignant respectively. This grouping is done with the mean value of all the columns.

In [31]:
breast_cancer_data.groupby('target').mean().T


target,0,1
radius_mean,12.146524,17.46283
texture_mean,17.914762,21.604906
perimeter_mean,78.075406,115.365377
area_mean,462.790196,978.376415
smoothness_mean,0.092478,0.102898
compactness_mean,0.080085,0.145188
concavity_mean,0.046058,0.160775
concave points_mean,0.025717,0.08799
symmetry_mean,0.174186,0.192909
fractal_dimension_mean,0.062867,0.06268


This clearly tells us that the mean value for most of the features are greater for Malignant cases than the mean value for Benign cases. This inference is very important.