# Explore and visualize with pandas + seaborn

Welcome! In this lesson, you'll get a crash course in bringing your tabular research data into Python. 

We will introduce **pandas**, a key tool in the Python ecosystem, along with **NumPy** which is the powerful framework underlying pandas and **matplotlib**, a visualization library already integrated into pandas. (We'll use the **seaborn** visualization library briefly too). Throughout this notebook, we will focus on how to leverage Python tools to think critically about your data. 

At the end, you'll have written a data science recipe you can bring to future datasets and research questions.

In [None]:
from IPython.display import Image

In [None]:
Image('https://d33wubrfki0l68.cloudfront.net/795c039ba2520455d833b4034befc8cf360a70ba/558a5/diagrams/data-science-explore.png')

*The exploratory data analysis cycle (Grolemund and Wickham, 2017)*

## A refresher: Breast Cancer Dianostic Dataset

*(If you're proceeding directly from the Import and Tidy Data pandas notebook, feel free to skip ahead to **Load in cleaned data from .csv**)*

What does it take to produce good, reliable diagnoses? Clinicians need to use information strategically to determine the best plan of care. Researchers may wish to improve clinician training, invent automated systems that support patient-facing staff, or identify pathways for future investigations into the etiology of disease.

When exploring a new dataset, it's important to understand and clearly articulate research questions. 

Let's suppose we are researchers at Big University Oncology Institute working with fine needle aspirate (FNA) images of breast masses. Our long-term goal is to increase the efficacy of distinguishing malignant from non-malignant masses from FNA digital images. As one part of that research, we are interested in whether patterns exist in imaging data that would improve the ability of clinicians to correctly dinstinguish malignant and non-malignant tumors.

In [None]:
Image('https://upload.wikimedia.org/wikipedia/commons/8/8b/Breast_fibroadenoma_by_fine_needle_aspiration_%282%29_PAP_stain.jpg', width = 600)

*Fine needle aspiration of fibroadenoma, a type of benign breast tumor. (Source: Wikimedia)*

Today, we'll be using the **Wisconsin Diagnostic Breast Cancer (WDBC) dataset**. This data was initially captured in eight groups from January 1989 to November 1991 by Dr. William H. Wolberg at the University of Wisconsin Hospitals, Madison. Dr. Wolberg subsequently donated the dataset to the University of California-Irvine's Machine Learning Repository, where it has been used in dozens of subsequent articles and further annalyzed and transformed.

The [original repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+\(Diagnostic\)) contains several versions of the breast cancer data. For today we're interested in Dr. Wolberg's original encodings of the FNA image data, which uses a ranking of 1 to 10 to describe a variety of attributes such unifority of cell shape and size, mitoses and normal nuclei count, margin adhesion, clump thickness, and so on.

Given our research questions, how might we wish to explore a dataset where observations correspond to fine needle aspiration (FNA) images of breast growths? What's important to us?

(**Edit this box to include a question you would like to investigate in this data.**)



## Load in cleaned data from .csv

In the previous notebook, we took the raw data from the Wisconsin Diagnostic Breast Cancer (WDBC) dataset and cleaned & tidied it. This lesson will continue with the cleaned data. To start, let's load the results of the previous notebook into a pandas DataFrame:

Let's get started. First, let's import the pandas and numpy library:

In [None]:
import pandas as pd
import numpy as np

In [None]:
diagnostic_data = pd.read_csv('https://raw.githubusercontent.com/arcus/education-materials/master/explore-pandas-seaborn/wisconsin_data_clean.csv')

## The core data exploration cycle: transforming and visualizing

Earlier, we had described our research question as follows:

> ...we are interested in whether patterns exist in imaging data that would improve the ability of clinicians to correctly dinstinguish malignant and non-malignant tumors.

Let's break this goal into some specific tasks. To learn more, we may wish to learn:
1. The overall size, shape, and scope of our FNA imaging data.
2. Summary statistics and distributions for each image attribute.
3. Possible correlations between image attributes.

And, in the future, to pursue:
4. Avenues for subsequent analysis (e.g. designing a machine learning classification task)

Now that we've brought our data into the DataFrame structure, we have access to an array of methods and tools to accomplish these tasks. Let's explore each below.

### 1. Overall size, shape, and scope of our FNA imaging data.

When getting oriented to a new dataset, I like to see a few examples of observations, and also make sure the data is in a format I expect.

We've already seen `.head()` and `.info()`. Let's also add in `.tail()`, which just shows the bottom 5 items. We can also pass in a parameter to see the nth rows from the bottom.

In [None]:
diagnostic_data.head()

In [None]:
diagnostic_data.tail(10)

In [None]:
diagnostic_data.info()

If you want to quickly view the overall dimensions of your DataFrame, the most concise way is to use .shape . Unlike .head(), which is a **method** (or function that belongs to our DataFrame object), .shape is an **atribute** meaning that it returns some data about the object. 

The reason this matters is that .head() always takes parentheses at the end, and .shape never takes paranetheses.

Run the code below to see this in action.

In [None]:
diagnostic_data.shape

In contrast, writing `.shape` with paranetheses will produce a TypeError:

In [None]:
diagnostic_data.shape()

That's it for now!

### 2. Summary statistics and distributions for each image attribute.

We now have a big-picture view of how our dataframe is organized, but we don't yet know much about the individual columns. What is the distribution of values within a given column? How do they compare to one another?

Fortunately, column-level analysis is where the NumPy-powered architecture of pandas really shines. Let's take some time exploring columns individually and comparing them to one another.

First, run the `.columns` attribute call on our dataframe to remind us of all available columns:

In [None]:
# replace with your code
diagnostic_data.columns

Now, let's get a visual sense of the distirbution of values within each column. Pandas supports visualization using the matplotlib library natively, which is very helpful for our purposes. To make this work within a notebook, we simply need to import matplotlib, and then specify that we want to see inline visualizations with a "magic command" (the magic command is specific to Jupyter Notebooks):

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

We can generate side-by-side histograms of all our columns by calling the `.hist()` method on our DataFrame. It's not necessary to specify paramters, but for readability, it's helpful to set the `figsize` parameter to an `(x, y)` tuple, in inches:

In [None]:
our_plot = diagnostic_data.hist(figsize=(12,12))

We can pair this code with the column select syntax `DataFrame[["Column A", "Column B"...]]` to compare just a subset of columns. Unlike the dot format of referencing a column (which is my favorite for most situations), the bracket format allows you to pass in a list of columns, which is helpful for subsetting. If you write code this way, note that you need two brackets, and the column names must now be passed in as strings (and thus must be surrounded by quotation marks). 

Run the example below and please experiment with your own code afterwards:


In [None]:
comparison_plot = diagnostic_data[["marginal_adhesion","uniformity_cell_size"]].hist(figsize=(12,6))

In [None]:
## Add your own code to compare the distribution of two other variables here!

Now let's use the `.describe()` method specifically on a column to get summary statistics. For instance, here is the summary statistics on the `clump_thickness` column:

In [None]:
diagnostic_data['clump_thickness'].describe()

You can also write this method call using the bracket notation for referencing a column:

In [None]:
# add your own code to view a column's summary statistics, or to compare multiple columns of interest

### 3. Possible correlations between image attributes.

Let's visualize correlations!

In [None]:
diagnostic_data.corr()

In [None]:
plt.matshow(diagnostic_data.corr())
plt.show()

In [None]:
import seaborn as sns

In [None]:
corr = diagnostic_data.corr()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

Let's run the analysis again after dropping our primary key (sample_code) and predictor class variable (malignant). You can drop a column using the `.drop()` method, which requires two parameters:

* the name of our column (don't need to specify a parameter), which we can pass in as a list of multiple items
* an axis value (1 for columns)

Like so: `our_new_data_frame = our_old_data_frame.drop(["primary_key_column", "predictor_column"], axis=1)`


Drop the outcome class column `malignant` from our DataFrame and assign the output to a new DataFrame called `diagnostic_predictors'

In [None]:
# add your code for a new diagnostic_predicotrs DataFrame

In [None]:
# generate a new correlation heatmap based on the new dataframe

In [None]:
diagnostic_data.malignant = diagnostic_data.malignant.astype('category')
diagnostic_data = pd.get_dummies(diagnostic_data, prefix = "malignant", columns = ['malignant'])
diagnostic_data = diagnostic_data.rename(columns={"malignant_2": "is_benign", "malignant_4": "is_malignant"})

In [None]:
corr_cleaned = diagnostic_data.drop(["sample_code"], axis=1)

In [None]:
corr_cleaned.corr()

In [None]:
corr = corr_cleaned.corr()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

In [None]:
predictors_only = corr_cleaned.drop(['is_benign', 'is_malignant'], axis=1).corr()
sns.heatmap(predictors_only, 
        xticklabels=predictors_only.columns,
        yticklabels=predictors_only.columns)

## Where to go from here

We've tidied up our dataset, including dealing with headers and nulls; glimpsed the design, shape, and typical observations within our datset; explored summary statistics and distributions of values within columns; and evaluated covariance between columns with correlation plots and heat maps. 

Each of these tasks are valuable in their own right, no matter what analytical method we wish to pursue next. 

### 4. Avenues for subsequent analysis (e.g. designing a machine learning classification task)

The data work you've done in pandas is *particularly* relevant if we plan to approach this research question

> we are interested in whether patterns exist in imaging data that would improve the ability of clinicians to correctly dinstinguish malignant and non-malignant tumors

as the classification task

> Given the output class of "malignant" or "benign" in our observations, can we generate an accurate and reliable prediction from observations about Fine Needle Aspiration images (also contained within our observations)? If so, what variables are singificant in predicting an outcome class effectively? Among those important predictors, how would a change in measurement value influence the likely outcome, and how confident are we in that influence?

Now that we have our data in a tidy dataframe, those observations include a outcome class variable as well as several storng candidate for predictor variables. This is an excellent task to pursue via the supervised machine learning method of [binary classification](https://www.sciencedirect.com/topics/computer-science/binary-classification).

In the Python ecosystem, most if not all popular machine learning packages will work natively with pandas DataFrames. Look out for a future lesson using random forest classifers with scikit-learn on this very dataset you have prepared!

Let's save our work.

In [None]:
diagnostic_data.to_csv("wisconsin_data_clean.csv")

## Works cited & further reading

## Appendix: Full readme file

Full readme file accompanying data:

```
Citation Request:
   This breast cancer databases was obtained from the University of Wisconsin
   Hospitals, Madison from Dr. William H. Wolberg.  If you publish results
   when using this database, then please include this information in your
   acknowledgements.  Also, please cite one or more of:

   1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear 
      programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

   2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of 
      pattern separation for medical diagnosis applied to breast cytology", 
      Proceedings of the National Academy of Sciences, U.S.A., Volume 87, 
      December 1990, pp 9193-9196.

   3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition 
      via linear programming: Theory and application to medical diagnosis", 
      in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying
      Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.

   4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming 
      discrimination of two linearly inseparable sets", Optimization Methods
      and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).

1. Title: Wisconsin Breast Cancer Database (January 8, 1991)

2. Sources:
   -- Dr. WIlliam H. Wolberg (physician)
      University of Wisconsin Hospitals
      Madison, Wisconsin
      USA
   -- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)
      Received by David W. Aha (aha@cs.jhu.edu)
   -- Date: 15 July 1992

3. Past Usage:

   Attributes 2 through 10 have been used to represent instances.
   Each instance has one of 2 possible classes: benign or malignant.

   1. Wolberg,~W.~H., \& Mangasarian,~O.~L. (1990). Multisurface method of 
      pattern separation for medical diagnosis applied to breast cytology. In
      {\it Proceedings of the National Academy of Sciences}, {\it 87},
      9193--9196.
      -- Size of data set: only 369 instances (at that point in time)
      -- Collected classification results: 1 trial only
      -- Two pairs of parallel hyperplanes were found to be consistent with
         50% of the data
         -- Accuracy on remaining 50% of dataset: 93.5%
      -- Three pairs of parallel hyperplanes were found to be consistent with
         67% of data
         -- Accuracy on remaining 33% of dataset: 95.9%

   2. Zhang,~J. (1992). Selecting typical instances in instance-based
      learning.  In {\it Proceedings of the Ninth International Machine
      Learning Conference} (pp. 470--479).  Aberdeen, Scotland: Morgan
      Kaufmann.
      -- Size of data set: only 369 instances (at that point in time)
      -- Applied 4 instance-based learning algorithms 
      -- Collected classification results averaged over 10 trials
      -- Best accuracy result: 
         -- 1-nearest neighbor: 93.7%
         -- trained on 200 instances, tested on the other 169
      -- Also of interest:
         -- Using only typical instances: 92.2% (storing only 23.1 instances)
         -- trained on 200 instances, tested on the other 169

4. Relevant Information:

   Samples arrive periodically as Dr. Wolberg reports his clinical cases.
   The database therefore reflects this chronological grouping of the data.
   This grouping information appears immediately below, having been removed
   from the data itself:

     Group 1: 367 instances (January 1989)
     Group 2:  70 instances (October 1989)
     Group 3:  31 instances (February 1990)
     Group 4:  17 instances (April 1990)
     Group 5:  48 instances (August 1990)
     Group 6:  49 instances (Updated January 1991)
     Group 7:  31 instances (June 1991)
     Group 8:  86 instances (November 1991)
     -----------------------------------------
     Total:   699 points (as of the donated datbase on 15 July 1992)

   Note that the results summarized above in Past Usage refer to a dataset
   of size 369, while Group 1 has only 367 instances.  This is because it
   originally contained 369 instances; 2 were removed.  The following
   statements summarizes changes to the original Group 1's set of data:

   #####  Group 1 : 367 points: 200B 167M (January 1989)
   #####  Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805
   #####  Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record
   #####                  : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial
   #####                  : Changed 0 to 1 in field 6 of sample 1219406
   #####                  : Changed 0 to 1 in field 8 of following sample:
   #####                  : 1182404,2,3,1,1,1,2,0,1,1,1

5. Number of Instances: 699 (as of 15 July 1992)

6. Number of Attributes: 10 plus the class attribute

7. Attribute Information: (class attribute has been moved to last column)

   #  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)

8. Missing attribute values: 16

   There are 16 instances in Groups 1 to 6 that contain a single missing 
   (i.e., unavailable) attribute value, now denoted by "?".  

9. Class distribution:
 
   Benign: 458 (65.5%)
   Malignant: 241 (34.5%)

```