# More about data exploration

Now that we have our data imported and cleaned, we can move on to summarizing and otherwise examining our data. Again, **do not try to memorize everything, you have access to a cheat sheet!**

<font color = '#ed865c' size = 4>**Make sure to press the play button to run the cell below: this will re-load the datasets and functions that we worked on during the last section.**</font>

In [None]:
#@title
# we hid the contents of this cell because there's a lot going on in here
# you can re-hide this cell by clicking on View -> Show/hide code

!git clone https://github.com/ccbskillssem/pythonbootcamp.git
import numpy as np

### load and clean datasets ###
animals2 = np.genfromtxt('/content/pythonbootcamp/day_3/Animals2.csv',
              delimiter=',')
animals2 = animals2[:, ~np.isnan(animals2).all(axis = 0)][~np.isnan(animals2).all(axis = 1), :]
airquality = np.genfromtxt('/content/pythonbootcamp/day_3/airquality.csv',
              delimiter=',')
airquality = airquality[:, ~np.isnan(airquality).all(axis = 0)][~np.isnan(airquality).all(axis = 1), :]
nan_inds = np.where(np.isnan(airquality))
airquality[nan_inds] = np.take(np.nanmean(airquality, axis = 0), nan_inds[1])

### load in sample solution for clean_data() ###
def clean_data(data_array):
  nan_map = np.isnan(data_array)

  data_array = data_array[:, ~nan_map.all(axis = 0)]
  data_array = data_array[~nan_map.all(axis = 1), :]
  return data_array

## Examining attributes

<font color = '#ed865c' size = 4>**Make sure to run the cell above before you begin: this will re-load the datasets and functions that we worked on during the last section.**</font>

As we discussed yesterday, arrays have *attributes* that describe key characteristics of the array at hand. Let's take a look at the attributes of the `animals2` array, just to make sure that we have the right number of columns and rows.

In [None]:
print(animals2.shape)
print(animals2.size)
print(animals2.ndim)

Everything looks to be in order: based on the background information about `animals2`, we know that there should be 65 rows (species) and two columns.

## Heads and tails

One good place to start is by simply examining the content of the data. As we saw earlier, calling `animals2` directly results in the entire dataset being shown in the output. However, it can be visually difficult to parse all 65 rows at once: in most cases, slicing a small number of rows will suffice for examining the data.

In [None]:
# let's slice the first ten rows of animals2
animals2[:10,:]

Good! This is a simple way for us to inspect a small, ordered section of the data: one row after another, for ten rows. This is described as viewing the **head** of our dataset.

Similarly, taking a small number of rows from the end of your dataset is described as viewing the **tail** of the data.

In [None]:
# try it out:
# slice the *last* 10 rows of animals2


## Summarizing values

Finally, we can generate some simple summary values. We've already taught you the following functions/methods that describe summary values of an array:
* `np.min()` / `.min()`
* `np.max()` / `.max()`
* `np.mean()` / `.mean()`
* `np.std()` / `.std()`
* `np.var()` / `.var()`
* `np.median()` (no method equivalent, for reasons unknown to us...)

It's important to keep in mind that these summary functions/methods need a specific axis parameter to guide their operation: a row-wise mean will be different than a column-wise mean.

* `0` indicates a column-wise operation.
* `1` indicates a row-wise operation.

In [None]:
# try it out:
# calculate the column-wise means of animals2


These last four functions are slightly more advanced in their scope. We've summarized their main inputs and functions, but they do offer additional options that may be useful to you in the future. You can view these options in the docs.
* `np.unique()`: Takes an array and returns a sorted copy of unique values. [[docs]](https://numpy.org/doc/stable/reference/generated/numpy.unique.html)
* `np.histogram()`: Takes an array of values and returns a tuple of two arrays: the first array describes bin count, and the second array gives left-hand bin boundaries. [[docs]](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html)
  * You can explicitly specify desired bin boundaries using the `bins` input.
* `np.percentile()`: Takes in an array and a percentile value (float, `0.0`-`100.0`), returning the value of the array that demarcates the given percentile value. [[docs]](https://numpy.org/doc/stable/reference/generated/numpy.percentile.html)
* `np.quantile()`: Takes in an array and a quantile value (float, `0.0`-`1.0`), returning the value of the array that demarcates the given quantile value. [[docs]](https://numpy.org/doc/stable/reference/generated/numpy.quantile.html)

We'll show you how these work using the `airquality` dataset. We've provided the columns again for your convenience:

| Index  | Description                   |
|--------|-------------------------------|
| 0      | Ozone (ppb)                   |
| 1      | Solar radiation (Langeleys)   |
| 2      | Wind (mph)                    |
| 3      | Temperature (degrees F)       |
| 4      | Month (`1`-`12`)              |
| 5      | Day of month (`1`-`31`)       |

In [None]:
# using np.histogram() to generate bin count and intervals for
# temperature values in airquality

np.histogram(airquality[:,3])

In [None]:
# creating bins from the minimum temp to the maximum temp in 5 degree intervals
temp_bins = np.arange(airquality[:,3].min(), airquality[:,3].max() + 5, 5)
temp_bins

In [None]:
# specifying temp_bins in np.histogram()
np.histogram(airquality[:,3], bins = temp_bins)

## Very, very simple plots

Plots can be a useful tool in data exploration: sometimes it's easier to identify patterns in data once it's been visualized.

Today, we'll introduce you to some very simple plotting functions that you can use to explore your data. We'll use a popular package called `matplotlib`, which contains a collection of useful MATLAB-like plotting functions (`pyplot`).

In [None]:
import matplotlib.pyplot as plt # the alias is plt

For today's purposes, we'll only introduce you to three simple `pyplot` functions and their key functionality.
* `plt.hist()`: Takes in an array of values to generate a basic histogram plot.
  * Can be used to visualize the distribution of values in a single column.
  * You can explicitly specify desired bin boundaries using the `bins` input.
* `plt.scatter()`: Takes in two arrays of values and generates a scatter plot.
  * Can be used to visualize the relationship between values in two columns.
* `plt.violinplot()`: Takes in a 2D array of values and plots a series of "violins".
  * Can be used to visualize the distribution of values in *multiple* columns.

In [None]:
# creating a histogram of temperatures

plt.hist(airquality[:,3])

In [None]:
# creating a histogram of temperatures using temp_bins

plt.hist(airquality[:,3], bins = temp_bins)

In [None]:
# visualizing temperature ranges by month

plt.scatter(airquality[:,4], airquality[:,3])

In [None]:
# visualize the distribution of values in all the columns.

plt.violinplot(airquality)

As you can see, these plots are quite rudimentary, but they get the job done for data exploration. We'll explore greater `pyplot` functionality on Friday.

# Mini-project: Exploring health data

That's the last of the `numpy` techniques that we'll cover in this bootcamp!

To cap things off, we're going to embark on a mini-project that will require you to use the skills you've learned so far. We've provided two project options:

1. `badhealth`, a dataset describing healthcare habits of individuals and their self-reported health status. (Recommended)
2. `hepatocellular`, a dataset from a journal article exploring gene expression biomarkers in cancer prognosis.

We recommend that you start with `badhealth` and work on `hepatocellular` if you have extra time.


## `badhealth` dataset

Our first dataset is the `badhealth` dataset. This dataset originates from a 1998 German health study of 1,127 individuals and their healthcare habits. [[source]](https://vincentarelbundock.github.io/Rdatasets/doc/COUNT/badhealth.html)

Here are the descriptions and indices of columns in `badhealth`.

| Index  | Description                   |
|--------|-------------------------------|
| 0      | Recorded # of doctor visits   |
| 1      | Reported health status:<br>`0`: health, `1`: bad health   |
| 2      | Age of patient (years)        |

Using the following file path, import the `badhealth` dataset to a variable (also called `badhealth`). Slice and print the first 10 rows of the dataset.

```
'/content/pythonbootcamp/day_3/badhealth.csv'
```

In [None]:
### use this cell to import and view the head of the data ###


Next, use `clean_data()` to clean `badhealth`. There shouldn't be any interspersed `nan` values, but make sure to check anyway to practice your `nan`-detection routine!

In [None]:
### write your code below ###


Let's consider some of the key summary values that we might want to identify in our dataset, such as mean and median values.

Write a function called `summary_values()` that takes an array and returns a dictionary containing:
* The number of elements in the array, with the key value `'count'`.
* Column-wise minimum values, with the key value `'mins'`
* Column-wise maximum values, with the key value `'maxs'`
* Column-wise means, with the key value `'means'`
* Column-wise medians, with the key value `'medians'`
* Column-wise standard deviations, with the key value `'stdevs'`

In [None]:
### write your code below ###


Next, using `summary_values()`, find the specified summaries of all columns in `badhealth`.

In [None]:
### write your code below ###


Good! Let's try out more of the techniques we learned about earlier, such as percentiles and histograms.

Below, use `np.histogram()` to generate a histogram of individual ages in `badhealth` using 5 year-intervals from the min to max age (inclusive of max age). Assign the intervals to a variable called `age_bins`.

In [None]:
### write your code below ###


When you're done, try using `plt.hist()` to plot the histogram of ages, defined by `age_bins`: they should yield the same values as `np.histogram()`.

In [None]:
### write your code below ###


Repeat the above process to generate the histogram and histogram plot of individuals' doctor visits. Save the 5-visit interval bins to a new variable called `visit_bins`.

In [None]:
### write your code below ###


What percentile range corresponds to 30 or more visits? Try out different values with `np.percentile()` to find out.

In [None]:
### write your code below ###


Is there a difference in the age distribution of individuals who *never* visit the doctor, versus individuals who *frequently* visit the doctor?

1.   `never_visit` for never-visiting individuals, or individuals with zero visits.
2.   `frequent_visit` for frequently-visiting individuals, or individuals who visit more than 95% of other people (95+ percentile)



In [None]:
### write your code below ###


Use `summary_values()` to examine the ages of each group. How many individuals are in each group? Does there seem to be a difference in the average/median age?

In [None]:
### write your code below ###


Finally, plot age histograms for both groups using `age_bins`.

In [None]:
# age histogram for never_visit
### write your code below ###


In [None]:
# age histogram for frequent_visit
### write your code below ###


Are there other explorations you can think of, or better ways to plot the relationship between variables? If time permits, share your own useful explorations or visualizations!

## `hepatocellular` dataset

The `hepatocellular` dataset describes clinical metrics and biomarkers sampled in patients with hepatocellular carcinoma.

> This dataset was obtained from this [source](https://vincentarelbundock.github.io/Rdatasets/doc/asaur/hepatoCellular.html), and originates from [Li et al. 2014: CXCL17 Expression Predicts Poor Prognosis and Correlates with Adverse Immune Infiltration in Hepatocellular Carcinoma](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0110064).

`hepatocellular` contains significantly more columns than `badhealth`, spanning 48 variables for 227 individuals.

Columns are listed in the following order: we've provided the 0-index of all columns for your convenience. All column values are encoded numerically.

| Index   |      Description     | Index   |        Description       |  Index  |      Description      |  Index   |      Description    |
| ------- | ---------------------| ------- | -------------------------| ------- | ----------------------| -------- | --------------------|
| 0       | Patient ID number    | 12      | Capsulation              | 24      | CD8T                  | 36       | CD20NR              |
| 1       | Age                  | 13      | TNM                      | 25      | CD8N                  | 37       | CD57NR              |
| 2       | Sex                  | 14      | BCLC                     | 26      | CD20T                 | 38       | CD15NR              |
| 3       | HBsAg                | 15      | Overall survival         | 27      | CD20N                 | 39       | CD68NR              |
| 4       | Cirrhosis status     | 16      | Death                    | 28      | CD57T                 | 40       | CD4TR               |
| 5       | ALT                  | 17      | Recurrence-free survival | 29      | CD57N                 | 41       | CD8TR               |
| 6       | AST                  | 18      | Recurrence               | 30      | CD15T                 | 42       | CD20TR              |
| 7       | AFP                  | 19      | CXCL17T                  | 31      | CD15N                 | 43       | CD57TR              |
| 8       | Tumor size           | 20      | CXCL17P                  | 32      | CD68T                 | 44       | CD15TR              |
| 9       | Tumor differentiation| 21      | CXCL17N                  | 33      | CD68N                 | 45       | CD68TR              |
| 10      | Vascular invasion    | 22      | CD4T                     | 34      | CD4NR                 | 46       | Ki67                |
| 11      | Tumor multiplicity   | 23      | CD4N                     | 35      | CD8NR                 | 47       | CD34                |

That's a lot of columns, but have no fear – we're only going to have you take a look at a couple of them, and you don't need to know what the biological function of the biomarkers are (but we won't stop you from looking them up!)

Just as before, start by importing the data from the following file path, then cleaning all-`nan` rows/columns using `clean_data()`.

`'/content/pythonbootcamp/day_3/hepatoCellular.csv'`

Assign this to a variable called `hepatocellular`.

In [None]:
### use this cell to import and clean the data ###


Check for remaining `nan` values by column. Which columns still have `nan` values?

In [None]:
### write your code below ###


For our purposes, we'll only be examining columns without `nan` values. The `hepatocellular` dataset has data for three measurements of `CXCL17` gene expression:
* `CXCL17T`, Index 19: Intratumoral gene expression (measured within tumoral tissue).
* `CXCL17P`, Index 20: Peritumoral gene expression (measured adjacent to tumoral tissue).
* `CXCL17N`, Index 21: Non-tumoral gene expression.

Recall the simple `pyplot` functions we taught you about earlier: which one is suitable for visualizing all three measurements of *CXCL17* expression? Use this function to visualize gene expression across all three measurements for all patients in `hepatocellular`.

In [None]:
# visualize the measurements of CXCL17 expression
### write your code below ###


Take a moment to examine the range and distribution of each measurement: what does the spread of the values look like?

In the paper discussion, the authors posit that high peritumoral and intratumoral *CXCL17* expression is predictive of carcinoma prognosis. Let's see if we can see the same trend, using very basic data exploration.

* Find the median values for peritumoral and intratumoral *CXCL17* expression.
* Create two new subsets: `low_exp`, for patients with below-median expression for both peri- and intratumoral expression, and `high_exp`, for patients with above-median expression.

In [None]:
### write your code below ###


Next, re-examine the distribution of all three *CXCL17T* measurements: this time, create one plot for `low_exp` and `high_exp`. Do you notice any trends or associations?

In [None]:
# plot distributions of CXCL17 for low_exp
### write your code below ###


In [None]:
# plot distributions of CXCL17 for high_exp
### write your code below ###


Next, let's look at patient prognosis. There are several measures of patient outcomes in the dataset, but we'll focus on just two: patient death (Index 16) and recurrence-free survival (Index 17).

As we mentioned earlier, all of the columns in `hepatocellular` are numeric. Patient death is encoded as either a `0` or a `1`.
* `0` indicates that the patient did not die.
* `1` indicates patient death.

Below, calculate the relative proportion of patients that died in `low_exp` and `high_exp`.

In [None]:
### write your code below ###


Next, let's examine recurrence-free survival. The clinical definition of recurrence-free survival is the number of days between the dates of sample biopsy and cancer recurrence, or the date from sample biopsy to last follow-up if no recurrence was observed.

Below, calculate the summary values and plot the distribution of recurrence-free survival intervals for patients in both groups *on the same histogram plot*.

In [None]:
# calculate summary values here
### write your code below ###


In [None]:
# plot the distribution here
### write your code below ###

# hint: you can plot *two* histogram distributions on the same plot
# if you provide a list of arrays


That's it for `hepatocellular`! If you've finished early, you can review the paper manuscript and work on executing some more explorations according to the author's findings. If time permits, we'll ask people to share their results.



# Introduction to Pandas

That's about it for today! We hope that you enjoyed putting your skills to the test.

Tomorrow, we'll embark on our last full day of lecture content that will discuss data exploration and preliminary analysis. We'll be working with a package called `pandas`, which is a package that extends the array infrastructure provided by `numpy`.

> *Another package already?!* 😱<br>
Have no fear: all of the conceptual material we covered with `numpy` arrays is transferable to `pandas`. In fact, `pandas` takes care of many of the techniques that were fairly clunky for us to perform with `numpy`, and most people find `pandas` a little more intuitive to work with. We're only going to preview our forthcoming work with `pandas` today.

Let's think all the way back to when we first introduced `numpy`: recall that arrays *do* have their limitations. For one, arrays can't contain multiple types. This can be troublesome for scientific data that can't be stored in solely numeric form. Thus, we turn to `pandas`.

Like `numpy`, `pandas` is imported with a shorthand **alias** (`pd`), which is used as the prefix for any functions imported via `pandas`.



In [None]:
# try it out: import pandas with its alias


## DataFrames versus arrays

`pandas` is a great tool for data manipulation and analysis in Python because of the infrastructure it provides for complex tabular data. Just as `numpy` introduced the array, `pandas` introduces the **DataFrame** (a term familiar to those of you coming from R).

DataFrames and arrays are quite similar in many aspects. For one, they can both be used to store rectangular data (rows and columns), and they're both useful for performing vectorized operations. However, DataFrames have several key functions that can make it easier to work with *labeled* data.

1. **DataFrames allow mixed types.** Unlike with arrays, you can store strings, integers, numerics, etc. in the same DataFrame.
2. **DataFrames supports row and column names**. You can index with row and column names! This can be handy if you know your sample/variable names by heart.
3. **DataFrames support easy database-like operations**. Merging, joining, grouping, sorting on a column's values – all possible with `pandas` DataFrames!

We'll teach you about the essential `pandas` functions and operations, but if you'd like to know *more* about what you can do with `pandas`, you can read the introduction on the `pandas` documentation [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html).

Let's take a quick look at some data using `pandas`. All of the operations we're showing you will be covered in tomorrow's lecture, so there's no need to commit all of it to memory right now.

We'll teach you about the essential `pandas` functions and operations, but if you'd like to know *more* about what you can do with `pandas`, you can read the introduction on the `pandas` documentation [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html).

Let's take a quick look at some data using `pandas`. All of the operations we're showing you will be covered in tomorrow's lecture, so there's no need to commit all of it to memory right now.

In [None]:
# importing data

airquality_df = pd.read_csv('/content/pythonbootcamp/day_3/airquality.csv').drop('Unnamed: 0', axis = 1)
airquality_df

This is the `airquality` dataset we worked with earlier today, but imported into a `pandas` DataFrame. The first thing you'll notice is that the display of the data is quite different, compared to an array.

Next, you'll probably notice that the columns in this table have names, and the rows have numbers: DataFrames operate using these **labels** as indices, rather than the strict numerical indices we've been using so far. This is *especially* useful for columns!

In [None]:
hepatocellular_df = pd.read_csv('/content/pythonbootcamp/day_3/hepatoCellular.csv').drop('Unnamed: 0', axis = 1)
hepatocellular_df

Look at all those columns! With DataFrames, you won't need to worry about constantly referring to index lists for datasets like `hepatocellular` that contain dozens of columns.

In [None]:
# before, if we wanted CXCL17T expression
hepatocellular[:,19]

In [None]:
# if we want to examine CXCL17T expression in our DataFrame
hepatocellular_df['CXCL17T']

Hopefully this is starting to show you some of the motivation behind `pandas` and DataFrames! By the end of the week, you'll be well-prepared to use `numpy`, `pandas`, and `pyplot` to analyze *your* particular data. We'll have TAs on staff during the Friday afternoon session to discuss future directions for your Python analyses.

# [Optional] Saving your arrays to external files

`numpy` also provides a utility function called `np.savetxt()` that saves arrays to external files. We won't be using it in the remainder of the bootcamp (`pandas` has superior file input/output functions for mixed-type data), but it may be useful if you work with solely numerical data.

`np.savetxt()` takes three key inputs:
1. A file name.
2. The array of interest.
3. A `delimiter` string.

In [None]:
# let's sample 5 random rows from airquality

rng = np.random.default_rng(2023) # this time, using a seed value for reproducibility
random_airquality = rng.choice(airquality, 5)
random_airquality

In [None]:
# now let's save random_airquality as a comma-separated value file
np.savetxt('random_airquality.csv', random_airquality, delimiter = ',')

Go to the left hand panel of the Colab notebook and click on the folder icon at the bottom of the panel. This will bring you to Colab's `Files` menu. You should see a file called `random_airquality.csv`: if you hover over it and click the menu with three dots, you can obtain the file path (if you want to import it again) or download the file.
___
**CAUTION**: Files that you save while using Colab are not retained after you close the notebook, as they only exist in Colab's temporary **session storage**. If you generate files and wish to keep them, make sure to download your files (with the same three dots menu) before you exit Colab.
___

# [Optional] Cloning files from GitHub

> This section was first introduced in the morning session on `numpy`. We've copy-pasted it here for convenience.

[GitHub](https://github.com/) is a website that hosts code and files for software development projects. It serves two major functions: backing up **codebases** (files with data and code that work together) and enabling collaboration between programmers/developers.

We (your staff team) use GitHub as a **repository** for files that are used during PyCamp. We do this so that we have a stable copy of these files that stays out of "I spilled coffee on my laptop the night before PyCamp", or "my laptop was ransomed for cryptocurrency" territory. Moreover, if we accidentally delete a file from the repository, GitHub's **version control**  allows us to roll back the repository to a working version. Neat, right?

The below command allows us to **clone** these files from the GitHub repository to our local runtime's session storage. This allows for us to skip the messy steps of trying to get everyone to download and re-upload the right data.

```
!git clone https://github.com/ccbskillssem/pythonbootcamp.git
```

The `!` operator is used to indicate *special commands* that would normally be run at a computer's **command line**, rather than in Python. This is akin to communicating with a computer (or in Colab, our runtime) directly to tell it that we want to download files using the given file path.

The GitHub file path that you see above points to a single file called a `.git` file. This file does not contain all the data: rather, it provides directions to the GitHub repository of interest, and therefore all the files it contains. In this manner, we never have to worry about giving all the file paths to each file we want: we just pull all the files in the repository by giving its `.git` file path.

# [Optional] More methods for external data

> This section was first introduced in yesterday's session on `numpy`. We've copy-pasted it here for convenience.

This section describes the bare essentials of file uploads/downloads with Colab. For a more in-depth exploration, you can visit the official Google Colab notebook on data I/O [here](https://colab.research.google.com/notebooks/io.ipynb).

## Loading data from your computer
You can use Colab's `Files` menu to upload data from your own computer to Colab's temporary **session storage**. Session storage is reset each time the notebook runtime ends or is otherwise reset.

Go to the left hand panel of the Colab notebook and click on the folder icon at the bottom of the panel. This will bring you to Colab's `Files` menu.

Click on the leftmost icon underneath the `'Files'` title of the panel: it should appear as a piece of paper with an up arrow on it. Follow the prompts to upload your data of choice. Once your file is uploaded, you can access the file path by hovering over the file name, clicking on the three-dot menu, then selecting `Copy path`.

___

**CAUTION**: Files that you upload are NOT retained in the `Files` panel after you close the notebook or reset the runtime. If you would prefer to avoid the upload process, consider the next section on loading data from Google Drive.

___

## Loading data from Google Drive
Google Drive is an excellent cloud storage solution for data you wish to work with in Colab. Colab provides a simple solution for allowing you to access files from Google Drive in Colab: all you have to do is access the `Files` menu by clicking the folder icon on the left hand panel of the Colab notebook.

Once you're in the `Files` menu, click on the third icon below the `'Files'` title: it should appear as a filled-in white folder with the Google Drive icon. Click this button to connect Google Drive to Colab: a pop-up should appear asking you to confirm that you wish to do this, and you may need to wait a few minutes while Google Drive loads.

Once your Drive is mounted, you should see a new folder called `drive` in the `Files` menu. You can access the file path by hovering over the file name, clicking on the three-dot menu, then selecting `Copy path`.