# Good morning!

Welcome back to Day 4 of PyCamp!

Yesterday, we finished up our tour of `numpy` by learning how to import, clean, and explore external data using the techniques we learned on Tuesday. That's a lot of material that we covered in only three days!

Today's material will focus on `pandas`, a powerful Python data science package that provides infrastructure for working with complex tabular data. After we finish today's content, you'll largely be prepared to work with your own data using the techniques we've covered in PyCamp. Without further ado, let's get started!

In [None]:
# make sure to run this cell to import the external files we need for today
# and load in the appropriate packages
!git clone https://github.com/ccbskillssem/pythonbootcamp.git

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## DataFrames versus arrays

Yesterday, we *very briefly* touched upon the motivation behind using `pandas` in complement with `numpy`. We discussed that the main attraction of `pandas` is the data type it introduces: the **DataFrame**.

1. **DataFrames allow mixed types.** Unlike with arrays, you can store strings, integers, numerics, etc. in the same DataFrame.
2. **DataFrames supports row and column names**. You can index with row and column names! This can be handy if you know your sample/variable names by heart.
3. **DataFrames support easy database-like operations**. Merging, joining, grouping, sorting on a column's values – all possible with `pandas` DataFrames!

Today's morning session will cover the essentials of working with DataFrames. The operations you'll learn about are quite similar to operations you've performed with arrays, so hopefully they feel intuitive to you!

## Importing external data

> Most of the time, you'll want to use `pandas` for working with external data: thus, we won't discuss how to create DataFrames from scratch. If you'd like to learn how to do that, you can review the documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe).

Let's start by playing around with some sample data. `pandas`, like `numpy`, uses a function to read in external data with a file path. You can choose from one of the following functions, depending on your file type:

* `pd.read_csv()` is used for importing `.csv` (comma-separated value) files.
* `pd.read_table()` is used for importing `.tsv` (tab-separated value) files.
* `pd.read_excel()` is used for importing `.xls` or `.xlsx` (Excel) files.

For more data import functions, you can read the quick tutorial on importing data [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/02_read_write.html), or review the full file input documentation for `pandas` [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).<br><br>
___
<br>

Let's start by testing out one of these import functions with a new dataset, `alleles`.

`alleles` is a dataset that describes **alleles** (genetic variants/mutations) associated with metabolic diseases. It has a number of columns that denote properties, such as the genomic location of the allele, the identity of the allele compared to the human *reference genome*), and the frequencies of each allele across multiple global populations.

> This dataset was curated from publicly available data by a former CCB PhD student and PyCamp instructor, [Andrew Sharo](https://www.andrewsharo.com/). (Thanks, Andrew!)


Here's the file path:

```
'/content/pythonbootcamp/day_4/alleles.tsv'
```

Because `alleles` is stored in a `.tsv` file, we should use `pd.read_table()` to import it. This works just like `np.genfromtxt()`, except we don't need to provide the delimiter: `pd.read_table()` takes care of that for us.

In [None]:
alleles = pd.read_table('/content/pythonbootcamp/day_4/alleles.tsv')
print(type(alleles)) # note that it's a new type

In [None]:
# run this cell to inspect the table
alleles

Colab has built-in functionality for displaying DataFrames in a more structured, human-readable manner, compared to `numpy` arrays. Let's go ahead and inspect the table.

## Inspecting DataFrame displays

First, notice that DataFrame row indices/names are displayed with **bold face**, and column names are displayed with a `different font`.
 Notice that `pd.read_table()` automatically assigned our column names to the first row of our dataset containing strings: if we imported this data with `np.genfromtxt()`, this column name row would have been all `nan` values.



⏸ **Exercise 1**: Use ```np.genfromtxt()``` and observe the output. What do you notice? Compare it to the pandas DataFrame from above. What are the differences? Can you think of a situation where a ```numpy``` DataFrame would be a better choice?

In [None]:
### write your code below ###


Back to the pandas DataFrame. Take a look at the bottom left of the table: the shape attribute is automatically displayed each time that a table is returned to the output. This is convenient for making sure that we've imported all the rows and columns that we expect!

> *Note*: There are literally dozens of optional parameters that you can use to specify how your data is imported! We won't cover them today for the sake of brevity, but we encourage you to review the options [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) before you import your own data. There's an optional parameter for almost every quirk that you'll encounter in your data!

# Parsing DataFrames

We now turn to the standard parsing operations we need to know to work with DataFrames, such as indexing.

Although most of these closely mirror the `numpy` operations we already learned about, **we highly encourage you to follow along with your `pandas` cheat sheet**. Remember, syntax will come naturally with practice, so don't burn out trying to memorize all of the methods and functions!

### Accessing attributes

We can access attributes of `pandas` DataFrames in *exactly* the same manner as `numpy` arrays. They even share the same attribute names:

* `.shape`: Returns a tuple with the number of rows and columns in the DataFrame.
* `.size`: Returns a tuple with the number of elements in the DataFrame.
* `.ndim`: Returns a tuple with the number of dimensions in the DataFrame.

In [None]:
# try it out:
# print these attributes of alleles


### Row/column labels

Row and column **labels** (a generic term that encompasses both default numerical indices or assigned row/column names) can be accessed through the DataFrame's attributes.

Column labels can be accessed through the `.columns` attribute.

In [None]:
# inspecting the column labels
alleles.columns

Row labels can be accessed through the `.index` attribute. (We wish it was `.rows` too, and we don't know why it isn't.)

In [None]:
# inspecting the row labels
alleles.index

You'll notice that both row and column labels are stored in a special data structure called an **Index**: thus, each DataFrame will have two associated Index structures, one for the rows and one for the columns.

Indexes can come in multiple flavors depending on the labels of the DataFrame, such as the above `RangeIndex` (an index of sequential integers, like our conventional zero-index): nevertheless, they're all considered to be Index structures.

There are a number of advanced operations you can perform with Index methods, but they're beyond the scope of this bootcamp, and you'll probably only use them if you become an advanced `pandas` user. For today's purposes, an Index is just a fancy iterable that contains row/column labels.

> If you *do* want to know more about Index structures, you can read about all their properties [here](https://pandas.pydata.org/pandas-docs/stable/reference/indexing.html).

## Indexing columns

Columns of DataFrames are much more intuitive to index than columns of arrays. You can slice the values of a single column by indexing with the column label.

```
dataframe[column_label]
```

In [None]:
alleles['ID']

⏸ **Exercise 2** Take a few minutes to look at the columns in ```alleles```. What are the different ```dtypes``` in this DataFrame? If you have time (and want a challenge), play around with Index Structures and the parameters with ```pd.read_csv()```.

Each column of a DataFrame is stored in another special data type called a **Series**. For our purposes, a Series is essentially a one-dimensional DataFrame that only has one associated Index (for the row labels).

In [None]:
# try it out:
# 1. print the type of the ID column
# 2. print the row labels of the ID column


Notice that the Series shares the same row labels as the original DataFrame. In this manner, we always preserve essential context with our data.

<img src ='https://github.com/ccbskillssem/pythonbootcamp/raw/main/day_4/ColumnIndex.png'>

We can slice multiple columns by providing a list of column names: the selected columns will display as a DataFrame.

In [None]:
# try it out:
# index the chromosome, position, and ID columns


## Indexing rows and cells

DataFrame rows are indexed similarly to array rows, with a minor syntax difference. With DataFrames, we use `.loc[]` instead of `[]` to access rows.

```
dataframe.loc[row_label]
```

In [None]:
# try it out:
# use .loc[] to get the first row of alleles


Just as with columns, rows of DataFrames are also stored in Series structures: the only difference is that the column labels have been transposed to row labels, as a Series only has one Index for row labels.

<img src='https://raw.githubusercontent.com/ccbskillssem/pythonbootcamp/main/day_4/RowIndex.png'>

It can help to think of a DataFrame as a Lego-like structure composed of Series. When we slice the DataFrame, we're snapping off some of the Legos from the main structure.

In [None]:
# try it out:
# 1. print the type of the first row
# 2. print the index of the first row


We can use `.loc[]` to slice rows, just as we did with arrays.
> Interestingly, the authors of `pandas` opted to make slicing ranges *inclusive* of the right index. This may or may not feel more intuitive to you, but just keep in mind that this is the exception rather than the rule for ranges in Python.

In [None]:
# try it out:
# use .loc[] to slice the 0 to 10-indexed rows of alleles
# notice that the right index is now *inclusive*


We can also use `.loc[]` to access individual cells in an array-like manner.

```
dataframe.loc[row_label, column_label]
```

In [None]:
# try it out:
# access the ID for the 5-indexed row


This holds for accessing multiple rows and multiple columns.

In [None]:
# accessing the ID for the 0 to 10-indexed rows
alleles.loc[0:10, 'ID']

In [None]:
# try it out:
# access both the position and ID for the 0 to 10-indexed rows


⏸ **Exercise 3** Subset the allele frequencies for the different geographic superpopulations from ```alleles``` (`AFR_AF`, `EUR_AF`, etc) as well as the ```AF``` column.

In [None]:
### write your code below ###


Can you determine what proportion of the dataset is represented by ```EAS```? How about ```SAS```?

*Hint 1*: Look at the third and fifth rows. You can use df.loc to access the desired rows.

*Hint 2*:
If $X$ is number of people with the mutation

$$AF= \frac{X}{total}$$

$$AF_{EAS}=\frac{X_{EAS}}{total_{EAS}}$$

if $X=X_{EAS}$
$$proportion_{EAS}=\frac{AF}{AF_{EAS}}$$

In [None]:
### write your code below ###


⏸ **Exercise 4** Can you access the position and allele frequency for the even numbered indexes in the first 20 rows?

Hint: You can use ``np.arange()`` to generate an array of even integers. Refer to documentation.

In [None]:
### write your code below ###


## Summary

Let's recap all of this before we head into our exercises. You'll find this info on your cheat sheet as well, of course, but make sure that it *makes sense* too!

* Attributes
  * You can access the core attributes of a DataFrame using the same attribute names that we learned for `numpy` arrays: `.shape`, `.size`, and `.ndim`.
  * You can obtain row and column labels by using the `.columns` and `.index` attributes.

* Indexing
  * You can index a single column using:
  ```
  dataframe[column_name]
  ```
  * You can index multiple columns using:
  ```
  dataframe[[column_name_1, column_name_2...]]
  ```
  * You can index a single row using:
    ```
    dataframe.loc[row_name]
    ```
  * You can index multiple (numeric-index) rows using:
    ```
    dataframe.loc[firstrow:lastrow]
    ```
    This slice will be *inclusive* of `lastrow`.
  * You can index a single element (cell) using:
    ```
    dataframe.loc[row_name, column_name]
    ```

# Data exploration with DataFrames

## Simple methods

Now that we're up to speed on how to parse DataFrames, we can learn about the utility that `pandas` has for data exploration. We'll cover many of the same topics that we did with `numpy`, and the conceptual basis of why we want to perform these operations remains the same between the two packages.

We've already mentioned that DataFrames share a great number of similarities with `numpy` arrays. The first (and more important) similarity is that DataFrames *also* allow us to perform vectorized operations.

> *Wait, then why did we even learn `numpy`?*<br>
  Because there's no such thing as a free lunch. Remember how we mentioned that `numpy` is efficient because it only permits data of a single type? Well, `pandas` is [slower](https://towardsdatascience.com/speed-testing-pandas-vs-numpy-ffbf80070ee7) than `numpy`, even for vectorized operations, because it has to take non-numeric types into account. This is why we taught you `numpy` first: if you're working with strictly numeric values, then `numpy` arrays will be a more efficient solution.

DataFrames and Series share many of the same methods: this makes sense in light of our Lego analogy. Moreover, the majority of methods we learned with `numpy` arrays are identical in name and function to methods for DataFrames/Series.
* `.sum()`
* `.min()`
* `.max()`
* `.mean()`
* `.std()`
* `.median()` (Rejoice at the availability of a median method!)

Just like with arrays, these methods take an `axis` parameter to specify column-wise or row-wise operation. You can provide either an integer or a string value, whichever is more intuitive for you.
* `0` or `'columns'` refers to column-wise operations.
* `1` or `'rows'` refers to row-wise operations.

In [None]:
# try it out:
# get the average frequency for all alleles in the table


In [None]:
# try it out:
# find the average allele frequencies of the AFR_AF and EAS_AF columns


`pandas` also offers some additional convenient methods for DataFrames/Series that weren't available (or weren't particularly useful) for arrays. These methods can be useful for quick data exploration, which we'll try out in the exercises.

* `.head()` and `.tail()` return the first or last five rows of the dataset. This can be useful for quickly viewing the result of an operation, or testing out an operation on a small subset of the DataFrame.
  * You can specify a specific number of rows by passing an integer to the method input: for example, `.head(15)` would yield the first 15 rows.
* `.value_counts()` returns a new DataFrame or Series that shows the count of each unique value.
* `.describe()` returns a new DataFrame or Series with conventional summary statistics (`count`, `min`, `max`, `mean`, `std`), as well as 25, 50, and 75 percentile values.
* `.unique()` returns a sorted array containing the unique values of a DataFrame or Series.
  * Yes, an array, not a Series or DataFrame! This is an example of how `pandas` really just works as an expansion pack for `numpy`.

In [None]:
# let's pop open a quick preview of alleles
alleles.head()

In [None]:
# how many alleles passed the quality filter?
alleles['filter'].value_counts()

In [None]:
# let's look at the summary of quality scores
alleles['quality'].describe()

In [None]:
# check which chromosomes are covered in this dataset
alleles['chromosome'].unique()

⏸ **Exercise 5a** How many unique values of allele frequency do you observe in `AF`?

In [None]:
### write your code below ###


⏸ **Exercise 5b** What is the median allele frequency across all the superpopulations (`AFR_AF`, `EUR_AF`, etc)?

In [None]:
### write your code below ###


`pandas` really shines in its offerings for routine data exploration activities like cleaning, querying, and visualizing data from DataFrames/Series. Here are three great reasons to use DataFrames for exploring your data:

* **Labels are preserved across operations**. You can easily associate calculated values with their corresponding label.
* **Almost all operations on DataFrames/Series are offered in method form**. You can chain methods together for convenience. (This is analogous to piping the input of one function to another in R.)
* **Methods don't save in place by default.** You can test out your chained operations to see if they yield the result you expect.
  * If you *do* want to use methods in place, you can specify the parameter `inplace` set to `True`. This can save you variable updates if you're already confident of the result that you'll obtain from your method operations.

There's a ton that you can do if you dig deep into the material, so consider this a simple tour of routines you can use with `pandas`. Just like with `numpy`, we must emphasize that you **should not attempt to memorize the syntax and quirks of the methods you see, because you will always have the cheat sheet for reference!**


Here's an example of a method that sorts the DataFrame by a column:

In [None]:
alleles.sort_values(by="AF",ascending=False).head()

if we look at the DataFrame again, specifically at the `AF` column, we'll see that `alleles` hasn't been updated.

In [None]:
alleles.head()

If we want to modify the original DataFrame, we need to use the optional parameter `inplace = True`.

In [None]:
alleles.sort_values(by="AF",ascending=False,inplace=True)

When we use that option, the result is not displayed, but the DataFrame is modified.

In [None]:
alleles.head()

`inplace = True` is an optional parameter for many different methods.

In [None]:
# try it out:
# reset the index in place using .reset_index()
# then see if it worked


In [None]:
# run this cell to restore alleles to its original index
alleles.set_index('index', drop = True, inplace = True)
alleles.index.name = None
alleles.head()

## Cleaning

As we discussed yesterday, data cleaning is an essential pre-processing step for working with real data. Let's briefly revisit the `airquality` dataset, which we know has `nan` values.

In [None]:
# reading in airquality

airquality = pd.read_csv('/content/pythonbootcamp/day_4/airquality.csv')
airquality.head()

The first thing to recognize is that `airquality` now displays a row of column names, which was previously imported as all `nan` values by `numpy`. Next, you'll also notice the presence of a column called `Unnamed: 0`. This is a artifactual column that you'll frequently see when importing files that were exported from programs like Excel and Google Sheets, which use one-indexed row names that export in string format (hence the all `nan` column from yesterday).

The `.drop()` method takes an `axis` parameter and either a single label or list of labels to drop from the DataFrame, returning a new cleaned DataFrame.

In [None]:
# let's drop the unnamed column and update the variable
airquality = airquality.drop(['Unnamed: 0'], axis = 1) # remove unnamed for all rows
airquality.head()

Next, let's clean up the `nan` values. Recall that with `numpy`, we used the `np.isnan()` function and `.sum()` method to obtain a count of `nan` values.

```
np.isnan(array).sum()
```

`pandas` uses the same concept for DataFrames, but operates solely with methods: `.isna()` is the DataFrame equivalent of `np.isnan()` for arrays.

```
dataframe.isna().sum()
```

In [None]:
# try it out:
# count the nan values in airquality


Notice that for DataFrames, `pandas` automatically categorizes the `nan` counts by columns. This is a useful shortcut for examining exactly how `nan` values are distributed across columns.

Next, let's consider what we should do with our `nan` values. Previously, we *had* to resolve `nan` values because they interfered with our calculations. `pandas` handles `nan` values a little differently than arrays by simply **masking** (bypassing) them by default.

In [None]:
# try it out:
# use .mean() to take the column-wise mean of airquality


Above, `pandas` evaluates the mean of each column by simply taking the mean of the non-`nan` values. Functionally, this provides the same outcome as the *mean imputation* that we coded ourselves in yesterday's exercises.

In the case that `nan` values of any kind are entirely unacceptable, `pandas` does provide simple but powerful methods for `nan` resolution.

* `.dropna()`, which takes two parameters: an `axis` parameter, and the `how` parameter (`'any'` or `'all'`). [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)
* `.fillna()`, which takes two parameters: an `axis` parameter, and `value`, the desired fill value. [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

> For more ways to resolve missing data, refer to the `pandas` documentation on working with missing data [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html).

In [None]:
# try it out:
# use .dropna to remove rows with any nan values in airquality


## Filtering

One of the most common data operations is querying or filtering data: in other words, obtaining values based on a set of requirements. Previously, we accomplished this using logical operators, followed by Boolean masking and indexing.

`pandas` shortcuts past Boolean masks and simply returns the values that pass the logic check. In this way, you can quickly obtain subsets of your data based on specific logic checks.

In [None]:
# we want to find alleles in chromosome 1

alleles[alleles.chromosome == 1]

Notice that the number of rows is now reduced, indicating that we've successfully filtered our rows to those belonging to chromosome 1.

This holds for multiple logic checks as well, using the same syntax as we used with arrays. It may be more convenient to use Method 2 to write multiple logic checks, just to avoid dealing with multiple sets of square brackets.

In [None]:
# try it out:
# find alleles in chromosome 1 that pass the filter (filter column)


In [None]:
# this generates a Boolean map of rows
# True = for this row, the chromosome value is in 1, 2, or 3
# False = not 1, 2, or 3

alleles['chromosome'].isin([1, 2, 3])

In [None]:
# now we can use this map to select rows in our DataFrame

alleles[alleles['chromosome'].isin([1, 2, 3])]

And with that, we're able to cleanly perform a filter operation and obtain relevant labels with one line of code. Nice, right?

In [None]:
# try it out:
# what if we wanted to filter rows by the following conditions
# 1) chromosome in 1, 5, and 7
# 2) pass the filter (filter column)
# 3) have an allele frequency of 0.2 or greater


## Merging

We've discussed at length the benefit of labels in DataFrames: one of the major benefits of `pandas` is the column and row labels. We can take advantage of these labels to perform column-based operations.

**Merging** refers to the operation of combining DataFrames using a "key" column of unique values (akin to dictionary keys!). This is an *extremely* powerful method for relating datasets.

Let's examine a dataset called `clinvar`. `clinvar` contains information about the clinical importance of alleles, some of which are *also* described in `alleles`.

> `clinvar` is a subset of the National Institutes of Health's Clinical Variation ([ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/)) database. This subset was curated by [Andrew Sharo](https://www.andrewsharo.com/), a former CCB PhD student and PyCamp instructor. Thanks, Andrew!

In [None]:
# load in clinvar, which is a .tsv
# it's a big dataset!

clinvar = pd.read_table('/content/pythonbootcamp/day_4/clinvar.tsv')
clinvar.head()

Let's say that we want to combine `alleles` and `clinvar`, such that we obtain a super-DataFrame with columns from both DataFrames. We know that the `ID` is a column with unique values, so we can use the `ID` column as our "key" in both tables.

`pandas` provides a single powerful method called `.merge()`, which takes the following key parameters:
* `left`: The target DataFrame.
* `right`: The DataFrame/Series that we wish to merge with the target.
* `how`: The merging method. This defaults to an "inner" merge, which returns a DataFrame that only contains rows with shared keys in both DataFrames. (For `alleles` and `clinvar`, this would be shared `ID`s.)
* `on`: The label of our "key" column.

In [None]:
merged = pd.merge(alleles, clinvar, 'inner', on = 'ID')
merged

Notice the shape of the resultant super-DataFrame: it appears that only 3,749 alleles are shared between `alleles` and `clinvar`. Additionally, columns that appear in both `alleles` and `clinvar` now have the suffix `x` and `y` to distinguish their origin: `x` refers to the "left" DataFrame, and `y` refers to the "right".

## Selecting columns by substrings

We're going to introduce a technique that may be useful for inspecting and analyzing subsets of your data.

Earlier this morning, we showed you how to use `.isin()` to filter rows by column values (plural). This allowed us to select rows based on multiple desired values in a given column. However, this method requires us to know exactly which values we want to identify for selecting rows.

Take a look at the `clinical_significance` values above. Notice how many of them contain the words 'pathogenic' and 'benign', but in various combinations with other indicators of clinical significance. Although there aren't too many combinations in this particular table, it would still be cumbersome to manually curate a list of values containing 'pathogenic', for example.

The `.str.contains()` method works for columns that contain string values, returning a Boolean map of values in the column that contain the desired substring.

In [None]:
# try it out:
# let's examine our rows: which ones have 'pathogenic' in their clinical significance?


## Effective filtering with `.query()`

One minor annoyance with filtering is the sheer number of `df[column_name] == value` checks that you need to write if you want to filter based on multiple column values.

In [None]:
# what if we wanted all alleles in merged that are:
# 1) on chromosome 2, based on the chromosome_x column
# 2) benign
# 3) synonymous variants

merged[(merged['chromosome_x'] == 2) & (merged['clinical_significance'] == 'Benign') & (merged['molecular_consequence'] == 'synonymous_variant')]

That's a lot of code just to get those few rows! The `.query()` method provides a slightly less wordy way for us to create this very specific filter. Instead of writing all of those logic checks, we provide a single string as an input to `.query()`. Unlike with our traditional filtering method of specifying `dataframe[column]`, we can use column names directly in the input to `.query()`, which makes writing our query much more syntactically natural in English.

In [None]:
# same as above, but with .query()

merged.query("chromosome_x == 2 & clinical_significance == 'Benign' & molecular_consequence == 'synonymous_variant'")

In [None]:
# try it out:
# using .query(), find out how many alleles are:
# 1) above 0.2 allele frequency (all populations)
# 2) benign
# 3) missense variants?


## Very, very simple plots

Yesterday, we showed you a few simple `pyplot` plotting functions for data visualization with `numpy`. We can use the very same functions to create basic visualizations with our data.

* `plt.hist()`: Takes in an array of values to generate a basic histogram plot.
  * Can be used to visualize the distribution of values in a single column.
  * You can explicitly specify desired bin boundaries using the `bins` input.
* `plt.scatter()`: Takes in two arrays of values and generates a scatter plot.
  * Can be used to visualize the relationship between values in two columns.
* `plt.violinplot()`: Takes in a 2D array of values and plots a series of "violins".
  * Can be used to visualize the distribution of values in *multiple* columns.

In [None]:
# try it out:
# plot the quality score distribution of alleles that pass the filter


In [None]:
# try it out:
# create a histogram that displays information about the following columns:
# AF (allele frequency, all populations)
# AFR_AF (allele frequency, African population)

# tips:
# 1) you can plot two Series in the same histogram if you put them in a list
# 2) you can adjust the y scale to log scale using plt.yscale('log')


## Replacing

Previously, we relied heavily on Boolean masking and assignment to update or replace slices/elements in arrays. For example, `nan` resolution is simply a specific case of value replacement that focuses on `nan` values.

With DataFrames/Series, we can use the `.replace()` method to quickly replace single or multiple values in one fell swoop. Although there are many ways that you can provide input to `.replace()`, we'll just teach you the simplest technique, which is providing target and replacement values with a dictionary.

Let's return to `alleles`. Currently, the `filter` column has two values: `'PASS'` and `'LowQ'`.

In [None]:
# just to prove it, let's check:
alleles['filter'].unique()

For example, let's say that we want to change the values of `filter`, such that `'PASS'` is instead `True`, and `'LowQ'` becomes `False`.

Using a dictionary, we can specify the *target* value and the *substitute* value.

In [None]:
print('Before replacement:\n', alleles['filter'].value_counts())

alleles['filter'].replace({'PASS': True,
                           'LowQ': False}, inplace = True)
print('After replacement:\n', alleles['filter'].value_counts())

# << Exercises >>


⏸ **Exercise 6** Using column indexing, create a subset of `alleles` called `subset`, which only contains the `AFR_AF`, `AMR_AF`, `EAS_AF`, `EUR_AF`, and `SAS_AF` columns. Which group has the highest mean allele frequency?

In [None]:
### write your code below ###


⏸ **Exercise 7a** Use the appropriate DataFrame method to summarize/describe the values in each column of `subset`.

In [None]:
### write your code below ###


⏸ **Exercise 7b** Print the ```min``` and ```max``` values for the mean and standard deviation for the subset.

In [None]:
### write your code below ###


# [Optional] Concatenating rows
This section is optional, just to avoid content overload. Material may be covered at the lecturer's discretion.

____


Let's step back for a moment to think about how DataFrames and Series work together. Each time that we index a row or a column, we obtain a Series: this goes back to the Lego analogy that we used, in which a DataFrame (Lego structure) is simply a composite of Series (individual Legos).

We can easily combine Series and/or DataFrames together using the `pd.concat()` function, which **concatenates** input structures together along a specified `axis`. This method is commonly used to add rows to existing DataFrames, or even whole DataFrames that share the same columns.

In [None]:
# recreating some_alleles
# and creating another sample with a different seed value

some_alleles = alleles.sample(n = 5, random_state = 2023)
more_alleles = alleles.sample(n = 5, random_state = 615)

In [None]:
# examine our rows

some_alleles

In [None]:
# examine the other rows

more_alleles

In [None]:
# let's combine these two together

pd.concat([some_alleles, more_alleles], axis = 0)

# [Optional] Cloning files from GitHub

> This section was first introduced in yesterday's session on `numpy`. We've copy-pasted it here for convenience.

[GitHub](https://github.com/) is a website that hosts code and files for software development projects. It serves two major functions: backing up **codebases** (files with data and code that work together) and enabling collaboration between programmers/developers.

We (your staff team) use GitHub as a **repository** for files that are used during PyCamp. We do this so that we have a stable copy of these files that stays out of "I spilled coffee on my laptop the night before PyCamp", or "my laptop was ransomed for cryptocurrency" territory. Moreover, if we accidentally delete a file from the repository, GitHub's **version control**  allows us to roll back the repository to a working version. Neat, right?

The below command allows us to **clone** these files from the GitHub repository to our local runtime's session storage. This allows for us to skip the messy steps of trying to get everyone to download and re-upload the right data.

```
!git clone https://github.com/ccbskillssem/pythonbootcamp.git
```

The `!` operator is used to indicate *special commands* that would normally be run at a computer's **command line**, rather than in Python. This is akin to communicating with a computer (or in Colab, our runtime) directly to tell it that we want to download files using the given file path.

The GitHub file path that you see above points to a single file called a `.git` file. This file does not contain all the data: rather, it provides directions to the GitHub repository of interest, and therefore all the files it contains. In this manner, we never have to worry about giving all the file paths to each file we want: we just pull all the files in the repository by giving its `.git` file path.

# [Optional] More methods for external data

> This section was first introduced in yesterday's session on `numpy`. We've copy-pasted it here for convenience.

This section describes the bare essentials of file uploads/downloads with Colab. For a more in-depth exploration, you can visit the official Google Colab notebook on data I/O [here](https://colab.research.google.com/notebooks/io.ipynb).

## Loading data from your computer
You can use Colab's `Files` menu to upload data from your own computer to Colab's temporary **session storage**. Session storage is reset each time the notebook runtime ends or is otherwise reset.

Go to the left hand panel of the Colab notebook and click on the folder icon at the bottom of the panel. This will bring you to Colab's `Files` menu.

Click on the leftmost icon underneath the `'Files'` title of the panel: it should appear as a piece of paper with an up arrow on it. Follow the prompts to upload your data of choice. Once your file is uploaded, you can access the file path by hovering over the file name, clicking on the three-dot menu, then selecting `Copy path`.

___

**CAUTION**: Files that you upload are NOT retained in the `Files` panel after you close the notebook or reset the runtime. If you would prefer to avoid the upload process, consider the next section on loading data from Google Drive.

___

## Loading data from Google Drive
Google Drive is an excellent cloud storage solution for data you wish to work with in Colab. Colab provides a simple solution for allowing you to access files from Google Drive in Colab: all you have to do is access the `Files` menu by clicking the folder icon on the left hand panel of the Colab notebook.

Once you're in the `Files` menu, click on the third icon below the `'Files'` title: it should appear as a filled-in white folder with the Google Drive icon. Click this button to connect Google Drive to Colab: a pop-up should appear asking you to confirm that you wish to do this, and you may need to wait a few minutes while Google Drive loads.

Once your Drive is mounted, you should see a new folder called `drive` in the `Files` menu. You can access the file path by hovering over the file name, clicking on the three-dot menu, then selecting `Copy path`.