In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import numpy as np

## What is a dataframe? 
A DataFrame is a data structure that can store a variety of data types (e.g. numeric (integers and floats), strings, objects, and more!) in a 2D structre (rows by columns). It is similar to a spreadsheet or an SQL table or the data.frame in R. A DataFrame always has an index (0-based). An index refers to the position of an element in the data structure.

# Load in pandas data

First let's start off looking at something we are familar with. Codon tables! 

In [None]:
codon_df = pd.read_csv('data/codon-tables/std-code.tab', sep=' ', index_col='codon')

In [None]:
codon_df

There are many ways to summarize and access the data stored in DataFrames-- pandas has a variety of associated `attirbutes` and `methods`. 

An `attribute` is a value or characteristic that is associated with the data-- think of it as a variable that is stored within an object (dataframe in this case). A `method` is a set of instructions or functions associated with an object. 

To access an `attribute`, use the DataFrame object name followed by the attribute. Using the DataFrame `codon_df` and attribute `columns`, to get an index of all the column names in the DataFrame can be accessed with `codon_df.columns`.

In [None]:
codon_df.index

In [None]:
codon_df.columns

Pandas data frames are rather like data tables in `R` and information can be selected from them based on index or column values. 

We can then use these `attributes` to retrieve the data in the dataframe using `.loc`. 

In [None]:
codon_df.loc['AAT']

In [None]:
codon_df.loc['AAT', 'aa1']

In [None]:
codon_df.loc['AAT', 'aa3']

In [None]:
codon_df.loc[:, 'aa3']

In [None]:
codon_df.loc[['AAT','GGG'], 'aa3']



Methods are called in a similar fashion using the syntax `df_object.method()`-- but the paraentheses are required. As an example, `surveys_df.head()` gets the first few rows in the DataFrame `codon_df` using the `head()` method. With a method, we can supply extra information in the parens to control behavior -- fo example: pass `n=100` to head to get more lines. 

In [None]:
help(codon_df.head())

In [None]:
codon_df.head(n=100)

There are many `methods` in pandas. Some examples:

In [None]:
codon_df.reset_index() #resets the index


In [None]:
codon_df.reset_index().set_index('aa1') #index by a different value

In [None]:
grp = codon_df.groupby('aa3') #group by a value
print(grp)
grp.get_group('Ala') #retrieve one value

In [None]:
for name, group in codon_df.groupby(['aa3']):
    print(name, list(group.index), len(group.index))

Now, let's load a new, more complex data set! Groundhog day observations! 

![image.png](https://www.almanac.com/sites/default/files/styles/primary_image_in_article/public/image_nodes/groundhog-day.jpg?itok=hzIHRUoK)

In [None]:
ghog_df = pd.read_csv('data/groundhog.csv')

In [None]:
ghog_df.head(n=10) 

There are a lot of `NaN` values -- there are many built in methods to deal with this. 

In [None]:
ghog_df.dropna() #drops nans

In [None]:
ghog_df.fillna(0) #fills na values with a different value. 

In [None]:
ghog_df_clean = ghog_df.dropna() 

In [None]:
ghog_df_clean['difference'] = ghog_df_clean['March Average Temperature']-ghog_df_clean['February Average Temperature']

In [None]:
ghog_df_clean

## Basic statistics! 

In [None]:
ghog_df_clean.mean()

In [None]:
ghog_df_clean.median()

In [None]:
ghog_df_clean.describe()

In [None]:
ghog_grps = ghog_df_clean.groupby('Punxsutawney Phil')
full = ghog_grps.get_group('Full Shadow')
nosha = ghog_grps.get_group('No Shadow')

In [None]:
stats={}
for name, ggrp in ghog_grps:
    stats[name] = ggrp.describe().head()

In [None]:
stats['Full Shadow']

In [None]:
stats['No Shadow']

# Plots in python

### Plotting libraries in python
Almost everything in python is based off of Matplotlib -- a plotting library that is based loosely off of matlab. Pandas and seaborn both are/have wrappers around this basic python library and are useful for making nicer looking plots with fewer lines of code. Within `matplotlib` `pyplot` is the most commonly used module. 

![image.png](https://files.realpython.com/media/fig_map.bc8c7cabd823.png)

In [None]:
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots(1) #initalize an empty figure and axis handle. 


In [None]:
fig, ax = plt.subplots(1) #initalize an empty figure and axis handle. 
ax.plot(ghog_df['Year'], ghog_df['February Average Temperature'])


In [None]:
fig, ax = plt.subplots(1) #initalize an empty figure and axis handle. 
ax.plot(ghog_df['Year'], ghog_df['February Average Temperature'])
ax.plot(ghog_df['Year'], ghog_df['March Average Temperature'], color = 'orange')
ax.set_xticks(np.arange(1890,2020,20))

In [None]:
fig, axs = plt.subplots(2)

In [None]:
fig, axs = plt.subplots(2) #initalize an empty figure and axis handle. 
fig.set_size_inches(8,8)
axs[0].plot(ghog_df['Year'], ghog_df['February Average Temperature'])
axs[0].plot(ghog_df['Year'], ghog_df['March Average Temperature'], color = 'orange')
axs[1].hist(ghog_df['February Average Temperature'],alpha=0.6)
axs[1].hist(ghog_df['March Average Temperature'], color='orange',alpha=0.6)
axs[0].set_xticks(np.arange(1890,2020,20))

In [None]:
fig, axs = plt.subplots(2) #initalize an empty figure and axis handle. 
fig.set_size_inches(8,8)
axs[0].plot(ghog_df['Year'], ghog_df['February Average Temperature'])
axs[0].plot(ghog_df['Year'], ghog_df['March Average Temperature'], color = 'orange')
axs[1].hist(ghog_df['February Average Temperature'],alpha=0.6)
axs[1].hist(ghog_df['March Average Temperature'], color='orange',alpha=0.6)
axs[0].set_xticks(np.arange(1890,2020,20))


### Built in plotting with pandas

In [None]:
ghog_df_clean.plot(kind='scatter', x='February Average Temperature', y='March Average Temperature', c= 'Year')

In [None]:
ghog_grp = ghog_df_clean.groupby('Punxsutawney Phil')
ghog_grp.get_group('No Shadow').plot(x='Year', y='difference', kind='scatter', color='black')
ghog_grp.get_group('Full Shadow').plot(x='Year', y='difference', kind='scatter', color='black')

In [None]:
fig, ax =plt.subplots(1)
ghog_grp = ghog_df_clean.groupby('Punxsutawney Phil')
ghog_grp.get_group('No Shadow').plot(x='Year', y='difference', kind='scatter', color='orange', ax = ax)
ghog_grp.get_group('Full Shadow').plot(x='Year', y='difference', kind='scatter', color='black', ax = ax)

In [None]:
sns.swarmplot(data = ghog_df_clean, x= 'Punxsutawney Phil', y='difference')

In [None]:
sns.relplot(data = ghog_df_clean, x= 'February Average Temperature', y='March Average Temperature', 
           hue = 'Punxsutawney Phil', size = 'difference', alpha=.5, palette="bright", sizes=(40, 400))

## OTU Tables

Data from: Suter et al. 2017. Free-living chemoautotrophic and particle-attachedheterotrophic prokaryotes dominate microbialassemblages along a pelagic redox gradient. Environmental Microbiology. 

### A key for the sample coding: 
- AB2 = number of the Niskin cast (the same of all)
- a or b = biological replicate a or b
- 143: depth in meters
- A or B: Particle-associated (A: 250-2.7um) or Free-living (B:2.7-0.2um)

In [None]:
otu_df = pd.read_csv('data/Cariaco_OTU.csv')

In [None]:
otu_df = pd.read_csv('data/Cariaco_OTU.csv', index_col=0)

In [None]:
otu_df.columns

In [None]:
otu_df.index

In [None]:
otu_df.head()

In [None]:
ax = otu_df.head().plot(kind='bar', stacked=True, legend = 'left')
ax.get_legend().set_bbox_to_anchor((1,1))

In [None]:
ax = otu_df.T.head().plot(kind='bar', stacked=True, legend = 'left')
ax.get_legend().set_bbox_to_anchor((1,1))

# Parsing taxonomy! 

In [None]:
otu_df.taxonomy.str.split(';').str[1]

In [None]:
otu_df['class']=otu_df.taxonomy.str.split(';').str[1]

In [None]:
otu_df.groupby('class').sum()

In [None]:
ax = otu_df.groupby('class').sum().T.plot(kind='bar', stacked=True)
ax.get_legend().set_bbox_to_anchor((1,1))

In [None]:
# Calculate Bray Curtis distance between samples

In [None]:
df = otu_df
df = df.drop(['taxonomy', 'class'], axis =1 )

In [None]:
df_norm = df / df.sum()

In [None]:
from scipy.spatial.distance import squareform, pdist, braycurtis

In [None]:
bc_distance = pd.DataFrame(squareform(pdist(df_norm.T), 'braycurtis'), index=df.columns, columns=df.columns)


In [None]:
sns.clustermap(bc_distance)