# Session IV: System Calls and Plotting

---
### Session III Challenge

In [None]:
#!/usr/bin/env python
""" Compares two ages

Finds the difference of years between an age and a birthdate

Usage: python basic_template.py 9001 12

Args:
    (str): age of the navigator
    (str): age of the driver

Returns:
    (int): difference of age in years

"""
import sys

        
def main(navigator_age, driver_age):
    """ Finds the difference between an age and birthdate in years
    
    Args:
        navigator_age (int): age of the navigator
        driver_age (int): age of the driver
    
    Returns:
        Difference in age
    """
    diff = abs(navigator_age - driver_age)
    print(f'Difference of {diff} years')
    return diff
            
    
if __name__ == '__main__':
    n_age = int(sys.argv[1])
    d_age = int(sys.argv[2])
    main(n_age, d_age)

---

## System Calls

Like it was stated before. One of the biggest strengths of Python is its ability to 'glue' together many programs. This can be done though different APIs and libraries. However, some programs don't have an easy solution like that. For those, we use **system calls**.

Where we have run Python from the command line, it is possible to run the command line from Python. Here are some ways to use it.

In [14]:
from subprocess import run, PIPE

The whole API for system calls can be found [here](https://docs.python.org/3/library/subprocess.html) or:

In [None]:
?run()

**Fair Warning**: it is long!

The general idea of using `run()` is to give it all the information it needs to send to the command line, and whether or not to capture the output.

In [None]:
# When would you not need to save the output?
run('mkdir ./tickle', shell=True) # That is fun, but I don't want that directory anymore
run('rm -rf ./tickle', shell=True)

In [27]:
# When would be a time you would want to store the output?
job = run('ls | grep CODE', shell=True, encoding = 'utf-8', stdout = PIPE)
job.stdout

'CODE_OF_CONDUCT.md\n'

What is `shell`, and why is it set to `True`?

It is as easy as that. 

Just like command line arguments, the shell returns what ***kind*** of object?

In [None]:
type(job.stdout)

---

## Tabular data analysis with `pandas`

`pandas` is Python's answer to R's `data.frame`. It exposes the `DataFrame` data structure, which is described as a "two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)"

`pandas` is a ***whole lot more*** than `DataFrame`, but for anybody that has used R, this is one of the most important parts of `pandas`.

In [None]:
# standard pandas importing convention
import pandas as pd

Don't worry about those warnings, they are just an artifact of the environment we are working on.

Remember all the file handling stuff we did earlier, where you had to run through each line? Watch this...

In [None]:
pokemon = pd.read_csv('./datasets/pokemon.csv', index_col=0)

In [None]:
pokemon.head()

In [None]:
# What pokemon have the highest base attack rating?
pokemon.sort_values('Attack', ascending=False)[:10]

What about tab-separated values?

In [None]:
weather = pd.read_table('./datasets/weather.tsv', delimiter='\t').dropna(axis=1, how='all', thresh=800)

In [None]:
weather.head()

In [None]:
# Exploring some statistics


Even plays well with Excel

In [None]:
ramen = pd.read_excel('./datasets/ramen-ratings.xlsx')

In [None]:
ramen.head()

### An extended version of my `pandas` tutorial can be found [here](https://github.com/betteridiot/b575w18/blob/master/Pandas.ipynb)

---

## Data Visualization

Data is great, but unless we can determine trends in it, it is useless. One of the most efficient ways to provide evidence of these trends is data visualization. This is just a graphic representation of the data.

In [None]:
%matplotlib inline
# This first line is special magic just for notebooks that let us see the plots as we make them

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Also, we are going to want to generate some data
import numpy as np
from sklearn.datasets import make_classification
from utils.session4 import *

## Simulating data

In [None]:
random_data = make_classification(n_samples=1000, n_features=10, n_informative=4, random_state=0)[0]
df = pd.DataFrame(random_data, columns=gen_lab(3, 10))
df.head()

## Add a categorical value column

In [None]:
feature_types = ['gene', 'CDS', 'mRNA', 'exon', 'five_prime_UTR',
                'three_prime_UTR', 'rRNA', 'tRNA', 'ncRNA', 'tmRNA',
                'transcript', 'mobile_genetic_element', 'origin_of_replication',
                'promoter', 'repeat_region']
feature_column = pd.Series(nr.choice(feature_types, 1000), name='feature_type')
df = df.join(feature_column)

In [None]:
df.head()

## `matplotlib` f|rom scratch

In [None]:
plt.plot(df.iloc[:10,1], color='#FF8016', marker='o', linestyle=':')
plt.plot(df.iloc[:10,2], color='#2353C0', marker='^', linestyle='')
plt.plot(df.iloc[:10,3], color='#FFD716', linestyle='--')
plt.show()

## Multiple Plots

matplotlib allows users to define the regions of their plotting canvas. If the user intends to create a canvas with multiple plots, they would use the `subplot()` function. The `subplot` function sets the number of rows and columns the canvas will have **AND** sets the current index of where the next subplot will be rendered.

In [None]:
plt.figure(1)
# two row, two columns, first index (top-left)
plt.subplot(221)
plt.plot(df.loc[:,['se', 'yy']], alpha=0.5)

plt.subplot(222)
plt.plot(df.loc[:,['va','xg']], alpha=0.5)

plt.subplot(223)
plt.plot(df.iloc[:,8:10], alpha=0.5)

plt.subplot(224)
plt.plot(df.iloc[:,:2], alpha=0.5)

plt.subplots_adjust(top=.92, bottom=.08, left=.1, right=.95, hspace=.25, wspace=.35)
plt.show()

In [None]:
n, bins, patches = plt.hist(df.yy, facecolor='#5A0BB0', alpha=0.8, rwidth=.8, align='mid')
plt.title("Hello")
plt.ylabel('counts')

The biggest issue with `matplotlib` isn't its lack of power...it is that it is too much power. With great power, comes great responsibility. When you are quickly exploring data, you don't want to have to fiddle around with axis limits, colors, figure sizes, etc. Yes, you *can* make good figures with `matplotlib`, but you probably won't.

## Using pandas `.plot()`

Pandas abstracts some of those initial issues with data visualization. However, it is still `matplotlib`-esque.</br></br>
Pandas is built off of `numpy` for its caclulations, but its plotting is built off of `matplotlib`. Therefore, just like any data you get from `pandas` can be used within `numpy`, every plot that is returned from `pandas` is a `matplotlib` plot...and subject to `matplotlib` modification.

In [None]:
ax = df.feature_type.value_counts(sort=True).plot.bar()
ax.set_ylabel('count')
ax.set_xlabel('Feature Type')
plt.show()

In [None]:
ax = pd.plotting.scatter_matrix(df, alpha = 0.05, figsize=(10,10), 
                                diagonal='kde')

# Seaborn

Seaborn is a library that specializes in making *prettier* `matplotlib` plots of statistical data. There was a brief introduction to seaborn in the last class, which we will re-create here.

In [None]:
import seaborn as sns

In [None]:
sns.set(style='whitegrid')

## Violin plot

Fancier box plot that gets rid of the need for 'jitter' to show the inherent distribution of the data points

In [None]:
fig, axes = plt.subplots(figsize=(10, 10))
sns.violinplot(data=df.iloc[:,:-1], ax=axes)
axes.set_ylabel('number')
axes.set_xlabel('columns')
plt.show()

## Distplot

In [None]:
sns.set(palette='muted')

f, axes = plt.subplots(2,2, figsize=(10,10), sharex=True)
sns.despine(left=True)

sns.distplot(df.iloc[:, nr.randint(0,len(df.columns)-1)], ax=axes[0,0])
sns.distplot(df.iloc[:, nr.randint(0,len(df.columns)-1)], kde=False, ax=axes[0,1], color='orange')
sns.distplot(df.iloc[:, nr.randint(0,len(df.columns)-1)], hist=False, kde_kws={'shade':True}, ax=axes[1,0], color='purple')
sns.distplot(df.iloc[:, nr.randint(0,len(df.columns)-1)], hist=False, rug=True, ax=axes[1,1], color='green')

## Hexbin with marginal distributions

In [None]:
sns.set(style='ticks')

In [None]:
sns.jointplot(df.iloc[:,nr.randint(0, len(df.columns)-1)], 
              df.iloc[:,nr.randint(0, len(df.columns)-1)], 
              kind='hex', color= '#246068')

In [None]:
sns.set()

g = sns.FacetGrid(df.loc[:,['hj','feature_type', 'tv']], col='feature_type', hue='feature_type', col_wrap=5)
g.map(plt.scatter, 'hj', 'tv')