<!-- https://github.com/kmahelona/ipython_notebook_goodies -->
<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

# Setup Software

## Git, Github

- **Git**. You may already have git installed. Try going to the terminal and type `git --version`. If not, then install from <https://git-scm.com/downloads>.

- **Github**. Create an account at <http://github.com>, if you don't already have one. For username, I recommend all lower-case letters, short as you can. I recommned using your \*ucsb.edu email, since you can request free private repositories via [GitHub Education](https://education.github.com/) discount. You're encouraged to upload a picture since it will get included in the students listing as part of this course repository.


## Python - Anaconda distribution

- Download and install the Python Anaconda distribution from <https://www.continuum.io/downloads>

# Data Science

## What is Data Science?


Consensus from [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by 
by Jake VanderPlas on [Drew Conway's Venn diagram of Data Science](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram):

![](http://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png?format=200w)

Interdisciplinary aspects:

1. **Hacker** reads and cleans data, develops workflows and (web-based) visualizations
1. **Statistician** models and summarizes data
1. **Expert** advances knowledge for a given domain

## Data Science Workflow

I also like Hadley Wickham's [R for Data Science](http://r4ds.had.co.nz/) iterative recipe for exploratory data analysis:

![](http://r4ds.had.co.nz/diagrams/data-science-explore.png)

- **Import**: read in simple text files, eg comma-seperated values (*.csv) [`csv`, `numpy`, `pandas`]
- **Tidy**: normalize data into easily queryable formats (eg "wide" to "long") [`pandas`]
- **Transform**: manipulate data based on subsetting by rows or columns, sorting and joining [`pandas`]
- **Model**: develop statistical or algorithmic model to test scientific hypothesis [`scikit-learn`, `networkX`]
- **Visualise**: static [`matplotlib`] or interactive [`bokeh`] plots
- **Communicate**: online [Github, Jupyter Notebooks, markdown]

# Git & Github

- **Git** is a version control system that lets you track changes to files over time. These files can be any kind of file (eg doc, pdf, xls), but free text differences are most easily visible (eg txt, csv, md). You can rollback changes made by you, or others. This facilitates a playground for collaboration, without fear of experimentation (you can always rollback changes).

- **Github** is a website for storing your git versioned files remotely. It has many nice features to be able visualize differences between [images](https://help.github.com/articles/rendering-and-diffing-images/), [rendering](https://help.github.com/articles/mapping-geojson-files-on-github/) & [diffing](https://github.com/blog/1772-diffable-more-customizable-maps) map data files, [render text data files](https://help.github.com/articles/rendering-csv-and-tsv-data/), and [track changes in text](https://help.github.com/articles/rendering-differences-in-prose-documents/).   It's mostly designed to facilitate a technical conversation between commits, pull requests, issues and particular lines of code. It's also great for project management, with the ability to link issues, milestones and commits (see [Mastering Issues](https://guides.github.com/features/issues/)).

## Setup Github & Git

1. Create **Github** account at <http://github.com>, if you don't already have one. For username, I recommend all lower-case letters, short as you can. I recommned using your \*ucsb.edu email, since you can request free private repositories via [GitHub Education](https://education.github.com/) discount. You're encouraged to upload a picture since it will get included in the students listing as part of this course repository.

1. Configure **git** with global commands. Open up the Bash version of Git and type the following:

        # display your version of git
        git --version
        
        # replace USER with your Github user account
        git config –-global user.name USER
        
        # replace USER@UMAIL.UCSB.EDU with the email you used to register with Github
        git config –-global user.email USER@UMAIL.UCSB.EDU
        
        # list your config to confirm user.* variables set
        git config --list

## Github Workflows

The two most common workflow models for working Github repositories are based on your permissions:

1. **writable**: Push & Pull (simplest)

1. **read only**: Fork & Pull Request (extra steps)

### Push & Pull

repo location | `USER` permission | initialize <i class="fa fa-arrow-down"></i> | edit <i class="fa fa-arrow-up"></i> | update <i class="fa fa-arrow-down"></i>
-----------|:-----------:|:-----------:|:-----------:|:-----------:
<i class="fa fa-cloud"></i> `github.com/OWNER/REPO` | read + write | [**create**](https://help.github.com/articles/create-a-repo/) <span class="octicon octicon-plus"></span> |   |
<i class="fa fa-desktop"></i> `~/github/REPO`      | read + write | [**clone**](https://help.github.com/articles/fetching-a-remote) <span class="octicon octicon-desktop-download"></span> | [**commit**](http://git-scm.com/docs/git-commit) <span class="octicon octicon-git-commit"></span>,  [**push**](https://help.github.com/articles/pushing-to-a-remote/) <span class="octicon octicon-cloud-upload"></span> | [**pull**](https://help.github.com/articles/fetching-a-remote/#pull) <span class="octicon octicon-cloud-download"></span>

Note that OWNER could be either an individual USER or group ORGANIZATION, which has member USERs.

### Fork & Pull Request

repo location | `USER` permission | initialize <i class="fa fa-arrow-down"></i> | edit <i class="fa fa-arrow-up"></i> | update <i class="fa fa-arrow-down"></i>
-----------|:-----------:|:-----------:|:-----------:|:-----------:
<i class="fa fa-cloud"></i> `github.com/OWNER/REPO` | read only |  | [**merge**](https://help.github.com/articles/merging-a-pull-request) <span class="octicon octicon-git-merge"></span>  | 
<i class="fa fa-cloud"></i> `github.com/OWNER/REPO`    | read + write | [**fork**](https://help.github.com/articles/fork-a-repo) <span class="octicon octicon-repo-forked"></span> | [**pull request**](https://help.github.com/articles/creating-a-pull-request/) <span class="octicon octicon-git-pull-request"></span> | [**pull request**](https://help.github.com/articles/creating-a-pull-request/) <span class="octicon octicon-git-pull-request"></span>, [**merge**](https://help.github.com/articles/merging-a-pull-request) <span class="octicon octicon-git-merge"></span>
<i class="fa fa-desktop"></i> `~/github/REPO` | read + write | [**clone**](https://help.github.com/articles/fetching-a-remote) <span class="octicon octicon-desktop-download"></span> | [**commit**](http://git-scm.com/docs/git-commit) <span class="octicon octicon-git-commit"></span>,  [**push**](https://help.github.com/articles/pushing-to-a-remote/) <span class="octicon octicon-cloud-upload"></span> | [**pull**](https://help.github.com/articles/fetching-a-remote/#pull) <span class="octicon octicon-cloud-download"></span>

## Fork & Pull Request Your People Entry

As an exercise for you to try out this fork & pull request model, you will add yourself to the [**<i class="fa fa-users"></i> Github Directory** for UCSB Network Data Science Boot Camp (2016)](http://bbest.github.io/ucsb-network-data-science-2016/) directory for this workshop which initially looks like this:

  ![](img/directory_bbest-only.png)

Please join me! Because you cannot directly write to this course repository, [fork](https://help.github.com/articles/fork-a-repo/) it into your own USER space. You can further [clone](https://help.github.com/articles/cloning-a-repository/) it onto your machine to edit locally, or simply create a New file through the web browser. Introduce yourself by adding a tiny file per your **Github** `USERNAME.json` under the `_data/2016` directory. Here's an example for my Github username `bbest`, so in a file named `bbest.json`:

```javascript
{
	"department": "MSI, NCEAS (consultant)",
	"interests": "marine biodiversity, ocean health",
	"project": "",
	"project_url": ""
}
```

If you cloned to your machine, be sure to git [commit](http://git-scm.com/docs/git-commit) and [push](https://help.github.com/articles/pushing-to-a-remote/) the changes, and [**create a pull request**](https://help.github.com/articles/creating-a-pull-request/) to the original repository `bbest/ucsb-network-data-science-2016`.

The details of how this works (using [Jekyll data files](https://jekyllrb.com/docs/datafiles/)) are beyond the scope of this boot camp, but provides a simple satisfying example for applying the fork & pull request model to a repository for which you do not have write permissions and want to contribute towards.

## Create Repository `my-project`

Now you will create a Github repository for a project to which you will have direct write access, ie use the push & pull model.

1. [Create a repository](https://help.github.com/articles/create-a-repo/) called `my-project`.

    ![](img/github_repo-create.png)
    
    Please be sure to tick the box to **Initialize this repository with a README**. Otherwise defaults are fine.
    
    ![](img/github_create-my-project.png)

1. [Create a branch](https://help.github.com/articles/creating-and-deleting-branches-within-your-repository/) called `gh-pages`.

    ![](img/github_create-branch_gh-pages.png)
    
    Per [pages.github.com](https://pages.github.com), since this will be a project site only web files in the `gh-pages` branch will show up at `http://USER.github.io/REPO`. For a user (or organization) site, the REPO must be named `USER.github.io` (or `ORG.github.io`) and then the default `master` branch will contain the web files for the website `http://USER.github.io` (or `http://ORG.github.io`). See also [User, Organization, and Project Pages - Github Help](https://help.github.com/articles/user-organization-and-project-pages/).
    
1. [Set the default branch](https://help.github.com/articles/setting-the-default-branch/) to `gh-pages`, NOT the default `master`.

    ![](img/github_default-branch_gh-pages.png)
    
1. [Delete the branch](https://help.github.com/articles/viewing-branches-in-your-repository/#deleting-branches) `master`, which will not be used.

## Edit `README.md` in Markdown

[Commit your first change](https://help.github.com/articles/create-a-repo/#commit-your-first-change) by editing the `README.md` which is in **markdown**, simple syntax for conversion to HTML. Now update the contents of the `README.md` with the following, having a link and a numbered list:
  
```
# my-project

Playing with [Software Carpentry at UCSB](http://remi-daigle.github.io/2016-04-15-UCSB).

## Introduction

This repository demonstrates **software** and _formats_:

1. **Git**
1. **Github**
1. _Markdown_
1. _Rmarkdown_

## Conclusion

![](https://octodex.github.com/images/labtocat.png)
```
    
Now click on the <span class="octicon octicon-eye"></span> Preview changes to see the markdown rendered as HTML:
    
![](img/github_preview_README-md.png)
    
Notice the syntax for:

- **numbered list** gets automatically sequenced: `1.`, `1.`
- **headers** get rendered at multiple levels: `#`, `##`
- **link**: `[](http://...)`
- **image**: `![](http://...)`
- _italics_: `_word_` 
- **bold**: `**word**`

See [Mastering Markdown · GitHub Guides](https://guides.github.com/features/mastering-markdown/) and add some more personalized content to the README of your own, like a bulleted list or blockquote.

## Create `index.html`

By default `index.html` is served up. Go ahead and create a new file named `index.html` with the following [basic HTML](http://www.w3schools.com/html/html_basic.asp):

```html
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

</body>
</html>
```

## Clone Repository

[Clone the repository](https://help.github.com/articles/fetching-a-remote) onto your local machine. The easiest way to do this is simply clicking the button <span class="octicon octicon-desktop-download"></span> to open up the Github Desktop App.

![](img/github_clone-desktop.png)
    
You'll be prompted to clone this repository into a folder on your local machine. I recommend creating a folder `github` under your user folder.
  
See [GitHub Desktop User Guides](https://help.github.com/desktop/guides/) for more. You could also do this from the Bash Shell for Git with the command `git clone https://github.com/USER/REPO.git`, replacing USER with your Github username and REPO with my_project. Or you can use the Github Desktop App menu File -> Clone Repository...

# Python using Jupyter Notebook

The **Python** programming language was created by Guido Von Rossum, the "benevolent dictator" who likes Monty Python comedies.

**Jupyter** started off as iPython (interactive Python) Notebook, then was generalized to include other language kernels to include Julia (Ju), Python (pyt) and R (r), hence the rename. With Jupyter Notebook you can weave cells of descriptive Markdown content with cells of Code into a single notebook document.


## Launch Jupyter Notebook

```bash
# change directory to where you git cloned the course repo
cd ~/github/ucsb-network-data-science-2016
jupyter notebook
```

Play around with creating a new notebook and populating cells of Markdown and Code.

## Python Basics

```python
# basic calculations
1 + 1

# assign variables from integer, string data types
a = 2
b = 5
c = 'a string'
a/b
a/c

# import the operating system module
import(os)

# everything is an object
dir()
dir(os)

# get the current working directory
os.getcwd()

# assign a list
x = [1,3,4]

# grab specific elements in a list. zero indexed!
x[0]
x[0:1]      # only 1 element, so goes up to last slice and excludes
x[0:2]      # misses last element
len(x)      # length of 3
x[0:len(x)] # gets all 3 elements

# inspect x as an object
type(x)
dir(x)

# sort x
print(x)
x.sort()
print(x)

# loop through the list. indentation matters!
cum_sum = 0
for i in x:
    cum_sum += i
    print(cum_sum-i,'+',i,'=',cum_sum)

# range, conditionals, string formatting
cum_sum = 0
for i in range(10):
    cum_sum += i
    if cum_sum > 5:
        print('{0}, {1}, {2}'.format(cum_sum-i, i, cum_sum))
    else:
        print('cum_sum < 5')
```

For more, see: 

- [Python_Data_Science_Handbook.pdf](https://drive.google.com/open?id=0B7zzHLs-PTOvRHRvSkR5aVFfODA) chapter 1: A Whirlwind Tour of the Python Language.
- [The Python Tutorial — Python 3.5.2 documentation](https://docs.python.org/3/tutorial/index.html)


TODO:
- data types: 
  - strings: real vs unicode
  - lists. list comprehension

# Graphs with Python

For more on dictionaries, see [5. Data Structures — Python 3.5.2 documentation](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)


The following cells of code are directly pulled from [Python Advanced: Graph Theory and Graphs in Python](http://www.python-course.eu/graphs_python.php) where the following graph is used as an example to build up in Python:


![](http://www.python-course.eu/images/simple_graph_isolated.png)

## Define Graph as a Dictionary 

In [63]:
# define a graph as a dictionary
graph = { "a" : ["c"],
          "b" : ["c", "e"],
          "c" : ["a", "b", "d", "e"],
          "d" : ["c"],
          "e" : ["c", "b"],
          "f" : []
        }
        
# define function to output list of all edges
def generate_edges(graph):
    edges = []
    for node in graph:
        for neighbour in graph[node]:
            edges.append((node, neighbour))

    return edges

print(generate_edges(graph))

[('e', 'c'), ('e', 'b'), ('c', 'a'), ('c', 'b'), ('c', 'd'), ('c', 'e'), ('d', 'c'), ('b', 'c'), ('b', 'e'), ('a', 'c')]


## Find Isolated Nodes

In [64]:
def find_isolated_nodes(graph):
    """ returns a list of isolated nodes. """
    isolated = []
    for node in graph:
        if not graph[node]:
            isolated += node
    return isolated
    
    
print(find_isolated_nodes(graph))

['f']


## Create a Graph Class

Now use object oriented programming in Python to create a special Graph class.


In [65]:
""" A Python Class
A simple Python graph class, demonstrating the essential 
facts and functionalities of graphs.
"""
class Graph(object):

    def __init__(self, graph_dict=None):
        """ initializes a graph object 
            If no dictionary or None is given, 
            an empty dictionary will be used
        """
        if graph_dict == None:
            graph_dict = {}
        self.__graph_dict = graph_dict

    def vertices(self):
        """ returns the vertices of a graph """
        return list(self.__graph_dict.keys())

    def edges(self):
        """ returns the edges of a graph """
        return self.__generate_edges()

    def add_vertex(self, vertex):
        """ If the vertex "vertex" is not in 
            self.__graph_dict, a key "vertex" with an empty
            list as a value is added to the dictionary. 
            Otherwise nothing has to be done. 
        """
        if vertex not in self.__graph_dict:
            self.__graph_dict[vertex] = []

    def add_edge(self, edge):
        """ assumes that edge is of type set, tuple or list; 
            between two vertices can be multiple edges! 
        """
        edge = set(edge)
        (vertex1, vertex2) = tuple(edge)
        if vertex1 in self.__graph_dict:
            self.__graph_dict[vertex1].append(vertex2)
        else:
            self.__graph_dict[vertex1] = [vertex2]

    def __generate_edges(self):
        """ A static method generating the edges of the 
            graph "graph". Edges are represented as sets 
            with one (a loop back to the vertex) or two 
            vertices 
        """
        edges = []
        for vertex in self.__graph_dict:
            for neighbour in self.__graph_dict[vertex]:
                if {neighbour, vertex} not in edges:
                    edges.append({vertex, neighbour})
        return edges

    def __str__(self):
        res = "vertices: "
        for k in self.__graph_dict:
            res += str(k) + " "
        res += "\nedges: "
        for edge in self.__generate_edges():
            res += str(edge) + " "
        return res

graph = Graph(g)

print("Vertices of graph:")
print(graph.vertices())

print("Edges of graph:")
print(graph.edges())

print("Add vertex:")
graph.add_vertex("z")

print("Vertices of graph:")
print(graph.vertices())

print("Add an edge:")
graph.add_edge({"a","z"})

print("Vertices of graph:")
print(graph.vertices())

print("Edges of graph:")
print(graph.edges())

print('Adding an edge {"x","y"} with new vertices:')
graph.add_edge({"x","y"})
print("Vertices of graph:")
print(graph.vertices())
print("Edges of graph:")
print(graph.edges())

Vertices of graph:
['e', 'x', 'c', 'f', 'd', 'b', 'z', 'a']
Edges of graph:
[{'e', 'c'}, {'x', 'y'}, {'b', 'c'}, {'c'}, {'c', 'd'}, {'a', 'd'}, {'a', 'z'}]
Add vertex:
Vertices of graph:
['e', 'x', 'c', 'f', 'd', 'b', 'z', 'a']
Add an edge:
Vertices of graph:
['e', 'x', 'c', 'f', 'd', 'b', 'z', 'a']
Edges of graph:
[{'e', 'c'}, {'x', 'y'}, {'b', 'c'}, {'c'}, {'c', 'd'}, {'a', 'd'}, {'a', 'z'}]
Adding an edge {"x","y"} with new vertices:
Vertices of graph:
['e', 'x', 'c', 'f', 'd', 'b', 'z', 'a']
Edges of graph:
[{'e', 'c'}, {'x', 'y'}, {'b', 'c'}, {'c'}, {'c', 'd'}, {'a', 'd'}, {'a', 'z'}]


## Add `find_path()` to Graph class

Now let's add a function to find paths `find_path()`

In [66]:
class Graph(object):

    def __init__(self, graph_dict=None):
        """ initializes a graph object 
            If no dictionary or None is given, 
            an empty dictionary will be used
        """
        if graph_dict == None:
            graph_dict = {}
        self.__graph_dict = graph_dict

    def vertices(self):
        """ returns the vertices of a graph """
        return list(self.__graph_dict.keys())

    def edges(self):
        """ returns the edges of a graph """
        return self.__generate_edges()

    def add_vertex(self, vertex):
        """ If the vertex "vertex" is not in 
            self.__graph_dict, a key "vertex" with an empty
            list as a value is added to the dictionary. 
            Otherwise nothing has to be done. 
        """
        if vertex not in self.__graph_dict:
            self.__graph_dict[vertex] = []

    def add_edge(self, edge):
        """ assumes that edge is of type set, tuple or list; 
            between two vertices can be multiple edges! 
        """
        edge = set(edge)
        (vertex1, vertex2) = tuple(edge)
        if vertex1 in self.__graph_dict:
            self.__graph_dict[vertex1].append(vertex2)
        else:
            self.__graph_dict[vertex1] = [vertex2]

    def __generate_edges(self):
        """ A static method generating the edges of the 
            graph "graph". Edges are represented as sets 
            with one (a loop back to the vertex) or two 
            vertices 
        """
        edges = []
        for vertex in self.__graph_dict:
            for neighbour in self.__graph_dict[vertex]:
                if {neighbour, vertex} not in edges:
                    edges.append({vertex, neighbour})
        return edges

    def __str__(self):
        res = "vertices: "
        for k in self.__graph_dict:
            res += str(k) + " "
        res += "\nedges: "
        for edge in self.__generate_edges():
            res += str(edge) + " "
        return res
        
    def find_path(self, start_vertex, end_vertex, path=None):
        """ find a path from start_vertex to end_vertex 
            in graph """
        if path == None:
            path = []
        graph = self.__graph_dict
        path.append(start_vertex)
        if start_vertex == end_vertex:
            return path
        if start_vertex not in graph:
            return None
        for vertex in graph[start_vertex]:
            if vertex not in path:
                extended_path = self.find_path(vertex, 
                                               end_vertex, 
                                               path)
                if extended_path: 
                    return extended_path
        return None

graph = Graph(g)

print("Vertices of graph:")
print(graph.vertices())

print("Edges of graph:")
print(graph.edges())


print('The path from vertex "a" to vertex "b":')
path = graph.find_path("a", "b")
print(path)

print('The path from vertex "a" to vertex "f":')
path = graph.find_path("a", "f")
print(path)

print('The path from vertex "c" to vertex "c":')
path = graph.find_path("c", "c")
print(path)

Vertices of graph:
['e', 'x', 'c', 'f', 'd', 'b', 'z', 'a']
Edges of graph:
[{'e', 'c'}, {'x', 'y'}, {'b', 'c'}, {'c'}, {'c', 'd'}, {'a', 'd'}, {'a', 'z'}]
The path from vertex "a" to vertex "b":
['a', 'd', 'c', 'b']
The path from vertex "a" to vertex "f":
None
The path from vertex "c" to vertex "c":
['c']


## NetworkX

The most commonly used networking package in Python is [NetworkX](https://networkx.github.io/), which is included in the Anaconda distribution.

Here's the [Quick Example — NetworkX](https://networkx.github.io/examples.html):

In [67]:
import networkx as nx

G=nx.Graph()
G.add_node("spam")
G.add_edge(1,2)

print(list(G.nodes()))
print(list(G.edges()))

[1, 2, 'spam']
[(1, 2)]


To compare the NetworkX Graph class with earlier, see the code in [networkx/graph.py](https://github.com/networkx/networkx/blob/master/networkx/classes/graph.py#L181-L185), and in particular the extra information it is capable of storing: 

> The Graph class uses a dict-of-dict-of-dict data structure.
The outer dict (node_dict) holds adjacency information keyed by node.
The next dict (adjlist_dict) represents the adjacency information and holds
edge data keyed by neighbor.  The inner dict (edge_attr_dict) represents
the edge data and holds edge attribute values keyed by attribute names.


# Reading Tabular Data with Pandas

[Pandas](http://pandas.pydata.org/pandas-docs/stable/) is well suited for "tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet."

- [Package overview — pandas 0.18.1 documentation](http://pandas.pydata.org/pandas-docs/stable/overview.html)
- [10 Minutes to pandas — pandas 0.18.1 documentation](http://pandas.pydata.org/pandas-docs/stable/10min.html)
- [12 Useful Pandas Techniques in Python for Data Manipulation](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/)

Other:
- numpy: `loadtxt`, `numpygenfromtxt`
- [csv — CSV File Reading and Writing — examples](https://docs.python.org/3/library/csv.html#examples)
- represent each row with own key,val vs each col has key: list of values

```python
import csv

d = {}
rdr = csv.reader(open('filename.csv', 'r'))
d.keys = rdr.next()
for row in rdr:
   k, v = row
   d[d.keys()] = v
```

- read csv (vs dict representation)

```python
dic = pd.Series.from_csv(filename, names=cols, header=None).to_dict()
```

In [68]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>