# Lab 7: git and GitHub
 

## Acknowledgements

Much of the material for this lesson was borrowed from or inspired by Matt Jones' NCEAS Reproducible Research Techniques for Synthesis workshop](https://learning.nceas.ucsb.edu/2020-02-RRCourse/)


## Learning Objectives

In this lab, you will learn:

- What computational reproducibility is and why it is useful
- How version control can increase computational reproducibility
- to set up git on your computer
- to use git and github to track changes to your work over time


## Reproducible Research

Reproducibility is the hallmark of science, which is based on empirical observations 
coupled with explanatory models.  While reproducibility encompasses 
the full science lifecycle, and includes issues such as methodological consistency and
treatment of bias, in this course we will focus on **computational reproducibility**: 
the ability to document data, analyses, and models sufficiently for other researchers 
to be able to understand and ideally re-execute the computations that led to 
scientific results and conclusions.

### What is needed for computational reproducibility?

The first step towards addressing these issues is to be able to evaluate the data,
analyses, and models on which conclusions are drawn.  Under current practice, 
this can be difficult because data are typically unavailable, the method sections
of papers do not detail the computational approaches used, and analyses and models
are often conducted in graphical programs, or, when scripted analyses are employed,
the code is not available.

And yet, this is easily remedied.  Researchers can achieve computational 
reproducibility through open science approaches, including straightforward steps 
for archiving data and code openly along with the scientific workflows describing 
the provenance of scientific results (e.g., @hampton_tao_2015, @munafo_manifesto_2017).

### Conceptualizing workflows

Scientific workflows encapsulate all of the steps from data acquisition, cleaning,
transformation, integration, analysis, and visualization.  

![](images/workflow.png)

Workflows can range in detail from simple flowcharts 
to fully executable scripts. R scripts and python scripts are a textual form 
of a workflow, and when researchers publish specific versions of the scripts and 
data used in an analysis, it becomes far easier to repeat their computations and 
understand the provenance of their conclusions.

### The problem with filenames

Every file in the scientific process changes.  Manuscripts are edited.
Figures get revised.  Code gets fixed when problems are discovered.  Data files
get combined together, then errors are fixed, and then they are split and 
combined again. In the course of a single analysis, one can expect thousands of
changes to files.  And yet, all we use to track this are simplistic *filenames*.  
You might think there is a better way, and you'd be right: __version control__.

Version control systems help you track all of the changes to your files, without
the spaghetti mess that ensues from simple file renaming.  In version control systems
like `git`, the system tracks not just the name of the file, but also its contents,
so that when contents change, it can tell you which pieces went where.  It tracks
which version of a file a new version came from.  So its easy to draw a graph
showing all of the versions of a file, like this one:

![](images/version-graph.png)

Version control systems assign an identifier to every version of every file, and 
track their relationships. They also allow branches in those versions, and merging
those branches back into the main line of work.  They also support having 
*multiple copies* on multiple computers for backup, and for collaboration.
And finally, they let you tag particular versions, such that it is easy to return 
to a set of files exactly as they were when you tagged them.  For example, the 
exact versions of data, code, and narrative that were used when a manuscript was originally 
submitted might be `eco-ms-1` in the graph above, and then when it was revised and resubmitted,
it was done with tag `eco-ms-2`.  A different paper was started and submitted with tag `dens-ms-1`, showing that you can be working on multiple manuscripts with closely related but not identical sets of code and data being used for each, and keep track of it all.


## Version control and Collaboration using Git and GitHub

First, just what are `git` and GitHub?

- __git__: version control software used to track files in a folder (a repository)
    - git creates the versioned history of a repository
- __GitHub__: web site that allows users to store their git repositories and share them with others


### Getting started on GitHub

Go to https://github.com/ and sign up for an account. This is a good opportunity to create a professional presence in the bioinformatics and data science world.

### Creating a personal access token 

Starting this fall GitHub is requiring personal access tokens. To generate one follow https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token This will serve as your password later.

### Cloning a repository

You can clone any Github repository. This is the whole purpose, to openly share code. To do this you must install git on your computer.  Fortunately for us git is already


Let's start by cloning the course repository - https://github.com/jeffreyblanchard/EvoGeno2021PyTo do so, locate the green "CODE" button and copy the URL. Now go to RStudio Cloud or Unity and open a terminal. Type


In [None]:
git clone https://github.com/jeffreyblanchard/EvoGeno2021Py.git

If you type ls you will now see a new directory "EvoGeno2021Py" with all of the class materials. To update the directory use the git pull command when you are inside the directory

In [None]:
cd EvoGeno2021Py
git pull

*** Note: If you change the contents of the directory in any way it will be out of sync with the course git hub repo and you will get an error when you try to pull. More on this in a bit

### Create your own repository on GitHub 

Next create a repository on GitHub, then we'll edit some files.

- Log into [GitHub](https://github.com)
- Click on the Repositories tab
- Click the New Repository button
- Name it `genomics-course` or something similar
- Create a README.md
- Set the LICENSE to Apache 2.0

You've now created your first repository! It has a couple of files that GitHub created
for you, like the README.md file, and the LICENSE file

For simple changes to text files, you can make edits right in the GitHub web interface.  For example,
navigate to the `README.md` file in the file listing, and edit it by clicking on the *pencil* icon.
This is a regular Markdown file, so you can just add text, and when done, add a commit message, and 
hit the `Commit changes` button.  

Congratulations, you've now authored your first versioned commit.  If you navigate back to the GitHub page for the repository, you'll see your commit listed there, as well as the
rendered README.md file.

Now locate the green `CODE` button and copy the URL. Now go to RStudio Cloud or Unity and open a terminal. Type git clone and the name of your URL as you did above.

### Pushing and Pulling changes.  

Make a simple change to your README file directly in your GitHub repo.  In order for your RStudio Cloud or Unity to update that change move into your GitHub directory in RStudio Cloud or Unity and

In [None]:
git pull

***Once you have created a Git repository on RStudio Cloud, Unity or your laptop I recommend NOT making changes on the Github site as it is easy to get the repositories out of sync

Now create a new Jupyter notebook file in your repo/directory on RStudio Cloud or Unity and put some simple text and code in the notebook. Save it as test.ipynb

To make the notebook file or any changes to it visible on GitHub we have to "Push" the changes. This involves a series of steps which include indicate which specific which changes to the local working files should be staged for versioning (using the `git add` command), and when to record those changes as a version in the local repository (using the command `git commit`).

The remaining concepts are involved in synchronizing the changes in your local repository with changes in a remote repository.  The `git push` command is used to send local changes up to the remote repository on GitHub, 

![](images/git-flowchart.png)


In your terminal type

In [None]:
git add test.ipynb

then

In [None]:
git commit

    When using the git commit command in RStudio or Unity it will open the Unix text editors `vim` and `nano` respectively. You will need to learn a few import commands
    

- `vim` starts in the command mode. To insert text type `i` and add your commit message (e.g. Adding test file). To exit the insert mode hit `esc`. To save and exit `vim` `shift ZZ` with shift held down for both Zs. For more info on `vim` https://www.vim.org/

- In `nano` you can start typing your commit message like 'Adding test file'. Then use `ctr x` to exit and save the changes to the file. For more info on `nano` https://www.nano-editor.org/docs.php

Now you can `Push` the changes to your GitHub repo

In [None]:
git push

The first time doing this you will need to add the email you used for Github and your username following the suggested command line syntax

### Deleting files in git

If you delete a file from your RStudio Cloud or Unity git repository, you still need to tell Github about it. The process is still the same as shown with the example.txt file

In [None]:
git add example.txt
git commit
git push

### On good commit messages

Clearly, good documentation of what you've done is critical to making the version history of your repository meaningful and helpful.  Its tempting to skip the commit message altogether, or to add some stock blurd like 'Updates'.  Its better to use messages that will be helpful to your future self in deducing not just what you did, but why you did it.  Also, commit messaged are best understood if they follow the active verb convention.  For example, you can see that my commit messages all started with a past tense verb, and then explained what was changed.

While some of the changes we illustrated here were simple and so easily explained in a short phrase, for more complext changes, its best to provide a more complete message.  The convention, however, is to always have a short, terse first sentence, followed by a more verbose explanation of the details and rationale for the change. This keeps the high level details readable in the version log.  I can't count the number of times I've looked at the commit log from 2, 3, or 10 years prior and been so grateful for diligence of my past self and collaborators.

### Github web pages

You can enable Github pages to create a web presence for your project. 

- Go to your repository you just created
- Click on Settings
- Scroll down to GitHub pages
- Select Master branch
- Click on Save
(Do not choose a theme today. You have the option of choosing a theme later).
- It will create a GitHub page (e.g. https://jeffreyblanchard.github.io/jeffblanchard/)
- Copy the link to your GitHub page
- Go to the main page for your web (e.g. jeffblanchard) repository.
- In the `about section` add the url to your GitHub repo page

Under the settings tab enable Github pages.  It takes about 10 min for the web site to appear. The default web pages in the README.md file, but if you create and upload an index.html page this will be your new default. This provides a way to see the html files in your browser as you intended them to appear (not just the html code).  

* Note: It is critical that you use a small `i` in  `index.html` and not a captial `I`


### Github project management

You can keep tract of ideas, todos and fixes by creating a wiki or using the Project

![Managing projects on Github](images/Project_Acidos.png)



### Collaboration and conflict free workflows (we walk talk more about this later in the class)

Up to now, we have been focused on using Git and GitHub for yourself, which is a great use. But equally powerful is to share a GitHib repository with other researchers so that you can work on code, analyses, and models together.  When working together, you will need to pay careful attention to the state of the remote repository to avoid and handle merge conflicts.  A *merge conflict* occurs when two collaborators make two separate commits that change the same lines of the same file.  When this happens, git can't merge the changes together automatically, and will give you back an error asking you to resolve the conflict. Don't be afraid of merge conflicts, they are pretty easy to handle.  and there are some 
[great](https://help.github.com/articles/resolving-a-merge-conflict-using-the-command-line/) [guides](https://stackoverflow.com/questions/161813/how-to-resolve-merge-conflicts-in-git).

That said, its truly painless if you can avoid merge conflicts in the first place. You can minimize conflicts by:

- Ensure that you pull down changes just before you commit
  + Ensures that you have the most recent changes
  + But you may have to fix your code if conflict would have occurred
- Coordinate with your collaborators on who is touching which files
  + You still need to comunicate to collaborate

### More with git

There's a lot we haven't covered in this brief tutorial.  There are some good longer tutorials that cover additional topics:

- Git Guides - https://github.com/git-guides/git-push
- Git cheatsheat - https://github.com/git-guides/git-push
- Git Learning Lab - https://lab.github.com/
- [Happy Git and Github for the useR](https://happygitwithr.com/)
- [Try Git](https://try.github.io) a great interactive tutorial
- Software Carpentry [Version Control with Git](http://swcarpentry.github.io/git-novice/)


### Example Github repositories and pages

- Nick Reich - https://github.com/nickreich
- Noelle Beckman - http://seedscape.github.io/BeckmanLab/Beckman.html
- Women In Soil Ecology - https://womeninsoilecology.github.io/
- Frank Alyward - https://github.com/faylward

## Exercises

You only need to turn in the link to your GitHub repository. In it should be your new course repo and a test file.