# git & GitHub: Introduction for Scientists

# git, github, etc... Why do we care?

**Replication** and **reproducibility** are two of the cornerstones in the scientific method. With respect to data analysis (and scientific computing in general!), these concepts have the following practical implications:

* **Replication**: An author of a scientific paper that involves some data analysis should be able to rerun the analysis code and replicate the results upon request. Other scientists should be able to perform the same analysis and obtain the same results, given the information about the methods used in a publication.

* **Reproducibility**: The results obtained by analyzing the data should be reproducible with an independent implementation of the method, or using a different method altogether.


In summary: A sound scientific result should be reproducible, and a sound scientific study should be replicable.

To achieve these goals, we need to:

* Keep and take note of *exactly* which source code and version that was used to produce data and figures in published papers.

* Record information of which version of external software that was used. Keep access to the environment that was used.

* Make sure that old codes and notes are backed up and kept for future reference.

* Ideally, codes should be published online, to make it easier for other scientists interested in the codes to access it.

# What is git?

* A tool that efficiently __saves snapshots__ of a set of files (what their contents was at some point in time).
* A tool that __tracks relationships__ between those snapshots (which snapshot preceeded which, etc.).
* A tool that can efficiently __merge__ different snapshots, even if they have conflicting changes.
* A tool that makes it possible to __share__ these snapshots with others, enabling __collaboration__ (and adding a degree of reproducibility).

# What is GitHub?

* It’s a company that will __host copies of your git repositories__.
* It’s a [website](github.com) that provides a nice way to __browse, view, and sometimes edit__ the contents of those repositories.
* It’s a company that will let you __share__ them with others, and make them __discoverable__.
* It’s a website that makes __collaboration__ on software (incl. research, analysis, etc.) projects __easy__. E.g.:
  * Report and discuss issues.
  * Review, discuss, accept proposed changes.


Note: Git and GitHub are two different things -- one is a tool, another one is a service built around it (and there are alternatives). You can use git without GitHub; it's just that GitHub makes some things easier.

# git+github: introduction by example

# GitHub Student Developer Pack

<center>[https://education.github.com/pack](https://education.github.com/pack)</center>
<img src="images/gh-pack.jpg" width="600">

## Note
	
`git` repositories are also excellent for version controlling manuscripts, figures, thesis files, data files, lab logs, etc. Basically for any digital content that must be preserved and is frequently updated.

They are also excellent collaboration tools!

## This is how we will practice using git and GitHub

### Forking and working on someone elses reposiitory

Lets fork the completely useless repository

https://github.com/connolly/A302_W19_a_useless_repo

```
click on "fork" at the top right of the page
```

This creates a version of the repository in your local Github account

Now we want to "clone" this repository to our own machine

```git clone https://github.com/YOURNAME/A302_W19_a_useless_repo.git```

We will work on this page using Github and git

### What if Andy changes the repository

##### Collaboration with GitHub

`git` has a very different collaboration philosophy compared to (e.g.) Google
Docs.  In Google Docs, you edit in real time, seeing your and everyone
else's changes as they happen.  With `git`, you make a copy of the files
(a `clone` of the repository), make the edits on the copy, and use `git` 
(and/or GitHub) to merge them with the original. To signal that a set
of changes is ready to be merged into the original, you open a [Pull Request](https://help.github.com/articles/about-pull-requests/).

I will add a name to the file "names.md"

```
emacs names.md
```

We can also check the status of the files in my directory
```
git status
```

We can also check on what has happened to the directory over time

```
git log
```

Now we want to commit a change to our local repository, push that change to the repository owned by Connolly

```
git commit names.md -m "add a name" # -m adds a comment (required)
git pull  # to make sure our version is uptodate
git push 
```

On the connolly Github repository it shows an updated "names.md" file. On your fork it should show that your version is one "commit" behind

We can update our github version by issuing a "pull request"

```
connolly:master is up to date with all commits from connolly2:master. Try switching the base for your comparison
```
This means we havent added anything to the repository so we dont need to push to to the Connolly repository. If we 
"switch the base" we see Connolly is ahead and we can

```
Create a pull request
Merge a pull request 
Confirm a pull request
```
Now we are all up to date


### What if I change the repository


You should add a name to the file "names" (put in the middle of the list)

```
emacs names.md
```

We can also check the status of the files in my directory
```
git status
```

We can also check on what has happened to the directory over time

```
git log
```

Now we want to commit a change to our local repository, push that change to our remote repository and issue a pull request to merge it with the one owned by Connolly

```
git commit names.md -m "add my name" # -m adds a comment (required)
git pull  # to make sure our version is uptodate
git push 
```

On the our Github repository it shows an updated "names.md" file. 

We can update the Connolly Github version by issuing a "pull request" and including a comment if we like

Oh no there is a merge conflict!

```
<<<<<<< master
Arthur
=======
Simon
>>>>>>> master
```

<<<<<<< to ====== is the change we have locally

======= to <<<<<< are the lines on the remote repository

We need to edit the file and "resolve" the conflict

We can resolve easy ones in the Github client but lets do it with git


First we need to update our local repo to what the remote repo looks like

```
git pull https://github.com/connolly/A302_W19_a_useless_repo master
```

pull from the Connolly repo to my ```master```

``` 
error: Your local changes to the following files would be overwritten by merge:
	names.md
```

```
edit "names.md"

git commit -a -m "merge"
```

Your pull request has updated and can be merged

## Let's add your own directory and issue a pull request to 
https://github.com/connolly/A302_2019_Homework

## Let's take a look at homework 2
https://github.com/connolly/A302_2019/blob/master/homework/Homework%202.ipynb

## Follow the instructions to create your own repo on the command line

### Creating my own repo

Let's create a directory, and a few (two) files in a directory:

```
mkdir "astr302-$USER"
cd "astr302-$USER"
echo "# My first ASTR 302 git repository" > README.md
echo "A file I do not wish to track" > AnotherFile.txt
```

I'll now create a git _repository_ where git will store snapshot of these files, using the `git init` command:

```
git init
```

Now git will be able to "track" (store snapshots of) files in your current directory. You have to run `git init` only once for that directory.

Running `git init` created a directory named `.git`, where the snapshots will be stored:

```
ls -l .git
```

If you want to send someone all of your snapshots, it's as easy as copying this directory (but there's a better way! more later!)

Some more bookkeeping before we begin. We need to tell git what our name and preferred e-mail are:

```
git config --global user.email “your@email”
git config --global user.name “your name"
```

Now we can begin! Git doesn't automatically snapshot the contents of everything in the directory -- you have to tell it what is the set of files I want it to snapshot. We do this using the `git add` command. For example:

```
git add README.md
```

Note that at this point, we didn’t take a snapshot yet. To do that, we use the `git commit` command:

```
git commit
```

We've just created a snapshot, which is referred to as a _commit_ in git. Running `git commit` opens an editor in which you can write your _commit message_. This is meant to be a short description of the set of files (or changes to files; we’ll show that in a minute). It should remind you what this snapshot is about (or tell others with whom you share this code) .

A convention for commit messages is to start with a short (\<70 characters) one-line description of the reason for the commit, followed by a blank line, followed by a longer paragraph (if needed). See [here](https://github.com/erlang/otp/wiki/writing-good-commit-messages) for an example of how to write useful commit messages.

At this point, git has stored a single commit:

```
git log
```

We can also check the status of the files in my directory
```
git status
```

Let’s now edit the README.md and check the status again:
```
echo "">> README.md
echo "We’re practicing git here." >> README.md

git status
```

Committing all these changes:
```
git commit -a

git log
```
The `-a` argument to `git commit` tells git to commit all files it's tracking and that have changed. Alternatively, you could've ran `git add` for every file individually, before running `git commit`. Note that `git add` serves two somewhat different purposes: it tells git which files to track, but it also tells git which files to include into the next commit (as an aside: these can be viewed as being one and the same...).

How to see the differences to a previous version:
```
git diff HEAD^
```

Git keeps a history of all commits. The top of this history is called the “HEAD”. Adding a ^ means “one before head”.

Let’s make more changes:
```
git add AnotherFile.txt
git commit
```

And a prettier log may be shown with:

```
git log --oneline
```

### GitHub

How do we effectivelly share our repositories? This is where GitHub comes in.

First, let’s make sure you all have accounts on GitHub.

Then, let us create a new repository (named astr302-$USER$, where $USER$ is
your _local_ username) and then:

```
git remote add origin git@github.com:your-github-account/astr302-USER.git
git push --all -u
```

In the first command, you should replace the URL (the web address) in the first line with the one for your repository. Note that GitHub will give you an address starting with `https://` by default -- unless you've set up [SSH keys](https://help.github.com/articles/connecting-to-github-with-ssh/), use it.
The second command _copies_ all the commits stored on your local hard drive repository (the `.git` directory) to GitHub. Now refresh the page on GitHub and explore what GitHub has to offer!

You can consider the copy on GitHub as your 'backup copy'. Even if you lose your entire computer, the commits of your files are safe and sound on GitHub.

As you make new changes and new commits, you should occasionally push them _upstream_:
```
git push
```
This will push the commits from the current branch to the GitHub copy. Make it a habit to do this at least once a day.



### Branches

Imagine you have your program/analysis working reasonably well. Now you want to make some changes, but are worried you’ll mess up your code if they don’t work out. One way to do it is to just copy the whole directory into a backup folder.

But git provides a facility — branches — to do it much more effectively. Creating a branch is like creating a parallel universe: what happens there won’t affect what’s in the “main” universe (called the _master branch_). You're free to experiment with your changes on a different branch, and then merge it back into _master_ once you're happy with the results.

Let's create and check out a branch named `new-feature`.

```
git branch

git branch new-feature
git branch

git checkout new-feature
git branch

A shorthand for create-and-checkout-a-branch is `git checkout -b new-feature`

```

Let’s make some edits:
```
echo "New edits" >> README.md
git status
git commit -a
```

See the history:
```
git log --oneline
git log --oneline --decorate
```
Note how individual commits are identified by their _commit id_, a 40-character unique checksum such as 9e130f7b1ff3bfdcf23702ee18da5b7cdc8b6ef4 (a so-called SHA1 hash).

We can switch back to the `master` branch
```
git checkout master
cat README.md
```

Note how changes made on the `new-feature` branch are not visible here (as expected).

We can go ahead and also make some changes on master:
```
echo "Another tracked file" > AnotherFile.txt
git commit -a
```

Let's look at what all this looks like:
```
git log --oneline --decorate
git log --oneline --decorate --graph --all
```

Now we can bring the two sets of changes together: “merge” the two branches.

```
git merge new-feature
git log --oneline --decorate --graph --all
```

This is “branch & merge” technique is a powerful way to incrementally develop analyses or codes.

### Finding out more

 * [Google](http://google.com)
 * [YouTube](http://youtube.com)
   * [Scott Chacon on git](https://www.youtube.com/watch?v=ZDR433b0HJY)
   
   
 * [gitref.org](http://gitref.org/index.html)
 * [LSST's page on git](https://confluence.lsstcorp.org/display/LDMDG/Using+Git+for+LSST+Development)


 * [git website](http://git-scm.com)
 * [github.com](http://github.com)


 * [Reproducible Research in Computational Science](http://dx.doi.org/10.1126/science.1213847), Roger D. Peng, Science 334, 1226 (2011).
 * [Shining Light into Black Boxes](http://dx.doi.org/10.1126/science.1218263), A. Morin et al., Science 336, 159-160 (2012).
 * [The case for open computer programs](http://dx.doi.org/doi:10.1038/nature10836), D.C. Ince, Nature 482, 485 (2012).