# DSCI 521 - Computing Platforms for Data Science


## Lecture 2 - Getting groovy with Git and GitHub


### 2018-09-09

# Lecture learning goals

## By the end of the lecture, students should be able to:


1. Create a new repository on GitHub
2. Viewing your Git history and checking out an older version of the file
5. Deal with merge conflicts at the command line 
6. Use `nbdime` to deal with merge conflicts involving Jupyter notebooks

# 1 Creating a Repository

We can create a repository one of two ways:

1. Start on GitHub and **then** clone the repository to your local computer using Git.
2. Start in a folder on your local computer, use Git to initialize it as a Git repository (by typing `git init` inside the directory). **Then** create a new repository on GitHub without adding any files. **Then** use Git to tell your local computer where the remote is (i.e., the location of the empty GitHub repository on GitHub).

You can see the second method requires more work. So I usually do it using the first method.

## Creating a repository from GitHub (method 1 above)

- Let's work through a demo together where we create and edit a repository for a fictionary Data Science project we are going to create. 

- Pair up in partners (to help each other out and for the collaboration exercise coming later in the lecture...).


### Steps to follow:

1. Go to https://github.ubc.ca and make sure you are logged in.

2. Click green “New repository” button. Or, if you are on your own profile page, click on “Repositories”, then click the green “New” button.

3. Choose/set:
- Repository name: exampleDataProject (or whatever you wish)
- Public
- YES Initialize this repository with a README

4. Click big green button “Create repository.”

5. That's it! You now have a new repository on GitHub!

## Adding & commiting changes to version controlled files

There are two ways to make changes to your files:

1. Edit files directly on Github.
2. Make changes on files you have cloned locally to your computer, and then "push" the changes back to Github.
 
We have had some practice with method 2, so we will try out method 1 now.

Let's create a file called `README.md` that contains some information about a fictionary Data Science project we are going to create.

### Steps to follow: 

1. Click on the `README.md` file link

2. Click on the pen tool (right-hand side of document)

3. Add your name as the author to the document (e.g., "author: Tiffany Timbers")

4. Click on the big green button "Commit changes" to save your work (this is essentially `git add` + `git commit`

5. Get this repo on your local computer by cloning it (`git clone` + remote URL)

### Class discussion point:

Why and when would you use the two different ways of adding & commiting changes to version controlled files?

# 2. Viewing your Git history and checking out an older version of the file

Now we have a project, but only 2 commits (the initial creation and adding us as the author). Let's now add a couple more commits to generate a history that we can view and experiment with.

Let's edit our `README.md` over several different commits (to generate a history that we can view and experiment with).

### Steps to follow:

1. Use the pen tool to change the README header from the default (repo name) to a more proper english title for the project. Click on the big green button "Commit changes" to save your work.

2. Use the pen tool to change the README header from the default (repo name) to add today's date as the project start date to the README. Click on the big green button "Commit changes" to save your work.

3. Use the pen tool to change the README header from the default (repo name) to add a list of dependencies (Software that will be required to run it, e.g., Python) to the README. Click on the big green button "Commit changes" to save your work.

4. "Accidentally" delete the list of dependencies you just created. Click on the big green button "Commit changes" to save your work.

4. Bring all these changes down to your local computer by typing `git pull` from inside the cloned repo on your laptop using Git Bash/terminal (hint - open the file locally to see that it looks as expected to ensure you did things correctly).

## Viewing the history of a project

There are two ways you can view the Git history of a project:

1. On GitHub through the repo's code commit view
2. On your local machine using `git log`

Arguably, the best and easiest place to view the Git history of a project is on GitHub. So let's start there. But we'll explore both as sometimes the history on your local machine might differ from that on GitHub and that is when you might need to look at both.

Let's viewing the git history of our project on GitHub and locally.

### Steps to follow: 

1. On GitHub, on the repo's landing page click "$N$ commit" link (where $N$ is the number of commits made on the repo, yours should be around 6). 
2. On your laptop, from inside your Git repository type `git log` at the command line

### Class discussion point:
How similar are these two views? Do you get the same information from both? Which seems easier to read/naviagte?

## Comparing files from different commits

In my humble opinion (IMHO) [GithHub's Compare view](https://help.github.com/articles/comparing-commits-across-time/) is the best and easiest way to compare files from different commits. 

To use it, you need to:

1. Get the [Short SHA-1](https://git-scm.com/book/en/v2/Git-Tools-Revision-Selection) of the commits you want to compare (from either method of viewing the history).

2. Go to https://github.ubc.ca/YOUR_USERNAME/YOUR_REPO_NAME/compare and enter the Short SHA-1's in the base and compare drop down menus.

note - another way to do this is using `git diff`, which you can learn about [here](https://www.atlassian.com/git/tutorials/saving-changes/git-diff).

## Checking out an older version of the file

Oh no! We realize by viewing the history that we made a mistake! We didn't mean to delete our list of dependencies. Worry not! We can now take advantage that we have been tracking this file under version control by using `git checkout` to retrieve an older version of the file to replace the broken version we now have.

Let's checkout the version of the file BEFORE we deleted the software dependency list.

### Steps to follow: 

1. Look at the history to see which version of the file we want to go back to and get its Short SHA-1 (we'll need this to retrieve the file).

2. Then we use `git checkout` in the command line to grab it: `git checkout SHORT_SHA-1 FILENAME`

3. After recovering the file, check `git status` and you will see you need to `git add` and `git commit` to log this file reversion.

4. Don't forget `git push` to get the file backed up on GitHub!

# 3. Deal with merge conflicts at the command line

When working with version control, usually changes are happening in more than one place (e.g., your laptop and on GitHub). So changes of the same document in different places will have to happen. There are two types of changes you need to know about (and how Git deals with them):

1. Changes to a document where different lines are modified (Git can automatically merge these).

2. Changes to a document where the same line(s) are modified (Git CANNOT automatically merge these).


I case \#2 you (or some other human) has to deal with the conflict. Git kindly points you to where the problem is, and then will do no further work for you until you deal with the conflict.



## How do you know you have a merge conflict?

If you do `git push` and you see something like:

```
To https://github.com/vlad/planets.git
 ! [rejected]        master -> master (non-fast-forward)
error: failed to push some refs to 'https://github.com/vlad/planets.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Merge the remote changes (e.g. 'git pull')
hint: before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
```
**and** then you do a `git pull` and see something like this:

```
remote: Counting objects: 5, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 3 (delta 1)
Unpacking objects: 100% (3/3), done.
From https://github.com/vlad/planets
 * branch            master     -> FETCH_HEAD
Auto-merging mars.txt
CONFLICT (content): Merge conflict in mars.txt
Automatic merge failed; fix conflicts and then commit the result.

```

You have a merge conflict.


## What do you do to fix a merge conflict?

1. Pull the changes from Github
1. Open the file that has a conflict (the output of `git pull` will tell you which files) in a plain text editor (e.g., Atom)
2. Look for the conflict (hint - search for `<<<<<<< HEAD`)
3. Fix the conflict and save the file
4. `git add` and `git commit` your changes, and then `git push` them up to GitHub

## How do you find the conflicts in a file

Here's an example of a text file with a conflict:

```
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
<<<<<<< HEAD
We added a this line in our last commit
=======
This line was added somewhere else
>>>>>>> dabb4c8c450e8475aee9b14b4383acc99f42af1d
```

- `<<<<<<< HEAD` precedes the change you made (that you couldn't push)
- `=======` is a separator between the conflicting changes 
- `>>>>>>> dabb4c8c450e8475aee9b14b4383acc99f42af1d` flags the end of the conflicting change you pulled from GitHub

## How do you fix the conflicts in a file?

- edit this file to remove these markers and reconcile the changes
- We can do anything we want: 
    - keep the change made in the local repository, 
    - keep the change made in the remote repository, 
    - write something new to replace both, 
    - or get rid of the change entirely. 
    


If we chose to write something new to replace both, it would look like this:

```
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
We removed the conflict on this line
```

You then need to save, `git add`, `git commit` and `git push` the file to have these changes reflected on GitHub.

## What about merge conflicts Jupyter notebooks???

## First - a bit about what a Jupyter notebook is made up of

- `.ipynb` files are "plain" text files, and we can view them in a plain text editor and make some sense of them
- The contents of the notebook are encoded in JSON
- When we run the notebook via `Jupyter notebook` the kernel is the part the can interpret and run the code


For example, this notebook of 2 cells:

![alt tag](imgs/sample_notebook.png)

is encoded by the following JSON:
```
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A Markdown header cell\n",
    "\n",
    "Below is a simple example of some code in Python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "15\n"
     ]
    }
   ],
   "source": [
    "x = 5\n",
    "y = 10\n",
    "\n",
    "print(x + y)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
```

## Back to version control and Jupyter notebook

Because the notebooks are stored as plain text, we can use them for version control, but this is not without issues, which include:
    - `git diff` looks horrendous because of the JSON
    - manually fixing conflicts is arduous because of the JSON
    
Strategies to help you not end up in conflict hell with Jupyter:

1. Always `git pull` before you start doing ANY work!
2. Clear output of your notebooks before you `git push` (although we need your output for MDS homework... so it depends here).

But there is hope things are better (or will get better)! [`nbdime`](https://nbdime.readthedocs.io/en/stable/) is a project that helps solve these problems... You can try to test drive it, I have yet to have success with it yet however.

## Practicing conflict solution in pairs:

Designate one partner as the Data Science repository "Owner" and one partner as the repository "Collaborator". The repository "Owner" needs to grant the Collaborator access.

##### Owner: 
* On GitHub, click the settings button on the right. 
* Select Collaborators (top left), and enter your Collaborator's username.

##### Collaborator: 
* Go to your email to retrieve the `URL` to connect to the Owner's repository.
* Clone your partners repo:

~~~ 
$ git clone URL_from_Collaborator
~~~

`git clone` creates a fresh local copy of a remote repository.

### Stage a conflict:

Both partners modify the same line of the same file. And try to send the changes to GitHub. One of you should get a conflict. Work together to follow what was learned in lecture to resolve it.

## Attribution 
1. [Happy Git and GitHub for the useR by Jenny Bryan and the STAT 545 TAs](http://happygitwithr.com/)
2. [Software Carpentry](https://software-carpentry.org/), specifically the Git lessons