# Introduction to git for Data Scientists

**Part I:**
* What problem version control software tries to solve?
* How to create a git repository (init, status)
* The four main stages of a file in a repository: untracked, tracked, staged and committed (ignore, add, commit, rm)

**Part II:**
* How to track changes of a file in a repository (diff, log)
* How to maintain different versions of a file without going crazy or using N suffixes (branch, checkout)
* Time-travel with git (revert, checkout, reset)

**Part III:**
* How to work with friends on the same file... and remain friends (merge)
* Working with remote repositories (clone, fetch, pull, push, pull requests)
* Contributing flows (forks, feature branches)

## Part I

* How to create a git repository (init, status)
* The four main stages of a file in a repository: untracked, tracked, staged and committed (ignore, add, commit, rm)

**Warning**: This tutorial is meant to be executed ONCE and in the given order. Since in fact each step has side-effects, running one of the cells will likely change the output of re-running one of the previous cells. If you need to re-run previous cells, you will need to re-run all cells starting from the very first one below (cf. Creating a new repository)

## Creating a new repository
We will start by creating a git repository in a new subfolder of the folder where this notebook lives.

In [None]:
%%bash 
rm -Rf my_first_repo
mkdir my_first_repo
git init my_first_repo

From now on, we have all the necessary components for git to start tracking changes in this repository. But what has the command exactly done? Let's check out the .git folder directory structure:

In [None]:
%%bash 
ls -lai my_first_repo/.git

As you can see the `git init` command has created a bunch of files and directories inside the `.git` folder. These are the files that store all the information necessary to track and navigate through changes of the files in your repository. If you are curious (or even better when you understand the basics well), you can read more about each component's role in the [Pro git book (available for free online)](https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain#ch10-git-internals).

We will now start introducing changes (that is, files) into our repository. Do we have anything in the repository yet? 

In [None]:
%%bash 
cd my_first_repo
git status

The short answer is no, there's nothing in the repository because `git` only does what we tell it to do, and we haven't asked it to track any of the contents of the folder. That is why it lists the folder's contents as "untracked" files.

As a sidenote, git is telling us that we are working in our "master" branch, which is the default name for the default branch of any project. We'll talk about branches later but for now we can already notice that the status command is telling us that there are a few files that are "untracked".

## The first 2 out of 4 stages of a file in a repository

As we saw in the previous section, when we create a new repository, git will start noticing files that exist in the same folder but are not yet tracked by
git. This suggests that there are at least two stages in the lifetime a file living in the same folder of a git repository:
* untracked
* tracked

For example, suppose that we start working on a new file `test.py` containig code we want to start tracking under git:

In [None]:
%%bash
echo "print('Complicated Python pipeline')" > my_first_repo/test.py

This new file (`test.py`) exists now in the repository but is not yet being tracked by git:

In [None]:
%%bash
cd my_first_repo
git status

Git tells us that our new file `test.py` is **untracked**.

If we want to start tracking this new file within our repository, 
we just need to `add` it for the first time:

In [None]:
%%bash
cd my_first_repo
git add test.py
git status

The `test.py` file we just created is now tracked by the repository, and git status tells us as much. 

Note that git will complain if we ask it to to track files that are in fact in folders other than the on where the repository was created:

In [None]:
%%bash
cd my_first_repo
touch ../another_test.txt
# git add ../another_test.txt
# This will fail with an error:
# `fatal: ../another_test.txt: '../another_test.txt' is outside repository`

### IMPORTANT: Things git should not track...

There are some files that will be in the same folder as the repo but that we will never want to track. Examples of these are:
* data files,
* artifact files (serialized models, binaries),
* **credentials**,
* generally anything that is not source code.

Suppose that we have a `data.csv` file that we don't want to include in the repository:

In [None]:
%%bash
touch my_first_repo/data.csv

This file will now show as an untracked file everytime we run `git status` which eventually will get a bit tiresome:

In [None]:
%%bash
cd my_first_repo
git status

Since we don't want to track it, what we will do is tell `git` to **ignore** it. As a matter of fact, we will tell `git` to ignore all files ending with the
`.csv` extension. To do that, we just need to create a file named `.gitignore` in the repo, containing the line `*.csv`:

In [None]:
%%bash
cd my_first_repo
echo "*.csv" > .gitignore
git add .gitignore
git status

Note that we also started tracking the `.gitignore` file. Magically, git status now doesn't show `data.csv` anymore: `git` knows files like that one can be safely ignored.

**Excercise 0.0**: Why do you think we need to track (`git add`) the `.gitignore` file?

**Excercise 0.1**: Can you think of (at least) two reasons why we typically don't want to track binary files in a git repository? 

## The last two of the four stages of a file in a repository

We have seen that there are at least two stages in the lifetime a file living in the same folder of a git repository:
* untracked: when a file resides in the same folder where we created the repository, but it is not under source control
* tracked: when add a previously _untracked_ file into the repository.

There are in fact two additional stages for files _tracked_ by a repository:
* staged
* commited

As you saw, we started tracking a file `test.py` by git-adding it into the repository. As a matter of fact, the `git add` command
did actually two things:
* it told git to start tracking the contents of that file
* it _staged_ the changes in the file (i.e. its creation) to prepare them to be _commited_

You can think of _staging_ a file as the last step prior to making the file (or the changes introduced to the file) permanent in the repository. While a file is _staged_, you can continue to modify it:

In [None]:
%%bash
cd my_first_repo
# we add one more line to our `test.py` script
echo "print('Now it is even more complicated!')" >> test.py
git status

And the file remains in its _staged_ state, but now `git status`
informs us that our latest changes have not yet been staged.

If we were to commit the _staged_ `test.py` changes now, we would not be "saving" the changes 
we just performed, because those changes haven't yet been staged. So let's stage them 
so that we can commit everything in one batch!

In [None]:
%%bash
cd my_first_repo
git add test.py
git status

### Commiting the staged changes

In order to make the new files (or the changes in existing files) permanently stored in git, we need to _commit_ them.

Here we need to explain what we mean by "permantently storing". Permanently storing changes in git (or most version control systems) doesn't mean that we won't be able to (or that we shouldn't) change those files ever again. On the contrary, it means that we will store the state of the file in the repository in a way that will allow us to build on top of that state, coming back to it at any point in the future. You read well, *any* point in the future! Git is like a time-travel machine!

So let's _commit_ both of the staged changes (the ones in `.gitignore` and the ones in `test.py`):

In [None]:
%%bash
cd my_first_repo
git commit -m 'My first commit and my first commit message!'

Now when we check the status of our repository, we notice that we no longer have staged
changes:

In [None]:
%%bash
cd my_first_repo
git status

### The opposite of `add`ing is `rm`oving!

Sometimes we would like to stop tracking files we previously added to a repository, perhaps because they are no longer useful or because we added them mistakenly. 

There's a useful git command to do just that, which is `git rm`. To illustrate its behavior, we will first create, stage and commit a dummy file that we would later like to remove:

In [None]:
%%bash
cd my_first_repo
touch dummy.py
git add dummy.py
git commit -m 'commiting a dummy file just to remove it later!'

Let's remove this dummy! 

In [None]:
%%bash
cd my_first_repo
git rm dummy.py
git status

Suprise! Removing is dual to `add` in one more way: Applying `git rm` only _stages_ the delete operation but doesn't yet make it permanent. We need to further _commit_ it for the deletion to take place:

In [None]:
%%bash
cd my_first_repo
git commit -m 'Away with the dummy!'
git status

**Exercise 0.3**: Let's practice a bit more with the concepts of tracking, staging and committing. To this end, please try to follow these steps in order:
1. Create two empty new files, start tracking and commit them (`A.txt` and `B.txt`). 
2. Change a few lines in `A.txt` and `B.txt` and stage the changes.
3. *Unstage* `B.txt`. Hint: Google `git reset HEAD`. 
4. Now commit the changes in `A.txt`
5. Stage `B.txt` and commit it (in a different commit!)

## Take aways
And now we're truly done with this first part where we have learned:

0. The difference between _untracked_ and _tracked_ files in our repository
1. How to start tracking a file into a repository, or _stage_ changes we have done to an already tracked file
2. How to commit the _staged_ changes to make them permanent
3. How to _remove_ a previously tracked file from the respository

## Part II

* How to track changes of a file in a repository (log, show, diff)
* How to maintain different versions of a file without going crazy (branch, checkout)
* Time-travel with git (revert, checkout, reset)

**Warning**: This tutorial assumes you have already completed part 0 of this tutorial, and have run all the cells in that notebook *exactly once*

## How to track changes of a file in a repository (log, show, diff)

In the first part of this tutorial we have created a repository and added a few commits to it. Sometimes we will want git to remind us of the latest changes we have performed in our repository - this is where `git log` comes handy: 

In [None]:
%%bash
cd my_first_repo
git log --graph

**Exercise 1.0**: Show the commit history only for the file (aka. path) `test.py`. 

**Exercise 1.1**: Show the commit history only for the last two commits.

**Exercise 1.2**: Show the commit history only for the first two commits.

*Hint*: Check out [git log's documentation](https://git-scm.com/docs/git-log)

Perhaps we also want to know what exactly we did in a given commit. `git show` is here to help us:

In [None]:
%%bash
cd my_first_repo
git show 33edc85477fb9db2a67e279b7b031d0a69a05002

### Diffing is the name of the game

Let's spice it up a little. We will now experiment with the `git diff` tool which allows us to understand what exactly changed between two commits, the staged changes and HEAD or other variations thereof.

Before doing that, let's add some meat to our skinny repository! For the next few steps we will work with [an example of Machine Learning code in Python from Scikit-learn's](https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py) library of examples. This means that you will **need to install scikit-learn v0.20.3 or compatible if you don't have it yet.**

In [None]:
%%bash
rm -f my_first_repo/plot_compare_reduction.py*
# This script (plot_compare_reduction.py) has been obtained via: 
# wget https://scikit-learn.org/stable/_downloads/plot_compare_reduction.py -P resources
# A local copy is held in case sklearn's version changes
cp resources/plot_compare_reduction.py my_first_repo/
cd my_first_repo
git add plot_compare_reduction.py
git commit -m 'checking scikit learn example into our repo'

Our first step will be to clean up the example to remove the part corresponding to "Caching transformers within a `Pipeline`":

In [None]:
%%bash
cd my_first_repo
# We will remove all code after line 91
# If you're wondering why redirect and then mv, cf: https://unix.stackexchange.com/questions/15826/io-redirection-and-the-head-command
head -n 91 plot_compare_reduction.py > tmp.py
mv tmp.py plot_compare_reduction.py
git add plot_compare_reduction.py
git commit -m 'removing part corresponding to Caching transformers within a Pipeline'

Did we do what we wanted? `git diff` is your ally because it will allow you to see what exactly changed relative to a past commit:

In [None]:
%%bash
cd my_first_repo
git diff `git log --pretty=format:'%h' -n 1`~1

The `diff` output shows that we managed to do what we wanted.

Note that here we used a trick:
```
`git log --pretty=format:'%h' -n 1`~1
```
to retrieve the previous' commit sha without having to know the sha in advance. Indeed:

In [None]:
%%bash
cd my_first_repo
git log --pretty=format:'%h' -n 1

Returns the short sha of last commit (the one we just did), and by adding the `~1` suffix, we get a reference to the one previous to the last. 

In fact, in this particular case we overcomplicated the syntax since we could have simply used `git diff HEAD~1` to get the information we wanted.

**Exercise 1.4**: You will now modify the `my_first_repo/plot_compare_reduction.py` file so that:
1. The values in the variable `C_OPTIONS` are `[1, 10, 100, 500]`
2. Instead of plotting a figure, it saves the chart in a PNG file (hint: [plt.savefig](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html))

It's important that you stage and commit **each of changes separately**. Make sure that you accomplished what you wanted by:
1. Checking each diff relative to each previous commit.
2. Running the resulting code

## How to maintain different versions of a file without going crazy (branch, checkout)

As you know, the typical data science flow involves a lot more changes than the ones we performed in Exercise 1.4. In particular, the development flow is not linear: one wants to be able to go back and forth between different experiments, and try different variations of a given change before consolidating it.

The VCS-free approach to dealing with this variability is to append suffixes to our code files and artifacts (e.g. charts), which typically ends up with a pletora of files with cryptic names, each corresponding to a different variation:
* plot_compare_reduction_v2.py
* plot_compare_reduction_v2_include_LR.py
* plot_compare_reduction_iris_data_v2.py...

You get the point. The situation becomes even more daunting when there are several people working on the same project.

Luckily, Git gives a number of tools to support this type of work in cleaner and more powerful way, leveraging the power of `branches`. As we saw during the lecture, one can think of a branch as a diversion from the master branch in such a way that we can always go back to any other branch (including the master one) with just a few keystrokes.

Let's get to it. Suppose that we want to experiment with some serious changes to the `my_first_repo/plot_compare_reduction.py` script, changes that we are not sure will remain permanent. The first step is to create a new branch where we will start commiting these changes into, so that the stuff we have in our master branch remains unnaffected:

In [None]:
%%bash
cd my_first_repo/
# I usually use the convention of giving my branches names following the convention {user}/{short description}
git branch -c $USER/experiment

I now have two branches in my repository, as `git branch` cordially explains:

In [None]:
%%bash
cd my_first_repo/
git branch

The * indicates to me that I'm currently working on my `master` branch. Since I want to perform a few naughty changes, I need first to switch to the newly created branch, and for that I will use `git checkout`:

In [None]:
%%bash
cd my_first_repo/
git checkout $USER/experiment

**Exercise 1.5**: Can you visualize the differences between those two branches? (the `$USER/experiment` branch and the `master` branch) (hint: you can use git diff)

To make things more interesting, we will create yet another branch to work on a couple changes that we would like to peform in order to show confidence intervals in the bar chart of our comparison plot:

In [None]:
%%bash
cd my_first_repo/
# `git checkout -b` performs a `branch -c` + `checkout`
git checkout -b $USER/add_cis_to_plot master

**Exercise 1.6**: Let's practice moving in and out of branches... it's a bit of work but we'll use what you do here later to work on `merge` and `rebases`. Please do the following:
1. Make a few changes in the experiment code (for example: try another classifier or dimensionality reduction method) while working on the `$USER/experiment` branch. 
2. Introduce confidence intervals to the bar charts while working on the `$USER/add_cis_to_plot` branch

**Important**: make sure you develop each change in the right branch, and that you stage and commit your work incrementally. Incremental commits are in general preferable over big long ones.

*Questions*:
1. After you've made the changes, what are the differences between the `$USER/experiment` and the `$USER/add_cis_to_plot` branches? 
2. Are there any conflicting changes?

## Time-travel with git (checkout, revert, reset)

During the course of the past exercises you have probably encountered a few situations where you made a mistake:
* Perhaps you mistakenly staged files that you don't really want to track
* Or you commited a change that you now regret
* Or simply you want to go back to a state where you know things were stable...

Don't you worry! The ability to go back and forth in your development timeline is one of the most powerful features git has to offer. 

### Unstaging changes
One of the most common situations is when you accidentally stage changes on a tracked file, or start tracking a file you didn't want to track. Let's simulate this situation in our playground repo:

In [None]:
%%bash
cd my_first_repo/
# we will switch to a new branch for convenience
git checkout -b $USER/time_travel master
# accidental removal of "a few" lines in `plot_compare_reduction.py`
head -n 50 plot_compare_reduction.py > plot_compare_reduction.py.tmp
mv plot_compare_reduction.py.tmp plot_compare_reduction.py
# change we actually want to make:
echo "import sys\npython_major_version=sys.version_info[0]\nassert python_major_version==3" >> test.py
# we carelessly add the accidental changes and the one we want to make
git add --all
git status

The first problem is that we have added a `.png` file which is the artifact of our computation in `plot_compare_reduction.py`, and as we discussed earlier, we don't generally want to check artifacts (binaries) into git. So we will follow `git`'s suggestion and `unstage` it, by applying:

In [None]:
%%bash
cd my_first_repo/
git reset HEAD plot_compare_reduction.png
git status

We have successfully solved the first problem: the addition of an unwanted new file. But how did this happened? 
    
A proper response to this answer would require a level of understanding of the git internals and commands that is a bit beyond the scope of this course, but I will attempt to simplify the explanation here. When we user `git reset` on a given path (file), what we told git is to bring the state of that file **in the staging area** (the index) to the commit `HEAD` is pointing too (the last commit of the branch). Since that commit points to a snapshot of the repository where the file `plot_compare_reduction.png` doesn't exist, the result is that the `add` operation we just performed on it is undone, so the file goes back to the pile of `untracked files`, as the `git status` command states.

More information on `git reset`:

* https://stackoverflow.com/questions/3639342/whats-the-difference-between-git-reset-and-git-checkout

* https://git-scm.com/book/en/v2/Git-Tools-Reset-Demystified

We still have one more problem to solve, which is the accidental deletion of content from `plot_compare_reduction.py`. Since we know that the state of this path in HEAD is the correct one, we can apply the same operation:

In [None]:
%%bash
cd my_first_repo/
git reset HEAD plot_compare_reduction.py
git status

The accidental change is no longer staged for commit (it no longer exists in the index/staging area), but if we look at the file, it still has the changes we performed in the working tree (our filesystem). In order to undo the changes we did in the file in our working tree, we need to perform an additional operation of restoring the state of the file to the one we had stored in HEAD:

In [None]:
%%bash
cd my_first_repo/
git checkout HEAD plot_compare_reduction.py
git status

Note that we could also have used the command `git checkout -- plot_compare_reduction.py`, where `--` in this case indicates `checkout` to bring the path to the state in HEAD.

So we have managed to succesfully unstage the accidental changes, so now we can merrily commit our desired changes and move on:

In [None]:
%%bash
cd my_first_repo/
git commit -m 'adding python version check'
git status

### Reverting previous work

Another common situation occurs when we introduce a change in our repository that we later regret. Contrary to the previous situation, in this case we have already committed the change.

If the unwanted change is the last one we committed, we could use `git reset` to start working from the last stable state. Unfortunately, this would mean that we're "re-writing history" which can cause problems if the changes have already been shared with other collaborators, because each one of us will have a different version of history.

So a safer way is to use `git revert`, which allows us to undo the changes in a specific commit by adding an additional commit that does just as much:

In [None]:
%%bash
cd my_first_repo/
git log -n 5

Suppose we want to undo the changes we performed earlier in commit "Updating C_OPTIONS to desired values"

In [None]:
%%bash
cd my_first_repo/
# update this commit SHA to the one you see in your commit log
git show XXXX # <---

Reverting to the state previous to these changes is a piece of cake:

In [None]:
%%bash
cd my_first_repo/
git revert XXXX # update this commit SHA to the one you see in your commit log

And you're done: you can verify by yourself that the state of `plot_compare_reduction.py` is the one previous to the changes we performed.

**Warning**: Note that revert gives you a quick way of undoing any type of changes, not just simple changes affecting a single file. This doesn't mean reverting is without risks. Consider for example the setup of the following exercise.

**Exercise 1.7**: We will now create a situation where reverting can create a mess. Please follow the next steps of havoc-making:
0. Create and checkout new branch off master called `revert_havoc`
1. Create a new python module `dependency.py` with a function named `my_dependency()`. Stage and commit the changes.
2. Now make a call to that function somewhere in `plot_compare_reduction.py` (Don't forget to import the module first). Stage and commit those changes.
3. Now revert the commit created in Step 1. What changes does it induce? Will your code work after the revert? 

### Summary of time-travel operations

In this last section we have learned how to time-travel: go back in time to a state where we are more comfy. As you probably suspected, time-travel is not without risk so it has to be used carefully and judiciously. 

A nice and concise summary of the time-travel operations is the following:

| Command |	Scope |	Common use cases|
|---|---|---|
|git reset|Commit-level|Discard commits in a private branch or throw away uncommited changes|
|git reset|File-level|Unstage a file|
|git checkout|Commit-level|Switch between branches or inspect old snapshots|
|git checkout|File-level|Discard changes in the working directory|
|git revert|Commit-level|Undo commits in a public branch|
|git revert|File-level|(N/A)|

(Adapted from https://www.atlassian.com/git/tutorials/resetting-checking-out-and-reverting)

## Part III

* How to work with friends on the same file... and remain friends (merge)
* Working with remote repositories (clone, fetch, pull, push, pull requests)
* Contributing flows (forks, feature branches)

**Warning**: This tutorial assumes you have already completed part 0 and 1 of this tutorial, and have run all the cells in that notebook *exactly once*. **It also assumes you have accomplished Exercise 1.7** (see `../solutions/` if you haven't, run the corresponding cells and come back :))

## How to work with friends on the same file... and remain friends (merge)

Working with other people is always a challenge. We have different brains, perspectives in life, ideologies... there's no need to make it even harder by working on a same file without the right tools!

Luckily git gives us several tools to do that without generating more conflicts than absolutely necessary. One of the set of tools are `git branch`es, the other is the ability to bring work from one branch to another without going bananas.

So let's get our hands dirty. We will work off the two branches that we created in **Exercise 1.6** (`$USER/experiment` and `$USER/add_cis_to_plot`). If you succeded in finishing the exercise, you should have two branches with different (and conflicting) changes on the same file (`my_first_repo/plot_compare_reduction.py`) (If you did not succeed, please use the solution provided in the Solutions section).

Note that inn this case the changes were performed by the same person, but you can imagine a situation where you were working on one branch and someone else in the other.

In [None]:
%%bash
cd my_first_repo/
# we have two branches:
git branch

And conflicting changes:

In [None]:
%%bash
cd my_first_repo/
# we have two branches:
git diff $USER/experiment $USER/add_cis_to_plot

What we will do now is to try to *merge* the changes introduced in these "feature" branches, 
that is, consolidate the progress that we (or the team) made on two independent branches into
one single branch. To this end, we will first create a branch (`$USER/merge_work`) off one of our feature branches (say the one we called `$USER/experiment`), and try to merge the content of the other branch (`$USER/add_cis_to_plot`) into this new branch:

In [None]:
%%bash
cd my_first_repo/
git checkout -b $USER/merge_work $USER/experiment
git merge $USER/add_cis_to_plot

If no conflicts arose, git was able to merge the two branches and has created a special merge commit with two ancestors: the last commit in one branch and the last commit in the other:

In [None]:
%%bash
cd my_first_repo/
git log --graph -n 10

Amazing! If your branch work had no conflicts, you now have a branch `$USER/merge_work` with all the work you've been doing, which you can now test before merging it into the `master` branch: (**If your merge had conflicts, please continue to the next section**)

In [None]:
%%bash
cd my_first_repo/
# let's test the changes we introduced:
python plot_compare_reduction.py

If you have used the solution provided by us in the companion solution notebook `../solutions/1 - Introduction to git for Data Scientists [solutions].ipynb`, you will have run into an error. If you introduced your own changes, perhaps your code ran perfectly and you are good to go.

If you did get an error, it turns out we are not done just yet... Which is good news because we're now learning an important lesson: having no merge conflics doesn't mean that you haven't introduced errors in your code. It just means that git was able to mix and match your contributions without needing your aid because the changes were done in different lines of the same file.

Let's now play with the case where conflicts arise...

### Conflict resolution & peacekeeping

Beneath the apparent simplicity, merging branches is a complicated task. Git can use different algorithms to decide how to perform the merge depending on the type of divergences that the two branches have. Explaining the merge strategies is out of the scope of this course (and a very complicated topic, see for instance [here](http://blog.plasticscm.com/2011/09/merge-recursive-strategy.html) and [here](http://raulavila.com/2017/03/como-funciona-git-3/)).

For now we will content ourselves with creating a situation where we have conflicts and learning how to resolve them:

In [None]:
%%bash
# we will emulate the situation where someone else has started working on a feature off
# the master branch, and she changed a line that we also changed. In this case
# the change is a bit silly (a variable name change), though perhaps not a completely
# unfamiliar situation :)
cd my_first_repo/
git checkout -b super_feature master
sed -i -e 's/C_OPTIONS/SVM_C_OPTIONS/g' plot_compare_reduction.py > plot_compare_reduction.py.tmp
mv plot_compare_reduction.py.tmp plot_compare_reduction.py
git commit -a -m 'super contribution to super feature'
git log --graph -n 5

Now if we try to merge this `super_feature` branch with our `$USER/merge_work`... 

In [None]:
%%bash
cd my_first_repo/
git merge $USER/merge_work

Uh oh! Conflicts happened! No need to freak out though because git is nice enough to tell us what to do next:

```
CONFLICT (content): Merge conflict in plot_compare_reduction.py
Automatic merge failed; fix conflicts and then commit the result.
```

So let's do it, we will open the `plot_compare_reduction.py` file and see where `git` run into trouble:

In [None]:
%%bash
cd my_first_repo/
cat plot_compare_reduction.py

**Exercise 2.0:** Fix the merge conflicts and commit the changes. The merge is now done and we can move on to the next section!

## Remote repositories & teleportation

As you know, a lot of the work we Data Scientists do is not limited to being developed and run in our local machine. In addition, we also like our work to be safe and secure somewhere outside the reach of a potential coffee spill...

Remote repositories are versions of our local repository that are stored... remotely (hence the teleportation pun). The most well-known remote server is `github.com` but you should know there's a few other well known ones like `bitbucket`, `sourceforge` or `gitlab`. 

In this section we will see how to perform the most common operations with remote repositories:
* Creating a remote repository and setting up a remote for your local repository
* Listing and manipulating the remote repositories for an existing project (remote)
* Pushing and pulling from remote repositories (fetch, pull, push)
* Issuing Pull Requests 

Reference: https://git-scm.com/book/en/v2/Git-Basics-Working-with-Remotes

### Creating a remote repository and setting up a remote for your local repository

If we try to see what remotes are currently set up for the repository we have created in earlier parts of this tutorial, we see that there's none:

In [None]:
%%bash
cd my_first_repo/
git remote -v

So since we want to be able to save and share all our hard work with other people, our first step will be to add a remote to `my_first_repo`.

Start by creating a new public repository on [github.com](https://github.com/). For that you will need a github user. Please name your repository `my_first_repo`. After that, you can add this remote to your repo as follows:

In [None]:
%%bash
cd my_first_repo/
git remote -v
git remote add origin git@github.com:atibaup/my_first_repo.git

So now if you try again, you should see your newly added remote:

In [None]:
%%bash
cd my_first_repo/
git remote -v

**Exercise 2.1**: Pair up with another student and add his repository as one of your remotes.

If you want to see the branches for a remote, it suffices to run

In [None]:
%%bash
cd my_first_repo/
git remote show origin 
# In this case you won't see much information here because the remote repository you created only has one branch

Now that your repository has a brand new remote, you can merrily pull and push data from and towards it. Since the remote repository is brand new, there's nothing for us yet to fetch:

In [None]:
%%bash
cd my_first_repo/
git fetch origin

So we can push our local branches to the remote repository:

In [None]:
%%bash
cd my_first_repo/
git push origin master

And voilà, you and everyone else (with access) can play with your world-saving contributions!

**Exercise 2.2**: Fetch from your colleague's remotes which you added in Exercise 2.1.

### Fetching is not pulling

When you fetch a branch from a remote, `git` only saves a copy of the remotes snapshot in your local repository, under a special type of branch called "remote-tracking branches" that explicitly relate them to the remote you fetched from.

This means that in order to incorporate the remote branch changes into your local branch, you would need to `merge` the content from the recently fetched branch into your local branch. So you can either:

In [None]:
%%bash
cd my_first_repo/
# assuming you're already in your master branch
git fetch origin
# likely will do nothing because both your remote and local master branch are already in sync,
# unless you have added changes to either on your own
git merge origin/master

Or use `git pull` synthactic sugar, which fetches and merges the remote-tracking branch into your current branch:

In [None]:
%%bash
cd my_first_repo/
# assuming you're already in your master branch
git pull origin master

### Tracking branches

Tracking branches are local branches that are directly related to a remote branch. They are convenient because they save us some typing when we want to pull/push from a remote branch so that instead of having to write:
```
git pull my_remote my_remote_branch
```
we can simply do:
```
git pull
```
To create a tracking branch directly from an existing remote branch, git offers the following shortcut:
```
git checkout --track origin/serverfix
```
instead of the more verbose `git checkout -b [branch] [remotename]/[branch]`.

On the other hand, if you want to quickly see what if any remote branches your local branch are tracking, you can run:

In [None]:
%%bash
cd my_first_repo/
git branch -vv

If you want an existing local branch to start tracking a specific remote branch,
it suffices to run ```git branch -u origin/serverfix```

### Issuing pull requests

When working in a collaborative environment, you don't usually want to push your contributions directly to a repository's master branch. One of the reasons is that the code you modified is probably not just your own, which means you may want to get it double-checked by a more knowledgeable pair of eyes. Another reason is that it is usually good practice to let someone else be aware of the changes you introduced, in case you decide to take that long-awaited tropical beach vacation before everything breaks down...

There are some good practices that are helpful to make the most of code reviews during pull request.

For the contributor:
1. Make your code reproducible and provide reproducibility instructions: it's easier to understand things if one can run them!
2. Describe the purpose of the code changes in the PR description
3. Give instructions on what you would like to have reviewed (ex: a specific function, a line of code that you're not sure about)
4. Make small pull requests! (ex: [The art of the small pull request](https://medium.com/letgo/the-art-of-the-small-pull-request-303f7ef63901))
5. Always assign your PR, if possible to one or two reviewers, not more.


For the reviewer:
1. Review the PR promptly, or else let the contributor know if you can't.
2. Be nice! Everyone writes bad code, and most of the times, it's ok!
3. Categorize comments as "minor" (not necessary for PR approval) or "required".
4. Try to learn from other people's brains.


If you are curious, these are a couple of interesting additional references regarding collaboration workflows in git:
* https://reflectoring.io/github-fork-and-pull/
* https://www.atlassian.com/git/tutorials/comparing-workflows
