## Lab 1: Version Control with Git & Github

This notebook uses a Bash kernel, different from the Python kernel we will use for the remainder of the class.

To download **bash_kernel**, use one of the following commands:

`conda install -c conda-forge bash_kernel` or `pip install bash_kernel`.

Note that notebooks in Bash are more likely to change settings or important state. Read cells before running making sure not to run cells twice when unnecessary.

### Setting up Git

In [None]:
# Change settings
git config --global user.name "YOUR NAME HERE"
git config --global user.email "your@email.com"
git config --global color.ui auto

Line endings (i.e. carriage returns, pressing enter) are an important setting to configure correctly. Without proper handling, the software might confuse switching from a Mac to Windows machine as a change where all of the line endings are changed. This is because the operating systems have different characters for line endings.

For those whose file editors are in Linux/Mac formats:

In [None]:
git config --global core.autocrlf input

For those whose file editors are in Windows formats:

In [None]:
git config --global core.autocrlf true

File editor is used in several places in Git, most commonly for changing commit messages and interactive rebasing mode. Below are suggested settings depending on the file editor you use. We will mostly use `nano` in this class.

|Editor | Configuration command |
| :------------- | :----------: |
| **nano** | `git config --global core.editor "nano -w"` |
| Atom | `git config --global core.editor "atom --wait"` |
| BBEdit (Mac, with command line tools) | `git config --global core.editor "bbedit -w"` |
| Sublime Text (Mac) | `git config --global core.editor "/Applications/Sublime\ Text.app/Contents/SharedSuppo/bin/subl -n -w"` |
| Sublime Text (Win, 32-bit install) | `git config --global core.editor "'c:/program files (x86)/sublime text 3/sublime_text.exe' -w"` |
| Sublime Text (Win, 64-bit install) | `git config --global core.editor "'c:/program files/sublime text 3/sublime_text.exe' -w"` |
| Notepad++ (Win, 32-bit install) | `git config --global core.editor "'c:/program files (x86)/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"` |
| Notepad++ (Win, 64-bit install) | `git config --global core.editor "'c:/program files/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"` |
| Kate (Linux) | `git config --global core.editor "kate"` |
| Gedit (Linux) | `git config --global core.editor "gedit --wait --new-window"` |
| Scratch (Linux) | `git config --global core.editor "scratch-text-editor"` |
| Emacs | `git config --global core.editor "emacs"` |
| Vim | `git config --global core.editor "vim"` |
| VS Code | `git config --global core.editor "code --wait"` |

You can check your config settings using `git config --list`.

In [None]:
git config --list

### Creating a repository locally and making a commit

While this is the less common way public repositories are started, creating a repository locally and adding remote repositories after helps to better understand what is happening under the hood first.

Let's start by making a new empty directory after changing directories if needed.

In [None]:
pwd

In [None]:
#cd path/to/desired/directory

In [None]:
mkdir bridge
cd bridge

We'll then make this folder a git repository that allows the program to track versions of our files.

It is important that the folder we intiate is not a subdirectory of another repository. So, for example, you should not make your home directory a repository.

##### What is a way we can check to see if the parent directories are git repositories?

In [None]:
# Initialize git repository
git init

You can see the `.git` folder that is created when you include hidden folders in your `ls` view. All files and folders that start with `.` are hidden by default without the flag.

In [None]:
ls -a

The folder contains a number of interesting folders and files, but it is extremely rare that you should need to view or edit them.

Do not change anything here unless you're advanced!

In [None]:
ls .git

We can check the status of our git repository using `git status`. I use this often to see which files are staged, the current branch, and more.

In [None]:
# View status
git status

Let's add a file as our first commit using the `touch` program, which just makes empty files.

The README file (often as .txt or .md) is a good first step.

In [None]:
# Create new file named 'README.md'
touch README.md

In [None]:
ls

In [None]:
git status

You should see `README.md` in Untracked files. When a particular file has not been staged yet, it is not being tracked in the repository.

Once we do the next step (staging), we will see it in unstaged (or staged).

In [None]:
# Stage changes
git add README.md

In [None]:
git status

##### What might happen if we made additional changes to the README.md file? (You're welcome to do this if you wish.)

We'll use the `-m` flag for making commits. Not only do I use it 99% of the time for speed, using `git commit` without the flag opens a text editor. This is incompatible with the Jupyter notebook environment.

In [None]:
# Create commit from staged changes
git commit -m "Add empty README file"

Notice that after committing, our status has no changes to view.

We can see the **commits** in reverse chronological order when viewing the log.

In [None]:
# View log
git log

##### What is the long sequence of alphanumeric characters for?

##### Why might it be helpful to write useful commit messages?

##### Why might it be helpful to keep track of author information and date?

### Adding a remote from GitHub

Go to Github and creat a new empty repository. Do not include a README, License, or .gitignore file on Github as we are importing an existing repository from our local machine.

##### Why might including the above files cause an issue?

Copy the URL from Github and enter the following on your local machine:

In [None]:
# Add GitHub repo as remote named 'origin'
git remote add origin [URL]

You'll also see two more new commands.

These involve branches, which have defaulted for years in Git as `master` (master/slave imagery has been used in programming for many years). In 2020, Github made the decision to replace the default branch name with `main`. We will also do that here.

In [None]:
# Create new branch, push, and track
git branch -M main
git push -u origin main

So what does push and pull do? They allow us to **push** commits from our local repository to the remote or **pull** (also fetch) commit info and data to our local.

The `-u` just allows us to specify a particular remote and branch so that we don't always have to specify `origin main`. However, I suggest that you write those out while learning.

### Activity: Add several more commits

We now want to make several changes,  to do the following:

1. Create a Python module, `analysis.py` that has a function `download_data` to download the Fremont bridge crossing data (https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD) and return as a Pandas dataframe. 

> Hint: Use Pandas' `read_csv` function.

##### What do we need to do to get access to functions from the Pandas library?

##### What is the benefit to enclosing this code in a function?

2. Add the `if __name__ == "__main__"` script at the bottom so that analysis.py can be run from the command line. Should call `download_data` and print results.

> Hint: Search StackOverflow about this particular line of code. 

3. If you have time, add a README file.

In [None]:
# Run Python code
python analysis.py

##### What happens if you do not add the script from #2 and run a file anyway?

### Branches

Let's change our functionality to handle multiple bridge counting datasets from Seattle Open Data.

But this more advanced feature development may be messy, so we wish to do this work in a separate branch. Let's create one now:

In [None]:
# Create new branch named 'multi-dataset'
git branch multi-dataset

In [None]:
# View branches (* indicates current)
git branch

To switch to that branch (also used for switching to past commits), use checkout.

In [None]:
# Checkout created branch
git checkout multi-dataset

### Activity: Replace data download logic

Let's improve our script. Instead of only working for the Fremont Bridge data, let's add a parameter to `download_data` named `dataset` that takes one of \[`fremont`, `spokane_street`\]. The data for Spokane Street bridge can be found here: https://data.seattle.gov/Transportation/Spokane-St-Bridge-Bicycle-Counter/upms-nr8w

##### How do we specify parameters for functions we create?

##### How might we change the __main__ script to allow us to test the `spokane_street` dataset?

Stage and commit these changes.

In [None]:
# Stage and commit changes here


In [None]:
# Check the log for history


In [None]:
# View log
git log

In [None]:
# Run Python code
python analysis.py

Let's now make changes to more easily change parameters from the command line. We'll replace the *\__name\__* script in our Python module (.py) with this code:

```python
import sys

if __name__ == "__main__":
    if len(sys.argv) >= 2:
        print(download_data(sys.argv[1]))
    else:
        print(download_data())
```

Stage and commit these changes.

In [None]:
# Stage and commit changes here


In [None]:
# Check the log for history


In [None]:
# Run Python code with new dataset
python analysis.py spokane_street

In [None]:
# Run Python code with old dataset
python analysis.py fremont

### Checking out

Let's now return to development on our *main* branch.

We want to change our call to *pandas.read_csv* so that it automatically converts the date to a preferred format.

To check this, we'll first have our *\__name\__* script print out the data types by adding this code inside of the if-statement.

```python
    if ...:
        ...
        print(df.dtypes)
```

Stage and commit these changes.

In [None]:
# Return to the main branch


In [None]:
# Stage and commit changes here


In [None]:
# Check the log for history


In [None]:
# Run Python code on main branch
python analysis.py

You should see the following results before making changes:
```
Date                             object
Fremont Bridge Total            float64
Fremont Bridge East Sidewalk    float64
Fremont Bridge West Sidewalk    float64
dtype: object
```

You should see Date change to a datetime object after the following change in *analysis.py*:

```python
pd.read_csv(..., parse_dates=["Date"], infer_datetime_format=True)
```

##### How could you learn more about *parse_dates* or *infer_datetime_format*?

##### What is another way to learn more?'

Stage and commit these changes.

In [None]:
# TODO: Stage and commit these changes


### Merging and Merge Conflicts

Now let's try to merge the changes from the `multi-dataset` branch back into the main branch. We've setup a situation where further commits have been made both branches.

First, let's view those changes using some special flags.

In [None]:
git log --graph --all

##### Can someone interpret the log graph here?

As the log grows, it is easier to view by adding the --oneline flag as well.

In [None]:
# Display log with one line, graph,
# and all branches


Now let's merge the `multi-dataset` branch back into `main`.

In [None]:
git merge multi-dataset

You should see the following:

```
CONFLICT (add/add): Merge conflict in analysis.py
Automatic merge failed; fix conflicts and then commit the result.
```

Let's view the status for more information.

In [None]:
git status

You see that *analysis.py* indicates that both have changes added. We must manually resolve these issues.

### Demonstration: How to interpret and fix a failed merge conflict with annotations from git

You will likely encounter merge conflicts in a number of situations beyond merging two branches.

1. When your changes conflict with a teammate's changes. Mitigation strategies: using branches, working on different files, and/or pulling often.

2. When you've made changes on multiple machines -- this includes the Github website. Mitigation strategies: committing your work before switching computers, not making file changes on Github website.

Add, commit changes to create merge commit.

But let's not push to our public repository until we've confirmed our code works.

In [None]:
# Add and commit changes


In [None]:
# Confirm code works with first dataset
python analysis.py fremont

In [None]:
# Confirm code works with second dataset
python analysis.py spokane_street

In [None]:
# Confirm code works with no parameter
python analysis.py

This last one doesn't work!
(Unless you were really clever, congratulations if so.)

### Dealing with backward compatibility

When working on a public codebase (or even privately with your teammates), you'll make it much harder for everyone if you change functionality as we did here.

##### Why is backwards compatibility important?

Let's confirm this code used to work by checking out previous code in the **main** branch before we merged in the multiple dataset commits.

In [None]:
git log

In [None]:
git checkout [SHA-for-earlier commit]

### Activity: Update `download_dataset` to allow for either zero or one parameter

##### How can we make a parameter be optional?

In [None]:
# Confirm fix works for zero parameters
python analysis.py

In [None]:
# Confirm code works for one parameter
python analysis.py spokane_street

### Pulling / Pushing changes with remote

When you're happy with your changes, you should make sure you've pulled the latest changes from your remote, making sure to handle any conflicts that arise.

Then, push your changes.

In [None]:
# Pull from remote
git pull origin main

In [None]:
# Push to remote
git push origin main