# Introduction to Git
  
This course introduces learners to version control using Git. You will discover the importance of version control when working on data science projects and explore how you can use Git to track files, compare differences, modify and save files, undo changes, and allow collaborative development through the use of branches. You will gain an introduction to the structure of a repository, how to create new repositories and clone existing ones, and show how Git stores data. By working through typical data science tasks, you will gain the skills to handle conflicting files.
  
In the first chapter, you’ll learn what version control is and why it is essential for data projects. Then, you’ll discover what Git is and how to use it for a version control workflow.

## Introduction to version control with Git
  
In this course we'll learn how to use Git for version control.
  
**What is a version?**
  
Before we discuss Git and version control, let's define a version, which is the contents of a file at a given point in time. It also includes metadata, or information associated with the file, such as the author, where it is located, the file type, and when it was last saved.
  
**What is version control?**
  
Version control is a group of systems and processes to manage changes made to documents, programs, and directories. Version control isn't just for software. Anything that changes over time or needs to be shared can benefit from using version control.
  
<center><img src='../_images/introduction-to-version-control-with-git.png' alt='img' width='740'></center>
  
**What is version control?**
  
Version control allows us to track files in different states and let multiple people work on the same files simultaneously, a concept known as continuous development. It also allows us to combine different versions, identify a particular version of a file, and revert changes.
  
<center><img src='../_images/introduction-to-version-control-with-git1.png' alt='img' width='740'></center>
  
**Why is version control important?**
  
To illustrate why version control is essential when working with data, consider a common scenario of modifying a dataset. We save separate copies at various time intervals, using fairly similar names. We then produce new analyses as the dataset changes, but matching these outputs to the correct data can be difficult as time goes by!
  
<center><img src='../_images/introduction-to-version-control-with-git2.png' alt='img' width='740'></center>
  
**Why is version control important?**
  
Put another way, a data project without version control is like cooking without a recipe - it'll be difficult to remember how to produce the same results again.
  
**Git**
  
One popular program for version control is called Git. Git is open source and scalable to easily track everything from small solo projects to complex collaborative efforts with large teams! Note that Git is not the same as GitHub, which is a cloud-based Git repository hosting platform. However, it's common to use Git with GitHub!
  
<center><img src='../_images/introduction-to-version-control-with-git3.png' alt='img' width='740'></center>
  
**Benefits of Git**
  
A key benefit of Git is that it stores everything, so nothing is ever lost. Also, Git automatically notifies us when our work conflicts with someone else's, so it's harder to accidentally overwrite content. Additionally, Git can synchronize work done by different people on different machines.
  
<center><img src='../_images/introduction-to-version-control-with-git4.png' alt='img' width='740'></center>
  
**Using Git**
  
A common method to use Git is via the shell, also known as the terminal. The shell is a program for executing commands. Before we use Git, we'll run through some shell commands that will often be used in our version control workflow, such as previewing, modifying, and inspecting files or directories. Note that a directory is often referred to on a computer as a folder.
  
<center><img src='../_images/introduction-to-version-control-with-git5.png' alt='img' width='740'></center>
  
**Useful shell commands**
  
To see our location, we can execute `pwd`, which stands for print working directory. Here, we are in the Documents directory of a user called Repl. To see what is in our current directory we can use the `ls` command. This returns a list of all files and directories. There is a directory called archive and three csv files.
  
<center><img src='../_images/introduction-to-version-control-with-git6.png' alt='img' width='740'></center>
  
**Changing directory**
  
If we need to change directory, we can execute `cd` followed by the directory we want to move into. Here, we navigate to the archive directory. Rechecking our location confirms we have successfully moved.
  
<center><img src='../_images/introduction-to-version-control-with-git7.png' alt='img' width='740'></center>
  
**Editing a file**
  
We can even use the shell to preview and modify files. In the documents directory, we use the `nano` command followed by the filename, which opens a text editor. We can delete, add, or change contents of a file, and save using control and o, then exit the editor using control and x to return to the shell.
  
<center><img src='../_images/introduction-to-version-control-with-git8.png' alt='img' width='740'></center>
  
**Editing a file**
  
We can also use echo to create or edit a file. Here, we create a file called todo.txt with a reminder to review our data for duplicates. If the file already exists, then we can append content by using two arrows instead of one.
  
<center><img src='../_images/introduction-to-version-control-with-git9.png' alt='img' width='740'></center>
  
**Checking Git version**
  
Different versions of software have different functionality, and Git is no exception. We can check which version of Git we have installed by typing `git --version` in the shell.
  
<center><img src='../_images/introduction-to-version-control-with-git10.png' alt='img' width='740'></center>
  
**Let's practice!**
  
Now let's check our understanding of version control!

### Using the shell
  
Git commands are typically performed using the Shell.
  
Understanding some common shell commands allows you to perform more of your Git workflow in the shell without having to spend time navigating different programs.
  
In this exercise, you will need to perform shell commands to identify how many files and directories are in the data directory.
  
---
  
Possible answers
  
- [ ] Two
- [x] One
- [ ] Three
  
Solution
  
```sh
$ ls                                                           
data  report.md
$ cd data
$ ls
mental_health_survey.csv
```
  
Super shell skills! Being able to navigate directories and interact with files is extremely beneficial as you begin to work with Git.

### Checking the version of Git
  
Just like you need to know what version of a file you are working with, it's important to understand which version of Git is installed on your computer so you're aware of what functionality it offers.
  
---
  
1. Using the terminal, enter the command to find out what version of Git is installed.

In [1]:
%%sh
git --version

git version 2.37.1 (Apple Git-137.1)


Great work! The `--version` flag allows you to check what version of Git is installed locally, which is useful if you upgrade, or need to compare versions among colleagues.

## Saving files
  
Now let's explore how Git stores information, along with looking at a workflow to update files and check their status.
  
**A repository**
  
To start, let's discuss the components of a Git project. We'll be working with a project about mental health in tech throughout the course, which is shown here.
  
<center><img src='../_images/saving-files-in-git.png' alt='img' width='740'></center>
  
**A repository**
  
There are two parts - the first is the files and directories that we create and edit, in this case a funding document, a markdown file containing a report, and a directory called data.
  
<center><img src='../_images/saving-files-in-git1.png' alt='img' width='740'></center>
  
**A repository**
  
The second part is the extra information that Git records about the project's history. The combination of these two things is called a repository, often referred to as a repo. Git stores all of its extra information in a directory called `.git`, located in the main directory of the repo. Git expects this information to be laid out in a particular way, so we should not edit or delete `.git`.
  
<center><img src='../_images/saving-files-in-git2.png' alt='img' width='740'></center>
  
**Staging and committing**
  
Now let's discuss how to make changes in a repo. We save a draft by placing it in a staging area. We save files, and update the repo in the process, through a commit.
  
<center><img src='../_images/saving-files-in-git3.png' alt='img' width='740'></center>
  
**Staging and committing**
  
Putting files in the staging area is like placing a letter in an envelope, while making a commit is like putting the envelope in a mailbox. We can add more things to the envelope or take things out as often as we want, but once we put it in the mailbox we can't make further changes.
  
<center><img src='../_images/saving-files-in-git4.png' alt='img' width='740'></center>
  
**Accessing the `.git` directory**
  
Although we shouldn't edit the `.git` directory, it may be helpful to see what's inside. It won't display when using the shell `ls` command, as it's a hidden directory. A hidden directory is a directory not displayed to users, typically because it stores information to enable programs to run. But if we add the `-a` flag to `ls` it shows up along with some hidden files!
  
<center><img src='../_images/saving-files-in-git5.png' alt='img' width='740'></center>
  
**Making changes to files**
  
Let's visualize the Git storage workflow. Here, we modify a markdown file called report.md, and store five draft updates to the staging area as we progress. We commit, or save, the second and fifth versions of the file in the staging area, and with each commit our `.git` directory is modified to reflect the state of the repo.
  
<center><img src='../_images/saving-files-in-git6.png' alt='img' width='740'></center>
  
**Git workflow**
  
So, our Git workflow is to modify a file, save the draft to the staging area, commit the updated file to our repo, and repeat!
  
**Modifying a file**
  
To execute this workflow we can use `nano` to open a text editor for the report file. We add three lines of text and save it using control-O and control-X.
  
<center><img src='../_images/saving-files-in-git7.png' alt='img' width='740'></center>
  
**Saving a file**
  
To add the updated file to the staging area we use the command `git add` followed by the filename. Alternatively, we can add all modified files in the current directory using `git add .`, as a dot represents all files and directories in our current location.
  
<center><img src='../_images/saving-files-in-git8.png' alt='img' width='740'></center>
  
**Making a commit**
  
We then commit our drafts using `git commit`. We add the `-m` flag to allow us to include a log message about our commit, placing it in quotes. The log message is important as we can refer to it later. Best practice is to keep it short and concise.
  
<center><img src='../_images/saving-files-in-git9.png' alt='img' width='740'></center>
  
**Check the status of files**
  
If we are making lots of changes then it's useful to know the status of our repo. We can use the `git status` command, which tells us which files are in the staging area, and which files have changes that aren't in the staging area yet. In this case, we see report.md has been modified and is in the staging area, so we make a commit.
  
<center><img src='../_images/saving-files-in-git10.png' alt='img' width='740'></center>
  
**Let's practice!**
  
Let's modify some files!

### Where does Git store information?
  
Your home directory `/home/repl` contains a repo called `mh_survey`, which has a directory called `data`.
  
Where is information about the history of the files in `/home/repl/mh_survey/data` stored?
  
---
  
Possible answers
  
- [ ] `/home/repl/.git`
- [x] `/home/repl/mh_survey/.git`
- [ ] `/home/repl/mh_survey/data/.git`
  
Yes—all of the information about a repo is stored under its main directory, `mh_survey`.

### Adding a file
  
For the remainder of the course, you will be working on a Git project analyzing the mental health of employees working at tech companies.
  
In this exercise, you will complete a version control workflow to modify the `mental_health_survey.csv` file.
  
You are located in `mh_survey/data`, which contains the csv file you need to edit.
  
---
  
1. Add a new row of data at the end of `mental_health_survey.csv` containing: `"49,M,No,Yes,Never,Yes,Yes,No"`
2. Place the updated file in the staging area.
3. Commit the modified file with the log message `"Adding one new participant's data"`

In [None]:
%%sh
echo "49,M,No,Yes,Never,Yes,Yes,No" >> mental_health_survey.csv
git add mental_health_survey.csv
git commit -m "Adding one new participant's data"

Great Git skills! In three commands you've added a new row of data to a file, stored it as a draft, and made a commit to update the repo!

### Adding multiple files
  
You've added one more task to the report.md and an extra row of participant data to `mental_health_survey.csv` files:
  
```shell
report.md: "TODO: Add data visualizations."
  
mental_health_survey.csv: "49,M,No,Yes,Never,Yes,Yes,No"
```
  
You need to figure out which files are in the repo, and which are in the staging area, so you can update everything.
  
---
  
1. Check which files are in the staging area but not yet committed.
2. Add all files in your current directory and all subdirectories into the staging area.
3. Commit all files in the staging area with the log message `"Added 3 participants and a new section in report"`

In [None]:
%%sh
# Status of repo
git status
# Add all files in dir and sub-dir into staging
git add .
# Commit all files in staging each with the log message
it commit -m "Added 3 participants and a new section in report"

`git status` is a great way to see where you are in the version control workflow, allowing you to take steps to ensure everything is up-to-date. The commit output shows that two files were changed!

## Comparing files
  
We've seen the workflow for drafting and saving updates, but if we are making lots of changes we need a way to compare versions as we make modifications!
  
**Why compare files?**
  
Perhaps we've made changes to a machine learning model, but we're getting poorer performance as a result. We want to revert our changes, but can't remember what the code for the model looked like previously.
  
**Comparing a single file**
  
Luckily, Git provides commands for checking the current state of our files versus at other times. Suppose we want to edit the report.md file. Inside the text editor, we add two lines representing tasks to complete.
  
```sh
nano report.md
```
  
<center><img src='../_animations/nano_report_md.gif' alt='img' width='740'></center>
  
**Updating the file**
  
We've only edited one file, so we can use `git add .` to add the report to the staging area. We then commit our changes.
  
<center><img src='../_images/comparing-files-in-git.png' alt='img' width='740'></center>
  
**Updating the file again**
  
A while later we need to update the file again, this time removing the executive summary task and adding a reminder to cite the funding source. We can compare the last committed version of a file with the unstaged version by using the `git diff` command followed by the filename.
  
<center><img src='../_images/comparing-files-in-git1.png' alt='img' width='740'></center>
  
**Comparing an unstaged file with the last commit**
  
The output shows two versions of report.md, where a is the first version, or the last one to be saved, and b is the second, or the one we have not added to the staging area.
  
<center><img src='../_images/comparing-files-in-git2.png' alt='img' width='740'></center>
  
**Comparing an unstaged file with the last commit**
  
The line with the two at symbols tells us the location of the changes, where the pairs of numbers represent the start line and number of lines. The minus one and five shows one line was removed at line five, and the plus one and five shows one line was added back in at line five.
  
<center><img src='../_images/comparing-files-in-git3.png' alt='img' width='740'></center>
  
**Comparing an unstaged file with the last commit**
  
Lines starting with a minus symbol written in red have been removed, the executive summary line in this case,
  
<center><img src='../_images/comparing-files-in-git4.png' alt='img' width='740'></center>
  
**Comparing an unstaged file with the last commit**
  
and lines starting with a plus symbol and written in green have been added - which is the last one.
  
<center><img src='../_images/comparing-files-in-git5.png' alt='img' width='740'></center>
  
**Comparing a staged file with the last commit**
  
What if we had already added the file to the staging area? We can use the `git diff` command again, but this time we add the `-r` flag to indicate we want to look at a particular revision of the file. Adding `HEAD`, which is a shortcut for the most recent commit, allows us to see a difference between the report file in the staging area and the version in the last commit. Note that the `-r` flag won't work if we don't put `HEAD` afterwards.
  
<center><img src='../_images/comparing-files-in-git6.png' alt='img' width='740'></center>
  
**Comparing a staged file with the last commit**
  
As we can see, the output gives us the same information as using `git diff` for a file that hasn't been added to the staging area!
  
<center><img src='../_images/comparing-files-in-git7.png' alt='img' width='740'></center>
  
**Comparing multiple staged files with the last commit**
  
What if we have more than one file in the staging area? Here we use cd to switch into the data directory, use `nano` to modify mh-tech-survey.csv to add an extra participant's survey responses, then add the file to the staging area.
  
<center><img src='../_images/comparing-files-in-git8.png' alt='img' width='740'></center>
  
**Comparing multiple staged files with the last commit**
  
In this case, we can use `git diff -r HEAD` to show the difference between all files in the staging area. In the output, we can see two files have been modified. One new line was added to the end of the mh-tech-survey.csv file, shown in green, along with the two changes to the report.
  
<center><img src='../_images/comparing-files-in-git9.png' alt='img' width='740'></center>
  
**Recap**
  
To recap, if we want to compare an unstaged file with the last commit we use `git diff` filename. To see a staged file versus the last commit we run `git diff -r HEAD` filename. For comparing all staged files with the last commit it's `git diff -r HEAD`.
  
<center><img src='../_images/comparing-files-in-git10.png' alt='img' width='740'></center>
  
**Let's practice!**
  
Now let's use Git to compare the difference between files at different times.