# Introduction to Git
  
This course introduces learners to version control using Git. You will discover the importance of version control when working on data science projects and explore how you can use Git to track files, compare differences, modify and save files, undo changes, and allow collaborative development through the use of branches. You will gain an introduction to the structure of a repository, how to create new repositories and clone existing ones, and show how Git stores data. By working through typical data science tasks, you will gain the skills to handle conflicting files.
  
In the first chapter, you’ll learn what version control is and why it is essential for data projects. Then, you’ll discover what Git is and how to use it for a version control workflow.

## Introduction to version control with Git
  
In this course we'll learn how to use Git for version control.
  
**What is a version?**
  
Before we discuss Git and version control, let's define a version, which is the contents of a file at a given point in time. It also includes metadata, or information associated with the file, such as the author, where it is located, the file type, and when it was last saved.
  
**What is version control?**
  
Version control is a group of systems and processes to manage changes made to documents, programs, and directories. Version control isn't just for software. Anything that changes over time or needs to be shared can benefit from using version control.
  
<center><img src='../_images/introduction-to-version-control-with-git.png' alt='img' width='740'></center>
  
**What is version control?**
  
Version control allows us to track files in different states and let multiple people work on the same files simultaneously, a concept known as continuous development. It also allows us to combine different versions, identify a particular version of a file, and revert changes.
  
<center><img src='../_images/introduction-to-version-control-with-git1.png' alt='img' width='740'></center>
  
**Why is version control important?**
  
To illustrate why version control is essential when working with data, consider a common scenario of modifying a dataset. We save separate copies at various time intervals, using fairly similar names. We then produce new analyses as the dataset changes, but matching these outputs to the correct data can be difficult as time goes by!
  
<center><img src='../_images/introduction-to-version-control-with-git2.png' alt='img' width='740'></center>
  
**Why is version control important?**
  
Put another way, a data project without version control is like cooking without a recipe - it'll be difficult to remember how to produce the same results again.
  
**Git**
  
One popular program for version control is called Git. Git is open source and scalable to easily track everything from small solo projects to complex collaborative efforts with large teams! Note that Git is not the same as GitHub, which is a cloud-based Git repository hosting platform. However, it's common to use Git with GitHub!
  
<center><img src='../_images/introduction-to-version-control-with-git3.png' alt='img' width='740'></center>
  
**Benefits of Git**
  
A key benefit of Git is that it stores everything, so nothing is ever lost. Also, Git automatically notifies us when our work conflicts with someone else's, so it's harder to accidentally overwrite content. Additionally, Git can synchronize work done by different people on different machines.
  
<center><img src='../_images/introduction-to-version-control-with-git4.png' alt='img' width='740'></center>
  
**Using Git**
  
A common method to use Git is via the shell, also known as the terminal. The shell is a program for executing commands. Before we use Git, we'll run through some shell commands that will often be used in our version control workflow, such as previewing, modifying, and inspecting files or directories. Note that a directory is often referred to on a computer as a folder.
  
<center><img src='../_images/introduction-to-version-control-with-git5.png' alt='img' width='740'></center>
  
**Useful shell commands**
  
To see our location, we can execute `pwd`, which stands for print working directory. Here, we are in the Documents directory of a user called Repl. To see what is in our current directory we can use the `ls` command. This returns a list of all files and directories. There is a directory called archive and three csv files.
  
<center><img src='../_images/introduction-to-version-control-with-git6.png' alt='img' width='740'></center>
  
**Changing directory**
  
If we need to change directory, we can execute `cd` followed by the directory we want to move into. Here, we navigate to the archive directory. Rechecking our location confirms we have successfully moved.
  
<center><img src='../_images/introduction-to-version-control-with-git7.png' alt='img' width='740'></center>
  
**Editing a file**
  
We can even use the shell to preview and modify files. In the documents directory, we use the `nano` command followed by the filename, which opens a text editor. We can delete, add, or change contents of a file, and save using control and o, then exit the editor using control and x to return to the shell.
  
<center><img src='../_images/introduction-to-version-control-with-git8.png' alt='img' width='740'></center>
  
**Editing a file**
  
We can also use echo to create or edit a file. Here, we create a file called todo.txt with a reminder to review our data for duplicates. If the file already exists, then we can append content by using two arrows instead of one.
  
<center><img src='../_images/introduction-to-version-control-with-git9.png' alt='img' width='740'></center>
  
**Checking Git version**
  
Different versions of software have different functionality, and Git is no exception. We can check which version of Git we have installed by typing `git --version` in the shell.
  
<center><img src='../_images/introduction-to-version-control-with-git10.png' alt='img' width='740'></center>
  
**Let's practice!**
  
Now let's check our understanding of version control!

### Using the shell
  
Git commands are typically performed using the Shell.
  
Understanding some common shell commands allows you to perform more of your Git workflow in the shell without having to spend time navigating different programs.
  
In this exercise, you will need to perform shell commands to identify how many files and directories are in the data directory.
  
---
  
Possible answers
  
- [ ] Two
- [x] One
- [ ] Three
  
Solution
  
```sh
$ ls                                                           
data  report.md
$ cd data
$ ls
mental_health_survey.csv
```
  
Super shell skills! Being able to navigate directories and interact with files is extremely beneficial as you begin to work with Git.

### Checking the version of Git
  
Just like you need to know what version of a file you are working with, it's important to understand which version of Git is installed on your computer so you're aware of what functionality it offers.
  
---
  
1. Using the terminal, enter the command to find out what version of Git is installed.

In [1]:
%%sh
git --version

git version 2.37.1 (Apple Git-137.1)


Great work! The `--version` flag allows you to check what version of Git is installed locally, which is useful if you upgrade, or need to compare versions among colleagues.

## Saving files
  
Now let's explore how Git stores information, along with looking at a workflow to update files and check their status.
  
**A repository**
  
To start, let's discuss the components of a Git project. We'll be working with a project about mental health in tech throughout the course, which is shown here.
  
**A repository**
  
There are two parts - the first is the files and directories that we create and edit, in this case a funding document, a markdown file containing a report, and a directory called data.
  
**A repository**
  
The second part is the extra information that Git records about the project's history. The combination of these two things is called a repository, often referred to as a repo. Git stores all of its extra information in a directory called .git, located in the main directory of the repo. Git expects this information to be laid out in a particular way, so we should not edit or delete .git.
  
**Staging and committing**
  
Now let's discuss how to make changes in a repo. We save a draft by placing it in a staging area. We save files, and update the repo in the process, through a commit.
  
**Staging and committing**
  
Putting files in the staging area is like placing a letter in an envelope, while making a commit is like putting the envelope in a mailbox. We can add more things to the envelope or take things out as often as we want, but once we put it in the mailbox we can't make further changes.
  
**Accessing the .git directory**
  
Although we shouldn't edit the .git directory, it may be helpful to see what's inside. It won't display when using the shell ls command, as it's a hidden directory. A hidden directory is a directory not displayed to users, typically because it stores information to enable programs to run. But if we add the dash-a flag it shows up along with some hidden files!
  
**Making changes to files**
  
Let's visualize the Git storage workflow. Here, we modify a markdown file called report.md, and store five draft updates to the staging area as we progress. We commit, or save, the second and fifth versions of the file in the staging area, and with each commit our .git directory is modified to reflect the state of the repo.
  
**Git workflow**
  
So, our Git workflow is to modify a file, save the draft to the staging area, commit the updated file to our repo, and repeat!
  
**Modifying a file**
  
To execute this workflow we can use nano to open a text editor for the report file. We add three lines of text and save it using control-O and control-X.
  
**Saving a file**
  
To add the updated file to the staging area we use the command git add followed by the filename. Alternatively, we can add all modified files in the current directory using git add dot, as a dot represents all files and directories in our current location.
  
**Making a commit**
  
We then commit our drafts using git commit. We add the dash-m flag to allow us to include a log message about our commit, placing it in quotes. The log message is important as we can refer to it later. Best practice is to keep it short and concise.
  
**Check the status of files**
  
If we are making lots of changes then it's useful to know the status of our repo. We can use the git status command, which tells us which files are in the staging area, and which files have changes that aren't in the staging area yet. In this case, we see report-.md has been modified and is in the staging area, so we make a commit.
  
**Let's practice!**
  
Let's modify some files!