# Introduction to Version Control with Git
This course introduces learners to version control using Git. You will discover the importance of version control when working on data science projects and explore how you can use Git to track files, compare differences, modify and save files, undo changes, and allow collaborative development through the use of branches. You will gain an introduction to the structure of a repository, how to create new repositories and clone existing ones, and show how Git stores data. By working through typical data science tasks, you will gain the skills to handle conflicting files.

# 1. Introduction to Git
In the first chapter, you’ll learn what version control is and why it is essential for data projects. Then, you’ll discover what Git is and how to use it for a version control workflow.

#### What is a version?
1. Contents of a file at a given point in time
2. Metadata (information associated with the file):
    - The author of the file
    - Where it is located
    - The file type
    - When it was last saved
    
    
- Version control is a group of systems and processes
    - to manage changes made to documents, programs, and directories
- Version control isn't just for software. It is useful for anything that:
    - Changes over time, or
    - Needs to be shared
    
    
- Track files in different states
- Simultaneous file development (Continuous Development)
- Combine different versions of files
- Identify a partiular version
- Revert changes

#### Some benefits of Git
- Git stores everything, so nothing is lost
- Git notifies us when there is conflicting content in files
- Git synchronizes across different people and computers

### Editing a file
#### nano
- Use **`nano`** to preview, delete, add, or change contents of a file
- **`nano finance.csv`**: opens a text editor
- **Save** changes: **`Ctrl+O`**
- **Exit** the text editor: **`Ctrl+ X`**

#### echo
- **`echo`**: to create or edit a file
- **create a new file** `todo.txt`:
    - `echo "Review for duplicate records" > todo.txt`
    - uses 1 arrow `>`
- **add content to existing file** `todo.txt`:
    - `echo "Review for duplicate records" >> todo.txt`
    - uses 2 arrows `>>`

#### Checking Git version
- `git --version`

### Saving files
- The components of a Git project:
    - A repository:
        - **1) Files and directories that we create and edit:**
            - files:
                - `funding.doc`
                - `report.md`
            - directory:
                - `data`
        - **2) The extra information that git stores about the project history:**
            - `.git`
            
<img src='img/1.png' width="600" height="300" align="center"/>

- The combination of 1) the files and directories we create and 2) the extra information that git stores about the project history is called a **repository**, often referred to as a **repo**.
- **Git stores all of its extra information in a directory called `.git`, located in the main directory of the repo.**
- Git expects this information to be laid out in a particular way, so do not edit or delete **`.git`**.

### Accessing the .git directory
- Although we shouldn't edit the `.git` directory, it may be helpful to see what's inside
- It won't display when using the shell `ls` command, as it's a hidden directory
- A **hidden directory** is a directory not displayed to users, typically because it stores information to enable programs to run.
- But if we add the **`-a`** flag, it shows up, along with some hidden files!
- **Note:** All the information about a repo is stored in its **main** or **parent** directory!


<img src='img/2.png' width="800" height="400" align="center"/>

#### Git workflow
- Modify a file
- Save the draft
- Commit the updated file
- Repeat

## Comparing files

### Comparing an unstaged file with the last commit
- If we are making lots of changes, we need a way to compare versions as we make modifications.
- Luckily, Git provides commands for checking the current state of our files in comparison with their previous states.
- **We can compare the last committed version of a file with the unstaged version by using the `git diff` command followed by the filename.**
    - `git diff report.md`
    
***
- In the screenshot below, `a` is the first version, or the last version to be saved/pushed and `b` is the second version, or the one we have not yet added to the staging area
- The line with the two `@@` tells us the location of the changes, where the pairs of numbers represent the start line and number of lines.
- `@@ -1, 5 @@` means that 1 line was removed at line 5
- `@@ +1, 5 @@` means that 1 line was added to line 5
- `@@-1,5 +1,5 @@` means that a line replaced a line at line 5
- lines in red have been removed
- lines in green have been added

<img src='img/3.png' width="600" height="300" align="center"/>

### Comparing a staged file with the last commit
- Again, we can use the `git diff` command, but this time we use the **`-r` flag to indicate we want to look at a particular revision of the file.**
- `HEAD` is a shortcut for the most recent commit
- **`git diff -r HEAD report.md`**: allows us to see the difference between the report.md file in the staging area nad the version in the last commit

### Comparing multiple staged files with the last commit
- We can use `git diff -r HEAD` to show the difference between all files in the staging area:

<img src='img/4.png' width="800" height="400" align="center"/>

### Recap:
#### Compare an unstaged file with the last commited version:
`git diff filename`
#### Compare a staged file with the last commited version:
`git diff -r HEAD filename`
#### Compare all staged files with the last commited versions:
`git diff -r HEAD`

# 2. Making changes
Next, you’ll examine how Git stores data, learn essential commands to compare files and repositories at different times, and understand the process for restoring earlier versions of files in your data projects.

## Storing data with Git
### The commit structrure
#### Git commits have 3 parts:
- **Commit**
    - contains the metadata
- **Tree**
    - tracks the names and locations in the repo
- **Blob**
    - **B**inary **L**arge **Ob**ject
    - May contain data of any kind
    - contain a compressed snapshot of the contains of the file when the commit happened

<img src='img/5.png' width="800" height="400" align="center"/>

### Git log
- We can view commit information using the **`git log`** command
- This will **display all commits made to the repo in chronological order starting with the oldest.**

<img src='img/6.png' width="400" height="200" align="center"/>

<img src='img/7.png' width="600" height="300" align="center"/>

- If the output doesn't fit in the terminal window, there will be a colon at the end of output, indicating there are more commits
- We can **move through the history by pressing the `space` bar.**
- When we want to **exit the log, press `q` to return to the terminal.**

#### Git hash
- The git hash is a 40 character string of numbers and letters
- It is called a hash because Git produces it using a pseudo-random number generator called a **hash function**.
- Hashes allow efficient data sharing between repos
    - If two files are the same their hashes will be the same 
    - Therefore, Git only needs to compare hashes rather than entire files
    
#### Finding a particular commit 
- First **`git log`**
- Find the particular commit you want to look at
- Only need the first 6-8 characters of the `hash`
- **`git show c27fa856`**

<img src='img/8.png' width="600" height="300" align="center"/>

<img src='img/9.png' width="600" height="300" align="center"/>

- The output of `git show` shows the log entry for that commit, followed by a diff output showing changes between the file in that commit and the current version in the repo


<img src='img/x.png' width="600" height="300" align="center"/>