# Lecture 16: Version control and Jupyter


## Version control


When you're editing a file and make a mistake, what do you do?


[Changes over time (history)](https://speakerdeck.com/aidanfeldman/git-graphically?slide=2)


Undo/redo are useful


How would you try out a change that touches multiple files?


[Branching](https://speakerdeck.com/aidanfeldman/git-graphically?slide=7)


### Git

1. In sidebar, click [Source Control](https://code.visualstudio.com/docs/sourcecontrol/overview)
1. [Install Git.](https://git-scm.com/downloads)
   - Many ways to install
   - In [Git for Windows](https://gitforwindows.org/), there are a _lot_ of options in the installer - stick to the defaults.
1. [Initialize repo](https://code.visualstudio.com/docs/sourcecontrol/overview#_initialize-a-repository)
1. Go to [Explorer](https://code.visualstudio.com/docs/getstarted/userinterface#_explorer-view)
1. Open [integrated terminal](https://code.visualstudio.com/docs/terminal/getting-started)
   - Windows:
     1. [Open](https://code.visualstudio.com/docs/terminal/getting-started#_run-commands-in-another-shell) [Git BASH](https://gitforwindows.org/)
     1. [`Select Default Profile`](https://code.visualstudio.com/docs/terminal/profiles) to be `Git BASH`
1. Set global [name](https://docs.github.com/en/get-started/getting-started-with-git/setting-your-username-in-git) and [email](https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-email-preferences/setting-your-commit-email-address#setting-your-commit-email-address-in-git) in Git
   - Your name and email can be set to whatever
1. [Commit](https://code.visualstudio.com/docs/sourcecontrol/overview#_commit)
   - "There are no staged changes" -> "Always"


## Spreadsheets vs. programming languages


What do you like about spreadsheets?


### Why spreadsheets

- The easy stuff is easy
- Lots of people know how to use them
- Mostly just have to point, click, and scroll
- Data and logic live together as one


### Why programming languages

- Data and logic _don't_ live together
  - Why might this matter?


- More powerful, flexible, and expressive than spreadsheet formulas; don't have to cram into a single line

  ```
  =SUM(INDEX(C3:E9,MATCH(B13,C3:C9,0),MATCH(B14,C3:E3,0)))
  ```

- Better at working with large data
  - [Google Sheets](https://support.google.com/drive/answer/37603) and [Excel](https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3) have hard limits at 1-5 million rows, but get slow long before that
- Reusable code (packages)
- Automation


### Side-by-side<sup>1</sup>

|                       Task |      Spreadsheets      | Programming Languages |
| -------------------------: | :--------------------: | :-------------------: |
|           **Loading data** |          Easy          |        Medium         |
|           **Viewing data** |          Easy          |        Medium         |
|         **Filtering data** |          Easy          |        Medium         |
|      **Manipulating data** |         Medium         |        Medium         |
|           **Joining data** |          Hard          |        Medium         |
| **Complicated transforms** | Impossible<sup>2</sup> |        Medium         |
|             **Automation** | Impossible<sup>2</sup> |        Medium         |
|        **Making reusable** |  Limited<sup>3</sup>   |        Medium         |
|         **Large datasets** |       Impossible       |         Hard          |

1. These ratings are obviously subjective
1. Not including scripting, including [Excel's new Python+pandas support](https://support.microsoft.com/en-us/office/introduction-to-python-in-excel-55643c2e-ff56-4168-b1ce-9428c8308545)
1. [Google Sheets supports named functions](https://support.google.com/docs/answer/12504534)


## Python vs. other languages

- Good for general-purpose _and_ data stuff
- Widely used in both industry and academia
- Relatively easy to learn
- Open source

![Python logo](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Python-logo-notext.svg/110px-Python-logo-notext.svg.png)


## Where to Python

Pyton can be run in:

- A text file, using the `python` command
- [The interactive Python interpreter / command prompt / shell](https://www.python.org/shell/)
- An [integrated development environment (IDE)](https://runestone.academy/ns/books/published/thinkcspy/Appendices/usingIDE.html) like [Spyder](https://www.spyder-ide.org/) or [PyCharm](https://www.jetbrains.com/pycharm/)
- A [Jupyter notebook](https://docs.jupyter.org/en/latest/#what-is-a-notebook)
  - [Various other tools](https://python-public-policy.afeld.me/en/columbia/resources.html#jupyter-outside-this-course) are built around them
  - What we'll be using for this class

Each can be on your computer ("local"), or in the cloud somewhere. All call `python` under the hood, more or less.


## Pandas

- A Python package (bundled up code that you can reuse)
- Very common for data science in Python
- [A lot like R](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html)
  - Both organize around "data frames"


## Jupyter

- Web based programming environment
- Supports Python by default, and other languages with added [kernels](https://docs.jupyter.org/en/stable/projects/kernels.html)
- Nicely displays output of your code so you can check and share the results
- Avoids using the command line


### Command line vs. Jupyter

![Command line vs. Jupyter output](img/cli_vs_jupyter.png)


### Create notebook

1. Create file `project_3.ipynb`
1. [Select kernel](https://code.visualstudio.com/docs/datascience/jupyter-notebooks#_create-or-open-a-jupyter-notebook)
   1. Click `Install suggested extensions`, if it asks.
   1. Click `Python Environments…`
   1. Select the `.venv` one.
1. Add a Markdown cell at the top, giving the notebook a title as a [heading (level 1)](https://www.markdownguide.org/basic-syntax/#headings).

   ```md
   # Project 3
   ```

1. Add a couple code cells, doing some simple math.
1. Commit
1. Your [Source Control Graph](https://code.visualstudio.com/docs/sourcecontrol/overview#_source-control-graph) (a.k.a. your Git history) should then look something like this:

   ![Git history](img/git_history.png)

Hold onto this repository; you'll use it through the end of the course.


Will get more into these things in more detail in [Advanced Computing for Policy](https://github.com/advanced-computing/course-materials/blob/main/README.md).


### Jupyter basics

A "cell" can be either code or [Markdown](https://www.markdownguide.org/getting-started/) (text). Raw Markdown looks like this:

```
## A heading

Plain text

[A link](https://somewhere.com)
```


#### Running

- You "run" a cell by either:
  - Pressing the ▶️ button
  - Pressing `Control`+`Enter` on your keyboard
- Cells don't run unless you tell them to, in the order you do so
  - Generally, you want to do so from the top every time you open a notebook


#### Output

- The last thing in a code cell is what gets displayed when it's run
- The output gets saved as part of the notebook
- Just because there's existing output from a cell, doesn't mean that cell has been run during this session


### Hosted notebooks


Jupyter can run on on your computer (“local”), or in the cloud somewhere. [Many options.](https://python-public-policy.afeld.me/en/columbia/resources.html#jupyter-outside-this-course)


### Some pandas/Jupyter best practices

- Make variable names descriptive
  - Ignore that all examples use `requests`
- Only do one thing per line
  - Makes troubleshooting easier
- Make notebooks [idempotent](https://en.wikipedia.org/wiki/Idempotence)
  - Makes your work reproducible
  - Use `Restart and run all` (⏩ button in toolbar)
