# Lecture 16 Example

# Lecture 16: Python and Jupyter


## Spreadsheets vs. programming languages


What do you like about spreadsheets?


### Why spreadsheets

- The easy stuff is easy
- Lots of people know how to use them
- Mostly just have to point, click, and scroll
- Data and logic live together as one


### Why programming languages

- Data and logic _don't_ live together
  - Why might this matter?


- More powerful, flexible, and expressive than spreadsheet formulas; don't have to cram into a single line

  ```
  =SUM(INDEX(C3:E9,MATCH(B13,C3:C9,0),MATCH(B14,C3:E3,0)))
  ```

- Better at working with large data
  - [Google Sheets](https://support.google.com/drive/answer/37603) and [Excel](https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3) have hard limits at 1-5 million rows, but get slow long before that
- Reusable code (packages)
- Automation


### Side-by-side<sup>1</sup>

|                       Task |      Spreadsheets      | Programming Languages |
| -------------------------: | :--------------------: | :-------------------: |
|           **Loading data** |          Easy          |        Medium         |
|           **Viewing data** |          Easy          |        Medium         |
|         **Filtering data** |          Easy          |        Medium         |
|      **Manipulating data** |         Medium         |        Medium         |
|           **Joining data** |          Hard          |        Medium         |
| **Complicated transforms** | Impossible<sup>2</sup> |        Medium         |
|             **Automation** | Impossible<sup>2</sup> |        Medium         |
|        **Making reusable** |  Limited<sup>3</sup>   |        Medium         |
|         **Large datasets** |       Impossible       |         Hard          |

1. These ratings are obviously subjective
1. Not including scripting, including [Excel's new Python+pandas support](https://support.microsoft.com/en-us/office/introduction-to-python-in-excel-55643c2e-ff56-4168-b1ce-9428c8308545)
1. [Google Sheets supports named functions](https://support.google.com/docs/answer/12504534)


## Python vs. other languages

- Good for general-purpose _and_ data stuff
- Widely used in both industry and academia
- Relatively easy to learn
- Open source

![Python logo](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Python-logo-notext.svg/110px-Python-logo-notext.svg.png)


## Where to Python

Pyton can be run in:

- A text file, using the `python` command
- [The interactive Python interpreter / command prompt / shell](https://www.python.org/shell/)
- An [integrated development environment (IDE)](https://runestone.academy/ns/books/published/thinkcspy/Appendices/usingIDE.html), such as:
  - VSCode
  - [PyCharm](https://www.jetbrains.com/pycharm/)
  - [Spyder](https://www.spyder-ide.org/)
- A [Jupyter notebook](https://docs.jupyter.org/en/latest/#what-is-a-notebook)
  - [Various other tools](https://python-public-policy.afeld.me/en/columbia/resources.html#jupyter-outside-this-course) are built around them
  - What we'll be using for this class

Each can be on your computer ("local"), or in the cloud somewhere. All call `python` under the hood, more or less.


## Pandas

- A Python package (bundled up code that you can reuse)
- Very common for data science in Python
- [A lot like R](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html)
  - Both organize around "data frames"


## Jupyter

- Alternative programming environment
- Supports Python by default, and other languages with added [kernels](https://docs.jupyter.org/en/stable/projects/kernels.html)
- Nicely displays output of your code so you can check and share the results
- Avoids using the command line


### Command line vs. Jupyter

![Command line vs. Jupyter output](img/cli_vs_jupyter.png)


## [Lab 8](lab_8.ipynb) prep

[Sign up for GitHub.](https://github.com/signup)

- If you have an account already, [sign in](https://github.com/login).
- A [Free plan](https://github.com/pricing) is sufficient.


### Repository setup

Create a folder (on your computer) called `[username].github.io`.

- Example: `afeld.github.io`
- Do so _outside_ of your `computing-in-context/` (or equivalent) folder.
  - Therefore, you'll have two parent folders for class:
    - `computing-in-context/` (or equivalent)
    - `[username].github.io`
- This is preparing us to make a ["user site" on GitHub Pages](https://docs.github.com/en/pages/getting-started-with-github-pages/what-is-github-pages#types-of-github-pages-sites).
  - If you already have one, use a `computing-in-context/` folder (new or existing) to make a ["project site"](https://docs.github.com/en/pages/getting-started-with-github-pages/what-is-github-pages#types-of-github-pages-sites).


1. [Open your `[username].github.io` folder in VSCode.](https://code.visualstudio.com/docs/getstarted/getting-started#_open-a-folder-in-vs-code)
1. [Open a terminal.](https://code.visualstudio.com/docs/terminal/getting-started)
1. In the terminal, [create a virtual environment](https://docs.python.org/3/library/venv.html#creating-virtual-environments).

   ```sh
   python -m venv .venv
   ```


[Install the following packages.](notebooks.md#installing-packages)

```
ipykernel
pandas
plotly

# needed for plotly
nbformat
statsmodels
```


## Jupyter, continued

1. [Create a `lecture_16_example.ipynb` file.](https://code.visualstudio.com/docs/datascience/jupyter-notebooks#_create-or-open-a-jupyter-notebook)
1. _Walk through the Jupyter interface._
1. Copy in [this example](https://plotly.com/python/linear-fits/#Linear-fit-trendlines-with-Plotly-Express).
1. [Run it.](https://code.visualstudio.com/docs/datascience/jupyter-notebooks#_running-cells)
   1. [Select the kernel.](https://code.visualstudio.com/docs/datascience/jupyter-notebooks#_create-or-open-a-jupyter-notebook)
   1. Click `Install suggested extensions`, if it asks.
   1. Click `Python Environments…`.
   1. Select the `python` under the virtual environment.
      - Mac: `.venv/bin/python`
      - Windows: `.venv\Scripts\python.exe`


### Common issues

We've run into issues with being unable to [select the kernel](https://code.visualstudio.com/docs/datascience/jupyter-notebooks#_create-or-open-a-jupyter-notebook). Try each of the following:

- Ensure you're on the latest versions of:
  - [VSCode](https://code.visualstudio.com/docs/setup/setup-overview#_update-cadence)
  - [The extensions](https://code.visualstudio.com/docs/configure/extensions/extension-marketplace#_extensions-view-filter-and-commands)
- Restart VSCode (quitting the full app, not just the window).
- Mac: Make sure VSCode isn't in your `Downloads` folder — drag it to `Applications`.
- Confirm that [the packages were installed in the virtual environment](#repository-setup), not [globally](https://packaging.python.org/en/latest/tutorials/installing-packages/#creating-virtual-environments).


### Using multiple cells

_Show each step in their own cell:_

1. _Data loading_
1. _Displaying the `df`_
1. _Creating the chart_


FYI `px.data.tips()` loads one of [Plotly's sample datasets](https://plotly.com/python-api-reference/generated/plotly.express.data.html). That's not needed when plotting other datasets.


We'll learn more about pandas and plotly soon.


### Jupyter basics

A "cell" can be either code or [Markdown](https://www.markdownguide.org/getting-started/) (text). Raw Markdown looks like this:

```
## A heading

Plain text

[A link](https://somewhere.com)
```


#### Running

- You "run" a cell by either:
  - Pressing the ▶️ button
  - Pressing `Control`+`Enter` on your keyboard
- Cells don't run unless you tell them to, in the order you do so
  - Generally, you want to do so from the top every time you open a notebook


#### Output

- The last thing in a code cell is what gets displayed when it's run
- The output gets saved as part of the notebook
- Just because there's existing output from a cell, doesn't mean that cell has been run during this session


### Hosted notebooks


Jupyter can run on on your computer (“local”), or in the cloud somewhere. [Many options.](https://python-public-policy.afeld.me/en/columbia/resources.html#jupyter-outside-this-course)


### Some best practices

- Make variable names descriptive
- Only do one thing per line
  - Makes troubleshooting easier
- Make notebooks [idempotent](https://en.wikipedia.org/wiki/Idempotence)
  - Makes your work reproducible
  - Use `Restart` then `Run all` buttons in toolbar
