<a href="https://colab.research.google.com/github/afeld/python-public-policy/blob/main/pandas_crash_course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Analysis in Python

_pandas, specifically._

Aidan Feldman

Pulled from [a Python for Public Policy class](https://github.com/afeld/nyu-python-public-policy/blob/master/syllabus.md#readme) I teach

## Terminology

Thing | What is it?
--- | ---
Python | programming language
package | add-on/plugin for Python
pandas | package for data analysis
plotly | package for data visualization
Jupyter | programming environment; supports Markdown + code + HTML output; can run Python and [tons of other languages](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels)
notebook | individual Jupyter files; think Google Doc with executable code in it
Google Colaboratory ("Colab") | cloud-based Jupyter
Markdown | markup language; think "simple HTML"

- Jupyter notebooks are magical. This presentation is a notebook.
- You can use pandas anywhere you can run Python, but Jupyter makes things easier.

### Command line vs. Jupyter

![Command line vs. Jupyter output](img/cli_vs_jupyter.png)

## Spreadsheets vs. pandas

### Why spreadsheets

- The easy stuff is easy
- Lots of people know how to use them
- Mostly just have to point, click, and scroll
- Data and logic live together as one

### Why pandas

- Data and logic _don't_ live together
- More powerful, flexible, and expressive than spreadsheet formulas

  - Don't have to cram into a single line

    ```
    =SUM(INDEX(C3:E9,MATCH(B13,C3:C9,0),MATCH(B14,C3:E3,0)))
    ```

  - Can have more descriptive data references than `Sheet1!A:A`

- Better at working with large data
  - Google Sheets and Excel have hard limits at 1-5 million rows, but get slow long before that
- Reusable code (packages)
- Automation

### Side-by-side\*

|                       Task |  Spreadsheets  | pandas |
| -------------------------: | :------------: | :----: |
|           **Loading data** |      Easy      | Medium |
|           **Viewing data** |      Easy      | Medium |
|         **Filtering data** |      Easy      | Medium |
|      **Manipulating data** |     Medium     | Medium |
|           **Joining data** |      Hard      | Medium |
| **Complicated transforms** | Impossible\*\* | Medium |
|             **Automation** | Impossible\*\* | Medium |
|        **Making reusable** | Impossible\*\* | Medium |
|         **Large datasets** |   Impossible   |  Hard  |

_\*Ratings are obviously somewhat subjective._

_\*\*Not including scripting._

### Try it!

1. Create a Colab notebook
   1. Go to [colab.research.google.com](https://colab.research.google.com)
   1. Click `NEW NOTEBOOK`
1. Paste in [the following example](https://plotly.com/python/linear-fits/#linear-fit-trendlines-with-plotly-express):

    ```python
    import plotly.express as px

    df = px.data.tips()
    fig = px.scatter(df, x="total_bill", y="tip", trendline="ols")
    fig.show()
    ```

1. Press the ▶️ button

### Jupyter basics

- You "run" a cell by either:
    - Pressing the ▶️ button
    - Pressing `⌘`+`Enter` (Mac) or `Control`+`Enter` (Windows) on your keyboard
- Cells don't run unless you tell them to, in the order you do so
    - Generally, you want to do so from the top every time you open a notebook

#### Output

- The last thing in a code cell is what gets displayed when it's run
- The output gets saved as part of the notebook
- Just because there's existing output from a cell, doesn't mean that cell has been run during this session