## Notebook 1: Introduction to Jupyter Notebooks and Python

### **Jupyter Notebooks**

Welcome to a Jupyter Notebook! **Notebooks** are documents that support interactive computing in which code is interwoven with text, visualizations, and more.

The way notebooks are formatted encourages exploration, allowing users to iteratively update code and document the results. In use cases such as data exploration and communication, notebooks excel. Science (and computational work in general) has become quite sophisticated: models are built upon experiments that are conducted on large swaths of data, methods and results are abstracted away into symbols, and papers are full of technical jargon. A static document like a paper might not be sufficient to both effectively communicate a new discovery and allow someone else to discover it for themselves.

### Learning Outcomes
Working through this notebook, you will learn about:
- The history behind Jupyter notebooks and why they are used in computing
- How a Jupyter notebook is structured and how to use them
- Python fundamentals and working with tabular data


-----


### A Brief History
The Jupyter Notebook is an interactive computational environment that supports over 40 different programming languages, but was first released as a web-based interface for IPython in 2011. Fernando Perez, a professor in the statistics department here at UC Berkeley, created IPython as a graduate student in 2001 and co-founded Project Jupyter in 2014. 

Though the Jupyter Notebook interface has been around only about a decade, the first notebook interface, Mathematica, was released over 30 years ago in 1988. Other notebook interfaces have been released since then, but none have gained as much traction as Jupyter Notebooks have. Starting in the early 2000s, open-source scientific tools were becoming more and more popular and with the widespread popularity of Jupyter, it's well on its way to becoming a standard for sharing research methodology and results.

<img src="assets/mathematica.png" alt="Early Mathematica Interface" style="width: 350px;"/>
<center>The early Mathematica Interface.<center>

### Why (Jupyter) Notebooks?
Notebooks are used for *literate programming*, a programming paradigm introduced by Donald Knuth in 1984, in which a programming language is accompanied with a documentation language, or a natural language. In other words, the computer program has an explanation in a natural language. This approach to programming effectively treats software as works of literature ([Knuth](http://www.literateprogramming.com/knuthweb.pdf), "Literate Programming").  It supports people to have a strong conceptual map of what is happening in the code and also to have clarity on the flow and logic of the code/program. which is helpful for both the writer and the reader.

Jupyter leverages this idea and enables users to create and share documents that combine code, visualizations, narrative text, equations, and rich media. Notebooks are multipurpose and can be used in any discipline. The notebook is like a laboratory notebook, but for computing. Researchers can write code to work with their data while supplementing their methods with explanations, analysis, or hypotheses. Notebooks are also used in education because they enable students  to engage with content presented in different forms, experience computation with no prior experience, and practice programming in a scaffolded way. 

<div class="alert alert-success">
In our class, we'll be using Jupyter Notebooks to introduce you to how data scientists work with data, to learn about issues of justice using real-world data sets, and to also learn how to reason about the human choices embedded in the practice of data science and their significance.
</div>

------

### Notebook Structure

A notebook is composed of rectangular sections called **cells**. There are 2 kinds of cells: markdown and code. A **markdown cell**, such as this one, contains text. A **code cell** contains code in Python, a programming language that we will be using with all of our data science modules in this class. You can select any cell by clicking it once.


To "run" a code cell (i.e. tell the computer to perform the programmed instructions in the cell), select it and then,
- Press `Shift` + `Enter`, or
- Click the Run button in the toolbar at the top of the screen. 

If a code cell is running, you will see an asterisk (\*) appear in the square brackets to the left of the cell. Once the cell has finished running, a number will replace the asterisk and any output from the code will appear under the cell.

Let's try it! **Run the cell below to see the output.** Feel free to play around with the code -- try changing 'World' to your name.

In [1]:
# Run the cell by using one of the methods we mentioned above!

print("Hello World!")

Hello World!


You'll notice that many code cells contain lines of blue text that start with a `#`. These are *comments*. Comments often contain helpful information about what the code does or what you are supposed to do in the cell. The leading `#` tells the computer to ignore whatever text follows it.

### Editing

You can change the text in a markdown cell by clicking it twice. Text in markdown cells is written in [**Markdown**](https://daringfireball.net/projects/markdown/), a formatting language for plain text, so you may see some funky symbols should you try and edit a markdown cell we've already written. Once you've made changes to a markdown cell, you can exit editing mode by running the cell the same way you'd run a code cell.

**Try double-clicking on this text to see what some Markdown formatting looks like.**

#### Adding and Deleting Cells

Another feature of Jupyter Notebooks is the ability to add and delete cells, whether that be code or markdown. You can add cells by pressing the plus sign icon in the menu bar. This will add (by default) a code cell immediately below your current highlighted cell.

To convert a cell to markdown, you can press 'Cell' in the menu bar, select 'Cell Type', and finally pick the desired option. This works the other way around too!

To delete a cell, simply press the scissors icon in the menu bar. A common fear is deleting a cell that you needed -- but don't worry! This can be undone using 'Edit' > 'Undo Delete Cells'! If you accidentally delete content in a cell, you can use `Ctrl` + `Z` to undo.

#### Saving and Loading

Your notebook will automatically save your text and code edits, as well as any graphs you generate or any calculations you make. However, you can also manually save the notebook in its current state by using `Ctrl` + `S`, clicking the floppy disk icon in the toolbar at the top of the page, or by going to the 'File' menu and selecting 'Save and Checkpoint'.

Next time you open your notebook, it will look the same as when you last saved it!

**Note:** When you load a notebook you will see all the outputs from your last saved session (such as graphs, computations, etc.) but you won't be able to use any of the variables you assigned in your code without running it again.

An easy way to "catch up" to the last work you did is to highlight the cell you left off on and click "Run all above" under the Cell tab in the menu at the top of the screen.

### Getting Started setting up an Environment

Now that we've covered our bases with regards to the platform we'll be working on for this assignment, let's load some **libraries** we need to explore the data we are working with. Python **libraries** are extra packages we can load to help use tools that are not otherwise available. These can include visualization libraries such as `matplotlib` or numerical tools like `numpy`. You can see how we load these libraries below:

In [2]:
from datascience import * # This loads tools from the datascience library
import numpy as np # Loads numerical methods
import math
import random
#This is so we can get a clean export PDF to turn in
import otter
generator = otter.Notebook()

Now that we've loaded some relevant libraries, let's go over some Python basics.

### **Python Basics** <a id='subsectionpy'></a>

**Python** is a programming language -- a way for us to communicate with the computer and give it instructions.

Just like any language, Python has a set vocabulary made up of words it can understand, and a syntax which provides the rules for how to structure our commands and give instructions.

#### Errors
Errors in programming are common and totally okay! Don't be afraid when you see an error because more likely than not the solution lies in the error code itself! Let's see what an error looks like.**Run the cell below to see the output.**

In [3]:
print('This line is missing something.'

SyntaxError: unexpected EOF while parsing (Temp/ipykernel_5412/2588290324.py, line 1)


The last line of the error output attempts to tell you what went wrong.  The *syntax* of a language is its structure, and this `SyntaxError` tells you that you have created an illegal structure.  "`EOF`" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it.  (Of course, if you're frustrated, you can usually find out by searching for the error message online.)

#### Variables

As we mentioned before, in this Jupyter Notebook you will be assigning data or figures to **variables**. You can even assign graph output or functions to variables, but that is out of scope for this assignment so don't worry about it! Variables are stored in a computer's memory and you can use it over and over again in future calculations!

Sometimes, instead of trying to work with raw information all the time in a long calculation like `4 - 2 * (1 + 6 / 3)` you will want to store it as a **variable** for easy access in future calculations. **Check out how we can use variables to our advantage below!**

In [None]:
# Instead of performing this calculation over and over again ...
4 - 2 * (1 + 6 / 3)

In [None]:
# Try assigning it to a variable for future use!
y = 4 - 2 * (1 + 6 / 3)

An assignment statement, such as `y = 4 - 2 * (1 + 6 / 3)` has three parts: on the left is the variable **name** (`y`), on the right is the variable's **value** (`4 - 2 * (1 + 6 / 3)`), and the equals sign in the middle tells the computer to assign the value to the name.

You might have noticed that running that second cell did not output anything, however, we can access that value again and again in the future.

In [None]:
# We can print the value as follows
y

In [None]:
# We can also use it in other calculations now!
y * 2

#### Lists
Variable values may be more sophisticated. We can store multiple numbers under a single name if we make the value a list. The following cell stores 3 numbers in a list:

In [None]:
y = [4,9,16]

### Functions <a id='subsection 1c'></a>
We've seen that values can have names (often called **variables**), but operations may also have names. A named operation is called a **function**. Python has some functions built into it.

In [None]:
# a built-in function 
round

Functions get used in *call expressions*, where a function is named and given values to operate on inside a set of parentheses. The `round` function returns the number it was given, rounded to the nearest whole number.

In [None]:
# a call expression using round
round(1988.74699)

A function may also be called on more than one value (called *arguments*). For instance, the `min` function takes however many arguments you'd like and returns the smallest. Multiple arguments are separated by commas.

In [None]:
min(9, -34, 0, 99)

<div class="alert alert-warning">
<b>Practice</b>
<ul>
    <li>The `abs` function takes one argument (just like `round`)</li>
    <li>The `max` function takes one or more arguments (just like `min`)</li>
</ul>


Try calling `abs` and `max` in the cell below. What does each function do?

Also try calling each function *incorrectly*, such as with the wrong number of arguments. What kinds of error messages do you see?
</div>

In [None]:
# replace the ... with calls to abs and max
...

#### Dot Notation
Python has a lot of [built-in functions](https://docs.python.org/3/library/functions.html) (that is, functions that are already named and defined in Python), but even more functions are stored in collections called *modules*. Earlier, we imported the `math` module so we could use it later. Once a module is imported, you can use its functions by typing the name of the module, then the name of the function you want from it, separated with a `.`.

In [None]:
# a call expression with the factorial function from the math module
math.factorial(5)

Many math operations can be applied to lists. Try calling math.sqrt on y which you saved as the list [4, 9, 16].

Operations like math.sqrt output a list of the same length as the input. Some reduce a list to a single number: try calling sum() on [4,9,16].

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Question 1:</b>
`math` also has a function called `sqrt` that takes one argument and returns the square root. Call `sqrt` on 16 in the next cell.
</div>

<!--
BEGIN QUESTION
name: q1
points: 1
manual: true
-->

In [None]:
# Replace the ...with the call to use math.sqrt() to get the square root of 16
...

**Answer here** *Double click to edit this markdown cell with your answer*

<!-- END QUESTION -->

### Random numbers and sampling
Random sampling plays a key role in data science. The random module implements functions for random sampling and random number generation. For example, the cell below generates a random integer between 1 and 50 

`random.randint(1,50)`

Note that any whole number between 1 and 50 has an equal probability of being selected --- the sampling probabilities are uniform.



## Tables

In most cases, when interacting with data you will be working with **tables**. In this section, we will cover how to examine and manipulate data using Python. 

**Tables** are the fundamental way we organize and display data. 
**Run the cell below to load a dataset.** We'll be working with this data in a future notebook.

In [None]:
# Below we see an assignment statement.
# We are telling the computer to create a Table and read in some data.

prisons = Table().read_table("./data/monthly_cdcr.csv")

# This next command will display the top 5 entries. You can change the number
# to view a different amount of entries at time.
prisons.show(5)

This table is organized into **columns**, one for each category of information collected. You can also think about the table in terms of its rows, where each row represents all the information collected about a particular instance, in this case, different state prisons. By default only the first 10 rows are shown, but as you can see in the code we ran, we changed it to 5.

**Table Attributes**

Every table has **attributes** that give information about the table, such as the number of rows and the number of columns. Attributes you'll use frequently include `num_rows` and `num_columns`, which give the number of rows and columns in the table, respectively. These are accessed using something called **dot notation** which means we won't be using any parentheses like in our print statement (Hello World!) earlier.

In [None]:
# Get the number of columns
prisons.num_columns

In [None]:
# Get the number of rows
prisons.num_rows

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Question 2:</b>
Observe the output of the cell above. How many state prisons are included in our data set?
</div>

<!--
BEGIN QUESTION
name: Q2
points: 1
manual: true
-->

**Answer here** *Double click to edit this markdown cell with your answer*

<!-- END QUESTION -->

In other situations, we will want to sort, filter, or group our data. In order to manipulate our data stored in a table, we will be using various table functions. These will be explained as we go through them as to not overwhelm you!

#### Note on Tables

In this notebook, we worked with the [datascience](http://data8.org/datascience/) library to work with tables (which is also used in Data-8). In future courses (such as Data-100) and in industry, you'll most likely use [pandas](https://pandas.pydata.org/) to manipulate tabular data. Using the datascience library is more friendly syntax-wise, but pandas is more powerful overall. In this course, we'll use both.

-----

Now that you have a basic grasp on Python and the kinds of information we'll be working with, we can move on to where our data came from and how to interact with it.

Congrats on finishing the Jupyter Notebook and Python overview!

------

## Notebooks in Practice

With proprietary software like Mathematica, users are supposed to trust the results returned and are unable to check the code. In contrast, Jupyter fosters transparency and hence encourages reproducibility, which refers to the ability to reproduce the results of a scientific study. Not only is the code behind the software available for anyone to inspect or tinker with, but code in the notebooks can also be examined or re-run to reproduce the results. Theodore Gray, the co-founder of Wolfram Research who was also involved in creating the Mathematica interface, said about Jupyter, "I think what they have is acceptance from the scientific community as a tool that is considered to be universal." In other words, Jupyter Notebooks support the computational work of researchers from different fields, from astronomy to psychology to literature, and therefore enable new ways for researchers in very different domains to share research tools, methods, and learn from one another.

The versatility of the  Notebook also has important consequences for data science and the workflows that are involved when working with data in settings other than research, such as for education and community science projects. The process of working with data can be messy and nonlinear, which a Jupyter notebook can handle well because of its flexibility (though this messiness is often reflected in the resulting notebook!). The power of the notebook lies in its ability to include a variety of media with the computation as a means to maintain accountability, integrity, and transparency for both the author of the notebook and the audiences that you share your work with. In a  world in which algorithms and data analysis inform many aspects of life and where computation is getting more and more abstract, the ability to understand and reason about computational work is more important than ever.

<!-- BEGIN QUESTION -->

<div class="alert alert-block alert-info">
<b>Question 3:</b> As we mentioned above, notebooks are used to make programming easier to read by interleaving code with text and other media types. Just as literature allows for creativity and supports multiple interpretations, we can treat notebooks as a medium that lets us tell complex stories that incorporate programming. What does that imply about notebooks as a medium and us as readers? Can you think of ways to incorporate them into justice projects?
</div>


<!--
BEGIN QUESTION
name: q3
points: 1
manual: true
-->


**Answer here** *Double click to edit this markdown cell with your answer*

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-block alert-info">
<b>Labor Question:</b> How much time did you spend completing this module? Did you find outside resources that helped you? if so what were they?
</div>


<!--
BEGIN QUESTION
name: q4
points: 1
manual: true
-->


**Answer here** *Double click to edit this markdown cell with your answer*

<!-- END QUESTION -->

#### Feedback Survey
Please consider filling out this [survey](https://docs.google.com/forms/d/e/1FAIpQLScd1q8VqvOMuVvLfhbVswckYKg1HFVwVu_bTF5NWbVZr4qWhw/viewform?usp=sf_link) to help us improve this module.



In [None]:
# Save your notebook first, then run this cell to export your submission.
# Download the zip file, which contains a copy of your notebook and your written responses.
generator.export("notebook1.ipynb")