# Berkeley Unboxing Data Science: Introduction to Data Science in Python
<img src="https://data.berkeley.edu/sites/default/files/styles/openberkeley_brand_widgets_rectangle/public/john_and_ani_community_photo_dsc_7214.jpg?itok=6Hjv1irR" style="width: 500px; height: 350px;"/>

<center> Professor Ani Adhikari and Professor John DeNero pictured above. Here are two of the three professors that started Data Science at UC Berkeley. We hope to have Professor Adhikari speak to us later in the summer.</center>

### Table of Contents
<a href='#section 0'>Welcome to Jupyter Notebooks!</a>

1.  <a href='#section 1'>The Python Programming Language</a>

    a. <a href='#subsection 1a'>Expressions</a> and <a href='#subsection error'>Errors</a>

    b. <a href='#subsection 1b'>Names</a>

    c. <a href='#subsection 1c'>Functions</a>

    d. <a href='#subsection 1d'>Sequences</a>
<br><br>

## The Jupyter  Notebook <a id='section 0'></a>

Welcome to the Jupyter Notebook! **Notebooks** are documents that can contain text, code, visualizations, and more. 

A notebook is composed of rectangular sections called **cells**. There are 2 kinds of cells: markdown and code. A **markdown cell**, such as this one, contains text. A **code cell** contains code in Python, a programming language that we will be using for the remainder of this module. You can select any cell by clicking it once. After a cell is selected, you can navigate the notebook using the up and down arrow keys.

<div class="alert alert-info">
To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, you can either press the  <code><b>â–¶|</b> Run </code> button above or press <b><code>Shift + Return</code></b> or <b><code>Shift + Enter</code></b>. This will run the current cell and select the next one.
</div>

If a code cell is running, you will see an asterisk (\*) appear in the square brackets to the left of the cell. Once the cell has finished running, a number will replace the asterisk and any output from the code will appear under the cell.

In [1]:
# run this cell
print("Hello World!")

Hello World!


You'll notice that many code cells contain lines of blue text that start with a `#`. These are *comments*. Comments often contain helpful information about what the code does or what you are supposed to do in the cell. The leading `#` tells the computer to ignore them.

#### Editing

You can edit a Markdown cell by clicking it twice. Text in Markdown cells is written in [**Markdown**](https://daringfireball.net/projects/markdown/), a formatting syntax for plain text, so you may see some funky symbols when you edit a text cell. 

Once you've made your changes, you can exit text editing mode by running the cell. Edit the next cell to fix the misspelling.

Go Baers!

Code cells can be edited any time after they are highlighted. Try editing the next code cell to print your name.

In [2]:
# edit the code to print your name
print("Hello: my name is NAME")

Hello: my name is NAME


#### Adding Cells
You can add a cell by clicking <b><code>Insert > Insert Cell Below</code></b> and then choosing the cell type in the drop down menu. Try adding a cell below here and printing your birthday (format: mm/dd/yyyy). Do not forget the quotation marks!

#### Deleting Cells
You can delete a cell by clicking the <b><code>scissors</code></b> at the top or <b><code>Edit > Cut Cells</code></b>. Delete the next cell below here.

In [3]:
## DELETE THIS CELL
print("delete me")

delete me


The cell above was deleted, do not delete any of the following cells.

#### Saving and Loading

Your notebook can record all of your text and code edits, as well as any graphs you generate or calculations you make. You can save the notebook in its current state by clicking Control-S, clicking the floppy disc icon in the toolbar at the top of the page, or by going to the File menu and selecting "Save and Checkpoint".

The next time you open the notebook, it will look the same as when you last saved it.

**Note:** after loading a notebook you will see all the outputs (graphs, computations, etc) from your last session, but you won't be able to use any variables you assigned or functions you defined. You can get the functions and variables back by re-running the cells where they were defined- the easiest way is to highlight the cell where you left off work, then go to the Cell menu at the top of the screen and click "Run all above". You can also use this menu to run all cells in the notebook by clicking "Run all".

#### Downloading as PDF

You can download this notebook as a pdf by clicking <b><code>File > Download as > PDF via LaTeX</code></b>. Try doing this now. You will need to do this again at the end of the notebook to turn in the PDF into bCourses.

#### Completing the Notebooks

As you navigate the notebooks, you'll see cells with bold, all-capitalized headings that need to be filled in to complete the notebook. There are two types:

<div class="alert alert-warning">
<b>PRACTICE</b> cells provide spaces to try out new coding skills at your own pace, unrelated to the case study. Since each coding skill taught in these notebooks is necessary for analyzing the cases, practice cells are a good way to get comfortable before applying those skills to real data.
</div>

Before we begin, we'll need a few extra tools to conduct our analysis. Run the next cell to load some code packages that we'll use later. 

Note: this cell MUST be run in order for most of the rest of the notebook to work.

In [3]:
# dependencies: THIS CELL MUST BE RUN
from datascience import *
import numpy as np
import math
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import ipywidgets as widgets
%matplotlib inline

# 1. Python <a id='section 1'></a>

**Python** is  programming language- a way for us to communicate with the computer and give it instructions. 

Just like any language, Python has a *vocabulary* made up of words it can understand, and a *syntax* giving the rules for how to structure communication.


#### Errors <a id="subsection error"></a>

Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you will often accientally break some of these rules. When you run a code cell that doesn't follow every rule exactly, Python will produce an **error message**.

Errors are *normal*; experienced programmers make many errors every day. Errors are also *not dangerous*; you will not break your computer by making an error (in fact, errors are a big part of how you learn a coding language). An error is nothing more than a message from the computer saying it doesn't understand you and asking you to rewrite your command.

We have made an error in the next cell.  Run it and see what happens.

In [5]:
print("This line is missing something."

SyntaxError: unexpected EOF while parsing (<ipython-input-5-c7b7223ecd08>, line 1)

You should see something like this (minus our annotations):

<img src="images/error-image.jpg"/>

The last line of the error output attempts to tell you what went wrong.  The *syntax* of a language is its structure, and this `SyntaxError` tells you that you have created an illegal structure.  "`EOF`" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it.  (Of course, if you're frustrated, you can usually find out by searching for the error message online or posting on the Piazza.)



### 1a. Data <a id='subsection 1a'></a>
**Data** is information- the "stuff" we manipulate to make and test hypotheses. 

Almost all data you will work with broadly falls into two types: numbers and text. *Numerical data* shows up green in code cells and can be positive, negative, or include a decimal.

In [None]:
# Numerical data

4

87623000983

-667

3.14159

Text data (also called *strings*) shows up red in code cells. Strings are enclosed in double or single quotes. Note that numbers can appear in strings.

In [None]:
# Strings
"a"

"Hi there!"

"We hold these truths to be self-evident that all men are created equal."

# this is a string, NOT numerical data
"3.14159"

### 1a. Expressions <a id='subsection 1a'></a>

A bit of communication in Python is called an **expression**. It tells the computer what to do with the data we give it.

Here's an example of an expression. In general when you are writing expressions, you want to have a space between the number and operator. If it is a call expression (these use parentheses), then you do not have a space before the left end of the parentheses

In [None]:
# an expression
14 + 20


When you run the cell, the computer **evaluates** the expression and prints the result. Note that only the last line in a code cell will be printed, unless you explicitly tell the computer you want to print the result.

In [None]:
# more expressions. what gets printed and what doesn't?
100 / 10

print(4.3 + 10.98)

33 - 9 * (40000 + 1)

884

<img src="images/operators.jpg"/>

Many basic arithmetic operations are built in to Python, like `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division). There are many others, which you can find information about [here](http://www.inferentialthinking.com/chapters/03/1/expressions.html).  

The computer evaluates arithmetic according to the PEMDAS order of operations (just like you probably learned in middle school): anything in parentheses is done first, followed by exponents, then multiplication and division, and finally addition and subtraction.

In [None]:
# before you run this cell, can you say what it should print?
4 - 2 * (1 + 6 / 3)

<div class="alert alert-warning">
<b>PRACTICE:</b> If you're new to python and coding, one of the best ways to get comfortable is to practice. Try writing and running different expressions in the cell below using numbers and the arithmetic operators `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division). See if you can generate different error messages and figure out what they mean.
    </div>

In [4]:
# Optional: try out different arithmetic operations


### 1b. Names <a id='subsection 1b'></a>
Sometimes, the values you work with can get cumbersome- maybe the expression that gives the value is very complicated, or maybe the value itself is long. In these cases it's useful to give the value a **name**.

We can name values using what's called an *assignment* statement.

In [None]:
# assigns 442 to x
x = 442

The assignment statement has three parts. On the left is the *name* (`x`). On the right is the *value* (442). The *equals sign* in the middle tells the computer to assign the value to the name.

You'll notice that when you run the cell with the assignment, it doesn't print anything. But, if we try to access `x` again in the future, it will have the value we assigned it.

In [None]:
# print the value of x
x

You can also assign names to expressions. The computer will compute the expression and assign the name to the result of the computation.

In [None]:
y = 50 * 2 + 1
y

We can then use these name as if they were numbers.

In [None]:
x - 42

In [None]:
x + y

<div class="alert alert-warning">
<b>PRACTICE:</b> Try rewriting the problem below, so the names make more sense. Do we expect an error below?
</div>

In [11]:
# Experiment with assigning names and doing arithmetic operations with named variables

one = 10
two = 300

sixty = one * two
sixty

3000

### 1c. Functions <a id='subsection 1c'></a>
We've seen that values can have names (often called **variables**), but operations may also have names. A named operation is called a **function**. Python has some functions built into it.

In [None]:
# a built-in function 
round

Functions get used in *call expressions*, where a function is named and given values to operate on inside a set of parentheses. The `round` function returns the number it was given, rounded to the nearest whole number.

In [None]:
# a call expression using round
round(1988.74699)

A function may also be called on more than one value (called *arguments*). For instance, the `min` function takes however many arguments you'd like and returns the smallest. Multiple arguments are separated by commas.

In [None]:
min(9, -34, 0, 99)

<div class="alert alert-warning">
<b>PRACTICE</b>
<ul>
    <li>The `abs` function takes one argument (just like `round`)</li>
    <li>The `max` function takes one or more arguments (just like `min`)</li>
</ul>


Try calling `abs` and `max` in the cell below. What does each function do?

Also try calling each function *incorrectly*, such as with the wrong number of arguments. What kinds of error messages do you see?
</div>

In [None]:
# replace the ... with calls to abs and max
...

#### Dot Notation
Python has a lot of [built-in functions](https://docs.python.org/3/library/functions.html) (that is, functions that are already named and defined in Python), but even more functions are stored in collections called *modules*. Earlier, we imported the `math` module so we could use it later. Once a module is imported, you can use its functions by typing the name of the module, then the name of the function you want from it, separated with a `.`.

In [None]:
# a call expression with the factorial function from the math module
math.factorial(5)

<div class="alert alert-warning">
**Practice:**  `math` also has a function called `sqrt` that takes one argument and returns the square root. Call `sqrt` on 16 in the next cell.
</div>

In [None]:
# use math.sqrt to get the square root of 16
...

#### Downloading as PDF

Download this notebook as a pdf by clicking <b><code>File > Download as > PDF via LaTeX</code></b>. Turn in the PDF into bCourses under the first assignment.

#### References

- Sections of "Intro to Jupyter", "Table Transformation" adapted from materials by Kelly Chen and Ashley Chien in [UC Berkeley Data Science Modules core resources](http://github.com/ds-modules/core-resources)
- "A Note on Errors" subsection and "error" image adapted from materials by Chris Hench and Mariah Rogers for the Medieval Studies 250: Text Analysis for Graduate Medievalists [data science module](https://github.com/ds-modules/MEDST-250).
- Rocket Fuel data and discussion questions adapted from materials by Zsolt Katona and Brian Bell, BerkeleyHaas Case Series

Authored by Keeley Takimoto, Adapated by BUDS team.