# [ESPM-163ac]: Lab1 - Introduction to Jupyter Notebook

*Estimated Time: ~45 minutes*

Welcome to your first lab! We will take you step-by-step through some data analysis tools you'll need to analyze the CalEnviroScreen data we discussed in lecture this week. You will learn some coding skills, how to import and manipulate a table, and how to plot some cool graphs to turn numbers into visualizations. Don't worry about memorizing everything contained in this notebook -- we provide a "cheat sheet" you can refer to toward the end! These are skills we will build on next lecture and lab to analyze relationships between race, environmental factors and health outcomes -- have this big picture in mind as you go through the lab today. Have fun!

## The Jupyter Notebook

First of all, note that this page is divided into what are called *cells*. You can navigate cells by clicking on them or by using the up and down arrows. Cells will be highlighted as you navigate them.

### Text cells

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but know the difference between Text Cells and Code Cells.

### Code cells
Other cells contain code in the Python 3 language. Don't worry -- we'll show you everything you need to know to succeed in this part of the class. 

The fundamental building block of Python code is an **expression**. Cells can contain multiple lines with multiple expressions.  We'll explain what exactly we mean by "expressions" in just a moment: first, let's learn how to "run" cells.

### Running cells

"Running a cell" is equivalent to pressing "Enter" on a calculator once you've typed in the expression you want to evaluate: it produces an **output**. When you run a text cell, it outputs clean, organized writing. When you run a code cell, it **computes** all of the expressions you want to evaluate, and can **output** the result of the computation.

<p></p>

<div class="alert alert-info">
To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, you can either press the <code><b>▶|</b> Run </code> button above or press <b><code>Shift + Return</code></b> or <b><code>Shift + Enter</code></b>. This will run the current cell and select the next one.
</div>

Text cells are useful for taking notes and keeping your notebook organized, but your data analysis will be done in code cells. We will focus on code cells for the rest of the class.




### Expressions

An expression is a combination of numbers, variables, operators, and/or other Python elements that the language interprets and acts upon. Expressions act as a set of **instructions** to be followed, with the goal of generating specific outcomes.

You can start by thinking of code cells as really smart calculators that computes these expressions. For instance, code cells can evaluate simple arithmetic:

In [None]:
#Run me!
#This is an expression
10 + 10

In [None]:
#Run me too!
#This is another expression
(10 + 10) / 5

### Variables!

But the point of coding is that you can save these outputs to be used later without computing it again. So, we set **variables** to these values! Just like in your standard algebra class, you can set the letter `x` to be 10. You can set letter `y` to be 5. You can add variables `x` and `y` to get 15.

In [None]:
#this won't output anything: you're just telling the cell to set x to 10
x = 10

In [None]:
y = 5

In [None]:
#This will output the answer to the addition: you're asking it to compute the number
x + y

You can then **redefine** variables you'ved used before to hold **new** values; in other words, it replaces the old values. If you run the following cell, `x` and `y` will now hold different values:

In [None]:
#You can put all of the different expressions above into one code cell.
#When you run this code cell, everything will be evaluated in order, from top to bottom.
x = 3
y = 8
x+y

In algebra class, you were limited to using the letters of the alphabet as variable names. Here, you can use any combination of words **as long as there are no spaces in the names:**

In [None]:
test = 4
test

In [None]:
another_test = 234
another_test

Variables are assigned to some value. This means everything to the right side of the equals sign is **first evaluated** and **then saved** as the variable:

In [None]:
#Notice y is still the same value as before
# test + y is evaluated to 12
# 12 is set to the varible answer_to_above

answer_to_above = test+y
answer_to_above

**You Try:** What should be the answer to "2 times `test` plus `answer_to_above`"? 

*Hint:* the astrisk symbol \* is used for the multiplication sign

In [None]:
#Calculate your answer in this code block


Does your answer make sense?

### Variables vs. Strings
We have to be careful when working with words in a code cell. Words carry different meanings and uses when they are **in quotes** and **not in quotes**.

When words are in quotes, they often signify some sort of **identification**. In this lab (and future notebooks), you'll use identification to specify **column names of tables**. In quotes, words have meaning: Python understands these as the name of something. If a table has a column named "Gender", we can ask the notebook to find the column named "Gender" and the data contained in this column. Therefore, these quotations are **values**, just like numbers. In Python, this value type is called a **string**. 

In [None]:
#I output a number
123

In [None]:
#I output a string
"Woohoo"

What if we try to type a random word in a code cell **without** putting it in quotes?

In [None]:
#This will Error!
Woohoo

It throws out an error! Why? Because code cells think in terms of math, and any word **not** in quotes is considered to be a **variable** that stores information or means something. In this notebook, we haven't told it what `Woohoo` means -- it's just an empty variable holding no information, so it complains and says "I don't know what `Woohoo` is supposed to be." It's like taking an algebra exam and telling your teacher: "The answer is `x`". Your teacher would write back, "...and what, exactly, is `x`?" That's essentially what the notebook is saying.

## Defining Functions

Functions are useful when you want to repeat a series of steps on multiple different objects, but don't want to type out the steps over and over again. Many functions are built into Python already such as the table functions you will see later in this lab. You can also **make your own functions**. You won't have to define functions, but it's important to know how they work and how to understand what each does.

Functions generally take a set of __parameters__ (also called inputs), which define the objects they will use when they are run. For example, take a look at the function defined below. The green <span style="color:green">def</span> begins the definition of a function. The blue <span style="color:blue">add_two</span> is the name of the function which you can use later to repeat the operation, and the `n` in parentheses is the parameter/input. The red text under the <span style="color:green">def</span> statement tells us **how** to use the function and is called a __doc-string__.

In [None]:
# See the doc-string written in red for the functio description and the parameters.
def add_two(n):
    """Adds 2 to the input.
    
    Parameters
    ----------
    n: int
        The integer 2 will be added to.
        
    Returns
    -------
    int
        An integer which is 2 greater than the original input n.
        
    Example
    -------
    >>> add_two(4)
    6
    """
    return n + 2

In [None]:
add_two(3)

In [None]:
add_two(-1)

Now let's look at a function that takes two parameters. The `array_adder` function below takes as parameters an array `m` and an array `n`, adds each element together, and returns the result.

In [None]:
#A function that takes in two integers, m and n, and returns True if m is a multiple of n. Otherwise returns False.
import numpy as np
def array_adder(m, n):
    """Adds two arrays of the same size together
    
    Parameters
    ----------
    m, n: array
        The two arrays that will be added at each position.
    
    Returns
    -------
    array
        One array the same size as m and n with the integers added together.
        
    Example
    -------
    >>> array_adder(np.array([2, 4, 6]), np.array([1, 3, 5]))
    array([3, 7, 11])
    """
    
    return m + n

In [None]:
m = np.array([1, 2, 3, 4])
n = np.array([5, 6, 7, 8])
array_adder(m, n)

### Understanding Errors
Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is **not** smart enough to do that.

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will usually produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell.  Run it and see what happens.

In [None]:
"This line is missing something

Fix the error below:

In [None]:
#Your Answer Here


# Tables!


Now run this cell to import some tools we'll use today. Don't worry about anything printing out -- simply run the cell.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from datascience import *
%matplotlib inline 
plt.style.use("fivethirtyeight")

### Importing

In data analysis, there is almost always a file holding your data that already exists. There are thousands of databases online that contain information on topics from all domains. In general, to import data from a file, we write something like:

```python
Table.read_table("...file_location/file_name")
```

Most often, these file names end in `.csv` to show the data format. `.csv` format is popular for spreadsheets and can be imported/exported from programs such as Microsoft Excel, OpenOffice Calc, or Google spreadsheets. 
 
An example is shown below using [U.S. Census data](http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.csv). 

In [None]:
Table.read_table("../data/ces_data.csv")

That's a lot of information. As you can see from the labels on top, this table shows location, environmental and health factors, as well as population and demographic information.

## Using Tables

We can make criteria to cut down and manipulate tables. Accessing only the rows, columns, or values specific to our purpose makes information easier to understand. Analysis and conclusions can be made when data is more digestible. 

We need to access the [CalEnviroScreen data](https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-30) above and name it for further use. We assign the table to a **variable** we call `ces_data` so that we can reference it later!

In [None]:
ces_data = Table.read_table("../data/ces_data.csv")
ces_data

This notebook can calculate how large this table is with two functions: `num_rows` and `num_columns`. The general form for these built-in table functions are `table.num_rows` and `table.num_columns`. 

Let's use these on the table above. 

In [None]:
ces_data.num_rows

In [None]:
ces_data.num_columns

### `select`ing Columns to Keep

That's a 8035 x 65 table! We don't need all the information contained in this massive table: we want to cut down this table by keeping only a handful of columns we want. Let's only include the following columns: "Asthma", "African American", "Total" "Population", "Unemployment", "Poverty", "Hispanic", and "White". 


There are multiple methods to make a table with select columns included, but learn just one today: the `select` function.

- `select` can create a new table with only the specified column names (listed as parameters)

As you can see below, each column name must be written as strings, or quoted words/names. 

In [None]:
select_ces_data = ces_data.select("Asthma", "African American (%)", "Total Population", "Unemployment", "Poverty", "Hispanic (%)", "White (%)")
select_ces_data

### Extracting Data of One Column

Tables are useful because they store information. But to use them, we have to **choose and extract** this stored data. In our lecture and Lab2, we'll be extracting columns to calculate and plot relationships between them. To do this, you need the `column` function: this extracts any column from the table and turns it into an **array** of values.

In [None]:
#Run Me!
select_ces_data.column("Unemployment")

Now you can run calculations on this data (e.g. take the average, find the sum, etc)! Next week, we'll use two columns at a time to calculate a metric that signifies what kind of relationship the two attributes have in the dataset: we will be accessing our data this way a lot.

---

### Tables Functions!

Here is a list of table fuctions we just covered. We'll be using these operations next week in lecture and lab, so make sure you understand what each does!

|Name|Example|Purpose|
|-|-|-|
|`Table`|`Table()`|Create an empty table, usually to extend with data|
|`Table.read_table`|`Table.read_table("my_data.csv")`|Create a table from a data file|
|`num_rows`|`tbl.num_rows`|Compute the number of rows in a table|
|`num_columns`|`tbl.num_columns`|Compute the number of columns in a table|
|`select`|`tbl.select("N")`|Create a copy of a table with only some of the columns|
|`column`|`tbl.column("N")`|Create an array containing the elements of a column|

---

## Visualization: Scatter Plot! 

Let's start **visualizing** our data! Due to the numerical nature of the census table above, we'll be using **scatter plots**. 

To create a scatter plot, we need to use the `scatter()` function. The general form is:

```python
table.scatter("column name for x axis", "column name for y axis")
```

An example is shown below:

In [None]:
select_ces_data.scatter("Unemployment", "Asthma") 

Hmm... It appears there are so many data points overlapped on this graph that we can't see what's happening inside the big dark blob. Thankfully, the `scatter()` function takes in a few more parameters to help with this problem. Run the code below and compare the plot with what you see above:

In [None]:
select_ces_data.scatter("Unemployment", "Asthma", s = 8, alpha = .12) 

Does this look a little better? What changed?

We used two additional parameters: 
- `s`: changes the **size** of each point on the graph (default is 20)
- `alpha`: changes the **transparency** of each point

Why do we do this? It reduces the clutter of large datasets. If we don't adjust the transparency of the data points, we won't know where the data is actually concentrated, because it's hidden by a blob of thousands of solid circles. Often times, adjusting the transparency allows us to see different trends in the data.

**Discussion:** What can you say about the relationship between the number of people Unemployed and number of Asthma incidents in census tracts across the U.S.?

*Your Answer Here*

**Your Turn!**

Now try plotting a scatter plot in the next cell using any of the two columns from the table below! Play around with the extra parameters (`size`, `alpha`) we saw above to see how they affect your graph.

In [None]:
# run this cell to see the content of the table
select_ces_data

In [None]:
# Put your code here!
your_plot = ...
your_plot

---

## SUMMARY 

### You've learned a lot in this module! Let's look back on the key parts. 

- Jupyter Notebook fundamentals

- Python language: Expressions, Variables, and Strings

- Defining new functions / reading docstrings to learn how to use a function

- Understanding and catching errors: how to read and deal with error messages

- Import data from a .csv/.txt file with `Table.read_table("..file_location/file_name")`.

- Count number of rows with `table_name.num_rows`.

- Count number of columns with `table_name.num_columns`.

- Create a new table with only the columns indicated in the parameters with `table_name.select("COLUMN NAME", ...)`. 

- Create and adjust a scatter plot with `table.scatter(column for x axis, column for y axis, size of point, transparency)`.

---

With just some simple code, we were able to do an incredible amount of data analysis! Play around with the examples until you feel comfortable with the content of this notebook. We will be using notebooks to analyze your own data sets in the future! Please ask if you have questions!

**Congratulations!** You have completed your first lab and introduction to Jupyter Notebook! In the next lecture and lab, we will use these new skills to explore statistical concepts like correlation and prediction. Stay tuned!


## Peer Consulting Office Hours
If you had trouble with any content in this notebook, Data Peer Consultants are here to help! You can check for availability of Peer Consultants on the **third floor of Moffitt library** (right across from the entrance) with this detailed [Office Hours schedule](https://data.berkeley.edu/education/peer-consulting). Peer Consultants are there to answer all data-related questions, whether it be about the content of this notebook, applications of data science in the world or other data science courses offered at Berkeley -- make sure to take advantage of this wonderful resource!



---

**Bibliography**

Content adapted from Psych167AC module: https://github.com/ds-modules/PSYCH-167AC/blob/master/01-Intro-to-Importing-Data-Tables-Graphs.ipynb

*Notebook Developed by: Alleanna Clark*