In [None]:
from datascience import *
import numpy as np
import math
from IPython.display import Image
%matplotlib inline

# Welcome to GWS-131's Data Science Module!

# Introduction to Python and Jupyter Notebooks:

## 1. Cells - Text and Code
In a notebook, each rectangle containing text or code is called a *cell*.

Cells (like this one) can be edited by double-clicking on them. This cell is a text cell, written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to worry about Markdown today, but it's a pretty fun+easy tool to learn.

After you edit a cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions.) Instead, you can press `return` or `enter` while holding down the `shift` key.

Other cells contain code in the Python3 programming language.  Running a code cell will execute all of the code it contains.

Try running this cell:

In [None]:
print("Hello, World!")

## 2. Numbers

Quantitative information arises everywhere in data science. In addition to representing commands to print out lines, expressions can represent numbers and methods of combining numbers. The expression `3.2500` evaluates to the number 3.25. (Run the cell and see.)

In [None]:
3.2500

And this one:

In [None]:
print(3)
4
5

Notice that we don't necessarily need to `print`. When you run a notebook cell, if the last line has a value, then Jupyter helpfully prints out that value for you. However, it won't print out prior lines automatically.

## 3. Arithmetic
Many basic arithmetic operations are built in to Python.  The Data 8 textbook section on [Expressions](http://www.inferentialthinking.com/chapters/03/1/expressions.html) describes all the arithmetic operators used in the course.  The common operator that differs from typical math notation is `**`, which raises one number to the power of the other. So, `2**3` stands for $2^3$ and evaluates to 8. 

The order of operations is what you learned in elementary school, and Python also has parentheses.  For example, compare the outputs of the cells below. Use parentheses for a happy new year!

In [None]:
1+(6*5-(6*3))**2*((2**3)/4*7)

## 4. Names
In natural language, we have terminology that lets us quickly reference very complicated concepts.  We don't say, "That's a large mammal with brown fur and sharp teeth!"  Instead, we just say, "Bear!"

Similarly, an effective strategy for writing code is to define names for data as we compute it, like a lawyer would define terms for complex ideas at the start of a legal document.

In Python, we do this with *assignment statements*. An assignment statement has a name on the left side of an `=` sign and an expression to be evaluated on the right.

In [None]:
twenty = (3 * 11 + 5) / 2 - 9

When you run that cell, Python first evaluates the first line.  It computes the value of the expression `(3 * 11 + 5) / 2 - 9 `, which is the number 10.  Then it gives that value the name `twenty`.  At that point, the code in the cell is done running.

After you run that cell, the value 10 is bound to the name `twenty`:

In [None]:
twenty

## 5. Functions

    
One important form of an expression is the call expression, which first names a function and then describes its arguments. The function returns some value, based on its arguments. Some important mathematical functions are

| Function | Description                                                   |
|----------|---------------------------------------------------------------|
| `abs`      | Returns the absolute value of its argument                    |
| `max`      | Returns the maximum of all its arguments                      |
| `min`      | Returns the minimum of all its arguments                      |
| `pow`      | Raises its first argument to the power of its second argument |
| `round`    | Round its argument to the nearest integer                     |

Here are two call expressions that both evaluate to 3

    abs(2 - 5)
    max(round(2.8), min(pow(2, 10), -1 * pow(2, 10)))

All these expressions but the first are **compound expressions**, meaning that they are actually combinations of several smaller expressions.  `2 + 3` combines the expressions `2` and `3` by addition.  In this case, `2` and `3` are called **subexpressions** because they're expressions that are part of a larger expression.

A **statement** is a whole line of code.  Some statements are just expressions.  The expressions listed above are examples.

Other statements *make something happen* rather than *having a value*.  After they are run, something in the world has changed.  For example, an assignment statement assigns a value to a name. 

A good way to think about this is that we're **evaluating the right-hand side** of the equals sign and **assigning it to the left-hand side**. Here are some assignment statements:
    
    height = 1.3
    the_number_five = abs(-5)
    absolute_height_difference = abs(height - 1.688)

A key idea in programming is that large, interesting things can be built by combining many simple, uninteresting things.  The key to understanding a complicated piece of code is breaking it down into its simple components.

### 5.1. Calling functions

The most common way to combine or manipulate values in Python is by calling functions. Python comes with many built-in functions that perform common operations.

For example, the `abs` function takes a single number as its argument and returns the absolute value of that number.  The absolute value of a number is its distance from 0 on the number line, so `abs(5)` is 5 and `abs(-5)` is also 5.

In [None]:
abs(5)

In [None]:
abs(-5)

Functions can be called as above, putting the argument in parentheses at the end, or by using "dot notation", and calling the function after finding the arguments, as in the cell immediately below.

In [None]:
nums = make_array(1,2,3) # a list of numbers, will be explained in more detail soon
nums.max()

In [None]:
max(nums)

## 6. Strings

A `string` is a type of data, usually composed of alphabetical characters. A string is always enclosed in single or double quotations. There's nothing stopping a string from being a number, but you can't do normal numerical operations on them. 

### 6.1. Converting to and from Strings

Strings and numbers (numbers being integers and decimals) are different *types* of values, even when a string contains the digits of a number. For example, evaluating the following cell causes an error because an integer cannot be added to a string.

In [None]:
1 + 1

Now try running this:

In [None]:
1 + "1"

This gives a TypeError since we are trying to add an integer and a string.

However, there are built-in functions to convert numbers to strings and strings to numbers. 

    int:   Converts a string of digits to an integer ("int") value
    float: Converts a string of digits, perhaps with a decimal point, to a decimal ("float") value
    str:   Converts any value to a string

### 6.2. Strings as function arguments

String values, like numbers, can be arguments to functions and can be returned by functions.  The function `len` takes a single string as its argument and returns the number of characters in the string: its **len**-gth.  

Note that it doesn't count *words*. 

**Question**  Use `len` to find out the number of characters in the string.  

In [None]:
welcome = "Welcome to this class!"
sentence_length = len(welcome)
sentence_length

## 7. Importing code


Most programming involves work that is very similar to work that has been done before.  Since writing code is time-consuming, it's good to rely on others' published code when you can.  Rather than copy-pasting, Python allows us to **import** other code, creating a **module** that contains all of the names created by that code.

Python includes many useful modules that are just an `import` away.  We'll look at the `math` module as a first example. The `math` module is extremely useful in computing mathematical expressions in Python. 

Suppose we want to very accurately compute the area of a circle with radius 5 meters.  For that, we need the constant $\pi$, which is roughly 3.14.  Conveniently, the `math` module has `pi` defined for us:

In [None]:
import math
radius = 5
area_of_circle = radius**2 * math.pi
area_of_circle

`pi` is defined inside `math`, and the way that we access names that are inside modules is by writing the module's name, then a dot, then the name of the thing we want:

    <module name>.<name>
    
In order to use a module at all, we must first write the statement `import <module name>`.  That statement creates a module object with things like `pi` in it and then assigns the name `math` to that module.  Above we have done that for `math`.

Note how different methods take in different number of arguments. Often, the documentation of the module will provide information on how many arguments is required for each method.

In [None]:
# Calculating factorials.
math.factorial(5)

In [None]:
# Calculating square roots.
math.sqrt(5)

## 8. Arrays

Up to now, we haven't done much that you couldn't do yourself by hand, without going through the trouble of learning Python.  Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.


**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

    .10 * billions_of_numbers

gives a new array of numbers that's the result of multiplying each number in `billions_of_numbers` by .10 (10%).  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**, like a column in an Excel spreadsheet. 

### 8.1. Making arrays
You can type in the data that goes in an array yourself, but that's not typically how programs work. Normally, we create arrays by loading them from an external source, like a data file.

First, though, let's learn how to do it the hard way. Execute the following cell so that all the names from the `datascience` module are available to you.

In [None]:
from datascience import *

Now, to create an array, call the function `make_array`.  Each argument you pass to `make_array` will be in the array it returns.  Run this cell to see an example:

In [None]:
my_array = make_array(0.125, 4.75, -1.3)
my_array

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means you can assign them names or use them as arguments to functions.

Now try adding 1 to every single element in this array.

In [None]:
my_array = my_array + 1
my_array

In [None]:
my_array.item(0)

Notice that we wrote .item(0), not .item(1), to get the first element. This is a weird convention in computer science. 0 is called the index of the first item. It's the number of elements that appear before that item. So 3 is the index of the 4th item.

## 9. Tables 

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection.

In most data science applications, we have data about many entities, but we also have several kinds of data about each entity.

When we import data later in this lab, it will import into a table format.

### 9.2  Analyzing datasets
With just a few table methods, we can answer some interesting questions about datasets.

We can extract single columns, which are arrays themselves, and do math on them (averaging, max, min, etc), which we'll do on real data soon. We can also rearrange the order of rows in a table by the values in any column, add more rows or columns, filter tables to select only rows that meet certain criteria, and much more!

## Tables Essentials!

For your reference, here's a table of all the functions and methods we saw in this lab.

|Name|Example|Purpose|
|-|-|-|
|`Table`|`Table()`|Create an empty table, usually to extend with data|
|`Table.read_table`|`Table.read_table("my_data.csv")`|Create a table from a data file|
|`with_columns`|`tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2))`|Create a copy of a table with more columns|
|`column`|`tbl.column("N")`|Create an array containing the elements of a column|
|`sort`|`tbl.sort("N")`|Create a copy of a table sorted by the values in a column|
|`where`|`tbl.where("N", are.above(2))`|Create a copy of a table with only the rows that match some *predicate*|
|`num_rows`|`tbl.num_rows`|Compute the number of rows in a table|
|`num_columns`|`tbl.num_columns`|Compute the number of columns in a table|
|`select`|`tbl.select("N")`|Create a copy of a table with only some of the columns|
|`drop`|`tbl.drop("2*N")`|Create a copy of a table without some of the columns|
|`take`|`tbl.take(np.arange(0, 6, 2))`|Create a copy of the table with only the rows whose indices are in the given array|


## Silicon Valley

These data are compiled from EEO-1 reports from Apple, Twitter, Salesforce, Facebook, Microsoft, and Intel. The EEO-1 is a document required by the federal government that provides the raw numbers of employees in each of the categories below. We summed the most recent data (all from 2014-16) for these companies to get the table below.

In [None]:
# load data
tech_data = Table.read_table('data/Total-compiled_EEO-1.csv')
tech_data.show()

## UC Berkeley

UCB data is from Fall 2015.

Source: UC Corporate Personnel System

Note: STEM includes engineering and computer science, life sciences, math, medicine, other health sciences and physical sciences.

In [None]:
# LRE = ladder-rank-equivalent = important people !
UCB_LRE_female = Table.read_table('data/UCB-percent-female-LRE.csv')
UCB_LRE_female.show()

In [None]:
# remove Gender column, since always Female
UCB_LRE_female = UCB_LRE_female.drop(1)
UCB_LRE_female

Plot on bar graph, so that we can best visually compare between disciplines over time.

In [None]:
UCB_LRE_female.barh('Discipline')

What do you notice about the proportions of female LRE faculty across disciplines? Discuss with the people around you.

In [None]:
# should get percent faculty overall in each of these fields, then compare with LRE
# then try to find typical income for LRE vs. other faculty?

In [None]:
Image('data/gender_subject.png') #from accountability.universityofcalifornia.edu

The increase in the share of ladder-rank and equivalent (LRE) faculty who are underrepresented minorities has largely been due to an increase in the Hispanic/Latino(a) group. Representation by American Indian and African American faculty remains a challenge.

In [None]:
Image('data/subject_line_graph.png') #from accountability.universityofcalifornia.edu

Female LRE faculty have grown in share over time, fueled by increased diversity in hiring. Their proportion differs significantly depending on discipline.