In [None]:
from datascience import *
import numpy as np
import math
import matplotlib.pyplot as plt
from IPython.display import Image, display
import ipywidgets as widgets
from scipy import stats
%matplotlib inline
%matplotlib inline

# Welcome to GWS-131's Data Science Module!

# Introduction to Python and Jupyter Notebooks:

## 1. Cells - Text and Code
In a notebook, each rectangle containing text or code is called a *cell*.

Cells (like this one) can be edited by double-clicking on them. This cell is a text cell, written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to worry about Markdown today, but it's a pretty fun+easy tool to learn.

After you edit a cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions.) You can also press `SHIFT-ENTER` to run any cell or progress from one cell to the next.

Other cells contain code in the Python3 programming language.  Running a code cell will execute all of the code it contains.

Try running this cell:

In [None]:
print("Hello, World!")

## 2. Numbers

Quantitative information arises everywhere in data science. In addition to representing commands to print out lines, expressions can represent numbers and methods of combining numbers. The expression `3.2500` evaluates to the number 3.25. (Run the cell and see.)

In [None]:
3.2500

And this one:

In [None]:
print(3)
4
5

Notice that we don't necessarily need to `print`. When you run a notebook cell, if the last line has a value, then Jupyter helpfully prints out that value for you. However, it won't print out prior lines automatically. If you wish to print out multiple lines, then the `print` function is helpful!

## 3. Arithmetic
Many basic arithmetic operations are built in to Python.  The Data 8 textbook section on [Expressions](http://www.inferentialthinking.com/chapters/03/1/expressions.html) describes all the arithmetic operators used in the course.  The common operator that differs from typical math notation is `**`, which raises one number to the power of the other. So, `2**3` stands for $2^3$ and evaluates to 8. 

The order of operations is what you learned in elementary school, and Python also has parentheses.  For example, compare the outputs of the cells below. Use parentheses for a happy new year!

In [None]:
1+(6*5-(6*3))**2*((2**3)/4*7)

## 4. Names
In natural language, we have terminology that lets us quickly reference very complicated concepts.  We don't say, "That's a large mammal with brown fur and sharp teeth!"  Instead, we just say, "Bear!"

Similarly, an effective strategy for writing code is to define names for data as we compute it, like a lawyer would define terms for complex ideas at the start of a legal document.

In Python, we do this with *assignment statements*. An assignment statement has a name on the left side of an `=` sign and an expression to be evaluated on the right.

In [None]:
twenty = (3 * 11 + 5) / 2 - 9

When you run that cell, Python first evaluates the expression on the right: it computes the value of the expression `(3 * 11 + 5) / 2 - 9 `, which is the number 10.  Then it assigns that value the name `twenty`.  At that point, the code in the cell is done running.

After you run that cell, the value 10 is bound to the name `twenty`:

In [None]:
twenty

## 5. Functions

    
One important form of an expression is the call expression, which first names a function and then describes its arguments. The function returns some value, based on its arguments. Some important mathematical functions are

| Function | Description                                                   |
|----------|---------------------------------------------------------------|
| `abs`      | Returns the absolute value of its argument                    |
| `max`      | Returns the maximum of all its arguments                      |
| `min`      | Returns the minimum of all its arguments                      |
| `pow`      | Raises its first argument to the power of its second argument |
| `round`    | Round its argument to the nearest integer                     |

Here are two call expressions that both evaluate to 3

    abs(2 - 5)
    max(round(2.8), min(pow(2, 10), -1 * pow(2, 10)))

All these expressions are **compound expressions**, meaning that they are actually combinations of several smaller expressions.  `2 + 3` combines the expressions `2` and `3` by addition.  In this case, `2` and `3` are called **subexpressions** because they're expressions that are part of a larger expression.

A **statement** is a whole line of code.  Some statements are just expressions, like the examples above, that can be broken down into its subexpressions which get evaluated individually before evaluating the statement as a whole.


### 5.1. Calling functions

The most common way to combine or manipulate values in Python is by calling functions. Python comes with many built-in functions that perform common operations.

For example, the `abs` function takes a single number as its argument and returns the absolute value of that number.  The absolute value of a number is its distance from 0 on the number line, so `abs(5)` is 5 and `abs(-5)` is also 5.

In [None]:
abs(5)

In [None]:
abs(-5)

Functions can be called as above, putting the argument in parentheses at the end, or by using "dot notation", and calling the function after finding the arguments, as in the cell immediately below.

In [None]:
nums = make_array(1,2,5) # a list of numbers, will be explained in more detail soon
nums.max()

In [None]:
max(nums)

## 6. Strings

A `string` is a type of data, usually composed of alphabetical characters. A string is always enclosed in single or double quotations.

You can create variables that hold strings, and you can create strings to be any sequence of letters, numbers, and special characters that you want:

In [None]:
"I'm a string!"

There's nothing stopping a string from being a number, but you can't do normal numerical operations on them.

In [None]:
sentence = "I'm a string"
sentence*4

There are, however, some convenient functions that allow you to convert strings to numbers, and numbers to strings: 

In [None]:
zero = '0'
int(zero)

In [None]:
str(zero)

**Strings as function arguments**

String values, like numbers, can be arguments to functions and can be returned by functions.  The function `len` takes a single string as its argument and returns the number of characters in the string: its **len**-gth.  

Note that it doesn't count *words*. 

**Challenge**

Use `len` to find out the number of characters in the string `welcome` below:  

In [None]:
welcome = "Welcome to this class!"

# your code below


## 7. Importing code


Most programming involves work that is very similar to work that has been done before.  Since writing code is time-consuming, it's good to rely on others' published code when you can.  Rather than copy-pasting, Python allows us to **import** other code, creating a **module** that contains all of the names created by that code.

Python includes many useful modules that are just an `import` away.  We'll look at the `math` module as a first example. The `math` module is extremely useful in computing mathematical expressions in Python. 

Suppose we want to very accurately compute the area of a circle with radius 5 meters.  For that, we need the constant $\pi$, which is roughly 3.14.  Conveniently, the `math` module has `pi` defined for us:

In [None]:
import math
radius = 5
area_of_circle = radius**2 * math.pi
area_of_circle

`pi` is defined inside `math`, and the way that we access names that are inside modules is by writing the module's name, then a dot, then the name of the thing we want:

    <module name>.<name>
    
In order to use a module at all, we must first write the statement `import <module name>`.  That statement creates a module object with things like `pi` in it and then assigns the name `math` to that module.  Above we have done that for `math`. If you wish to use a module in more than one cell, however, you only need to import it once — the first time you want to use it.

In [None]:
# Calculating factorials.
math.factorial(5)

In [None]:
# Calculating square roots.
math.sqrt(5)

## 8. Arrays

Up to now, we haven't done much that you couldn't do yourself by hand, without going through the trouble of learning Python.  Computers are most useful when you can use a small amount of code to *do the same action* to *many things at once*.


**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

```python
.10 * billions_of_numbers
```

gives a new array of numbers that's the result of multiplying each number in `billions_of_numbers` by .10 (10%).  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**, like a column in an Excel spreadsheet. 

### 8.1. Making arrays
You can type in the data that goes in an array yourself, but that's not typically how programs work. Normally, we create arrays by loading them from an external source, like a data file.

First, though, let's learn how to do it the hard way. Execute the following cell so that all the names from the `datascience` module are available to you.

In [None]:
from datascience import *

Now, to create an array, call the function `make_array`.  Each argument you pass to `make_array` will be in the array it returns.  Run this cell to see an example:

In [None]:
my_array = make_array(0.125, 4.75, -1.3)
my_array

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means you can assign them names or use them as arguments to functions.

In [None]:
print(my_array)

In [None]:
my_array.item(0)

Notice that we wrote .item(0), not .item(1), to get the first element. This is a weird convention in programming and computer science: indices start at 0 instead of 1. So the first thing in an array is item 0, the second is item 1, and so on. It is also described as the number of elements that appear before that item. So 3 is the index of the 4th item.

## 9. Creating Tables

If we don't have a spreadsheet file and are starting with nothing, first we need to make arrays. In the case of a table, we'll consider an array as either a row or a column. Let's make two arrays below that will become our columns, one for some of our U.S. presidents and one for the year they were born.

In [None]:
president_names = make_array("Jefferson", "Garfield", "Eisenhower", "Obama")
president_birth = make_array(1743, 1831, 1890, 1961)

Now, to make a table using these arrays, we use the general form:

```python
Table( ).with_columns("Column Name", array_name, . . .)
```

We assign the created table to a variable (just like the arrays from above), and then type that variable name to display the table. 

In [None]:
pres_table = Table().with_columns("President", president_names,
                                  "Birth Year", president_birth)
pres_table

Note that `with_columns` can also be used to add additional columns to existing tables, replacing `Table()` with the name of the table you want to add columns to.

In [None]:
pres_with_pets = pres_table.with_columns("Pet Name", make_array("Buzzy", "Veto", "Heidi", "Bo"))
pres_with_pets

### 9.1 Importing

It's more likey that a file holding your data already exists. In general, to import data from a file, we write:

```python
Table.read_table("file_name")
```

Most often, these file names end in `.csv` to show the data format. `.csv` format is popular for spreadsheets and can be imported/exported from programs such as Microsoft Excel, OpenOffice Calc, or Google spreadsheets. 
 
An example is shown below using [U.S. Census data](http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.csv). 

In [None]:
census_data = Table.read_table("http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.csv")
census_data

That's a lot of information. As you can see from the labels on top, this table shows Biological Sex (0=total, 1=male, 2=female), Age,  2010 Census Information, and predictions for U.S. population for the next five years. 

### 9.2 Using Tables

We can make criteria to cut down tables. Accessing only the rows, columns, or values specfic to our purpose makes information easier understood. Analysis and conclusions can be made when data is more digestible. 

This notebook can calculate how large this table is with two functions: `num_rows` and `num_columns`. The general form for these functions are `<table>.num_rows` and `<table>.num_columns`. 

Let's use these on the table above. 

In [None]:
census_data.num_rows

In [None]:
census_data.num_columns

That's a 306 x 10 table! We can first start to cut down this table using only some columns. Let's only include biological sex, age and the estimated base for 2010 census data. 

   There are two methods to make a table with select columns included, `select` or `drop`:

- `select` can create a new table with only the columns indicated in the parameters 
- `drop` can create a new table with columns NOT indicated in the parameters


Here's an example of two equal lines: (keep in mind that we assign each new table to a new variable, to make organization easier). 

In [None]:
select_census_data = census_data.select("SEX", "AGE", "ESTIMATESBASE2010")
select_census_data

In [None]:
drop_census_data = census_data.drop("CENSUS2010POP","POPESTIMATE2010","POPESTIMATE2011","POPESTIMATE2012","POPESTIMATE2013","POPESTIMATE2014","POPESTIMATE2015")
drop_census_data

These two cells above give us the same resulting table, but `select` was easier to use since we only cared about a few columns. The resulting table was still pretty large (3x306), so our next step to cut it down even more is to only include non-gendered data AKA data where SEX=0, neither male or female specific.

To do this, we need to use a new function `where`. The general form of this function is:

```python
table.where(column_name, predicate)
```

To cut our table down to only include `sex=0`, we may use the predicate `are.equal_to()`. Note that we are assigning the new table to a new variable.

In [None]:
new_census_data = select_census_data.where("SEX", are.equal_to(0))
new_census_data

There are still 92 rows omitted! Let's take every 10th entry to cut this table down a little more. 

To do this we need to use the `take` function. The `take` function creates a new table with rows from the original table whose indices (row numbers) are given. Remember, in Python, indices start at 0! 

In [None]:
census_10_year = new_census_data.take([0,10,20,30,40,50,60,70,80,90])
census_10_year

Now that sex is all the same, we can drop that column. 

In [None]:
final_census_table = census_10_year.drop("SEX")
final_census_table

---

#### Tables Essentials!

For your reference, here's a table of all the useful `Table` functions we've used so far:

|Name|Example|Purpose|
|-|-|-|
|`Table`|`Table()`|Create an empty table, usually to extend with data|
|`Table.read_table`|`Table.read_table("my_data.csv")`|Create a table from a data file|
|`with_columns`|`tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2))`|Create a copy of a table with more columns|
|`column`|`tbl.column("N")`|Create an array containing the elements of a column|
|`sort`|`tbl.sort("N")`|Create a copy of a table sorted by the values in a column|
|`where`|`tbl.where("N", are.above(2))`|Create a copy of a table with only the rows that match some *predicate*|
|`num_rows`|`tbl.num_rows`|Compute the number of rows in a table|
|`num_columns`|`tbl.num_columns`|Compute the number of columns in a table|
|`select`|`tbl.select("N")`|Create a copy of a table with only some of the columns|
|`drop`|`tbl.drop("2*N")`|Create a copy of a table without some of the columns|
|`take`|`tbl.take(np.arange(0, 6, 2))`|Create a copy of the table with only the rows whose indices are in the given array|

### 9.3 Merging Tables

Merging two tables allows us to consolidate information that may be spread across multiple data sources.

#### Adding Columns

We discussed this briefly in an earlier section; you can add new columns to a table by using the `with_columns` method that we learned about when creating new tables. 

Let's pretend that we suddenly have access to the favorite foods of each president from our earlier table. We can merge in a new column to the initial table.

In [None]:
pres_table_with_food = pres_table.with_columns("Favorite food", make_array("Pizza", "Snickers", "Grapes", "Escargot"))
pres_table_with_food

#### Adding Rows

Now let's assume we have a new president's information to add to our table, stored on its own in another table.

In [None]:
new_pres = Table().with_columns("President", make_array("Ford"),
                               "Birth Year", make_array(1913),
                               "Favorite food", make_array("Mac & Cheese"))
new_pres

We can add our new president's information to our original table with the `append` function!

In [None]:
new_pres_combined = pres_table_with_food.copy()
new_pres_combined.append(new_pres)

#### Joining Tables on Columns

Let's say we have another table with some additional information on our presidents that we'd like to combine with our original table, but the rows aren't in the same order as our original table.

In [None]:
places = make_array("Hawaii", "Ohio", "Virginia", "Texas")
new_pres_info = Table().with_columns("President", ["Obama", "Garfield", "Jefferson", "Eisenhower"],
                                      "Birth Place", places)
new_pres_info

We can use a method called `join`, which combines two tables based on one column of information that they share, in this case, our column of their names ("President"). The syntax of a call to `join` requires 3 arguments, the column name of table1 on which you wish to join, the name of table2, and (optionally) the name of the column in table2 on which you wish to join, in the case that the name differs.
```python
table1.join(column1_name, table2, (column2_name))
```

In [None]:
pres_table_with_food.join("President", new_pres_info)

### 9.4 Visualizations

Now that we have a manageable table we can start making visualizations! Below is a handy chart with all of the visualization functions we'll be working with and how you call them.

|Plotting type | | Call structure |
|-|-|-|
|Scatter | | `table.scatter("x column", "y column")` |
|Line | | `table.plot("x column", "y column")` |
|Bar | | `table.bar("x column", "y column")` |
|Horiz. Bar | | `table.barh("x column", "y column")` |
|Histogram | | `table.hist("x axis", bins(optional), unit(optional))` |


We'll quickly go through an example of each of these, so you get an idea for how the plots should look. We'll go back to our census data that we were working with before. If you don't recall, we stored our final manipulations of the census data into `final_census_table`.

In [None]:
final_census_table.scatter("AGE", "ESTIMATESBASE2010") 

In [None]:
final_census_table.plot("AGE", "ESTIMATESBASE2010") 

In [None]:
final_census_table.bar("AGE", "ESTIMATESBASE2010") 

You can also create horizontal bar charts using the method `barh`, which creates a bar chart with the same information as `bar`, aligned horizontally:

In [None]:
final_census_table.barh("AGE", "ESTIMATESBASE2010")

One last visualization technique you may use is the histogram. A histogram is a type of visualization where data is grouped into ranges; those ranges are then plotted as bars. This one is a little trickier than the ones above.

Bins are contiguous intervals (that span over each of the data groupings) so a dataset may be grouped together. Bin parameters are inclusive on the left end and exclusive on the right on or math wise: `[a, b)`. 

Adjustments can be made on bins to include high or low outliers. We will see this using the original unmodified census data table, `census_data`.

In [None]:
census_data.hist("AGE", unit= "year")

Each bar contains the percent of the population that falls into that bin. The total number of individuals in each bar corresponds to the area of a bar, which is found by multiplying the bar's height (the percent) by the bar's base (in this case, `100` per bin).

This chart may appear confusing at first. What we are seeing is that the vast majority of individuals in our census data was between the ages of 0 and 100, which makes a lot of sense.  There are some outliers in the 100 to 200 age range, which is also possible, and in the 900 to 1000 age range, which is likely due to a documentation issue.

# Tables 

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection.

In most data science applications, we have data about many entities, but we also have several kinds of data about each entity.

When we import data later in this lab, it will import into a table format.

### Analyzing datasets
With just a few table methods, we can answer some interesting questions about datasets.

We can extract single columns, which are arrays themselves, and do math on them (averaging, max, min, etc), which we'll do on real data soon. We can also rearrange the order of rows in a table by the values in any column, add more rows or columns, filter tables to select only rows that meet certain criteria, and much more!

### Tables Essentials!

For your reference, here's a table of all the functions and methods we saw in this lab.

|Name|Example|Purpose|
|-|-|-|
|`Table`|`Table()`|Create an empty table, usually to extend with data|
|`Table.read_table`|`Table.read_table("my_data.csv")`|Create a table from a data file|
|`with_columns`|`tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2))`|Create a copy of a table with more columns|
|`column`|`tbl.column("N")`|Create an array containing the elements of a column|
|`sort`|`tbl.sort("N")`|Create a copy of a table sorted by the values in a column|
|`where`|`tbl.where("N", are.above(2))`|Create a copy of a table with only the rows that match some *predicate*|
|`num_rows`|`tbl.num_rows`|Compute the number of rows in a table|
|`num_columns`|`tbl.num_columns`|Compute the number of columns in a table|
|`select`|`tbl.select("N")`|Create a copy of a table with only some of the columns|
|`drop`|`tbl.drop("2*N")`|Create a copy of a table without some of the columns|
|`take`|`tbl.take(np.arange(0, 6, 2))`|Create a copy of the table with only the rows whose indices are in the given array|


---

# UC Berkeley

We're going to start right here at UCB! These data are from Fall 2015.

*Source: UC Corporate Personnel System*

**Note**: STEM includes engineering and computer science, life sciences, math, medicine, other health sciences and physical sciences.

Let's read in a CSV file. A CSV file is a common storage device for spreadsheet data and is also easily manipulated and exported by using programs like Excel.

These data will give us the ratio of female ladder rank equivalent (LRE), which are tenure and tenure track faculty at Berkeley, in the respective divisions.

In [None]:
UCB_LRE_female = Table.read_table('data/UCB-percent-female-LRE.csv').drop(1)
UCB_LRE_female.show()

We can quickly plot these data on a bar graph using the `barh` function, so that we can best visually compare between disciplines over time.

In [None]:
UCB_LRE_female.barh('Discipline')

What do you notice about the proportions of female LRE faculty across disciplines? Discuss with the people around you, and write a few sentences about it below. 

> Type your response here.

---

We see in other published materials that over the ten year period there has been an increase in underrepresented minorities at UCB:

<img src="data/gender_subject.png" alt="gender_subject" style="width: 400px;"/>

The increase in the share of ladder-rank and equivalent (LRE) faculty who are underrepresented minorities has largely been due to an increase in the Hispanic/Latino(a) group. Representation by American Indian and African American faculty remains a challenge.

Female LRE faculty have grown in share over time, fueled by increased diversity in hiring. Their proportion differs significantly depending on discipline.

<img src="data/subject_line_graph.png" alt="subject_line_graph" style="width: 400px;"/>



---

## UCOP Payroll Dataset

Let's look at another dataset that has the payroll for all UC empoloyees:

In [None]:
UCB_data = Table.read_table('data/UCOP.csv')
UCB_data.show(5)

Let's look only at professors:

In [None]:
rd = UCB_data.select(2,3,4,5).sort(3, descending=True)
professors = rd.where("Title", are.equal_to("PROF-AY"))
professors.show(5)

Big money!

We can visualize the distribution of pay with a histogram, but the histogram (counting frequencies of a specific pay level) will change depending upon the "bin size" of these pay levels. We can make an interactive slider to see this:

In [None]:
def hist_bins(bin_size=1):
    professors.select(3).hist(bins=np.arange(0,500000,bin_size*2000))

slider = widgets.IntSlider(min=1,max=10,step=1,value=5)
display(widgets.interactive(hist_bins, bin_size=slider))

What are the drawbacks and advantages of different bin sizes?

### Salary for males vs. females on the UC payroll

While we don't have gender data in this dataset, we can use a pre-trained machine learning model to predict gender based on first name (we will forget for a moment that creating binary categories of male and female is problematic to begin with). While this is ***certainly not 100% accurate***, it is more like around 80%, we can use it to get a better idea of salaries for different genders.

Here is an example of what the classification model can do:

In [None]:
from scripts.gender import classify_gender

classify_gender("Daniel"), classify_gender("Katherine")

We can add a new column to our professors table with the gender classification output by the model for each professor's name.

In [None]:
professors.append_column("Gender", [classify_gender(name) for name in professors['First Name']])
professors.show(5)

**Challenge**

Now let's investigate the average salary amounts for female vs. male professors, try to use what you've learned so far to get the `mean` of `Gross Pay` by gender from the `professors` table, and then make a `bar` plot:

What can we conclude from this analysis?

---

## Silicon Valley

These data are compiled from EEO-1 reports from Apple, Twitter, Salesforce, Facebook, Microsoft, and Intel. The EEO-1 is a document required by the federal government that provides the raw numbers of employees in each of the categories below. We summed the most recent data (all from 2014-16) for these companies to get the table below.

In [None]:
tech_data = Table.read_table('data/eeo-aggregate.csv')
tech_data.show()

We can look at a basic bar chart of all males and females by job category:

In [None]:
tech_data.select(['Job Categories', 'All Male', 'All Female']).barh('Job Categories')

We can also break down each gender by race:

In [None]:
females = [c for c in tech_data.to_df().columns if "Female" in c]
tech_data.select(['Job Categories'] + females).barh('Job Categories')

In [None]:
males = [c for c in tech_data.to_df().columns if "Male" in c]
tech_data.select(['Job Categories'] + males).barh('Job Categories')

What do you see in this data? (**Note**: It might be a little hard to see since the bars are so small - zoom in!)

---

## Bay Area Census data

Let's read in a CSV file with Bay Area data:

In [None]:
bay_area = Table().read_table('data/bay_area_data.csv')
bay_area.show(5)

### Job code subset

As you can see above, this table has a lot of information. The variables are in the columns and each row represents an individual. First, we will subset this table to only include the occupations we want to analyze. Job codes are listed in the column `OCC2010`. We're going to focus on management and stem jobs.

In [None]:
job_codes = [10, 20, 30, 100, 110, 120, 130, 140, 150, 160, 220, 300, 310, 330, 350, 360, 410, 420,
             620, 700, 710, 720, 730, 800, 820, 940, 950, 1000, 1010, 1020, 1050, 1060, 1100, 1200, 1220,
             1230, 1240, 1350, 1360, 1400, 1410, 1420, 1430, 1450, 1460, 1540, 1550, 1720, 1910, 1920,
             1980, 2840, 2900, 4000, 4010, 4030, 4050, 4060, 4110, 4120, 4130, 4140, 4150, 4200, 4210,
             4220, 4230, 4250, 4720, 5000, 7720, 7730, 7900, 8000, 8010, 8030, 8060, 8800, 8830, 7700,
             9620, 9630, 9640]

df = bay_area.to_df()
bay_area_cut = Table.from_df(df.loc[df['OCC2010'].isin(job_codes)])
bay_area_cut.show(5)

Although still large, a table with 13110 rows has now decreased to 2550 by selecting rows that match our job_codes array. Let's subset this further by picking out specific variables we want to look at:

In [None]:
cut_bay_area= bay_area_cut.drop("CPSID","ASECFLAG","HWTSUPP", "HFLAG", "MONTH", "PERNUM", "CPSIDP","WTSUPP")
cut_bay_area.show(5)

The column of job codes in "OCC2010" still does not paint a picture of who is doing which jobs. To solve this, we may add a job sector classification. The array "sector" is created below by the function "job categories". Following the code below, if an array (such as the job code column) is ran through the job_categories function, an array of corresponding sectors is outputted. 

In [None]:
job_categories = {"STEM": [700, 1000, 1010, 1020, 1050, 1220, 1230, 1240, 1350, 1360, 1400, 1410, 1420, 1430, 1450,1460, 1540, 1550, 1720, 1910, 1920, 1980,2840, 2900,7720, 7730, 7900, 8000, 8010,8030, 8060, 8800, 8830],
                  "SERVICE": [7700, 9620, 9630, 9640, 4000, 4010, 4030, 4050, 4060, 4110, 4120, 4130, 4140, 4150, 4720],
                  "FINANCIAL": [120, 800, 820, 940, 950],
                  "CUSTODIAL": [4200, 4210, 4220, 4230, 4250],
                  "MANAGEMENT": [130, 150, 160, 220, 30, 100, 410, 420],
                  "STEM_MANAGER": [140,300,330, 350, 360, 1060, 1100],
                  "ADMINISTRATOR": [10,20]}

job_categories = dict((v,k) for k in job_categories for v in job_categories[k])

sectors = []
for job in cut_bay_area.column("OCC2010"):
    try:
        sectors.append(job_categories[job])
    except:
        sectors.append("UNKNOWN")

Now we can add the sector of each individual's job into a column by using the `with_column` function as seen below. 

In [None]:
with_sector = cut_bay_area.with_column('SECTOR', sectors)
with_sector

You might have noticed this earlier but race in this table is listed as a number. To make analyis more intuitive, let's change the race codes into what they mean.

In [None]:
race_dict = {'White': list(range(100,200)),
             'Black': list(range(200,300)),
             'Indigenous': list(range(300,400)),
             'Asian': list(range(400,500)),
             'Pacific Islander': list(range(500,600)),
             'Other': list(range(600,700)),
             'NA': list(range(700,900))}

race_dict = dict((v,k) for k in race_dict for v in race_dict[k])

with_race = Table.from_df(with_sector.to_df().replace({"RACE": race_dict, "SEX": {1: "MALE", 2: "FEMALE"}}))
with_race.show(5)

As you can see, "White" is a pretty big ethnicity group, this may be due to the fact that "White" encompasses a lot according to the 2010 U.S. Census. The definitions of White include Middle Easterners, North Africans and the majority of Hispanic people in the United States. 

In [None]:
with_race.to_df()['RACE'].value_counts().plot.bar()

In [None]:
with_race.to_df().groupby(['SECTOR', 'SEX'])['SEX'].count().unstack().plot.bar(stacked=True)

In [None]:
with_race.to_df().groupby(['SECTOR', 'RACE'])['RACE'].count().unstack().plot.bar(stacked=True)

In [None]:
for s in set(sectors):
    df = with_race.to_df()
    df[df['SECTOR'] == s].groupby(['RACE', 'SEX'])['SEX'].count().unstack().plot.bar(stacked=True, title=s)

By looking at the average mean of each part of the sample, we see some differences. 

## Income by race and gender

**Challenge**

Use our table `with_race` to get `groupby` `SEX` and then get the mean of `INCTOT`, then plot this:

Do the same for `RACE`:

---

This type of comparison isn't very reliable. We will perform a p-value test to determine if the change of income across races/ sex is statistically significant. To do this we first need to bootstrap our sample to make a 95% confidence interval of the estimated population mean. 

In [None]:
def bootstrap_median(original_sample, label, replications):
    '''
    Returns an array of bootstrapped sample medians:
    original_sample: table containing the original sample
    label: label of column containing the variable
    replications: number of bootstrap samples
    '''
    just_one_column = original_sample.select(label)
    medians = make_array()
    for i in np.arange(replications):
        bootstrap_sample = just_one_column.sample()
        resampled_median = percentile(50, bootstrap_sample.column(0))
        medians = np.append(medians, resampled_median)

    return medians

Let's look at `MALE` vs. `FEMALE`:

In [None]:
median_dict = {}
for i, s in enumerate(['MALE', 'FEMALE']):
    subset = Table.from_df(df[df['SEX'] == s])
    medians= bootstrap_median(subset, "INCTOT", 1000)
    median_dict[s] = medians
    left = percentile(2.5, medians)
    right = percentile(97.5, medians)
    CI = make_array(left, right)
    print("The median 95% Confidence Interval for " + s + " is", CI)
    plt.plot(CI, make_array(i, i), lw=10, label=s)
plt.legend(bbox_to_anchor=(1.01, 1), loc='upper left', ncol=1)
plt.title('Bootstrap Median Income Confidence Intervals')

We can calculate the p-value from the median samples to determine whether the difference is significant:

In [None]:
stats.ttest_ind(median_dict['MALE'], median_dict['FEMALE'])

We can also look by `RACE`:

In [None]:
median_dict = {}
for i, r in enumerate(set(df['RACE'])):
    subset = Table.from_df(df[df['RACE'] == r])
    medians= bootstrap_median(subset, "INCTOT", 1000)
    median_dict[r] = medians
    left = percentile(2.5, medians)
    right = percentile(97.5, medians)
    CI = make_array(left, right)
    print("The median 95% Confidence Interval for " + r + " is", CI)
    plt.plot(CI, make_array(i, i), lw=10, label=r)
plt.legend(bbox_to_anchor=(1.01, 1), loc='upper left', ncol=1)
plt.title('Bootstrap Median Income Confidence Intervals')

We can do a one way F test to see if there is significance in the difference between the median samples:

In [None]:
stats.f_oneway(median_dict['NA'], median_dict['Other'], median_dict['Indigenous'], median_dict['White'], median_dict['Black'])

We can use a fancy tool to give us all the combinations of `SEX` and `RACE` and then get the confidence intervals for those:

In [None]:
import itertools

combos = [i for i in itertools.product(set(df['RACE']), ['MALE', 'FEMALE'])]
combos

In [None]:
for i, c in enumerate(combos):
    subset = df[df['RACE'] == c[0]]
    subset = Table.from_df(subset[subset['SEX'] == c[1]])
    medians= bootstrap_median(subset, "INCTOT", 1000)
    left = percentile(2.5, medians)
    right = percentile(97.5, medians)
    CI = make_array(left, right)
    print("The median 95% Confidence Interval for " + c[0] + ' ' + c[1] + " is", CI)
    plt.plot(CI, make_array(i, i), lw=10, label=c[0] + ' ' + c[1])
plt.legend(bbox_to_anchor=(1.01, 1), loc='upper left', ncol=1)
plt.title('Bootstrap Median Income Confidence Intervals')

Discuss this plot with the people around you and write a response below.

---

## Compared to entire Bay Area census sample

So how does our tech-biased subset compare to the entire census subset of the Bay Area? First we'll do some quick processing to get out non-responses and relabel the Bay Area subset: 

In [None]:
bay_area2 = bay_area.to_df().replace({"RACE": race_dict, "SEX": {1: "MALE", 2: "FEMALE"}})
bay_area2 = bay_area2[bay_area2["INCTOT"] != 0]
bay_area2 = bay_area2[bay_area2["INCTOT"] != 99999999]

Let's check to see what our sample contains in terms of `SEX` and `RACE`:

In [None]:
bay_area2['SEX'].value_counts()

In [None]:
bay_area2['SEX'].value_counts().plot.bar()

In [None]:
bay_area2['RACE'].value_counts()

In [None]:
bay_area2['RACE'].value_counts().plot.bar()

We can then look at income by `SEX` and `RACE`. Let's start with `SEX`.

### `SEX`

In [None]:
bay_area_mean_sex = bay_area2.groupby(['SEX'])['INCTOT'].mean()
bay_area_mean_sex.plot.bar()
bay_area_mean_sex

We'll recall that our biased subset had much higher means, but similar disparity:

In [None]:
# this plot is the same as the one in the previous section with job codes - copied here for your reference
with_race_mean_sex = with_race.to_df().groupby(['SEX'])['INCTOT'].mean()
with_race_mean_sex.plot.bar()
with_race_mean_sex

UCB is doing much better than both of these subsets:

In [None]:
professors_mean_sex = professors.to_df().groupby(['Gender'])['Gross Pay'].mean()
professors_mean_sex.plot.bar()
professors_mean_sex

We can put each of these bar charts together to get a better idea of how the numbers in each of these subsets relate to each other:

In [None]:
b = pd.DataFrame([bay_area_mean_sex, with_race_mean_sex, professors_mean_sex], ["Entire Bay Area", "Biased subset", "Professors"])
b.plot.bar()

Again, combining the charts in a different way to group the numbers for each gender together:

In [None]:
b2 = pd.DataFrame(list(zip(bay_area_mean_sex, with_race_mean_sex, professors_mean_sex)), bay_area_mean_sex.keys())
ax = b2.plot.bar()
ax.legend(["Entire Bay Area", "Biased subset", "Professors"], loc=4)

We can also calculate the differences between average wages of males and females in each of our population subsets. The table below is the complete table with the wages of each gender for each group, which we used in the charts above.

In [None]:
b

To calculate the difference, we subtract the amount of each gender group's wages from each other, and get the numbers below. These numbers represent how much more males earn than females in each group.

In [None]:
b['MALE'] - b['FEMALE']

How about `FEMALE` as a percentage of `MALE`?

In [None]:
b['FEMALE'] / b['MALE']

### `RACE`
We can also look at `RACE` in the larger Bay Area subset:

In [None]:
bay_area_mean_race = bay_area2.groupby(['RACE'])['INCTOT'].mean()
bay_area_mean_race

In [None]:
bay_area_mean_race.plot.bar()

Our biased subset was also great here:

In [None]:
with_race_mean_race = with_race.to_df().groupby(['RACE'])['INCTOT'].mean()
with_race_mean_race

In [None]:
with_race_mean_race.plot.bar()

Again, we'll put these charts together so we can better compare how these different subsets' numbers relate to each other:

In [None]:
b = pd.DataFrame(list(zip(bay_area_mean_race, with_race_mean_race)), bay_area_mean_race.keys())
ax = b.plot.bar(stacked=False)
ax.legend(["Entire Bay Area", "Biased subset"])

As we noted when we were investigating the data from workers in the Tech industry, it is more helpful to consider race and gender together:

In [None]:
bay_area2.groupby(['RACE', 'SEX'])['INCTOT'].mean().unstack().plot.bar()

In [None]:
with_race.to_df().groupby(['RACE', 'SEX'])['INCTOT'].mean().unstack().plot.bar()

Where are the disparities the worst?

---

***Please fill out our [modules feedback survey](https://goo.gl/forms/QCgq3B5uA5npe5ja2)!***