# Legal Studies 190 - Data, Prediction, and Law

Welcome to our class! This introductory notebook will reviews concepts that you may already be familiar with from Data 8 or similar courses. The basic strategies and tools for data analysis covered in this notebook will be the foundations of this class. It will cover an overview of our software and some programming concepts.

## Table of Contents

1 - [Computing Environment](#computing environment)

2 - [Coding Concepts](#programming concepts)
    
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1 - [Python Basics](#python basics)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2 - [Pandas](#tables)

## Our Computing Environment, Jupyter notebooks  <a id='computing environment'></a>
This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results. 

### Text cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but you might want to.

After you edit a text cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions of the lab.)

**Understanding Check 1** This paragraph is in its own text cell.  Try editing it so that this sentence is the last sentence in the paragraph, and then click the "run cell" ▶| button .  This sentence, for example, should be deleted.  So should this one.

### Code cells
Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains.

To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, either press ▶| or hold down the `shift` key and press `return` or `enter`.

The fundamental building block of Python code is an expression. Cells can contain multiple lines with multiple expressions. When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

In [1]:
print("First this line is printed,")
print("and then this one.")
print("\N{WAVING HAND SIGN}, \N{EARTH GLOBE ASIA-AUSTRALIA}!")

First this line is printed,
and then this one.
👋, 🌏!


Don't be scared if you see a "Kernel Restarting" message! Your data and work will still be saved. Once you see "Kernel Ready" in a light blue box on the top right of the notebook, you'll be ready to work again. You should rerun any cells with imports, variables, and loaded data.

<img src="images/kernel.png">

### Writing Jupyter notebooks
You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar.  It'll start out as a text cell.  You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Code".

### Errors
Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell.  Run it and see what happens.

In [2]:
print("This line is missing something."

SyntaxError: unexpected EOF while parsing (<ipython-input-2-c7b7223ecd08>, line 1)

You should see something like this (minus our annotations):

<img src="images/error.jpg"/>

The last line of the error output attempts to tell you what went wrong.  The *syntax* of a language is its structure, and this `SyntaxError` tells you that you have created an illegal structure.  "`EOF`" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it.  (Of course, if you're frustrated, feel free to ask a friend or post on the class Piazza.)

**Understanding Check 2** Try to fix the code above so that you can run the cell and see the intended message instead of an error.

## Programming Concepts <a id='programming concepts'></a>

Now that you are comfortable with our computing environment, we are going to be moving into more of the fundamentals of Python, but first, run the cell below to ensure all the libraries needed for this notebook are installed.

### Part 1: Python basics <a id='python basics'></a>
Before getting into the more advanced analysis techniques that will be required in this course, we need to cover a few of the foundational elements of programming in Python.
#### A. Expressions
The departure point for all programming is the concept of the __expression__. An expression is a combination of variables, operators, and other Python elements that the language interprets and acts upon. Expressions act as a set of instructions to be fed through the interpreter, with the goal of generating specific outcomes. See below for some examples of basic expressions.

In [3]:
# Examples of expressions:

#addition
print(2 + 2)

#string concatenation 
print('me' + ' and I')

#you can print a number with a string if you cast it 
print("me" + str(2))

#exponents
print(12 ** 2)


4
me and I
me2
144


You will notice that only the last line in a cell gets printed out. If you want to see the values of previous expressions, you need to call `print` on that expression. Try adding `print` statements to some of the above expressions to get them to display.

#### Data Types

In Python, all things have a type. In the above example, you saw saw *integers* (positive and negative whole numbers) and *strings* (sequences of characters, often thought of as words or sentences). We denote strings by surrounding the desired value with quotes. For example, "Data Science" and "2017" are strings, while `bears` and `2020` (both without quotes) are not strings (`bears` without quotes would be interpreted as a variable). You'll also be using decimal numbers in Python, which are called *floats* (positive and negative decimal numbers). 

You'll also often run into *booleans*. They can take on one of two values: `True` or `False`. Booleans are often used to check conditions; for example, we might have a list of dogs, and we want to sort them into small dogs and large dogs. One way we could accomplish this is to say either `True` or `False` for each dog after seeing if the dog weighs more than 15 pounds. 

We'll soon be going over additional data types. Below is a table that summarizes the information in this section:

|Variable Type|Definition|Examples|
|-|-|-|
|Integer|Positive and negative whole numbers|`42`, `-10`, `0`|
|Float|Positive and negative decimal numbers|`73.9`, `2.4`, `0.0`|
|String|Sequence of characters|`"Go Bears!"`, `"variables"`|
|Boolean|True or false value|`True`, `False`|


#### B. Variables
In the example below, `a` and `b` are Python objects known as __variables__. We are giving an object (in this case, an `integer` and a `float`, two Python data types) a name that we can store for later use. To use that value, we can simply type the name that we stored the value as. Variables are stored within the notebook's environment, meaning stored variable values carry over from cell to cell.

In [4]:
a = 4
b = 10/5

# Notice that 'a' retains its value.
print(a)
a + b

4


6.0

#### Question 1: Variables
See if you can write a series of expressions that creates two new variables called __x__ and __y__ and assigns them values of __10.5__ and __7.2__. Then assign their product to the variable __combo__ and print it.

In [5]:
# Fill in the missing lines to complete the expressions.
x = ...
...
...
print(...)

Ellipsis


Check to see if the value you get for **combo** is what you expect it to be.

#### C. Lists
The next topic is particularly useful in the kind of data manipulation that you will see throughout this class. The following few cells will introduce the concept of __lists__ (and their counterpart, `numpy arrays`). Read through the following cell to understand the basic structure of a list. 

A list is an ordered collection of objects. They allow us to store and access groups of variables and other objects for easy access and analysis. Check out this [documentation](https://www.tutorialspoint.com/python/python_lists.htm) for an in-depth look at the capabilities of lists.

To initialize a list, you use brackets. Putting objects separated by commas in between the brackets will add them to the list. 

In [6]:
# an empty list
lst = []
print(lst)

# reassigning our empty list to a new list
lst = [1, 3, 6, 'lists', 'are' 'fun', 4]
print(lst)

#lists in python are zero-indexed so the indices for lst are 0,1,2,3,4,5 and 6
example = lst[2]
print(example)

#list slicing: This line will store the first (inclusive) through fourth (exclusive) elements of lst as a new list 
#called lst_2:
lst_2 = lst[1:4]
lst_2

[]
[1, 3, 6, 'lists', 'arefun', 4]
6


[3, 6, 'lists']

It is important to note that when you store a list to a variable, you are actually storing the **pointer** to the list. That means if you assign your list to another variable, and you change the elements in your other variable, then you are changing the same data as in the original list. 

In [7]:
a = [1,2,3] #original list
b = a #b now points to list a 
b[0] = 4 
print(a[0]) #return 4 since we modified the first element of the list pointed to by a and b 

4


#### Question 2: Lists
Build a list of length 10 containing whatever elements you'd like. Then, slice it into a new list of length five using a index slicing. Finally, assign the last element in your sliced list to the given variable and print it.

In [8]:
### Fill in the ellipses to complete the question.
my_list = ...

my_list_sliced = my_list[...]

last_of_sliced = ...

print(...)

TypeError: 'ellipsis' object is not subscriptable

Lists can also be operated on with a few built-in analysis functions. These include `min` and `max`, among others. Lists can also be concatenated together. Find some examples below.

In [9]:
# A list containing six integers.
a_list = [1, 6, 4, 8, 13, 2]

# Another list containing six integers.
b_list = [4, 5, 2, 14, 9, 11]

print('Max of a_list:', max(a_list))
print('Min of b_list:', min(a_list))

# Concatenate a_list and b_list:
c_list = a_list + b_list
print('Concatenated:', c_list)

Max of a_list: 13
Min of b_list: 1
Concatenated: [1, 6, 4, 8, 13, 2, 4, 5, 2, 14, 9, 11]


#### D. Numpy Arrays
Closely related to the concept of a list is the array, a nested sequence of elements that is structurally identical to a list. Arrays, however, can be operated on arithmetically with much more versatility than regular lists. For the purpose of later data manipulation, we'll access arrays through [Numpy](https://docs.scipy.org/doc/numpy/reference/routines.html), which will require an import statement.

Now run the next cell to import the numpy library into your notebook, and examine how numpy arrays can be used.

In [10]:
import numpy as np

In [11]:
# Initialize an array of integers 0 through 9.
example_array = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# This can also be accomplished using np.arange
example_array_2 = np.arange(10)
print('Undoubled Array:', example_array_2)

# Double the values in example_array and print the new array.
double_array = example_array*2
print('Doubled Array:', double_array)

Undoubled Array: [0 1 2 3 4 5 6 7 8 9]
Doubled Array: [ 0  2  4  6  8 10 12 14 16 18]


This behavior differs from that of a list. See below what happens if you multiply a list.

In [12]:
example_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
example_list * 2

[1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Notice that instead of multiplying each of the elements by two, multiplying a list and a number returns that many copies of that list. This is the reason that we will sometimes use Numpy over lists. Other mathematical operations have interesting behaviors with lists that you should explore on your own. 

#### E. Looping
[Loops](https://www.tutorialspoint.com/python/python_loops.htm) are often useful in manipulating, iterating over, or transforming large lists and arrays. The first type we will discuss is the __for loop__. For loops are helpful in traversing a list and performing an action at each element. For example, the following code moves through every element in example_array, adds it to the previous element in example_array, and copies this sum to a new array. 

It's important to note that "element" is an arbitrary variable name used to represent whichever index value the loop is currently operating on. We can change the variable name to whatever we want and achieve the same result, as long as we stay consistent.

In [13]:
new_list = []

for element in example_array:
    new_element = element + 5
    new_list.append(new_element)

print(new_list)

#iterate using list indices rather than elements themselves
for i in range(len(example_array)):
    example_array[i] = example_array[i] + 5

example_array

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14]


array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

#### Other types of loops
The __while loop__ repeatedly performs operations until a conditional is no longer satisfied. A conditional is a [boolean expression](https://en.wikipedia.org/wiki/Boolean_expression), that is an expression that evaluates to `True` or `False`. 

In the below example, an array of integers 0 to 9 is generated. When the program enters the while loop on the subsequent line, it notices that the maximum value of the array is less than 50. Because of this, it adds 1 to the fifth element, as instructed. Once the instructions embedded in the loop are complete, the program refers back to the conditional. Again, the maximum value is less than 50. This process repeats until the the fifth element, now the maximum value of the array, is equal to 50, at which point the conditional is no longer true and the loop breaks.

In [14]:
while_array = np.arange(10)        # Generate our array of values

print('Before:', while_array)

while(max(while_array) < 50):      # Set our conditional
    while_array[4] += 1            # Add 1 to the fifth element if the conditional is satisfied 
    
print('After:', while_array)

Before: [0 1 2 3 4 5 6 7 8 9]
After: [ 0  1  2  3 50  5  6  7  8  9]


#### Question 3: Loops
In the following cell, partial steps to manipulate an array are included. You must fill in the blanks to accomplish the following: <br>
1. Iterate over the entire array, checking if each element is a multiple of 5
2. If an element is not a multiple of 5, add 1 to it repeatedly until it is
3. Iterate back over the list and print each element.

> Hint: To check if an integer `x` is a multiple of `y`, use the modulus operator `%`. Typing `x % y` will return the remainder when `x` is divided by `y`. Therefore, (`x % y != 0`) will return `True` when `y` __does not divide__ `x`, and `False` when it does.

In [15]:
# Make use of iterators, range, length, while loops, and indices to complete this question.
question_3 = np.array([12, 31, 50, 0, 22, 28, 19, 105, 44, 12, 77])

for i in range(len(...)):
    while(...):
        question_3[i] = ...
        
for element in question_3:
    print(...)

TypeError: object of type 'ellipsis' has no len()

The following cell should return `True` if your code is correct.

In [16]:
answer = np.array([15, 35, 50, 0, 25, 30, 20, 105, 45, 15, 80])
question_3 == answer

array([False, False,  True,  True, False, False, False,  True, False,
       False, False])

#### F. Functions!
Functions are useful when you want to repeat a series of steps on multiple different objects, but don't want to type out the steps over and over again. Many functions are built into Python already; for example, you've already made use of `len()` to retrieve the number of elements in a list. You can also write your own functions, and at this point you already have the skills to do so.


Functions generally take a set of __parameters__ (also called inputs), which define the objects they will use when they are run. For example, the `len()` function takes a list or array as its parameter, and returns the length of that list.

Let's look at a function that takes two parameters, compares them somehow, and then returns a boolean value (`True` or `False`) depending on the comparison. The `is_multiple` function below takes as parameters an integer `m` and an integer `n`, checks if `m` is a multiple of `n`, and returns `True` if it is. Otherwise, it returns `False`. 

`if` statements, just like `while` loops, are dependent on boolean expressions. If the conditional is `True`, then the following indented code block will be executed. If the conditional evaluates to `False`, then the code block will be skipped over. Read more about `if` statements [here](https://www.tutorialspoint.com/python/python_if_else.htm).

In [17]:
def is_multiple(m, n):
    if (m % n == 0):
        return True
    else:
        return False

In [18]:
is_multiple(12, 4)

True

In [19]:
is_multiple(12, 7)

False

**Sidenote:** Another way to write `is_multiple` is below, think about why it works.

    def is_multiple(m, n):
        return m % n == 0
        
Since functions are so easily replicable, we can include them in loops if we want. For instance, our `is_multiple` function can be used to check if a number is prime! See for yourself by testing some possible prime numbers in the cell below.

In [20]:
# Change possible_prime to any integer to test its primality
# NOTE: If you happen to stumble across a large (> 8 digits) prime number, the cell could take a very, very long time
# to run and will likely crash your kernel. Just click kernel>interrupt if it looks like it's caught.

possible_prime = 9999991

for i in range(2, possible_prime):
    if (is_multiple(possible_prime, i)):
        print(possible_prime, 'is not prime')   
        break
    if (i >= possible_prime/2):
        print(possible_prime, 'is prime')
        break

9999991 is prime


### Part 2: Pandas <a id='tables'></a>

We will be using Pandas tables for much of this class to organize and sort through tabular data. [Pandas](http://pandas.pydata.org/pandas-docs/stable/api.html) is a library that is used for manipulating tabular data. It has a user-friendly API, and can be used to answer difficult questions in relatively few commands. Like we did with `numpy`, we will have to import `pandas`.

In [22]:
from pandas import *

#### Creating DataFrames

When dealing with a collection of things with multiple attributes, it can be useful to put the data in a _dataframe_.  DataFrames are a nice way of organizing data in a 2-dimensional data set. The `head(n)` function outputs the first n rows and by default, the first 5 rows. For example, take a look at the table below.

In [41]:
pandas.read_csv('../data/anes/ANES_legalst123.csv').head(5)

Unnamed: 0,V160101f,V160102f,V161024x,V161140x,V161158x,V161188,V161192,V161194x,V161198,V161204x,...,V162362,V162363,V162364,V162365,V162366,V162367,V162368,V162369,V168112,V168113
0,0.887,0.927,3,5,7,2,3,4,7,4,...,4,3,4,4,4,-4,3,4,4,4
1,1.16,1.084,3,3,6,1,1,1,7,6,...,3,3,1,4,1,-4,3,4,1,2
2,0.416,0.398,1,3,3,2,1,7,7,4,...,2,3,1,2,1,-4,2,4,4,3
3,0.385,0.418,4,3,5,1,3,4,5,6,...,3,3,2,3,2,-4,4,4,2,2
4,0.693,0.726,3,3,3,1,3,6,7,7,...,2,4,3,2,3,-4,2,5,4,3


This table is from the Incident Record-Type File of the NCVS. See page 31 of the codebook (on bCourses) for a description of the survey. To create this table, we have drawn the data from the path `data/anes`, stored in a file called `ANES_legalst123.csv`. In general, to import data from a `.csv` file, we write **`pandas.read_table("file_name")`.** Information in `.csv`'s are separated by commas, and are what are typically used with the `pandas` package. 

We can also create our own DataFrames from scratch without having to import data from another file. Let's say we have two arrays, one with a list of fruits, and another with a list of their price at the Berkeley Student Food Collective. Then, we can create a new `DataFrame` with each of these arrays as columns with the `with_columns` method:

In [34]:
fruit_names = ['Apple', 'Orange', 'Banana']
fruit_prices = [1, 0.75, 0.5]
fruit_table = pandas.DataFrame(data = {
    "Fruit": fruit_names,
     "Price ($)": fruit_prices
})
fruit_table

Unnamed: 0,Fruit,Price ($)
0,Apple,1.0
1,Orange,0.75
2,Banana,0.5


The **`with_columns`** method takes in pairs of column labels and arrays, and creates a new DataFrame with each array as a column of the DataFrame. Finally, to create a new dataframe (with no columns or rows), we simply write

In [35]:
empty_table = DataFrame()
empty_table

We typically start off with empty tables when we need to add rows inside for loops, which we'll see later.

## Accessing Values

Often, it is useful to access only the rows, columns, or values related to our analysis. We'll look at several ways to cut down our table into smaller, more digestible parts.

Let's go back to our table of incidents.

** Exercise 1 **

Below, assign a variable named `anes` to the data from the `ANES_legalst123.csv` file with the path `../data/anes/`, then display the table. (Hint: use the `read_table` function from the previous section and don't forget about the parameter `delimiter`). We will take a closer look at the ANES data in Lab 3.

In [43]:
# YOUR CODE HERE

anes = ...

#the head function selects the first n rows or default value of 5 rows
anes.head(5)


Unnamed: 0,V160101f,V160102f,V161024x,V161140x,V161158x,V161188,V161192,V161194x,V161198,V161204x,...,V162362,V162363,V162364,V162365,V162366,V162367,V162368,V162369,V168112,V168113
0,0.887,0.927,3,5,7,2,3,4,7,4,...,4,3,4,4,4,-4,3,4,4,4
1,1.16,1.084,3,3,6,1,1,1,7,6,...,3,3,1,4,1,-4,3,4,1,2
2,0.416,0.398,1,3,3,2,1,7,7,4,...,2,3,1,2,1,-4,2,4,4,3
3,0.385,0.418,4,3,5,1,3,4,5,6,...,3,3,2,3,2,-4,4,4,2,2
4,0.693,0.726,3,3,3,1,3,6,7,7,...,2,4,3,2,3,-4,2,5,4,3


Notice that not all of the rows are displayed--in fact, there are over 10000 rows in the DataFrame! By default, we are shown the first 10 rows.

However, let's say we wanted to grab only the first _five_ rows of this DataFrame. We can do this by using the **`loc`** function; it takes in a list or range of numbers, and creates a new DataFrame with rows from the original DataFrame whose indices are given in the array or range. Remember that in Python, indices start at 0! Below are a few examples:

In [45]:
anes.iloc[[1, 3, 5]] # Takes rows with indices 1, 3, and 5 (the 2nd, 4th, and 6th rows)

Unnamed: 0,V160101f,V160102f,V161024x,V161140x,V161158x,V161188,V161192,V161194x,V161198,V161204x,...,V162362,V162363,V162364,V162365,V162366,V162367,V162368,V162369,V168112,V168113
1,1.16,1.084,3,3,6,1,1,1,7,6,...,3,3,1,4,1,-4,3,4,1,2
3,0.385,0.418,4,3,5,1,3,4,5,6,...,3,3,2,3,2,-4,4,4,2,2
5,0.758,0.724,3,5,5,2,3,2,5,7,...,-9,-9,-9,-9,-9,-4,-9,-9,1,1


In [46]:
anes.iloc[[7]] # Takes the row with index 7 (8th row)

Unnamed: 0,V160101f,V160102f,V161024x,V161140x,V161158x,V161188,V161192,V161194x,V161198,V161204x,...,V162362,V162363,V162364,V162365,V162366,V162367,V162368,V162369,V168112,V168113
7,1.032,1.041,3,4,4,2,3,1,7,5,...,4,3,1,3,2,-4,2,4,1,1


In [47]:
anes.iloc[np.arange(7)] # Takes the row with indices 0, 1, ... 6

Unnamed: 0,V160101f,V160102f,V161024x,V161140x,V161158x,V161188,V161192,V161194x,V161198,V161204x,...,V162362,V162363,V162364,V162365,V162366,V162367,V162368,V162369,V168112,V168113
0,0.887,0.927,3,5,7,2,3,4,7,4,...,4,3,4,4,4,-4,3,4,4,4
1,1.16,1.084,3,3,6,1,1,1,7,6,...,3,3,1,4,1,-4,3,4,1,2
2,0.416,0.398,1,3,3,2,1,7,7,4,...,2,3,1,2,1,-4,2,4,4,3
3,0.385,0.418,4,3,5,1,3,4,5,6,...,3,3,2,3,2,-4,4,4,2,2
4,0.693,0.726,3,3,3,1,3,6,7,7,...,2,4,3,2,3,-4,2,5,4,3
5,0.758,0.724,3,5,5,2,3,2,5,7,...,-9,-9,-9,-9,-9,-4,-9,-9,1,1
6,4.251,4.79,3,4,1,1,3,4,1,7,...,4,5,3,5,2,-4,8,3,3,3


Similarly, we can also choose to display certain columns of the DataFrame. There are two methods to accomplish this, and both methods take in lists of either column indices or column labels:
- Insert the names of the columns as a list in the DataFrame
- The **`drop`** method creates a new DataFrame with all columns _except_ those indicated by the parameters (i.e. the parameters are dropped).

Some examples:

In [49]:
anes.loc[:, ["V161188", "V161204x"]].head() # Selects only "V161188" and "V161204x" columns

Unnamed: 0,V161188,V161204x
0,2,4
1,1,6
2,2,4
3,1,6
4,1,7


In [50]:
incidents.drop(incidents.columns[[0, 1]], axis=1).head() # Drops the columns with indices 0 and 1

Unnamed: 0,V161024x,V161140x,V161158x,V161188,V161192,V161194x,V161198,V161204x,V161208,V161209,...,V162362,V162363,V162364,V162365,V162366,V162367,V162368,V162369,V168112,V168113
0,3,5,7,2,3,4,7,4,1,2,...,4,3,4,4,4,-4,3,4,4,4
1,3,3,6,1,1,1,7,6,3,2,...,3,3,1,4,1,-4,3,4,1,2
2,1,3,3,2,1,7,7,4,1,2,...,2,3,1,2,1,-4,2,4,4,3
3,4,3,5,1,3,4,5,6,1,3,...,3,3,2,3,2,-4,4,4,2,2
4,3,3,3,1,3,6,7,7,1,2,...,2,4,3,2,3,-4,2,5,4,3


In [51]:
incidents.iloc[[1,2,3,5], [1,68]] # Select only columns with indices 1 and 68, 
                                               # then only the rows with indices 1, 2, 3, 5

Unnamed: 0,V160102f,V162012
1,1.084,2
2,0.398,2
3,0.418,2
5,0.724,2


** Exercise 2**

To make sure you understand the `loc`, `iloc`, and `drop` functions, try selecting the columns V4002 to V4008 with only the first 3 rows:

In [52]:
# YOUR CODE HERE

Finally, the `loc` function in the DataFrame can be modified so instead of only choosing certain rows or columns you can give conditions for the selected columns or rows:
- A column label
- A condition that each row should match

In other words, we call the select rows as so: `DataFrame_name.loc[DataFrame_name["column_name'] filter]`.


Here are some examples of selection:

The variable `V162365` indicates whether or not their is discrimmination against Christians. A value of 1 corresponds to the score of discrimmination. The below query will find all variables of the election where the discrimmination against Christians has a score of 1.

In [55]:
incidents.loc[incidents["V162365"] == 1]

Unnamed: 0,V160101f,V160102f,V161024x,V161140x,V161158x,V161188,V161192,V161194x,V161198,V161204x,...,V162362,V162363,V162364,V162365,V162366,V162367,V162368,V162369,V168112,V168113
118,1.132,1.104,1,1,1,4,3,7,4,6,...,3,5,1,1,1,-4,2,5,4,3
159,0.782,0.748,3,4,5,1,2,1,6,7,...,4,4,3,1,2,-4,3,5,1,1
234,0.529,0.556,4,3,7,2,1,2,7,7,...,4,4,3,1,2,-4,2,5,2,2
239,0.909,0.970,3,4,7,1,1,6,7,7,...,4,5,3,1,4,-4,2,5,1,1
379,1.117,1.177,3,4,5,2,2,7,6,7,...,4,4,4,1,4,-4,4,4,1,1
382,0.664,0.682,3,4,1,1,3,6,4,4,...,3,4,2,1,1,-4,2,5,4,3
388,0.286,0.284,3,3,7,1,3,1,-8,4,...,2,5,5,1,3,-4,4,4,1,1
397,0.815,0.859,3,3,7,3,3,7,7,7,...,4,3,3,1,4,-4,3,4,1,1
436,1.323,1.376,3,5,5,1,3,4,2,4,...,2,3,1,1,1,-4,1,4,4,3
448,0.511,0.531,3,2,5,3,3,4,1,-8,...,4,-9,-9,1,4,-4,3,5,4,5


The variable `V161233x` corresponds to the death penalty. The variable takes values between 1 and 10. With the following where statement, we'll find the variables where the value of death penalty score is between 1 and 10.

In [57]:
anes.loc[anes["V161233x"].isin(np.arange(1, 10))]

Unnamed: 0,V160101f,V160102f,V161024x,V161140x,V161158x,V161188,V161192,V161194x,V161198,V161204x,...,V162362,V162363,V162364,V162365,V162366,V162367,V162368,V162369,V168112,V168113
0,0.887,0.927,3,5,7,2,3,4,7,4,...,4,3,4,4,4,-4,3,4,4,4
1,1.160,1.084,3,3,6,1,1,1,7,6,...,3,3,1,4,1,-4,3,4,1,2
2,0.416,0.398,1,3,3,2,1,7,7,4,...,2,3,1,2,1,-4,2,4,4,3
3,0.385,0.418,4,3,5,1,3,4,5,6,...,3,3,2,3,2,-4,4,4,2,2
4,0.693,0.726,3,3,3,1,3,6,7,7,...,2,4,3,2,3,-4,2,5,4,3
5,0.758,0.724,3,5,5,2,3,2,5,7,...,-9,-9,-9,-9,-9,-4,-9,-9,1,1
6,4.251,4.790,3,4,1,1,3,4,1,7,...,4,5,3,5,2,-4,8,3,3,3
7,1.032,1.041,3,4,4,2,3,1,7,5,...,4,3,1,3,2,-4,2,4,1,1
8,1.048,1.073,1,4,5,1,3,7,7,7,...,3,5,1,4,2,-4,2,5,1,1
9,0.664,0.637,3,3,3,1,3,4,4,4,...,3,3,3,3,3,-4,1,3,1,2


#### Attributes

Using the methods that we have learned, we can now dive into calculating statistics from data in tables. Two useful _attributes_ (variables, not methods!) of tables are **`index`** and **`columns`**. They store the rows and the columns in a given table, respectively. For example:

In [59]:
num_variables = len(anes.index)
print("Number of rows: ", num_variables)
num_attributes = len(anes.columns)
print("Numbers of columns: ", num_attributes)

Number of rows:  4271
Numbers of columns:  197


Notice that we do _not_ put `()` after `num_rows` and `num_columns`, as we did for other methods.

#### Sorting

It can be very useful to sort our DataFrames according to some column. The `sort` function does exactly that; it takes the column that you want to sort by. By default, the `sort_values` function sorts the table in _ascending_ order of the data in the column indicated; however, you can change this by setting the optional parameter `ascending=False`.

Below is an example using the same variable above, `V4364`, which is the value of property taken.

In [61]:
monetary_loss = anes.loc[anes["V161212"].isin(np.arange(1, 100))]
monetary_loss.sort_values(by=['V161212']) # Sort table by value of property taken in ascending order

Unnamed: 0,V160101f,V160102f,V161024x,V161140x,V161158x,V161188,V161192,V161194x,V161198,V161204x,...,V162362,V162363,V162364,V162365,V162366,V162367,V162368,V162369,V168112,V168113
0,0.887,0.927,3,5,7,2,3,4,7,4,...,4,3,4,4,4,-4,3,4,4,4
1961,0.000,0.000,3,3,2,2,3,4,2,4,...,5,5,2,5,2,5,2,5,-1,-1
1962,0.000,0.000,3,4,1,2,3,2,4,7,...,4,4,1,4,1,4,1,4,-1,-1
1963,0.000,0.000,3,3,4,3,3,1,5,6,...,-6,-6,-6,-6,-6,-6,-6,-6,-6,-6
1964,0.000,0.000,3,3,2,3,3,7,3,4,...,-6,-6,-6,-6,-6,-6,-6,-6,-6,-6
3689,0.000,0.000,3,4,2,3,2,4,4,4,...,3,3,3,3,3,5,2,5,-1,-1
1967,0.000,0.000,3,2,6,3,3,2,99,6,...,4,4,1,4,1,5,1,5,-1,-1
1968,0.000,0.000,3,1,1,1,3,1,5,1,...,1,5,1,1,1,2,3,2,-1,-1
1960,0.000,0.000,3,2,1,3,3,7,4,5,...,3,5,2,5,1,5,2,5,-1,-1
1969,0.000,0.000,3,1,2,2,3,6,6,6,...,1,5,1,3,1,5,2,5,-1,-1


The above code will sort the DataFrame by the column `V161212` from least to greatest. Below, we'll sort it from greatest to least.

In [64]:
monetary_loss.sort_values(by='V161212', ascending=False) # Sort table by value of property taken in descending order (highest at top)

Unnamed: 0,V160101f,V160102f,V161024x,V161140x,V161158x,V161188,V161192,V161194x,V161198,V161204x,...,V162362,V162363,V162364,V162365,V162366,V162367,V162368,V162369,V168112,V168113
2140,0.000,0.000,3,2,1,3,3,1,1,1,...,2,5,4,4,1,1,10,1,-1,-1
2237,0.000,0.000,3,4,4,5,2,4,5,4,...,4,5,2,4,2,5,2,5,-1,-1
2222,0.000,0.000,3,2,6,4,3,4,7,7,...,3,4,2,3,2,4,1,4,-1,-1
2225,0.000,0.000,3,5,4,2,2,1,6,7,...,4,4,3,4,3,4,2,4,-1,-1
2227,0.000,0.000,3,4,4,4,2,6,6,4,...,5,5,5,5,5,3,1,5,-1,-1
2230,0.000,0.000,3,1,1,3,3,4,99,4,...,4,4,4,4,3,5,1,5,-1,-1
2232,0.000,0.000,3,2,1,4,1,4,7,4,...,5,4,3,4,4,5,2,5,-1,-1
2234,0.000,0.000,3,3,5,3,3,4,5,7,...,4,4,3,3,3,4,3,4,-1,-1
2238,0.000,0.000,3,5,7,3,1,4,7,4,...,3,5,3,4,3,3,4,3,-1,-1
2738,0.000,0.000,3,4,7,3,1,4,99,4,...,5,5,1,5,4,4,2,5,-1,-1


## Summary

As a summary, here are the functions we learned about during this notebook:
    
|Name|Example|Purpose|
|-|-|-|
|`DataFrame`|`DataFrame()`|Create an empty DataFrame, usually to extend with data|
|`pandas.read_table`|`pandas.read_table("my_data.csv")`|Create a DataFrame from a data file|
|`with_columns`|`tbl = pandas.DataFrame({"N": np.arange(5), "2*N": np.arange(0, 10, 2)})`|Create a copy of a DataFrame with more columns|
|`column`|`tbl[["N"]]`|Create an array containing the elements of a column|
|`sort`|`tbl.sort(["N"])`|Create a copy of a DataFrame sorted by the values in a column|
|`loc`|`tbl.loc[tbl["N"] > 10]`|Create a copy of a DataFrame with only the rows that match some *predicate*|
|`index`|`len(tbl.index)`|Compute the number of rows in a DataFrame|
|`columns`|`len(tbl.columns)`|Compute the number of columns in a DataFrame|
|`loc`|`tbl.loc["N"]`|Create a copy of a DataFrame with only some of the columns|
|`drop`|`tbl.drop(columns=["2*N"])`|Create a copy of a DataFrame without some of the columns|
|`iloc`|`tbl.iloc(np.arange(0, 6, 2))`|Create a copy of the DataFrame with only the rows whose indices are in the given array|

Some materials this notebook were taken from [Data 8](http://data8.org/), [CS 61A](http://cs61a.org/), and  [DS Modules](http://data.berkeley.edu/education/modules) lessons.