# Section 1: Introduction to Python

## Python basics

This section will provide you with a foundational understanding of Python syntax and data types.

## Comments

Many of the examples in this notebook include comments. Comments in Python start with the hash character (`#`) and extend to the end of the physical line. A comment may appear at the start of a line or following white space or code, but not within a string literal. A hash character within a string literal is just a hash character. Because comments are there to clarify code and are not interpreted by Python, they may be omitted when typing in examples. For example:

In [3]:
# this is the first comment
spam = 1  # and this is the second comment
          # ... and now a third!
text = "# This is not a comment because it's inside quotes."
print(text)

# This is not a comment because it's inside quotes.


### Arithmetic and numeric types

> **Learning goal:** By the end of this subsection, you should be comfortable with using numeric types in Python arithmetic.

Python is an interpreted language, which means that you can interactively use the interpreter to get immediate results. You can see this by using the Python interpreter as a simple calculator: type an expression, and you can see the output immediately.

How can you see the results? The Python interpreter runs inside this notebook. To run the code inside a cell, either click the **Run Cell** button at the top of the window or press **Ctrl**+**Enter**. Try running the contents of the cell below. (Don't worry, we'll cover what the syntax of the Python code means later on in this section.)

In [2]:
print("Hello, world.")

Hello, world.


#### Python numeric operators

Expression syntax is straightforward: the operators `+`, `-`, `*`, and `/` work just like in most other programming languages (such as Java or C). For example:

In [3]:
2 + 3

5

The order of operations also works as in other programming languages (and in math class):

In [4]:
30 - 4*5

10

Note what happens when you use division:

In [5]:
7 / 5

1.4

Division (`/`) always returns a floating-point number, which brings up a good point. Python (like other programming languages) has different numeric types. Integer numbers (such as `1`, `3`, and `20`) have type [`int`](https://docs.python.org/3.6/library/functions.html#int). Numbers with a fractional component (such as `3.0` or `1.6`) have type [`float`](https://docs.python.org/3.5/library/functions.html#float).

You can mix numeric types in calculations:

In [6]:
3 * 3.5

10.5

In [7]:
7.0 / 5

1.4

You can perform a type of division that returns an integer: [floor division](https://docs.python.org/3.6/glossary.html#term-floor-division). Floor division uses the `//` operator, discards any remainders, and just returns an `int`.

In [8]:
7 // 5

1

To calculate the remainder, you can use the modulo operator, `%`:

In [9]:
7 % 5

2

For exponents, use the `**` operator. For example, you can write $5^2$ as:

In [10]:
5 ** 2

25

Conversely, $2^5$ would be:

In [11]:
2 ** 5

32

Note that `**` has higher precedence in the order of operations than the negative sign, `-`. This means that $-5^2$ is actually the same thing as $-\left(5^2\right)$:

In [12]:
-5**2

-25

In order to assert the order of precedence that you want, use parentheses, `()`:

In [13]:
(-5)**2

25

Parentheses can supersede the order of operations in any calculation you need to run:

In [14]:
(30 - 4)*5

130

### Variables

As in other programming languages, it is often essential to save values for later using variables in Python. Python assigns values to variables using the equals sign (`=`):

In [15]:
length = 15
width = 3 * 5
length * width

225

If you come from a programming background in another programming language (such as Java), you might have noticed that we never specified the variable type when we declared our variables `length` and `width`. Python does not require this, and you can change variable types as you wish:

In [16]:
length = 15
length

15

In [17]:
length = 15.0
length

15.0

In [18]:
length = 'fifteen'
length

'fifteen'

Note that, for all the flexibility of variables in Python, you do have to define them. If you try to use an undefined variable, it will produce an error:

In [19]:
n

NameError: name 'n' is not defined

In Python's interactive mode and in Jupyter Notebooks, you can use the built-in variable `_`, which automatically takes the value of the last printed expression. For example:

In [20]:
tax = 11.3 / 100
price = 19.95
price * tax

2.25435

In [21]:
price + _

22.204349999999998

Note that you should always treat the `_` variable as read-only. Explicitly assigning a value to it will create an independent local variable with the same name and will mask the built-in variable (and its behavior).

Our previous output was kind of a mess, however; we generally use only two or fewer decimal points when working with prices. In order to clean this up, we can use a built-in function, `round()`.

In [22]:
round(_, 2)

22.2

We will cover some of the other functions built into Python and user-defined functions later in this section.

You do not have to define variables one at a time. You can define multiple variables on a single line, like so:

In [2]:
a, b, c, = 3.2, 1, 6
a, b, c

(3.2, 1, 6)

You can also augment variable assignments. This will be particularly useful when we tackle loops later in this section.

In [1]:
x = 5
x = x + 1  # Un-pythonic variable augmentation
x += 1  # Pythonic variable augmentation
x

7

Pythonic means code that doesn't just get the syntax right but that follows the conventions of the Python community and uses the language in the way it is intended to be used.

Note that augmented assignment doesn’t have to be by 1 or even just addition. Beyond +=, augmented assignment statements in Python include -=, \*=, /=, %=, and \**=. Try playing around with different augmentation assignments until this concept makes sense.

Python supports other types of numbers beyond `int` and `float`, such as [`Decimal`](https://docs.python.org/3.6/library/decimal.html#decimal.Decimal) and [`Fraction`](https://docs.python.org/3.6/library/fractions.html#fractions.Fraction). Python also has built-in support for [complex numbers](https://docs.python.org/3.6/library/stdtypes.html#typesnumeric), which are all beyond the scope of this course.

### Expressions

As with other programming languages, expressions are critical for decision making controlling the logical flow of Python programs. The most fundamental way of doing this in Python is with a comparison operator, such as "`<`":

In [2]:
2 < 5

True

Python supplies serveral comparison operators:

<center>**Python Comparison Operators**</center>

| Operator |      Description      | Sample Input | Sample Output |
|:--------:|:---------------------:|:------------:|:-------------:|
| `<`      | Less than             | `2 < 5`      | `True`        |
| `>`      | Greater than          | `2 > 5`      | `False`       |
| `<=`     | Less than or equal    | `2 <= 5`     | `True`        |
|          |                       | `2 <= 2`     | `True`        |
| `>=`     | Greater than or equal | `2 >= 5`     | `False`       |
| `==`     | Equality              | `2 == 2`     | `True`        |
|          |                       | `2 == 5`     | `False`       |
| `!=`     | Inequality            | `2 != 5`     | `True`        |
|          |                       | `2 != 2`     | `False`       |

Python does not restrict you to comparing just two operands at a time. For example:

In [1]:
a, b, c = 1, 2, 3
a < b < c

True

This entire expression is `True` because `1 < 2` is `True` and `2 < 3` is `True`.

You can also use built-in functions in Python for comparing data. For example:

In [2]:
min(3, 2.4, 5)

2.4

In [3]:
max(3, 2.4, 5)

5

You can also combine comparison operators into compound expressions. For example:

In [4]:
1 < 2 and 2 < 3

True

This compound expression returned `True` because **both** `1 < 2` is true and `2 < 3` is true. (Note that this is equivalent to `1 < 2 < 3`.)

> **Exercise:**

In [5]:
# Now flip around one of the simple expressions and see if the output matches your expectations:
1 < 2 and 3 < 2

False

Python also provides the `or` Boolean operator, which requires that only one simple expression in a compound expression be true in order to return `True`. For example:

In [6]:
1 < 2 or 1 > 2

True

Finally, `not` inverts the truth evaluation of an expression, such as in:

In [7]:
not (2 < 3)

False

> **Exercise:**

In [8]:
# Play around with compound expressions.
# Set i to different values to see what results this complex compound expression returns:
i = 7
(i == 2) or not (i % 2 != 0 and 1 < i < 5)

True

> **Takeaway:** Arithmetic operations on numeric data form the foundation of data science work in Python. Even sophisticated numeric operations are predicated on these basics, so mastering them is essential to doing data science.

## Strings

> **Learning goal:** By the end of this subsection, you should be comfortable working with strings at a basic level in Python.

Besides numbers, Python can also manipulate strings. Strings can be enclosed in single quotes (`'...'`) or double quotes (`"..."`) with the same result. Use `\` to escape quotes; that is, use `\` in order to use quotation marks within the string itself:

In [1]:
'spam eggs'  # Single quotes.

'spam eggs'

In [2]:
'doesn\'t'  # Use \' to escape the single quote...

"doesn't"

In [3]:
"doesn't"  # ...or use double quotes instead.

"doesn't"

In the interactive interpreter and Jupyter Notebooks, the output string is enclosed in quotes and special characters are escaped with backslashes. Although this output sometimes looks different from the input (the enclosing quotes could change), the two strings are equivalent. The string is enclosed in double quotes if the string contains a single quote and no double quotes; otherwise, it’s enclosed in single quotes. The [`print()`](https://docs.python.org/3.6/library/functions.html#print) function produces a more readable output by omitting the enclosing quotes and by printing escaped and special characters:

In [4]:
'"Isn\'t," she said.'

'"Isn\'t," she said.'

In [5]:
print('"Isn\'t," she said.')

"Isn't," she said.


If you don't want escaped characters (prefaced by `\`) to be interpreted as special characters, use *raw strings* by adding an `r` before the first quote:

In [6]:
print('C:\some\name')  # Here \n means newline!

C:\some
ame


In [7]:
print(r'C:\some\name')  # Note the r before the quote.

C:\some\name


#### String literals

String literals can span multiple lines and are delineated by triple-quotes: `"""..."""` or `'''...'''`.

Because Python doesn't provide a means for creating multi-line comments, developers often just use triple quotes for this purpose. In a Jupyter notebook, however, such quotes define a string literal that appears as the output of a code cell:

In [8]:
"""
Everything between the first three quotes, including new lines,
is part of the multi-line comment. Technically, the Python interpreter
simply sees the comment as a string, and because it's not otherwise
used in code, the string is ignored. Convenient, eh?
"""

"\nEverything between the first three quotes, including new lines,\nis part of the multi-line comment. Technically, the Python interpreter\nsimply sees the comment as a string, and because it's not otherwise\nused in code, the string is ignored. Convenient, eh?\n"

For this reason, it's best in notebooks to use the # comment character at the beginning of each line, or better still, just use a Markdown cell outside of a code cell in a Jupyter notebook!

Strings can be *concatenated* (glued together) with the + operator, and repeated with *:

In [9]:
# 3 times 'un', followed by 'ium'
3 * 'un' + 'ium'

'unununium'

The order of operations applies to operators when they are used with strings as well as numeric types. Try experimenting with different combinations and orders of operators and strings to see what happens.

### Concatenating strings

Two or more *string literals* placed next to each other are automatically concatenated:

In [10]:
'Py' 'thon'

'Python'

However, to concatenate variables or a variable and a literal, use `+`:

In [11]:
prefix = 'Py'
prefix + 'thon'

'Python'

### String indexes

Strings can be *indexed* (subscripted), with the first character having index 0. There is no separate character type; a character is simply a string of size one:

In [12]:
word = 'Python'
word[0]  # Character in position 0.

'P'

In [13]:
word[5]  # Character in position 5.

'n'

Indices may also be negative numbers, which means to start counting from the end of the string. Note that because -0 is the same as 0, negative indices start from -1:

In [14]:
word[-1]  # Last character.

'n'

In [15]:
word[-2]  # Second-last character.

'o'

In [16]:
word[-6]

'P'

### Slicing strings

In addition to indexing, which extracts individual characters, Python also supports *slicing*, which extracts a substring. To slice, you indicate a *range* in the format `start:end`, where the start position is included but the end position is excluded:

In [17]:
word[0:2]  # Characters from position 0 (included) to 2 (excluded).

'Py'

In [18]:
word[2:5]  # Characters from position 2 (included) to 5 (excluded).

'tho'

If you omit either position, the default start position is 0 and the default end is the length of the string:

In [19]:
word[:2]   # Character from the beginning to position 2 (excluded).

'Py'

In [20]:
word[4:]  # Characters from position 4 (included) to the end.

'on'

In [21]:
word[-2:] # Characters from the second-last (included) to the end.

'on'

This characteristic means that `s[:i] + s[i:]` is always equal to `s`:

In [22]:
word[:2] + word[2:]

'Python'

In [23]:
word[:4] + word[4:]

'Python'

One way to remember how slices work is to think of the indices as pointing between characters, with the left edge of the first character numbered 0. Then the right edge of the last character of a string of *n* characters has index *n*. For example:

The first row of numbers gives the position of the indices 0–6 in the string; the second row gives the corresponding negative indices. The slice from *i* to *j* consists of all characters between the edges labeled *i* and *j*, respectively.

For non-negative indices, the length of a slice is the difference of the indices, if both are within bounds. For example, the length of `word[1:3]` is 2.

Attempting to use an index that is too large results in an error:

In [24]:
word[42]  # The word only has 6 characters.

IndexError: string index out of range

However, when used in a range, an index that's too large defaults to the size of the string and does not give an error. This characteristic is useful when you always want to slice at a particular index regardless of the length of a string:

In [25]:
word[4:42]

'on'

In [26]:
word[42:]

''

Python strings are [immutable](https://docs.python.org/3.6/glossary.html#term-immutable), which means they cannot be changed. Therefore, assigning a value to an indexed position in a string results in an error:

In [27]:
word[0] = 'J'

TypeError: 'str' object does not support item assignment

The following cell also produces an error:

In [28]:
word[2:] = 'py'

TypeError: 'str' object does not support item assignment

A slice is itself a value that you can concatenate with other values using `+`:

In [29]:
'J' + word[1:]

'Jython'

In [30]:
word[:2] + 'Py'

'PyPy'

A slice, however, is not a string literal, and it cannot be used with automatic concatenation. The following code produces an error:

In [31]:
word[:2] 'Py'    # Slice is not a literal; produces an error

SyntaxError: invalid syntax (<ipython-input-31-60be1c701626>, line 1)

Oftentimes, while working with strings, it can be useful to evaluate the length of a string. The built-in function [`len()`](https://docs.python.org/3.5/library/functions.html#len) returns the length of a string:

In [32]:
s = 'supercalifragilisticexpialidocious'
len(s)

34

Another useful built-in function for working with strings is [`str()`](https://docs.python.org/3.6/library/stdtypes.html#str). This function takes any object and returns a printable string version of that object. For example:

In [33]:
str(2)

'2'

In [34]:
str(2.5)

'2.5'

> **Takeaway:** Operations on string data form the other fundamental task you will do in data science in Python. Becoming comfortable with strings now will pay large dividends to you later as you work with increasingly complex data.

## Other data types

> **Learning goal:** By the end of this subsection, you should have a basic understanding of the remaining fundamental data types in Python and an idea of how and when to use them.

The string and numeric data types that we have looked at so far are common to many programming languages. The other data types that we will now look at--lists, tuples, and dictionaries--set Python apart from C++ or Java by providing powerful and easy-to-use built-in data structures.

### Lists

Python knows a number of compound data types, which are used to group together other values. The most versatile is the [*list*](https://docs.python.org/3.5/library/stdtypes.html#typesseq-list), which can be written as a sequence of comma-separated values (items) between square brackets. Lists might contain items of different types, but usually the items all have the same type.

In [35]:
squares = [1, 4, 9, 16, 25]
squares

[1, 4, 9, 16, 25]

Like strings (and all other built-in [sequence](https://docs.python.org/3.5/glossary.html#term-sequence) types), lists can be indexed and sliced:

In [36]:
squares[0]  # Indexing returns the item.

1

In [37]:
squares[-1]

25

In [38]:
squares[-3:]  # Slicing returns a new list.

[9, 16, 25]

All slice operations return a new list containing the requested elements. This means that the following slice returns a new (shallow) copy of the list:

In [39]:
squares[:]

[1, 4, 9, 16, 25]

Lists also support concatenation with the `+` operator:

In [42]:
squares + [36, 49, 64, 81, 100]

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Unlike strings, which are [immutable](https://docs.python.org/3.5/glossary.html#term-immutable), lists are a [mutable](https://docs.python.org/3.5/glossary.html#term-mutable) type, which means you can change any value in the list:

In [43]:
cubes = [1, 8, 27, 65, 125]  # Something's wrong here ...
4 ** 3  # the cube of 4 is 64, not 65!

64

In [44]:
cubes[3] = 64  # Replace the wrong value.
cubes

[1, 8, 27, 64, 125]

You can assign to slices, which can change the size of the list or clear it entirely:

In [2]:
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
letters

['a', 'b', 'c', 'd', 'e', 'f', 'g']

In [3]:
# Replace some values.
letters[2:5] = ['C', 'D', 'E']
letters

['a', 'b', 'C', 'D', 'E', 'f', 'g']

In [4]:
# Now remove them.
letters[2:5] = []
letters

['a', 'b', 'f', 'g']

In [5]:
# Clear the list by replacing all the elements with an empty list.
letters[:] = []
letters

[]

The built-in [`len()`](https://docs.python.org/3.6/library/functions.html#len) function also applies to lists for getting their lengths:

In [6]:
letters = ['a', 'b', 'c', 'd']
len(letters)

4

You can nest lists, which means to create lists that contain other lists. For example:

In [7]:
a = ['a', 'b', 'c']
n = [1, 2, 3]
x = [a, n]
x

[['a', 'b', 'c'], [1, 2, 3]]

`x` is a list of lists, and you can access its constituent lists through the same indexing you use with simpler lists:

In [8]:
x[0]

['a', 'b', 'c']

And by using additional index numbers, you can directly access elements within those sub-lists:

In [9]:
x[0][0]

'a'

> **Exercise:**

In [10]:
# Nested lists come up a lot in programming, so it pays to practice.
# Which indices would you include after x to get ‘c’?
# How about to get 3?


### List object methods

Python includes a number of handy functions that are available to all lists.

For example, [`append()`](https://docs.python.org/3.6/tutorial/datastructures.html) and [`extend()`](https://docs.python.org/3.6/tutorial/datastructures.html) enable you to add to the end of a list, much like the `+=` operator:

In [11]:
beatles = ['John', 'Paul']
beatles.append('George')
beatles

['John', 'Paul', 'George']

Notice that you did not actually pass a list to `append()`; passing a list to `append()` results in this behavior:

In [12]:
beatles2 = ['John', 'Paul', 'George']
beatles2.append(['Stuart', 'Pete'])
beatles2

['John', 'Paul', 'George', ['Stuart', 'Pete']]

To tack a list on the end of an existing list, use `extend()` instead:

In [13]:
beatles.extend(['Stuart', 'Pete'])
beatles

['John', 'Paul', 'George', 'Stuart', 'Pete']

[`index()`](https://docs.python.org/3.6/tutorial/datastructures.html) returns the index of the first matching item in a list (if present):

In [14]:
beatles.index('George')

2

The [`count()`](https://docs.python.org/3.6/tutorial/datastructures.html) method returns the number of items in a list that match objects you pass in:

In [15]:
beatles.count('John')

1

There are two methods for removing items from a list. The first is [`remove()`](https://docs.python.org/3.6/tutorial/datastructures.html), which locates the first occurrence of an item in the list and removes it (if present):

In [16]:
beatles.remove('Stuart')
beatles

['John', 'Paul', 'George', 'Pete']

The other method for removing items from lists is the [`pop()`](https://docs.python.org/3.6/tutorial/datastructures.html) method. If you supply `pop()` with an index number, it will remove the item from that location in the list and return it; otherwise, `pop()` removes the last item in a list and returns that:

In [17]:
beatles.pop()

'Pete'

The [`insert()`](https://docs.python.org/3.6/tutorial/datastructures.html) method enables you to add an item to a specific location in a list:

In [18]:
beatles.insert(1, 'Ringo')
beatles

['John', 'Ringo', 'Paul', 'George']

Unsurprisingly, the [`reverse()`](https://docs.python.org/3.6/tutorial/datastructures.html) method reverses the order of items in a list:

In [19]:
beatles.reverse()
beatles

['George', 'Paul', 'Ringo', 'John']

Finally, the [`sort()`](https://docs.python.org/3.6/tutorial/datastructures.html) method orders the items in a list:

In [20]:
beatles.sort()
beatles

['George', 'John', 'Paul', 'Ringo']

> **Exercise:**

In [21]:
# What happens if you run beatles.extend(beatles)?
# How about beatles.append(beatles)?


Note that you can supply your own *lambda function* to `sort()` for use in comparing items in a list. We will cover lambda functions later in this section.

### Tuples

Another immutable data type in Python are *tuples*. It can be useful at times to create a data structure that won't be altered later in a program, such as to protect constant data from being overwritten on accident or to improve performance for iterating over data. This is where tuples come in. You create tuples much as you do lists, only using parentheses instead of brackets.

In [22]:
t = (1, 2, 3)
t

(1, 2, 3)

Because tuples are immutable, you cannot change elements within them:

In [23]:
t[1] = 2.0

TypeError: 'tuple' object does not support item assignment

However, you can refer to elements within them:

In [24]:
t[1]

2

You can also slice tuples:

In [25]:
t[:2]

(1, 2)

You can also create tuples from lists:

In [26]:
l = ['baked', 'beans', 'spam']
l = tuple(l)
l

('baked', 'beans', 'spam')

Or you can create lists from tuples:

In [27]:
l = list(l)
l

['baked', 'beans', 'spam']

### Membership testing

As your Python programming grows more complex, you will want to test lists and tuples for the membership of specific data. The `in` operator enables you to do that.

In [1]:
tup = ('a', 'b', 'c')
'b' in tup

True

You can also test to see if something is not in a list or tuple using `not in`:

In [2]:
lis = ['a', 'b', 'c']
'a' not in lis

False

> **Exercise:**

In [3]:
# What happens if you run lis in lis?
# Is that the behavior you expected?
# If not, think back to the nested lists we’ve already encountered.


### Dictionaries

Dictionaries in Python provide a means of mapping information between unique keys and values. You create dictionaries by listing zero or more key-value pairs inside of braces, like this:

In [4]:
capitals = {'France': ('Paris', 2140526)}

Keys for dictionaries can be three things: strings, numbers, or tuples (that contain only strings, numbers, or other tuples). The important thing is that dictionary keys be immutable, so lists cannot be used for keys in dictionaries, for example.

You add to dictionaries like this:

In [5]:
capitals['Nigeria'] = ('Lagos', 6048430)
capitals

{'France': ('Paris', 2140526), 'Nigeria': ('Lagos', 6048430)}

> **Exercise:**

In [6]:
# Now try adding another country (or something else) to the capitals dictionary

You reference entries much like you do as through an index number for a string, list, or tuple, but instead of an index, use a key:

In [7]:
capitals['France']

('Paris', 2140526)

You can also update entries in the dictionary:

In [8]:
capitals['Nigeria'] = ('Abuja', 1235880)
capitals

{'France': ('Paris', 2140526), 'Nigeria': ('Abuja', 1235880)}

When used on a dictionary, the `len()` method returns the number of keys in a dictionary:

In [9]:
len(capitals)

2

Similar to the `pop()` method for lists, the `popitem()` method randomly removes a key from the dictionary, along with its associated value:

In [10]:
capitals.popitem()

('Nigeria', ('Abuja', 1235880))

In [11]:
capitals

{'France': ('Paris', 2140526)}

> **Takeaway:** Regardless of how complex and voluminous the data you will work with, these basic data structures will repeatedly be your means for handling and manipulating it. Comfort with these basic data structures is essential to being able to understand and use Python code written by others.

### Control flow in Python

> **Learning goal:** By the end of this subsection, you should be comfortable using basic control flows in Python.

Now that you have a working understanding of the fundamental data types and structures in Python, we can move on to actual programming using Python.

#### If-statements

`If` statements in Python are similar to those in other programming languages like Java, and they form the backbone of the logical flow of most programs.

In [5]:
y = 6
if y % 2 == 0:
    print('Even')

Even


> **Exercise:**

In [2]:
# What behavior do you experience if you change y to be odd?

Did you notice the indentation for print under the if statement? That indentation is important because that is how Python demarks the scope of a control flow--what is contingently run or looped over--as opposed to the braces ({}) used in other languages.

To cover more contingencies without having to construct a follow-on `if` statement, you can add an `else` statement:

In [6]:
y = 7
if y % 2 == 0:
    print('Even')
else:
    print('Odd')

Odd


`elif` enables you to insert an additional logical test to an `if` statement:

In [7]:
y = 1
if y % 2 == 0:
    print('Even')
elif y == 1:
    print('One')
else:
    print('Odd')

One


Notice that, in the previous example, the `if` statement exited after finding the *first* logical test that was `true`. If `y = 1`, and while 1 is indeed odd, the `if` statement executed and exited after finding that `y == 1`, rather than continuing to the end of the statement.

> **Exercise:**

In [8]:
# Try changing the value of y in the snippet above.
# Do you get the output that you expect?


#### For-loops

It is often necessary in programs to iterate over some set of items. This is where `for` loops prove useful. For example, they can provide a useful way to iterate over the items of a list:

In [9]:
colors = ['red', 'yellow', 'blue']
for color in colors:
    print(color)

red
yellow
blue


Sometimes, you will want to iterate over a list using the list index rather than items from that list (say, when you want to access items from another list at the same time). In this case, you can combine list-object methods and for loops:

In [10]:
comp_colors = ['green', 'purple', 'orange']
for i in range(len(comp_colors)):
    print(colors[i], comp_colors[i])

red green
yellow purple
blue orange


We've met `len()` before, but [`range()`](https://docs.python.org/3/library/functions.html#func-range) is new to us. That function produces a sequence of integers from 0 to 1 less than the number passed into it. Hence:

In [11]:
for j in range(5):
    print(j)

0
1
2
3
4


In addition to `range(`*`stop`*`)`, the range function can take up to three parameters: `range(`*`start`*, *`stop`*`[, `*step*`])`. This odd-looking notation just means that if you pass a single argument to `range()`, it will take that to be the stop value; two arguments will be the start and stop values; and three values are `start`, `stop`, and `step`.

> **Exercise:**

In [12]:
# How would you use range and a for loop to print the sequence of numbers
# from 10 to 20? How about counting by threes from 17 to 41?


It can also be important to break out of a loop. Python uses the `break` statement borrowed from C to do this. To see this in action, consider two nested for loops:

In [13]:
for n in range(2, 10):
    for x in range(2, n):
        if n % x == 0:
            print(n, 'equals', x, '*', n//x)
            break
    else:
        print(n, 'is a prime number')

2 is a prime number
3 is a prime number
4 equals 2 * 2
5 is a prime number
6 equals 2 * 3
7 is a prime number
8 equals 2 * 4
9 equals 3 * 3


Note that, in the example above, the `else` statement belongs to the `for` loop, not to the `if` statement.

> **Exercise:**

In [14]:
# Try changing the code snippet above after you remove the break statement.
# What output does it now produce?


As part of the control flow of your program, you might want to continue to the next iteration of your loop. The `continue` statement (also borrowed from C) can help with that:

In [15]:
for num in range(2, 10):
    if num % 2 == 0:
        print("Found an even number:", num)
        continue
    print("Found an odd number:", num)

Found an even number: 2
Found an odd number: 3
Found an even number: 4
Found an odd number: 5
Found an even number: 6
Found an odd number: 7
Found an even number: 8
Found an odd number: 9


> **Exercise:**

In [16]:
# What happens when you replace the continue statement above with a break?


#### While-loops

If we cross the functionality of the `if` statement with that of the `for` loop, we would get the `while` loop, a loop that iterates while some logical condition remains true. Consider this snippet of code to compute the initial sub-sequence of the Fibonacci sequence:

In [17]:
# In the Fibonacci series, the sum of two elements defines the next.
a, b = 0, 1

while b < 100:    
    print(b, end=', ')
    a, b = b, a+b

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 

Go ahead and play with the number of iterations for the while loop. Notice that this snippet also uses multiple assignment for variables.

> **Takeaway:** Control flows are what make programs programs, as opposed to a single sequence of operations. Mastering the logical flow of information in Python will enable you to automate tasks that would be impossibly complex or time-consuming to do manually.

### Functions

> **Learning goal:** By the end of this subsection, you should understand how to pass and receive data from functions.

As in other programming languages, it is often essential in Python to break down your program into reusable chunks. A primary means of doing that is through functions.

For example, we could rewrite the `while` loop code snippet above as a formal function:

In [18]:
def fib(n):
    """Print a Fibonacci series up to n."""
    a, b = 0, 1
    while a < n:
        print(a, end=', ')
        a, b = b, a+b

Now we can call this function and compute the Fibonacci series up to some arbitrary point:

In [19]:
fib(2000)

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 

Python can also define new functions on the fly. These anonymous functions are called *lambda functions* because you define them with the `lambda` keyword. Lambda functions can contain any number of arguments but only one expression.

In [20]:
nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
list(filter(lambda x: x % 2 != 0, nums))

[1, 3, 5, 7, 9]

> **Takeaway:** You will constantly be using functions of all kind to perform data science in Python, so understanding how functions accept, work on, and return data is critical to further progress.

### List comprehensions

> **Learning goal:** By the end of this subsection, you should understand how to economically and computationally create lists.

Sometimes, it makes more sense to generate a list algorithmically. Consider the last example. We really wanted just a list of numbers from 1 to 10. Rather than type those out, we can use a *list comprehension* to generate it:

In [21]:
numbers = [x for x in range(1,11)] # Remember to create a range 1 more than the number you actually want.
numbers

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

We can also perform computation on the items generated for the list:

In [22]:
squares = [x*x for x in range(1,11)]
squares

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

We can even perform logical tests on list items in the comprehension:

In [23]:
odd_squares = [x*x for x in range(1,11) if x % 2 != 0]
odd_squares

[1, 9, 25, 49, 81]

> **Exercise:**

In [24]:
# Now use a list comprehension to generate a list of odd cubes
# from 1 to 2,197


> **Takeaway:** List comprehensions are a popular tool in Python because they enable the rapid, programmatic generation of lists. The economy and ease of use therefore make them an essential tool for you (in addition to a necessary topic to understand as you try to understand Python code written by others).

### Classes and instance objects

> **Learning goal:** By the end of this subsection, you should have a basic understanding of class in Python, particularly class methods.

Python is an object-oriented programming language; nearly everything in Python is an object with attributes: data members (variables that belong to that object) and methods (functions built into an object that operate on that object's data).

A [class](https://docs.python.org/3/tutorial/classes.html) is like an object constructor, a blueprint for creating objects.

Let's take a look at what that looks like in Python by creating a class representing a simple bank account.

In [1]:
class BankAccount:
    """Where does your money go?"""
    
    account_count = 0
    
    # Above is an example of a class variable
    # Below are the class methods

    
    def __init__(self, balance=0):
        self.balance = balance
        BankAccount.account_count += 1
        
    def deposit(self, amount):
        self.balance += amount
        self.display_balance()
        
    def withdrawal(self, amount):
        self.balance -= amount
        self.display_balance()
        
    def display_balance(self):
        print('New balance: ${:.2f}'.format(self.balance))

Unless we specify a figure, objects in our `BankAccount` class are created empty. Let's create a new account with \$50.00 in it:

In [2]:
my_account = BankAccount(50)
my_account.balance

50

Our class also has three methods, two of which are designed to be accessed from outside the object.

In [3]:
my_account.deposit(100)

New balance: $150.00


In [4]:
my_account.withdrawal(125)

New balance: $25.00


We even have a documentation string in the class:

In [5]:
my_account.__doc__

'Where does your money go?'

Note the `account_count` class variable. It is outside any method of the class, which means that every instance of this class shares it. In application, that means that every time a new account is created, that counter iterates for every instance of the class. Here's how that looks in action:

In [6]:
my_account.account_count

1

In [7]:
your_account = BankAccount()
print(my_account.account_count, your_account.account_count)

2 2


> **Takeaway:** Because nearly everything in Python is an object, it is essential to understand—even at a basic level—what that means and how to use object attributes like methods.

### Importing modules

> **Learning goal:** By the end of this subsection, you should be comfortable importing modules in Python.

If you quit from the Python interpreter and enter it again, the definitions you have made (your functions and variables) will be lost. Similarly, you might also want to use a handy function that you’ve written in several programs without copying its definition into each program.

To support this, Python has a way to put definitions in a file and use them in a script or in an interactive instance of the interpreter. Such a file is called a [*module*](https://docs.python.org/3/tutorial/modules.html). Definitions from a module can be imported into other programs or modules.

For example, the `factorial()` function is not one of the standard functions built into Python. It is part of the Python [`math`](https://docs.python.org/3/library/math.html) module. So, when we run `factorial()` before importing `math`, we get an error:

In [8]:
factorial(5)

NameError: name 'factorial' is not defined

However, the situation changes after we import the `math` module:

In [9]:
import math
math.factorial(5)

120

Notice that we still have to prepend `math` to the front of the `factorial()` function. We can use a different method to import that specific function from the `math` module and use it as if it were defined in our program:

In [10]:
from math import factorial
factorial(5)

120

You can add more cells to your notebook by clicking the **insert cell below (+)** button at the top of the window. The Python [`math`](https://docs.python.org/3/library/math.html) module has many functions in it. Try importing some of the other math functions and playing around with them.

> **Takeaway:** There are several Python modules that you will regularly use in conducting data science in Python, so understanding how to import them will be essential (especially in this training).

# Section 2: Introduction to NumPy

NumPy is one of the two most important libraries in Python for data science, along with pandas (which we will cover in the next section). NumPy is a crucial library for effectively loading, storing, and manipulating in-memory data in Python, all of which will be at the heart of what you do with data science in Python.

Datasets come from a huge range of sources and in a wide range of formats, such as text documents, images, sound clips, numerical measurements, and nearly anything else. Despite this variety, however, the start of data science is to think of all data fundamentally as arrays of numbers.

For example, the words in documents can be represented as the numbers that encode letters in computers or even the frequency of particular words in a collection of documents. Digital images can be thought of as two-dimensional arrays of numbers representing pixel brightness or color. Sound files can be represented as one-dimensional arrays of frequency versus time. However, no matter what form our data takes, in order to analyze it, our first step will be to transform it into arrays of numbers—which is where NumPy comes in (and pandas down the road).

NumPy is short for *Numerical Python*, and it provides an efficient means of storing and operating on dense data buffers in Python. Array-oriented computing in Python goes back to 1995 with the Numeric library. Scientific programming in Python took off over the next 10 years, but the collections of libraries splintered. The NumPy project began in 2005 as a means of bringing the Numeric and NumArray projects together around a single array-based framework.

Let's get started exploring NumPy! Our first step will be to import NumPy using `np` as an alias:

In [2]:
import numpy as np

Get used to this convention — it's a common convention in Python, and it's the way we will use and refer to NumPy throughout the rest of this course.

## Built-In Help

There's a lot to learn about NumPy, and it can be tough to remember it all the first time through. Don't worry! IPython — the underlying program that enables notebooks like this one to interact with Python—has you covered.

First off, IPython gives you the ability to quickly explore the contents of a package like NumPy by using the tab-completion feature. So, if you want to see all of the functions available with NumPy, type this:

```ipython
In [ ]: np.<TAB>
```
When you do so, a drop-down menu will appear next to the `np.`

> **Exercise**

In [4]:
# Place your cursor after the period and press <TAB>:
np.

SyntaxError: invalid syntax (<ipython-input-4-f0faa50f3d63>, line 2)

From the drop-down menu, you can select any function to run. Better still, you can select any function and view the built-in help documentation for it. For example, to view the documentation for the NumPy `add()` function, you can run this code:

```ipython
In [ ]: np.add?
```
Try this with a few different functions. Remember, these functions are just like the ones you wrote in the previous section; the documentation will help explain what parameters you can (or should) provide to the function, in addition to what output you can expect.

> **Exercise**

In [3]:
# Replace 'add' below with a few different NumPy function names and look over the documentation:
np.add?

For more detailed documentation (along with additional tutorials and other resources), visit [www.numpy.org](http://www.numpy.org).

Now that you know how to quickly get help while you are working on your own, let's return to storing data in arrays.

## NumPy arrays: a specialized data structure for analysis

> **Learning goal:** By the end of this subsection, you should have a basic understanding of what NumPy arrays are and how they differ from the other Python data structures you have studied thus far.

We started the discussion in this section by noting that data science starts by representing data as arrays of numbers.

"Wait!" you might be thinking. "Can't we just use Python lists for that?"

Depending on the data, yes, you could (and you will use lists as a part of working with data in Python). But to see what we might want to use a specialized data structure for, let's look a little more closely at lists.

### Lists in Python

Python lists can hold just one kind of object. Let's use one to create a list of just integers:

In [5]:
myList = list(range(10))
myList

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Remember list comprehension? We can use it to probe the data types of items in a list:

In [6]:
[type(item) for item in myList]

[int, int, int, int, int, int, int, int, int, int]

Of course, a really handy feature of Python lists is that they can hold heterogeneous types of data in a single list object:

In [7]:
myList2 = [True, "2", 3.0, 4]
[type(item) for item in myList2]

[bool, str, float, int]

However, this flexibility comes at a price. Each item in a list is really a separate Python object (the list is an object itself, true, but mostly it is an object that serves as a container for the memory pointers to the constituent objects). That means that each item in a list must contain its own type info, reference count, and other information. All of this information can become expensive in terms of memory and performance if we are dealing with hundreds of thousands or millions of items in a list. Moreover, for many uses in data science, our arrays just store a single type of data (such as integers or floats), which means that all of the object-related information for items in such an array would be redundant. It can be much more efficient to store data in a fixed-type array.

<img align="left" style="padding-right:10px;" src="Graphics/array_vs_list.png">

### Fixed-type arrays in Python

At the level of implementation by the computer, the `ndarray` that is part of the NumPy package contains a single pointer to one contiguous block of data. This is efficient memory-wise and computationally. Better still, NumPy provides efficient *operations* on data stored in `ndarray` objects.

(Note that we will pretty much use “array,” “NumPy array,” and “ndarray” interchangeably throughout this section to refer to the ndarray object.)

#### Creating NumPy arrays method 1: using Python lists

There are multiple ways to create arrays in NumPy. Let's start by using our good old familiar Python lists. We will use the `np.array()` function to do this (remember, we imported NumPy as '`np`'):

In [8]:
# Create an integer array:
np.array([1, 4, 2, 5, 3])

array([1, 4, 2, 5, 3])

Remember that, unlike Python lists, NumPy constrains arrays to contain a single type. So, if data types fed into a NumPy array do not match, NumPy will attempt to *upcast* them if possible. To see what we mean, here NumPy upcasts integers to floats:

In [9]:
np.array([3.14, 4, 2, 3])

array([3.14, 4.  , 2.  , 3.  ])

> **Exercise**

In [10]:
# What happens if you construct an array using a list that contains a combination of integers, floats, and strings?


If you want to explicitly set the data type of your array when you create it, you can use the `dtype` keyword:

In [11]:
np.array([1, 2, 3, 4], dtype='float32')

array([1., 2., 3., 4.], dtype=float32)

> **Exercise**

In [12]:
# Try this using a different dtype.
# Remember that you can always refer to the documentation with the command np.array.


Most usefully for a lot of applications in data science, NumPy arrays can explicitly be multidimensional (like matrices or tensors). Here's one way of creating a multidimensional array using a list of lists:

In [13]:
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

The inner lists in a list of lists are treated as rows of the two-dimensional array you created.

#### Creating NumPy arrays method 2: building from scratch

In practice, it is often more efficient to create arrays from scratch using functions built into NumPy, particularly for larger arrays. Here are a few examples; these example will help introduce you to several useful NumPy functions.

In [14]:
# Create an integer array of length 10 filled with zeros
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [15]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [16]:
# Create a 3x5 array filled with 3.14
# The first number in the tuple gives the number of rows
# The second number in the tuple sets the number of columns
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [17]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in Python range() function)
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [18]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [19]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

array([[0.00703713, 0.52877796, 0.75275509],
       [0.89635637, 0.25126021, 0.44915591],
       [0.14923241, 0.06215598, 0.80735779]])

In [20]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[ 0.7714448 ,  0.7676462 , -0.21410694],
       [ 0.31127101, -1.90044403,  0.88254434],
       [ 0.76114444,  0.5224406 ,  1.7439717 ]])

In [21]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

array([[9, 3, 8],
       [3, 4, 5],
       [6, 7, 6]])

In [22]:
# Create a 3x3 identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [5]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)

array([0.00000000e+000, 1.93538748e-309, 4.68110620e-310])

Now take a couple of minutes to go back and play with these code snippets, changing the parameters. These functions are the bread-and-butter of creating NumPy arrays and you will want to become comfortable with them.

Below is a table listing out several of the array-creation functions in NumPy.

| Function      | Description |
|:--------------|:------------|
| `array`       | Converts input data (list, tuple, array, or other sequence type) to an ndarray either |
|               | by inferring a dtype or explicitly specifying a dtype. Copies the input data by default. |
| `asarray`     | Converts input to ndarray, but does not copy if the input is already an ndarray. |
| `arange`      | Similar to the built-in `range()` function but returns an ndarray instead of a list. |
| `ones`, `ones_like` | Produces an array of all 1s with the given shape and dtype. |
|               | `ones_like` takes another array and produces a ones-array of the same shape and dtype. |
| `zeros`, `zeros_like` | Similar to `ones` and `ones_like` but producing arrays of 0s instead. |
| `empty`, `empty_like` | Creates new arrays by allocating new memory, but does not populate with any values 
|               | like `ones` and `zeros`. |
| `full`, `full_like` | Produces an array of the given shape and dtype with all values set to the indicated “fill value.” |
|               | `full_like` takes another array and produces a a filled array of the same shape and dtype. |
| `eye`, `identity` | Create a square $N \times N$ identity matrix (1s on the diagonal and 0s elsewhere) |

### NumPy data types

The standard NumPy data types are listed in the following table. Note that when constructing an array, they can be specified using a string:

```python
np.zeros(8, dtype='int16')
```

Or they can be specified directly using the NumPy object:

```python
np.zeros(8, dtype=np.int16)
```

| Data type	    | Description |
|:--------------|:------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half-precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single-precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double-precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 

If these data types seem a lot like those in C, that's because NumPy is built in C.

> **Takeaway:** NumPy arrays are a data structure similar to Python lists that provide high performance when storing and working on large amounts of homogeneous data—precisely the kind of data that you will encounter frequently in doing data science. NumPy arrays support many data types beyond those discussed in this course. With all of that said, however, don’t worry about memorizing all of the NumPy dtypes. **It’s often just necessary to care about the general kind of data you’re dealing with: floating point, integer, Boolean, string, or general Python object.**

## Working with NumPy arrays: the basics

> **Learning goal:** By the end of this subsection, you should be comfortable working with NumPy arrays in basic ways.

Now that you know how to create arrays in NumPy, you need to get comfortable manipulating them for two reasons. First, you will work with NumPy arrays as part of your exploration of data science. Second, our other important Python data-science tool, pandas, is actually built around NumPy. Getting good at working with NumPy arrays will pay dividends in the next section and beyond: NumPy arrays are the building blocks for the `Series` and `DataFrame` data structures in the Python pandas library and you will use them *a lot* in data science. To get comfortable with array manipulation, we will cover five specifics:
- **Arrays attributes**: Assessing the size, shape, and data types of arrays
- **Indexing arrays**: Getting and setting the value of individual array elements
- **Slicing arrays**: Getting and setting smaller subarrays within a larger array
- **Reshaping arrays**: Changing the shape of a given array
- **Joining and splitting arrays**: Combining multiple arrays into one and splitting one array into multiple arrays

### Array attributes
First, let's look at some array attributes. We'll start by defining three arrays filled with random numbers: one one-dimensional, another two-dimensional, and the last three-dimensional. Because we will be using NumPy's random number generator, we will set a *seed* value in order to ensure that you get the same random arrays each time you run this code:

In [33]:
import numpy as np
np.random.seed(0)  # seed for reproducibility

a1 = np.random.randint(10, size=6)  # One-dimensional array
a2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
a3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

Each array has attributes ``ndim`` (the number of dimensions of an array), ``shape`` (the size of each dimension of an array), and ``size`` (the total number of elements in an array).

> **Exercise:**

In [34]:
# Change the values in this code snippet to look at the attributes for a1, a2, and a3:
print("a3 ndim: ", a3.ndim)
print("a3 shape:", a3.shape)
print("a3 size: ", a3.size)

a3 ndim:  3
a3 shape: (3, 4, 5)
a3 size:  60


Another useful array attribute is the `dtype`, which we already encountered earlier in this section as a means of determining the type of data in an array:

In [35]:
print("dtype:", a3.dtype)

dtype: int64


> **Exercise:**

In [36]:
# Explore the dtype for the other arrays.
# What dtypes do you predict them to have?
print("dtype:", a3.dtype)

dtype: int64


### Indexing arrays

Indexing in NumPy is pretty similar to indexing lists in standard Python. In fact, indices in one-dimensional arrays work exactly as they do with Python lists:

In [37]:
a1

array([5, 0, 3, 3, 7, 9])

In [38]:
a1[0]

5

In [39]:
a1[4]

7

As with regular Python lists, in order to index from the end of the array, you can use negative indices:

In [40]:
a1[-1]

9

In [41]:
a1[-2]

7

> **Exercise:**

In [42]:
# Do multidimensional NumPy arrays work like Python lists of lists?
# Try a few combinations like a2[1][1] or a3[0][2][1] and see what comes back


You might have noticed that we can treat multidimensional arrays like lists of lists. But a more common means of accessing items in multidimensional arrays is to use a comma-separated tuple of indices.

(Yes, we realize that these comma-separated tuples use square brackets rather than the parentheses the name might suggest, but they are nevertheless referred to as tuples.)

In [43]:
a2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [44]:
a2[0, 0]

3

In [45]:
a2[2, 0]

1

In [46]:
a2[2, -1]

7

You can also modify values by use of this same comma-separated index notation:

In [47]:
a2[0, 0] = 12
a2

array([[12,  5,  2,  4],
       [ 7,  6,  8,  8],
       [ 1,  6,  7,  7]])

Remember, once defined, NumPy arrays have a fixed data type. So, if you attempt to insert a float into an integer array, the value will be silently truncated.

In [48]:
a1[0] = 3.14159
a1

array([3, 0, 3, 3, 7, 9])

> **Exercise:**

In [49]:
# What happens if you try to insert a string into a1?
# Hint: try both a string like '3' and one like 'three'


### Slicing arrays
Similar to how you can use square brackets to access individual array elements, you can also use them to access subarrays. You do this with the *slice* notation, marked by the colon (`:`) character. NumPy slicing syntax follows that of the standard Python list; so, to access a slice of an array `a`, use this notation:
``` python
a[start:stop:step]
```
If any of these are unspecified, they default to the values ``start=0``, ``stop=``*``size of dimension``*, ``step=1``.
Let's take a look at accessing sub-arrays in one dimension and in multiple dimensions.

#### One-dimensional slices

In [50]:
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [51]:
a[:5]  # first five elements

array([0, 1, 2, 3, 4])

In [52]:
a[5:]  # elements after index 5

array([5, 6, 7, 8, 9])

In [53]:
a[4:7]  # middle sub-array

array([4, 5, 6])

In [54]:
a[::2]  # every other element

array([0, 2, 4, 6, 8])

In [55]:
a[1::2]  # every other element, starting at index 1

array([1, 3, 5, 7, 9])

> **Exercise:**

In [56]:
# How would you access the *last* five elements of array a?
# How about every other element of the last five elements of a?
# Hint: Think back to list indexing in Python


Be careful when using negative values for ``step``. When ``step`` has a negative value, the defaults for ``start`` and ``stop`` are swapped and you can use this functionality to reverse an array:

In [57]:
a[::-1]  # all elements, reversed

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [58]:
a[5::-2]  # reversed every other from index 5

array([5, 3, 1])

> **Exercise:**

In [59]:
# How can you create a slice that contains every third element of a
# descending from the second-to-last element to the second element of a?


#### Multidimensional slices

Multidimensional slices use the same slice notation of one-dimensional subarrays mixed with the comma-separated notation of multidimensional arrays. Some examples will help illustrate this.

In [60]:
a2

array([[12,  5,  2,  4],
       [ 7,  6,  8,  8],
       [ 1,  6,  7,  7]])

In [61]:
a2[:2, :3]  # two rows, three columns

array([[12,  5,  2],
       [ 7,  6,  8]])

In [62]:
a2[:3, ::2]  # all rows, every other column

array([[12,  2],
       [ 7,  8],
       [ 1,  7]])

Finally, subarray dimensions can even be reversed together:

In [63]:
a2[::-1, ::-1]

array([[ 7,  7,  6,  1],
       [ 8,  8,  6,  7],
       [ 4,  2,  5, 12]])

> **Exercise:**

In [64]:
# Now try to show 2 rows and 4 columns with every other element?


#### Accessing array rows and columns
One thing you will often need to do in manipulating data is accessing a single row or column in an array. You can do this through a combination of indexing and slicing; specifically by using an empty slice marked by a single colon (``:``). Again, some examples will help illustrate this.

In [65]:
print(a2[:, 0])  # first column of x2

[12  7  1]


In [66]:
print(a2[0, :])  # first row of x2

[12  5  2  4]


In the case of row access, the empty slice can be omitted for a more compact syntax:

In [67]:
print(a2[0])  # equivalent to a2[0, :]

[12  5  2  4]


> **Exercise:**

In [68]:
# How would you access the third column of a3?
# How about the third row of a3?


#### Slices are no-copy views
It's important to know that slicing produces *views* of array data, not *copies*. This is a **huge** difference between NumPy array slicing and Python list slicing. With Python lists, slices are only shallow copies of lists; if you modify a copy, it doesn't affect the parent list. When you modify a NumPy subarray, you modify the original list. Be careful: this can have ramifications when you are trying to just work with a small part of a large dataset and you don’t want to change the whole thing. Let's look more closely.

In [69]:
print(a2)

[[12  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]


Extract a $2 \times 2$ subarray from `a2`:

In [70]:
a2_sub = a2[:2, :2]
print(a2_sub)

[[12  5]
 [ 7  6]]


Now modify this subarray:

In [71]:
a2_sub[0, 0] = 99
print(a2_sub)

[[99  5]
 [ 7  6]]


`a2` is now modified as well:

In [72]:
print(a2)

[[99  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]


> **Exercise:**

In [73]:
# Now try reversing the column and row order of a2_sub
# Does a2 look the way you expected it would after that manipulation?


The fact that slicing produces views rather than copies is useful for data-science work. As you work with large datasets, you will often find that it is easier to access and manipulate pieces of those datasets rather than copying them entirely.

#### Copying arrays
Instead of just creating views, sometimes it is necessary to copy the data in one array to another. When you need to do this, use the `copy()` method:

In [74]:
a2_sub_copy = a2[:2, :2].copy()
print(a2_sub_copy)

[[99  5]
 [ 7  6]]


If we now modify this subarray, the original array is not touched:

In [75]:
a2_sub_copy[0, 0] = 42
print(a2_sub_copy)

[[42  5]
 [ 7  6]]


In [76]:
print(a2)

[[99  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]


### Reshaping arrays
Another way in which you will need to manipulate arrays is by reshaping them. This involves changing the number and size of dimensions of an array. This kind of manipulation can be important in getting your data to meet the expectations of machine learning programs or APIs.

The most flexible way of doing this kind of manipulation is with the `reshape` method. For example, if you want to put the numbers 1 through 9 in a $3 \times 3$ grid, you can do the following:

In [77]:
grid = np.arange(1, 10).reshape((3, 3))
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


Another common manipulation you will do in data science is converting one-dimensional arrays into two-dimensional row or column matrices. This can be a common necessity when doing linear algebra for machine learning. While you can do this by means of the `reshape` method, an easier way is to use the `newaxis` keyword in a slice operation:

In [78]:
a = np.array([1, 2, 3])

# row vector via reshape
a.reshape((1, 3))

array([[1, 2, 3]])

In [79]:
# row vector via newaxis
a[np.newaxis, :]

array([[1, 2, 3]])

In [80]:
# column vector via reshape
a.reshape((3, 1))

array([[1],
       [2],
       [3]])

In [81]:
# column vector via newaxis
a[:, np.newaxis]

array([[1],
       [2],
       [3]])

You will see this type of transformation a lot in the remainder of this course.

### Joining and splitting arrays

Another common data-manipulation need in data science is combining multiple datasets; learning first how to do this with NumPy arrays will help you in the next section when we do this with more complex data structures. You will many times also need to split a single array into multiple arrays.

#### Joining arrays
To join arrays in NumPy, you will most often use `np.concatenate`, which is the method we will cover here. If you find yourself in the future needing to specifically join arrays in mixed dimensions (a rarer case), read the documentation on `np.vstack`, `np.hstack`, and `np.dstack`.

##### `np.concatenate()`

`np.concatenate` takes a tuple or list of arrays as its first argument:

In [82]:
a = np.array([1, 2, 3])
b = np.array([3, 2, 1])
np.concatenate([a, b])

array([1, 2, 3, 3, 2, 1])

You can also concatenate more than two arrays at once:

In [83]:
c = [99, 99, 99]
print(np.concatenate([a, b, c]))

[ 1  2  3  3  2  1 99 99 99]


`np.concatenate` can also be used for two-dimensional arrays:

In [84]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])

In [85]:
# concatenate along the first axis, which is the default
np.concatenate([grid, grid])

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

> **Exercise:**

In [86]:
# Recall that axes are zero-indexed in NumPy.
# What do you predict np.concatenate([grid, grid], axis=1) will produce?


#### Splitting arrays
In order to split arrays into multiple smaller arrays, you can use the functions ``np.split``, ``np.hsplit``, ``np.vsplit``, and ``np.dsplit``.  As above, we will only cover the most commonly used function (`np.split`) in this course.

##### `np.split()`
Let's first examine the case of a one-dimensional array:

In [None]:
a = [1, 2, 3, 99, 99, 3, 2, 1]
a1, a2, a3 = np.split(a, [3, 5])
print(a1, a2, a3)

Notice that *N* split-points produces to *N + 1* subarrays. In this case it has formed the subarray `a2` with `a[3]` and `a[4]` (the element just before position 5 [remember how Python indexing goes], the second input in the tuple) as elements. `a1` and `a3` pick up the leftover portions from the original array `a`.

> **Exercise:**

In [None]:
grid = np.arange(16).reshape((4, 4))
grid

In [None]:
# What does np.split(grid, [1, 2]) produce?
# What about np.split(grid, [1, 2], axis=1)?


> **Takeaway:** Manipulating datasets is a fundamental part of preparing data for analysis. The skills you learned and practiced here will form building blocks for the most sophisticated data-manipulation you will learn in later sections in this course.

## Fancy indexing

So far, we have explored how to access and modify portions of arrays using simple indices like `arr[0]`) and slices like `arr[:5]`. Now it is time for fancy indexing, in which we pass an array of indices to an array in order to access or modify multiple array elements at the same time.

Let's try it out:

In [None]:
rand = np.random.RandomState(42)

arr = rand.randint(100, size=10)
print(arr)

Suppose you need to access three different elements. Using the tools you currently have, your code might look something like this:

In [None]:
[arr[3], arr[7], arr[2]]

With fancy indexing, you can pass a single list or array of indices to do the same thing:

In [None]:
ind = [3, 7, 4]
arr[ind]

Another useful aspect of fancy indexing is that the shape of the output array reflects the shape of the *index arrays* you supply, rather than the shape of the array you are accessing. This is handy because there will be many times in a data scientist's life when they want to grab data from an array in a particular manner, such as to pass it to a machine learning API. Let's examine this property with an example:

In [None]:
ind = np.array([[3, 7],
                [4, 5]])
arr[ind]

`arr` is a one-dimensional array, but `ind`, your index array, is a $2 \times 2$ array, and that is the shape the results comes back in.

> **Exercise:**

In [None]:
# What happens when your index array is bigger than the target array?
# Hint: you could use a large one-dimensional array or something fancier like ind = np.arange(0, 12).reshape((6, 2))


Fancy indexing also works in multiple dimensions:

In [None]:
arr2 = np.arange(12).reshape((3, 4))
arr2

As with standard indexing, the first index refers to the row and the second to the column:

In [None]:
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
arr2[row, col]

What did you actually get as your final result here? The first value in the result array is `arr2[0, 2]`, the second one is `arr2[1, 1]`, and the third one is `arr2[2, 3]`.

The pairing of indices in fancy indexing follows all the same broadcasting rules we covered earlier. Thus, if you combine a column vector and a row vector within the indices, you get a two-dimensional result:

In [None]:
arr2[row[:, np.newaxis], col]

Here, each row value is matched with each column vector, exactly as we saw in broadcasting of arithmetic operations.

> **Exercise:**

In [None]:
# Now try broadcasting this on your own.
# What do you get with row[:, np.newaxis] * col? 
# Or row[:, np.newaxis] * row? col[:, np.newaxis] * row?
# What about col[:, np.newaxis] * row?
# Hint: think back to the broadcast rules


**The big takeaway:** It is always important to remember that fancy indexing returns values reflected by the *broadcasted shape of the indices*, and not the shape of the array being indexed.

### Combined indexing

You can also combine fancy indexing with the other indexing schemes you have learned. Consider `arr2` again:

In [None]:
print(arr2)

Now combine fancy and simple indices:

In [None]:
arr2[2, [2, 0, 1]]

What did you get back? The elements at positions 2, 0, and 1 of row 2 (the third row).

You can also combine fancy indexing with slicing:

In [None]:
arr2[1:, [2, 0, 1]]

Again, consider what you got back as output: the elements at positions 2, 0, and 1 of each row after the first one (so the second and third rows).

Of course, you can also combine fancy indexing with masking:

In [None]:
mask = np.array([1, 0, 1, 0], dtype=bool)
arr2[row[:, np.newaxis], mask]

### Modifying values using fancy indexing

Fancy indexing is, of course, not just for accessing parts of an array, but also for modifying parts of an array:

In [None]:
ind = np.arange(10)
arr = np.array([2, 1, 8, 4])
ind[arr] = 99
print(ind)

You can also use a ufunc here and subtract 10 from each element of the array:

In [None]:
ind[arr] -= 10
print(ind)

Be cautious when using repeated indices with operations like these. They might not always produce the results you expect. For example:

In [None]:
ind = np.zeros(10)
ind[[0, 0]] = [4, 6]
print(ind)

Where did the 4 go? The result of this operation is to first assign `ind[0] = 4`, followed by `ind[0] = 6`. So the result is that `ind[0]` contains the value 6.

But not every operation repeats the way you might think it should:

In [None]:
arr = [2, 3, 3, 4, 4, 4]
ind[arr] += 1
ind

We might have expected that `ind[3]` would contain the value 2 and `ind[4]` would contain the value 3. After all, that is how many times each index is repeated. So what happened?

This happened because `ind[arr] += 1` is really shorthand for `ind[arr] = ind[arr] + 1`. `ind[arr] + 1` is evaluated, and then the result is assigned to the indices in `ind`. So, similar to the previous example, this is not augmentation that happens multiple times, but an assignment, which can lead to potentially counterintuitive results.

But what if you want an operation to repeat? To do this, use the `at()` method of ufuncs:

In [None]:
ind = np.zeros(10)
np.add.at(ind, arr, 1)
print(ind)

> **Exercise:**

In [None]:
# What does np.subtract.at(ind, arr, 1) give you?
# Play around with some of the other ufuncs we have seen.


> **Takeaway:** Fancy indexing enables you to select and manipulate several array members at once. This type of programmatic data manipulation is common in data science: often what you want to do with your data you want to do on several data points at once.

## Sorting arrays

So far we have just worried about accessing and modifying NumPy arrays. Another huge thing you will need to do as a data scientist is sort array data. Sorting is often an important means of teasing out the structure in data (such as outlying data points).

Although you could use Python's built-in `sort` and `sorted` functions, they will not work nearly as efficiently as NumPy's `np.sort` function.

`np.sort` returns a sorted version of an array without modifying the input:

In [None]:
a = np.array([2, 1, 4, 3, 5])
np.sort(a)

In order to sort the array in-place, use the `sort` method directly on arrays:

In [None]:
a.sort()
print(a)

A related function is `argsort`, which returns the *indices* of the sorted elements rather than the elements themselves:

In [None]:
a = np.array([2, 1, 4, 3, 5])
b = np.argsort(a)
print(b)

The first element of this result gives the index of the smallest element, the second value gives the index of the second smallest, and so on. These indices can then be used (via fancy indexing) to reconstruct the sorted array:

In [None]:
a[b]

### Sorting along rows or columns

A useful feature of NumPy's sorting algorithms is the ability to sort along specific rows or columns of a multidimensional array using the `axis` argument. For example:

In [None]:
rand = np.random.RandomState(42)
table = rand.randint(0, 10, (4, 6))
print(table)

In [None]:
# Sort each column of the table
np.sort(table, axis=0)

In [None]:
# Sort each row of the table
np.sort(table, axis=1)

Bear in mind that this treats each row or column as an independent array; any relationships between the row or column values will be lost doing this kind of sorting.

## Partial sorting: partitioning

Sometimes you don't need to sort an entire array, you just need to find the *k* smallest values in the array (often when looking at the distance of data points from one another). NumPy supplies this functionality through the `np.partition` function. `np.partition` takes an array and a number *k*; the result is a new array with the smallest *k* values to the left of the partition, and the remaining values to the right (in arbitrary order):

In [None]:
arr = np.array([7, 2, 3, 1, 6, 5, 4])
np.partition(arr, 3)

Note that the first three values in the resulting array are the three smallest in the array, and the remaining array positions contain the remaining values. Within the two partitions, the elements have arbitrary order.

Similarly to sorting, we can partition along an arbitrary axis of a multidimensional array:

In [None]:
np.partition(table, 2, axis=1)

The result is an array where the first two slots in each row contain the smallest values from that row, with the remaining values filling the remaining slots.

Finally, just as there is an `np.argsort` that computes indices of the sort, there is an `np.argpartition` that computes indices of the partition. We'll see this in action in the following section when we discuss pandas.

> **Takeaway:** Sorting your data is a fundamental means of exploring it and answering questions about it. The sorting algorithms in NumPy provide you with a fast, computationally efficient way of doing this on large amounts of data and with fine-grain control.

## Efficient computation on NumPy arrays: Universal functions

> **Learning goal:** By the end of this subsection, you should have a basic understanding of what NumPy universal functions are and how (and why) to use them.

Some of the properties that make Python great to work with for data science (its dynamic, interpreted nature, for example) can also make it slow. This is particularly true with looping. These small performance hits can add up to minutes (or longer) when dealing with truly huge datasets.

When we first examined loops in the previous section, you probably didn't notice any delay: the loops were short enough that Python’s relatively slow looping wasn’t an issue. Consider this function, which calculates the reciprocal for an array of numbers:

In [None]:
import numpy as np
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output
        
values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

Running this loop, it was probably difficult to even discern that execution wasn't instantaneous.

But let’s try it on a much larger array. To empirically do this, we'll time this with IPython's `%timeit` magic command.

In [None]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

You certainly noticed that delay. The slowness of this looping becomes noticeable when we repeat many small operations many times.

The performance bottleneck is not the operations themselves, but the type-checking and function dispatches that Python performs on each cycle of the loop. In the case of the `compute_reciprocals` function above, each time Python computes the reciprocal, it first examines the object's type and does a dynamic lookup of the correct function to use for that type. Such is life with interpreted code. However, were we working with compiled code instead (such as in C), the object-type specification would be known before the code executes, and the result could be computed much more efficiently. This is where NumPy universal functions come into play.

### Ufuncs

Universal functions in NumPy (often shortened to *ufuncs*) provide a statically typed, compiled function for many of the operations that we will need to run in the course of manipulating and analyzing data.

Let's examine what this means in practice. Let's find the reciprocals of `big_array` again, this time using a built-in NumPy division ufunc on the array:

In [None]:
%timeit (1.0 / big_array)

That’s orders of magnitude better.

Ufuncs can be used between a scalar and an array and between arrays of arbitrary dimensions.

Computations vectorized by ufuncs are almost always more efficient than doing the same computation using Python loops. This is especially true on large arrays. When possible, try to use ufuncs when operating on NumPy arrays, rather than using ordinary Python loops.

Ufuncs come in two flavors: *unary ufuncs*, which use a single input, and *binary ufuncs*, which operate on two inputs. The common ufuncs we'll look at here encompass both kinds.

#### Array arithmetic

Many NumPy ufuncs use Python's native arithmetic operators, so you can use the standard addition, subtraction, multiplication, and division operators that we covered in the first section:

In [None]:
a = np.arange(4)
print("a     =", a)
print("a + 5 =", a + 5)
print("a - 5 =", a - 5)
print("a * 2 =", a * 2)
print("a / 2 =", a / 2)
print("a // 2 =", a // 2)  # floor division

There are also ufuncs for negation, exponentiation, and the modulo operation:

In [None]:
print("-a     = ", -a)
print("a ** 2 = ", a ** 2)
print("a % 2  = ", a % 2)

You can also combine these ufuncs using the standard order of operations:

In [None]:
-(0.5*a + 1) ** 2

The Python operators are not actually the ufuncs, but are rather wrappers around functions built into NumPy. So the `+` operator is actually a wrapper for the `add` function:

In [None]:
np.add(a, 2)

Here is a cheat sheet for the equivalencies between Python operators and NumPy ufuncs:

| Operator	    | Equivalent ufunc    | Description                           |
|:--------------|:--------------------|:--------------------------------------|
|``+``          |``np.add``           |Addition (e.g., ``1 + 1 = 2``)         |
|``-``          |``np.subtract``      |Subtraction (e.g., ``3 - 2 = 1``)      |
|``-``          |``np.negative``      |Unary negation (e.g., ``-2``)          |
|``*``          |``np.multiply``      |Multiplication (e.g., ``2 * 3 = 6``)   |
|``/``          |``np.divide``        |Division (e.g., ``3 / 2 = 1.5``)       |
|``//``         |``np.floor_divide``  |Floor division (e.g., ``3 // 2 = 1``)  |
|``**``         |``np.power``         |Exponentiation (e.g., ``2 ** 3 = 8``)  |
|``%``          |``np.mod``           |Modulus/remainder (e.g., ``9 % 4 = 1``)|

Python Boolean operators also work; we will explore those later in this section.

#### Absolute value

NumPy also understands Python's built-in absolute value function:

In [None]:
a = np.array([-2, -1, 0, 1, 2])
abs(a)

This corresponds to the NumPy ufunc `np.absolute` (which is also available under the alias `np.abs`):

In [None]:
np.absolute(a)

In [None]:
np.abs(a)

#### Exponents and logarithms

You will need to use exponents and logarithms a lot in data science; these are some of the most common data transformations for machine learning and statistical work.

In [None]:
a = [1, 2, 3]
print("a     =", a)
print("e^a   =", np.exp(a))
print("2^a   =", np.exp2(a))
print("3^a   =", np.power(3, a))

The basic `np.log` gives the natural logarithm; if you need to compute base-2 or base-10 logarithms, NumPy also provides those:

In [None]:
a = [1, 2, 4, 10]
print("a        =", a)
print("ln(a)    =", np.log(a))
print("log2(a)  =", np.log2(a))
print("log10(a) =", np.log10(a))

There are also some specialized versions of these ufuncs to help maintain precision when dealing with very small inputs:

In [None]:
a = [0, 0.001, 0.01, 0.1]
print("exp(a) - 1 =", np.expm1(a))
print("log(1 + a) =", np.log1p(a))

These functions give more precise values than if you were to use the raw `np.log` or `np.exp` on very small values of `a`.

#### Specialized ufuncs

NumPy has many other ufuncs. Another source for specialized and obscure ufuncs is the submodule `scipy.special`. If you need to compute some specialized mathematical or statistical function on your data, chances are it is implemented in `scipy.special`.

In [None]:
from scipy import special

In [None]:
# Gamma functions (generalized factorials) and related functions
a = [1, 5, 10]
print("gamma(a)     =", special.gamma(a))
print("ln|gamma(a)| =", special.gammaln(a))
print("beta(a, 2)   =", special.beta(a, 2))

> **Takeaway:** Universal functions in NumPy provide you with computational functions that are faster than regular Python functions, particularly when working on large datasets that are common in data science. This speed is important because it can make you more efficient as a data scientist and it makes a broader range of inquiries into your data tractable in terms of time and computational resources.

## Aggregations

> **Learning goal:** By the end of this subsection, you should be comfortable aggregating data in NumPy.

One of the first things you will find yourself doing with most datasets is computing the summary statistics for the data in order to get a general overview of your data before exploring it further. These summary statistics include the mean and standard deviation, in addition to other aggregates, such as the sum, product, median, minimum and maximum, or quantiles of the data.

NumPy has fast built-in aggregation functions for working on arrays that are the subject of this sub-section.

### Summing the values of an array

You can use the built-in Python `sum` function to sum up the values in an array.

In [None]:
import numpy as np

In [None]:
myList = np.random.random(100)
sum(myList)

If you guessed that there is also a built-in NumPy function for this, you guessed correctly:

In [None]:
np.sum(myList)

And if you guessed that the NumPy version is faster, you are doubly correct:

In [None]:
large_array = np.random.rand(1000000)
%timeit sum(large_array)
%timeit np.sum(large_array)

For all their similarity, bear in mind that `sum` and `np.sum` are not identical; their optional arguments have different meanings, and `np.sum` is aware of multiple array dimensions.

### Minimum and maximum

Just as Python has built-in `min` and `max` functions, NumPy has similar, vectorized versions:

In [None]:
np.min(large_array), np.max(large_array)

You can also use `min`, `max`, and `sum` (and several other NumPy aggregates) as methods of the array object itself:

In [None]:
print(large_array.min(), large_array.max(), large_array.sum())

### Multidimensional aggregates

Because you will often treat the rows and columns of two-dimensional arrays differently (treating columns as variables and rows as observations of those variables, for example), it can often be desirable to aggregate array data along a row or column. Let's consider a two-dimensional array:

In [None]:
md = np.random.random((3, 4))
print(md)

Unless you specify otherwise, each NumPy aggregation function will compute the aggregate for the entire array. Hence:

In [None]:
md.sum()

Aggregation functions take an additional argument specifying the *axis* along which to compute the aggregation. For example, we can find the minimum value within each column by specifying `axis=0`:

In [None]:
md.min(axis=0)

> **Exercise:**

In [None]:
# What do you get when you try md.max(axis=1)?


Remember that the `axis` keyword specifies the *dimension of the array that is to be collapsed*, not the dimension that will be returned. Thus specifying `axis=0` means that the first axis will be the one collapsed: for two-dimensional arrays, this means that values within each column will be aggregated.

### Other aggregation functions

The table below lists other aggregation functions in NumPy. Most NumPy aggregates have a '`NaN`-safe' version, which computes the result while ignoring missing values marked by the `NaN` value.

|Function Name      |   NaN-safe Version  | Description                                   |
|:------------------|:--------------------|:----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |

We will see these aggregates often throughout the rest of the course.

> **Takeaway:** Aggregation is the primary means you will use to explore you data, not just when using NumPy, but particularly in conjunction with pandas, the Python library you will learn about in the next section, which builds off of NumPy and thus off of everything you have learned thus far.

## Computation on arrays with broadcasting

> **Learning goal:** By the end of this subsection, you should have a basic understanding of how broadcasting works in NumPy (and why NumPy uses it).

Another means of vectorizing operations is to use NumPy's *broadcasting* functionality: creating rules for applying binary ufuncs like addition, subtraction, or multiplication on arrays of different sizes.

Before, when we performed binary operations on arrays of the same size, those operations were performed on an element-by-element basis.

In [None]:
first_array = np.array([3, 6, 8, 1])
second_array = np.array([4, 5, 7, 2])
first_array + second_array

Broadcasting enables you to perform these types of binary operations on arrays of different sizes. Thus, you could just as easily add a scalar (which is really just a zero-dimensional array) to an array:

In [None]:
first_array + 5

Similarly, you can add a one-dimensional array to a two-dimensional array:

In [None]:
one_dim_array = np.ones((1))
one_dim_array

In [None]:
two_dim_array = np.ones((2, 2))
two_dim_array

In [None]:
one_dim_array + two_dim_array

So far, so easy. But you can use broadcasting on arrays in more complicated ways. Consider this example:

In [None]:
horizontal_array = np.arange(3)
vertical_array = np.arange(3)[:, np.newaxis]

print(horizontal_array)
print(vertical_array)

In [None]:
horizontal_array + vertical_array

### Rules of broadcasting
Broadcasting ollows a set of rules to determine the interaction between the two arrays:
- **Rule 1**: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is *padded* with ones on its leading (left) side.
- **Rule 2**: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
- **Rule 3**: If, in any dimension, the sizes disagree and neither is equal to 1, NumPy raises an error.

Let's see these rules in action to better understand them.

#### Broadcasting example 1

Let's look at adding a two-dimensional array to a one-dimensional array:

In [None]:
two_dim_array = np.ones((2, 3))
one_dim_array = np.arange(3)

Let's consider an operation on these two arrays. The shape of the arrays are:

- `two_dim_array.shape = (2, 3)`
- `one_dim_array.shape = (3,)`

We see by rule 1 that the array `one_dim_array` has fewer dimensions, so we pad it on the left with ones:

- `two_dim_array.shape -> (2, 3)`
- `one_dim_array.shape -> (1, 3)`

By rule 2, we now see that the first dimension disagrees, so we stretch this dimension to match:

- `two_dim_array.shape -> (2, 3)`
- `one_dim_array.shape -> (2, 3)`

The shapes match, and we see that the final shape will be `(2, 3)`:

In [None]:
two_dim_array + one_dim_array

> **Exercise:**

In [None]:
# Flip this around. Try adding these with two_dim_array = np.ones((3, 2)) 
# and one_dim_array = np.arange(3)[:, np.newaxis].
# What do you get?


#### Broadcasting example 2

Let's examine what happens when both arrays need to be broadcast:

In [None]:
vertical_array = np.arange(3).reshape((3, 1))
horizontal_array = np.arange(3)

Again, we'll start by writing out the shape of the arrays:

- `vertical_array.shape = (3, 1)`
- `horizontal_array.shape = (3,)`

Rule 1 says we must pad the shape of `horizontal_array ` with ones:

- `vertical_array.shape -> (3, 1)`
- `horizontal_array.shape -> (1, 3)`

And rule 2 tells us that we upgrade each of these ones to match the corresponding size of the other array:

- `vertical_array.shape -> (3, 3)`
- `horizontal_array.shape -> (3, 3)`

Because the result matches, these shapes are compatible. We can see this here:

In [None]:
vertical_array + horizontal_array

#### Broadcasting example 3

Here's what happens with incompatible arrays:

In [None]:
M = np.ones((3, 2))
i = np.arange(3)

This is just a slightly different situation than in the first example: the matrix ``M`` is transposed.
How does this affect the calculation? The shape of the arrays are:

- ``M.shape = (3, 2)``
- ``i.shape = (3,)``

Again, rule 1 tells us that we must pad the shape of ``i`` with ones:

- ``M.shape -> (3, 2)``
- ``i.shape -> (1, 3)``

By rule 2, the first dimension of ``i`` is stretched to match that of ``M``:

- ``M.shape -> (3, 2)``
- ``i.shape -> (3, 3)``

Now we hit Rule 3: the final shapes do not match and the two arrays are incompatible:

In [None]:
M + i

### Broadcasting in practice
Ufuncs enable you to avoid using slow Python loops; broadcasting builds on that.

A common data practice is to *center* an array of data. For example, if we have an array of 10 observations, each of which consists of three values (called features in this context), we might want to center that data so that we have the differences from the mean rather than the raw data itself. Doing this can help us better compare the different values.

We'll store this in a $10 \times 3$ array:

In [None]:
T = np.random.random((10, 3))
T

Now compute the mean of each feature using the ``mean`` aggregate across the first dimension:

In [None]:
Tmean = T.mean(0)
Tmean

Finally, center ``T`` by subtracting the mean. (This is a broadcasting operation.)

In [None]:
T_centered = T - Tmean
T_centered

This is not just faster, but easier than writing a loop to do this.

> **Takeaway:** The data you will work with in data science invariably comes in different shapes and sizes (at least in terms of the arrays in which you work with that data). The broadcasting functionality in NumPy enables you to use binary functions on irregularly fitting data in a predictable way.

## Comparisons, masks, and Boolean logic in NumPy

> **Learning goal:** By the end of this subsection, you should be comfortable with and understand how to use Boolean masking in NumPy in order to answer basic questions about your data.

*Masking* is when you want to manipulate or count or extract values in an array based on a criterion. For example, counting all the values in an array greater than a certain value is an example of masking. Boolean masking is often the most efficient way to accomplish these types of tasks in NumPy and it plays a large part in cleaning and otherwise preparing data for analysis (which we will cover in a later section).

### Example: Counting Rainy Days

Let's see masking in practice by examining the monthly rainfall statistics for Seattle. The data is in a CSV file from data.gov. To load the data, we will use pandas, which we will formally introduce later.

In [None]:
import numpy as np
import pandas as pd

# Use pandas to extract rainfall as a NumPy array
rainfall_2003 = pd.read_csv('Data/Observed_Monthly_Rain_Gauge_Accumulations_-_Oct_2002_to_May_2017.csv')['RG01'][ 2:14].values
rainfall_2003

Let’s break down what we just did in the code cell above. The rainfall data contains monthly rainfall totals from several rain gauges around the city of Seattle; we selected the first one. From that gauge, we then selected the relevant months for the first full calendar year in the dataset, 2003. That range of months started at the third row of the CSV file (remember, Python zero-indexes!) and ran through the thirteenth row, hence `2:14]`.

You now have an array containing 12 values, each of which records the monthly rainfall in inches from January to December 2003.

Commonly in data science, you will want to take a quick first exploratory look at the data. In this case, a bar chart is a good way to do this. To generate this bar chart, we will use Matplotlib, another important data-science tool that we will introduce formally later in the course. (This also brings up another widely used Python convention you should adopt: `import matplotlib.pyplot as plt`.)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
plt.bar(np.arange(1, len(rainfall_2003) + 1), rainfall_2003)

To briefly interpret the code snippet above, we passed two parameters to the bar function in pyplot: the first defining the index for the x-axis and the second defining the data to use for the bars (the y-axis). To create the index, we use the NumPy function `arange` to create a sequence of numbers (this is the same `arange` we encountered earlier in this section). We know that the length of our array is 12, but it can be a good habit to get into to programmatically pass the length of an array in case it changes or you don’t know it with specificity. We also added 1 to both the start and the end of the `arange` to accommodate for Python zero-indexing (because there is no “month-zero” in the calendar).

Looking at the chart above (and as residents can attest), Seattle can have lovely, sunny summers. However, this is only a first glimpse of the data. There are still several questions we would like to answer, such as in how many months did it rain, or what was the average precipitation in those months? We would use masking to answer those questions. (We will also return to this example dataset to demonstrate concepts throughout the rest of this section.) Before we dive deeper in explaining what masking is, we should briefly touch on comparison operators in NumPy.

### Comparison operators as ufuncs

In addition to the computational operators as ufuncs that we have already encountered, NumPy also implements comparison operators such as `<` (less than) and `>` (greater than) as element-wise ufuncs. All of the standard Python comparison operations are available:

In [None]:
simple_array = np.array([1, 2, 3, 4, 5])

In [None]:
simple_array < 2  # less than

In [None]:
simple_array >= 4  # greater than or equal

In [None]:
simple_array == 2  # equal

It is also possible to do an element-wise comparison of two arrays, and to include compound expressions:

In [None]:
(2 * simple_array) == (simple_array ** 2)

As with the arithmetic operators, these comparison operators are wrappers for the NumPy ufuncs: when you write ``x < 3``, NumPy actually uses ``np.less(x, 3)``. Here is a summary of the comparison operators and their equivalent ufuncs:

| Operator	    | Equivalent ufunc    || Operator	   | Equivalent ufunc    |
|:--------------|:--------------------||:--------------|:--------------------|
|``==``         |``np.equal``         ||``!=``         |``np.not_equal``     |
|``<``          |``np.less``          ||``<=``         |``np.less_equal``    |
|``>``          |``np.greater``       ||``>=``         |``np.greater_equal`` |

Just like the arithmetic ufuncs, the comparison ufuncs work on arrays of any size and shape.

In [None]:
rand = np.random.RandomState(0)
two_dim_array = rand.randint(10, size=(3, 4))
two_dim_array

In [None]:
two_dim_array < 6

The result is a Boolean array, and NumPy provides a number of straightforward patterns for working with these Boolean results.

## Working with Boolean arrays

Given a Boolean array, there are a host of useful operations you can do.
We'll work with `two_dim_array`, the two-dimensional array we created earlier.

In [None]:
print(two_dim_array)

### Counting entries

To count the number of ``True`` entries in a Boolean array, ``np.count_nonzero`` is useful:

In [None]:
# how many values less than 6?
np.count_nonzero(two_dim_array < 6)

We see that there are eight array entries that are less than 6.
Another way to get at this information is to use ``np.sum``; in this case, ``False`` is interpreted as ``0``, and ``True`` is interpreted as ``1``:

In [None]:
np.sum(two_dim_array < 5)

The benefit of `sum()` is that, like with other NumPy aggregation functions, this summation can be done along rows or columns as well:

In [None]:
# how many values less than 5 in each row?
np.sum(two_dim_array < 5, axis=1)

This counts the number of values less than 5 in each row of the matrix.

If we're interested in quickly checking whether any or all the values are true, we can use (you guessed it) ``np.any`` or ``np.all``:

In [None]:
# Are there any values less than zero?
np.any(two_dim_array < 0)

> **Exercise:**

In [None]:
# Now check to see if all values less than 10?
# Hint: use np.all()


``np.all`` and ``np.any`` can be used along particular axes as well. For example:

In [None]:
# are all values in each row less than 7?
np.all(two_dim_array < 7, axis=1)

Here, all the elements in the first and third rows are less than 7, while this is not the case for the second row.

**A reminder:** Python has built-in `sum()`, `any()`, and `all()` functions. These have a different syntax than the NumPy versions, and, in particular, will fail or produce unintended results when used on multidimensional arrays. Be sure that you are using `np.sum()`, `np.any()`, and `np.all()` for these examples.

### Boolean operators

We've already seen how we might count, say, all months with rain less than four inches, or all months with more than two inches of rain. But what if we want to know about all months with rain less than four inches and greater than one inch? This is accomplished through Python's *bitwise logic operators*, `&`, `|`, `^`, and `~`. Like with the standard arithmetic operators, NumPy overloads these as ufuncs which work element-wise on (usually Boolean) arrays.

For example, we can address this sort of compound question as follows:

In [None]:
np.sum((rainfall_2003 > 0.5) & (rainfall_2003 < 1))

So we see that there are two months with rainfall between 0.5 and 1.0 inches.
Note that the parentheses here are important—because of operator-precedence rules, with parentheses removed, this expression would be evaluated as follows, which results in an error:

In [None]:
rainfall_2003 > (0.5 & rainfall_2003) < 1

Using the equivalence of *A AND B and NOT (NOT A OR NOT B)* (which you might remember if you've taken an introductory logic course), we can compute the same result in a different manner:

In [None]:
np.sum(~((rainfall_2003 <= 0.5) | (rainfall_2003 >= 1)))

Combining comparison operators and Boolean operators on arrays can lead to a wide range of efficient logical operations.

The following table summarizes the bitwise Boolean operators and their equivalent ufuncs:

| Operator	    | Equivalent ufunc    || Operator	   | Equivalent ufunc    |
|:--------------|:--------------------||:--------------|:--------------------|
|``&``          |``np.bitwise_and``   ||&#124;         |``np.bitwise_or``    |
|``^``          |``np.bitwise_xor``   ||``~``          |``np.bitwise_not``   |

Using these tools, you can start to answer the types of questions we listed above about the Seattle rainfall data. Here are some examples of results we can compute when combining masking with aggregations:

In [None]:
print("Number of months without rain:", np.sum(rainfall_2003 == 0))
print("Number of months with rain:   ", np.sum(rainfall_2003 != 0))
print("Months with more than 1 inch: ", np.sum(rainfall_2003 > 1))
print("Rainy months with < 1 inch:   ", np.sum((rainfall_2003 > 0) &
                                              (rainfall_2003 < 1)))

## Boolean arrays as masks

In the prior section, we looked at aggregates computed directly on Boolean arrays.
A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.
Returning to our `two_dim_array` array from before, suppose we want an array of all values in the array that are less than 5:

In [None]:
two_dim_array

You can obtain a Boolean array for this condition easily:

In [None]:
two_dim_array < 5

Now, to *select* these values from the array, you can simply index on this Boolean array. This is the *masking* operation:

In [None]:
two_dim_array[two_dim_array < 5]

What is returned is a one-dimensional array filled with all the values that meet your condition. Put another way, these are all the values in positions at which the mask array is ``True``.

You can use masking as a way to compute some relevant statistics on the Seattle rain data:

In [None]:
# Construct a mask of all rainy months
rainy = (rainfall_2003 > 0)

# Construct a mask of all summer months (June through September)
months = np.arange(1, 13)
summer = (months > 5) & (months < 10)

print("Median precip in rainy months in 2003 (inches):   ", 
      np.median(rainfall_2003[rainy]))
print("Median precip in summer months in 2003 (inches):  ", 
      np.median(rainfall_2003[summer]))
print("Maximum precip in summer months in 2003 (inches): ", 
      np.max(rainfall_2003[summer]))
print("Median precip in non-summer rainy months (inches):", 
      np.median(rainfall_2003[rainy & ~summer]))

> **Takeaway:** By combining Boolean operations, masking operations, and aggregates, you can quickly answer questions similar to those we posed about the Seattle rainfall data about any dataset. Operations like these will form the basis for the data exploration and preparation for analysis that will by our primary concerns in the following sections.

# Section 4: Pandas

Having explored NumPy in a previous section, it is time to get to know the other workhorse of data science in Python: pandas. The pandas library in Python really does a lot to make working with data--and importing, cleaning, and organizing it--so much easier that it is hard to imagine doing data science in Python without it.

But it was not always this way. Wes McKinney developed the library out of necessity in 2008 while at AQR Capital Management in order to have a better tool for dealing with data analysis. The library has since taken off as an open-source software project that has become a mature and integral part of the data science ecosystem. (In fact, some examples in this section will be drawn from McKinney's book, *Python for Data Analysis*.)

The name 'pandas' actually has nothing to do with Chinese bears but rather comes from the term *panel data*, a form of multi-dimensional data involving measurements over time that comes out the econometrics and statistics community. Ironically, while panel data is a usable data structure in pandas, it is not generally used today and we will not examine it in this course. Instead, we will focus on the two most widely used data structures in pandas: `Series` and `DataFrame`s.

## Reminders about importing and documentation

Just as you imported NumPy undwither the alias ``np``, we will import Pandas under the alias ``pd``:

In [1]:
import pandas as pd

As with the NumPy convention, `pd` is an important and widely used convention in the data science world; we will use it here and we advise you to use it in your own coding.

As we progress through the next section, don't forget that IPython provides tab-completion feature and function documentation with the ``?`` character. If you don't understand anything about a function you see in this section, take a moment and read the documentation; it can help a great deal. As a reminder, to display the built-in pandas documentation, use this code:

```ipython
In [4]: pd?
```

Because it can be useful to lean about `Series` and `DataFrame`s in pandas a extension of `ndarray`s in NumPy, go ahead also import NumPy; you will want it for some of the examples later on:

In [2]:
import numpy as np

Now, on to pandas!

## Fundamental panda data structures

Both `Series` and `DataFrame`s are a lot like they `ndarray`s you encountered in the last section. They provide clean, efficent data storage and handling at the scales necessary for data science. What both of them provide that `ndarray`s lack, however, are essential data-science features like flexibility when dealing with missing data and the ability to label data. These capabilities (along with others) help make `Series` and `DataFrame`s essential to the "data munging" that make up so much of data science.

### `Series` objects in pandas

A pandas `Series` is a lot like an `ndarray` in NumPy: a one-dimensional array of indexed data.
You can create a simple Series from an array of data like this:

In [3]:
series_example = pd.Series([-0.5, 0.75, 1.0, -2])
series_example

0   -0.50
1    0.75
2    1.00
3   -2.00
dtype: float64

Similar to an `ndarray`, a `Series` upcasts entries to be of the same type of data (that `-2` integer in the original array became a `-2.00` float in the `Series`).

What is different from an `ndarray` is that the ``Series`` automatically wraps both a sequence of values and a sequence of indices. These are two separate objects within the `Seriers` object that can access with the ``values`` and ``index`` attributes.

Try accessing the ``values`` first; they are just a familiar NumPy array:

In [5]:
series_example.values

array([-0.5 ,  0.75,  1.  , -2.  ])

The ``index`` is also an array-like object:

In [6]:
series_example.index

RangeIndex(start=0, stop=4, step=1)

Just as with `ndarra`s, you can access specific data elements in a `Series` via the familiar Python square-bracket index notation and slicing:

In [7]:
series_example[1]

0.75

In [8]:
series_example[1:3]

1    0.75
2    1.00
dtype: float64

Despite a lot of similarities, pandas `Series` have an important distinction from NumPy `ndarrays`: whereas `ndarrays` have  *implicitly defined* integer indices (as do Python lists), pandas `Series` have *explicitly defined* indices. The best part is that you can set the index:

In [9]:
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
series_example2

a   -0.50
b    0.75
c    1.00
d   -2.00
dtype: float64

These explicit indices work exactly the way you would expect them to:

In [11]:
series_example2['b']

0.75

In [None]:
# Do explicit Series indices work *exactly* the way you might expect?
# Try slicing series_example2 using its explicit index and find out.


With explicit indices in the mix, a `Series` is basically a fixed-length, ordered dictionary in that it maps arbitrary typed index values to arbitrary typed data values. But like `ndarray`s these data are all of the same type, which is important. Just as the type-specific compiled code behind `ndarray` makes them more efficient than a Python lists for certain operations, the type information of pandas ``Series`` makes them much more efficient than Python dictionaries for certain operations.

But the connection between `Series` and dictionaries is nevertheless very real: you can construct a ``Series`` object directly from a Python dictionary:

In [18]:
population_dict = {'France': 65429495,
                   'Germany': 82408706,
                   'Russia': 143910127,
                   'Japan': 126922333}
population = pd.Series(population_dict)
population

France      65429495
Germany     82408706
Japan      126922333
Russia     143910127
dtype: int64

Did you see what happened there? The order of the keys `Russia` and `Japan` in the switched places between the order in which they were entered in `population_dict` and how they ended up in the `population` `Series` object. While Python dictionary keys have no order, `Series` keys are ordered.

So, at one level, you can interact with `Series` as you would with dictionaries:

In [14]:
population['Russia']

143910127

But you can also do powerful array-like operations with `Series` like slicing:

In [None]:
# Try slicing on the population Series on your own.
# Would slicing be possible if Series keys were not ordered?


You can also add elements to a `Series` the way that you would to an `ndarray`. Try it in the code cell below:

In [None]:
# Try running population['Albania'] = 2937590 (or another country of your choice)
# What order do the keys appear in when you run population? Is it what you expected?


Anoter useful `Series` feature (and definitely a difference from dictionaries) is that `Series` automatically aligns differently indexed data in arithmetic operations:

In [20]:
pop2 = pd.Series({'Spain': 46432074, 'France': 102321, 'Albania': 50532})
population + pop2

Albania     2988122.0
France     65531816.0
Germany           NaN
Japan             NaN
Russia            NaN
Spain             NaN
dtype: float64

Notice that in the case of Germany, Japan, Russia, and Spain (and Albania, depending on what you did in the previous exercise), the addition operation produced `NaN` (not a number) values. pandas does not treat missing values as `0`, but as `NaN` (and it can be helpful to think of arithmetic operations involving `NaN` as essentially `NaN`$ + x=$ `NaN`).

### `DataFrame` object in pandas

The other crucial data structure in pandas to get to know for data science is the `DataFrame`.
Like the ``Series`` object, ``DataFrame``s can be thought of either as generalizations of `ndarray`s (or as specializations of Python dictionaries).

Just as a ``Series`` is like a one-dimensional array with flexible indices, a ``DataFrame`` is like a two-dimensional array with both flexible row indices and flexible column names. Essentially, a `DataFrame` represents a rectangular table of data and contains an ordered collection of labeled columns, each of which can be a different value type (`string`, `int`, `float`, etc.).
The DataFrame has both a row and column index; in this way you can think of it as a dictionary of `Series`, all of which share the same index.

Let's take a look at how this works in practice. We will start by creating a `Series` called `area`:

In [22]:
area_dict = {'Albania': 28748,
             'France': 643801,
             'Germany': 357386,
             'Japan': 377972,
             'Russia': 17125200}
area = pd.Series(area_dict)
area

Albania       28748
France       643801
Germany      357386
Japan        377972
Russia     17125200
dtype: int64

Now you can combine this with the `population` `Series` you created earlier by using a dictionary to construct a single two-dimensional table containing data from both `Series`:

In [57]:
countries = pd.DataFrame({'Population': population, 'Area': area})
countries

Unnamed: 0,Area,Population
Albania,28748,2937590
France,643801,65429495
Germany,357386,82408706
Japan,377972,126922333
Russia,17125200,143910127


As with `Series`, note that `DataFrame`s also automatically order indices (in this case, the column indices `Area` and `Population`).

So far we have combined dictionaries together to compose a `DataFrame` (which has given our `DataFrame` a row-centric feel), but you can also create `DataFrame`s in a column-wise fashion. Consider adding a `Capital` column using our reliable old array-analog, a list:

In [58]:
countries['Capital'] = ['Tirana', 'Paris', 'Berlin', 'Tokyo', 'Moscow']
countries

Unnamed: 0,Area,Population,Capital
Albania,28748,2937590,Tirana
France,643801,65429495,Paris
Germany,357386,82408706,Berlin
Japan,377972,126922333,Tokyo
Russia,17125200,143910127,Moscow


As with `Series`, even though initial indices are ordered in `DataFrame`s, subsequent additions to a `DataFrame` stay in the ordered added. However, you can explicitly change the order of `DataFrame` column indices this way:

In [59]:
countries = countries[['Capital', 'Area', 'Population']]
countries

Unnamed: 0,Capital,Area,Population
Albania,Tirana,28748,2937590
France,Paris,643801,65429495
Germany,Berlin,357386,82408706
Japan,Tokyo,377972,126922333
Russia,Moscow,17125200,143910127


Commonly in a data science context, it is necessary to generate new columns of data from existing data sets. Because `DataFrame` columns behave like `Series`, you can do this is by performing operations on them as you would with `Series`:

In [60]:
countries['Population Density'] = countries['Population'] / countries['Area']
countries

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Capital,Area,Population,Population Density
Albania,Tirana,28748,2937590,102.184152
France,Paris,643801,65429495,101.629999
Germany,Berlin,357386,82408706,230.587393
Japan,Tokyo,377972,126922333,335.798242
Russia,Moscow,17125200,143910127,8.403413


Note: don't worry if IPython gives you a warning over this. The warning is IPython trying to be a little too helpful. The new column you created is an actual part of the `DataFrame` and not a copy of a slice.

We have stated before that `DataFrame`s are like dictionaries, and it's true. You can retrieve the contents of a column just as you would the value for a specific key in an ordinary dictionary:

In [40]:
countries['Area']

Albania       28748
France       643801
Germany      357386
Japan        377972
Russia     17125200
Name: Area, dtype: int64

What about using the row indices?

In [None]:
# Now try accessing row data with a command like countries['Japan']


This returns an error: `DataFrame`s are dictionaries of `Series`, which are the columns. `DataFrame` rows often have heterogeneous data types, so different methods are necessary to access row data. For that, we use the `.loc` method:

In [41]:
countries.loc['Japan']

Capital                   Tokyo
Area                     377972
Population            126922333
Population Density      335.798
Name: Japan, dtype: object

Note that what `.loc` returns is an indexed object in its own right and you can access elements within it using familiar index syntax:

In [42]:
countries.loc['Japan']['Area']

377972

In [None]:
# Can you think of a way to return the area of Japan without using .iloc?
# Hint: Try putting the column index first.
# Can you slice along these indices as well?


Sometimes it is helpful in data science projects to add a column to a `DataFrame` without assigning values to it:

In [61]:
countries['Debt-to-GDP Ratio'] = np.nan
countries

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Capital,Area,Population,Population Density,Debt-to-GDP Ratio
Albania,Tirana,28748,2937590,102.184152,
France,Paris,643801,65429495,101.629999,
Germany,Berlin,357386,82408706,230.587393,
Japan,Tokyo,377972,126922333,335.798242,
Russia,Moscow,17125200,143910127,8.403413,


Again, you can disregard the warning (if it triggers) about adding the column this way.

You can also add columns to a `DataFrame` that do not have the same number of rows as the `DataFrame`:

In [62]:
debt = pd.Series([0.19, 2.36], index=['Russia', 'Japan'])
countries['Debt-to-GDP Ratio'] = debt
countries

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Capital,Area,Population,Population Density,Debt-to-GDP Ratio
Albania,Tirana,28748,2937590,102.184152,
France,Paris,643801,65429495,101.629999,
Germany,Berlin,357386,82408706,230.587393,
Japan,Tokyo,377972,126922333,335.798242,2.36
Russia,Moscow,17125200,143910127,8.403413,0.19


You can use the `del` command to delete a column from a `DataFrame`:

In [63]:
del countries['Capital']
countries

Unnamed: 0,Area,Population,Population Density,Debt-to-GDP Ratio
Albania,28748,2937590,102.184152,
France,643801,65429495,101.629999,
Germany,357386,82408706,230.587393,
Japan,377972,126922333,335.798242,2.36
Russia,17125200,143910127,8.403413,0.19


In addition to their dictionary-like behavior, `DataFrames` also behave like two-dimensional arrays. For example, it can be useful at times when working with a `DataFrame` to transpose it:

In [64]:
countries.T

Unnamed: 0,Albania,France,Germany,Japan,Russia
Area,28748.0,643801.0,357386.0,377972.0,17125200.0
Population,2937590.0,65429500.0,82408710.0,126922300.0,143910100.0
Population Density,102.1842,101.63,230.5874,335.7982,8.403413
Debt-to-GDP Ratio,,,,2.36,0.19


Again, note that `DataFrame` columns are `Series` and thus the data types must consistent, hence the upcasting to floating-point numbers. **If there had been strings in this `DataFrame`, everything would have been upcast to strings.** Use caution when transposing `DataFrame`s.

#### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:

In [27]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.865257,0.213169
b,0.442759,0.108267
c,0.04711,0.905718


## Manipulating data in pandas

A huge part of data science is manipulating data in order to analyze it. (One rule of thumb is that 80% of any data science project will be concerned with cleaning and organizing the data for the project.) So it makes sense to lear the tools that pandas provides for handling data in `Series` and especially `DataFrame`s. Because both of those data structures are ordered, let's first start by taking a closer look at what gives them their structure: the `Index`.

### Index objects in pandas

Both ``Series`` and ``DataFrame``s in pandas have explicit indices that enable you to reference and modify data in them. These indices are actually objects themselves. The ``Index`` object can be thought of as both an immutable array or as fixed-size set. 

It's worth the time to get to know the properties of the `Index` object. Let's return to an example from earlier in the section to examine these properties.

In [66]:
series_example = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
ind = series_example.index
ind

Index(['a', 'b', 'c', 'd'], dtype='object')

The ``Index`` works a lot like an array. we have already seen how to use standard Python indexing notation to retrieve values or slices:

In [67]:
ind[1]

'b'

In [68]:
ind[::2]

Index(['a', 'c'], dtype='object')

But ``Index`` objects are immutable; you cannot be modified via the normal means:

In [69]:
ind[1] = 0

TypeError: Index does not support mutable operations

This immutability is a good thing: it makes it safer to share indices between multiple ``Series`` or ``DataFrame``s without the potential for problems arising from inadvertent index modification.

In addition to being array-like, a Index also behaves like a fixed-size set, including following many of the conventions used by Python's built-in ``set`` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way. Let's play around with this to see it in action.

In [35]:
ind_odd = pd.Index([1, 3, 5, 7, 9])
ind_prime = pd.Index([2, 3, 5, 7, 11])

In the code cell below, try out the intersection (`ind_odd & ind_prime`), union (`ind_odd | ind_prime`), and the symmetric difference (`ind_odd ^ ind_prime`) of `ind_odd` and `ind_prime`.

These operations may also be accessed via object methods, for example ``ind_odd.intersection(ind_prime)``. Below is a table listing some useful `Index` methods and properties.

| **Method**     | **Description**                                                                           |
|:---------------|:------------------------------------------------------------------------------------------|
| [`append`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html)       | Concatenate with additional `Index` objects, producing a new `Index`                      |
| [`diff`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html)         | Compute set difference as an Index                                                        |
| [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)         | Compute new `Index` by deleting passed values                                             |
| [`insert`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html)       | Compute new `Index` by inserting element at index `i`                                     |
| [`is_monotonic`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.is_monotonic.html) | Returns `True` if each element is greater than or equal to the previous element           |
| [`is_unique`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.is_unique.html)    | Returns `True` if the Index has no duplicate values                                       |
| [`isin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html)         | Compute boolean array indicating whether each value is contained in the passed collection |
| [`unique`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html)       | Compute the array of unique values in order of appearance                                         |

### Data Selection in Series

As a refresher, a ``Series`` object acts in many ways like both a one-dimensional `ndarray` and a standard Python dictionary.

Like a dictionary, the ``Series`` object provides a mapping from a collection of arbitrary keys to a collection of arbitrary values. Back to an old example:

In [72]:
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
series_example2

a   -0.50
b    0.75
c    1.00
d   -2.00
dtype: float64

In [74]:
series_example2['b']

0.75

You can also examine the keys/indices and values using dictionary-like Python tools:

In [75]:
'a' in series_example2

True

In [76]:
series_example2.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [77]:
list(series_example2.items())

[('a', -0.5), ('b', 0.75), ('c', 1.0), ('d', -2.0)]

As with dictionaries, you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value:

In [78]:
series_example2['e'] = 1.25
series_example2

a   -0.50
b    0.75
c    1.00
d   -2.00
e    1.25
dtype: float64

#### Series as one-dimensional array

Because ``Series`` also provide array-style functionality, you can use the NumPy techniques we looked at a previous section like slices, masking, and fancy indexing:

In [79]:
# Slicing using the explicit index
series_example2['a':'c']

a   -0.50
b    0.75
c    1.00
dtype: float64

In [80]:
# Slicing using the implicit integer index
series_example2[0:2]

a   -0.50
b    0.75
dtype: float64

In [82]:
# Masking
series_example2[(series_example2 > -1) & (series_example2 < 0.8)]

a   -0.50
b    0.75
dtype: float64

In [84]:
# Fancy indexing
series_example2[['a', 'e']]

a   -0.50
e    1.25
dtype: float64

One note to avoid confusion. When slicing with an explicit index (i.e., ``series_example2['a':'c']``), the final index is **included** in the slice; when slicing with an implicit index (i.e., ``series_example2[0:2]``), the final index is **excluded** from the slice.

#### Indexers: `loc` and `iloc`

A great thing about pandas is that you can use a lot different things for your explicit indices. A potentially confusing thing about pandas is that you can use a lot different things for your explicit indices, including integers. To avoid confusion between integer indices that you might supply and those implicit integer indices that pandas generates, pandas provides special *indexer* attributes that explicitly expose certain indexing schemes.

(A technical note: These are not functional methods; they are attributes that expose a particular slicing interface to the data in the ``Series``.)

The ``loc`` attribute allows indexing and slicing that always references the explicit index:

In [86]:
series_example2.loc['a']

-0.5

In [88]:
series_example2.loc['a':'c']

a   -0.50
b    0.75
c    1.00
dtype: float64

The ``iloc`` attribute enables indexing and slicing using the implicit, Python-style index:

In [90]:
series_example2.iloc[0]

-0.5

In [91]:
series_example2.iloc[0:2]

a   -0.50
b    0.75
dtype: float64

A guiding principle of the Python language is the idea that "explicit is better than implicit." Professional code will generally use explicit indexing with ``loc`` and ``iloc`` and you should as well in order to make your code clean and readable.

### Data selection in DataFrames

``DataFrame``s also exhibit dual behavior, acting both like a two-dimensional `ndarray` and like a dictionary of ``Series``  sharing the same index.

#### DataFrame as dictionary of Series

Let's return to our earlier example of countries' areas and populations in order to examine `DataFrame`s as a dictionary of `Series`.

In [92]:
area = pd.Series({'Albania': 28748,
                  'France': 643801,
                  'Germany': 357386,
                  'Japan': 377972,
                  'Russia': 17125200})
population = pd.Series ({'Albania': 2937590,
                         'France': 65429495,
                         'Germany': 82408706,
                         'Russia': 143910127,
                         'Japan': 126922333})
countries = pd.DataFrame({'Area': area, 'Population': population})
countries

Unnamed: 0,Area,Population
Albania,28748,2937590
France,643801,65429495
Germany,357386,82408706
Japan,377972,126922333
Russia,17125200,143910127


You can access the individual ``Series`` that make up the columns of a ``DataFrame`` via dictionary-style indexing of the column name:

In [94]:
countries['Area']

Albania       28748
France       643801
Germany      357386
Japan        377972
Russia     17125200
Name: Area, dtype: int64

An you can use dictionary-style syntax can also be used to modify `DataFrame`s, such as by adding a new column:

In [95]:
countries['Population Density'] = countries['Population'] / countries['Area']
countries

Unnamed: 0,Area,Population,Population Density
Albania,28748,2937590,102.184152
France,643801,65429495,101.629999
Germany,357386,82408706,230.587393
Japan,377972,126922333,335.798242
Russia,17125200,143910127,8.403413


#### DataFrame as two-dimensional array

You can also think of ``DataFrame``s as two-dimensional arrays. You can examine the raw data in the `DataFrame`/data array using the ``values`` attribute:

In [96]:
countries.values

array([[  2.87480000e+04,   2.93759000e+06,   1.02184152e+02],
       [  6.43801000e+05,   6.54294950e+07,   1.01629999e+02],
       [  3.57386000e+05,   8.24087060e+07,   2.30587393e+02],
       [  3.77972000e+05,   1.26922333e+08,   3.35798242e+02],
       [  1.71252000e+07,   1.43910127e+08,   8.40341292e+00]])

Viewed thsi way it makes sense that we can transpose the rows and columns of a `DataFrame` the same way we would an array:

In [97]:
countries.T

Unnamed: 0,Albania,France,Germany,Japan,Russia
Area,28748.0,643801.0,357386.0,377972.0,17125200.0
Population,2937590.0,65429500.0,82408710.0,126922300.0,143910100.0
Population Density,102.1842,101.63,230.5874,335.7982,8.403413


`DataFrame`s also uses the ``loc`` and ``iloc`` indexers. With ``iloc``, you can index the underlying array as if it were an `ndarray` but with the ``DataFrame`` index and column labels maintained in the result:

In [98]:
countries.iloc[:3, :2]

Unnamed: 0,Area,Population
Albania,28748,2937590
France,643801,65429495
Germany,357386,82408706


``loc`` also permits array-like slicing but using the explicit index and column names:

In [100]:
countries.loc[:'Germany', :'Population']

Unnamed: 0,Area,Population
Albania,28748,2937590
France,643801,65429495
Germany,357386,82408706


You can also use array-like techniques such as masking and fancing indexing with `loc`.

In [101]:
# Can you think of how to combine masking and fancy indexing in one line?
# Your masking could be somthing like countries['Population Density'] > 200
# Your fancy indexing could be something like ['Population', 'Population Density']
# Be sure to put the the masking and fancy indexing inside the square brackets: countries.loc[]


#### Indexing conventions

In practice in the world of data science (and pandas more generally), *indexing* refers to columns while *slicing* refers to rows:

In [102]:
countries['France':'Japan']

Unnamed: 0,Area,Population,Population Density
France,643801,65429495,101.629999
Germany,357386,82408706,230.587393
Japan,377972,126922333,335.798242


Such slices can also refer to rows by number rather than by index:

In [103]:
countries[1:3]

Unnamed: 0,Area,Population,Population Density
France,643801,65429495,101.629999
Germany,357386,82408706,230.587393


Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [104]:
countries[countries['Population Density'] > 200]

Unnamed: 0,Area,Population,Population Density
Germany,357386,82408706,230.587393
Japan,377972,126922333,335.798242


These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice.

# Operating on Data in Pandas

As you begin to work in data science, operating on data is imperative. It is the very heart of data science. Another aspect of pandas that makes it a compelling tool for many data scientists is pandas' capability to perform efficient element-wise operations on data. pandas builds on ufuncs from NumPy to supply theses capabilities and then extends them to provide additional power for data manipulation:
 - For unary operations (such as negation and trigonometric functions), ufuncs in pandas **preserve index and column labels** in the output.
 - For binary operations (such as addition and multiplication), pandas automatically **aligns indices** when passing objects to  ufuncs.

These critical features of ufuncs in pandas mean that data retains its context when operated on and, more importantly still, drastically helps reduce errors when you combine data from multiple sources.

## Index Preservation

pandas is explicitly designed to work with NumPy. As a results, all NumPy ufuncs will work on Pandas ``Series`` and ``DataFrame`` objects.

We can see this more clearly if we create a simple ``Series`` and ``DataFrame`` of random numbers on which to operate. 

In [8]:
rng = np.random.RandomState(42)
ser_example = pd.Series(rng.randint(0, 10, 4))
ser_example

0    6
1    3
2    7
3    4
dtype: int32

Did you notice the NumPy function we used with the variable `rng`? By specifying a seed for the random-number generator, you get the same result each time. This can be useful trick when you need to produce psuedo-random output that also needs to be replicatable by others. (Go ahead and re-run the code cell above a couple of times to convince yourself that it produces the same output each time.)

In [9]:
df_example = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df_example

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


Let's apply a ufunc to our example `Series`:

In [11]:
np.exp(ser_example)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

The same thing happens with a slightly more complex operation on our example `DataFrame`:

In [12]:
np.cos(df_example * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.83697e-16,0.7071068,6.123234000000001e-17,-1.83697e-16
1,0.7071068,-1.0,-0.7071068,0.7071068
2,0.7071068,6.123234000000001e-17,-0.7071068,-1.0


Note that you can use all of the ufuncs we discussed in a previous section the same way.

## Index alignment

As mentioned above, when you perform a binary operation on two ``Series`` or ``DataFrame`` objects, pandas will align indices in the process of performing the operation. This is essential when working with incomplete data (and data is usually incomplete), but it is helpful to see this in action to better understand it.

### Index alignment with Series

For our first example, suppose we are combining two different data sources and find only the top five countries by *area* and the top five countries by *population*:

In [15]:
area = pd.Series({'Russia': 17075400, 'Canada':  9984670,
                  'USA': 9826675, 'China': 9598094, 
                  'Brazil': 8514877}, name='area')
population = pd.Series({'China': 1409517397, 'India': 1339180127,
                        'USA': 324459463, 'Indonesia': 322179605, 
                        'Brazil': 207652865}, name='population')

In [None]:
# Now divide these to compute the population density


Your resulting array contains the **union** of indices of the two input arrays: seven countries in total. All of the countries in the array without an entry (because they lacked either area data or population data) are marked with the now familiar ``NaN``, or "Not a Number," designation.

Index matching works the same way built-in Python arithmetic expressions and missing values are filled in with `NaN`s. You can see this clearly by adding two `Series` that are slightly misaligned in their indices:

In [18]:
series1 = pd.Series([2, 4, 6], index=[0, 1, 2])
series2 = pd.Series([3, 5, 7], index=[1, 2, 3])
series1 + series2

0     NaN
1     7.0
2    11.0
3     NaN
dtype: float64

`NaN` values are not always convenient to work with; `NaN` combined with any other values results in `NaN`, which can be a pain, particulalry if you are combining multiple data sources with missing values. To help with this, pandas allows you to specify a default value to use for missing values in the operation. For example, calling `series1.add(series2)` is equivalent to calling `series1 + series2`, but you can supply the fill value:

In [19]:
series1.add(series2, fill_value=0)

0     2.0
1     7.0
2    11.0
3     7.0
dtype: float64

Much better!

### Index alignment with DataFrames

The same kind of alignment takes place in both dimension (columns and indices) when you perform operations on ``DataFrame``s.

In [20]:
df1 = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                   columns=list('AB'))
df1

Unnamed: 0,A,B
0,1,11
1,5,1


In [22]:
df2 = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                   columns=list('BAC'))
df2

Unnamed: 0,B,A,C
0,3,8,2
1,4,2,6
2,4,8,6


In [None]:
# Add df1 and df2. Is the output what you expected?

Even though we passed the columns in a different order in `df2` than in `df1`, the indices were aligned correctly sorted in the resulting union of columns.

You can also use fill values for missing values with `Data Frame`s. In this example, let's fill the missing values with the mean of all values in `df1` (computed by first stacking the rows of `df1`):

In [24]:
fill = df1.stack().mean()
df1.add(df2, fill_value=fill)

Unnamed: 0,A,B,C
0,9.0,14.0,6.5
1,7.0,5.0,10.5
2,12.5,8.5,10.5


This table lists Python operators and their equivalent pandas object methods:

| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


## Operations between DataFrames and Series

Index and column alignment gets maintained in operations between a `DataFrame` and a `Series` as well. To see this, consider a common operation in data science, wherein we find the difference of a `DataFrame` and one of its rows. Because pandas inherits ufuncs from NumPy, pandas will compute the difference row-wise by default:

In [25]:
df3 = pd.DataFrame(rng.randint(10, size=(3, 4)), columns=list('WXYZ'))
df3

Unnamed: 0,W,X,Y,Z
0,1,3,8,1
1,9,8,9,4
2,1,3,6,7


In [26]:
df3 - df3.iloc[0]

Unnamed: 0,W,X,Y,Z
0,0,0,0,0
1,8,5,1,3
2,0,0,-2,6


But what if you need to operate column-wise? You can do this by using object methodsand specifying the ``axis`` keyword.

In [28]:
df3.subtract(df3['X'], axis=0)

Unnamed: 0,W,X,Y,Z
0,-2,0,5,-2
1,1,0,1,-4
2,-2,0,3,4


And when you do operations between `DataFrame`s and `Series` operations, you still get automatic index alignment:

In [29]:
halfrow = df3.iloc[0, ::2]
halfrow

W    1
Y    8
Name: 0, dtype: int32

Note that the output from that operation was transposed. That was so that we can subtract it from the `DataFrame`:

In [30]:
df3 - halfrow

Unnamed: 0,W,X,Y,Z
0,0.0,,0.0,
1,8.0,,1.0,
2,0.0,,-2.0,


Remember, pandas preserves and aligns indices and columns so preserve data context. This will be of huge help to you in our next section when we look at data cleaning and preparation.

# Section 5: Manipulating and Cleaning Data

This section marks a subtle change. Up until now, we have been introducing ideas and techniques in order to prepare you with a toolbox of techniques to deal with real-world situations. We are now going to start using some of those tools while also giving you some ideas about how and when to use them in your own work with data.

Real-world data is messy. You will likely need to combine several data sources to get the data you actually want. The data from those sources will be incomplete. And it will likely not be formatted in exactly the way you want in order to perform your analysis. It's for these reasons that most data scientists will tell you that about 80 percent of any project is spent just getting the data into a form ready for analysis.

## Exploring `DataFrame` information

> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.

Once you have loaded your data into pandas, it will more likely than not be in a `DataFrame`. However, if the data set in your `DataFrame` has 60,000 rows and 400 columns, how do you even begin to get a sense of what you're working with? Fortunately, pandas provides some conventient tools to quickly look at overall information about a `DataFrame` in addition to the first few and last few rows.

In order to explore this functionality, we will import the Python scikit-learn library and use an iconic dataset that every data scientist has seen hundreds of times: British biologist Ronald Fisher's *Iris* data set used in his 1936 paper "The use of multiple measurements in taxonomic problems":

In [None]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])

### `DataFrame.info`
Let's take a look at this dataset to see what we have:

In [None]:
iris_df.info()

From this, we know that the *Iris* dataset has 150 entries in four columns. All of the data is stored as 64-bit floating-point numbers.

### `DataFrame.head`
Next, let's see what the first few rows of our `DataFrame` look like:

In [None]:
iris_df.head()

> **Exercise:**

By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?

In [None]:
# Hint: Consult the documentation by using iris_df.head?


### `DataFrame.tail`
The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:

In [None]:
iris_df.tail()

In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets.

> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with.

## Dealing with missing data

> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.

Most of the time the datasets you want to use (of have to use) have missing values in them. How missing data is handled carries with it subtle tradeoffs that can affect your final analysis and real-world outcomes.

Pandas handles missing values in two ways. The first you've seen before in previous sections: `NaN`, or Not a Number. This is a actually a special value that is part of the IEEE floating-point specification and it is only used to indicate missing floating-point values.

For missing values apart from floats, pandas uses the Python `None` object. While it might seem confusing that you will encounter two different kinds of values that say essentially the same thing, there are sound programmatic reasons for this design choice and, in practice, going this route enables pandas to deliver a good compromise for the vast majority of cases. Notwithstanding this, both `None` and `NaN` carry restrictions that you need to be mindful of with regards to how they can be used.

### `None`: non-float missing data
Because `None` comes from Python, it cannot be used in NumPy and pandas arrays that are not of data type `'object'`. Remember, NumPy arrays (and the data structures in pandas) can contain only one type of data. This is what gives them their tremendous power for large-scale data and computational work, but it also limits their flexibility. Such arrays have to upcast to the “lowest common denominator,” the data type that will encompass everything in the array. When `None` is in the array, it means you are working with Python objects.

To see this in action, consider the following example array (note the `dtype` for it):

In [None]:
import numpy as np

example1 = np.array([2, None, 6, 8])
example1

The reality of upcast data types carries two side effects with it. First, operations will be carried out at the level of interpreted Python code rather than compiled NumPy code. Essentially, this means that any operations involving `Series` or `DataFrames` with `None` in them will be slower. While you would probably not notice this performance hit, for large datasets it might become an issue.

The second side effect stems from the first. Because `None` essentially drags `Series` or `DataFrame`s back into the world of vanilla Python, using NumPy/pandas aggregations like `sum()` or `min()` on arrays that contain a ``None`` value will generally produce an error:

In [None]:
example1.sum()

**Key takeaway**: Addition (and other operations) between integers and `None` values is undefined, which can limit what you can do with datasets that contain them.

### `NaN`: missing float values

In contrast to `None`, NumPy (and therefore pandas) supports `NaN` for its fast, vectorized operations and ufuncs. The bad news is that any arithmetic performed on `NaN` always results in `NaN`. For example:

In [None]:
np.nan + 1

In [None]:
np.nan * 0

The good news: aggregations run on arrays with `NaN` in them don't pop errors. The bad news: the results are not uniformly useful:

In [None]:
example2 = np.array([2, np.nan, 6, 8]) 
example2.sum(), example2.min(), example2.max()

> **Exercise:**

In [None]:
# What happens if you add np.nan and None together?


Remember: `NaN` is just for missing floating-point values; there is no `NaN` equivalent for integers, strings, or Booleans.

### `NaN` and `None`: null values in pandas

Even though `NaN` and `None` can behave somewhat differently, pandas is nevertheless built to handle them interchangeably. To see what we mean, consider a `Series` of integers:

In [None]:
int_series = pd.Series([1, 2, 3], dtype=int)
int_series

> **Exercise:**

In [None]:
# Now set an element of int_series equal to None.
# How does that element show up in the Series?
# What is the dtype of the Series?


In the process of upcasting data types to establish data homogeneity in `Seires` and `DataFrame`s, pandas will willingly switch missing values between `None` and `NaN`. Because of this design feature, it can be helpful to think of `None` and `NaN` as two different flavors of "null" in pandas. Indeed, some of the core methods you will use to deal with missing values in pandas reflect this idea in their names:

- `isnull()`: Generates a Boolean mask indicating missing values
- `notnull()`: Opposite of `isnull()`
- `dropna()`: Returns a filtered version of the data
- `fillna()`: Returns a copy of the data with missing values filled or imputed

These are important methods to master and get comfortable with, so let's go over them each in some depth.

### Detecting null values
Both `isnull()` and `notnull()` are your primary methods for detecting null data. Both return Boolean masks over your data.

In [None]:
example3 = pd.Series([0, np.nan, '', None])

In [None]:
example3.isnull()

Look closely at the output. Does any of it surprise you? While `0` is an arithmetic null, it's nevertheless a perfectly good integer and pandas treats it as such. `''` is a little more subtle. While we used it in the first section to represent an empty string value, it is nevertheless a string object and not a representation of null as far as pandas is concerned.

Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks  directly as a ``Series`` or ``DataFrame`` index, which can be useful when trying to work with isolated missing (or present) values.

> **Exercise:**

In [None]:
# Try running example3[example3.notnull()].
# Before you do so, what do you expect to see?


**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data.

### Dropping null values

Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) To see this in action, let's return to `example3`:

In [None]:
example3 = example3.dropna()
example3

Note that this should look like your output from `example3[example3.notnull()]`. The difference here is that, rather than just indexing on the masked values, `dropna` has removed those missing values from the `Series` `example3`.

Because `DataFrame`s have two dimensions, they afford more options for dropping data.

In [None]:
example4 = pd.DataFrame([[1,      np.nan, 7], 
                         [2,      5,      8], 
                         [np.nan, 6,      9]])
example4

(Did you notice that pandas upcast two of the columns to floats to accommodate the `NaN`s?)

You cannot drop a single value from a `DataFrame`, so you have to drop full rows or columns. Depending on what you are doing, you might want to do one or the other, and so pandas gives you options for both. Because in data science, columns generally represent variables and rows represent observations, you are more likely to drop rows of data; the default setting for `dropna()` is to drop all rows that contain any null values:

In [None]:
example4.dropna()

If necessary, you can drop NA values from columns. Use `axis=1` to do so:

In [None]:
example4.dropna(axis='columns')

Notice that this can drop a lot of data that you might want to keep, particularly in smaller datasets. What if you just want to drop rows or columns that contain several or even just all null values? You specify those setting in `dropna` with the `how` and `thresh` parameters.

By default, `how='any'` (if you would like to check for yourself or see what other parameters the method has, run `example4.dropna?` in a code cell). You could alternatively specify `how='all'` so as to drop only rows or columns that contain all null values. Let's expand our example `DataFrame` to see this in action.

In [None]:
example4[3] = np.nan
example4

> **Exercise:**

In [None]:
# How might you go about dropping just column 3?
# Hint: remember that you will need to supply both the axis parameter and the how parameter.


The `thresh` parameter gives you finer-grained control: you set the number of *non-null* values that a row or column needs to have in order to be kept:

In [None]:
example4.dropna(axis='rows', thresh=3)

Here, the first and last row have been dropped, because they contain only two non-null values.

### Filling null values

Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice.

In [None]:
example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
example5

You can fill all of the null entries with a single value, such as `0`:

In [None]:
example5.fillna(0)

> **Exercise:**

In [None]:
# What happens if you try to fill null values with a string, like ''?


You can **forward-fill** null values, which is to use the last valid value to fill a null:

In [None]:
example5.fillna(method='ffill')

You can also **back-fill** to propagate the next valid value backward to fill a null:

In [None]:
example5.fillna(method='bfill')

As you might guess, this works the same with `DataFrame`s, but you can also specify an `axis` along which to fill null values:

In [None]:
example4

In [None]:
example4.fillna(method='ffill', axis=1)

Notice that when a previous value is not available for forward-filling, the null value remains.

> **Exercise:**

In [None]:
# What output does example4.fillna(method='bfill', axis=1) produce?
# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?
# Can you think of a longer code snippet to write that can fill all of the null values in example4?


You can be creative about how you use `fillna`. For example, let's look at `example4` again, but this time let's fill the missing values with the average of all of the values in the `DataFrame`:

In [None]:
example4.fillna(example4.mean())

Notice that column 3 is still valueless: the default direction is to fill values row-wise.

> **Takeaway:** There are multiple ways to deal with missing values in your datasets. The specific strategy you use (removing them, replacing them, or even how you replace them) should be dictated by the particulars of that data. You will develop a better sense of how to deal with missing values the more you handle and interact with datasets.

## Removing duplicate data

> **Learning goal:** By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.

In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, pandas provides an easy means of detecting and removing duplicate entries.

### Identifying duplicates: `duplicated`

You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a `DataFrame` is a duplicate of an ealier one. Let's create another example `DataFrame` to see this in action.

In [None]:
example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
                         'numbers': [1, 2, 1, 3, 3]})
example6

In [None]:
example6.duplicated()

### Dropping duplicates: `drop_duplicates`
`drop_duplicates` simply returns a copy of the data for which all of the `duplicated` values are `False`:

In [None]:
example6.drop_duplicates()

Both `duplicated` and `drop_duplicates` default to consider all columnsm but you can specify that they examine only a subset of columns in your `DataFrame`:

In [None]:
example6.drop_duplicates(['letters'])

> **Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you spurious results!

## Combining datasets: merge and join

> **Learning goal:** By the end of this subsection, you should have a general knowledge of the various ways to combine `DataFrame`s.

Your most interesting analyses will often come from data melded together from more than one source. Because of this, pandas provides several methods of merging and joining datasets to make this necessary job easier:
 - **`pandas.merge`** connects rows in `DataFrame`s based on one or more keys.
 - **`pandas.concat`** concatenates or “stacks” together objects along an axis.
 - The **`combine_first`** instance method enables you to splice together overlapping data to fill in missing values in one object with values from another.

Let's examine merging data first, because it will be the most familiar to course attendees who are already familiar with SQL or other relational databases.

### Categories of joins

`merge` carries out several types of joins: *one-to-one*, *many-to-one*, and *many-to-many*. You use the same basic function call to implement all of them and we will examine all three (because you will need all three as some point in your data delving depending on the data). We will start with one-to-one joins because they are generally the simplest example.

#### One-to-one joins

Consider combining two `DataFrame`s that contain different information on the same employees in a company:

In [None]:
df1 = pd.DataFrame({'employee': ['Gary', 'Stu', 'Mary', 'Sue'],
                    'group': ['Accounting', 'Marketing', 'Marketing', 'HR']})
df1

In [None]:
df2 = pd.DataFrame({'employee': ['Mary', 'Stu', 'Gary', 'Sue'],
                    'hire_date': [2008, 2012, 2017, 2018]})
df2

Combine this information into a single `DataFrame` using the `merge` function:

In [None]:
df3 = pd.merge(df1, df2)
df3

Pandas joined on the `employee` column because it was the only column common to both `df1` and `df2`. (Note also that the original indices of `df1` and `df2` were discarded by `merge`; this is generally the case with merges unless you conduct them by index, which we will dicuss later on.)

#### Many-to-one joins

A many-to-one join is like a one-to-one join except that one of the two key columns contains duplicate entries. The `DataFrame` resulting from such a join will preserve those duplicate entries as appropriate:

In [None]:
df4 = pd.DataFrame({'group': ['Accounting', 'Marketing', 'HR'],
                    'supervisor': ['Carlos', 'Giada', 'Stephanie']})
df4

In [None]:
pd.merge(df3, df4)

The resulting `DataFrame` has an additional column for `supervisor`; that column has an extra occurence of 'Giada' that did not occur in `df4` because more than one employee in the merged `DataFrame` works in the 'Marketing' group.

Note that we didn’t specify which column to join on. When you don't specify that information, `merge` uses the overlapping column names as the keys. However, that can be ambiguous; several columns might meet that condition. For that reason, it is a good practice to explicitly specify on which key to join. You can do this with the `on` parameter:

In [None]:
pd.merge(df3, df4, on='group')

#### Many-to-many joins
What happens if the key columns in both of the `DataFrame`s you are joining contain duplicates? That gives you a many-to-many join:

In [None]:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Marketing', 'Marketing', 'HR', 'HR'],
                    'core_skills': ['math', 'spreadsheets', 'writing', 'communication',
                               'spreadsheets', 'organization']})
df5

In [None]:
pd.merge(df1, df5, on='group')

Again, in order to avoid ambiguity as to which column to join on, it is a good idea to explicitly tell `merge` which one to use with the `on` parameter.

#### `left_on` and `right_on` keywords
What if you need to merge two datasets with no shared column names? For example, what if you are using a dataset in which the employee name is labeled as 'name' rather than 'employee'? In such cases, you will need to use the `left_on` and `right_on` keywords in order to specify the column names on which to join:

In [None]:
df6 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
df6

In [None]:
pd.merge(df1, df6, left_on="employee", right_on="name")

> **Exercise:**

In [None]:
# Using the documentation, can you figure out how to use .drop() to get rid of the 'name' column?
# Hint: You will need to supply two parameters to .drop()


#### `left_index` and `right_index` keywords

Sometimes it can be more advantageous to merge on an index rather than on a column. The `left_index` and `right_index` keywords make it possible to join by index. Let's revisit some of our earlier example `DataFrame`s to see what this looks like in action.

In [None]:
df1a = df1.set_index('employee')
df1a

In [None]:
df2a = df2.set_index('employee')
df2a

To merge on the index, specify the `left_index` and `right_index` parameters in `merge`:

In [None]:
pd.merge(df1a, df2a, left_index=True, right_index=True)

> **Exercise:**

In [None]:
# What happens if you specify only left_index or right_index?


You can also use the `join` method for `DataFrame`s, which produces the same effect but merges on indices by default:

In [None]:
df1a.join(df2a)

You can also mix and match `left_index`/`right_index` with `right_on`/`left_on`:

In [None]:
pd.merge(df1a, df6, left_index=True, right_on='name')

#### Set arithmetic for joins

Let's return to many-to-many joins for a moment. A consideration that is unique to them is the *arithmetic* of the join, specifically the set arithmetic we use for the join. To illustrate what we mean by this, let's restructure an old example `DataFrame`:

In [None]:
df5 = pd.DataFrame({'group': ['Engineering', 'Marketing', 'Sales'],
                    'core_skills': ['math', 'writing', 'communication']})
df5

In [None]:
pd.merge(df1, df5, on='group')

Notice that after we have restructured `df5` and then re-run the merge with `df1`, we have only two entries in the result. This is because we merged on `group` and 'Marketing' was the only entry that appeared in the `group` column of both `DataFrame`s.

In effect, what we have gotten is the *intersection* of both `DataFrame`s. This is know as the inner join in the database world and it is the default setting for `merge` although we can certainly specify it:

In [None]:
pd.merge(df1, df5, on='group', how='inner')

The complement of the inner join is the outer join, which returns the *union* of the two `DataFrame`s.

> **Exercise:**

In [None]:
# The keyword for perfoming an outer join is how='outer'. How would you perform it?
# What do you expect the output of an outer join of df1 and df5 to be?


Notice in your resulting `DataFrame` that not every row in `df1` and `df5` had a value that corresponds to the union of the key values (the 'group' column). Pandas fills in these missing values with `NaN`s.

Inner and outer joins are not your only options. A *left join* returns all of the rows in the first (left-side) `DataFrame` supplied to `merge` along with rows from the other `DataFrame` that match up with the left-side key values (and `NaNs` rows with respective values):

In [None]:
pd.merge(df1, df5, how='left')

> **Exercise:**

In [None]:
# Now run the right merge between df1 and df5.
# What do you expect to see?


#### `suffixes` keyword: dealing with conflicting column names
Because you can join datasets, you will eventually join two with conflicting column names. Let's look at another example to see what we mean:

In [None]:
df7 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
                    'rank': [1, 2, 3, 4]})
df7

In [None]:
df8 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
                    'rank': [3, 1, 4, 2]})
df8

In [None]:
pd.merge(df7, df8, on='name')

Each column name in a `DataFrame` must be unique, so in cases where two joined `DataFrame`s share column names (aside from the column serving as the key), the `merge` function automatically appends the suffix `_x` or `_y` to the conflicting column names in order to make them unique. In cases where it is best to control your column names, you can specify a custom suffix for `merge` to append through the `suffixes` keyword:

In [None]:
pd.merge(df7, df8, on='name', suffixes=['_left', '_right'])

Note that these suffixes work if there are multiple conflicting columns.

### Concatenation in NumPy
Concatenation in pandas is built off of the concatenation functionality for NumPy arrays. Here is what NumPy concatenation looks like:
 - For one-dimensional arrays:

In [None]:
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])

 - For two-dimensional arrays:

In [None]:
x = [[1, 2],
     [3, 4]]
np.concatenate([x, x], axis=1)

Notice that the `axis=1` parameter makes the concatenation occur along columns rather than rows. Concatenation in pandas looks similar to this.

### Concatenation in pandas

Pandas has a function, `pd.concat()` that can be used for a simple concatenation of `Series` or `DataFrame` objects in similar manner to `np.concatenate()` with ndarrays.

In [None]:
ser1 = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
ser2 = pd.Series(['d', 'e', 'f'], index=[4, 5, 6])
pd.concat([ser1, ser2])

It also concatenates higher-dimensional objects, such as ``DataFrame``s:

In [None]:
df9 = pd.DataFrame({'A': ['a', 'c'],
                    'B': ['b', 'd']})
df9

In [None]:
pd.concat([df9, df9])

Notice that `pd.concat` has preserved the indexing even though that means that it has been duplicated. You can have the results re-indexed (and avoid potential confusion down the road) like so:

In [None]:
pd.concat([df9, df9], ignore_index=True)

By default, `pd.concat` concatenates row-wise within the `DataFrame` (that is, `axis=0` by default). You can specify the axis along which to concatenate:

In [None]:
pd.concat([df9, df9], axis=1)

Note that while pandas will display this without error, you will get an error message if you try to assign this result as a new `DataFrame`. Column names in `DataFrame`s must be unique.

### Concatenation with joins
Just as you did with merge above, you can use inner and outer joins when concatenating `DataFrame`s with different sets of column names.

In [None]:
df10 = pd.DataFrame({'A': ['a', 'd'],
                     'B': ['b', 'e'],
                     'C': ['c', 'f']})
df10

In [None]:
df11 = pd.DataFrame({'B': ['u', 'x'],
                     'C': ['v', 'y'],
                     'D': ['w', 'z']})
df11

In [None]:
pd.concat([df10, df11])

As we saw earlier, the default join for this is an outer join and entries for which no data is available are filled with `NaN` values. You can also do an inner join:

In [None]:
pd.concat([df10, df11], join='inner')

Another option is to directly specify the index of the remaininig colums using the `join_axes` argument, which takes a list of index objects. Here, we will specify that the returned columns should be the same as those of the first input (`df10`):

In [None]:
pd.concat([df10, df11], join_axes=[df10.columns])

#### `append()`

Because direct array concatenation is so common, ``Series`` and ``DataFrame`` objects have an ``append`` method that can accomplish the same thing in fewer keystrokes. For example, rather than calling ``pd.concat([df9, df9])``, you can simply call ``df9.append(df9)``:

In [None]:
df9.append(df9)

**Important point**: Unlike the `append()` and `extend()` methods of Python lists, the `append()` method in pandas does not modify the original object. It instead creates a new object with the combined data.

> **Takeaway:** A large part of the value you can provide as a data scientist comes from connecting multiple, often disparate datasets to find new insights. Learning how to join and merge data is thus an essential part of your skill set.

## Exploratory statistics and visualization

> **Learning goal:** By the end of this subsection, you should be familiar with some of the ways to visually explore the data stored in `DataFrame`s.

Often when probing a new data set, it is invaluable to get high-level information about what the dataset holds. Earlier in this section we discussed using methods such as `DataFrame.info`, `DataFrame.head`, and `DataFrame.tail` to examine some aspects of a `DataFrame`. While these methods are critical, they are on their own often insufficient to get enough information to know how to approach a new dataset. This is where exploratory statistics and visualizations for datasets come in.

To see what we mean in terms of gaining exploratory insight (both visually and numerically), let's dig into one of the the datasets that come with the scikit-learn library, the Boston Housing Dataset (though you will load it from a CSV file):

In [None]:
df = pd.read_csv('Data/housing_dataset.csv')
df.head()

This dataset contains information collected from the U.S Census Bureau concerning housing in the area of Boston, Massachusetts and was first published in 1978. The dataset has 14 columns:
 - **CRIM**:     Per-capita crime rate by town
 - **ZN**:       Proportion of residential land zoned for lots over 25,000 square feet
 - **INDUS**:    Proportion of non-retail business acres per town
 - **CHAS**:     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
 - **NOX**:      Nitric oxides concentration (parts per 10 million)
 - **RM**:       Average number of rooms per dwelling
 - **AGE**:      Proportion of owner-occupied units built prior to 1940
 - **DIS**:      Weighted distances to five Boston employment centres
 - **RAD**:      Index of accessibility to radial highways
 - **TAX**:      Full-value property-tax rate per \$10,000
 - **PTRATIO**:  Pupil-teacher ratio by town
 - **LSTAT**:    Percent of lower-status portion of the population
 - **MEDV**:     Median value of owner-occupied homes in \$1,000s

One of the first methods we can use to better understand this dataset is `DataFrame.shape`:

In [None]:
df.shape

The dataset has 506 rows and 13 columns.

To get a better idea of the contents of each column we can use `DataFrame.describe`, which returns the maximum value, minimums value, mean, and standard deviation of numeric values in each columns, in addition to the quartiles for each column:

In [None]:
df.describe()

Because dataset can have so many columns in them, it can often be useful to transpose the results of `DataFrame.describe` to better use them:

Note that you can also examine specific descriptive statistics for columns without having to invoke `DataFrame.describe`:

In [None]:
df['MEDV'].mean()

In [None]:
df['MEDV'].max()

In [None]:
df['AGE'].median()

> **Exercise:**

In [None]:
# Now find the maximum value in df['AGE'].


Other information that you will often want to see is the relationship between different columns. You do this with the `DataFrame.groupby` method. For example, you could examine the average MEDV (median value of owner-occupied homes) for each value of AGE (proportion of owner-occupied units built prior to 1940):

In [None]:
df.groupby(['AGE'])['MEDV'].mean()

> **Exercise:**

In [None]:
# Now try to find the median value for AGE for each value of MEDV.


You can also apply a lambda function to each element of a `DataFrame` column by using the `apply` method. For example, say you wanted to create a new column that flagged a row if more than 50 percent of owner-occupied homes were build before 1940:

In [None]:
df['AGE_50'] = df['AGE'].apply(lambda x: x>50)

Once applied, you also see how many values returned true and how many false by using the `value_counts` method:

In [None]:
df['AGE_50'].value_counts()

You can also examine figures from the groupby statement you created earlier:

In [None]:
df.groupby(['AGE_50'])['MEDV'].mean()

You can also group by more than one variable, such AGE_50 (the one you just created), CHAS (whether a town is on the Charles River), and RAD (an index measuring access to the Boston-area radial highways), and then evaluate each group for the average median home price in that group:

In [None]:
groupby_twovar=df.groupby(['AGE_50','RAD','CHAS'])['MEDV'].mean()

You can then see what values are in this stacked group of variables:

In [None]:
groupby_twovar

Let's take a moment to analyze these results in a little depth. The first row reports that communities with less the half of houses built before 1940, with a highway-access index of 1, and that are not situated on the Charles River have a mean house price of \$24,667 (1970s dollars); the next row shows that for communities similar to the first row except for being located on the Charles River have a mean house price of \$50,000.

One insight that pops out from continuing down this is that, all else being equal, being located next to the Charles River can significantly increase the value of newer housing stock. The story is more ambiguous for communities dominated by older houses: proximity to the Charles significantly increases home prices in one community (and that one presumably farther away from the city); for all others, being situated on the river either provided a modest increase in value or actually decreased mean home prices.

While groupings like this can be a great way to begin to interrogate your data, you might not care for the 'tall' format it comes in. In that case, you can unstack the data into a "wide" format:

In [None]:
groupby_twovar.unstack()

> **Exercise:**

In [None]:
# How could you use groupby to get a sense of the proportion 
# of residential land zoned for lots over 25,000 sq.ft., 
# the proportion of non-retail business acres per town, 
# and the distance of towns from employment centers in Boston?


It is also often valuable to know how many unique values a column has in it with the `nunique` method:

In [None]:
df['CHAS'].nunique()

Complementary to that, you will also likely want to know what those unique values are, which is where the `unique` method helps:

In [None]:
df['CHAS'].unique()

You can use the `value_counts` method to see how many of each unique value there are in a column:

In [None]:
df['CHAS'].value_counts()

Or you can easily plot a bar graph to visually see the breakdown:

In [None]:
%matplotlib inline
df['CHAS'].value_counts().plot(kind='bar')

Note that the IPython magic command `%matplotlib inline` enables you to view the chart inline.

Let's pull back to the dataset as a whole for a moment. Two major things that you will look for in almost any dataset are trends and relationships. A typical relationship between variables to explore is the Pearson correlation, or the extent to which two variables are linearly related. The `corr` method will show this in table format for all of the columns in a `DataFrame`:

In [None]:
df.corr(method='pearson')

Suppose you just wanted to look at the correlations between all of the columns and just one variable? Let's examine just the correlation between all other variables and the percentage of owner-occupied houses build before 1940 (AGE). We will do this by accessing the column by index number:

In [None]:
corr = df.corr(method='pearson')
corr_with_homevalue = corr.iloc[-1]
corr_with_homevalue[corr_with_homevalue.argsort()[::-1]]

With the correlations arranged in descending order, it's easy to start to see some patterns. Correlating AGE with a variable we created from AGE is a trivial correlation. However, it is interesting to note that the percentage of older housing stock in communities strongly correlates with air pollution (NOX) and the proportion of non-retail business acres per town (INDUS); at least in 1978 metro Boston, older towns are more industrial.

Graphically, we can see the correlations using a heatmap from the Seaborn library:

In [None]:
import seaborn as sns
sns.heatmap(df.corr(),cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))

Histograms are another valuable tool for investigating your data. For example, what is the overall distribution of prices of owner-occupied houses in the Boston area?

In [None]:
import matplotlib.pyplot as plt
plt.hist(df['MEDV'])

The default bin size for the matplotlib histogram (essentially big of buckets of percentages that you include in each histogram bar in this case) is pretty large and might mask smaller details. To get a finer-grained view of the AGE column, you can manually increase the number of bins in the histogram:

In [None]:
plt.hist(df['MEDV'],bins=50)

Seaborn has a somewhat more attractive version of the standard matplotlib histogram: the distribution plot. This is a combination histogram and kernel density estimate (KDE) plot (essentially a smoothed histogram):

In [None]:
sns.distplot(df['MEDV'])

Another commonly used plot is the Seaborn jointplot, which combines histograms for two columns along with a scatterplot:

In [None]:
sns.jointplot(df['RM'], df['MEDV'], kind='scatter')

Unfortunately, many of the dots print over each other. You can help address this by adding some alpha blending, a figure that sets the transparency for the dots so that concentrations of them drawing over one another will be apparent:

In [None]:
sns.jointplot(df['RM'], df['MEDV'], kind='scatter', alpha=0.3)

Another way to see patterns in your data is with a two-dimensional KDE plot. Darker colors here represent a higher concentration of data points:

In [None]:
sns.kdeplot(df['RM'], df['MEDV'], shade=True)

Note that while the KDE plot is very good at showing concentrations of data points, finer structures like linear relationships (such as the clear relationship between the number of rooms in homes and the house price) are lost in the KDE plot.

Finally, the pairplot in Seaborn allows you to see scatterplots and histograms for several columns in one table. Here we have played with some of the keywords to produce a more sophisticated and easier to read pairplot that incorporates both alpha blending and linear regression lines for the scatterplots.

In [None]:
sns.pairplot(df[['RM', 'AGE', 'LSTAT', 'DIS', 'MEDV']], kind="reg", plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.1}})

Visualization is the start of the really cool, fun part of data science. So play around with these visualization tools and see what you can learn from the data!

> **Takeaway:** An old joke goes: “What does a data scientist seen when they look at a dataset? A bunch of numbers.” There is more than a little truth in that joke. Visualization is often the key to finding patterns and correlations in your data. While visualization cannot often deliver precise results, it can point you in the right direction to ask better questions and efficiently find value in the data.

# Section 6: Introduction to machine learning models 

You have now made it to the section on machine learning (ML). ML and the branch of computer science in which it resides, artificial intelligence (AI), are so central to data science that ML/AI and data science are synonymous in the minds of many people. However, the preceding sections have hopefully demonstrated that there are a lot of other facets to the discipline of data science apart from the prediction and classification tasks that supply so much value to the world. (Remember, at least 80 percent of the effort in most data-science projects will be composed of cleaning and manipulating the data to prepare it for analysis.)

That said, ML is fun! In this section, and the next one on data science in the cloud, you will get to play around with some of the “magic” of data science and start to put into practice the tools you have spent the last five sections learning. Let's get started!

## A quick aside: types of ML

As you get deeper into data science, it might seem like there are a bewildering array of ML algorithms out there. However many you encounter, it can be handy to remember that most ML algorithms fall into three broad categories:
 - **Predictive algorithms**: These analyze current and historical facts to make predictions about unknown events, such as the future or customers’ choices.
 - **Classification algorithms**: These teach a program from a body of data, and the program then uses that learning to classify new observations.
 - **Time-series forecasting algorithms**: While it can argued that these algorithms are a part of predictive algorithms, their techniques are specialized enough that they in many ways functions like a separate category. Time-series forecasting is beyond the scope of this course, but we have more than enough work with focusing here on prediction and classification.

## Prediction: linear regression

> **Learning goal:** By the end of this subsection, you should be comfortable fitting linear regression models, and you should have some familiarity with interpreting their output.

Arguably the simplest form of machine learning is to draw a line connecting two points and make predictions about where that trend might lead.

But what if you have more than two points—and those points don't line up neatly? What if you have points in more than two dimensions? This is where linear regression comes in.

Formally, linear regression is used to predict a quantitative *response* (the values on a Y axis) that is dependent on one or more *predictors* (values on one or more axes that are orthogonal to Y, commonly just thought of collectively as X). The working assumption is that the relationship between predictors and response is more or less linear. The goal of linear regression is to fit a straight line in the best possible way to minimize the deviation between our observed responses in the dataset and the responses predicted by our line, the linear approximation. (The most common means of assessing this error is called the **least squares method**; it consists of minimizing the number you get when you square the difference between your predicted value and the actual value and add up all of those squared differences for your entire dataset.)

<img align="left" style="padding-right:10px;" src="Graphics/Sec6_linear_regression.png">

Statistically, we can represent this relationship between response and predictors as:

$Y = B_0 + B_1X + E$

Remember high school geometry? $B_0$ is the intercept of our line and $B_1$ is its slope. We commonly refer to $B_0$ and $B_1$ as coefficients and to $E$ as the *error term*, which represents the margin of error in the model.

Let's try this in practice with actual data. (Note: no graph paper will be harmed in the course of these predictions.)

### Data exploration

We'll begin by importing our usual libraries and using our %matplotlib inline magic command:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns

And now for our data. In this case, we’ll use a newer housing dataset than the Boston Housing Dataset we used in the last section (with this one storing data on individual houses across the United States).

In [None]:
df = pd.read_csv('Data/Housing_Dataset_Sample.csv')
df.head()

> **Exercise:**

In [None]:
# Do you remember the DataFrame method for looking at overall information
# about a DataFrame, such as number of columns and rows? Try it here.


Let's also use the `describe` method to look at some of the vital statistics about the columns. Note that in cases like this, in which some of the column names are long, it can be helpful to view the transposition of the summary, like so:

In [None]:
df.describe().T

Let's look at the data in the **Price** column. (You can disregard the deprecation warning if it appears.)

In [None]:
sns.distplot(df['Price'])

As we would hope with this much data, our prices form a nice bell-shaped, normally distributed curve.

Now, let's look at a simple relationship like that between house prices and the average income in a geographic area:

In [None]:
sns.jointplot(df['Avg. Area Income'],df['Price'])

As we would expect, there is an intuitive, linear relationship between them. Also good: the pairplot shows that the data in both columns is normally distributed, so we don't have to worry about somehow transforming the data for meaningful analysis.

Let's take a quick look at all of the columns:

In [None]:
sns.pairplot(df)

Some observations:
1. Not all of the combinations of columns provide strong linear relationships; some just look like blobs. That's nothing to worry about for our analysis.
2. See the visualizations that look like lanes rather than organic groups? That is the result of the average number of bedrooms in houses being measured in discrete values rather than continuous ones (as no one has 0.3 bedrooms in their house). The number of bathrooms is also the one column whose data is not really normally distributed, though some of this might be distortion caused by the default bin size of the pairplot histogram functionality.

It is now time to make a prediction. 

### Fitting the model

Let's make a prediction. Let's feed everything into a linear model (average area income, average area house age, average area number of rooms, average area number of bedrooms, and	area population) and see how well knowing those factors can help us predict the price of a home. 

To do this, we will make our first five columns the X (our predictors) and the **Price** column the Y (our response):

In [None]:
X = df.iloc[:,:5]
y = df['Price']

Now, we could use all of our data to create our model. However, all that would get us is a model that is good at predicting itself. Not only would that leave us with no objective way to measure how good the model is, it would also likely lead to a model that was less accurate when used on new data. Such a model is termed *overfitted*.

To avoid this, data scientists divide their datasets for ML into *training* data (the data used to fit the model) and *test* data (data used to evaluate how accurate the model is). Fortunately, scikit-learn provides a function that enables us to easily divide up our data between training and test sets: `train_test_split`. In this case, we will use 70 percent of our data for training and reserve 30 percent of it for testing. (Note that you will also supply a fourth parameter to the function: `random_state`; `train_test_split` randomly divides up our data between test and training, so this number provides an explicit seed for the random-number generator so that you will get the same result each time you run this code snippet.)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=54)

All that is left now is to import our linear regression algorithm and fit our model based on our training data:

In [None]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()

In [None]:
reg.fit(X_train,y_train)

### Evaluating the model

Now, a moment of truth: let's see how our model does making predictions based on the test data:

In [None]:
predictions = reg.predict(X_test)

In [None]:
predictions

Our predictions are just an array of numbers: these are the house prices predicted by our model. One for every row in our test dataset.

Remember how we mentioned that linear models have the mathematical form of $Y = B_0 + B_1*X + E$? Let’s look at the actual equation:

In [None]:
print(reg.intercept_,reg.coef_)

In algebraic terms, here is our model:

$Y=-2,646,401+0.21587X_1+0.00002X_2+0.00001X_3+0.00279X_4+0.00002X_5$

where:
 - $Y=$ Price
 - $X_1=$ Average area income
 - $X_2=$ Average area house age
 - $X_3=$ Average area number of rooms
 - $X_4=$ Average area number of bedrooms
 - $X_5=$ Area population

So, just how good is our model? There are many ways to measure the accuracy of ML models. Linear models have a good one: the $R^2$ score (also knows as the coefficient of determination). A high $R^2$, close to 1, indicates better prediction with less error.

In [None]:
#Explained variation. A high R2 close to 1 indicates better prediction with less error.
from sklearn.metrics import r2_score

r2_score(y_test,predictions)

The $R^2$ score also indicates how much explanatory power a linear model has. In the case of our model, the five predictors we used in the model explain a little more than 92 percent of the price of a house in this dataset.

We can also plot our errors to get a visual sense of how wrong our predictions were:

In [None]:
#plot errors
sns.distplot([y_test-predictions])

Do you notice the numbers on the left axis? Whereas a histogram shows the number of things that fall into discrete numeric buckets, a kernel density estimation (KDE, and the histogram that accompanies it in the Seaborn displot) normalizes those numbers to show what proportion of results lands in each bucket. Essentially, these are all decimal numbers less than 1.0 because the area under the KDE has to add up to 1.

Maybe more gratifying, we can plot the predictions from our model:

In [None]:
# Plot outputs
plt.scatter(y_test,predictions, color='blue')

The linear nature of our predicted prices is clear enough, but there are so many of them that it is hard to tell where dots are concentrated. Can you think of a way to refine this visualization to make it clearer, particularly if you were explaining the results to someone?

> **Exercise:**

In [None]:
# Hint: Remember to try the plt.scatter parameter alpha=.
# It takes values between 0 and 1.


> **Takeaway:** In this subsection, you performed prediction using linear regression by exploring your data, then fitting your model, and finally evaluating your model’s performance.

## Classification: logistic regression

> **Learning goal:** By the end of this subsection, you should know how logistic regression differs from linear regression, be comfortable fitting logistic regression models, and have some familiarity with interpreting their output.

We'll now pivot to discussing classification. If our simple analogy of predictive analytics was drawing a line through points and extrapolating from that, then classification can be described in its simplest form as drawing lines around groups of points.

While linear regression is used to predict quantitative responses, *logistic* regression is used for classification problems. Formally, logistic regression predicts the categorical response (Y) based on predictors (Xs). Logistic regression goes by several names, and it is also known in the scholarly literature as logit regression, maximum-entropy classification (MaxEnt), and the log-linear classifier. In this algorithm, the probabilities describing the possible outcomes of a single trial are modeled using a sigmoid (S-curve) function. Sigmoid functions take any value and transform it to be between 0 and 1, which can be used as a probability for a class to be predicted, with the goal of predictors mapping to 1 when something belongs in the class and 0 when they do not.

<img align="left" style="padding-right:10px;" src="Graphics/Sec6_logistic_regression.png">

To show this in action, let's do something a little different and try a historical dataset: the fates of the passengers of the RMS Titanic, which is a popular dataset for classification problems in machine learning. In this case, the class we want to predict is whether a passenger survived the doomed liner's sinking.

The dataset has 12 variables:

 - **PassengerId**
 - **Survived:** 0 = No, 1 = Yes
 - **Pclass:** Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
 - **Sex**
 - **Age**		
 - **Sibsp:** Number of siblings or spouses aboard the *Titanic*	
 - **Parch:** Number of parents or children aboard the *Titanic*
 - **Ticket:** Passenger ticket number	
 - **Fare:** Passenger fare	
 - **Cabin:** Cabin number	
 - **Embarked:** Port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton

In [None]:
df = pd.read_csv('Data/train_data_titanic.csv')
df.head()

In [None]:
df.info()

One reason that the Titanic data set is a popular classification set is that it provides opportunities to prepare data for analysis. To prepare this dataset for analysis, we need to perform a number of tasks:
 - Remove extraneous variables
 - Check for multicollinearity 
 - Handle missing values

We will touch on each of these steps in turn.

### Remove extraneous variables

The name of individual passengers and their ticket numbers will clearly do nothing to help our model, so we can drop those columns to simplify matters.

In [None]:
df.drop(['Name','Ticket'],axis=1,inplace=True)

There are additional variables that will not add classifying power to our model, but to find them we will need to look for correlation between variables.

### Check for multicollinearity

If one or more of our predictors can themselves be predicted from other predictors, it can produce a state of *multicollinearity* in our model. Multicollinearity is a challenge because it can skew the results of regression models (both linear and logistic) and reduce the predictive or classifying power of a model.

To help combat this problem, we can start to look for some initial patterns. For example, do any correlations between **Survived** and **Fare** jump out?

In [None]:
sns.pairplot(df[['Survived','Fare']], dropna=True)

> **Exercise:**

In [None]:
# Try running sns.pairplot twice more on some other combinations of columns
# and see if any patterns emerge.


We can also use `groupby` to look for patterns. Consider the mean values for the various variables when we group by **Survived**:

In [None]:
df.groupby('Survived').mean()

Survivors appear to be slightly younger on average with higher-cost fare.

In [None]:
df.head()

Value counts can also help us get a sense of the data before us, such as numbers for siblings and spouses on the *Titanic*, in addition to the sex split of passengers:

In [None]:
df['SibSp'].value_counts()

In [None]:
df['Parch'].value_counts()

In [None]:
df['Sex'].value_counts()

### Handle missing values

We now need to address missing values. First, let’s look to see which columns have more than half of their values missing:

In [None]:
#missing
df.isnull().sum()>(len(df)/2)

Let's break down the code in the call above just a bit. `df.isnull().sum()` tells pandas to take the sum of all of the missing values for each column. `len(df)/2` is just another way of expressing half the number of rows in the `DataFrame`. Taken together with the `>`, this line of code is looking for any columns with more than half of its entries missing, and there is one: **Cabin**.

We could try to do something about those missing values. However, if any pattern does emerge in the data that involves **Cabin**, it will be highly cross-correlated with both **Pclass** and **Fare** (as higher-fare, better-class accommodations were grouped together on the *Titanic*). Given that too much cross-correlation can be detrimental to a model, it is probably just better for us to drop **Cabin** from our `DataFrame`:

In [None]:
df.drop('Cabin',axis=1,inplace=True)

Let's now run `info` to see if there are columns with just a few null values.

In [None]:
df.info()

One note on the data: given that 1,503 died in the *Titanic* tragedy (and that we know that some survived), this data set clearly does not include every passenger on the ship (and none of the crew). Also remember that **Survived** is a variable that includes both those who survived and those who perished.

Back to missing values. **Age** is missing several values, as is **Embarked**. Let's see how many values are missing from **Age**:

In [None]:
df['Age'].isnull().value_counts()

As we saw above, **Age** isn't really correlated with **Fare**, so it is a variable that we want to eventually use in our model. That means that we need to do something with those missing values. But we before we decide on a strategy, we should check to see if our median age is the same for both sexes.

In [None]:
df.groupby('Sex')['Age'].median().plot(kind='bar')

The median ages are different for men and women sailing on the *Titanic*, which means that we should handle the missing values accordingly. A sound strategy is to replace the missing ages for passengers with the median age *for the passengers' sexes*.

In [None]:
df['Age'] = df.groupby('Sex')['Age'].apply(lambda x: x.fillna(x.median()))

Any other missing values?

In [None]:
df.isnull().sum()

We are missing two values for **Embarked**. Check to see how that variable breaks down:

In [None]:
df['Embarked'].value_counts()

The vast majority of passengers embarked on the *Titanic* from Southampton, so we will just fill in those two missing values with the most statistically likely value (the median result): Southampton.

In [None]:
df['Embarked'].fillna(df['Embarked'].value_counts().idxmax(), inplace=True)
df['Embarked'].value_counts()

In [None]:
df = pd.get_dummies(data=df, columns=['Sex', 'Embarked'],drop_first=True)
df.head()

Let's do a final look at the correlation matrix to see if there is anything else we should remove.

In [None]:
df.corr()

**Pclass** and **Fare** have some amount of correlation, we can probably get rid of one of them. In addition, we need to remove **Survived** from our X `DataFrame` because it will be our response `DataFrame`, Y:

In [None]:
X = df.drop(['Survived','Pclass'],axis=1)
y = df['Survived']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=67)

> **Exercise:**

We now need to split the training and test data, which you will so as an exercise:

In [None]:
from sklearn.model_selection import train_test_split
# Look up in the portion above on linear regression and use train_test_split here.
# Set test_size = 0.3 and random_state = 67 to get the same results as below when
# you run through the rest of the code example below.


Now you will import and fit the logistic regression model:

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

In [None]:
lr.fit(X_train,y_train)

In [None]:
predictions = lr.predict(X_test)

### Evaluate the model

In contrast to linear regression, logistic regression does not produce an $R^2$ score by which we can assess the accuracy of our model. In order to evaluate that, we will use a classification report, a confusion matrix, and the accuracy score.

#### Classification report

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

The classification reports the proportions of both survivors and non-survivors with four scores:
 - **Precision:** The number of true positives divided by the sum of true positives and false positives; closer to 1 is better.
 - **Recall:** The true-positive rate, the number of true positives divided by the sum of the true positives and the false negatives.
 - **F1 score:** The harmonic mean (the average for rates) of precision and recall.
 - **Support:** The number of true instances for each label.
 
Why so many ways of measuring accuracy for a model? Well, success means different things in different contexts. Imagine that we had a model to diagnose infectious disease. In such a case we might want to tune our model to maximize recall (and thus minimize our false-negative rate): even high precision might miss a lot of infected people. On the other hand, a weather-forecasting model might be interested in maximizing precision because the cost of false negatives is so low. For other uses, striking a balance between precision and recall by maximizing the F1 score might be the best choice. Run the classification report:

In [None]:
print(classification_report(y_test,predictions))

#### Confusion matrix

The confusion matrix is another way to present this same information, this time with raw scores. The columns show the true condition, positive on the left, negative on the right. The rows show predicted conditions, positive on the top, negative on the bottom. So, the matrix below shows that our model correctly predicted 146 survivors (true positives) and incorrectly predicted another 16 (false positives). On the other hand, our model correctly predicted 30 non-survivors (true negatives) and incorrectly predicted 76 more (false negatives).

In [None]:
print(confusion_matrix(y_test,predictions))

Let's dress up the confusion matrix a bit to make it a little easier to read:

In [None]:
pd.DataFrame(confusion_matrix(y_test, predictions), columns=['True Survived', 'True Not Survived'], index=['Predicted Survived', 'Predicted Not Survived'])

#### Accuracy score

Finally, our accuracy score tells us the fraction of correctly classified samples; in this case (146 + 76) / (146 + 76 + 30 + 16).

In [None]:
print(accuracy_score(y_test,predictions))

Not bad for an off-the-shelf model with no tuning!

> **Takeaway:** In this subsection, you performed classification using logistic regression by removing extraneous variables, checking for multicollinearity, handling missing values, and fitting and evaluating your model.

## Classification: decision trees

> **Learning goal:** By the end of this subsection, you should be comfortable fitting decision-tree models and have some understanding of what they output.

If logistic regression uses observations about variables to swing a metaphorical needle between 0 and 1, classification based on decision trees programmatically builds a Yes/No decision to classify items.

<img align="left" style="padding-right:10px;" src="Graphics/Sec6_decision_tree.png">

Let's look at this in practice with the same *Titanic* dataset we used with logistic regression.

In [None]:
from sklearn import tree

In [None]:
tr = tree.DecisionTreeClassifier()

> **Exercise:**

In [None]:
# Using the same split data as with the logistic regression,
# can you fit the decision tree model?
# Hint: Refer to code snippet for fitting the logistic regression above.


In [None]:
tr.fit(X_train, y_train)

Once fitted, we get our predicitions just like we did in the logistic regression example above:

In [None]:
tr_predictions = tr.predict(X_test)

In [None]:
pd.DataFrame(confusion_matrix(y_test, tr_predictions), 
             columns=['True Survived', 'True Not Survived'], 
             index=['Predicted Survived', 'Predicted Not Survived'])

In [None]:
print(accuracy_score(y_test,tr_predictions))

One of the great attractions of decision trees is that the models are readable by humans. Let's visualize to see it in action. (Note, the generated graphic can be quite large, so scroll to the right if the generated graphic just looks blank at first.)

In [None]:
import graphviz 

dot_file = tree.export_graphviz(tr, out_file=None, 
                                feature_names=X.columns, 
                                class_names='Survived',  
                                filled=True,rounded=True)  
graph = graphviz.Source(dot_file)  
graph

There are, of course, myriad other ML models that we could explore. However, you now know some of the most commonly encountered ones, which is great preparation to understand what automated, cloud-based ML and AI services are doing and how to intelligently apply them to data-science problems, the subject of the next section.

> **Takeaway:** In this subsection, you performed classification on previously cleaned data by fitting and evaluating a decision tree.