<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Python and Pandas and Plots

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
sns.set_context(context="notebook", font_scale=1.7)

In [None]:
shakespeare_file = Path("data") / "shakespeare.txt"

if not shakespeare_file.exists():
    !curl https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt -o data/shakespeare.txt

with open(shakespeare_file) as f:
    shakespeare = f.read()

In [None]:
atten_df = pd.read_csv("data/attention.csv", index_col=1).drop("Unnamed: 0", axis=1)

In [None]:
names = ["Bai Yun", "Xiao Liwu", "Ya Ya", "Le Le", "Mei Lan", "Lun Lun"]
zoos = ["San Diego Zoo","San Diego Zoo", "Memphis Zoo", "Memphis Zoo", "Zoo Atlanta", "Zoo Atlanta"]
sexes = ["F", "M", "M", "F", "M", "F"]
species = ["panda", "panda", "panda", "panda", "panda", "panda"]
ages = [27, 7, 10, 21, 13, 22]

panda_df = pd.DataFrame({"zoo": zoos, "sex": sexes, "species": species, "age": ages}, index=names)

## Data Types

### Numbers: `int` and `float`

`int`s are `int`egers, or whole numbers, positive or negative.

`float`s are [floating point numbers](https://en.wikipedia.org/wiki/Floating-point_arithmetic),
a particular kind of decimal number.

In [None]:
type(1), type(1.0)

In [None]:
type(float(1)), type(int(1.0))

#### Operators on Numbers

In [None]:
1 + 2, 1 - 2, 1 * 2, 1 / 2

In [None]:
type(1 / 2), type(2 / 2)

Watch out for type conversion!
Using `/` always results in a `float`, not an `int`.

Only `int`s can be used for some purposes.

In [None]:
1 < 2, 1 == 2, 1 != 2, 1 > 2, 1 <= 2, 1 >= 2

### Booleans

In [None]:
type(True), type(False)

#### Operators on Booleans

In [None]:
True & True, True & False, False & False

In [None]:
True | True, True | False, False | False

In [None]:
not True, not False

In [None]:
bool(1), bool(0)

In [None]:
bool(2), bool(-1)

Again, type conversion can be non-intuitive.
Any non-zero number becomes `True`.

### Strings

In [None]:
(type("a"), type("ayy"))

In [None]:
"ayy" + "lmao"

In [None]:
"ayy" * 10

In [None]:
str(10)

## Callables

Callables are what we use to act on data, take in inputs, produce outputs, and generally do almost anything in Python.

Even operators are just callables in disguise!

### Functions

In [None]:
len("ayy")

In [None]:
print("ayy")

In [None]:
"ayy"

In [None]:
print("ayy")
"ayy"

In [None]:
def ayy():
    return "ayy"

In [None]:
ayy()

#### Function Syntax

```python
def function(arg1, arg2, kwarg1=default1, kwarg2=default2):
    do stuff to get output
    return output
```

In [None]:
def print_n(strng, n=1):
    print(strng * n)
    
print_n("hello there")
print_n("hello there", 1)
print_n("hello there", n=1)

### Methods

In [None]:
"hello there".upper(), "hello there".title(), "hello there".split(" ")

```python
obj.
```

In [None]:
dir("ab")

## Containers

"Containers" is a loose, non-technical term for data types in Python
that can "hold" either basic data types or more complicated objects, including containers.

### Lists

In [None]:
type([]), type([1]), type(["a"])

In [None]:
friends = ["rachel", "joey", "phoebe", "monica", "chandler", "satan"]
numbers = [1, 2, 3, 4, 5, 666]

#### Operators on Lists

In [None]:
friends_and_numbers = friends + numbers
friends_and_numbers

In [None]:
"chandler" in friends, "new friend" in friends

The `in` keyword works like an operator to check whether an object is in an iterable, like a `list`.

#### Indexing Lists

Indexing is performed with square brackets, `[]`,
rather than parentheses, `()`, which are used for calling.

Some Python objects can do both, so be careful!

In [None]:
friends[0], friends[-1]

In [None]:
friends[:2], friends[2:], friends[::2]

In [None]:
friends[1:-1:2]

### Tuples

In [None]:
type(("a", 1)), type(())

In [None]:
letter_a, number_1 = ("a", 1)

print(letter_a, number_1)

Note that `tuple`s can be indexed and checked with `in` just like `list`s.

The biggest difference between a tuple and a list is that a tuple, once made, cannot be changed.

### Dictionaries

Dictionaries relate pieces of information to one another. A `"key"` is mapped, or associated, or related, to a `"value"`, just as in a dictionary like Webster's, a word is associated to its definition.

In other languages, they are sometimes called "maps" or "associative arrays".

In [None]:
best_poems = {"Mary Oliver": "Snow Geese",
              "Rupi Kaur": "milk and honey",
              "William Blake": "The Tyger",
              "Homer": "The Odyssey"}

best_poems

In [None]:
best_poems["Mary Oliver"]

In [None]:
"William Blake" in best_poems.keys()

`.keys` is a method that returns an iterable of the dictionary's _keys_.

In [None]:
"The Odyssey" in best_poems.values()

`.values` is a method that returns an iterable of the dictionary's _values_.

In [None]:
("Homer", "The Odyssey") in best_poems.items()

`.items` is a method that returns an iterable of tuples of the dictionary's _keys and values_.

Sometimes a dictionary tell us how two pieces of data relate;
other times it's more like a function masquerading as data.

In [None]:
plus_one = {0: 1, 1: 2, 2: 3, 3: 4}

In [None]:
plus_one[1]

## Flow Control

Flow control constructs are used to control which code is run, when.

`for` runs the same code multiple times,
while `if`/`elif`/`else` switch which code is run.

### `for`

```python
for elem in iterable:
    do stuff
```

In [None]:
for friend in friends:
    print("i like my friend " + friend)

In [None]:
for ii in range(len(friends)):
    friend = friends[ii]
    number = numbers[ii]
    print(friend + "'s favorite number is " + str(number))

Not every iterable is a list.

In [None]:
range(len(friends))

In [None]:
list(range(len(friends)))

`enumerate` makes another useful iterable:
it takes in an iterable, like a list,
and at each step of the `for` loop makes a `tuple` with the next element of the original iterable and the "step count".

In [None]:
list(enumerate(friends))

In [None]:
for ii, friend in enumerate(friends):
    number = numbers[ii]
    print(friend + "'s favorite number is " + str(number))

In [None]:
list(zip(friends, numbers))

In [None]:
for friend, number in zip(friends, numbers):
    print(friend + "'s favorite number is " + str(number))

`zip` makes another useful iterable:
it takes in a collection of iterables and,
at each iteration, spits out the next element of each iterable, as a `tuple`.

In [None]:
friends_with_numbers = []

for ii in range(len(friends)):
    friends_with_numbers.append([friends[ii], numbers[ii]])
    
friends_with_numbers

In this week's homework, you'll need to make some lists out of other lists.
The code above shows one strategy for doing this, with the `.append` method.

### `if`/`else`

In [None]:
for friend in friends:
    if friend != "satan":
        print("i like my friend " + friend)

In [None]:
for friend in friends:
    if friend != "satan":
        print("i like my friend " + friend)
    else:
        print("i don't like " + friend)

## Example: An Interesting Word

This section gives more examples of how to build a list using `.append`, useful for the homework.

In [None]:
interesting_word = "honorificabilitudinitatibus"

len(interesting_word)

The word
[honorificabilitudinitatibus](https://en.wikipedia.org/wiki/Honorificabilitudinitatibus)
has been noted since at least the time of Charlemagne (800 AD)
and is joked about in Shakespeare.

In [None]:
interesting_word in shakespeare

In addition to being long, this word, which means "able to receive honorifics",
has the curious property that every other letter is a vowel.

Let's check this for ourselves below.

In [None]:
even_numbers = []
max_number = 26

for ii in range(max_number + 1):
    divided_by_two = ii / 2
    divided_by_two_and_rounded = int(divided_by_two)
    if divided_by_two_and_rounded == divided_by_two:
        even_numbers.append(ii)

even_numbers

In [None]:
vowels = "aeiou"
consonants = "bcdfghjklmnpqrstvwxyz"

In [None]:
def is_consonant(letter):
    return letter in consonants

In [None]:
even_letter_is_consonant = []
for number in even_numbers:
    even_letter_is_consonant.append(is_consonant(interesting_word[number]))

In [None]:
odd_letter_is_consonant = []
for number in even_numbers[:-1]:
    odd_letter_is_consonant.append(is_consonant(interesting_word[number + 1]))

In [None]:
all(even_letter_is_consonant), any(odd_letter_is_consonant)

## Pandas

Pandas is the primary library in Python for working with tabular data,
also known as `pan`el `da`ta.

Tabular data in pandas is represented with `DataFrame`s.

### `DataFrame`s

In [None]:
panda_df

A `DataFrame` is like a dictionary
whose keys are strings or numbers and whose values are something like a list.

Viewed from one angle, it is a dictionary of columns of data.

In [None]:
panda_df.columns

In [None]:
panda_df

Columns are accessed directly, with brackets, like indexing a dictionary or list.

In [None]:
panda_df["zoo"]

Viewed from the other angle, a `DataFrame` is a dictionary of rows of data.

In [None]:
panda_df.index

Columns are accessed by the `.loc` attribute, also with brackets.

In [None]:
panda_df.loc["Bai Yun"]

### Series

That "something like a list" is called a `Series`.

Rows and columns are both `Series`.

In [None]:
type(panda_df["zoo"]), type(panda_df.loc["Bai Yun"])

There are lots of operators on `Series` and `Series` methods.

For this first week, we need only one particular set of operators.

### Selection

In [None]:
panda_df["sex"] == "M"

In [None]:
type(panda_df["sex"] == "M")

In [None]:
male_panda_df = panda_df[panda_df["sex"] == "M"]
male_panda_df

Boolean operators work on Boolean `Series`.

In [None]:
panda_df[(panda_df["age"] >= 10) & ( panda_df["sex"] == "M")]

There are also lots of `DataFrame` methods.

For this first week, we need only one: `sort_values`.

In [None]:
panda_df.sort_values(by="age")

### Using Functions and Making Columns with `.apply`

In [None]:
def remove_zoo(string):
    words = string.split(" ")
    words_without_zoo = []
    for word in words:
        if word == "Zoo":
            pass
        else:
            words_without_zoo.append(word)
            
    return " ".join(words_without_zoo)

In [None]:
cities_series = panda_df["zoo"].apply(remove_zoo)
cities_series

In [None]:
panda_df["city"] = panda_df["zoo"].apply(remove_zoo)
panda_df

## Plots

We will use two main plotting libraries in this course:

- `seaborn`, aka [`sns`](https://en.wikipedia.org/wiki/Sam_Seaborn), for high-level plotting
- `matplotlib`, through the `pyplot` interface, for low-level plotting

For this week, you'll only need seaborn, but there will be some matplotlib code provided.

Check out
[this link](https://medium.com/@ishan.cdixit/lets-learn-matplotlib-812ac490d0d6) for more on using matplotlib.

## Seaborn: Standard plots made easy

The goal of seaborn is to make it as easy as possible to make certain standard plots,
starting from a dataframe.

In [None]:
atten_df.sample(10)

```python
sns.stripplot(x=?, y=?, data=?);
```

In [None]:
sns.stripplot(x="solutions", y="score", data=atten_df, jitter=True);

Seaborn also allows you to programmatically set the color of datapoints, with `hue`.

```python
f, ax = plt.subplots(figsize=(12, 6))
sns.stripplot(x=?, y=?, hue="attention", data=?, jitter=True);
```

The above is an example of a code snippet that might appear in the homework/lab,
along with the injunction to
> make a `stripplot` with `solutions` on the `x` axis and `"score"` on the `y` axis

which would have correct answer as below:

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.stripplot(x="solutions", y="score", hue="attention", data=atten_df, jitter=True);

The other line of code uses `matplotlib`'s `pyplot` interface,
alias `plt`,
to set up the plot.

Whenever you want precise control over how a plot looks,
you'll need to use `matplotlib` in some capacity.

In the cell where you create a figure, the figure will be plotted automatically,
using the same output stream as the `print` function.

If you assign a figure to a variable name, as above, you can also produce it from the `Out` stream.

In [None]:
f