Like the mighty Python, before we walk we must crawl

I learned python [The Hard Way](https://learning.oreilly.com/library/view/learn-python-the/9780133124316/ch01.html).
The idea behind this approach is "doing is learning".
There was lots of focus on baby steps and repeating those steps until they became unconscious.
It was tedious, but definitely useful.
I mention it, because if you find yourself struggling with Python syntax, it's a great reference, or way to learn outright.

My philosophy for this course is in line with this, but with a specific task, or project in mind.
Today's task will be extracting simple information from a plain text file.

**Outline**
  1. [Launch Python (Notebooks)](#Launch-Python)
  1. [Syntax basics](#Syntax-basics)
     1. [Numbers](#Numbers)
     1. [Strings](#Strings)
     1. [Variables](#Variables)
     1. [Tuples](#Tuples)
     1. [Lists](#Lists)
     1. [Dictionaries](#Dictionaries)
  1. [Program Control Logic](#Program-Control-Logic)
     1. [If-then-else](#If-then-else)
     1. [For loops](#For-loops)
  1. [Functions](#Functions)
  1. [Reading files](#Reading-files)
  1. [Viewing data](#Viewing-data)
  1. [Regular Expressions](#Regular-Expressions)
  1. [Extracting data](#Extracting-data) <-- this is the objective, jump here if you know python

# Launch Python

We start by launching python.
Open your terminal, navigate to where you downloaded this course (manually or with git), and launch jupyter:

```bash
$ jupyter notebook ./notebooks/
```

There's also the graphical option, Anaconda installs a nice little launcher, which you can find in your start menu.
But that will usually default to loading `~/`, so you would have to navigate in the browser to this notebook. 
Whichever works better for you.

**NOTE:** Notebook has these things called cells, which you maybe experienced in the Installation step.
Cells are executable bits of python, which you execute by holding (shift | ctrl | alt) and pressing enter.
The cell will run the code, printing what is printed below it, and then whatever is the last thing on the line of the cell, that gets printed in the output as well.
We'll get a lot of practice with that below.

# Syntax basics

Python documentation: [https://www.python.org/doc/](https://www.python.org/doc/)

Python is the type of language where everything is possible.
Do you want to change the `print` function to always add an emoji to the start of your string? 

In [None]:
# You can ignore the code below, you'll understand it later.
# The point is that Python is flexible.
if "_print" not in dir():
    """Let's keep this around for safety reasons"""
    _print = print

In [None]:
def print(*args, **kwargs):
    """Add some cheer to our printing"""
    _print(b'\xf0\x9f\x8d\xbb'.decode(), *args, **kwargs)
print("Wait but I thought `print` was a special python function?")

All of this to say that almost anything is possible, so the joy of python is just figuring out how to convert your brainy ideas *into* the computer.

But I digress.
Python has some basic components, starting with numbers, strings, and variables.

## Numbers

Documentation: [https://docs.python.org/3.7/library/stdtypes.html#numeric-types-int-float-complex](https://docs.python.org/3.7/library/stdtypes.html#numeric-types-int-float-complex)

Numbers are pretty basic in Python, they are just things typed on your number-keypad:

In [None]:
5

Notice that the 5 was printed next to the Out[##]: text?
That is Notebooks printing whatever is the last thing in the cell.
It's very convenient, and I use it constantly to look at data.

Math also works as you would expect:

In [None]:
# * is multiplication
5 * 5

In [None]:
# ** is the exponent operator, i.e. "to the power of"
5 ** 5

These so far have been integers.
But what if we want the Reals?
Those are called floats (for floating point number), and they are just as easy

In [None]:
3.14159 * 5 ** 2

Mixing integers and floats is fine, as we saw above.
Python is smart, and converts the integers to float behind the scene.

There are two convenience functions for dealing with floats and integers though, Modulus (%) and floor division (//):

In [None]:
# Mod can be thought of as "the remainder after dividing 3 by 2"
3 % 2

What if we want multiple of those at a time? For exposition? Just throw a comma in there!

In [None]:
7 % 3,        8 % 3,         9 % 3,         10 % 3

Notice that python doesn't really care about spaces.
That's only true within the line. 
Python VERY MUCH cares about spaces at the beginning of lines, as we'll see later.

Floor division is the complement to Modulus (or mod), it's the division part:

In [None]:
3 // 2

In [None]:
3 / 2

In [None]:
3 // 2   +   (3 % 2) / 2

And there are many more!
Explore them all here: [en.wikibooks.org/wiki/Python_Programming/Operators](https://en.wikibooks.org/wiki/Python_Programming/Operators#Basics)

## Strings

Documentation: [https://docs.python.org/3.7/library/stdtypes.html#text-sequence-type-str](https://docs.python.org/3.7/library/stdtypes.html#text-sequence-type-str)

Strings are just text:

In [None]:
"This is a string. It is defined by surrounding whatever you want with quotes."

In [None]:
'Single quotes are fine too! So are numbers: 5 + 5'

In [None]:
"""
Triple quoted strings can span multiple lines.
They are great for long strings, specifically Docstrings, which are great for commenting your code!
"""

Woah, that last one was ugly!
What's going on there?

The triple quoted string can contain newlines (they all can, but go with it for simplicity sake).
Newlines in python strings are represented by `\n`, which is the **escape character** `\` and an `n`.
The escape character `\` says "whatever comes next is special".

Some convenient escape characters are the Tab `\t`, or emoji like we saw above, denoted by `\x` or `\U`.
To include the character `\` in a string, type `\\`.

To include a newline in a single-quoted string, just type `\n`. But if we do that, it will be ugly, like above:

In [None]:
"First line\nSecond line"

How do we get python to print those two on two lines?
We use the print function, of course:

In [None]:
print("First line\nSecond line\n\tIntended third line")

The reason this happens is because Notebooks isn't printing out those strings we typed above, it was just saying "Here's the thing from the last line in this cell", and just because it is a string doesn't mean Notebooks knows what to do with it (i.e. doesn't assume it should be passed into print())

So what can we do with strings?
Add them of course!

In [None]:
"I'm a lumberjack, " + "and I'm okay!"

We can also multiply them by numbers, which repeats them:

In [None]:
"Na"*8 + " Batman!"

That's about it.
We can't subtract, divide, or exponentiate them, sadly.


So now that we have the basics of strings (and I encourage you to play around with them, make a new cell by hitting the `Escape` key and pressing `a`), let's get into what we can really do with strings, and that's search through them:

In [None]:
'To be, or not to be, that is the question.'.find('be')

That told us that the string 'be' was found at the index 3.
What's an index?
Strings are kind of like text entry fields on a government form.
Here's a diagram of how it works:

<img src="img/2_string_indexing.png" width="500px" alt="String indices" />

Indexing is achieved using square brackets:

In [None]:
"This is a regular old string that starts with T"[0]

That grabbed the `0`th item, which in that diagram above is denoted by the little blue arrows, so that means the first one.

What if we wanted to grab the whole first word?

In [None]:
"First word is 5 letters long, meaning from 0 - 5"[0:5]

In [None]:
"First word is 5 letters long, and this time we're going to omit the 0"[:5]

In [None]:
"This sentence ends with First"[-5:]

In [None]:
"This sentence has 3 characters after First!!!"[-(5+3):-(0+3)]

***TODO:*** 

  * [ ] Play around with the string slicing below until you are comfortable with the indexing syntax.

In [None]:
"String slicing playground"[:]

Indexing strings (or slicing, as the \[:\] syntax is called) is convenient for simple things, but what we'll see later when we talk about regular expressions, is that this *idea* of cutting up strings, or more importantly extracting just sub-sections of the string, is really powerful.


That's what our task is, extracting specific information from strings.
So what we want to achieve eventually is given a string, figuring out what's the location of the data we want, so we can use this string indexing to get it.

In [None]:
"Our CEO Brian Cadman is paid $1,000,000 per day."[?????]

There are a few things to note about strings in Python.
First, strings are *immutable*, which means they can't be changed.

Hopefully, you're now thinking "But wait, I added two strings up above."
Yes, you did!
But Python, in its infinite wisdom, made a new string which was the result of concatenating the two strings.
This only matters to you when you're thinking about efficiency of code.
For example, we can use the Notebook magic %timeit to show us this (ignore the code for now, just trust me it does what I claim):

In [None]:
%%timeit
# This approach adds strings over and over

string_to_add_to = ""
for number in range(1000):
    string_to_add_to = string_to_add_to + str(number)

In [None]:
%%timeit
# This approach collects all the strings we wish to add, then puts them all together in one step

thing_we_will_turn_into_string = []
for number in range(1000):
    thing_we_will_turn_into_string.append(str(number))
string_with_all_the_numbers = ''.join(thing_we_will_turn_into_string)

***TODO:*** 
  * [ ] Browse what functions you can call on a string: [https://docs.python.org/3.7/library/stdtypes.html#text-sequence-type-str](https://docs.python.org/3.7/library/stdtypes.html#text-sequence-type-str)
  * [ ] Browse the Python documents to read a bit about the string format mini-language: [https://docs.python.org/3.7/library/string.html](https://docs.python.org/3.7/library/string.html#formatspec)
  * [ ] Get comfortable with both, so you understand the following:

In [None]:
"Start:{number_to_display:{pad_character}{align}{num_width}f}"\
    .format(number_to_display=5.12345, num_width=15.2, pad_character='_', align=">", unused="gets ignored")\
    .replace('5', '4')

Note: long lines of python code can be continued on the next line by adding a `\` to the end of the line

New to python 3.6 are f-strings, which are cool shortcuts for the above:

In [None]:
f"Start:{5.12345:_>15.2f}"

## Variables

Variables in python are pointers to things.
Those things might be numbers to be added, strings to be searched, or lists of numbers, strings, other lists, etc.
Almost everything in python can be thought of as a reference to some underlying data or function.

We've already seen some variables above, but this is what they look like:

In [None]:
variable_name = 5
variable_name ** 2

In [None]:
CAPITAL_is_OKAY_too = 'but no spaces'
len(CAPITAL_is_OKAY_too)

Variables are just references to the data they are pointing at, so we can use them just like the raw data above:

In [None]:
five = 5
two = 2
five ** two

In [None]:
(CAPITAL_is_OKAY_too + " | ") * two

Python is like a loyal puppy, they try to run your code even if it doesn't make sense:

In [None]:
CAPITAL_is_OKAY_too + five

When you do ask it to do a trick it doesn't know how to, it will throw an Error.
The above error is called a **TypeError**, and it comes with an error message:

    can only concatenate str (not "int") to str
    
This is Python trying to tell us what the error might be.
These error outputs are often invaluable for debugging, they will typically point to exactly the line of code which is causing the problem.
Also, when we start getting into functions, we will see the whole chain of calls, which will make more sense later.

Sometimes (often) we might expect, or even want an error. 
This is because we don't have to let Errors ruin our day, or our code execution.
The solution to errors is the `try`/`catch` code block:

In [None]:
try:
    CAPITAL_is_OKAY_too + five
except TypeError:
    print("Don't add strings and numbers. To do that, cast the number to string, silly!")
    print(CAPITAL_is_OKAY_too + " | " + str(five))

You'll see many, many errors (which cause exceptions) in your programming career, so try/catch might become your best friend.
This is especially useful if you're doing a huge parsing job, and don't want it to crash after running for 8 hours.
So if you wrap each loop in a try/catch, you can just continue on to the next document if something fails for one of them.

***TODO***
 * [ ] Play around with variables, and try throwing a few more errors.

<div style="font-size:2em;padding-left:20pt">Other 'types'</div>

So now we've seen strings and numbers, what else is there? 
Well, [lots](https://docs.python.org/3.7/library/stdtypes.html)!
Just look at the table of contents on the left side of that page for a full list.

The ones we care about for now are Boolean, None, and some super helpful iterables below: [tuples](#Tuples), [lists](#Lists), and [dictionaries](#Dictionaries).

## Boolean

Documentation: [https://docs.python.org/3.7/library/stdtypes.html#truth-value-testing](https://docs.python.org/3.7/library/stdtypes.html#truth-value-testing)

Booleans (or `bool`), are True/False values.
They are used for testing things usually, like "If something is true, do this, if it is false, do that."

Many things are Boolean, like the answer to tests:

In [None]:
5 > 6

or checking if something is a string:

In [None]:
isinstance("Is a string a `str`?", str)

or even if a string contains another string:

In [None]:
"NOFX" in "I listen to NOFX sometimes"

***TODO***: 
  * [ ] Look up the different boolean logic functions: [https://docs.python.org/3.7/library/stdtypes.html#comparisons](https://docs.python.org/3.7/library/stdtypes.html#comparisons)

Booleans pop up almost everywhere!
Some things have a 'boolean' value, which is the equivalent of calling `bool` on them:

In [None]:
bool(True), bool(False), "is redundant, but works"

In [None]:
bool(0), bool(1), bool(1337)

In [None]:
bool("Strings are true!"), bool(""), bool("Unless they are empty.")

That last one is worth a moment to consider: what is a `False` string?
Python says that an empty string is `False`, just like 0 means `False`, None is usually `False`.
But just because these things are all False, doesn't mean they are equal:

In [None]:
bool(None)

In [None]:
None == ""

In [None]:
bool(None) == bool("")

Because Python is really neat, instead of just `&` for and, and `|` for or, you can use `and` and `or` like you would expect:

In [None]:
True and True

In [None]:
False or False

NOTE: One thing about `and` and `or`, is that Python only executes code if it needs to.
This means if you have some expression `and` some other expressions, if the first one is `False`, the second one will never be run or checked if it's `True`, because it doesn't matter (any False makes `and` False).
This might be confusing now, but when we get to functions below it'll make more sense.

## None

`None` is just that, a nothing-value.
It's useful for lots of reasons, we'll see it used quite a bit.
For now, just know it exists, and is called `None`

In [None]:
None

In [None]:
None is None

`is` is a nice function for checking equality, [mostly the same](https://www.mysamplecode.com/2012/11/python-difference-between-is-and-equals.html) as `==`:

In [None]:
5 == None

## Tuples

Documentation: [https://docs.python.org/3.7/library/stdtypes.html#tuples](https://docs.python.org/3.7/library/stdtypes.html#tuples)

Now that we know about the basic types, we move on to iterables, which you can think of as 'collections' of things.
They can be collections of numbers:

In [None]:
(1, 2, 3)

or collections of mixed types:

In [None]:
(False, 1, "two", None)

or, and this might get ugly, collections of collections!

In [None]:
(1, 2, (False, True, ("String 1", "String 2")))

Two important things about tuples (same for lists and dictionaries below):

 1. They have length, which you can test using `len`

In [None]:
a = 1, 2, 3, 4, 5
len(a)

 2. They can be sliced, just like strings:

In [None]:
(0, 1, 2, 3, 4, 5)[0]

In [None]:
(0, 1, 2, 3, 4, 5)[1:4]

Just like strings, the slicing indexes works according to the diagram above.

## Lists

Documentation: [https://docs.python.org/3.7/library/stdtypes.html#lists](https://docs.python.org/3.7/library/stdtypes.html#lists)

Now that we know about tuples, lists will be simple because they are almost exactly the same, just with square brackets:

In [None]:
a = [0, 1, 2, 3]
len(a)

The biggest difference is that lists are 'mutable', and tuples are not.

WAT? 
That just means that lists can be edited, and tuples, like strings above, can not (without creating a new copy).

The simplest thing to think about is adding new elements to your list:

In [None]:
print(a) # from above

In [None]:
a.append(4)
print(a)

What if we tried to do that with tuples?

In [None]:
t = (0, 1, 2, 3)
print(t)
t.append(4)
print(t)

Don't you love exceptions? 
They tell us what went wrong, sometimes in ways only helpful if you know Python already.
But that's why we're here, right?

Converting from tuples to list and vice-versa is easy:

In [None]:
tuple_var = (1, 2, 3, 4)
list_var = list(tuple_var)
new_tuple_var = tuple(list_var)
type(tuple_var), type(list_var), type(new_tuple_var) # the type function is neato!

And even if the elements match, they still aren't equal:

In [None]:
tuple_var == list_var

***TODO***

 * [ ] Check out the documentation for what we can do with lists [operations](https://docs.python.org/3.7/library/stdtypes.html#common-sequence-operations) and [manipulation](https://docs.python.org/3.7/library/stdtypes.html#mutable-sequence-types).

## Dictionaries

Documentation: [https://docs.python.org/3.7/library/stdtypes.html#mapping-types-dict](https://docs.python.org/3.7/library/stdtypes.html#mapping-types-dict)

Dictionaries are one of the coolest things about Python, hands down.
If you care to look under the hood, you should be amazed.
But do that later, for now, let's just play with them.

Dictionaries are basically lists with names, instead of indexes. 
Remember when we were indexing lists:

In [None]:
[0, 1, 2, 3][0]

The logic here is "get the 0th index of the list".
Well what if we don't want to use numbers to index?
What if we want to use names?

Answer: Dictionaries.

In [None]:
a = {'key': 'value'}
a['key']

What this does is 'search' the dictionary a for the key 'key', and return whatever it finds.

<br />
<div style="font-size:90%">Note for the curious: the search is actually not a search, but a <a href="https://stackoverflow.com/a/114831" target=_blank>hash lookup</a>, which means it's super speedy. Isn't that awesome? Ignore this if it didn't make immediate sense. Just remember it's fast.</div>

We can add to dictionaries by just assigning values to the key:

In [None]:
print("Before:", a)
a['new key'] = "new value"
print("After: ", a)

What can be used as a key, I hear you ask?
Anything ['hashable'](https://docs.python.org/3.7/glossary.html#term-hashable), so strings, numbers, simple tuples (with just strings or numbers), but not lists or other dicts:

In [None]:
a[0] = "Zero value"
print(a)
a[0]

In [None]:
a[(0, 1, 'string')] = "Because why not?"
a

Okay, that last one is annoying, let's get rid of it:

In [None]:
del a[(0, 1, 'string')]
a

And of course, we can nest dictionaries:

In [None]:
a['dict key'] = {'sub-key': 100}
a

And then we can access it with two indexes:

In [None]:
a['dict key']['sub-key']

Now here's where dictionaries get tricky: variables pointing to dictionaries.

Remember up in the variables section I called variables 'pointers'?
No? Scroll up, I'll wait.

What this means for dictionaries, is that if two variables point to the SAME dictionary, changing ones changes the other.
This is because they point to memory, you change the memory, that will be seen by the other pointer:

In [None]:
b = a
print(b)

In [None]:
b['key'] = 'changed original value'
print(b)
print(a)

In [None]:
b['new b key'] = [0, 1, 2, 3]
a

This also means we can point to sub-dictionaries:

In [None]:
c = a['dict key']
c['new sub-dict key'] = True
a

Like lists, dictionaries also have a 'length', which is the number of keys in them.
Note: This does not count sub-dictionaries:

In [None]:
len(a)

Lastly (not actually, dictionaries are EVERYWHERE, so we'll see them a lot more below), the components of dictionaries are twofold:

  1. keys: the keys in the dictionary. Note: does not descend into sub-dictionaries:

In [None]:
list(a.keys())

 2. values: the values associated with those keys:

In [None]:
list(a.values())

And we can get both at once with `.items()`:

In [None]:
list(a.items())

***TODO***:
 * [ ] Look at that last result, what does .items() return?
 * [ ] What happens if the key doesn't exist, and you try to access it? Try a['no key'].
 * [ ] Check out the dictionary's [.get()](https://docs.python.org/3.7/library/stdtypes.html#dict.get) function. How could that help the point above?
 * [ ] Play around with dictionaries below, and get comfortable with them. We will use them a lot to collect data.

In [None]:
a['no key']

# Program Control Logic

Documentation: https://docs.python.org/3/reference/index.html

Now we know the basics of data in Python, it's time to move to the logic of programming.
I don't expect to teach you that here, this is just the syntax behind the typical logical components you would expect.

The first thing to know is about whitespace.
Python defines 'blocks' by whitespace (tabs or spaces).
This means where C would say:

```c
if (x > 5) {
    sprintf("String");
}
sprintf("Always prints");
```

Python's version would read:

```python
if x > 5:
    print("String)
print("Always prints")
```

Note two things:

  1. Python doesn't use {} to denote what belongs to the `if` statement (i.e. what will run if True), it uses the indent. To 'end' the block, just don't indent the code.
  2. Python doesn't use ; to end a command. Python assumes one command per line, which you can extend with `()` or `\` (like we saw above in strings):

In [None]:
"string we want to count t's in"\
.count('t')

In [None]:
("string we want to count t's in" # note, no \, because python knows we're in a () expression,
 .count('t'))                     # so it will keep searching until it finds the closing ).

## If-then-else

Documentation: [https://docs.python.org/3/reference/compound_stmts.html#the-if-statement](https://docs.python.org/3/reference/compound_stmts.html#the-if-statement)

If-then-else controls what blocks of code to run given certain (boolean) conditions:

In [None]:
a = 5
if a > 5:
    # This is the then part of the first if, it's indented 4 spaces.
    print("A is greater than 5")
elif a >= 3:
    # Note: this will not run if a is 6, so effectively this is also
    # a <= 5. That's because if-else runs in order.
    print("A is greater or equal to 3")
else:
    # This runs if none of the above conditions are True.
    print("If we get here, a must be less than 3.")

***TODO***
  * [ ] Play around with the above logic, changing values of a.
  * [ ] Check out '[ternary](https://docs.python.org/3/reference/expressions.html#conditional-expressions)' operators here, so you understand the following:

In [None]:
a = 5
b = "a must be < 3" if a < 3 else "a must be >= 3"
print(f"a-->{a}, b-->{b}")

## For loops

Documentation: [https://docs.python.org/3/reference/compound_stmts.html#the-for-statement](https://docs.python.org/3/reference/compound_stmts.html#the-for-statement)

For loops 'loop' over an iterable.
Remember that's like a list, from above:

In [None]:
for i in [0, 1, 2]:
    print(i, i + 1)

If you remember from above, dictionary items return a list of (key, value), so we can loop over those:

In [None]:
a = {'key1': 'value1', 'key2':'value2'}
for _key, _value in a.items():
    print(f"[{_key}] = {_value}")

There are also magical things called iterators (yes, same `iterator` we talked about with lists, the difference is unimportant at this point).
Iterators are specifically designed to be 'looped' over.

One example is `range(start, stop, skip)`, which just returns the numbers from `start` to `stop` (not inclusive), incrementing by `skip`:

In [None]:
for i in range(5, 25, 10):
    print(i)
# won't print 15, because stop is not included

But a more common use is to omit the start and skip, and just enter the stop (this is rare, but for `range` you don't have to specify that you are skipping the first argument):

In [None]:
for i in range(2):
    print(i)

These iterators are scattered throughout almost every python script or library, so it's probably best to be very comfortable with them.
(For the pedantic, yes range isn't an iterator, but it gets the point across.)

One thing about iterators is that they are NOT lists.
To see what this means in action:

In [None]:
range(2)

Notice that it doesn't print out the list of 0 - 9?
But it will if we ask it for its whole list:

In [None]:
list(range(2))

***TODO***:
  * [ ] Try loops with `break` and `continue` keywords in them ([docs](https://docs.python.org/3/reference/simple_stmts.html#break)). We use those a lot!
  * [ ] Check out the else statement that for loops have. Play around with it to learn what it does, or check the [docs](https://docs.python.org/3/reference/compound_stmts.html#the-for-statement).

In [None]:
for i in range(3):
    if i > 1: # try changing this 1 to 5
        break
    print(i)
else:
    print("What have we here?")

## List-comprehension

Documentation: [https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)

List comprehension is one of the coolest things in Python.
You can think of them like one-line for-loops:

In [None]:
[i * 5 for i in range(5)]

They are really convenient for doing things like filtering lists by using the if statement:

In [None]:
str_list = "the quick brown fox jumped over the lazy dog".split() # split with no arguments splits string by whitespace.
[s for s in str_list if s != 'the']

Or for changing some values in the list without a big for loop:

In [None]:
[s if s=='the' else s.upper() for s in str_list]

The equivalent, without a for loop, might look like:

In [None]:
new_str_list = str_list.copy()
for i,s in enumerate(new_str_list): # Google python enumerate to learn what it does.
    if s != 'the':
        new_str_list[i] = s.upper()
new_str_list

***TODO****
  * [ ] Play with the following examples, to get comfortable with list comprehension.
  * [ ] Glance over the data structures documentation [here](https://docs.python.org/3/tutorial/datastructures.html). It's a great reference.

In [None]:
[f"{' '*i}{s}" for i,s in enumerate(str_list)]

In [None]:
[f"{' '*(10 - len(s))}{s}" for i,s in enumerate(str_list)]

In [None]:
# Dictionary comprehension!!! But where'd the 0 go?
{s: i for i,s in enumerate(str_list)}

In [None]:
# list comprehensions with () return iterators. Try putting list out front.
(s.replace('t', 'T') for s in str_list)

In [None]:
max([len(s) for s in str_list])

In [None]:
# replace dict with list to see what zip does. This isn't a comprehension, sorry.
dict(zip(str_list, 'abcdefg'))

# Functions

Documentation: [https://docs.python.org/3/reference/compound_stmts.html#function-definitions](https://docs.python.org/3/reference/compound_stmts.html#function-definitions)

Functions are our bread and butter.
You will thank yourself if you write good functions, because they will be reusable.

Functions basically wrap some code in a name which allows you to reuse that code anywhere:

In [None]:
def FUNction(): # def tells Python you're making a function, and the name before the () is the name of the function
    print("Functions are my happy place.")
FUNction()

Functions are really useful when you start passing in  parameters.
Think of parameters like $x$ in the math function $f(x)=x^2$:

In [None]:
def f(x):
    return x**2
f(5)

Notice the return statement, that exits the function and returns whatever you write after return.
Our first function didn't have a return statement, so it returned None:

In [None]:
a = FUNction()
print(f'a = {a}')

There are two ways to pass in arguments to functions, position and keyword.
Position arguments have to come before keyword, and are required, meaning when calling the function you must supply values for all positional arguments.
Keyword arguments have an = sign, and must have a default value, because they are not required.

Here's all that in action:

In [None]:
def f(x, y, math_func="power", noisy=False):
    if math_func == 'power':
        return_val = x ** y
    elif math_func.lower().startswith('mul'):
        return_val = x * y
    elif math_func.lower().startswith('div'):
        return_val = x / y
    elif math_func.lower().startswith('add'):
        return_val = x + y
    elif math_func.lower().startswith('sub'):
        return_val = x - y
    else:
        # We can throw our own exceptions to tell the user they messed up
        raise ValueError("Sorry, only power, multiply, divide, add, subtract allowed. " # we can split strings like this
                         "You passed: {}".format(math_func))
        
    if noisy:
        print(return_val)
    
    return return_val

In [None]:
f() # this error is actually helpful!

In [None]:
f(2, 3) # Why is the answer 8?

In [None]:
f(2, 3, math_func='multiply')

We don't need to name our keyword arguments, but we have to know the right order if we don't:

In [None]:
f(2, 5, 'subtract', noisy=True)

In [None]:
f(2, 5, noisy=False, math_func='subtract')

In [None]:
f(2, 5, 'error') # try out our error, can you see why it might be useful to write descriptive errors?

One thing to note, and this might not be useful now, is that we can pass dictionaries and lists into our functions to act as the arguments.
The magic beans to make this work is `*` (sometimes called 'splat') for lists, and `**` for dictionaries:

In [None]:
pos_args = [2, 5]
key_args = {'math_func': 'subtract'}
f(*pos_args, **key_args)

That might be confusing, so don't worry about it.
Just file it away in your brain, and one day it will become useful and you'll scream "EUREKA" and love your life.

The flip side of that same splat nonsense is actually in defining a function (like print, actually):

In [None]:
def crazy(*args, **kwargs): # args-kwargs is very google-able, that's almost always what these are named
    print("args is:", type(args))
    print("kwargs is:", type(kwargs))
    
    for one_arg in enumerate(args):
        print(f"Argument {i}: {one_arg}")
    
    for arg_key, arg_val in kwargs.items():
        print(f"Argument {arg_key}: {arg_val}")

In [None]:
crazy(1, 2, 3, 'and to the', 4, this="that", up="down")

# Reading files

Documentation: [https://docs.python.org/3.7/library/functions.html?highlight=open#open](https://docs.python.org/3.7/library/functions.html?highlight=open#open)

Reading files in Python is pretty hard, try and follow along:

In [None]:
with open("data/greatgatsby.txt") as fh:
    text = fh.read()

That's it.

Now we can view what we've read in:

In [None]:
print(text[:1000].strip().split('\n')[0])

There are must more complexities, like file encoding, error handling, etc.
But for now, that's all we need to know.

In later lessons, we'll use something else to read/write files, because given that most of our data is tabulated, parsing all that by hand is wasted effort.

***TODO***:
  * [ ] Sit for a moment and reflect on how wonderful Python is.
  * [ ] Read this web comic: [https://xkcd.com/353/](https://xkcd.com/353/)
  * [ ] Try it yourself. 
  * [ ] Really, run that `import` command.
  * [ ] Then try `import this`

# Viewing data

Documentation: [https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html)

This is Jupyter Notebook specific, but there are some great commands for viewing data.

We'll start with things like text, which is pretty straight forward:

In [None]:
try:
    # having emoticons in our prints will drive us crazy. Let's undo that.
    print = _print
except:
    pass

In [None]:
print(text[:5000])

5000 characters of text is pretty long, notice how Jupyter condensed it into a little scrollable box?
You can click the left side of that box to expand it.

To get rid of long boxes, I enter command mode (press Escape), and type `my`

But what if we want to display a webpage?
For example, an 8-K?

In [None]:
with open("data/8-K.html") as fh:
    html = fh.read()

In [None]:
len(html)

In [None]:
print(html[:1000])

Eww. Let's fix that.

In [None]:
# Imports tell python we want to use another library's functionality
from IPython.display import display_html
display_html(html, raw=True)

When we get into parsing HTML documents, specifically sub-sections of them, this will be IMMENSELY helpful.

There are lots more display methods in Notebooks, you can read about them in the Documentation above.

# Regular Expressions

Documentation: [https://docs.python.org/3.6/library/re.html](https://docs.python.org/3.6/library/re.html)

Bookmark that, seriously.
It's my most visited doc page.

Regular expressions are special 'programs' to match strings that are 'regular'.
What this means in general is that if you want to find some string, but there are some minor variations in the format, regular expressions allow you to define what you're looking for in general.
For example, if we want to find "earnings announcement", but also "earnings-announcements" or "Earnings Announcement", we would write:

In [None]:
# load the regular expression library
import re

In [None]:
re_earn_annc = re.compile('earnings[\s-]+announcements?', re.I)

In [None]:
search_string = '''
For example, if we want to find "earnings announcement",
but also "earnings-announcements" or "Earnings 
Announcement", we would write:'''

In [None]:
re_earn_annc.findall(search_string)

Learning regular expressions is a life-time pursuit, because as soon as you think you've figured it out, you come across a string that proves you wrong.

Maybe this python tutorial might help: [https://docs.python.org/3/howto/regex.html](https://docs.python.org/3/howto/regex.html)

# Extracting data

So now you're a Python expert, let's get to some useful data-extraction.

***TODO***
  1. Load the Great Gatsby text from above.
  2. Look at the first 5000 characters of the file. Notice it has a header? Let's get rid of that.
     1. Remove the header from the text, and start your new text variable at "Title:      The Great Gatsby"
  1. If each tweet is 280 characters, how many tweets would it take to put the whole Great Gatsby on twitter?
  1. How many words are in the text?
  1. How many times does Gatsby show up?
  1. How many capitalized words are there?
  1. What's the most common word in the file?
  1. Are there any numbers in the text? Which ones?
  1. When are some significant dates in the book?