<h1>Advanced Python</h1>

Hi there! Welcome to Advanced Python!. In this course we are going to be covering some of the more advanced features of this programming language. We will finish off by looking at the Pandas programming library in depth and then try our hand at some plotting using Matplotlib and Seaborn.

This course makes the following assumptions:
* You are familiar with the Python programming language and the basic data structures
* You have an understanding of the term "Object Orientated Programming" and you know about class structures
* Your primary interest is to use Python for data science and scientific programming

If all the above don't quite match your experience or needs, please don't panic. You can find an introduction course to Python here: https://colab.research.google.com/drive/1kDcuG1C0fI-wzoKvae2VIPHpPum2J1DT

If you have other plans for Python that are outside of the Data Science world, come speak to me after the class and I can point you in the right direction. But I promise regardless of your needs, this course will help your Python understanding.

<h2>Primer: Welcome to Jupyter Notebooks</h2>

Welcome to the world of Notebooks! Now, there are many tools you can use to write Python code - simple text editors (e.g. Vim, Notepad++, Atom) or integrated development enviroments, IDEs (e.g. PyCharm, Visual Studio Code). The reason I am introducing you with Jupyter Notebooks is because it allows for something called "literate programming". Code is presented in snippets, contained in cells and executed in real-time, allowing for 'interative programming'. Either side of this code, the programmer has the ability to include written english, contained in 'markdown' cells.

Jupyter notebooks provides you with a kind of 'programmers lab book'. You can document your code as you go along but also document your thought process, all whilst working in an enviroment that allows you to iterate over your work and develop efficient and affective programs.

We are currently working on Google Colab, an online notebook very similar to Jupyter Notebook. To install Jupyter Notebooks locally, I would suggest you install <a href="https://www.anaconda.com/download/">Anaconda</a>. This is a free distribution of both Python and R all in one big package, with all the data science libraries that you need along with Jupyter Notebook and R Studio, among other software.

For now, we will be working in Colab only. If you haven't done so already, please open in 'Playground Mode'; go to `file` -> `open in playground mode`

You can add and control cells using the commands on the toolbar at the top of the page. The `code` button adds a code cell, this is what you type your Python code into. The `text` button creates a markdown cell. These cells take written english but can also accept HTML formatting.

<h2>Learning Objectives</h2>

1. Using Python locally and the Python package manager `pip`
2. Strings, template strings and f-strings 
3. List comprehension
4. Pythons built-in utility functions
5. Introduction to the functional paradigm
6. Advanced Pandas
7. Plotting with Matplotlib and Seaborn

<h2>Using Python locally and the Python package manager pip</h2>

I highly recommend that if you're currently not using Python locally on your machine, you install Anaconda. Anaconda is a distribution of the Python and R programming language, but is a bundled package designed for scientific programming. Most of the popular libraries ship with it as well as tools such as Jupyter notebook.....but I've already mentioned most of this.

<img src="https://cdn-images-1.medium.com/max/2000/1*8VwF5RUh4vEf4FfrKMw7qg.png">

Along with anaconda comes an environment and package manager named `conda`. A package manager allows you to install packages (another name for library) so you can use them locally. 

Python actually has it's own package manager called `pip`. We can use `pip` to install the packages we need. For practice, we are going to run `pip` in our notebook. Because we are in a Colab notebook, we have to tell the notebook that we are running a bash command (a command usually found in the command line). We do this by prefixing our code with an exclamation mark.

Later on we are going to use a library called `seaborn`, but more specifically we are going to use a certain version of `seaborn`. We install a package by saying `pip` follwed by the command `install` then the package name and `==VERSION_NUMBER` for a specific version. Run the code cell below to install `seaborn`

In [0]:
!pip install pandas



In [0]:
!pip install seaborn==0.9



There is alot more to programming than just writing code and much of the frustration experienced by new programmers has nothing to do with coding itself, but rather managing the enviroment they code within. This is why I highly suggest you take the time to cover the following concepts using the resources below:

* <a href="https://www.learnenough.com/command-line-tutorial/basics">Learn how to use the command line!</a> It is one of the most powerful tools in your arsenal and building this essential skill will make life alot easier going forward.
* <a href="https://www.learnenough.com/git-tutorial/getting_started"> Learn how to version control your code!</a> There is going to come a point where you mess up and break everything, it is just bound to happen. Version control software like Git tracks the changes made to your code and lets you travel back in time and restore functioning versions.
* Learn how to use environment managers! Conflicts in libraries and different dependencies between projects means that your programming world can quickly become an absolute mess. Environment managers allow you to create a "sandbox" for each project, where the programming langauge and all the libraries you need are kept neat and tidy and away from everything else. There are two main contenders here, you can use <a href="https://towardsdatascience.com/getting-started-with-python-environments-using-conda-32e9f2779307">Conda</a>, the environment manager that comes with Anaconda, or you can use <a href="https://realpython.com/python-virtual-environments-a-primer/#using-different-versions-of-python">VirtualEnv</a>. Both are great, just make sure you use one!


![alt text](https://i.redd.it/imbmky0k2zh21.jpg)

<h2>1. Strings, template string and f-strings</h2>

We know that Python, like other programming languages, stores alphanumeric 'text-like' data in a data structure known as a `string`. In Python, a `string` is an array like data structure, meaning it's elements can be accessed using square bracket notation:

In [0]:
some_string = "some words"
some_string[0:6]

'some w'

When we create a string in Python it naturally inherits all the methods of the string class. We will run over some of the most useful methods here, but for a full list of methods and how to use them see this link: <a href="https://www.w3schools.com/python/python_ref_string.asp">W3schools/python/strings</a>

<h3>The Join and Split Method</h3>

The join method takes a single argument: a list of string elements. We call the `join` method on a string variable that we want to act as the character that will be "joining" all the elements, within the given list, together.

In [0]:
",".join(["words", "we","want", "to", "join"])

'words,we,want,to,join'

The above makes intuitive sense, in that we commonly seperate words with a comma. But just to clarify, the word used to conjoin the elements within the given list can be any string object:

In [0]:
" join ".join(["WORD_ONE", "WORD_TWO", "WORD_THREE"])

'WORD_ONE join WORD_TWO join WORD_THREE'

Whilst we look at the join method, it is worth mentioning one other method that is commonly used with it: the `split` method. This method takes one argument, the value on which to split the string on, and then it returns a list of resulting values. See below for an example:

In [0]:
"words and words and words".split(" ")

['words', 'and', 'words', 'and', 'words']

Notice how the string is now split into a list of strings, seperated by the space character. We can split a string on any value:

In [0]:
"bcdefgabcedfgabcedfg".split("a")

['bcdefg', 'bcedfg', 'bcedfg']

In [0]:
s = "The cat jumped, over, the hill"
s_list = s.split(",")
new_s = " ".join(s_list)
print(s_list)
print(new_s)

['The cat jumped', ' over', ' the hill']
The cat jumped  over  the hill


Now it is your turn to try:

We have two recipe lists below, but each list is stored as a string with the elements seperated by a comma. Use the split method to split the two strings below into their constitute words. Save the result into new variables. These new variables will be lists, we can join the lists using the `+` operator. Join the lists together and then join the elements of the list with a comma character and the `join` method. In the end you should have a string containing all the elements from each recipe, call this `shopping list`:

In [0]:
recipe_one = "onions,garlic,carrots,mince beef,tomatoes,mushrooms"
recipe_two = "blueberries,vanilla yogurt,banana,honey"
split_recipe_one = recipe_one.split(",")
split_recipe_two = recipe_two.split(",")


shopping_list = split_recipe_one + split_recipe_two
shopping_list = ",".join(shopping_list)

In [0]:
shopping_list

'onions,garlic,carrots,mince beef,tomatoes,mushrooms,blueberries,vanilla yogurt,banana,honey'

<h3>Format Strings</h3>

We often use string variables to store information that we want our program to output. We might have a standard message whose output varies slightly based on the actions of our program. Python provides us with a handy method and a "micro-language" to make this much easier. 

The `format` method interprets a string and inputs values based on the arguments given. We can specify where to input the arguments into the string using curly brackets:

In [0]:
x = "Word"
"Insert a word here: {}".format(x)

'Insert a word here: Word'

The values within the curly brackets are where the arguments are inserted but we can use special characters to format our text in certain ways. Firstly, we can insert multiple values using the index of the arguments:

In [0]:
"The {0} jumped over the {1} to get to the {2}".format("cat", "fence", "chickens")

'The cat jumped over the fence to get to the chickens'

In [0]:
"Python {1} {0}".format("awesome", "is")

'Python is awesome'

Or we can use named arguments:

In [0]:
"Arguments can be {one} by {two}".format(one="specified", two="name")

'Arguments can be specified by name'

We can also use bracket notation to access the elements of a list within the formatted string:

In [0]:
"Elements of the list: {0[0]},{0[1]}".format(["one", "two"])

'Elements of the list: one,two'

Or we can use the `*` astrix symbol **"unpack" the elements of a list.** This is handy syntax in Python that we will use alot:

In [0]:
"Elements of the list: {0},{1}".format(*["one", "two"])

'Elements of the list: one,two'

What is actually happening here is, the astrix tells the format function "there will be some arguments coming in, but I don't know how many, just treat each element of the list as one argument". This is commonly referred too as `*args`. 

You will also see `**kwargs` which stands for "keyword arguments". This tells Python that we are specifying an unknown number of keyword arguments. We can use use this to "unpack" the keywords in a dicitonary for example:

**Note: \*args and \*kwargs are Python syntax, they are not specific to the format method. Infact we can unpack lists and dictionaries like this for any function.**

In [0]:
fruit = {"name": "Banana",
        "colour": "yellow"}
"Hi, I'm a {name} and I'm {colour}".format(**fruit)

"Hi, I'm a Banana and I'm yellow"

In [0]:
def print_fruit(name, colour):
  print("I am a {0}".format(name))
  print("I am {0}".format(colour))
  
print_fruit(**fruit)

I am a Banana
I am yellow


There is a sort of "micro-language" that can be used within the curly brackets to format the output in different ways. You can read more about this <a href="https://docs.python.org/3.4/library/string.html#formatspec">here</a> but lets cover some of the most commonly used options:

* Expressing percentages with the `%` symbol. This will multiply the value by 100 and print with a leading '%' symbol
* Specify the number of decimal places by using a colon, followed by ".#f" where '#' is the number of decimal places to show
* Use a comma to seperate thousands

In [0]:
"Two decimal places: {0:.2f}".format(2.45333)

'Two decimal places: 2.45'

In [0]:
"Percentage: {0:%}".format(0.1232)

'Percentage: 12.320000%'

In [0]:
"Percentage two decimal place: {0:.2%}".format(0.1232)

'Percentage two decimal place: 12.32%'

In [0]:
"Big number: {0:,}".format(8237498237492740140214)

'Big number: 8,237,498,237,492,740,140,214'

<h3>F-strings</h3>

New to Python 3.7 is something called 'F-stings'. This is exactly the same as format string except you can now pass objects directly into a string and they will be interpreted on the fly, see the example below:

In [0]:
#We specify an f-string by using a preceeding 'f'
s = "f-string"
f"Yo, yo, yo, I'm an {s}"

"Yo, yo, yo, I'm an f-string"

Notice how we pass in the variable `s` explicitly. This is great because we can now put complex logic into a string and it is interpreted upon execution:

In [0]:
recipe = ["blueberries", "vanilla yogurt", "banana", "honey"]
recipe_str = ','.join(["blueberries", "vanilla yogurt", "banana", "honey"])
cost = {"blueberries": 1.55,
       "vanilla yogurt": 2.2,
       "banana": 0.35,
       "honey": 1.85}

output = f"Your ingredients: {recipe_str} will cost £{sum(cost.values()):.2f}"

In [0]:
output

'Your ingredients: blueberries,vanilla yogurt,banana,honey will cost £5.95'

What if we only want to know the cost of one particular item in this list of ingredients? I want you to create a function below that takes two arguments: an ingredient and a dictionary of ingredient costs. The function should check if the ingredient is in the dictionary keys and if it is, the value of the ingredient should be printed to the screen. Remember to print the value to 2 decimal places!

First try with the format method, then try with the f-sting.

**Note:** Check out the text at the start of the function. It is contained within three quotation marks. This is what is called a **doc-string**. GOOD PROGRAMMERS USE DOC-STRINGS. Doc-strings let you add documentation to your functions so that later on you can answer the question "What the hell does that function do again?!"

In [0]:
def ingredient_price(ingredient, prices):
  """This is a function for printing the price of a given ingredient
  Args:-
  - ingredient (string): ingredient to query
  - price (dictionary): dictionary of prices"""
  if ingredient in prices.keys():
    print(f"The {ingredient} costs £{prices[ingredient]}")
  else:
    print("This ingredient does not have a listed price!")
    return()

In [0]:
cost.keys()

dict_keys(['blueberries', 'vanilla yogurt', 'banana', 'honey'])

In [0]:
ingredient_price("banana", cost)

The banana costs £0.35


In [0]:
help(ingredient_price)

Help on function ingredient_price in module __main__:

ingredient_price(ingredient, prices)
    This is a function for printing the price of a given ingredient
    Args:-
    - ingredient (string): ingredient to query
    - price (dictionary): dictionary of prices



![alt text](https://i.redd.it/88bk6ymnj3s21.jpg)

<h2>2. List Comprehension</h2>

Notice how before, when we executed `sum` on `cost.values()` we automatically assumed that all the items in the dictionary are relevant to our recipe. What if there was some other items in the dictionary that are completely irrelvant? We would end up with an incorrect price.

We could get around this by using loops and conditionals, excluding the items not relevant to us, or we can use something called "list comprehension" to generate a list of relevant items. 

List comprehension allows us to generate a list of values using complex logic housed within double brackets. This is then interpreted at runtime. See the example below. I have added some irrelvant items to our `cost` dictionary and generated a list of values where the list item matches one of the dictionary keys.

In [0]:
cost = {"blueberries": 1.55,
       "vanilla yogurt": 2.2,
       "banana": 0.35,
       "honey": 1.85,
       "apple": 0.25,
       "grapes": 2,
       "TV": 125}

recipe = ["blueberries", "vanilla yogurt", "banana", "honey", "chocolate chips"]

recipe_with_cost_data = [ingredient for ingredient in cost.keys() if ingredient in recipe]

recipe_with_cost_data

['blueberries', 'vanilla yogurt', 'banana', 'honey']

The last line is our list comprehension. Within the square brackets is a `for` loop and an `if` statement. What happens here is: we loop through all the keys in `cost` (accessed using the `keys()` method) and we assign its value to `i`. We then say, if `i` is `in recipe` (if it's value can be found within the recipe list) then we use `i` as a value in our new list, otherwise we do nothing and move onto the next item.

But we want to know the total cost. What we need to do is loop over cost's `items` and if the `key` is in `recipe` then we keep that keys `value`.

In [0]:
recipe_cost_data = [cost for ingredient,cost in cost.items() if ingredient in recipe]

In [0]:
sum(recipe_cost_data)

5.949999999999999

So what happened here? `cost.items()` returns the key value pairs as a list of tuples. We loop over this list of tuples and for each one, we assing the first element (the key) to `ingredient` and we assign the second element (the value) to `cost`. We then say "if that ingredient is found in our recipe, append the value to our new list, otherwise do nothing. Notice how we specify `cost` as the value we want to append to our list by specifying it at the start of our list comprehension statement.

Let's create a new function that takes as its arguments: a list of ingredients and a dictionary of costs. This function will first generate a list of recipe items without costs and print them to the screen with an error message. Then it will generate a total cost for all items it does have price data for:

In [0]:
cost = {"blueberries": 1.55,
       "vanilla yogurt": 2.2,
       "banana": 0.35,
       "honey": 1.85,
       "apple": 0.25,
       "grapes": 2,
       "TV": 125}

recipe = ["blueberries", "vanilla yogurt", "banana", "honey", "chocolate chips"]

def recipe_cost(ingredients, prices):
    no_data = [i for i in ingredients if i not in prices.keys()]
    print("The following items have no price data: {0}".format(",".join(no_data)))
    
    total_price = sum([p for i, p in prices.items() if i in ingredients])
    print(f"The total cost for known items: £{total_price:.2f}")

In [0]:
recipe_cost(recipe, cost)

The following items have no price data: chocolate chips
The total cost for known items: £5.95


Another nifty little thing list comprehension allows us to do is execute a function on our list elements as we create the list. So for example, below we have a list of lists. Think of each list within the list as a "column"....what I am defining here is actually what you might call a "matrix". There is actually a special package called **Numpy** specially designed for defining and handling matrices. It can do lots of awesome things like linear algrebra! If you're interested, check more out <a href="https://www.programiz.com/python-programming/matrix">here</a> or <a href="http://cs231n.github.io/python-numpy-tutorial/#numpy">here.</a>

In [0]:
m = [[3.3,4.1,6.5,2.3,3.6,5.4],
    [3.6, 92.4, 32.4, 46.4, 22.4, 23.4],
    [72.4, 23.4,75.4,21.3,44.5,99.2]]

What we can do now is use list comprehension to perform a perform a task on each list and generate a list of the results. So for example if we want to know the mean of each column of this matrix we can create a function called `mean` and call it within list comprehension.

In [0]:
def mean(*args):
    x = sum([*args])
    return(x/len([*args]))

In [0]:
[mean(*c) for c in m]

[4.2, 36.76666666666667, 56.03333333333334]

Numpy actually has a function called **mean** that we can use to do the above. It takes a list of values as its argument.

In [0]:
import numpy as np
[np.mean(c) for c in m]

[4.2, 36.76666666666667, 56.03333333333334]

Comprehensions actually extend beyond just lists and can be used with other data structures too. There isn't time to cover them all but just to show you as an example to go away with and learn more, this is how you would create a dictionary using dictionary comprehension:

Let's say we have a list of ingredients in a recipe and we want a dictionary of prices, but the price is discounted by 30%. We could do the following:

In [0]:
cost = {"blueberries": 1.55,
       "vanilla yogurt": 2.2,
       "banana": 0.35,
       "honey": 1.85,
       "apple": 0.25,
       "grapes": 2,
       "TV": 125}

recipe = ["blueberries", "vanilla yogurt", "banana", "honey", "chocolate chips"]

{ingredient: round(price-(price*0.3), 2) for ingredient, price in cost.items() if ingredient in recipe}

{'banana': 0.24, 'blueberries': 1.08, 'honey': 1.29, 'vanilla yogurt': 1.54}

This is a tad messy however and the code is not reusable. Let's right a function to calculate discounts:

In [0]:
def discount(value, discount_to_apply):
  new_cost = value - (value * discount_to_apply)
  return(round(new_cost, 2))

{ingredient:discount(price, 0.5) for ingredient, price in cost.items() if ingredient in recipe}

{'banana': 0.17, 'blueberries': 0.78, 'honey': 0.93, 'vanilla yogurt': 1.1}

![alt text](https://i.redd.it/azu39vxdwjg21.png)

<h2>3. Pythons built-in utility functions</h2>

In the last example we just saw one of Python's many built in utility functions: `round`. There are lots of these functions that you can use. The best advice I can give is, once you are comfortable with the Python syntax and can navigate the terminology, get stuck into a project and Google/learn about these functions as and when you need them. Stackoverflow is your best friend.

I'm going to quickly introduce a few that I find incredibly useful and I use in my day-to-day data science work:

* any
* all
* zip
* enumerate
* open

<h3>Any</h3>

The `any` function takes a list of boolean value and reduces it too a single boolean value based on whether any single value is true. It is commonly used with list comprehension as so:

In [0]:
any([True, True, False, False])

True

In [0]:
any([False, False, False, False])

False

In [0]:
x = [2,6,2,45,2,3,4]
[y > 100 for y in x]

[False, False, False, False, False, False, False]

In [0]:
x = [2,6,2,45,2,3,4]
any([y > 100 for y in x])

False

In [0]:
any([y > 10 for y in x])

True

<h2>All</h2>

The `all` function is very similar to `any` however to return a value of `True` all of the values within the given list must be `True`:

In [0]:
all([True, True, False, False])

False

In [0]:
all([True, True, True, True])

True

In [0]:
x = [2,6,2,45,2,3,4]
all([y > 100 for y in x])

False

In [0]:
all([y > 10 for y in x])

False

In [0]:
all([y > 1 for y in x])

True

<h3>Zip</h3>

The `zip` function takes in two `iterators` (thats an object with a countable number of values e.g. a list) and combines there values in a pairwise fashion. The zip function returns an iterator. 

In [0]:
zip([1,2,3], [3,2,1])

<zip at 0x7f3b8da93508>

**GENERATORS!!**
  
  What just happened above is really really **really** important. The zip function is what is known as a <a href="https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/">generator.</a> Most of the time when we create a function, we use the `return` keyword to stop that function and essentially "return control to where the function was executed". The downside to this methodology is that this means functions only get one chance to return their result and must return all their results at once. But there is another way!
  
 ![alt text](https://memestatic1.fjcdn.com/comments/Another+way+to+mutation+there+is+with+your+parent+_523d63ce2fc398fcb3f2a7cfbd6deb7a.jpg)
 
 Generators are a special type of function that, rather than returning their results and essentially dying, they `yield` a result and then maintain their state in the background, waiting to be called again, where they just carry on from where they left off. 
 
 The advantage of this methodology is that:
 1. We can save multiple "states"
 2. It is more memory efficient.
 
`zip` is a generator. It yields each pairwise combination as and when is needed, in a sequential order, as opposed to generating a massive list of tuples that have to be stored in memory and looped over.

If this doesn't make sense look at the example of a generator below. I can use the `next` function to get the next results that the functions yields. Or alternatively I can loop over the function and iterator through each result.

We now have an object of type zip. We can use the `next` function to fetch the next value. Notice how it has combined 1 (the first item of the first list) and 3 (the first item of the second list).

In [0]:
def simple_generator():
  yield(1)
  yield(2)
  yield(3)

In [0]:
our_generator = simple_generator()
next(our_generator)

1

In [0]:
next(our_generator)

2

In [0]:
next(our_generator)

3

In [0]:
for x in simple_generator():
  print(x)

1
2
3


In [0]:
next(zip([1,2,3], [3,2,1]))

(1, 3)

We could loop over the values or alternatively we can unpack them directly into a list using the asterix symbol.

In [0]:
[*zip([1,2,3], [3,2,1])]

[(1, 3), (2, 2), (3, 1)]

In [0]:
[a for a in zip([1,2,3], [3,2,1])]

[(1, 3), (2, 2), (3, 1)]

<h3>Enumerate</h3>

Sometimes when we loop over an iterator we want to know the index of the value we are currently looking at. This is fine for things like dictionaries where we can unpack the keys into a variable that we loop through, but with lists it is a bit more complicated. We could define a new variable that we increment with each loop, but that seems a bit messy. The `enumerate` function returns a list of tuples. For each tuple the first element is the index of the value in the original list and the second element is the value itself.

In [0]:
l = [24,6,3,213,455,3,2,34,5]
enumerate(l)

<enumerate at 0x7f3b8da89120>

In [0]:
[*enumerate(l)]

[(0, 24), (1, 6), (2, 3), (3, 213), (4, 455), (5, 3), (6, 2), (7, 34), (8, 5)]

If we had a list and a dictionary of equal length and wanted to multiply the values of a list by the values of the dictionary, create a new dictionary with the result, and retain the keywords, we could do this like so:

In [0]:
cost = {"blueberries": 1.55,
       "vanilla yogurt": 2.2,
       "banana": 0.35,
       "honey": 1.85,
       "apple": 0.25,
       "grapes": 2,
       "TV": 125}
multiplier = [0.2, 0, 0.12, 0.4, 0.11, 0.02, 0.15]

new_cost = {}
for index, key in enumerate(cost.keys()):
    new_cost[key] = cost[key] * multiplier[index]

In [0]:
new_cost

{'TV': 18.75,
 'apple': 0.0275,
 'banana': 0.041999999999999996,
 'blueberries': 0.31000000000000005,
 'grapes': 0.04,
 'honey': 0.7400000000000001,
 'vanilla yogurt': 0.0}

<h3>Open</h3>

The last function worth mentioning is the `open` function. I will not demonstrate this here because we are working in the Google cloud environment but try this at home on your local machine.

`open`, which is detailed <a href="https://docs.python.org/3/library/functions.html#open">here</a>, takes a file path as its first argument, followed by `mode` which is specified as "w" for write or "r" for read. This utility function is commonly used for reading and writing to text files.

![alt text](https://i.imgur.com/BxBM1Oh.jpg)

<h2>4. Introduction to the Functional Paradigm</h2>

First off, what the hell is a 'paradigm'? This is just a fancy word to describe "how" you do something. There are two main paradigm's in programming, Object Orientated Programming (OOP) and Functional Programming. Think of it like Muay Thai and Taekwondo, they are both trying to achieve the same thing but going about it different ways. Now some people will argue that one is better than the other and people in each camp are always bad-mouthing those in the other camp. But like with martial arts, the person who knows both and can use both interchangeably (the mixed martial artist) tends to win.

Python allows for both programming paradigms. But what is the difference?

* OOP - this is probably the paradigm you're most familar with at the moment, especially if you attended my intro course. It is based around the concept of creating objects which inherit properties and methods from classes. OOP is heavily based around abstraction and inheritance.
* Functional - this paradigm focuses on treating a problem as a series of mathematical functions that can be evaluated in any order and are not dependant upon some outside state. This idea of `state` is really important. In a functional paradigm you treat everything as immutable.

We've actually been using a functional paradigm already, I just didn't tell you. List comprehension is functional and borrowed from Haskell, a purely functional language.

The real advantages of a functional paradigm are: avoiding mutable states (which can be dangerous), cleaner code, and parallel programming (functions can be executed together in parallel).

That being said, Python is not a functional programming language, it merely borrows some of the better concepts from functional programming. You should not avoid using mutable objects all together but instead learn from functional programming to create cleaner more efficient code. The easiest way to do this is, when doing something in python ask **do I really want to change X or do I want a new X?**

The operations that we discuss next become really useful for scripting: programming mostly conducted in the field of data science where operations are often applied to large amounts of data that flow in one direction in a pipeline to generate some sort of output.

Let's get started with some of the functional operators provided by Python, we'll cover them in this order:

* lambda
* map
* filter

<h3>Lambda - creating anonymous functions</h3>

Lambda allows us to create "unnamed" or "anonymous" functions. Rather than creating a function with the `def` keyword and refering to it by name, we can use `lambda` to create a sort of "one off" function.

We create a function by using the `lambda` keyword, followed by the arguments given, a colon (specifying the start of the functions internal workings, then the logic executed by the function.)

In [0]:
f = lambda x: x*12
f(2)

24

In [0]:
f = lambda x, y: (x*12) + y
f(2, 4)

28

<h3>Map</h3>

The `lambda` operator doesn't really come into it's own until we combine it with other functions. One such function is `map`. This function takes two or more arguments: a function and an iterator(s) (e.g. a list...9/10 we use lists). The map function applys the function given to each element in the given iterator. It returns an iterator of results. We can pass this into `list()` to get a list of the results

In [0]:
list(map(lambda x: x+3, [1,2,3]))

[4, 5, 6]

In [0]:
list(map(lambda x: (x*10)/6, [64,384,278,45]))

[106.66666666666667, 640.0, 463.3333333333333, 75.0]

So here we have used `lambda` to create an anonymous function that we use just once. If we had an operation we were going to execute on many lists, then we could create a function and use that instead:

In [0]:
def add_three(x):
    return(x+3)

list(map(add_three, [1,2,3,4,5]))

[4, 5, 6, 7, 8]

If we have multiple lists, we can use map to iterate over the values in each list in a pairwise fashion and pass them as arguments to a function:

In [0]:
list(map(lambda x,y: x*y, [1,2,3], [4,5,6]))

[4, 10, 18]

Maps can be powerful in scripting situations to apply functions to multiple objects at once. In the next exercise we will look at mapping over dictionary objects. In the next code cell I have defined a function that generates dictionary objects for plane specifications. I have also defined three planes.

In [0]:
def define_aircraft(name, unit_cost_millions, capacity, max_range):
    return({"name":name, "unit_cost_millions": unit_cost_millions, 
            "capacity":capacity, "max_range": max_range})

triple_seven = define_aircraft("Boeing 777", 306, 394, 8555)
airbus = define_aircraft("Airbus A320", 77.4, 236, 6500)
seven_three_seven = define_aircraft("Boeing 737", 89.1, 210, 3235)

We can use `map` to perform tasks on a list of these dictionary objects such that we access them one at a time:

In [0]:
dict(map(lambda x: (x["name"], "This is a {0} and it costs ${1} million".format(x["name"], x["unit_cost_millions"])), [triple_seven, airbus, seven_three_seven]))

{'Airbus A320': 'This is a Airbus A320 and it costs $77.4 million',
 'Boeing 737': 'This is a Boeing 737 and it costs $89.1 million',
 'Boeing 777': 'This is a Boeing 777 and it costs $306 million'}

Now if we take a *really* simplified approach to airtravel and we say each ticket costs \\$300. We will also say that each full range trip costs \\$200 per 100 nautical mile to run. Can we generate a dictionary where the key is the name of the plane and the value is the number of maximum number of trips at full capacity that would need to be performed to make our money back on a purchase of any of these aircraft?

In the cell below complete the function that calculates the max number of trips required and then use map to apply the function to each aircraft storing the result as a dictionary.

HINT: we want to multiply passenger capacity by 300 to get the revenue generated by one flight and muliple the max range, divided by 100, by 200 to get the cost per flight

In [0]:
def trips_required(aircraft):
    cost = (aircraft["max_range"]/100)*200
    revenue = aircraft["capacity"] * 300
    profit = revenue - cost
    trips = (aircraft["unit_cost_millions"]*1000000)/profit
    return((aircraft["name"], trips))

In [0]:
dict(map(trips_required, [triple_seven, seven_three_seven, airbus]))

{'Airbus A320': 1339.1003460207612,
 'Boeing 737': 1576.1542543782064,
 'Boeing 777': 3027.005638539915}

<h3>Filter</h3>

The filter function is like a special kind of map function. It's objective is to take an iterator and reduce its contents based on a condition. It's great for....well for filtering!

Let's look at an example, only the values in the list that, when passed into the given function, yield a true value will be kept:

In [0]:
list(filter(lambda x: x > 100, [12, 344, 5, 233, 4, 34, 56, 192, 100]))

Let's try using this with out aircraft example. I've defined a few more aircraft below. Modify the `trips_required` function so that it returns just the number of trips required, rather than a tuple. Then use `filter` to select only the aircraft with fewer than 2000 trips required.

In [0]:
bombardier = define_aircraft("Bombardier Dash 8", 14.3, 68, 1100)
embraer = define_aircraft("Embraer E-175", 53.5, 124, 2300)

In [0]:
def trips_required(aircraft):
    cost = (_______/___)*___
    revenue = _________ * ___
    profit = ______ - _____
    trips = (_____________*1000000)/profit
    return(_____)

In [0]:
list(filter(____ _: _________ < 2000, [triple_seven, airbus, seven_three_seven, bombardier, embraer]))

There are more functions provided by Python like this, they are referred too as "higher-order functions". Give that term a Google to know more or explore the `functools` library.

![alt text](https://preview.redd.it/3m7q3po48ho21.png?width=960&crop=smart&auto=webp&s=45a28d0371410ce8f1c96c1c44679ce713319434)

<h2>5. Advanced Pandas</h2>

![alt text](https://bear-joneskilmartingr.netdna-ssl.com/news/wp-content/uploads/panda.jpg)

The Pandas library should be well known to any data scientist. It provides a backbone for handling large datasets in the Python programming environment. It is heavily influenced by R and for those that are familiar with the R programming language, they will notice alot of simularities. Especially an enthasis on "tidy data". For those not familiar with Pandas, I recommend checking out my Introduction to Python course here: https://goo.gl/HB4psR or alternatively search for "Python Pandas" on youtube or LinkedIn Learning.

Just a quick recap for those that don't know, Pandas has two main data structures: `Pandas.Series` and `Pandas.DataFrame`. The `series` class can be thought of as a single column structure which contains an array of data. The DataFrame on the other hand is like a massive table, where each column is a series object.

I'm going to borrow some data from the `vega_datasets` library so we can play with some of Pandas functionality. I think the best way to learn data science is by working on a project. So here we will load in two datasets, one which details US stock prices for large technology companies (the big 5) and one which details US employment. We will use Pandas and then Matplotlib to see if we can identify any correlation between the two. Whilst doing so we will learn some of the core principles of both libraries.

I'm going to go over Matplotlib in detail as it is the most widely used graphical library in Python and alot of other libraries build upon it's core. I will also show a couple of other libraries in brief, which you can investigate on your own.

In terms of this project, we are going to do the following:

* Take a broad look at the data and consider data types and discuss how to handle missing data
* Merge datasets together
* We will see how to create new columns and whilst doing so we will introduce the `apply` method
* Learn how to group data and do stuff with the groups

In [0]:
import pandas as pd
from vega_datasets import local_data

In [0]:
stocks = local_data.stocks()
employment = local_data.us_employment()

In [0]:
#pprint is a library called "Pretty Print". It makes printing large amounts of
#text to screen easier.
from pprint import pprint
pprint(local_data.us_employment.description)

('In the mid 2000s the global economy was hit by a crippling recession. One '
 'result: Massive job losses across the United States. The downturn in '
 'employment, and the slow recovery in hiring that followed, was tracked each '
 'month by the Current Employment Statistics [1]_ program at the U.S. Bureau '
 'of Labor Statistics. This file contains the monthly employment total in a '
 'variety of job categories from January 2006 through December 2015. The '
 'numbers are seasonally adjusted and reported in thousands. The data were '
 'downloaded on Nov. 11, 2018, and reformatted for use in this library. '
 'Because it was initially published by the U.S. government, it is in the '
 "public domain. Totals are included for the 22 'supersectors' [2]_ tracked by "
 "the BLS. The 'nonfarm' total is the category typically used by economists "
 "and journalists as a stand-in for the country's employment total. A "
 "calculated 'nonfarm_change' column has been appended with the month-to-month 

In [0]:
pprint(local_data.stocks.description)

('Daily closing stock prices for AAPL, AMZN, GOOG, IBM, and MSFT between 2000 '
 'and 2010.')


The first thing we will want to do is take a look at our data. Pandas will load it in and do it's best to interpret the data types but we should check and make sure it's done it right. We do this using the `info` method.

In [0]:
stocks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560 entries, 0 to 559
Data columns (total 3 columns):
symbol    560 non-null object
date      560 non-null datetime64[ns]
price     560 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 13.2+ KB


In [0]:
employment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 24 columns):
month                                 120 non-null object
nonfarm                               120 non-null int64
private                               120 non-null int64
goods_producing                       120 non-null int64
service_providing                     120 non-null int64
private_service_providing             120 non-null int64
mining_and_logging                    120 non-null int64
construction                          120 non-null int64
manufacturing                         120 non-null int64
durable_goods                         120 non-null int64
nondurable_goods                      120 non-null int64
trade_transportation_utilties         120 non-null int64
wholesale_trade                       120 non-null float64
retail_trade                          120 non-null float64
transportation_and_warehousing        120 non-null float64
utilities                        

Most of it looks okay. The second column tells us that none of the columns are completely empty but we still have to check for missing values. There are some other questions we should ask as well:

* The `month` column has been brought in as an `object`
  * The third column in the table above shows the datatype. This will be one of the following: Int32/64, float32/64, datetime, or object. Object means that Pandas has stored it as a "string" like object, either because it is text or it just couldn't workout how to handle this data
* Most other columns have been loaded in as integers except `wholesale_trade`, `retail_trade`, `transportation_and_warehousing`, and `utilities`
* We know from the dataset description that the column `nonfarm` is the total number of jobs for the given date whereas `nonfarm_change` is the total change in jobs from the last period. Let's rename these columns so they are bit more sensible sounding.

Let's deal with the first problem. We can use bracket notation to select the `month` column and take a closer look:

In [0]:
employment["month"].head()

0    2006-01-01
1    2006-02-01
2    2006-03-01
3    2006-04-01
4    2006-05-01
Name: month, dtype: object

The month column is in the wrong data format and it has a silly name, so let's change this. We can change data types using some of Pythons inbuilt methods. They follow the nomenclature of `pandas.to_FORMAT`. We will use `pandas.to_datetime`. The details for this can be found in the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html"> documentation</a>. Pandas has some of the best documentation I have ever seen for a programming library, I highly recommend you go to this first when you have a problem.

Before we change the `month` columns datatype, let's have a quick look at the formatting of the `date` column in `stocks`, because we will eventually merge these two tables.

In [0]:
stocks["date"].head()

0   2000-01-01
1   2000-02-01
2   2000-03-01
3   2000-04-01
4   2000-05-01
Name: date, dtype: datetime64[ns]

Awesome! The date formatting appears to match in both tables. If we wanted to specify the format of our datetime, we would pass the argument `format` in our call to `pd.datetime` e.g. `pd.datetime(COLUMN, format="%d%m%Y")`

In [0]:
employment["month"] = pd.to_datetime(employment["month"])

In [0]:
employment["month"].head(4)

0   2006-01-01
1   2006-02-01
2   2006-03-01
3   2006-04-01
Name: month, dtype: datetime64[ns]

Let's also give this column a more sensible name, like "date". We do this using the `rename` method. Rather than the `to_datetime` function, `rename` is actually a method that belongs to the class `DataFrame`. This method will return a new DataFrame that we would need to assign to a new variable or alternatively we can specify the argument `inplace` to equal too `True`. This will modify the existing dataframe and is the preferred method.

In [0]:
#First argument is a dictionary specifying the column we want to change (key)
#and the name we want to change it too (value)
#axis = 1 tells the method we want to change column names as apposed to row 
#names (this would be axis=0)
employment = employment.rename({"month": "date"}, axis=1)

In [0]:
employment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 24 columns):
date                                  120 non-null datetime64[ns]
nonfarm                               120 non-null int64
private                               120 non-null int64
goods_producing                       120 non-null int64
service_providing                     120 non-null int64
private_service_providing             120 non-null int64
mining_and_logging                    120 non-null int64
construction                          120 non-null int64
manufacturing                         120 non-null int64
durable_goods                         120 non-null int64
nondurable_goods                      120 non-null int64
trade_transportation_utilties         120 non-null int64
wholesale_trade                       120 non-null float64
retail_trade                          120 non-null float64
transportation_and_warehousing        120 non-null float64
utilities                

I now want you to change some more column names using the rename method like we did above:

* In the employment DataFrame, change `nonfarm` too `total_jobs` and change `nonfarm_change` to `total_change`
* The `symbol` column name in `stocks` refers to the symbol used for each tech company on the stock market. Change the name `symbol` too `tech_company`

In [0]:
employment = employment.rename({"nonfarm": "total_jobs", "nonfarm_change": "total_change"}, axis=1)
stocks = stocks.rename({"symbol": "tech_company"}, axis=1)

In [0]:
employment.columns

Index(['date', 'total_jobs', 'private', 'goods_producing', 'service_providing',
       'private_service_providing', 'mining_and_logging', 'construction',
       'manufacturing', 'durable_goods', 'nondurable_goods',
       'trade_transportation_utilties', 'wholesale_trade', 'retail_trade',
       'transportation_and_warehousing', 'utilities', 'information',
       'financial_activities', 'professional_and_business_services',
       'education_and_health_services', 'leisure_and_hospitality',
       'other_services', 'government', 'total_change'],
      dtype='object')

In [0]:
stocks.columns

Index(['tech_company', 'date', 'price'], dtype='object')

Now lets look at why some of the columns in employment are `float64` whilst the rest are integers

In [0]:
employment[["wholesale_trade", "retail_trade", "transportation_and_warehousing","utilities"]].head()

Unnamed: 0,wholesale_trade,retail_trade,transportation_and_warehousing,utilities
0,5840.4,15351.5,4420.0,549.8
1,5854.8,15361.3,4429.4,550.1
2,5873.3,15388.0,4429.7,547.5
3,5886.9,15348.5,4445.4,548.9
4,5897.2,15318.1,4459.4,548.3


In [0]:
employment[["information", "financial_activities", "professional_and_business_services", "education_and_health_services"]].head()

Unnamed: 0,information,financial_activities,professional_and_business_services,education_and_health_services
0,3052,8307,17299,17946
1,3052,8332,17365,17998
2,3055,8348,17438,18045
3,3046,8369,17462,18070
4,3039,8376,17512,18100


So it is not clear why there is a difference here and the documentation doesn't shed light on it either. All we can assume is that in the sectors where the total jobs is reported to 1 decimal place, there is higher precision in employment statistics. If this was a real data science project you would go away and research why this difference has occured.

We are going to assume that the difference is due to precision. To be consistent, let's change the data format for all the other rows (currently integers) to floats.

In [0]:
#We can get the column names as a list by accessing the columns property and using the tolist() method
cols = employment.columns.tolist()
#Then use list comprehension to get a list of the columns we want to change
#We do this by excluding the columns we don't want to change
dont_change = ["wholesale_trade", "retail_trade", "transportation_and_warehousing","utilities", "date"]
to_change = [x for x in cols if x not in dont_change]

In [0]:
to_change

['total_jobs',
 'private',
 'goods_producing',
 'service_providing',
 'private_service_providing',
 'mining_and_logging',
 'construction',
 'manufacturing',
 'durable_goods',
 'nondurable_goods',
 'trade_transportation_utilties',
 'information',
 'financial_activities',
 'professional_and_business_services',
 'education_and_health_services',
 'leisure_and_hospitality',
 'other_services',
 'government',
 'total_change']

I want you to take the columns in `to_change` and I want you to convert them to a `float` data type. Now this is a tricky one, I've done this on purpose. Try searching Google, looking at the documentation and stack overflow, and find a way to do this. You will be turning to Google alot, trust me! It is normal!

In [0]:
employment[to_change] = employment[to_change].astype(float)

In [0]:
employment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 24 columns):
date                                  120 non-null datetime64[ns]
total_jobs                            120 non-null float64
private                               120 non-null float64
goods_producing                       120 non-null float64
service_providing                     120 non-null float64
private_service_providing             120 non-null float64
mining_and_logging                    120 non-null float64
construction                          120 non-null float64
manufacturing                         120 non-null float64
durable_goods                         120 non-null float64
nondurable_goods                      120 non-null float64
trade_transportation_utilties         120 non-null float64
wholesale_trade                       120 non-null float64
retail_trade                          120 non-null float64
transportation_and_warehousing        120 non-null float64
uti

**Missing data**

Let's talk about missing data quickly. We can see if data is missing by using the `isnull` method. This will return a `mask`: a table of boolean values that specify if a cell contains missing data. This is not very useful but we can use either `sum` method to get a summary of the missing data:

In [0]:
stocks.isnull().head()

Unnamed: 0,tech_company,date,price
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False


In [0]:
stocks.isnull().sum()

tech_company    0
date            0
price           0
dtype: int64

In [0]:
employment.isnull().sum()

date                                  0
total_jobs                            0
private                               0
goods_producing                       0
service_providing                     0
private_service_providing             0
mining_and_logging                    0
construction                          0
manufacturing                         0
durable_goods                         0
nondurable_goods                      0
trade_transportation_utilties         0
wholesale_trade                       0
retail_trade                          0
transportation_and_warehousing        0
utilities                             0
information                           0
financial_activities                  0
professional_and_business_services    0
education_and_health_services         0
leisure_and_hospitality               0
other_services                        0
government                            0
total_change                          0
dtype: int64

Wow! Aren't we lucky! FYI this **never** happens in real life! If we had missing data how would we deal with it? I'm going to make a fake dataset to demonstrate:

In [0]:
fake = pd.DataFrame({"A":[9808,42893,None,342083, 4233], "B":[2832,235,5435,13443,4243], "C": [39823, None, 3123, None, 323]})

In [0]:
fake

Unnamed: 0,A,B,C
0,9808.0,2832,39823.0
1,42893.0,235,
2,,5435,3123.0
3,342083.0,13443,
4,4233.0,4243,323.0


`NaN` is the placeholder Python uses for missing data.

---



In [0]:
fake.isnull().sum()

A    1
B    0
C    2
dtype: int64

Pandas DataFrames have a method called `fillna` that will....well fill null values! We just simply have to specify the value to fill it with. A common approach is to fill missing values with the mean from the column:

In [0]:
print("The mean value is {}".format(fake["A"].mean()))
fake["A"].fillna(fake["A"].mean())

The mean value is 99754.25


0      9808.00
1     42893.00
2     99754.25
3    342083.00
4      4233.00
Name: A, dtype: float64

Or we could just specify a value like 0

In [0]:
fake["C"].fillna(0)

0    39823.0
1        0.0
2     3123.0
3        0.0
4      323.0
Name: C, dtype: float64

<h3>Merging and pivoting DataFrames</h3>

We now have two datasets that we would like to merge together: `stocks` and `employment`. We want to merge them based on the `date` column. First of all, let's check that the date columns in both tables have the same range and interval:

In [0]:
employment["date"].min()

Timestamp('2006-01-01 00:00:00')

In [0]:
employment["date"].max()

Timestamp('2015-12-01 00:00:00')

In [0]:
stocks["date"].min()

Timestamp('2000-01-01 00:00:00')

In [0]:
stocks["date"].max()

Timestamp('2010-03-01 00:00:00')

Oh dear....looks like they cover different timeframes. They do overlap though! We can subset each dataframe using conditionals within square brackets and in this way both will have the same timeframe. We will start by using Pythons `datetime` library to create `date` objects for our start and finish dates. 

In [0]:
from datetime import date
start = date(year=2006, month=1, day=1)
end = date(year=2010, month=3, day=1)

In [0]:
employment = employment[(employment["date"] >= start) & (employment["date"] <= end)]

'datetime.date' is coerced to a datetime. In the future pandas will
not coerce, and a TypeError will be raised. To retain the current
behavior, convert the 'datetime.date' to a datetime with
'pd.Timestamp'.
  """Entry point for launching an IPython kernel.


In [0]:
employment["date"].min()

Timestamp('2006-01-01 00:00:00')

In [0]:
employment["date"].max()

Timestamp('2010-03-01 00:00:00')

Excellent! Now in the cell below do the same for the `stocks` dataframe

In [0]:
stocks = stocks[(stocks["date"] >= start) & (stocks["date"] <= end)]

'datetime.date' is coerced to a datetime. In the future pandas will
not coerce, and a TypeError will be raised. To retain the current
behavior, convert the 'datetime.date' to a datetime with
'pd.Timestamp'.
  """Entry point for launching an IPython kernel.


We should be able to compare the two dataframes `date` columns now and see if they are equal. We can do this with list comprehension:

In [0]:
stock_dates = stocks["date"].values.tolist()
employment_dates = employment["date"].values.tolist()

[x for x in stock_dates if x not in employment_dates]

[]

In [0]:
[x for x in employment_dates if x not in stock_dates]

[]

Thats great, there are no dates in either dataframe that don't exist in the other. Before we move on and merge the DataFrames, we need to check their shapes. We can do this by using the `shape` property that will return a tuple indicating the number of rows and columns in our DataFrame. The first value is the number of rows and the second is the number of columns.

In [0]:
employment.shape

(51, 24)

Do the same for `stocks`, call the shape property and see how many rows and columns it has.

In [0]:
stocks.shape

(255, 3)

You will notice that the shapes do not match up. Let's call the `head` method and see why this is:

In [0]:
stocks.head()

Unnamed: 0,tech_company,date,price
72,MSFT,2006-01-01,26.14
73,MSFT,2006-02-01,25.04
74,MSFT,2006-03-01,25.36
75,MSFT,2006-04-01,22.5
76,MSFT,2006-05-01,21.19


In [0]:
employment.head()

Unnamed: 0,date,total_jobs,private,goods_producing,service_providing,private_service_providing,mining_and_logging,construction,manufacturing,durable_goods,...,transportation_and_warehousing,utilities,information,financial_activities,professional_and_business_services,education_and_health_services,leisure_and_hospitality,other_services,government,total_change
0,2006-01-01,135450.0,113603.0,22467.0,112983.0,91136.0,656.0,7601.0,14210.0,8982.0,...,4420.0,549.8,3052.0,8307.0,17299.0,17946.0,12945.0,5425.0,21847.0,282.0
1,2006-02-01,135762.0,113884.0,22535.0,113227.0,91349.0,662.0,7664.0,14209.0,8986.0,...,4429.4,550.1,3052.0,8332.0,17365.0,17998.0,12980.0,5426.0,21878.0,312.0
2,2006-03-01,136059.0,114156.0,22572.0,113487.0,91584.0,669.0,7689.0,14214.0,9000.0,...,4429.7,547.5,3055.0,8348.0,17438.0,18045.0,13034.0,5425.0,21903.0,297.0
3,2006-04-01,136227.0,114308.0,22631.0,113596.0,91677.0,679.0,7726.0,14226.0,9020.0,...,4445.4,548.9,3046.0,8369.0,17462.0,18070.0,13074.0,5426.0,21919.0,168.0
4,2006-05-01,136258.0,114332.0,22597.0,113661.0,91735.0,681.0,7713.0,14203.0,9017.0,...,4459.4,548.3,3039.0,8376.0,17512.0,18100.0,13052.0,5433.0,21926.0,31.0


We can see that the first 5 rows of the `stocks` dataframe contains just one tech company. I suspect the reason this DataFrame is longer than the other is because there are duplicate dates, one for each tech company. Let's test this theory:

In [0]:
stocks[stocks["date"] == date(year=2006, month=1, day=1)]

'datetime.date' is coerced to a datetime. In the future pandas will
not coerce, and 'the values will not compare equal to the
'datetime.date'. To retain the current behavior, convert the
'datetime.date' to a datetime with 'pd.Timestamp'.
  """Entry point for launching an IPython kernel.


Unnamed: 0,tech_company,date,price
72,MSFT,2006-01-01,26.14
195,AMZN,2006-01-01,44.82
318,IBM,2006-01-01,75.89
386,GOOG,2006-01-01,432.66
509,AAPL,2006-01-01,75.51


Awesome! So before we merge these DataFrames we want to make them the same length. We can do this by making each tech company it's own individual column which specifies it's price on each date. We do this using the `pivot` method. Reshaping data is performed by a few methods in pandas DataFrames, pivot being one of them:

<img src="https://i.imgur.com/9JWkc6G.png">

<sm>Source: http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf</sm>

In [0]:
stocks_reshaped = stocks.pivot(index="date", columns="tech_company", values="price")

In [0]:
stocks_reshaped.head()

tech_company,AAPL,AMZN,GOOG,IBM,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2006-01-01,75.51,44.82,432.66,75.89,26.14
2006-02-01,68.49,37.44,362.62,75.09,25.04
2006-03-01,62.72,36.53,390.0,77.17,25.36
2006-04-01,70.39,35.21,417.94,77.05,22.5
2006-05-01,59.77,34.61,371.82,75.04,21.19


Thats great, but the date has now been shifted to the index. We don't want this, we want the date to be it's own seperate column. We can shift it back by using the `reset_index` method. This resets the index.

In [0]:
stocks_reshaped.reset_index(inplace=True)

In [0]:
stocks_reshaped.head()

tech_company,date,AAPL,AMZN,GOOG,IBM,MSFT
0,2006-01-01,75.51,44.82,432.66,75.89,26.14
1,2006-02-01,68.49,37.44,362.62,75.09,25.04
2,2006-03-01,62.72,36.53,390.0,77.17,25.36
3,2006-04-01,70.39,35.21,417.94,77.05,22.5
4,2006-05-01,59.77,34.61,371.82,75.04,21.19


In [0]:
#I'm just going to delete the column index name. We delete objects in Python by
#using the `del` keyword
del stocks_reshaped.columns.name
stocks_reshaped.head()

Unnamed: 0,date,AAPL,AMZN,GOOG,IBM,MSFT
0,2006-01-01,75.51,44.82,432.66,75.89,26.14
1,2006-02-01,68.49,37.44,362.62,75.09,25.04
2,2006-03-01,62.72,36.53,390.0,77.17,25.36
3,2006-04-01,70.39,35.21,417.94,77.05,22.5
4,2006-05-01,59.77,34.61,371.82,75.04,21.19


Let's just double check and make sure the dataframes are now the correct length.

In [0]:
stocks_reshaped.shape

(51, 6)

In [0]:
employment.shape

(51, 24)

Great! Both dataframes have the same number of rows. Let's get merging!

There are three functions for merging dataframes together:
* **join:** the dataframes are joined on their **index** values
* **concat:** use this when you want to append one (or more) dataframes to one another either by rows or columns
* **merge**: combine two (or more) dateframes based on some shared values in common columns

For our task we want to use`merge`. Let's just cover `merge` in detail first though, as it is the most complicated of the methods but extremely powerful. It takes two dataframes and merges them using a specified methodology. The documentation for this is fantastic: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

The principles are stolen from "relational databases". We specify how to merge in the `how` argument. 

Here is a great little cheatsheet for merging:

<img src="https://i.imgur.com/H4iLYQD.png">

We want to combine our dataframes using the `date` column. We also want to retain rows from both sets. We know our data has the same number of rows and we expect the date columns to be practically identical.

We can perform an `outer` join in the `date` column and retain all rows.

In [0]:
pd.merge(employment, stocks_reshaped, how="outer", on="date").shape

(51, 29)

In [0]:
pd.merge(employment, stocks_reshaped, how="outer", on="date").head()

Unnamed: 0,date,total_jobs,private,goods_producing,service_providing,private_service_providing,mining_and_logging,construction,manufacturing,durable_goods,...,education_and_health_services,leisure_and_hospitality,other_services,government,total_change,AAPL,AMZN,GOOG,IBM,MSFT
0,2006-01-01,135450.0,113603.0,22467.0,112983.0,91136.0,656.0,7601.0,14210.0,8982.0,...,17946.0,12945.0,5425.0,21847.0,282.0,75.51,44.82,432.66,75.89,26.14
1,2006-02-01,135762.0,113884.0,22535.0,113227.0,91349.0,662.0,7664.0,14209.0,8986.0,...,17998.0,12980.0,5426.0,21878.0,312.0,68.49,37.44,362.62,75.09,25.04
2,2006-03-01,136059.0,114156.0,22572.0,113487.0,91584.0,669.0,7689.0,14214.0,9000.0,...,18045.0,13034.0,5425.0,21903.0,297.0,62.72,36.53,390.0,77.17,25.36
3,2006-04-01,136227.0,114308.0,22631.0,113596.0,91677.0,679.0,7726.0,14226.0,9020.0,...,18070.0,13074.0,5426.0,21919.0,168.0,70.39,35.21,417.94,77.05,22.5
4,2006-05-01,136258.0,114332.0,22597.0,113661.0,91735.0,681.0,7713.0,14203.0,9017.0,...,18100.0,13052.0,5433.0,21926.0,31.0,59.77,34.61,371.82,75.04,21.19


<h3>TIDY DATA!</h3>

You might think "Great! we have the merged data, let's move on". I'm afraid not. I've done the above as an exercise to demonstrate to you how pivoting and merging works, but the result is not very useful. It is not very useful because it is not **tidy**.

Tidy data looks like so:

<img src="http://garrettgman.github.io/images/tidy-4.png">

Tidy data should have individual columns for each variable and there data "types" should not be mixed. We have created a "wide" table with mixed datatypes.

You might think this seems a bit pedantic but it is really important when we start looking at plotting data in the final section of this tutorial. Let's say for example you want to plot all the job sectors against one another on a line plot, with number of jobs on the y-axis and date on the x-axis. Each element of this plot will be defined as an object and each object will relate back to the underlying data. The job sector is a variable - it relates to each line on the plot and is a property of each data point - we would therefore need to specify a column for this variable.....but that column doesn't exist.

What we are going to do, is not merge the datasets. They contain different "types" of data - one is the number of jobs the other is price. They can be related on their `date` column, now that they are equal.

The employment table needs changing though. It needs to be converted to a tidy format, with a column for `employment_sector` and a column for the values which we will call `number_of_jobs`.  We will use the old `stocks` DataFrame (the one we have not reshaped) moving forward. 

When we reshaped `stocks` we used the `pivot` method. This is used to spread rows out into columns. What we want to do now is **gather** the columns into rows. We want to make a new column called `job_sector`which contains what is currently in the column names.

To do this we will use the `melt` method. Again the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html">documentation</a> for this is amazing. We want to specify three arguments:
* id_vars: the column we want to keep as an identifier for each row, which for use is the date
* var_name: the name of our new `variable` column
* value_name: the name of our column containing the `values` of the `variables`

I'm going to demonstrate on a quick example dataset and then I want you to perform the `melt` method on the employment dataset.

In [0]:
fake = pd.DataFrame({"shelter": ["roath", "city centre", "adamsdown", "cathays"],
                    "dogs": [3,5,1,3],
                    "cats": [7,2,4,2],
                    "rabbits": [0,2,0,3]})
fake

Unnamed: 0,shelter,dogs,cats,rabbits
0,roath,3,7,0
1,city centre,5,2,2
2,adamsdown,1,4,0
3,cathays,3,2,3


In [0]:
fake = fake.melt(id_vars="shelter", var_name="animal", value_name="count")

In [0]:
fake

Unnamed: 0,shelter,animal,count
0,roath,animal,dogs
1,city centre,animal,dogs
2,adamsdown,animal,dogs
3,cathays,animal,dogs
4,roath,animal,cats
5,city centre,animal,cats
6,adamsdown,animal,cats
7,cathays,animal,cats
8,roath,animal,rabbits
9,city centre,animal,rabbits


In [0]:
employment = employment.melt(_____________________________________________)

In [0]:
employment.head()

In [0]:
stocks.head()

Unnamed: 0,tech_company,date,price
72,MSFT,2006-01-01,26.14
73,MSFT,2006-02-01,25.04
74,MSFT,2006-03-01,25.36
75,MSFT,2006-04-01,22.5
76,MSFT,2006-05-01,21.19


In [0]:
columns = stocks.columns
first_column = columns[0]
stocks[first_column]

72     MSFT
73     MSFT
74     MSFT
75     MSFT
76     MSFT
77     MSFT
78     MSFT
79     MSFT
80     MSFT
81     MSFT
82     MSFT
83     MSFT
84     MSFT
85     MSFT
86     MSFT
87     MSFT
88     MSFT
89     MSFT
90     MSFT
91     MSFT
92     MSFT
93     MSFT
94     MSFT
95     MSFT
96     MSFT
97     MSFT
98     MSFT
99     MSFT
100    MSFT
101    MSFT
       ... 
530    AAPL
531    AAPL
532    AAPL
533    AAPL
534    AAPL
535    AAPL
536    AAPL
537    AAPL
538    AAPL
539    AAPL
540    AAPL
541    AAPL
542    AAPL
543    AAPL
544    AAPL
545    AAPL
546    AAPL
547    AAPL
548    AAPL
549    AAPL
550    AAPL
551    AAPL
552    AAPL
553    AAPL
554    AAPL
555    AAPL
556    AAPL
557    AAPL
558    AAPL
559    AAPL
Name: tech_company, Length: 255, dtype: object

<h3>The apply method</h3>

We now have our two dataframes `stocks` and `employment`:

In [0]:
stocks.head()

In [0]:
employment.head()

Wouldn't it be great if we had a way of changing the values in the `tech_company` column so that they were a bit more relatale e.g. the actual company names. We can see the current names by using the `unique` method on this column:

In [0]:
stocks["tech_company"].unique()

There are two ways we can do this, the first is using the `apply` method. I will show you this first because it has a broader application.

`apply` uses a functional paradigm that **maps** (remember that word) a function across all the values in a column. This makes for clean code and is the preferred method over doing something like, looping over each value and changing them one at a time in code.....gross.

We can call the `apply` method on either the whole dataframe or just one row. If we call it on a single row, the `apply` method takes a function as its first argument and then applys that function to all the values in the specified column. If we call the method on the whole dataframe, it will apply the function to all the values one row at a time, or one column (specify axis=1 for columns and axis=0 for rows).

I'm going to demonstrate on the fake DataFrame I made in the last exercise. I'll capitalise the text in the animals column....it's a silly example but it explains the logic:

In [0]:
def capitalise(x):
  return(x.upper())

fake["animal"].apply(capitalise)

Above we use `apply` on a single column but we can apply a function to each row individually like so:

In [0]:
def print_info(x):
  return("The shelter in {0} has {1} {2}".format(x["shelter"], 
                                                x["count"], 
                                                x["animal"]))

fake.apply(print_info, axis=1)

Now I want you to complete the function below and then use the apply method to change all the tech company symbols into names:

In [0]:
def get_name(x):
  mappings = {'MSFT': 'Microsoft',
            'AMZN': 'Amazon', 
            'IBM': 'IBM', 
            'GOOG': 'Google', 
            'AAPL': 'Apple'}
  return(mappings[x])

In [0]:
def get_name(_):
  mappings = {'MSFT': 'Microsoft',
            'AMZN': 'Amazon', 
            'IBM': 'IBM', 
            'GOOG': 'Google', 
            'AAPL': 'Apple'}
  return(_______)

In [0]:
stocks["tech_company"] = _____________.apply(___________)

There is actually a built in method for changing values using a dictionary in Pandas and we have used it before, it's called `rename`

In [0]:
  mappings = {'MSFT': 'Microsoft',
            'AMZN': 'Amazon', 
            'IBM': 'IBM', 
            'GOOG': 'Google', 
            'AAPL': 'Apple'}
stocks["tech_company"].rename(mappings).head()

<h3>Grouping</h3>

Before moving on to plotting data, I'm going to cover grouping. It is not relevant to this dataset so we are going to very briefly look at a different dataset before returning to our original task.

I'm going to bring in the `cars` dataset. See below.

In [0]:
cars = local_data.cars()

In [0]:
cars.head()

What would we do if we wanted to group our car data by `Origin` and calculate statistics based on this or apply functions to individual groups?

The `pandas.DataFrame.groupby` method groups data based on a given column(s)

In [0]:
cars.groupby("Origin")

This generates a `groupby` object. This object is an iterator that treats each group as its own dataframe. We can use methods to generate desciptive statistics for these groups:

In [0]:
car_origins = cars.groupby("Origin")
car_origins["Horsepower"].mean()

In [0]:
car_origins["Acceleration"].median()

We don't have time to cover groups in full detail in this tutorial, but I recommend refereing to the user guide for more information: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

<h2>6. Creating plots with Matplotlib and Seaborn</h2>

Matplotlib is the quintessential plotting library in Python and many other libraries relies upon it's classes and functions. So we shall start by establishing a core understanding of it's workings. 

We communicate with Matplotlib using the `pyplot` library which we import as follows:

In [0]:
import matplotlib.pyplot as plt

At the highest level we have the `figure` object. This houses all the contents of our plot and it is this onto which we create create `axes`.

In [0]:
#Create a figure using the figure function
fig = plt.figure()
#Add a title
fig.suptitle('Blank Plot')
#Add an axes using the add_axes method
fig.add_axes([0,0,1,1])
#We display all current plots using the plt.show function
plt.show()

This is not actually the prefered method for creating plots. The prefered methid is to use `plt.subplots()`. In this function call we specify the grid layout we want - i.e. how many axes we want on the plot.

The function returns two objects, a figure object and an axes object. If there are multiple axes then it returns an array of axes objects which can be accessed like a list or matrix

In [0]:
fig, ax = plt.subplots()
fig.suptitle("This figure has one axes")
plt.show()

In [0]:
fig, axes = plt.subplots(2)
fig.subplots_adjust(hspace=0.5)
fig.suptitle("This figure has two axis")
axes[0].set_title("Axis 1")
axes[1].set_title("Axis 2")
plt.show()

In [0]:
axes

In [0]:
fig, axes = plt.subplots(2,2)
#Specify the spaceing around plots
fig.subplots_adjust(hspace=0.5)
fig.subplots_adjust(wspace=0.5)
fig.suptitle("This figure has four axis")
axes[0][0].set_title("Axis 1")
axes[0][1].set_title("Axis 2")
axes[1][0].set_title("Axis 3")
axes[1][1].set_title("Axis 4")
plt.show()

If we want to plot something on our axes we can use various plotting methods. Some examples are shown below:

In [0]:
import numpy as np
t = np.arange(0.01, 10.0, 0.01)
s = np.exp(t)

fig, ax = plt.subplots()
#Plot will draw a simple line plot
ax.plot(t, s)
plt.show()

There are various methods which allow us to change asthetic features:


In [0]:
fig, ax = plt.subplots()
ax.plot(t, s)
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
ax.set_title("Axes Title")
plt.show()

![alt text](https://i.redd.it/zhscjhjr3nb21.jpg)

<h3>Seaborn</h3>

These basic constructs of Matplotlib are import to understand for using other libraries that build upon it, such as `Seaborn`.

`Seaborn` is a plottling library with heavy integration with Pandas. It inherits its infrastructure from Matplotlib. Infact as we will see below, the `seaborn` functions return a matplotlib axes subplot object, that we can then treat the same as the matplotlib objects.

In [0]:
import seaborn as sns
import numpy as np

In [0]:
t = np.arange(0.01, 10.0, 0.01)
s = np.exp(t)
d = pd.DataFrame({"x": t, "y":s})
ax = sns.lineplot(x="x", y="y", data=d)
plt.show()

In [0]:
type(ax)

So I can alter the aesthetics of this plot the same as I did before:

In [0]:
d = pd.DataFrame({"x": t, "y":s})
ax = sns.lineplot(x="x", y="y", data=d)
ax.set_xlabel("X-Axis")
ax.set_ylabel("Y-Axis")
ax.set_title("This is the title")
plt.show()

What seaborn offers us that is of extra value is an integration with the Pandas library.

Let's go back to our `stocks` and `employment` DataFrames and start to address some of our original questions by visualising the data. Lets start by plotting the stock prices of the technology companies over the time period we have.

In [0]:
#Hue refers to the different coloured lines
sns.lineplot(data=stocks, x="date", y="price", hue="tech_company")

Ooooooo that is UGLY. Let's clean this up a little.

In [0]:
ax = sns.lineplot(data=stocks, x="date", y="price", hue="tech_company")
#Change axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Price")
#The legend and x-axis tick labels are best edited by calling methods
#of the parent class `plt`
plt.legend(title="Tech Companies", loc="upper left", bbox_to_anchor=(1, 0.5))
plt.xticks(rotation=60)
plt.show()

The seaborn library comes with a magnitude of functions for all sorts of different plots. All of which can be found here:

https://seaborn.pydata.org/examples/index.html

https://seaborn.pydata.org/tutorial.html

The prefered plot for looking at relationships is the `relplot`. The difference with this plot is that it returns a `facetgrid` object rather than a `axes.subplot` object. The advantage being that we can plot a grid divided up based on our dataframe. See below for an example:


In [0]:
cars = local_data.cars()
cars.head()

In [0]:
sns.relplot(data=cars, x="Horsepower", y="Miles_per_Gallon",
           hue="Cylinders", col="Origin")

Let's use the `relplot` function for our stock data and show how to alter the asthetics of a facetgrid

In [0]:
ax = sns.relplot(data=stocks, x="date", y="price", hue="tech_company", kind="line")

So `ax` is now a FacetGrid object

In [0]:
ax

...and we access the axis objects by accessing the `axes` property of that object, which is just an array of Matplotlib subplot objects!

In [0]:
ax.axes

In [0]:
ax = sns.relplot(data=stocks, x="date", y="price", hue="tech_company", kind="line")
ax.axes[0][0].set_xlabel("Date")
ax.axes[0][0].set_ylabel("Price")
ax.axes[0][0].set_title("Price of tech company stocks over time")
legend = ax._legend
legend.texts[0].set_text("Tech Companies")
plt.xticks(rotation=60)
plt.show()

Now I want you to create a plot, the same way we did above, but this time use the `employment` dataset and plot the jobs in each employment sector over time.

In [0]:
ax = sns.relplot(data=_______, x=______, y=_______, hue=_________, kind="line")
ax.axes[0][0].set_xlabel("Date")
ax.axes[0][0].set_ylabel(________)
ax.axes[0][0].set_title(__________________________)
legend = ax._legend
legend.texts[0].set_text(____________)
plt.xticks(rotation=60)
plt.show()

You might have noticed that alot of the sectors are squished too the bottom. We can fix this by performing a log transformation on our jobs data and therefore putting everything on a log scale. Use the `apply` method to create a new column called `log_jobs` and then create a new plot using this new column.

In [0]:
def log_transform(x):
  return(np.log10(x))

In [0]:
employment["log_jobs"] = employment["jobs"].apply(log_transform)

For one last little exercise to wrap up, I'm going to show how to use seaborn to plot both datasets on the same axis.

In [0]:
fig, ax = plt.subplots(figsize=(10,5))
sns.relplot(data=employment[employment["job_sector"] == "total_change"], 
             x="date", y="jobs", ax=ax, hue="job_sector", kind="line")
plt.close()
ax.lines[0].set_linestyle("--")
ax2 = ax.twinx()
sns.relplot(data=stocks, x="date", y="price", hue="tech_company", ax=ax2, kind="line")
plt.close()