# Week 1: Coding with data in Python

We start out with the basics. The exercises in this session cover:

* Writing Python code and Markdown in Jupyter notebooks
* Introductory Python
* Getting some data from Reddit

## Exercises

### Part 1: Know thy notebook

This document is what we call a *Jupyter notebook*. We will be using these extensively throughout the course so **READ THIS CLOSELY**. If you understand how notebooks work, you will save yourself lots of time and frustration throughout this course!

There are two basic things you need to know about Jupyter notebooks:

1. A notebook is nothing but a list of cells. A cell can either be a **code cell** or a **Markdown cell**. Code cells are for writing executable code, and Markdown cells (like this one) are for explaining things in text and making your notebook more readable. A typical workflow that you will soon get used to, is something like: solving a problem with some code in a *code cell* and explaining your reasoning or the results you obtained in a *Markdown cell*. You can toggle cell type when you are in *command mode* by pressing <kbd>y</kbd> for code and <kbd>m</kbd> for Markdown. **Try to do that**. Change this *Markdown* cell to a *code* cell, and change it back again. What happens if you execute (<kbd>shift</kbd>+<kbd>enter</kbd>) when this cell is a code cell, compared to when it is a Markdown cell?

2. The notebook has two *modes*: **edit mode** and **command mode**. You enter command mode by pressing <kbd>esc</kbd> or clicking outside a cell, and edit mode by clicking a cell and pressing <kbd>enter</kbd> or double clicking a cell. When you're in edit mode, the left border of the current cell turns green (not with `jupyter lab`, though, there the bar is always blue) and whatever you type into your keyboard goes into that cell, whether it is a code or Markdown cell. [Here](http://maxmelnick.com/2016/04/19/python-beginner-tips-and-tricks.html)'s a nice rundown of the different commands you can use. **Beware of <kbd>x</kbd> and <kbd>dd</kbd>**. Read the full list of hotkeys by pressing <kbd>h</kbd> in command mode to figure out why.

>*Heads up:* Because we'll be using Jupyter notebooks so much in this course, I strongly recommend investing 5 more minutes playing around with cell types, modes and hotkeys. It will save you heaps of time down the road. Above all, make sure you have read and understood these ^ two points!

When you run a code cell by pressing <kbd>shift</kbd> + <kbd>enter</kbd>, the code gets evaluated by the Python interpreter installed on your computer. The interpreter always returns some output, so unless you store it in a variable, it gets printed below the cell. In general, you will use code cells for doing analysis and working with data.

*Markdown* is a simple markup language for formatting text (like *HTML* or $\LaTeX$). You will typically use it for writing explanations about how you solve the exercises and the results you get, and styling your notebook with sections and subsections. It can do **bold**, *italics* and $\LaTeX$ formatting (for equations), and much much more. You can read about the Markdown language [here](http://daringfireball.net/projects/markdown/).

Below is your first exercise. The exercise are numbered by the convention `[session]`.`[section]`.`[problem]`.`[subproblem]`. For example, exercise 4.2.3.1 is in week 4, section 2, problem 3, and subproblem 1.

>**Ex. 1.0.1**: In the Markdown cell below, write a short text that shows that you can:
>* Create sections
>* Write words in bold and italics
>* Write an equation in LateX formatting
>* Create bullet lists
>* Create [hyperlinks](https://en.wikipedia.org/wiki/Hyperlink)

>*Hint: Remember to execute the cell (<kbd>shift</kbd>+<kbd>enter</kbd>) so the Markdown gets rendered.*

*this is italics*

**this is bold**

>This is a section

Here is some latex: $x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$

Here is a bulleted list:
* here is the first item
* and the second item
* and the third item

and here is a hyperlink: [stack overflow](https://stackoverflow.com)


### Part 2: Essential Python ([DSFS](https://www.oreilly.com/library/view/data-science-from/9781492041122/) Chapter 2)

These exercises take you through some very basic Python. Use them to calibrate your expectations: If you find them hard, you must spend some more time getting up to speed (see the preparation goals for today's session on Canvas).

>**Ex. 1.1.1**: Create a list `a` that contains the numbers from $0$ to $1110$ (including $0$ and $1110$), incremented by one, using the `range` function.

In [6]:
a = list(range(1110))

>**Ex. 1.1.2**: Show that you understand [slicing](http://stackoverflow.com/questions/509211/explain-pythons-slice-notation) in Python by extracting a list `b` with the numbers from $760$ to $769$ (including both) from the list created above.

In [108]:
b = a[760:770]

>**Ex. 1.1.3**: Define a function that takes as input a number $x$ and outputs the number multiplied by itself plus three $f(x) = x(x+3)$. 

In [9]:
def my_func(x):
    return x * (x + 3)

>**Ex. 1.1.4**: Apply this function to every element of the list `b` using a `for` loop and append the results to a new list `c`. Print `c`.

In [12]:
c = []
for ele in b:
    c.append(my_func(ele))
print(c)

[579880, 581404, 582930, 584458, 585988, 587520, 589054, 590590, 592128, 593668]


>**Ex. 1.1.5**: Do the exact same thing using a *list comprehension*.

In [13]:
c = [my_func(ele) for ele in b]
print(c)

[579880, 581404, 582930, 584458, 585988, 587520, 589054, 590590, 592128, 593668]


>**Ex. 1.1.6**: Write the numbers in `c` to a text file with one number per line.

In [19]:
f = open('exercises_week1_output.txt', 'w')
for ele in c:
    f.write(str(ele) + '\n')
f.close()

>There are three ways to format strings in Python.
> 1) The oldest is %-formatting, which has more or less gone out of style. 
> 2) The next is str.format() a more modern approach. Read more [here](https://realpython.com/python-f-strings/#option-2-strformat)
> 3) Finally, formatting with f-strings is the newest and in most cases best method. Read more [here](https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python)
> 
>**Ex. 1.1.7**: Show that you understand how strings work in Python. You should:
>
>1. Add a comment above each line of code that explains it.
>2. Find all the lines where **a string** is put into a string. How many are there?
>3. Rewrite the last examples with owners and rabbits so that it uses f-strings instead

In [26]:
# This is an example of a comment

# Examples using f-strings
# make a new f-string that evaluates to "There are 10 types of people"
x = f"There are {10} types of people."
# instantiate a new variable called binary that points to a string containing "binary"
binary = "binary" 
# instantiate a new variable called do_not that points to a  string containing "don't"
do_not = "don't" 
# instantiate a new f-string that evaluates to "Those who know binary and those who don't"
y = f"Those who know {binary} and those who {do_not}."                         # 1

#print x and y
print(x)
print(y)

# Examples using str.format()
# make a new string speaker pointing to "I"
speaker = "I"
# print "I said: There are 10 types of people"
print("{} said: {}.".format(speaker, x))                                        # 2
# print "I also said: Those who know binary and those who don't"
print("{speaker} also said: '{str_to_insert}'.".format(speaker=speaker, str_to_insert=y))

# make new boolean holding false
hilarious = False
# make a new unformatted string
joke_evaluation = "Isn't that joke so funny?! {joke_is_funny}"

# substitute the value of hilarious for the curly-bracketed joke_is_funny string variable (and then print).
print(joke_evaluation.format(joke_is_funny=hilarious))

# make some more strings
w = "This is the left side of..."
e = "a string with a right side."

# print the concatenation of w and e
print(w + e)

# make a new dictionary showing how many rabbits Alice, Bob, and Cinderella have.
owner_to_rabbit_count = {'Alice': 5, 'Bob': 10, 'Cinderella': 2}
# make an unformatted string that will be used to display how many rabbits each person owns.
rabbit_stats_template = '{name} has {rabbit_count} rabbits'

# print out how many rabbits everyone owns.
for owner_name, rabbits in owner_to_rabbit_count.items():
    print(rabbit_stats_template.format(name=owner_name, rabbit_count=rabbits))                        #3

There are 10 types of people.
Those who know binary and those who don't.
I said: There are 10 types of people..
I also said: 'Those who know binary and those who don't.'.
Isn't that joke so funny?! False
This is the left side of...a string with a right side.
Alice has 5 rabbits
Bob has 10 rabbits
Cinderella has 2 rabbits


There were 3 lines in the previous code block where strings were inserted into strings

In [23]:
# print out how many rabbits everyone owns.
for owner_name, rabbits in owner_to_rabbit_count.items():
    print(f'{owner_name} has {rabbits} rabbits')

Alice has 5 rabbits
Bob has 10 rabbits
Cinderella has 2 rabbits


>**Ex. 1.1.8**: Why does `5 // 2 == 2` in Python 3? What does `5 / 2` give?

`5 // 2 == 2` because `//` signifies integer division. On the other hand, `5 / 2 == 2.5` because `/` signifies floating point division.

>**Ex. 1.1.9**: Explain the point of using `try` and `except` statements? Write some code that shows how to use these.
>
> *Hint: You will do a lot of Googling in this course. If you don't already know how to use `try` and `except`, start Googling now.*

In [25]:
def safe_function(a, b):
    try:
        # the following line is exception-prone (in the case that b is 0), so we wrap it in a try block
        return a / b
    except:
        # if our try block failed to execute (probably because b is 0), return a NaN. Now, it is impossible for our 
        # function to throw an exception.
        return float('NaN')
    
print(safe_function(1, 0))
print(safe_function(100, 5))

nan
20.0


>**Ex 1.1.10**: `dict`s and `defaultdict`s.
1. What is a `defaultdict`? How would you say it is different from a normal Python `dict`?
2. Write some code that takes a list of tuples:
>
>        l = [("a", 1), ("b", 3), ("a", None), ("c", False), ("b", True), ("a", None)]
>
>     And produces a `defaultdict` object
>
>        defaultdict(<class 'list'>, {'a': [1, None, None], 'c': [False], 'b': [3, True]})
>
>*Hint: you can import `defaultdict` from `collections`. Your code should be a for loop that loops over the tuples in `l` and updates an initially empty defaultdict, iteration after iteration.*

In [34]:
# a default dict is like a normal dict, but it supplies a default value if a key is queried but a value is not present.
from collections import defaultdict
l = [("a", 1), ("b", 3), ("a", None), ("c", False), ("b", True), ("a", None)]
my_dict = defaultdict(list)
for ele in l:
    my_dict[ele[0]].append(ele[1])
print(my_dict)

defaultdict(<class 'list'>, {'a': [1, None, None], 'b': [3, True], 'c': [False]})


>**Ex 1.1.11**: Take a list `a = list("justreadtheinstructions")` and
1. count the number of times each element occurs using `Counter`,
2. report the two most common elements
>
>*Hint: you can import `Counter` from `collections`. `Counter` has a method called `most_common` can you can use.*

In [41]:
from collections import Counter
a = list("justreadtheinstructions")
commonest = Counter(a).most_common()
print(commonest[0:2])

[('t', 4), ('s', 3)]


>**Ex 1.1.12**: Take another list `b = list("ofcourseistillloveyou")` and
1. get the `set` of characters that exist in both `a` and `b` (intersection),
2. get the `set` of characters that exist in either `a` or `b` (union), and
3. compute the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) between the distinct elements in `a` and `b`.
>
>*Hint: use the `set` function to get a `set`-type object of distinct elements from a list. Sets supports a [number of different operations](https://snakify.org/en/lessons/sets/#section_4).*

In [44]:
b = list("ofcourseistillloveyou")
intersection = set(a) & set(b)
union = set(a) | set(b)
jaccard_similarity = len(intersection) / len(union)

{'o', 'u', 'i', 't', 'c', 's', 'r', 'e'}
{'o', 'n', 'h', 'i', 'v', 't', 'f', 'j', 'c', 's', 'e', 'a', 'd', 'y', 'u', 'l', 'r'}
0.47058823529411764


### Part 3: A little bit of real data

>**Ex. 1.2.1**: Learn about JSON by reading the **[wikipedia page](https://en.wikipedia.org/wiki/JSON)**. Then answer the following questions in the cell below. 
>
>1. What do the letters stand for?
>2. What is JSON?
>3. Why is JSON superior to XML? (... or why not?)

1.) JavaScript Object Notation
2.) A human-readable data encoding using nested, text key-value pairs.
3.) JSON encodings tend to be more concise (use fewer characters). I have always used JSON whenever I have needed to encode data to text. XML just seems more complicated. JSON is extremely simple.

>**Ex. 1.2.2**: Working with JSON files
>1. Use [`requests`](https://www.google.dk/search?q=python+requests+get+json&gws_rd=cr&ei=M5OdWaewD8Ti6AS54J24Bg), or another Python module, to store **[this data](https://www.reddit.com/r/gameofthrones/.json)** in a new variable `data`. You may have to pass a User-agent argument in the header of your request to avoid HTTP 429. In the requests module, this can be done by including <code>headers = {'User-agent': 'whatever-you-like'}</code> as a keyword argument in the function call.
>2. Show that `data` is a `dict` type object.

In [59]:
import requests
import json

data = requests.get('https://www.reddit.com/r/gameofthrones/.json', headers={'User-agent': 'alex the data scientist'}).json()

In [109]:
print("data type:", type(data))

data type: <class 'dict'>


>**Ex. 1.2.3**: Let's try to inspect the data you retrieved. 
>
>1. Use the `json` module to print your data variable as a string with `indent=4`.
>2. Print the keys of `data`.
>
>*Hint: 1. Use the `json` function `dumps`. 2. Call `.keys()` on the variable.*

In [66]:
import json
print(json.dumps(data, indent=4))

{
    "kind": "Listing",
    "data": {
        "after": "t3_10ndrgp",
        "dist": 25,
        "modhash": "",
        "geo_filter": null,
        "children": [
            {
                "kind": "t3",
                "data": {
                    "approved_at_utc": null,
                    "subreddit": "gameofthrones",
                    "selftext": "",
                    "author_fullname": "t2_3dtkvfq4",
                    "saved": false,
                    "mod_reason_title": null,
                    "gilded": 0,
                    "clicked": false,
                    "title": "Gotta love when Jaime gets put in his place",
                    "link_flair_richtext": [],
                    "subreddit_name_prefixed": "r/gameofthrones",
                    "hidden": false,
                    "pwls": 6,
                    "link_flair_css_class": null,
                    "downs": 0,
                    "thumbnail_height": 140,
                    "top_awarded_type": null,

In [67]:
print(data.keys())

dict_keys(['kind', 'data'])


>**Ex. 1.2.4**: The URL reveals that the data is from reddit/r/gameofthrones, but can you recover that information from the data? Give your answer by 'keying' into the dictionary using square brackets.
>
>*Hint: 'Keying' is a word i just made up. By it, I mean the following. Consider a nested dictionary like:*
>
>        my_json_obj = {
>            'cats': {
>                'awesome': ['Missy'],
>                'useless': ['Kim', 'Frank', 'Sandy']
>            },
>            'dogs': {
>                'awesome': ['Finn', 'Dolores', 'Fido', 'Casper'],
>                'useless': []
>            }
>        }
>
>*I can get the list of useless cats by keying into `my_json_obj` like such:*
>
>        >>> my_json_obj['cats']['useless']
>        Out [ ]: ['Kim', 'Frank', 'Sandy']
>
>*`my_json_obj['cats']` returns the dictionary `{'awesome': ['Missy'], 'useless': ['Kim', 'Frank', 'Sandy']}` and getting '`useless`' from that eventually gives us `['Kim', 'Frank', 'Sandy']`. If any of those list items were a list of a dictionary themselves, we could have kept keying deeper into the structure.*

In [93]:
data['data']['children'][0]['data']['subreddit_name_prefixed']

'r/gameofthrones'

>**Ex 1.2.5**: Write two `for` loops (or list comprehensions) which:
>1. Count the number of spoilers.
>2. Only prints headlines that aren't spoilers.

In [102]:
spoiler_fields = [child['data']['spoiler'] for child in data['data']['children']]
Counter(spoiler_fields).most_common()

[(False, 22), (True, 3)]

In [106]:
[child['data']['title'] for child in data['data']['children'] if not child['data']['spoiler']]

['Gotta love when Jaime gets put in his place',
 'From Game Of Thrones To Last Of Us',
 "So what's the explanation for this? The White Walkers just like making swirly patterns with dismembered horse carcasses?",
 'Fire &amp; Blood. Artwork I made yesterday.',
 'Roleplaying Game from 2005',
 'This is amazing! Haha',
 "Gwendoline Christie Reminisces on 'Breakthrough' Role as Game of Thrones' Brienne",
 'What if Nymeria had killed that little brat right off the top?',
 'Lord Commander Jeor Mormont - There was no one better than this man. No one!',
 'Which culture is the better archers ? Greyjoy or dothraki?',
 'Is there a smash cut of only Tyrion scenes in games of thrones?',
 'Which GoT character do you identify with.',
 "What happened to Dany's Khalasar?",
 'I present: the two smartest players in the show',
 "Game Of Thrones as an 80's Dark Fantasy Movie",
 "Why did they portray Dany and Khal's sex as Rape scene?",
 'Been asking for some Ramsay meat.. he knew what I meant',
 'Why did th