# Lab Assignment 6 - Dictionaries, Sets, and Hashes

Many systems store and retrieve data using JavaScript Object Notation (JSON) which can be translated into a Python dictionary. Let's work the Wikipedia API since it's free and easy to access.

## Problem 1: Revision data from the Wikipedia API

In [None]:
import requests
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline

We'll need to make a `request` dictionary for the requests library to pass to the Wikipedia API. Details about this API endpoint are available in the [official documentation](https://www.mediawiki.org/wiki/API:Revisions).

In [None]:
request = {}
request['titles'] = "" # Replace with your article title. Avoid long articles like Obama, Trump, etc. if possible
request['action'] = 'query'
request['format'] = 'json'
request['prop'] = 'revisions'
request['rvlimit'] = 100
request['rvdir'] = 'older'
request['formatversion'] = 2
request['rvprop'] = '|'.join(['user','timestamp','ids','size','sha1'])
request

Make the request and store the data in `result`.

In [None]:
result = requests.get('http://en.wikipedia.org/w/api.php', params=request).json()

What are the keys in the object?

Navigate to the location in the dictionary where the list of revisions lives and confirm there are no more than 100 revisions.

Which users have made the most revisions over this timespan? Make an empty dictionary `edit_counter` and loop through the revisions and store the number of changes each user made to the article.

In [None]:
edit_counter = {}

# Your for loop
        
{k:v for k,v in edit_counter.items() if v > 1}

Using the `enumerate` function to give you index positions, loop over this list of revisions and print out the index position of each revision in the list and the "size" of each revision.

## Problem 2: Analyzing revert behavior in revision logs
Write a loop that goes through all the revisions, checks if the content hash ("sha1") is in a set of `unique_revs`, and if the content of a revision is the same as a previous revision, prints out:

* Position in the revision list (use the enumerate function to catch positions in a loop!)
* Revision ID
* Content hash
* Timestamp

In [None]:
unique_revs = set([])

# Your loop here

Which pairs of editors were most involved in reverting behavior? Put another way: if an article's content hash value at time $t$ reverts back to a previous version at time $t+1$, record the users who authored the article versions at time $t$ and $t+1$ as a tuple and store it in a list `warring_users`. It probably makes sense to follow the direction of time and work from the back of the list forward, so consider reversing the list of revisions.

In [None]:
warring_users = []
unique_revs = set([])

# Your loop here
        
warring_users

Are there any patterns in terms of the user doing the reverting and the users being reverted?

## Problem 3: Examining changes over time
Python's [strptime](http://strftime.org/) function inside datetime can convert formatted date strings into datetime objects while `strftime` can convert datetime objects back to strings.

In [None]:
date_string = '2017-03-01'
date_datetime = datetime.strptime(date_string,'%Y-%m-%d')
date_datetime

A datetime object has attributes like `year`, `month`, `day`, `hour`, `minute`, and `second`. Get the year out of a datetime object.

In [None]:
date_datetime.year

What does a Wikipedia timestamp look like? Store an example as `example_timestamp`.

In [None]:
example_timestamp = # Your accessor here
example_timestamp

Use the strptime function and documentation to convert this string to a datetime object.

In [None]:
parsed_example_timestamp = datetime.strptime(example_timestamp,'%Y-%m-%dT%H:%M:%SZ')
parsed_example_timestamp

Count how many edits occur by hour of the day by starting with an empty `hour_counter` dictionary. Write a loop to go through the revisions, retrieve each revision's timestamp, convert it to a datetime object, and then increment the `hour_counter` dictionary. What time of day appears to be the most popular for this article to be edited?

In [None]:
hour_counter = {}

# Your loop here
        
hour_counter

Similarly, create an empty `weekday_counter` dictionary, write a loop to go through the revisions, retrieve each revision's timestamp, convert it to a datetime object, use the `.weekday()` method to get the day of the week (0=Sunday, 6=Saturday), and then increment the `weekday_counter`. What day of the week appears to be the most popular for this article to be edited?

In [None]:
weekday_counter = {}

# Your loop here
        
weekday_counter

The `.date()` method can be applied to datetime objects to get their dates out but throw away the hours, minutes, seconds.

In [None]:
parsed_example_timestamp_date = # Get date out
parsed_example_timestamp_date

Because it's an object, we can hash these datetimes, which also means they can be keys in a dictionary.

In [None]:
# Check that this datetime object hashes

Write a loop through your revisions that populates a dictionary `date_counter` with the number of revisions made on each day.

In [None]:
date_counter = {}

# Your loop here
        
date_counter

Use matplotlib to create a barchart of the number of edits per day.

In [None]:
f,ax = plt.subplots(1,1,figsize=(12,3))
ax.bar(list(date_counter.keys()),list(date_counter.values()))
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')

## Problem 4: Best way to secure your account from brute force password hacking?

You've probably had to create a password for an account where they impose demands that there must be at least 1 number, capitalized character, special character, *etc*. But is making more complicated passwords a better strategy for defending against brute-force password cracking attempts than simply requiring longer passwords?

In [None]:
from itertools import product
import string
import numpy as np

Here is a string of all the ASCII characters, lower- and upper-case:

In [None]:
string.ascii_letters

Here's a string containing all of the digits.

In [None]:
string.digits

Here's a string containing lots of punctuation characters:

In [None]:
string.punctuation

Let's naively assume that everyone uses completely random passwords. Write a 

In [None]:
simple_pass_dict = {}

for num in range(2,10):
    _array = np.random.choice(list(string.ascii_letters),num)
    _str = ''.join(_array)
    simple_pass_dict[num] = _str
    
simple_pass_dict

Compare this to a moderate password scheme that combines letters and digits.

In [None]:
moderate_pass_dict = {}

letters_and_nums = string.ascii_letters + string.digits

for num in range(2,10):    
    _array = np.random.choice(list(letters_and_nums),num)
    _str = ''.join(_array)
    moderate_pass_dict[num] = _str
    
moderate_pass_dict

Finally compare to a complex password schemes that combines letters, digits, and punctuation.

In [None]:
complex_pass_dict = {}

letters_nums_punct = string.ascii_letters + string.digits + string.punctuation

for num in range(2,10):
    _array = np.random.choice(list(letters_nums_punct),num)
    _str = ''.join(_array)
    complex_pass_dict[num] = _str
    
complex_pass_dict

The password you chose is just one of several possible permutations of characters. 

* The simple scheme has 26 lower letters + 26 upper letters = 52 total possibilities per position
* The moderate scheme has 52 letters + 10 digits = 62 total possibilities per position
* The advanced scheme has 62 letters and numbers + 32 punctuation characters = 94 total possibilies per position

We can use the `product` function from itertools to iterate through all the 2-letter combinations of letters. Here's the first 5.

In [None]:
list(product(string.ascii_letters,string.ascii_letters))[:5]

The length of this list of permutations is 2704, which is the same as 52\*52, or $N_{possibilities}^{N_{letters}}$

In [None]:
len(list(product(string.ascii_letters,string.ascii_letters))), 52*52, 52**2

With this math in mind, we don't need to use `product` to generate the lists, we have worst-case estimates for the number of brute-force guesses we need to make before cracking a password.

In [None]:
number_of_letters = np.arange(2,21)
number_of_letters

Use The Power of Numpy to compute the worse-case number of guesses you'd need to make to crack a random password for each of the simple, moderate, and complex schemes.

In [None]:
simple_scheme = 52.0**number_of_letters 
moderate_scheme = 62.0**number_of_letters 
complex_scheme = 94.0**number_of_letters

Plot these out.

In [None]:
f,ax = plt.subplots(1,1)

# Divide number of guesses by 100 giga-flops (roughly how many calculations laptops can do per second)
# times the number of seconds in a day to get the number of days a computer would have to run to guess worst-cast

ax.plot(number_of_letters,simple_scheme / (86400*100*1e9),lw=3,color='b',label='Simple')
ax.plot(number_of_letters,moderate_scheme / (86400*100*1e9),lw=3,color='r',label='Moderate')
ax.plot(number_of_letters,complex_scheme / (86400*100*1e9),lw=3,color='g',label='Complex')
ax.grid(True)

ax.set_yscale('log')
ax.set_xlabel('Password length')
ax.set_ylabel('Number of days to guess')

# Draw dotted cyan horizontal lines for different time thresholds
ax.axhline(1,c='c',ls='--') # A day
ax.axhline(365,c='c',ls='--') # A year
ax.axhline(3650,c='c',ls='--') # A decade
ax.axhline(36500,c='c',ls='--') # A century
ax.axhline(365000,c='c',ls='--') # A millenium

# Don't forget the legend
ax.legend(loc='center left',bbox_to_anchor=(1,.5))

Write a loop that prints out how many letters a moderate and complex password need to be to have similar security by comparing when the less powerful scheme with more characters has more permutations than a more powerful scheme with fewer characters. You don't need dictionaries or sets, just a loop, if statements, print, and break.

In [None]:
# Write a loop that includes the following print statements somewhere inside them
# The idx_moderate, etc. are indexes returned from an enumerate function
print("A {0}-letter moderate password overpowers a {1}-letter complex password".format(idx_moderate+2, idx_complex+2))    

print("A {0}-letter simple password overpowers a {1}-letter complex password\n".format(idx_simple+2, idx_complex+2))


## Extra Credit: Getting more data out of the API
How does the API let us know if there are more revisions available than the 100 most recent we requested? Hint: look at the documentation about [how to continue queries](https://www.mediawiki.org/wiki/API:Query#Continuing_queries).

Update the `request` dictionary to include any new keys to access the next 100 most recent revisions from the API.

Confirm that these revisions' revids don't overlap with your previous call.

Combine the two sets of revisions.

Write a function `get_revisions` that will keep requesting data from the API and cumulatively combining each API call's revisions with previous revisions until there are no more revisions to get.

In [None]:
def get_revisions(page_title):
    
    # Your code here
    
    return revisions