# Testing

## Why?

- Prove that your code works.
- Make modifying it easy and save.
- Document expected behaviour.
- It's easy to do.
- There's a whole [software development methodology][tdd] behind it.

[tdd]: http://butunclebob.com/ArticleS.UncleBob.TheThreeRulesOfTdd

In [None]:
from datetime import datetime
def ymd(date_string):
    """Convert date string to datetime object.
    
    Example:
    
    >>> ymd('2016 01 30')
    datetime.datetime(2016, 1, 30, 0, 0)
    """
    format_string = '%Y %m %d'
    return datetime.strptime(date_string, format_string)

In [None]:
ymd('2016 01 30') # expected behavior

# Unittest

The python [unittest module][utest] makes it really easy to set the expected behavior in stone.

[utest]: https://docs.python.org/2/library/unittest.html

In [None]:
import unittest

In [None]:
# make a TestCase with a test for expected behavior
class YMDTest(unittest.TestCase):
    def test_with_known_values(self):
        """Test ymd with known values."""
        ymd_result = ymd('2016 01 30')
        known = datetime(2016, 1, 30, 0, 0)
        self.assertEqual(ymd_result, known)
# a collection of these is a great documentation for the code

# Running

Test can be run from the command line like so:

    python -m unittest YMDTest
    python -m unittest my_modle.YMDTest

In [None]:
# This is an easy way to run tests in a notebook
result = unittest.TestResult()
test = YMDTest('test_with_known_values')
test.run(result)
result

In [None]:
# Let's write a function to save typing:
def run_test(cls, method_name):
    result = unittest.TestResult()
    test = cls(method_name)
    test.run(result)
    return result

In [None]:
run_test(YMDTest, 'test_with_known_values')

## What about exceptions?

In [None]:
class YMDError(YMDTest):
    def test_throws_on_invalid_string(self):
        with self.assertRaises(ValueError):
            ymd('')
        with self.assertRaises(ValueError):
            # day out of range
            ymd_result = ymd('2016 02 30')
# again, this nicely documents the behavior of ymd!

In [None]:
run_test(YMDError, 'test_throws_on_invalid_string')

# What you can test

- Pipelines
- Data prep & transformation
- Data ingestion
- ...

# What you can't test

- Prediction accuracy (within limits)
    - Disaster prevention is possible
- 'Statistical stuff'

In [None]:
# TestCase has a lot of assertXXX methods ...
# ... let's demo some of those!
class MyTests(unittest.TestCase):
    def test_stuff(self):
        self.assertAlmostEqual(0.00000000001, 0)
        self.assertEqual(1, 1)
        self.assertEqual([1,2], [1,2])
        self.assertEqual({1: 1}, {1: 1})
        
        self.assertTrue(1 == 1) # avoid
        self.assertTrue(1 in range(2)) # avoid
        self.assertIn(1, range(2)) # better
        
        self.assertTrue(1 < 2) # avoid
        self.assertLess(1, 2) # better

In [None]:
run_test(MyTests, 'test_stuff')

In [None]:
# neat way for listing all assert functions
for i in dir(unittest.TestCase):
    if i.startswith('assert'):
        print i

## What about docstrings?

Some docstrings contain examples. The [doctest module][dt] finds and runs them automatically, making sure the output is replicated.

[dt]: https://docs.python.org/2/library/doctest.html

In [None]:
print ymd.__doc__

In [None]:
import doctest

In [None]:
doctest.run_docstring_examples(ymd, globals(), True)

# Twitter

The [Twitter api][tapi] is a gold mine for data. People talk about all kinds of topics, all the time. It's a great place to find opinions on almost anything. Make sure you understand the terms and conditions if you plan bigger projects.

We will interact with it in a very strainghtforward way using the [requests][req] library.

[tapi]: https://dev.twitter.com/overview/api
[req]: http://docs.python-requests.org/en/master/

In [None]:
import requests
from requests_oauthlib import OAuth1

In [None]:
# use your own here
# you get them at apps.twitter.com
from donthackme import CONSUMER_KEY, CONSUMER_SECRET, TOKEN, TOKEN_SECRET

In [None]:
# api url
url = 'https://api.twitter.com/1.1/account/verify_credentials.json'

In [None]:
# this is needed to authenticate our requests with Twitter
auth = OAuth1(CONSUMER_KEY, CONSUMER_SECRET, TOKEN, TOKEN_SECRET)

In [None]:
# let's verify that we're authenticated
verify_response = requests.get(url, auth=auth)

In [None]:
# 200 is great!
verify_response

In [None]:
# this tells us that we got a JSON response
# which is not surprising given the URL
verify_response.headers

In [None]:
# let's look at the JSON
# there's a lot of information
verify_response.json().keys()

In [None]:
verify_response.json()['name']

In [None]:
verify_response.json()['screen_name']

## What data is available?

The statuses (aka Tweets) in a response from the Twitter API contains *a lot* of information. You can explore it through Twitter's [documentation][tdoc].

[tdoc]: https://dev.twitter.com/overview/api/tweets

# Search

Use the Twitter search function to construct the queries. Just type in something and see how the URL that's being opened looks like. Also check the [documentation](https://dev.twitter.com/rest/public/search).

In [None]:
search_url = 'https://api.twitter.com/1.1/search/tweets.json'

In [None]:
params = {'q': 'data science',
          'result_type': 'recent'} # popular also possible

In [None]:
# same procedure as before
search_response = requests.get(search_url, params=params, auth=auth)

In [None]:
# 200 again, that's good
search_response

In [None]:
re_json = search_response.json()
# this contains a lot of information
# I'd encourage you to play around with it!

In [None]:
first_status = re_json.get('statuses', [{}])[0]
print first_status['text']

In [None]:
sorted(first_status.keys())

In [None]:
first_status['user'].keys()

In [None]:
# the first status
re_json['statuses'][0]

## Streaming API

While the search API gives you a static batch of statuses, the *streaming* API will send you statuses until interrupted.

In [None]:
from itertools import islice

In [None]:
# for this to work, we need POST instead of GET
r = requests.post('https://stream.twitter.com/1.1/statuses/filter.json',
                 params = {'track': '#data'},
                 auth=auth,
                 stream=True) # important

In [None]:
tweets = r.iter_lines()

In [None]:
import json

In [None]:
# this would work without islice as well, but it 
# would go on forever (for tweet in tweets: ...)
for tweet in islice(tweets, 20):
    if tweet != None:
        print json.loads(tweet)['text'][:20]
    else:
        print 'Timeout.'

In [None]:
r.close() # always do this

In [None]:
# student question: What about the u in u'string'?
print u'Hi, Håvard!'

In [None]:
u'Hi, Håvard!' # good

In [None]:
'Hi, Håvard!' # less good!

In [None]:
type(u'')
# unicode can deal with almost any character

In [None]:
type('')

# Storing things

## Files (e.g. .csv)

- Don't scale (could use HDFS)
- What about JSON?

## Databases

- SQL
- NoSQL

# MongoDB

- Windows: Download 'Community edition'
- MAC
    brew update
    brew install mongodb
    brew services start mongobd
- Linux
    sudo apt install mongodb
    
Plus:

    pip install pymongo
    
You can get great documentation at the [MongoDB website][mgs].

[mgs]: https://docs.mongodb.com/getting-started/python/

Mongobdb stores BSON, a binary version of JSON.

In [None]:
from pymongo import MongoClient # our window into MongoDB

In [None]:
MongoClient('localhost', 27017)

In [None]:
MongoClient()

In [None]:
# stkinf - Database
# music - Collection
c = MongoClient().stkinf.music

In [None]:
c

In [None]:
# Now the neat thing with MongoDB is that we can store
# basic python objects like dicts, lists, basic types,
# dates, etc.. The main object needs to be a dictionary.
jackson = {'name': {'first': 'Michael',
                    'last': 'Jackson',
                    'middle': 'Joseph'},
           'born': datetime(1958, 8, 29),
           'died': datetime(2009, 6, 25),
           'albums': [{'name': 'Thriller',
                       'released': 1982},
                      {'name': 'Bad',
                       'released': 1987}]}

In [None]:
# let's insert this into our MongoDB collection
result = c.insert_one(jackson)

In [None]:
result

In [None]:
result.acknowledged

In [None]:
result.inserted_id

In [None]:
# let's get it back
c.find_one(result.inserted_id)

In [None]:
# getting one random object
c.find_one() # any object

In [None]:
# getting many objects
list(c.find())

In [None]:
for i in c.find():
    print i

In [None]:
# getting max. 10 objects
for i in c.find().limit(10):
    print i

In [None]:
# search by field
c.find_one({'born': datetime(1958, 8, 29)})

In [None]:
# search by nested field
c.find_one({'name.middle': 'Joseph'})

In [None]:
# another example of nested field search
c.find_one({'albums.released': 1982})

In [None]:
# comparison operator
# we want albums.released > 1980
c.find_one({'albums.released': {'$gt': 1980}})

In [None]:
# AND
c.find_one({'name.first': 'Michael',
             'name.last': 'Bowie'})

In [None]:
# OR
c.find_one({'$or': [{'name.first': 'Michael'},
                     {'name.last': 'Bowie'}]})

In [None]:
# Let's make another object
# It doesn't need to have the same fields.
bowie = {'name': {'first': 'David',
                  'last': 'Bowie',
                  'middle': 'Robert'},
         'born': datetime(1049, 1, 8)}

In [None]:
# insert it
c.insert_one(bowie)

In [None]:
# let's return only first name and birth date
list(c.find({}, {'name.first': 1,
                 'born': 1}))

In [None]:
# suppress the id
list(c.find({}, {'name.first': 1,
                 'born': 1,
                 '_id': 0}))

In [None]:
# ditto
list(c.find({}, {'_id': 0}))

In [None]:
# another example for AND: 1900 < born < 2000
c.find_one({'born': {'$lt': datetime(2000, 1, 1)}, # AND
            'born': {'$gt': datetime(1900, 1, 1)}})

# Modifying things

In [None]:
# add albums field
result = c.update_one({'name.last': 'Bowie'},
                      {'$set': {'albums': []}})

In [None]:
result.acknowledged

In [None]:
# worked
c.find_one({'name.last': 'Bowie'})

In [None]:
# append to a list
c.update_one({'name.last': 'Bowie'},
             {'$push': {'albums': {'name': "Let's Dance",
                                   'released': 1983}}})

In [None]:
c.find_one({'name.last': 'Bowie'})

In [None]:
# increment a field
c.update_one({'name.last': 'Bowie'},
             {'$inc': {'albums.0.released': 1}})

In [None]:
c.find_one({'name.last': 'Bowie'})

In [None]:
# decrement a field
r = c.update_one({'name.last': 'Bowie'},
                 {'$inc': {'albums.0.released': -1}})

## Aggregation

We'll talk more about this next session.

In [None]:
agg = c.aggregate([{'$group': {'_id': '$born',
                               'people_count': {'$sum': 1}}}])

In [None]:
list(agg)