# Testing

## Why?

- Prove that your code works.
- Make modifying it easy and save.
- Document expected behaviour.
- It's easy to do.
- There's a whole [software development methodology][tdd] behind it.

[tdd]: http://butunclebob.com/ArticleS.UncleBob.TheThreeRulesOfTdd

In [4]:
from datetime import datetime
def ymd(date_string):
    """Convert date string to datetime object.
    
    Example:
    
    >>> ymd('2016 01 30')
    datetime.datetime(2016, 1, 30, 0, 0)
    """
    format_string = '%Y %m %d'
    return datetime.strptime(date_string, format_string)

In [4]:
ymd('2016 01 30') # expected behavior

datetime.datetime(2016, 1, 30, 0, 0)

# Unittest

The python [unittest module][utest] makes it really easy to set the expected behavior in stone.

[utest]: https://docs.python.org/2/library/unittest.html

In [5]:
import unittest

In [24]:
# make a TestCase with a test for expected behavior
class YMDTest(unittest.TestCase):
    def test_with_known_values(self):
        """Test ymd with known values."""
        ymd_result = ymd('2016 01 30')
        known = datetime(2016, 1, 30, 0, 0)
        self.assertEqual(ymd_result, known)
# a collection of these is a great documentation for the code

# Running

Test can be run from the command line like so:

    python -m unittest YMDTest
    python -m unittest my_modle.YMDTest

In [23]:
# This is an easy way to run tests in a notebook
result = unittest.TestResult()
test = YMDTest('test_with_known_values')
test.run(result)
result

<unittest.result.TestResult run=1 errors=1 failures=0>

In [14]:
# Let's write a function to save typing:
def run_test(cls, method_name):
    result = unittest.TestResult()
    test = cls(method_name)
    test.run(result)
    return result

In [15]:
run_test(YMDTest, 'test_with_known_values')

<unittest.result.TestResult run=1 errors=0 failures=1>

## What about exceptions?

In [20]:
class YMDError(YMDTest):
    def test_throws_on_invalid_string(self):
        with self.assertRaises(ValueError):
            ymd('')
        with self.assertRaises(ValueError):
            # day out of range
            ymd_result = ymd('2016 02 30')
# again, this nicely documents the behavior of ymd!

In [21]:
run_test(YMDError, 'test_throws_on_invalid_string')

<unittest.result.TestResult run=1 errors=0 failures=0>

# What you can test

- Pipelines
- Data prep & transformation
- Data ingestion
- ...

# What you can't test

- Prediction accuracy (within limits)
    - Disaster prevention is possible
- 'Statistical stuff'

In [40]:
# TestCase has a lot of assertXXX methods ...
# ... let's demo some of those!
class MyTests(unittest.TestCase):
    def test_stuff(self):
        self.assertAlmostEqual(0.00000000001, 0)
        self.assertEqual(1, 1)
        self.assertEqual([1,2], [1,2])
        self.assertEqual({1: 1}, {1: 1})
        
        self.assertTrue(1 == 1) # avoid
        self.assertTrue(1 in range(2)) # avoid
        self.assertIn(1, range(2)) # better
        
        self.assertTrue(1 < 2) # avoid
        self.assertLess(1, 2) # better

In [41]:
run_test(MyTests, 'test_stuff')

<unittest.result.TestResult run=1 errors=0 failures=0>

In [25]:
# neat way for listing all assert functions
for i in dir(unittest.TestCase):
    if i.startswith('assert'):
        print i

assertAlmostEqual
assertAlmostEquals
assertDictContainsSubset
assertDictEqual
assertEqual
assertEquals
assertFalse
assertGreater
assertGreaterEqual
assertIn
assertIs
assertIsInstance
assertIsNone
assertIsNot
assertIsNotNone
assertItemsEqual
assertLess
assertLessEqual
assertListEqual
assertMultiLineEqual
assertNotAlmostEqual
assertNotAlmostEquals
assertNotEqual
assertNotEquals
assertNotIn
assertNotIsInstance
assertNotRegexpMatches
assertRaises
assertRaisesRegexp
assertRegexpMatches
assertSequenceEqual
assertSetEqual
assertTrue
assertTupleEqual
assert_


## What about docstrings?

Some docstrings contain examples. The [doctest module][dt] finds and runs them automatically, making sure the output is replicated.

[dt]: https://docs.python.org/2/library/doctest.html

In [5]:
print ymd.__doc__

Convert date string to datetime object.
    
    Example:
    
    >>> ymd('2016 01 30')
    datetime.datetime(2016, 1, 30, 0, 0)
    


In [43]:
import doctest

In [196]:
doctest.run_docstring_examples(ymd, globals(), True)

Finding tests in NoName
Trying:
    ymd('2016 01 30')
Expecting:
    datetime.datetime(2016, 1, 30, 0, 0)
ok


# Twitter

The [Twitter api][tapi] is a gold mine for data. People talk about all kinds of topics, all the time. It's a great place to find opinions on almost anything. Make sure you understand the terms and conditions if you plan bigger projects.

We will interact with it in a very strainghtforward way using the [requests][req] library.

[tapi]: https://dev.twitter.com/overview/api
[req]: http://docs.python-requests.org/en/master/

In [48]:
import requests
from requests_oauthlib import OAuth1

In [49]:
# use your own here
# you get them at apps.twitter.com
from donthackme import CONSUMER_KEY, CONSUMER_SECRET, TOKEN, TOKEN_SECRET

In [54]:
# api url
url = 'https://api.twitter.com/1.1/account/verify_credentials.json'

In [55]:
# this is needed to authenticate our requests with Twitter
auth = OAuth1(CONSUMER_KEY, CONSUMER_SECRET, TOKEN, TOKEN_SECRET)

In [56]:
# let's verify that we're authenticated
verify_response = requests.get(url, auth=auth)

In [57]:
# 200 is great!
verify_response

<Response [200]>

In [58]:
# this tells us that we got a JSON response
# which is not surprising given the URL
verify_response.headers

{'content-length': '1091', 'x-rate-limit-reset': '1485773301', 'x-rate-limit-remaining': '74', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'x-connection-hash': '998c867bafed9d100ec8aaa1f50e026e', 'x-twitter-response-tags': 'BouncerExempt, BouncerCompliant', 'cache-control': 'no-cache, no-store, must-revalidate, pre-check=0, post-check=0', 'status': '200 OK', 'content-disposition': 'attachment; filename=json.json', 'set-cookie': 'lang=en; Path=/, guest_id=v1%3A148577240112998617; Domain=.twitter.com; Path=/; Expires=Wed, 30-Jan-2019 10:33:21 UTC', 'expires': 'Tue, 31 Mar 1981 05:00:00 GMT', 'x-access-level': 'read-write', 'last-modified': 'Mon, 30 Jan 2017 10:33:21 GMT', 'pragma': 'no-cache', 'date': 'Mon, 30 Jan 2017 10:33:21 GMT', 'x-rate-limit-limit': '75', 'x-response-time': '137', 'x-transaction': '000081590003dddc', 'content-encoding': 'gzip', 'strict-transport-security': 'max-age=631138519', 'server': 'tsa_f', 'x-frame-options': 'SAMEORIGIN', 'conten

In [60]:
# let's look at the JSON
# there's a lot of information
verify_response.json().keys()

[u'follow_request_sent',
 u'has_extended_profile',
 u'profile_use_background_image',
 u'default_profile_image',
 u'id',
 u'profile_background_image_url_https',
 u'verified',
 u'translator_type',
 u'profile_text_color',
 u'profile_image_url_https',
 u'profile_sidebar_fill_color',
 u'entities',
 u'followers_count',
 u'profile_sidebar_border_color',
 u'id_str',
 u'profile_background_color',
 u'listed_count',
 u'status',
 u'is_translation_enabled',
 u'utc_offset',
 u'statuses_count',
 u'description',
 u'friends_count',
 u'location',
 u'profile_link_color',
 u'profile_image_url',
 u'following',
 u'geo_enabled',
 u'profile_banner_url',
 u'profile_background_image_url',
 u'screen_name',
 u'lang',
 u'profile_background_tile',
 u'favourites_count',
 u'name',
 u'notifications',
 u'url',
 u'created_at',
 u'contributors_enabled',
 u'time_zone',
 u'protected',
 u'default_profile',
 u'is_translator']

In [61]:
verify_response.json()['name']

u'Dirk Hesse'

In [62]:
verify_response.json()['screen_name']

u'NotDirkHesse'

## What data is available?

The statuses (aka Tweets) in a response from the Twitter API contains *a lot* of information. You can explore it through Twitter's [documentation][tdoc].

[tdoc]: https://dev.twitter.com/overview/api/tweets

# Search

Use the Twitter search function to construct the queries. Just type in something and see how the URL that's being opened looks like. Also check the [documentation](https://dev.twitter.com/rest/public/search).

In [65]:
search_url = 'https://api.twitter.com/1.1/search/tweets.json'

In [66]:
params = {'q': 'data science',
          'result_type': 'recent'} # popular also possible

In [72]:
# same procedure as before
search_response = requests.get(search_url, params=params, auth=auth)

In [73]:
# 200 again, that's good
search_response

<Response [200]>

In [75]:
re_json = search_response.json()
# this contains a lot of information
# I'd encourage you to play around with it!

In [80]:
first_status = re_json.get('statuses', [{}])[0]
print first_status['text']

#bigdata
#BigDataAnalytics 
Watch the video to see how  data science is  to influence voter behaviour. Is this how… https://t.co/F5X3C6fejg


In [82]:
sorted(first_status.keys())

[u'contributors',
 u'coordinates',
 u'created_at',
 u'entities',
 u'favorite_count',
 u'favorited',
 u'geo',
 u'id',
 u'id_str',
 u'in_reply_to_screen_name',
 u'in_reply_to_status_id',
 u'in_reply_to_status_id_str',
 u'in_reply_to_user_id',
 u'in_reply_to_user_id_str',
 u'is_quote_status',
 u'lang',
 u'metadata',
 u'place',
 u'possibly_sensitive',
 u'quoted_status',
 u'quoted_status_id',
 u'quoted_status_id_str',
 u'retweet_count',
 u'retweeted',
 u'source',
 u'text',
 u'truncated',
 u'user']

In [83]:
first_status['user'].keys()

[u'follow_request_sent',
 u'has_extended_profile',
 u'profile_use_background_image',
 u'default_profile_image',
 u'id',
 u'profile_background_image_url_https',
 u'verified',
 u'translator_type',
 u'profile_text_color',
 u'profile_image_url_https',
 u'profile_sidebar_fill_color',
 u'entities',
 u'followers_count',
 u'profile_sidebar_border_color',
 u'id_str',
 u'profile_background_color',
 u'listed_count',
 u'is_translation_enabled',
 u'utc_offset',
 u'statuses_count',
 u'description',
 u'friends_count',
 u'location',
 u'profile_link_color',
 u'profile_image_url',
 u'following',
 u'geo_enabled',
 u'profile_banner_url',
 u'profile_background_image_url',
 u'screen_name',
 u'lang',
 u'profile_background_tile',
 u'favourites_count',
 u'name',
 u'notifications',
 u'url',
 u'created_at',
 u'contributors_enabled',
 u'time_zone',
 u'protected',
 u'default_profile',
 u'is_translator']

In [None]:
# the first status
re_json['statuses'][0]

## Streaming API

While the search API gives you a static batch of statuses, the *streaming* API will send you statuses until interrupted.

In [84]:
from itertools import islice

In [105]:
# for this to work, we need POST instead of GET
r = requests.post('https://stream.twitter.com/1.1/statuses/filter.json',
                 params = {'track': '#data'},
                 auth=auth,
                 stream=True) # important

In [106]:
tweets = r.iter_lines()

In [93]:
import json

In [107]:
# this would work without islice as well, but it 
# would go on forever (for tweet in tweets: ...)
for tweet in islice(tweets, 20):
    if tweet != None:
        print json.loads(tweet)['text'][:20]
    else:
        print 'Timeout.'

Join us on 2nd Febru
#Secure your #inform
Be Remarkable The Fo
RT @MediaNetwerk: [#
Timeout.
RT @BPTeC_Kenya: An 
Timeout.
[H-NYC] RT ygaudry: 
How To Build A Great
8 #big #data predict
Timeout.
Timeout.
RT BPTeC_Kenya: An i
Timeout.
RT @DeepLearn007: Ma
RT @Seb_Marchipont: 
Timeout.
@DeadMasterMusic Ala
Timeout.
RT @Seb_Marchipont: 


In [108]:
r.close() # always do this

In [116]:
# student question: What about the u in u'string'?
print u'Hi, Håvard!'

Hi, Håvard!


In [117]:
u'Hi, Håvard!' # good

u'Hi, H\xe5vard!'

In [119]:
'Hi, Håvard!' # less good!

'Hi, H\xc3\xa5vard!'

In [197]:
type(u'')
# unicode can deal with almost any character

unicode

In [198]:
type('')

str

# Storing things

## Files (e.g. .csv)

- Don't scale (could use HDFS)
- What about JSON?

## Databases

- SQL
- NoSQL

# MongoDB

- Windows: Download 'Community edition'
- MAC
    brew update
    brew install mongodb
    brew services start mongobd
- Linux
    sudo apt install mongodb
    
Plus:

    pip install pymongo
    
You can get great documentation at the [MongoDB website][mgs].

[mgs]: https://docs.mongodb.com/getting-started/python/

Mongobdb stores BSON, a binary version of JSON.

In [120]:
from pymongo import MongoClient # our window into MongoDB

In [121]:
MongoClient('localhost', 27017)

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

In [123]:
MongoClient()

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

In [129]:
# stkinf - Database
# music - Collection
c = MongoClient().stkinf.music

In [130]:
c

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'stkinf'), u'music')

In [131]:
# Now the neat thing with MongoDB is that we can store
# basic python objects like dicts, lists, basic types,
# dates, etc.. The main object needs to be a dictionary.
jackson = {'name': {'first': 'Michael',
                    'last': 'Jackson',
                    'middle': 'Joseph'},
           'born': datetime(1958, 8, 29),
           'died': datetime(2009, 6, 25),
           'albums': [{'name': 'Thriller',
                       'released': 1982},
                      {'name': 'Bad',
                       'released': 1987}]}

In [132]:
# let's insert this into our MongoDB collection
result = c.insert_one(jackson)

In [133]:
result

<pymongo.results.InsertOneResult at 0x1065c05a0>

In [134]:
result.acknowledged

True

In [135]:
result.inserted_id

ObjectId('588f24817d12fa00c0dc8f1e')

In [137]:
# let's get it back
c.find_one(result.inserted_id)

{u'_id': ObjectId('588f24817d12fa00c0dc8f1e'),
 u'albums': [{u'name': u'Thriller', u'released': 1982},
  {u'name': u'Bad', u'released': 1987}],
 u'born': datetime.datetime(1958, 8, 29, 0, 0),
 u'died': datetime.datetime(2009, 6, 25, 0, 0),
 u'name': {u'first': u'Michael', u'last': u'Jackson', u'middle': u'Joseph'}}

In [139]:
# getting one random object
c.find_one() # any object

{u'_id': ObjectId('588f24817d12fa00c0dc8f1e'),
 u'albums': [{u'name': u'Thriller', u'released': 1982},
  {u'name': u'Bad', u'released': 1987}],
 u'born': datetime.datetime(1958, 8, 29, 0, 0),
 u'died': datetime.datetime(2009, 6, 25, 0, 0),
 u'name': {u'first': u'Michael', u'last': u'Jackson', u'middle': u'Joseph'}}

In [141]:
# getting many objects
list(c.find())

[{u'_id': ObjectId('588f24817d12fa00c0dc8f1e'),
  u'albums': [{u'name': u'Thriller', u'released': 1982},
   {u'name': u'Bad', u'released': 1987}],
  u'born': datetime.datetime(1958, 8, 29, 0, 0),
  u'died': datetime.datetime(2009, 6, 25, 0, 0),
  u'name': {u'first': u'Michael', u'last': u'Jackson', u'middle': u'Joseph'}}]

In [142]:
for i in c.find():
    print i

{u'born': datetime.datetime(1958, 8, 29, 0, 0), u'albums': [{u'released': 1982, u'name': u'Thriller'}, {u'released': 1987, u'name': u'Bad'}], u'_id': ObjectId('588f24817d12fa00c0dc8f1e'), u'name': {u'middle': u'Joseph', u'last': u'Jackson', u'first': u'Michael'}, u'died': datetime.datetime(2009, 6, 25, 0, 0)}


In [143]:
# getting max. 10 objects
for i in c.find().limit(10):
    print i

{u'born': datetime.datetime(1958, 8, 29, 0, 0), u'albums': [{u'released': 1982, u'name': u'Thriller'}, {u'released': 1987, u'name': u'Bad'}], u'_id': ObjectId('588f24817d12fa00c0dc8f1e'), u'name': {u'middle': u'Joseph', u'last': u'Jackson', u'first': u'Michael'}, u'died': datetime.datetime(2009, 6, 25, 0, 0)}


In [144]:
# search by field
c.find_one({'born': datetime(1958, 8, 29)})

{u'_id': ObjectId('588f24817d12fa00c0dc8f1e'),
 u'albums': [{u'name': u'Thriller', u'released': 1982},
  {u'name': u'Bad', u'released': 1987}],
 u'born': datetime.datetime(1958, 8, 29, 0, 0),
 u'died': datetime.datetime(2009, 6, 25, 0, 0),
 u'name': {u'first': u'Michael', u'last': u'Jackson', u'middle': u'Joseph'}}

In [145]:
# search by nested field
c.find_one({'name.middle': 'Joseph'})

{u'_id': ObjectId('588f24817d12fa00c0dc8f1e'),
 u'albums': [{u'name': u'Thriller', u'released': 1982},
  {u'name': u'Bad', u'released': 1987}],
 u'born': datetime.datetime(1958, 8, 29, 0, 0),
 u'died': datetime.datetime(2009, 6, 25, 0, 0),
 u'name': {u'first': u'Michael', u'last': u'Jackson', u'middle': u'Joseph'}}

In [146]:
# another example of nested field search
c.find_one({'albums.released': 1982})

{u'_id': ObjectId('588f24817d12fa00c0dc8f1e'),
 u'albums': [{u'name': u'Thriller', u'released': 1982},
  {u'name': u'Bad', u'released': 1987}],
 u'born': datetime.datetime(1958, 8, 29, 0, 0),
 u'died': datetime.datetime(2009, 6, 25, 0, 0),
 u'name': {u'first': u'Michael', u'last': u'Jackson', u'middle': u'Joseph'}}

In [149]:
# comparison operator
# we want albums.released > 1980
c.find_one({'albums.released': {'$gt': 1980}})

{u'_id': ObjectId('588f24817d12fa00c0dc8f1e'),
 u'albums': [{u'name': u'Thriller', u'released': 1982},
  {u'name': u'Bad', u'released': 1987}],
 u'born': datetime.datetime(1958, 8, 29, 0, 0),
 u'died': datetime.datetime(2009, 6, 25, 0, 0),
 u'name': {u'first': u'Michael', u'last': u'Jackson', u'middle': u'Joseph'}}

In [150]:
# AND
c.find_one({'name.first': 'Michael',
             'name.last': 'Bowie'})

In [152]:
# OR
c.find_one({'$or': [{'name.first': 'Michael'},
                     {'name.last': 'Bowie'}]})

{u'_id': ObjectId('588f24817d12fa00c0dc8f1e'),
 u'albums': [{u'name': u'Thriller', u'released': 1982},
  {u'name': u'Bad', u'released': 1987}],
 u'born': datetime.datetime(1958, 8, 29, 0, 0),
 u'died': datetime.datetime(2009, 6, 25, 0, 0),
 u'name': {u'first': u'Michael', u'last': u'Jackson', u'middle': u'Joseph'}}

In [153]:
# Let's make another object
# It doesn't need to have the same fields.
bowie = {'name': {'first': 'David',
                  'last': 'Bowie',
                  'middle': 'Robert'},
         'born': datetime(1049, 1, 8)}

In [154]:
# insert it
c.insert_one(bowie)

<pymongo.results.InsertOneResult at 0x1065a94b0>

In [157]:
# let's return only first name and birth date
list(c.find({}, {'name.first': 1,
                 'born': 1}))

[{u'_id': ObjectId('588f24817d12fa00c0dc8f1e'),
  u'born': datetime.datetime(1958, 8, 29, 0, 0),
  u'name': {u'first': u'Michael'}},
 {u'_id': ObjectId('588f269e7d12fa00c0dc8f1f'),
  u'born': datetime.datetime(1049, 1, 8, 0, 0),
  u'name': {u'first': u'David'}}]

In [161]:
# suppress the id
list(c.find({}, {'name.first': 1,
                 'born': 1,
                 '_id': 0}))

[{u'born': datetime.datetime(1958, 8, 29, 0, 0),
  u'name': {u'first': u'Michael'}},
 {u'born': datetime.datetime(1049, 1, 8, 0, 0), u'name': {u'first': u'David'}}]

In [163]:
# ditto
list(c.find({}, {'_id': 0}))

[{u'albums': [{u'name': u'Thriller', u'released': 1982},
   {u'name': u'Bad', u'released': 1987}],
  u'born': datetime.datetime(1958, 8, 29, 0, 0),
  u'died': datetime.datetime(2009, 6, 25, 0, 0),
  u'name': {u'first': u'Michael', u'last': u'Jackson', u'middle': u'Joseph'}},
 {u'born': datetime.datetime(1049, 1, 8, 0, 0),
  u'name': {u'first': u'David', u'last': u'Bowie', u'middle': u'Robert'}}]

In [167]:
# another example for AND: 1900 < born < 2000
c.find_one({'born': {'$lt': datetime(2000, 1, 1)}, # AND
            'born': {'$gt': datetime(1900, 1, 1)}})

{u'_id': ObjectId('588f24817d12fa00c0dc8f1e'),
 u'albums': [{u'name': u'Thriller', u'released': 1982},
  {u'name': u'Bad', u'released': 1987}],
 u'born': datetime.datetime(1958, 8, 29, 0, 0),
 u'died': datetime.datetime(2009, 6, 25, 0, 0),
 u'name': {u'first': u'Michael', u'last': u'Jackson', u'middle': u'Joseph'}}

# Modifying things

In [168]:
# add albums field
result = c.update_one({'name.last': 'Bowie'},
                      {'$set': {'albums': []}})

In [170]:
result.acknowledged

True

In [174]:
# worked
c.find_one({'name.last': 'Bowie'})

{u'_id': ObjectId('588f269e7d12fa00c0dc8f1f'),
 u'albums': [],
 u'born': datetime.datetime(1049, 1, 8, 0, 0),
 u'name': {u'first': u'David', u'last': u'Bowie', u'middle': u'Robert'}}

In [175]:
# append to a list
c.update_one({'name.last': 'Bowie'},
             {'$push': {'albums': {'name': "Let's Dance",
                                   'released': 1983}}})

<pymongo.results.UpdateResult at 0x1065a9960>

In [176]:
c.find_one({'name.last': 'Bowie'})

{u'_id': ObjectId('588f269e7d12fa00c0dc8f1f'),
 u'albums': [{u'name': u"Let's Dance", u'released': 1983}],
 u'born': datetime.datetime(1049, 1, 8, 0, 0),
 u'name': {u'first': u'David', u'last': u'Bowie', u'middle': u'Robert'}}

In [178]:
# increment a field
c.update_one({'name.last': 'Bowie'},
             {'$inc': {'albums.0.released': 1}})

<pymongo.results.UpdateResult at 0x1065a9dc0>

In [179]:
c.find_one({'name.last': 'Bowie'})

{u'_id': ObjectId('588f269e7d12fa00c0dc8f1f'),
 u'albums': [{u'name': u"Let's Dance", u'released': 1984}],
 u'born': datetime.datetime(1049, 1, 8, 0, 0),
 u'name': {u'first': u'David', u'last': u'Bowie', u'middle': u'Robert'}}

In [186]:
# decrement a field
r = c.update_one({'name.last': 'Bowie'},
                 {'$inc': {'albums.0.released': -1}})

## Aggregation

We'll talk more about this next session.

In [185]:
agg = c.aggregate([{'$group': {'_id': '$born',
                               'people_count': {'$sum': 1}}}])

In [184]:
list(agg)

[{u'_id': datetime.datetime(1049, 1, 8, 0, 0), u'cnt': 1},
 {u'_id': datetime.datetime(1958, 8, 29, 0, 0), u'cnt': 1}]