In [None]:
%%html
<style>
table {float:left}
</style>

In [None]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<h1 id="tocheading">Overview</h1>
<div id="toc"></div>

## Learning Goals

* Learn and understand MongoDB and how to use it from Python
* Learn and understand Twitter REST and streaming APIs and how to use them from Python
* Understand what Sentiment Analysis is
* Build a twitter sentiment mining application
* Learn how to deploy a data science product as an API

## The Road Ahead

* Mining Twitter Sentiment
* NBC for Spam and Ham
* Kaggle Titanic Competition
* SQL in the Wild!

## Evaluation Scheme

* 10% for submission of Assignment 1 (bdap2015/NoSQL)
* 40% for submission of Assignment 2
* 50% for final exam

## Why Twitter Sentiment Analysis

* Twitter - learn to use rest and streaming apis
* Twitter - perfect data structure to learn mongodb
* Sentiment Analysis - introduction to machine learning and natural language processing
* Sentiment Analysis - we'll keep revisiting as we learn more and more sophisticated techniques

# Introduction to MongoDB

## CRUD

### Create/Insert

    mongo twitter
    
    db.users.insert({
      name: "Shakuny Mama",
      email: "shakuni.mama@mahabharata.com",
      age:42
    })

    show collections

    db.users.find()

**Notes**
* Databases and Collections are lazily created - created when we need them, not when they are defined.
* With greater flexibility comes greater responsibility - beware of typos

**Note: What is special about _id?**
* Auto-generated
* Auto-generated vs Auto-incremented
* Horizontal Sharding

### Read

    db.users.find({ "_id" : ObjectId("566a247ddae35821b3a0c523") })

(select fields)

    db.users.find({ _id : ObjectId("566a247ddae35821b3a0c523") }, { name : 1 })
    db.users.find({ _id : ObjectId("566a247ddae35821b3a0c523") }, { name : 0 }) #omit only name

(more sophisticated queries)

    db.users.find(
        { name : /^P/, age : { $lt : 40 } },
        { name : 1, age : 1 }
    )

(an even more complicated example)

    var age_range = {}
    age_range['$lt'] = 1000000
    age_range['$gt'] = 10000
    
    db.users.find(
        { name : /^P/, age : age_range },
        { name: 1 }
    )

### Update

    db.users.update(
        { _id : ObjectId("4d0ada87bb30773266f39fe5") },
        { $set : { "name" : "Something Else" } }
    );

### Delete

    var bad_bacon = {
        'exports.foods' : {
            $elemMatch : {
                name : 'bacon',
                tasty : false
            }
        }
    }

    db.countries.find( bad_bacon )

    db.countries.remove( bad_bacon )
    db.countries.count()

## JSON

<img src="mongodb_record_as_json_diag.png">

## A quick comparison

<img src="sql_vs_mongodb_schema_arrangement.png">

| Concept | SQL | MongoDB |
|:---|---|---|
| One User                         | One Row                    | One Document |
| All Users                        | Users Table                | Users Collection |
| One Username Per User (1-to-1)   | Username Column            | Username Property |
| Many Emails Per User (1-to-many) | SQL JOIN with Emails Table | Embed relevant email doc in User Document |
| Many Items Owned by Many Users (many-tomany) | SQL JOIN with Items Table | Programmatically Join with Items Collection |


## MongoDB from Python

### Connect to database

In [None]:
import sys
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure

def main():
    """ Connect to MongoDB """
    try:
        #Connect to Database
        client = MongoClient(host="localhost", port=27017)
        print "Connected successfully"

    except ConnectionFailure, e:
        sys.stderr.write("Could not connect to MongoDB: %s" % e)
        sys.exit(1)

if __name__ == "__main__":
    main()



In [None]:
client = MongoClient('localhost', 27017)

# The URI format
client = MongoClient('mongodb://localhost:27017/')

In [None]:
import sys
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure

def main():
    """ Connect to MongoDB """
    try:
        #Connect to Database
        client = MongoClient(host="localhost", port=27017)
        print "Connected successfully"
        
        # Get a Database handle to a database named "twitterdb"
        dbh = client["twitterdb"]
        print "Successfully set up a database handle"
        
    except ConnectionFailure, e:
        sys.stderr.write("Could not connect to MongoDB: %s" % e)
        sys.exit(1)

if __name__ == "__main__":
    main()


In [None]:
client["twitterdb"]

In [None]:
client.twitterdb

### Create/Insert

In [None]:
import sys
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure
from datetime import datetime

def main():
    """ Connect to MongoDB """
    try:
        #Connect to Database
        client = MongoClient(host="localhost", port=27017)
        print "Connected successfully"
        
        # Get a Database handle to a database named "twitterdb"
        dbh = client["twitterdb"]
        #assert dbh.connection == c
        print "Successfully set up a database handle"
        
    except ConnectionFailure, e:
        sys.stderr.write("Could not connect to MongoDB: %s" % e)
        sys.exit(1)
        
    user_doc = {
        "username" : "janedoe",
        "firstname" : "Jane",
        "surname" : "Doe",
        "dateofbirth" : datetime(1974, 4, 12),
        "email" : "janedoe74@example.com",
        "score" : 0
        }
    dbh.users.insert_one(user_doc)
    print "Successfully inserted document: %s" % user_doc

if __name__ == "__main__":
    main()


**Notes**
* The PyMongo driver supports Python datetime objects (it'll translate between mongodb datetime objects and python datatime objects), which is great for us. We'll not have to translate between the two data structures.
* Just like we noted before, we don't have to create our collection “users” before we insert documents to it.

In [None]:
result = client.twitterdb.users.insert_one({
    "username" : "Pavitra",
    "firstname" : "Pavitra",
    "surname" : "Pravakar",
    "dateofbirth" : datetime(1986, 4, 12),
    "email" : "spiderman@marvelheroes.com",
    "score" : 0
})
result.inserted_id

### Read

In [None]:
user_doc = client.twitterdb.users.find_one({"username" : "janedoe"})
if not user_doc:
    print "no document found for username janedoe"

In [None]:
users = client.twitterdb.users.find({"username":"janedoe"})
for user in users:
    print user.get("email")

### Update

In [None]:
user_doc = {
    "username" : "janedoe",
    "firstname" : "Jane",
    "surname" : "Doe",
    "dateofbirth" : datetime(1974, 4, 12),
    "email" : "janedoe74@example.com",
    "score" : 0
}

In [None]:
# first query to get a copy of the current document
import copy
old_user_doc = client.twitterdb.users.find_one({"username":"janedoe"})
new_user_doc = copy.deepcopy(old_user_doc)

# modify the copy to change the email address
new_user_doc["email"] = "janedoe74@example2.com"

# run the update query
# replace the matched document with the contents of new_user_doc
client.twitterdb.users.replace_one({"username":"janedoe"}, new_user_doc)

Building the whole replacement document can be cumbersome, and worse, can introduce race conditions. Imagine you want to increment the score property of the “janedoe” user. In order to achieve this with the replacement approach, you would have to first fetch the document, modify it with the incremented score, then write it back to
the database. With that approach, you could easily lose other score changes if something else were to update the score in between you reading and writing it.

In order to solve this problem, the update document supports an additional set of MongoDB operators called “update modifiers”. These update modifiers include operators such as atomic increment/decrement, atomic list push/pop and so on. It is very helpful to be aware of which update modifiers are available and what they can do when
designing your application.

In [None]:
client.twitterdb.users.update_one({"username":"janedoe"},
                {"$set":{"email":"janedoe74@example2.com"}})

In [None]:
client.twitterdb.users.update_one({"username":"janedoe"},
                 {"$set":{"email":"janedoe74@example2.com", "score":1}})

In [None]:
result = client.twitterdb.users.update_one({"username":"janedoe"},
                 {"$set":{"email":"janedoe74@exple2.com", "score":1}})
result.modified_count

### Delete

In [None]:
client.twitterdb.users.delete_one({"score":1})

# The Twitter API in Action

## Organization of Twitter Data

<img src="teamindia_tweet.png">

A Tweet contains:
* date and time
* links
* user mentions (@)
* hash tags (#)
* retweets count
* locale language
* favorites count
* geocode


## Accessing Twitter Data

### REST API

* [Twitter REST API Documentation](https://dev.twitter.com/rest/public)

### Streaming API

* [Twitter Streaming API Documentation](https://dev.twitter.com/streaming/overview)

### OAuth

* [Twitter OAuth Documentation](https://dev.twitter.com/oauth)
* Instructions for getting access:
    - Create a Twitter account
    - Go to https://apps.twitter.com/
    - Create New App (button on top right corner(-ish))
    - Fill out details in the next page. Value of *Website* doesn't matter right now (use http://google.com). Create your Twitter application.
    - In the next screen, select the *KeyandAccessTokens* tab.
    - Note down the following credentials:
        * Consumer Key (API Key)
        * Consumer Secret (API Secret)
    - Click on *Create my access token*. After tokens are generated, note down the following credentials:
        * Access Token
        * Access Token Secret
    - Add the credentials to *.profile*
        * .profile vs .bashrc vs .Renviron

## Introduction to Twython

    pip install twython

* [Official Twython Documentation](https://twython.readthedocs.org/en/latest/)
* Supports both REST and Streaming APIs
* For more wrappers, see https://dev.twitter.com/overview/api/twitter-libraries

### Searching by Topic

In [1]:
import os
TWITTER_CONSUMER_KEY = os.environ["TWITTER_CONSUMER_KEY"]
TWITTER_CONSUMER_SECRET = os.environ["TWITTER_CONSUMER_SECRET"]
TWITTER_ACCESS_TOKEN = os.environ["TWITTER_ACCESS_TOKEN"]
TWITTER_ACCESS_TOKEN_SECRET = os.environ["TWITTER_ACCESS_TOKEN_SECRET"]

In [2]:
from twython import Twython
twitter = Twython(TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET, TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_TOKEN_SECRET)

In [6]:
result = twitter.search(q="Salman Khan")

**Note**
* If Twython fails to authenticate, result will have the following json as its value:
        {"errors":[{"message":"Bad Authentication data", "code":215}]}
* If successful, Twython will convert the JSON it receives to a native python object.

In [7]:
for status in result["statuses"]:
    print(status)

{u'contributors': None, u'truncated': False, u'text': u'RT @taran_adarsh: No, #Sultan is NOT postponed to Diwali 2016. Confirmed for Eid 2016. Which means, #Sultan [Salman Khan] versus #Raees [SR\u2026', u'is_quote_status': False, u'in_reply_to_status_id': None, u'id': 679621794853187584, u'favorite_count': 0, u'source': u'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'user_mentions': [{u'id': 99642673, u'indices': [3, 16], u'id_str': u'99642673', u'screen_name': u'taran_adarsh', u'name': u'taran adarsh'}], u'hashtags': [{u'indices': [22, 29], u'text': u'Sultan'}, {u'indices': [100, 107], u'text': u'Sultan'}, {u'indices': [129, 135], u'text': u'Raees'}], u'urls': []}, u'in_reply_to_screen_name': None, u'in_reply_to_user_id': None, u'retweet_count': 286, u'id_str': u'679621794853187584', u'favorited': False, u'retweeted_status': {u'contributors': None, u'truncated': False, u'

In [9]:
for status in result["statuses"]:
    print("user: {0} text: {1}".format(status["user"]["name"], 
                                       status["text"]))

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 139: ordinal not in range(128)

In [10]:
result = twitter.search(q="data science")
for status in result["statuses"]:
    print("user: {0} \n text: {1} \n".format(status["user"]["name"].encode("utf-8"), 
                                             status["text"].encode("utf-8")))

user: Frankfurt School 
 text: RT @xlth: How to unleash #datascience with an MBA? Studied @FrankfurtSchool and @CEIBS after doing research @CERN https://t.co/hX5rZXZcAw #… 

user: Localstars 
 text: Trends for Mobile Advertising in 2016 via @adage https://t.co/p7VjtzRqen 

user: StartUpHire 
 text: Intern - Modeling and Data Science at Quantcast in San Francisco https://t.co/TF0g45lndY #jobs #StartUpHire 

user: StartUpHire 
 text: New Graduate - Modeling and Data Science at Quantcast in San Francisco https://t.co/dVpwqLgBSF #jobs #StartUpHire 

user: The DJ List Data 
 text: Event: @totalscience [TOTAL SCIENCE] is playing The Buttermarket Shrewsbury on March 24 https://t.co/QKUjgV0G1J 

user: Mohamed Magdy Gharib 
 text: RT @tech_clarity: 200 billion 'things' by 2020 = very #bigdata. Great @intel infographic on Data Science Central site: https://t.co/EyXW58V… 

user: Eric Bellamy 
 text: University of Washington Adds Latest Data Science Degree Program - Information Management https://

More documentation at https://dev.twitter.com/rest/reference/get/search/tweets

### Retrieving Timeline (your own)

In [None]:
timeline = twitter.get_home_timeline()

In [None]:
for tweet in timeline:
    print(" User: {0} \n Created: {1} \n Text: {2} \n".format(tweet["user"]["name"].encode("utf-8"), 
                                                            tweet["created_at"].encode("utf-8"), 
                                                            tweet["text"].encode("utf-8")))

### Retrieving Timeline (other users)

In [None]:
tl = twitter.get_user_timeline(screen_name = "iamsrk", count = 5)
for tweet in tl:
    print(" User: {0} \n Created: {1} \n Text: {2} \n".format(tweet["user"]["name"].encode("utf-8"),
                                                            tweet["created_at"].encode("utf-8"),
                                                            tweet["text"].encode("utf-8")))

* [Official Documentation for home timeline](https://dev.twitter.com/rest/reference/get/statuses/home_timeline)
* [Offician Documentation for (other) user timeline](https://dev.twitter.com/rest/reference/get/statuses/user_timeline)

### Get a list of followers

In [None]:
followers = twitter.get_followers_list(screen_name="dataBiryani")

In [None]:
for follower in followers["users"]:
    print(" {0} \n ".format(follower))

In [None]:
for follower in followers["users"]:
    print(" user: {0} \n name: {1} \n Number of tweets: {2} \n".format(follower["screen_name"],
                                                                       follower["name"],
                                                                       follower["statuses_count"]))

# Setiment Classification

## What is Sentiment Classification?

Sentiment classification is a special task of text classification whose objective is to classify a text according to the sentimental polarities of opinions it contains - favorable or unfavorable, positive or negative.

<img src="sentiment_classification_process.png">

## Dataset

### Affective Norms for English Words

*  The ANEW provides a set of normative emotional ratings as a text corpus for a large number of words in the English language.
* These sets of verbal materials have been rated in terms of pleasure, arousal, and dominance, in order to create a standard for use in studies of emotion and attention.

### Sentiment140

* http://help.sentiment140.com/for-students
* Download link
    - http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
    - (mirror) https://docs.google.com/file/d/0B04GJPshIjmPRnZManQwWEdTZjg/edit
* The data has been processed so that the emoticons are stripped off.
* CSV format
* Data file format has 6 fields:
    - 0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
    - 1 - the id of the tweet (2087)
    - 2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
    - 3 - the query (lyx). If there is no query, then this value is NO_QUERY.
    - 4 - the user that tweeted (robotickilldozr)
    - 5 - the text of the tweet (Lyx is cool)

## NLTK

In [None]:
import nltk
nltk.download() # Models > punkt

In [None]:
nltk.word_tokenize("Busy day ahead of me. Also just remembered that I left peah slices in the fridge at work on Friday. ")

In [None]:
def bagOfWords(tweets):
    wordsList = []
    for (words, sentiment) in tweets:
        wordsList.extend(words)
return wordsList

In [None]:
def wordFeatures(wordList):
    wordList = nltk.FreqDist(wordList)
    wordFeatures = wordList.keys()
    return wordFeatures

In [None]:
def getFeatures(doc):
    docWords = set(doc)
    feat = {}
    for word in wordFeatures:
        feat['contains(%s)' % word] = (word in docWords)
    return feat

In [None]:
# Fill these up with values from Sentiment140 dataset
positiveTweets = ???
negativeTweets = ???

In [None]:
corpusOfTweets = []
for (words, sentiment) in positiveTweets + negativeTweets:
    wordsFiltered = [e.lower() for e in nltk.word_tokenize(words) if len(e) >= 3]
    tweets.append((wordsFiltered, sentiment))

In [None]:
wordFeatures = wordFeatures(bagOfWords(corpusOfTweets))
training = nltk.classify.apply_features(getFeatures, corpusOfTweets)
classifier = nltk.NaiveBayesClassifier.train(training)
print(classifier.show_most_informative_features(32))

**Predicting Sentiment of new Tweets**

In [None]:
from twython import Twython

twitter = Twython(ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret)

result = twitter.search(q="python")

for status in result["statuses"]:
    print("Tweet: {0} \n Sentiment: {1}".format(status["text"], 
                                                classifier.classify(extract_features(status["text"].split()))))

# What Next?

## The Assignment

1. Write a blog post on how to use **OR** operator for find queries in mongodb.
2. Feed negative and positive tweets to the classification function for training. (using the Sentiment140 dataset)
3. Crawl all followers of ***naveen_odisha***, Odisha CM (note: you'll have to pay attention to rate limiting)
4. Crawl all followers of SRK. How can you calculate if this is feasible or not? (show the math)
5. Predict the sentiment of tweets by followers of ***naveen_odisha*** 