[![AnalyticsDojo](https://s3.amazonaws.com/analyticsdojo/logo/final-logo.png)](http://rpi.analyticsdojo.com)
<center><h1>Introduction to API's with Python</h1></center>
<center><h3><a href = 'http://rpi.analyticsdojo.com'>rpi.analyticsdojo.com</a></h3></center>



This is adopted from [Mining the Social Web, 2nd Edition](http://bit.ly/16kGNyb)
Copyright (c) 2013, Matthew A. Russell
All rights reserved.

This work is licensed under the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt).

### Before you Begin #1
If you are working locally, this exercise requires the twitter package, which unlike previous packages needs pip as an installer.  
`!pip install twitter`

If you get an error that pip is not available, you might have to install it.  See 
https://conda.io/docs/user-guide/tasks/manage-pkgs.html.

The package should be available online at lab.analyticsdojo.com (nothing to Install)


In [2]:
#see if it worked by importing the twitter package & some other things we will use.  
from  twitter import *
import datetime, traceback 
import json
import time
import sys

### Before you Begin #2
In the twitter directory you will see a configuration file called configsample.yaml.  We are going to store our Twitter keys in there.  These Twitter keys are used instead of passwords to programatically allow access to your 

1. Copy `configsample.yaml` to `config.yaml`.
2. Create an ID on Twitter or login if you already have a Twitter account. 
3. Create a twitter app.  Go to [apps.twitter.com](https://apps.twitter.com) and click on create a new app.  You will then be able to access the needed fields for the config.yaml file. 
4. Update config.yaml to include appropriate Twitter keys.
5. Update screen_names.csv to include the ids of interest.
6. Download your config file to somewhere outside this directory just in case you update the repository it might be lost. 


## Step1.  Loading Authorization Data
- Here we are going to store the authorization data in a .YAML file rather than directly in the notebook.  
- We have also added `config.yaml` to the `.gitignore` file so we won't accidentally commit our sensitive data to the repository.
- You should generally keep sensitive data out of all git repositories (public or private) but definitely Public. 
- If you ever accidentally commit data to a public repository you must consider it compromised.
- A .yaml file is a common way to store configuration data, but it is not really secure. 

In [3]:
#This will import some required libraries.
import sys 
import ruamel.yaml #A .yaml file 
#This is your configuration file. 
twitter_yaml='./twitter/config.yaml'
with open(twitter_yaml, 'r') as yaml_t:
    cf_t=ruamel.yaml.round_trip_load(yaml_t, preserve_quotes=True)


#You can check your config was loaded by printing, but you should not commit this.
#print(cf_t)


## Create Some Relevant Functions
- We first will create a Twitter object we can used to authorize data.
- Then we will get profiles.
- Finally we will 

In [4]:
def create_twitter_auth(cf_t):
        """Function to create a twitter object
           Args: cf_t is configuration dictionary. 
           Returns: Twitter object.
            """
        # When using twitter stream you must authorize.
        # these tokens are necessary for user authentication
        # create twitter API object

        auth = OAuth(cf_t['access_token'], cf_t['access_token_secret'], cf_t['consumer_key'], cf_t['consumer_secret'])

        try:
            # create twitter API object
            twitter = Twitter(auth = auth)
        except TwitterHTTPError:
            traceback.print_exc()
            time.sleep(cf_t['sleep_interval'])
        return twitter

In [5]:
def get_profiles(twitter, names, cf_t):
    """Function write profiles to a file with the form *data-user-profiles.json*
       Args: names is a list of names
             cf_t is a list of twitter config
       Returns: Nothing
        """
    # file name for daily tracking
    dt = datetime.datetime.now()
    fn = cf_t['data']+'/profiles/'+dt.strftime('%Y-%m-%d-user-profiles.json')
    with open(fn, 'w') as f:
        for name in names:
            print("Searching twitter for User profile: ", name)
            try:
                # create a subquery, looking up information about these users
                # twitter API docs: https://dev.twitter.com/docs/api/1/get/users/lookup
                profiles = twitter.users.lookup(screen_name = name)
                sub_start_time = time.time()
                for profile in profiles:
                    print("User found. Total tweets:", profile['statuses_count'])
                    # now save user info
                    f.write(json.dumps(profile))
                    f.write("\n")
                sub_elapsed_time = time.time() - sub_start_time;
                if sub_elapsed_time < cf_t['sleep_interval']:
                    time.sleep(cf_t['sleep_interval'] + 1 - sub_elapsed_time)
            except TwitterHTTPError:
                traceback.print_exc()
                time.sleep(cf_t['sleep_interval'])
                continue
    f.close()
    return fn

## Load Twitter Handle From CSV
- This is a .csv that has individuals we want to collect data on. 
- Go ahead and follow [AnalyticsDojo](https://twitter.com/AnalyticsDojo).  

In [6]:
import pandas as pd
df=pd.DataFrame.from_csv(cf_t['config']+"/"+cf_t['file'])
df

Unnamed: 0_level_0,screen_name
index,Unnamed: 1_level_1
1,tensorflow
2,DeepLearningHub


## Create Twitter Object

In [7]:
#Create Twitter Object
twitter= create_twitter_auth(cf_t)

In [8]:
#This will get general profile data
profiles_fn=get_profiles(twitter, df['screen_name'], cf_t)

Searching twitter for User profile:  tensorflow
User found. Total tweets: 144
Searching twitter for User profile:  DeepLearningHub
User found. Total tweets: 2404


The outcoming of running the above API is to generate a twitter object. 

## Step 2. Getting Help

In [11]:
# We can get some help on how to use the twitter api with the following. 
help(twitter)

Help on Twitter in module twitter.api object:

class Twitter(TwitterCall)
 |  The minimalist yet fully featured Twitter API class.
 |  
 |  Get RESTful data by accessing members of this class. The result
 |  is decoded python objects (lists and dicts).
 |  
 |  The Twitter API is documented at:
 |  
 |    http://dev.twitter.com/doc
 |  
 |  
 |  Examples::
 |  
 |      from twitter import *
 |  
 |      t = Twitter(
 |          auth=OAuth(token, token_key, con_secret, con_secret_key))
 |  
 |      # Get your "home" timeline
 |      t.statuses.home_timeline()
 |  
 |      # Get a particular friend's timeline
 |      t.statuses.user_timeline(screen_name="billybob")
 |  
 |      # to pass in GET/POST parameters, such as `count`
 |      t.statuses.home_timeline(count=5)
 |  
 |      # to pass in the GET/POST parameter `id` you need to use `_id`
 |      t.statuses.oembed(_id=1234567890)
 |  
 |      # Update your status
 |      t.statuses.update(
 |          status="Using @sixohsix's sweet 


Go ahead and take a look at the [twitter docs](https://dev.twitter.com/rest/public).



In [14]:
# The Yahoo! Where On Earth ID for the entire world is 1.
# See https://dev.twitter.com/docs/api/1.1/get/trends/place and
# http://developer.yahoo.com/geo/geoplanet/

WORLD_WOE_ID = 1
US_WOE_ID = 23424977

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = twitter.trends.place(_id=WORLD_WOE_ID)
us_trends = twitter.trends.place(_id=US_WOE_ID)

print (world_trends)
print (us_trends)

[{'as_of': '2017-10-02T23:15:33Z', 'locations': [{'name': 'Worldwide', 'woeid': 1}], 'trends': [{'url': 'http://twitter.com/search?q=%22Tom+Petty%22', 'name': 'Tom Petty', 'query': '%22Tom+Petty%22', 'tweet_volume': 595670, 'promoted_content': None}, {'url': 'http://twitter.com/search?q=%23GFvip', 'name': '#GFvip', 'query': '%23GFvip', 'tweet_volume': 190550, 'promoted_content': None}, {'url': 'http://twitter.com/search?q=%23ahoraqueARV', 'name': '#ahoraqueARV', 'query': '%23ahoraqueARV', 'tweet_volume': 22822, 'promoted_content': None}, {'url': 'http://twitter.com/search?q=%23%D9%88%D9%84%D9%8A_%D8%A7%D9%84%D8%B9%D9%87%D8%AF_%D8%A7%D9%84%D9%85%D8%A8%D8%A7%D8%B1%D9%8A%D8%A7%D8%AA_%D9%85%D8%AC%D8%A7%D9%86%D8%A7', 'name': '#ولي_العهد_المباريات_مجانا', 'query': '%23%D9%88%D9%84%D9%8A_%D8%A7%D9%84%D8%B9%D9%87%D8%AF_%D8%A7%D9%84%D9%85%D8%A8%D8%A7%D8%B1%D9%8A%D8%A7%D8%AA_%D9%85%D8%AC%D8%A7%D9%86%D8%A7', 'tweet_volume': 48718, 'promoted_content': None}, {'url': 'http://twitter.com/search?q=%2

## Step 3. Displaying API responses as pretty-printed JSON

In [15]:
import json

print (json.dumps(world_trends, indent=1))
print (json.dumps(us_trends, indent=1))

[
 {
  "as_of": "2017-10-02T23:15:33Z",
  "locations": [
   {
    "name": "Worldwide",
    "woeid": 1
   }
  ],
  "trends": [
   {
    "url": "http://twitter.com/search?q=%22Tom+Petty%22",
    "name": "Tom Petty",
    "query": "%22Tom+Petty%22",
    "tweet_volume": 595670,
    "promoted_content": null
   },
   {
    "url": "http://twitter.com/search?q=%23GFvip",
    "name": "#GFvip",
    "query": "%23GFvip",
    "tweet_volume": 190550,
    "promoted_content": null
   },
   {
    "url": "http://twitter.com/search?q=%23ahoraqueARV",
    "name": "#ahoraqueARV",
    "query": "%23ahoraqueARV",
    "tweet_volume": 22822,
    "promoted_content": null
   },
   {
    "url": "http://twitter.com/search?q=%23%D9%88%D9%84%D9%8A_%D8%A7%D9%84%D8%B9%D9%87%D8%AF_%D8%A7%D9%84%D9%85%D8%A8%D8%A7%D8%B1%D9%8A%D8%A7%D8%AA_%D9%85%D8%AC%D8%A7%D9%86%D8%A7",
    "name": "#\u0648\u0644\u064a_\u0627\u0644\u0639\u0647\u062f_\u0627\u0644\u0645\u0628\u0627\u0631\u064a\u0627\u062a_\u0645\u062c\u0627\u0646\u0627",
    

Take a look at the [api docs](https://dev.twitter.com/rest/reference/get/trends/place) for the /trends/place call made above. 

## Step 4. Collecting search results for a targeted hashtag.

In [16]:
# Import unquote to prevent url encoding errors in next_results
#from urllib3 import unquote

#This can be any trending topic, but let's focus on a hashtag that is relevant to the class. 
q = '#analytics' 

count = 100

# See https://dev.twitter.com/rest/reference/get/search/tweets
search_results = twitter.search.tweets(q=q, count=count)

#This selects out 
statuses = search_results['statuses']


# Iterate through 5 more batches of results by following the cursor
for _ in range(5):
    print ("Length of statuses", len(statuses))
    try:
        next_results = search_results['search_metadata']['next_results']
        print ("next_results", next_results)
    except: # No more results when next_results doesn't exist
        break
        
    # Create a dictionary from next_results, which has the following form:
    # ?max_id=313519052523986943&q=NCAA&include_entities=1
    kwargs = dict([ kv.split('=') for kv in next_results[1:].split("&") ])
    print (kwargs)
    search_results = twitter.search.tweets(**kwargs)
    statuses += search_results['statuses']

# Show one sample search result by slicing the list...
print (json.dumps(statuses[0], indent=1))

Length of statuses 89
{
 "favorite_count": 0,
 "in_reply_to_screen_name": null,
 "source": "<a href=\"https://twitter.com/alevergara78\" rel=\"nofollow\">realtimeApp_reactjs</a>",
 "truncated": false,
 "place": null,
 "retweeted_status": {
  "in_reply_to_screen_name": null,
  "source": "<a href=\"http://app.sendblur.com\" rel=\"nofollow\">Social Media Publisher App </a>",
  "truncated": true,
  "place": null,
  "in_reply_to_user_id": null,
  "in_reply_to_status_id": null,
  "coordinates": null,
  "text": "AI 100: The Artificial Intelligence Startups Redefining Industries | #Analytics #BusinessIntelligence #RT\u2026 https://t.co/c48lDwRVOg",
  "retweeted": false,
  "in_reply_to_status_id_str": null,
  "lang": "en",
  "is_quote_status": false,
  "favorite_count": 77,
  "created_at": "Sun Oct 01 23:03:42 +0000 2017",
  "contributors": null,
  "geo": null,
  "id": 914626908066926594,
  "favorited": false,
  "possibly_sensitive": false,
  "entities": {
   "user_mentions": [],
   "urls": [
 

In [17]:
#Print several
print (json.dumps(statuses[0:5], indent=1))

[
 {
  "favorite_count": 0,
  "in_reply_to_screen_name": null,
  "source": "<a href=\"https://twitter.com/alevergara78\" rel=\"nofollow\">realtimeApp_reactjs</a>",
  "truncated": false,
  "place": null,
  "retweeted_status": {
   "in_reply_to_screen_name": null,
   "source": "<a href=\"http://app.sendblur.com\" rel=\"nofollow\">Social Media Publisher App </a>",
   "truncated": true,
   "place": null,
   "in_reply_to_user_id": null,
   "in_reply_to_status_id": null,
   "coordinates": null,
   "text": "AI 100: The Artificial Intelligence Startups Redefining Industries | #Analytics #BusinessIntelligence #RT\u2026 https://t.co/c48lDwRVOg",
   "retweeted": false,
   "in_reply_to_status_id_str": null,
   "lang": "en",
   "is_quote_status": false,
   "favorite_count": 77,
   "created_at": "Sun Oct 01 23:03:42 +0000 2017",
   "contributors": null,
   "geo": null,
   "id": 914626908066926594,
   "favorited": false,
   "possibly_sensitive": false,
   "entities": {
    "user_mentions": [],
    "u

## Step 5. Extracting text, screen names, and hashtags from tweets

In [18]:
#We can access an individual tweet like so:
statuses[1]['text']





'RT @Ronald_vanLoon: AI 100: The Artificial Intelligence Startups Redefining Industries | #Analytics #BusinessIntelligence #RT https://t.co/…'

In [19]:
statuses[1]['entities']

{'hashtags': [{'indices': [89, 99], 'text': 'Analytics'},
  {'indices': [100, 121], 'text': 'BusinessIntelligence'},
  {'indices': [122, 125], 'text': 'RT'}],
 'symbols': [],
 'urls': [],
 'user_mentions': [{'id': 555031989,
   'id_str': '555031989',
   'indices': [3, 18],
   'name': 'Ronald van Loon',
   'screen_name': 'Ronald_vanLoon'}]}

In [20]:
#notice the nested relationships.  We have to take notice of this to further access the data.
statuses[1]['entities']['hashtags']

[{'indices': [89, 99], 'text': 'Analytics'},
 {'indices': [100, 121], 'text': 'BusinessIntelligence'},
 {'indices': [122, 125], 'text': 'RT'}]

In [21]:
status_texts = [ status['text'] 
                 for status in statuses ]

screen_names = [ user_mention['screen_name'] 
                 for status in statuses
                     for user_mention in status['entities']['user_mentions'] ]

hashtags = [ hashtag['text'] 
             for status in statuses
                 for hashtag in status['entities']['hashtags'] ]

urls = [ url['url'] 
             for status in statuses
                 for url in status['entities']['urls'] ]



# Compute a collection of all words from all tweets
words = [ w 
          for t in status_texts 
              for w in t.split() ]

# Explore the first 5 items for each...

print (json.dumps(status_texts[0:5], indent=1))
print (json.dumps(screen_names[0:5], indent=1)) 
print (json.dumps(hashtags[0:5], indent=1))
print (json.dumps(words[0:5], indent=1))

[
 "RT @Ronald_vanLoon: AI 100: The Artificial Intelligence Startups Redefining Industries | #Analytics #BusinessIntelligence #RT https://t.co/\u2026",
 "RT @Ronald_vanLoon: AI 100: The Artificial Intelligence Startups Redefining Industries | #Analytics #BusinessIntelligence #RT https://t.co/\u2026",
 "#Batman: Why the boom is already over - by @steveranger | #Analytics #IT https://t.co/QHeo07UZKJ",
 "RT tamaradull: \"One of the least understood is the distinction between DBMS and #analytics application functions\" via\u2026 \u2026",
 "RT @AnsonMccadeAus: Here are four data #analytics #careers that aren't a data scientist:\n\n#bigdata #datascience #blockchain #success #defst\u2026"
]
[
 "Ronald_vanLoon",
 "Ronald_vanLoon",
 "steveranger",
 "AnsonMccadeAus",
 "AnsonMccadeAus"
]
[
 "Analytics",
 "BusinessIntelligence",
 "RT",
 "Analytics",
 "BusinessIntelligence"
]
[
 "RT",
 "@Ronald_vanLoon:",
 "AI",
 "100:",
 "The"
]


## Step 6. Creating a basic frequency distribution from the words in tweets

In [22]:
from collections import Counter

for item in [words, screen_names, hashtags]:
    c = Counter(item)
    print (c.most_common()[:10]) # top 10, "\n")
    

[('RT', 43), ('#Analytics', 42), ('#analytics', 41), ('to', 28), ('and', 21), ('the', 21), ('#BigData', 20), ('The', 18), ('in', 17), ('#MachineLearning', 15)]
[('Ronald_vanLoon', 5), ('Forbes', 4), ('steveranger', 3), ('techpearce2', 3), ('humanwareonline', 2), ('msarsar', 2), ('mobileftp', 2), ('AnsonMccadeAus', 2), ('AmiiThinks', 2), ('jlmico', 2)]
[('analytics', 43), ('Analytics', 43), ('BigData', 22), ('MachineLearning', 15), ('AI', 12), ('IoT', 8), ('DataManagement', 7), ('IT', 6), ('bigdata', 6), ('MDM', 6)]
