# Table of Contents
  - [ Using TextRazor for Named Entities (and topics)](#_using textrazor for named entities (and topics)) 
- [ Twitter API](#_twitter api) 
  - [ Streaming Data (Watching Incoming Live Tweets) [Search is Smarter, See below]](#_streaming data (watching incoming live tweets) [search is smarter, see below]) 
  - [  Search for a term and control hit results (Preferred!)](#_ search for a term and control hit results (preferred!)) 
  - [ Making a DataFrame from JSON or Dict Results
](#_making a dataframe from json or dict results
) 


## Using TextRazor for Named Entities (and topics)<a name="_using textrazor for named entities (and topics)"></a>

Here is the site: https://www.textrazor.com/

You need to 
`pip install textrazor`
at the command line.

We are allowed to use my key for a limited number of queries per day. Don't abuse it.

Documentation: https://www.textrazor.com/tutorials

In [2]:
KEY = "8a75129de373331abf16cff513ec77d1b0cdc51b4d7d6957d9788d68"

In [1]:
import textrazor

import nlp_utilities as mytools

In [3]:
textrazor.api_key = KEY

In [4]:
client = textrazor.TextRazor(extractors=["entities", "topics"])
client.set_classifiers(["textrazor_newscodes"])

In [6]:
files = mytools.get_filenames("data/mydata")

In [7]:
texts = mytools.load_texts_as_string(files)

In [8]:
texts[files[0]]

"I really am a fan of this place. It has good food and good beer, and also is associated with some good memories here. This place is a great place to come to and watch a game or two. Or in our case, a great place to come to after watching out team win or lose the game. The parking here can be difficult since there is no lot or anything designated for costumer parking. The beer is really good here. I like the lighter beers, such as their regular peach beer, but their other beers and seasonal are awesome too. The food is delish; I always get a hankering for their beer battered fries. Eat it with their jalapeño ranch sauce and it's perfect. That and some loaded chicken nachos are easily split between friends; their portion sizes are fairly huge. It's really an awesome spot to get together with friend, have a pint, enjoy a game and some good food."

In [9]:
response = client.analyze(texts[files[0]])

In [10]:
entities = list(response.entities())
entities.sort(key=lambda x: x.relevance_score, reverse=True)
seen = set()
for entity in entities:
    if entity.id not in seen:
        print(entity.id, entity.relevance_score, entity.confidence_score, entity.freebase_types)
        seen.add(entity.id)

Pint 0.2001 1.875 ['/measurement_unit/volume_unit', '/food/culinary_measure', '/type/unit', '/time_series/unit']
Nachos 0.1544 1.437 ['/food/food', '/food/dish', '/people/person', '/law/inventor', '/people/deceased_person']
French fries 0.1403 2.211 ['/food/dish', '/food/food']
Jalapeño 0.1167 5.103 ['/food/food', '/biology/organism_classification', '/food/ingredient']
Batter (cooking) 0.1053 2.585 ['/food/ingredient']
Beer 0.07647 1.291 ['/visual_art/art_subject', '/book/book_subject', '/food/ingredient', '/food/beverage_type', '/film/film_subject', '/internet/website_category', '/food/food']
Sauce 0.06616 2.623 ['/food/type_of_dish']


In [11]:
for topic in response.topics():
    if topic.score > 0.3:
        print(topic.label, topic.score)

Foods 0.9698
Cuisine 0.9098
Cooking 0.8811
Food and drink preparation 0.8625
Western cuisine 0.8443
Food and drink 0.7952
North American cuisine 0.6749
Convenience foods 0.5715
American cuisine 0.5553
Cuisine of the Americas 0.5367
Fast food 0.4665
Fried foods 0.4185
Eating behaviors of humans 0.4168
Television series 0.413
Deep fried foods 0.3876
Alcoholic drinks 0.377
European cuisine 0.3751
Food industry 0.3501
Pint 0.3168
American television seasons 0.3137
The Biggest Loser 0.3087
Prepared foods 0.3057
Hobbies 0.3046
Fitness reality television series 0.3014


In [12]:
for category in response.categories():
    print(category.category_id, category.label, category.score)

10011000 lifestyle and leisure>public holiday 1
04007003 economy, business and finance>consumer goods>food 0.7415
04007008 economy, business and finance>consumer goods>beverage 0.5971
01021000 arts, culture and entertainment>entertainment (general) 0.5589
04014004 economy, business and finance>tourism and leisure>restaurant and catering 0.4112
10000000 lifestyle and leisure 0.3949
04013002 economy, business and finance>process industry>food 0.3643
10003000 lifestyle and leisure>gastronomy 0.3595
01016000 arts, culture and entertainment>television 0.3301
04007000 economy, business and finance>consumer goods 0.3285


# Twitter API<a name="_twitter api"></a><a name="_twitter api"></a>

If you want to use twitter's API to collect your own tweets about a subject, here is a good document:

http://socialmedia-class.org/twittertutorial.html

It requires you to create your own API keys with your own twitter account.

Note: You can 'pip install twitter' instead of the setup.py instructions shown there.

In [1]:
import json
import twitter
from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream

# Variables that contains the user credentials to access Twitter API
# register for these at apps.twitter.com.
ACCESS_TOKEN = 'yours here'
ACCESS_SECRET = 'yours here'
CONSUMER_KEY = 'yours here'
CONSUMER_SECRET = 'yours here'

oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)


## Streaming Data (Watching Incoming Live Tweets) [Search is Smarter, See below]<a name="_streaming data (watching incoming live tweets) [search is smarter, see below]"></a>

In [14]:
# Initiate the connection to Twitter Streaming API
twitter_stream = TwitterStream(auth=oauth)
iterator = twitter_stream.statuses.filter(track="#bigdata", language="en")

In [15]:
# set a limit because you will be stopped at a certain number
tweet_count = 2  # low because it's just to show the data
mydata = []
for tweet in iterator:
    tweet_count -= 1
    # Twitter Python Tool wraps the data returned by Twitter 
    # as a TwitterDictResponse object.
    mydata.append(tweet)
    if tweet_count <= 0:
        break 

In [16]:
mydata

[{'contributors': None,
  'coordinates': None,
  'created_at': 'Tue Mar 14 08:47:48 +0000 2017',
  'entities': {'hashtags': [{'indices': [77, 85], 'text': 'bigdata'},
    {'indices': [86, 98], 'text': 'datascience'},
    {'indices': [99, 115], 'text': 'machinelearning'},
    {'indices': [116, 128], 'text': 'advertising'}],
   'symbols': [],
   'urls': [{'expanded_url': None, 'indices': [130, 130], 'url': ''}],
   'user_mentions': [{'id': 799221973922496517,
     'id_str': '799221973922496517',
     'indices': [3, 15],
     'name': 'Andy Driver',
     'screen_name': 'Andy_RedCat'}]},
  'favorite_count': 0,
  'favorited': False,
  'filter_level': 'low',
  'geo': None,
  'id': 841571554702045184,
  'id_str': '841571554702045184',
  'in_reply_to_screen_name': None,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'is_quote_status': False,
  'lang': 'en',
  'place': None,
  'retweet_count': 0,
  'retwee

What's in this data: https://dev.twitter.com/overview/api/tweets

Incredibly cool: https://dev.twitter.com/overview/api/entities-in-twitter-objects

In [17]:
mydata[0].keys()

dict_keys(['id', 'in_reply_to_user_id_str', 'in_reply_to_user_id', 'created_at', 'source', 'favorite_count', 'coordinates', 'geo', 'retweeted', 'timestamp_ms', 'place', 'in_reply_to_status_id', 'is_quote_status', 'truncated', 'in_reply_to_screen_name', 'retweeted_status', 'user', 'text', 'filter_level', 'entities', 'id_str', 'lang', 'favorited', 'retweet_count', 'in_reply_to_status_id_str', 'contributors'])

In [18]:
mydata[0]['entities']

{'hashtags': [{'indices': [77, 85], 'text': 'bigdata'},
  {'indices': [86, 98], 'text': 'datascience'},
  {'indices': [99, 115], 'text': 'machinelearning'},
  {'indices': [116, 128], 'text': 'advertising'}],
 'symbols': [],
 'urls': [{'expanded_url': None, 'indices': [130, 130], 'url': ''}],
 'user_mentions': [{'id': 799221973922496517,
   'id_str': '799221973922496517',
   'indices': [3, 15],
   'name': 'Andy Driver',
   'screen_name': 'Andy_RedCat'}]}

In [22]:
mydata[0]['entities']['hashtags']

[{'indices': [77, 85], 'text': 'bigdata'},
 {'indices': [86, 98], 'text': 'datascience'},
 {'indices': [99, 115], 'text': 'machinelearning'},
 {'indices': [116, 128], 'text': 'advertising'}]

In [23]:
mydata[0]['text']

'RT @Andy_RedCat: The future of online advertising is big data and algorithms #bigdata #datascience #machinelearning #advertising… '

In [25]:
import pandas as pd
pd.DataFrame(mydata)

Unnamed: 0,contributors,coordinates,created_at,entities,favorite_count,favorited,filter_level,geo,id,id_str,...,place,possibly_sensitive,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user
0,,,Tue Mar 14 08:47:48 +0000 2017,"{'urls': [{'indices': [130, 130], 'expanded_ur...",0,False,low,,841571554702045184,841571554702045184,...,,,0,False,"{'id': 841570860335013888, 'in_reply_to_user_i...","<a href=""https://twitter.com/alevergara78"" rel...",RT @Andy_RedCat: The future of online advertis...,1489481268941,False,"{'default_profile_image': False, 'id': 1045204..."
1,,,Tue Mar 14 08:48:01 +0000 2017,{'urls': [{'display_url': 'drumup.io/s/bHFvQv'...,0,False,low,,841571605742407680,841571605742407680,...,,False,0,False,,"<a href=""https://drumup.io"" rel=""nofollow"">dru...",Using machine learning to secure #identity and...,1489481281110,False,"{'default_profile_image': False, 'id': 8321238..."


##  Search for a term and control hit results (Preferred!)<a name="_ search for a term and control hit results (preferred!)"></a>

In [26]:
twitter = Twitter(auth=oauth)
nlptweets = twitter.search.tweets(q='#nlp', result_type='recent', lang='en', count=10)

In [27]:
# this query returns a dictionary.  You need to pull out the tweets themselves from 'statuses'
nlptweets.keys()

dict_keys(['search_metadata', 'statuses'])

In [28]:
for tweet in nlptweets['statuses']:
    print(tweet['text'])

RT @duggystoneshow: #listen to the #confidence #Coach @KeithB_N https://t.co/MjNAtCHnYX #podcast #podfamily #NLP #magic #chatshow https://t…
Remembering that actions generally come from a #goodintention really does have benefits! 😊 https://t.co/wIEjeHAaXH #NLP #TopTipTuesday
Transhuman consulting is India’s largest #NLP #Training provider consultant in #Mumbai. https://t.co/jQdE8mZDC9 https://t.co/smVfBxrFL0
RT @DanielleSerpico: Its here! The next #NLP Training is up! https://t.co/TLS7s1TkAU Early Bird on now! #skills #career #communication
RT @DanielleSerpico: I'm so excited! The next #NLP Training is up! https://t.co/TLS7s1TkAU #earlybird #training #skills #Dublin #life https…
RT @andi_staub: 10'000 #AI #startups need to learn these lessons

#NLP #deeplearning #NeuralNetworks 
#insurtech #fintech

https://t.co/bAe…
GREAT OPPORTUNITY TO LEARN FROM AN INTERNATIONAL ENSEMBLE OF HYPNOSIS EXPERTISE https://t.co/piI7JGc2wq #HYPNOSIS #HYPNOTHERAPY #NLP
How is your mind running your business

## Making a DataFrame from JSON or Dict Results
<a name="_making a dataframe from json or dict results
"></a>

This is very easy. You can also do a google search for how to handle json, dicts, etc.  We have seen all of them in class so far.

In [29]:
import pandas as pd

In [32]:
df = pd.DataFrame(nlptweets['statuses'])
df.head()

Unnamed: 0,contributors,coordinates,created_at,entities,extended_entities,favorite_count,favorited,geo,id,id_str,...,metadata,place,possibly_sensitive,retweet_count,retweeted,retweeted_status,source,text,truncated,user
0,,,Tue Mar 14 08:46:27 +0000 2017,"{'urls': [{'display_url': 'apple.co/2lioqoH', ...",,0,False,,841571211310178304,841571211310178304,...,"{'result_type': 'recent', 'iso_language_code':...",,False,1,False,"{'id': 841544517413134336, 'in_reply_to_user_i...","<a href=""http://twitter.com/download/iphone"" r...",RT @duggystoneshow: #listen to the #confidence...,False,"{'default_profile_image': False, 'id': 3325976..."
1,,,Tue Mar 14 08:44:28 +0000 2017,"{'urls': [{'display_url': 'goo.gl/nu0zvW', 'ex...",,0,False,,841570713869901824,841570713869901824,...,"{'result_type': 'recent', 'iso_language_code':...",,False,0,False,,"<a href=""https://about.twitter.com/products/tw...",Remembering that actions generally come from a...,False,"{'default_profile_image': False, 'id': 1225396..."
2,,,Tue Mar 14 08:43:32 +0000 2017,{'urls': [{'display_url': 'transhumanconsultin...,"{'media': [{'id_str': '841570357299503104', 'i...",0,False,,841570477537611778,841570477537611778,...,"{'result_type': 'recent', 'iso_language_code':...",,False,0,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Transhuman consulting is India’s largest #NLP ...,False,"{'default_profile_image': False, 'id': 7966100..."
3,,,Tue Mar 14 08:43:23 +0000 2017,{'urls': [{'display_url': 'eventbrite.ie/e/the...,,0,False,,841570442410373120,841570442410373120,...,"{'result_type': 'recent', 'iso_language_code':...",,False,0,False,"{'id': 841364339927969794, 'in_reply_to_user_i...","<a href=""http://twitter.com/download/iphone"" r...",RT @DanielleSerpico: Its here! The next #NLP T...,False,"{'default_profile_image': False, 'id': 8853768..."
4,,,Tue Mar 14 08:43:17 +0000 2017,{'urls': [{'display_url': 'eventbrite.ie/e/the...,,0,False,,841570416510541824,841570416510541824,...,"{'result_type': 'recent', 'iso_language_code':...",,False,1,False,"{'id': 841364903151730688, 'in_reply_to_user_i...","<a href=""http://twitter.com/download/iphone"" r...",RT @DanielleSerpico: I'm so excited! The next ...,False,"{'default_profile_image': False, 'id': 8853768..."


Ideally, in a good analysis, you would extract items from entities too.  Think about functions to apply to make new columns, or a new dataframe using the same tweet id as index and the entities as columns.