<br><br>
## <center>Learning to use APIs, starting with Twitter</center>##
<br><br>

In the realm of data science, an API (_application programming interface_), is a means of connecting to a database maintained by another party. That definition may sound simultaneously too simple and/or too abstract to mean anything so let's use Facebook as an example to explain it.
<br>
<br>
If you're logged into your account via the Facebook app or website, any data you create or consume will be routed through your connection to the company's servers and into or out of their oceans of data. Many would love to have be able to glimpse at the data flowing over that connection, but Facebook is making billions off of those data and doesn't want to share it all, especially because users wouldn't like that. But Facebook also knows that third-party apps add a lot of value to the platform, so they need a way to allow third parties to create new data and see some of a user's existing data on behalf of the user. The Facebook API is the means by which they do that. An API defines protocols and methods for using pretty much any machine with an internet connection to connect to and make requests of a webserver and the data storage system behind it.   
<br>
The term API actually covers more than what I described above. For example, if you've used `matplotlib` for data visualization in Python, you shouldn't be using the classes defined by the main package directly. You should be using `pyplot`--via a statement like `import matplotlib.pyplot as plt`--, which is an API whose functions manages a lot of the hardest work for you. But in the world of data science/big data, the term generally refers to a means to connect to a server and database.
<br>
<br>
As data hungry scientists, we might wish every data storage system had an API we could plug into, but (un?)-fortunately they don't. In fact, is more likely that they don't have an API than they do. Places like the US Census Bureau have APIs as a public service, but for a company to invest in making and maintaning an API, and risk losing user's trust by making personal information available, the company needs to see a significant financial upside to it. Most companies don't, and the APIs that do exist usually don't make all data available and might in fact be trending toward making less available. This has happened with the Facebook API and for good reason. A lot of personal information was made available to third-party apps with user permissions it's clear that users aren't comfortable with that (see Cambridge Analytica).
<br><br>
## But then there was Twitter ##
There is a technical reason why you've seen and heard about so much research based on Twitter data. Because the platform aims to be a place for public discussion and dissemination of information, Twitter makes its easy to algorithmically collect its data, including metadata and other things not rendered in feeds. The comparative ease with which anyone can get data make Twitter a go-to source for trying to answer some types of questions. The jury is still out on how generalizable the finding are, but President Trump's affinity for the platform has at least kept it relevant and interesting.
<br><br>
To introduce you to APIs, we are doing to explore Twitter data and produce some descriptive findings. To make this notebook work, I've created a module that can help gather, archive, search for and display tweets interactively. When there is a clear relationship between what the module's functions are doing and what the actual Twitter API is doing, I've included the actual API call (what the command looks like). 
<br><br>
I've chosen this approach over dealing directly with the Twitter API because (1) the raw data you get back from the API can be overwhelming and possibly discouraging; (2) I'm trying to give you a general feel for how APIs work, not bore you with the details of the Twitter API; and (3) it's a lot of fun to play around with Twitter data and I want to make sure we get to that, which we couldn't do without sidelining some of the specific details.
<br><br>
### A couple more important details about using the Twitter API ###
<br>
First, Twitter does not provide access to all tweets with the standard connection, but instead allows users to find particular tweets or sample from realtime or historical data. The sample is random in principle, but the sampling schemes aren't public and the highly dynamic nature of the platform means there are probably a number of schemes that could be defended as being properly random. This shouldn't amount to much, but it's worth knowing.
<br><br>
The API Twitter provides uses the [REST protocol](https://spring.io/understanding/REST), which means if you know the right way to format the requests you can send them and get responses using just your web browser. It turns out this is clunky and inconvenient so people create libraries mapping commands written in your favorite programming language to the proper REST requests. We'll be making use of one of the Python packages, `twitter` below so you need to have it on your computer. We'll also need the package/library `whoosh` for creating a searchable index. Use the Anaconda Navigator or the command line to install the packages. The Navigator tabs for environments you can search for the packages by name and import them. Use either `easy_install twitter` or `pip install twitter` for command line installation.
<br>

<br>
### Topic 1: the JSON format 
<br><br>
Before we get to the API, we need to seque to the data format that the Twitter API, and many others, uses to give data to us. A common format for API data is the *JSON* (JavaScript Object Notation) format. A *JSON* formatted file is human-readable and ends with *.json*. At this point, the format's only meaningful connection to its namesake JavaScript is that it is built around the concept of *objects*, which is at the core of the object-oriented programming paradigm that Java helped popularized. This matters to us only in that thinking in terms of objects is helpful for a lot of big data research.
<br><br>
Consider a tweet:
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">I had to fire General Flynn because he lied to the Vice President and the FBI. He has pled guilty to those lies. It is a shame because his actions during the transition were lawful. There was nothing to hide!</p>&mdash; Donald J. Trump (@realDonaldTrump) <a href="https://twitter.com/realDonaldTrump/status/937007006526959618?ref_src=twsrc%5Etfw">December 2, 2017</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


<br>
In our daily lives, we view a tweet as a (very) short piece of text but there is a lot more to a tweet. Below is the JSON representation for the same tweet. 
<br><br>

<code> {'created_at': 'Sat Dec 02 17:14:13 +0000 2017',
        'id': 937007006526959618,
        'id_str': '937007006526959618',
         'text': 'I had to fire General Flynn because he lied to the Vice President and the FBI. He has pled guilty to those lies. It… https://t.co/8bMCLGuAhf',
         'truncated': True, 
         'entities': 
             {'hashtags': [],
             'symbols': [],
             'user_mentions': [],
             'urls': [
                {'url': 'https://t.co/8bMCLGuAhf',
                'expanded_url': 'https://twitter.com/i/web/status/937007006526959618',
                'display_url': 'twitter.com/i/web/status/9…',
                'indices': [117, 140]}
              ]
              }, 
         'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
         'in_reply_to_status_id': None,
         'in_reply_to_status_id_str': None,
         'in_reply_to_user_id': None,
         'in_reply_to_user_id_str': None,
         'in_reply_to_screen_name': None, 
         'user': 
             {'id': 25073877, 
             'id_str': '25073877',
             'name': 'Donald J. Trump',
             'screen_name': 'realDonaldTrump',
             'location': 'Washington, DC',
             'description': '45th President of the United States of America🇺🇸',
             'url': 'https://t.co/OMxB0x7xC5',
             'entities':
                 {'url': 
                     {'urls': [
                         {'url': 'https://t.co/OMxB0x7xC5',
                         'expanded_url': 'http://www.Instagram.com/realDonaldTrump',
                         'display_url': 'Instagram.com/realDonaldTrump',
                         'indices': [0, 23]}
                         ]
                      },
                  'description': 
                      {'urls': []}
                   },
              'protected': False,
              'followers_count': 46576912,
              'friends_count': 45,
              'listed_count': 83700,
              'created_at': 'Wed Mar 18 13:46:38 +0000 2009',
              'favourites_count': 24,
              'utc_offset': -18000,
              'time_zone': 'Eastern Time (US & Canada)',
              'geo_enabled': True,
              'verified': True,
              'statuses_count': 36761,
              'lang': 'en',
              'contributors_enabled': False,
              'is_translator': False,
              'is_translation_enabled': True,
              'profile_background_color': '6D5C18',
              'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/530021613/trump_scotland__43_of_70_cc.jpg',
              'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/530021613/trump_scotland__43_of_70_cc.jpg',
              'profile_background_tile': True, 
              'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
              'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
              'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1515478614',
              'profile_link_color': '1B95E0',
              'profile_sidebar_border_color': 'BDDCAD',
              'profile_sidebar_fill_color': 'C5CEC0',
              'profile_text_color': '333333',
              'profile_use_background_image': True,
              'has_extended_profile': False,
              'default_profile': False,
              'default_profile_image': False,
              'following': False, 'follow_request_sent': False,
              'notifications': False,
              'translator_type': 'regular'},
          'geo': None, 
          'coordinates': None,
          'place': None,
          'contributors': None,
          'is_quote_status': False,
          'retweet_count': 31599,
          'favorite_count': 116375,
          'favorited': False,
          'retweeted': False, 
          'lang': 'en'}</code>
<br><br>

As you can see, whatever a Tweet is, it is complex thing. (Can you even find the text of the tweet that actually shows up on your tweetbox?) What is clear is it exists in a set of relationships. And if there are relationships, there must be things being related. So let's just call those things *objects* to be stay generic as possible. 

In object-oriented programming (OO), objects are a collection of *fields* (or attributes) and *methods*, procedures to modify the fields. The essense of an *OO* program is the design of the objects and manipulation of them via methods. (OK, and a bunch of stuff we won't get into.) To do something with objects, we need to know a lot about them, but if we are just moving them around as we are when collecting data, we just need the names of the fields and their contents. So we need a file format that can keep track of attributes of objects.

Maybe just a standard spreadsheet (.csv, .xls) you're probably intimately familiar with could work? A row for each observation can work fine for keeping track of an observation and its attributes, but what if an attribute is a reference to something other than another observation of the same type? That is, to another type of object? There is a good chance we'll want to know more about those other objects so we want its attributes too. Unfortunately, a spreadsheet isn't very convenient for transporting that type of information. The JSON format can help us address this problem and that's why lots of data rich APIs use them.

Going back the tweet above, you can see that this record of a single tweet contains references to lots of other objects, each with it's own attributes. They are encased in curly brackets {} and indented here for visual clarity. Examples are the posting user, "entities", and urls.

It's having this type of data that makes big data interesting\* to social scientists in my opinion. These data offer a chance to understand the rich, relational context of the observations. We can track them through time better too. The existence of this rich data can make managing all of it challenging (we have more than a spreadsheet can contain, but we also don't have a nice clean spreadsheet!). The benefits far outpace the problems!
<br>
<br>

### Some package imports to get started ###
<br>
Next we need to import the core module with the `import twttr` statement below. We'll also import a package for displaying HTML within this notebook using the `from IPython.dispay import HTML` statement.
<br>
<br>
If you haven't already installed the Twitter API package `python-twitter` and the search tool `whoosh` via Anaconda or pip/easy_install, you'll get an error until you do the install.
<br>

In [1]:
from tools import twttr
from IPython.display import HTML

<br>

Before you can get started with collecting and analyzing, you'll need to authenticate to the Twitter API. The very first time you run the code in the cell below, you'll need to authenticate using your own Twitter account, so if you don't have one you'll want to register now. 

When you hit shift+enter to run the code the first time, you'll be taken to the Twitter website to sign in. It will then provide you with a PIN number to insert into the newly visible blue highlighted space in the cell above. Once you've done that, it will save your login data. You'll still need to run the login code everytime you start this notebook, but you won't need to authenticate via Twitter's webpage again.


In [2]:
my_connection = twttr.TWTTR()
my_connection.login()

<br>
### Getting Data Through the API ###

Now we're all authenticated to the Twitter API and can start getting data. You probably don't want to just start getting the realtime feed, so let's start with something just to whet our appetite. Let's just find a user to start.
The command `twttr.viewUser("user_handle")` does that and reports some of the user's metadata. It also returns a string that allows you to render the results as HTML using the aptly named `HTML()` function.

In [3]:
a = my_connection.viewUser("KingJames")

In [4]:
HTML(a)

The commands `viewUser()` and `HTML` are not at all that important right here but you can see why we'll use them by running the following line of code that uses the actual Twitter API call `GetUser(screen_name="")`

In [5]:
print(my_connection.TWIT_PIPE.GetUser(screen_name="KingJames"))

{"created_at": "Fri Mar 06 16:25:53 +0000 2009", "description": "EST. AKRON - ST.V/M Class of '03  http://t.co/IneJylUd1m #IPROMISE", "favourites_count": 122, "followers_count": 41090237, "following": true, "friends_count": 190, "id": 23083404, "id_str": "23083404", "lang": "en", "listed_count": 43761, "location": "Amongst La Familia!", "name": "LeBron James", "profile_background_color": "000000", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_banner_url": "https://pbs.twimg.com/profile_banners/23083404/1529843462", "profile_image_url": "http://pbs.twimg.com/profile_images/1010862750401253377/Rof4XuYC_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1010862750401253377/Rof4XuYC_normal.jpg", "profile_link_color": "FBAF41", "profile_sidebar_border_color": "000000", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color"

That's Twitter's default JSON file for when you ask for a user. It's ease for a computer to pull that apart, but what a visual mess! And it is some very basic information and it responses can get much worse! Let's try the following.
<br><br>
## getUsersTweets( ) ##
`getUsersTweets()` can get up to 200 of a user's most recent tweets. Below is an example of its use to get 25 tweets from ESPN's NBA account.<br><br>

In [6]:
the_tweets = my_connection.getUsersTweets("ESPNNBA", count=25)

______________________________
TweetID:		 1022692569057320965

User Name:	 NBA on ESPN
screen name:	 ESPNNBA
Posted:		 2018-07-27 03:58:03
Text:		 .@S10Bird  letting us know who she got 👀 https://t.co/LpYKZnURfI
Media?:	 photo
Mentions:	 S10Bird
Times RT'd:	 185
Times Fav'd:	 1127
Is Retweet:	 False
______________________________
TweetID:		 1022691079005433857

User Name:	 NBA on ESPN
screen name:	 ESPNNBA
Posted:		 2018-07-27 03:52:08
Text:		 .@KingJames and @DwyaneWade, just your ordinary basketball dads 😂 https://t.co/ud6mUE6j0g
Media?:	 video
Mentions:	 espn,KingJames,DwyaneWade
Times RT'd:	 0
Times Fav'd:	 0
Is Retweet:	 True
	Original poster:	 espn
	Times original RT'd:	 1284
	Times original Fav'd:	 9518
______________________________
TweetID:		 1022691047325888513

User Name:	 NBA on ESPN
screen name:	 ESPNNBA
Posted:		 2018-07-27 03:52:01
Text:		 Jayson is learning Mamba Mentality from Kobe himself 😎(via @jaytatum0) https://t.co/s1LuqJeDIa
Media?:	 photo
Mentions:	 espn,jaytatu

The results of this request contained a lot more than you see above. The tweets have been cleaned and reformated to make it easier and more enjoyable to quickly see what might be relevant to your research. The cell below has the direct Twitter API call if you want to see the mess that comes back.

In [7]:
print(my_connection.TWIT_PIPE.GetUserTimeline(screen_name="ESPNNBA",count=25))

[Status(ID=1022692569057320965, ScreenName=ESPNNBA, Created=Fri Jul 27 03:58:03 +0000 2018, Text='.@S10Bird  letting us know who she got 👀 https://t.co/LpYKZnURfI'), Status(ID=1022691079005433857, ScreenName=ESPNNBA, Created=Fri Jul 27 03:52:08 +0000 2018, Text='RT @espn: .@KingJames and @DwyaneWade, just your ordinary basketball dads 😂 https://t.co/ud6mUE6j0g'), Status(ID=1022691047325888513, ScreenName=ESPNNBA, Created=Fri Jul 27 03:52:01 +0000 2018, Text='RT @espn: Jayson is learning Mamba Mentality from Kobe himself 😎(via @jaytatum0) https://t.co/s1LuqJeDIa'), Status(ID=1022686775536496641, ScreenName=ESPNNBA, Created=Fri Jul 27 03:35:02 +0000 2018, Text=".@KingJames daps up former coach Tyronn Lue at his son's basketball game. (via @overtime) https://t.co/zp7cqMJscD"), Status(ID=1022681202803716096, ScreenName=ESPNNBA, Created=Fri Jul 27 03:12:53 +0000 2018, Text='RT @SportsCenter: CP3 and Melo getting some work in 👀 (via @travellegaines) https://t.co/JxmdlHC8VR'), Status(ID=10225

<br>
## index_Tweets()##

After you've run the `getUsersTweets()` command, all of the cleaned Tweet objects are sitting in the list `the_tweets`. You can preview them in the output window beneath the command after you've run it, but if you want to save them for future analysis, you can index them using the `index_Tweets( )` command. This command creates a searchable database using the Python module `whoosh`. We'll cover how to do searchs later in this lab, but first we need to get them into the database so that we have something to search through!
#### Important Note ####
The seearchable index is saved in the folder named `twitter_index`, which is in the same folder as this notebook. Saving it allows the index to be maintained if you close this notebook or session. Once you start indexing the first time, all new indexing will add to the same index. If you'd like to start over, you'll need to clear and reset the index. The command for that is below. All entries will be lost so be careful!<br>
`my_connection.clear_index()`<br><br>

In [8]:
my_connection.index_Tweets(the_tweets)

<br>
This command does a lot of things behind the scenes, including running the text through the *Linguistic Inquiry and Word Count* package to create *sentiment* scores. We'll cover that topic later.<br>

An important thing to know is that this command accepts a `batch` argument to expedite retrieving the collection of tweets you just added to the index. The index pools everything you put into it so the batch label is important if you want to track the different results from your queries. To use this functionality, simply provide a unique batch number whenever you run the command. You can then provide this number as an index search term later. Otherwise the default batch number is 1 and does not automatically increment, meaning all tweets will be in batch 1. I highly recommend noting the contents of the batches in somewhere such as one of the cells in this notebook. <br><br>


In [None]:
my_connection.index_Tweets(the_tweets, batch=2)

## tweetSearch(   ) ##
Searching by users can be a great way to get data, but you'll probably also want to search by terms, hashtags, or even URLs. To do that, you use the `tweetSearch( )` method shown here. Twitter caps the number of tweets it will return at 100 which you might expect is a very small sample of all the possible tweets matching your search term(s). Twitter doesn't tell us much about how they determine the sample, but they do allow us to specify whether we want the most recent tweets, the most popular or a mixture of both. You can specify this with the second argument below; the options are `popular`, `recent`, or `mixed` and the default if you omit this argument is `recent`.

In [9]:
hidden_curriculum_Tweets = my_connection.tweetSearch("#HiddenCurriculum", "recent")

100 tweets found
______________________________
TweetID:		 1022845214808047616

User Name:	 Kasey J. Eickmeyer
screen name:	 kj_eickmeyer
Posted:		 2018-07-27 14:04:37
Text:		 Working on my job market materials is EMOTIONAL. Tackling #firstgen status, working-class farming upbringing, social mobility, mental health, and the #HiddenCurriculum all before noon? I need a nap.
Media?:	 None
Mentions:	 
Times RT'd:	 0
Times Fav'd:	 3
Is Retweet:	 False
______________________________
TweetID:		 1022839192487972870

User Name:	 Tom de Greef
screen name:	 tfadgreef
Posted:		 2018-07-27 13:40:41
Text:		 I didn't think my 'start-up' document for my grad students (the resources I wish *I* had when I started a PhD) would prove to be so popular, so I figured I would share them here. So, here's what I've compiled so far. #hiddencurriculum 1/12
Media?:	 None
Mentions:	 thehauer
Times RT'd:	 0
Times Fav'd:	 0
Is Retweet:	 True
	Original poster:	 thehauer
	Times original RT'd:	 2243
	Times original Fav'

	Times original Fav'd:	 122
______________________________
TweetID:		 1022573309270142978

User Name:	 Fanny Macé
screen name:	 fmace26
Posted:		 2018-07-26 20:04:10
Text:		 I didn't think my 'start-up' document for my grad students (the resources I wish *I* had when I started a PhD) would prove to be so popular, so I figured I would share them here. So, here's what I've compiled so far. #hiddencurriculum 1/12
Media?:	 None
Mentions:	 thehauer
Times RT'd:	 0
Times Fav'd:	 0
Is Retweet:	 True
	Original poster:	 thehauer
	Times original RT'd:	 2243
	Times original Fav'd:	 5940
______________________________
TweetID:		 1022572267912486918

User Name:	 Shaoming Wang
screen name:	 Shaoming_Wang
Posted:		 2018-07-26 20:00:01
Text:		 I didn't think my 'start-up' document for my grad students (the resources I wish *I* had when I started a PhD) would prove to be so popular, so I figured I would share them here. So, here's what I've compiled so far. #hiddencurriculum 1/12
Media?:	 None
Mentions:

<br>
<br>
You might notice a lot of repeated text in there. Those are all retweets, which of course count as tweets themselves.
<br><br>
Again, to ensure that you can look at and analyze these tweets later, you need to index them using the `index_Tweets( )` command. Below the indexing command is the Twitter API call.

In [None]:
my_connection.index_Tweets(hidden_curriculum_Tweets, batch=2)

In [10]:
print(my_connection.TWIT_PIPE.GetSearch(term="#HiddenCurriculum",result_type="recent", count=100))

[Status(ID=1022845214808047616, ScreenName=kj_eickmeyer, Created=Fri Jul 27 14:04:37 +0000 2018, Text='Working on my job market materials is EMOTIONAL. Tackling #firstgen status, working-class farming upbringing, social mobility, mental health, and the #HiddenCurriculum all before noon? I need a nap.'), Status(ID=1022839192487972870, ScreenName=tfadgreef, Created=Fri Jul 27 13:40:41 +0000 2018, Text='RT @thehauer: I didn\'t think my "start-up" document for my grad students (the resources I wish *I* had when I started a PhD) would prove to…'), Status(ID=1022832768307396609, ScreenName=janice39577229, Created=Fri Jul 27 13:15:09 +0000 2018, Text='RT @thehauer: I didn\'t think my "start-up" document for my grad students (the resources I wish *I* had when I started a PhD) would prove to…'), Status(ID=1022825673440808960, ScreenName=craigfleisher, Created=Fri Jul 27 12:46:58 +0000 2018, Text='RT @AM_Headley: The #hiddencurriculum discussion hit on so many of my own memories during these pas

<br>
## Search the index ##
<br>
The `tweetSearch( )` command searches Twitter for tweets related to the term you supplied, but there will unexplored context in those results. To explore that context, we index the tweets (metadata and all) and then search within our database. You use the `search( )` command to do this. `search( )` accepts arguments for any combination of fields stored in the index. For this particular index, those terms are as follows:

  + `tweet_id`        - The unique integer that identifies this tweet
  + `owner_name`      - The displayed name of the original poster (i.e. Chris Rock)
  + `owner_sn`        - The screen name of the original poster (i.e. chrisrock)
  + `owner_id`        - The Twitter ID of the original poster (i.e. 238319766)
  + `batch`           - The batch number that you've been specifying. (default=1)
  + `text`            - The content of the tweet itself
  + `posted`          - The time the tweet was posted (GMT +0:00)
  + `isRT`            - (True/False) Whether the tweet was a retweet
  + `timesRT`         - The number of times the tweet was retweeted
  + `timesFav`        - The number of times the tweet was favorited
  + `hashtags`        - All the hashtags appearing in the tweet
  + `mentions`        - All the mentions in the tweet
  + `media`           - The included media type, if applicable
  
These arguments are entered all together in a single string, a series of character within double quotation marks. We'll go over the details of the syntax shortly.
  
In addition to these index arguments, you can search by a period of the day using the regular `time_slice` argument. The argument looks like this: `["09:00":"17:00"]`. The period starts at the time on the left and ends on the time on the right and it is important that there are two numbers on each side of the colon. If the starting time is later than the ending time, it searches past midnight and into the next day.

You can also limit the number of tweets returned using the regular `limit` argument. The default is 100.

Finally, you can save your search results for viewing as an HTML page later using the `saveAs` argument. The default name is `search`
<br>
Below are some examples of proper syntax. All of these variants can be combined into a single command but are in separate commands here for visual clarity.

<br><br><br>
Here is an example looking at the hashtags and posting times. We'll address the data the command returns later in the lab.

In [11]:
results, data, display, tweet_list = my_connection.search("hashtags:HiddenCurriculum posted:'July 10 to July 28'")

search terms (hashtags:HiddenCurriculum AND posted:[63666777600000000 TO 63668419199999999])
32 search results
------------------------------
Jess Calarco  :  If you're following the #HiddenCurriculum discussion, or if you're a grad student or thinking about grad school, check out my new blog post on the challenges of figuring out stuff you're never taught but still supposed to know.
https://t.co/EI4rl8G59o

(and thanks @asociologist!)

May Helena Plumb  :  If you're following the #HiddenCurriculum discussion, or if you're a grad student or thinking about grad school, check out my new blog post on the challenges of figuring out stuff you're never taught but still supposed to know.
https://t.co/EI4rl8G59o

(and thanks @asociologist!)

Jess Calarco  :  If you're following the #HiddenCurriculum discussion, or if you're a grad student or thinking about grad school, check out my new blog post on the challenges of figuring out stuff you're never taught but still supposed to know.
https://t.c

Note the use of multiple quotation marks. **Any argument in that appears in the bulleted list above must appear within the double quotation marks**. There are also single quotation marks for the values that follow the key terms. Also note that there is no comma between key terms

In [12]:
results, data, display, tweet_list = my_connection.search("'academic' isRT:'True'", saveAs="HiddenCurric")

search terms (content:academic AND isRT:True)
2 search results
------------------------------
Dr. Michelle Rodrigues  :  Important advice! #CCGSgrad was just discussing how there is a #HiddenCurriculum for academic faculty too. This helps to address that! https://t.co/cExWm853IC

Dr Fish Philosopher🐟  :  Important advice! #CCGSgrad was just discussing how there is a #HiddenCurriculum for academic faculty too. This helps to address that! https://t.co/cExWm853IC



The *timeRT* and *timesFav* key terms allow you to find the most popular tweets (or the least or in between).  Whether there is an important difference between favoriting or retweeting from the prospective of users is not clear to me, but you have both at your disposal. <br><br>
Here are additional examples of search queries.


In [None]:
results, data, display, tweet_list = my_connection.search("content:'hidden' mentions:'JessicaCalarco'", limit=5)

This next one gets up to 100 tweets made between the hours of 10pm and 3am and save the results in a file named LateNight_tweets.html in the folder this notebook is in.

In [None]:
results, data, display, tweet_list = my_connection.search("batch:'2' posted:'Jul 1 to Jul 29'", time_slice=["22:00","03:00"], saveAs="LateNight_tweets")

The query below searches the text of the tweets for the variations of a phrase.

In [None]:
results, data, display, tweet_list = my_connection.search("'Hidden' OR 'HiddenCurriculum' OR '#HiddenCurriculum'",saveAs="Curriculum")

This next one searches for tweets that are either tagged with a search term or mentions a user's offical Twitter account.

In [None]:
results, data, display, tweet_list = my_connection.search("hashtags:'HiddenCurriculum' OR mentions:'JessicaCalarco'")

#### Remember #### 
All of these search terms can be used simultaneously. Once you get the hang of it, you can do some very powerful sorting. The syntax can be finicky so if you get 0 results and you're expecting some, make sure there aren't extra commas or spaces and that the quotation marks are always in pairs.

#### What search( ) does####
Running the search command does a lot of things behind the scenes. It packages information up for you in the terms that appear on the left hand side of the equals sign. 
The first term is named `results` but you can name it whatever you'd like by replacing `results` with your term before running the command. (You can do this for all terms by the way.) Results is an object that can be used to display the averages of the individual LIWC values for the whole group of Tweets. 
`data` is the raw information that can be used to find differences between LIWC averages from different search results, a topic covers below. 
`display` can be used to view the individual tweets in the search results in a HTML table. 
Finally, `tweet_list` is a list of the Tweet IDs for the tweets in the search results.
<br>
<br>
The other thing that the command does is produce an HTML file with the Tweets represented as circles proportional to the number of times it was retweeted. An example of the link can be found [here.](files/tools/tweetpack.html) Your search results show up in the folder this notebook is in under the name you specified with the `saveAs` argument.

<br><br>
## Inspecting Retweets##
What you've been collecting and looking at up until now are just a handful of tweets but you might really want to look at what happens to the tweet after the original poster submitted it. Because of Twitter's rate-limiting policies, this can be a time consuming task and accordingly the indexing command doesn't collect those data immediately. Rather, you need to tell the index which ones to collect using the `inspectRetweets( )` command. The argument is just a list of tweet ids, most likely gotten from the `search( )` command above.

In [None]:
ind = min(5,len(tweet_list))

my_connection.inspectRetweets(tweet_list[:ind])

This command can take a long time to run and the amount of time depends on two things. The first is the number of tweets; the more you have in the list argument, the longer it will take. The second is whether or not the tweets are from the same person. The code looks up the poster's friends and followers to determine relationship the person has to the retweeters and saves the data. This process is slow (in the world of computers, that is) so if you are looking at posts from a bunch a people, it can take a long time. So if you have a stack of tweets from Oprah, it might take a couple of minutes. But if you have the most popular Tweets about the latest Star Wars movie, you could get dinner and it might not be finished.
<br><br>
### What Inspecting Retweets Does###
Inspecting the retweets looks at the who retweeted the original, when they did and the nature of their relationship to the original poster. It also summarizes these data and visualizes the network of the retweets in a webpage reachable by clicking on the tweet in search interface. The circle representing the tweet will turn blue if you've inspected it.  You can then click on it and be taken to the page with more details. 

<br>
This page includes a visualization of the network in which the length of the ties is proportional to the number of hours after the original tweet that the retweet was posted. The nodes are also colored according to the retweeter's relationship with the original poster. An important thing to note about these data is that Twitter limits the information to a random sampling of 100 of the retweets. So while the tweet might have 2000 retweets, you'll see at most 101 nodes in this network. For the purpose of this lab, you may consider this a representative sample and conduct your analyze using these data.

## Backtracking: Viewing Results ##

When you run a search, you can see the titles of the posts that the query found but it is actually spitting back much more information. We can look at those using the data object `results`, `data`, and `display`.
<br>
While it appears last, let's start with `display` first. It contains all of the raw information from the posting returned by the query. To make it easier to look at, it is packaged up as an HTML file that we can look at by running the command below.

In [13]:
HTML(display)

Category,Percent,Category.1,Percent.1,Category.2,Percent.2,Category.3,Percent.3,Category.4,Percent.4,Category.5,Percent.5
WC,25.0,WPS,6.25,Sixltr,0.28,Dic,0.72,funct,0.44,pronoun,0.08
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
ipron,0.08,article,0.04,verb,0.12,auxverb,0.08,past,0.04,present,0.08
,,,,,,,,,,,
adverb,0.16,preps,0.08,conj,0.04,social,0.12,affect,0.08,posemo,0.08

Category,Percent,Category.1,Percent.1,Category.2,Percent.2,Category.3,Percent.3,Category.4,Percent.4,Category.5,Percent.5
WC,25.0,WPS,6.25,Sixltr,0.28,Dic,0.72,funct,0.44,pronoun,0.08
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
ipron,0.08,article,0.04,verb,0.12,auxverb,0.08,past,0.04,present,0.08
,,,,,,,,,,,
adverb,0.16,preps,0.08,conj,0.04,social,0.12,affect,0.08,posemo,0.08


A couple of things to notice about this:

1) That is a lot of information in our poor little Notebook. If you'd rather not look at it here, don't run the command and just go find the file named 'search_results.html' in your CL folder. You can even just keep that file open in your browser and refresh it after new searchs. Doing so will make your notebook easier to use and the HTML also renders better there.

2) There is big table at the bottom of each post. These contain the textual analysis of the post using the LIWC dictionary. It contains basic statistics like word counts (WC) and the number of pronouns but also the frequency of words related to positive emotions (posemo) and tentativeness (tentat). The full names and categories for the terms are available on the LIWC site [here](http://www.liwc.net/descriptiontable1.php)

3) The tables aren't the same for each post. That's because some posted scored zero on the LIWC count and they've been omitted in order to condense the information. 

4) These tables are really good for exploring patterns worth looking into more, but chances are you aren't going to want to find averages for these tables by hand. Thats what the `results` object does for you.<br>

## Viewing Group Statistics##
The *results* object contains useful information about that group of search results, namely the average, the standard deviation, the maximum value and the minimum value for each of the LIWC categories. You can display it as HTML using the command below or via the file *search_averages.html*

In [None]:
HTML(results)

Finally, the `data` object contains the raw numbers of the results. These can be used to compare different search groups and create smaller tables of more relevant data. To do that, you need to specify two different searches and give the outputs different names as is shown in the next two lines

In [None]:
results_1, data_1, display_1, tweet_list_1 = my_connection.search("batch:2 content:'women'", saveAs="morning")

In [None]:
results_2, data_2, display_2, tweet_list_2 = my_connection.search("batch:2 content:'men'", saveAs="afternoon")

You then put the two data objects, 'data_1' and 'data_2' into the 'LIWC_differences( )' command to get a table back. You can then display it by passing it to the 'HTML( )' command. The two lines below do it for the searches above

In [None]:
diffs = my_connection.LIWC_differences(data_1, data_2)

In [None]:
HTML(diffs) 

Finally, if you'd like to trim this big table to just the relevant information, run the 'LIWC_differences_subset( )' command with a list of the terms you want. This is a handy thing for producing actual results to be included somewhere. 

In [None]:
subset = my_connection.LIWC_differences_subset(data_1, data_2, ['WC', 'posemo', 'negemo', 'swear', 'anger', 'affect'])

In [None]:
HTML(subset)