### Construction time, again ###

<img src=http://data.gdeltproject.org/blog/gkg2-feb-july-2015-global-map/GDELT-Feb-July2015-Final.png width=700>

**Text as data**

Today we are going to work in groups to architect and partially implement a system that uses a fair bit of what we've been looking at over the last two weeks. With data collection that ranges from simple downloads of CSVs to APIs to web scraping, we have covered a fair bit of data collection strategies. (From here we are going to transition to "analysis" that goes deeper than simple graphics -- we will have a couple lectures on machine learning, as a topic to report on as well as with.)

Our starting point today will be projects that draw from a continual feed of data and search for signals. Our data feed will be close to home -- the news. A few of your projects are looking at activities over time. Working with news stories as a data source can be difficult because, well, they are stories. They are made up of words and words are not like the kind of data we have seen so far. They are not directly a number of something, or a measure of something. To make things a little easier, news sources often come with metadata like keywords that distill the topic of the article, the main people or places involved. 

Beyond simple names, sometimes particular patterns of language can be used to identify important events. Phrases like "police involved shootings" helped projects like [Killed by Police](https://killedbypolice.net/) or [Fatal Encounters](https://www.fatalencounters.org/) find incidents in which people were killed by police officers. As background, this number was not well known, relying on voluntary reporting to either the FBI or the CDC; and news crawls produced numbers that were typically twice as high as the official statistics. (These projects were consolidated and extended in The Guardian's [The Counted](https://www.theguardian.com/us-news/series/counted-us-police-killings).)

<img src=http://4.bp.blogspot.com/-No2377MRFYU/VW9pszU2b0I/AAAAAAAAg5Q/hhWKOKg8DCw/s420/guardian-the_counted-1018.jpg width=500>

Digging deeper, news sources can be used for a variety of meta-reporting. The (more obscure) [Mass Mobilization in Autocracies Database](https://mmadatabase.org/)

>... contains sub-national data on mass mobilization events in autocracies worldwide. It includes both instances of anti- and pro-regime protest at the level of cities with daily resolution. The data is coded based on news reports obtained from AP, the AFP and BBC Monitoring. The main database contains information at the level of reports, which means that every mention of a political protest in a news report constitutes a new entry in the database. 

They use a "machine learning" approach to figure out what the details of a protest. 

>For example, the sentence  “A group of 300 people rallied in Bishek on December 4 to protest the ban of the ruling party” contains information about the date, location, number of protesters, and the issue of the protest, all of which should eventually be recorded in the protest database.

The better your machine learning, the more ambitious your tasks. For example, Eric Horvitz from Microsoft Research published a piece that attempted to predict events given collections of news stories in the past. So after a hurricane, what conditions do you start to see that might be predictive of a cholera outbreak? Patterns of events in the past tell you someting about what might happen in a new incident, allowing you to alert emergency workers. His paper is pretty accessible; [give it a skim](http://erichorvitz.com/future_news_wsdm.pdf).

<img src="http://gigaom2.files.wordpress.com/2013/02/screen-shot-2013-02-01-at-9-53-21-am.png?w=400">

So now you see the complexities and the potential of working with news stories as data. In some sense the granddaddy of all of these efforts is [GDELT,  the Global Data on Events, Location and Tone. ](https://www.gdeltproject.org/). It is a massive project that has always slightly mystified me. It says that it

>... monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

The map at the top of the notebook indicates all the places GDELT has seen since its inception. OK, let's try this system out. They've started looking specifically at coverage of the 2020 candidates. They have an overview of their plans [here](https://blog.gdeltproject.org/campaign-2020-getting-started-with-gdelt-for-tracking-the-us-presidential-race/). See their online news and television coverage, in particular. The former skims the online sites they are monitoring and produces the number of mentions, the "tone", and their prominence on the site. They have a new API to pull this information and they give examples [here](https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/). In particular, you can see [the tone used in stories about Trump]( https://api.gdeltproject.org/api/v2/doc/doc?query=%22donald%20trump%22&mode=tonechart) as a chart or [as a json string]( https://api.gdeltproject.org/api/v2/doc/doc?query=%22donald%20trump%22&mode=tonechart).

Now, using the `requests` library, pull the JSON string from the tone around Donald Trump and print out the headlines. Replace the query with `buttigieg` and see what you get.

In [None]:
# your work here



**All the news that's**

How is such a system created? What do we need to make this cook? First, we need a source of news. Or maybe a source of news sources. We have seen RSS already for individual sources and we know people who have built aggregators like Digg, ahem. There is also a lovely service called [News API](https://newsapi.org/). It provides simple searching access to a large number of news services. OK we could go bigger and try for more, but let's start here. 

Below we have the `requests` command that invokes their API. Sign up for your own API Key! The API has several "endpoints" -- from breaking news to "everything," a full search that you can limit in time. We will use the latter and look for articles on `buttigieg`. The different parameters for the API are spelled out [here](https://newsapi.org/docs). We want articles from the first of April, we want them sorted from most recent to oldest and we want 100 per page (that's the most we can ask for according to the documentation).

Here's the request. 

In [None]:
from requests import get

url = ('https://newsapi.org/v2/everything?'
       'q=buttigieg&'
       'from=2019-04-01&'
       'pageSize=100&'
       'page=1&'
       'sortBy=publishedAt&'
       'apiKey=9b49997ed4274749871f14355ec9cd3f')

response = get(url)

And here's what it looks like.

In [None]:
from pprint import pprint
pprint(response.json())

Use your awesome JSON skills to create a list called `articles` that consists of the different articles returned by this API call. Then, iterate through each and print out the `source` of each and its publication date/time. 

In [None]:
# your code here



Notice that the response object has a key called `totalResults` that you can use to sift back if your query returns more than 100 results. By setting the parameter `page` to 2, 3, 4... you can work your way backwards. 

**Tone (deaf)**

Now, let's consider how we might estimate the tone of the content for each of these articles. GDELT loves loves loves tone. Without getting too in the weeds, you might also see it called sentiment analysis -- "for" or "against", "positive" or "negative"? How might we do this? First thing, we need to make the contents of each article a little more actionable. For this we have an object we saw before briefly called [TextBlob](https://textblob.readthedocs.io/en/dev/) that is a simplified version of the Natural Language Toolkit in Python. (Sometimes tools become really powerful for practitioners and leave non-experts behind. That's what has happened, to some extent, with the NLTK. It's a little hard to just "jump in". And so TextBlob is like computational training wheels.) [Allison Parrish's Natural Language Basics with TextBlob](http://rwet.decontextualize.com/book/textblob/) is a great place to read about what TextBlob is good for. 

First, we need to install the package. Off to PIP!

In [None]:
%%sh
pip install TextBlob

In [None]:
from nltk import download
download('brown')
download('punkt')
download('maxent_ne_chunker')
download('words')
download('conll2000')
download('maxent_treebank_pos_tagger')
download('averaged_perceptron_tagger')

We now take the `description` from the first article from our search and use it to create a `TextBlob` object. Remember, our articles are in a list called `articles`, each element being a dictionary with keys describing the article.

In [None]:
from textblob import TextBlob

tb = TextBlob(articles[0]["description"])
type(tb)

In [None]:
tb

The TextBlob object has a number of attribures that have processed the text. The simplest are lists of words and sentences. Here we pull just the words.

In [None]:
tb.words

This is obviously a better approach than the one we took when we just split a string on spaces -- a technique that didn't handle punctuation like commas and periods well. OK that's a good trick but there are better ones! For example, TextBlob's language processing let's it estimate which words are part of noun phrases. 

There are various techniques for doing this and none of them are perfect. To be fair, using a headline means using a text fragment and not a sentence. The language processing tools are usually trained on full sentences of text. Still, it's not bad.

In [None]:
tb.noun_phrases

Noun phrases are obtained by extracting information from a "tagged" version of the text. Here the tags represent parts of speech. You can see [a complete list of the tags here.](https://cs.nyu.edu/grishman/jet/guide/PennPOS.html) The parts of speech are stored as a list of word-tag pairs.

In [None]:
tb.tags

In [None]:
type(tb.tags[0])

The .tags attribute is a list. (See the square brackets?) The list elements are a new data type called a "tuple" which is like a list, for our purposes. So you can take, say the first element of the tags list and look at the first and second elements of the tuple (the word and its estimates part of speech).

In [None]:
tb.tags[0]

In [None]:
tb.tags[0][0]

In [None]:
tb.tags[0][1]

Now, tone. TextBlob also provides an estimate of the sentiment of the statement. That is, is the text expressing a positive or negative sentiment. I'll leave you to consult the Parrish blog post or the TextBlob documentation of this lovely feature.

In [None]:
tb.sentiment

The sentiment here was computed from the `description` sentence. In general, these scores indicate the following.  

1. polarity: negative vs. positive    (-1.0 => +1.0)
2. subjectivity: objective vs. subjective (+0.0 => +1.0)

How might this be done? Well, TextBlob is a little quiet on the subject, but you can Google about and find [explanations like this](https://planspace.org/20150607-textblob_sentiment/). It is using a simple scoring method that assigns values to words. It's smart enough to flip the polarity when adding a "not" to "not amazing". You can see [the scores used by TextBlob here.](https://github.com/sloria/TextBlob/blob/eb08c120d364e908646731d60b4e4c6c1712ff63/textblob/en/en-sentiment.xml)

One last thing. There are various methods to "parse" text -- different algorithms for tagging words in a sentence, for extracting noun phrases and for estimating sentiment. You can replace the default when you call TextBlob. The documentation describes other noun phrase extractors. Here's how you would use the ConllExtractor, based on a data set compiled for the Conference on Computational Natural Language Learning (CoNLL-2000).

In [None]:
from textblob.np_extractors import ConllExtractor
extractor = ConllExtractor()

tb = TextBlob(headline,np_extractor=extractor)
tb.noun_phrases

**A tone of one's own**

Let's now try a tone of our own. In a previous drill we worked with data from a group at the University of Vermont who wanted to judge the "happiness" of social media over time by looking at how "happy" each tweet or facebook post was. They scored the text using a happiness dictionary where each of 10,000 words was assigned a numerical score (0 meant depressed, 10 meant giggly). The paper was called [The Geography of Happiness](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752) and you should have a look. 

While they went on to score entire sentences or blog posts, we're going to just try single words. The data were published by the authors as a CSV. The data are on our GitHub page and you can either [download them here](https://github.com/computationaljournalism/columbia2019/raw/master/data/happy.csv) and place them in the folder with this notebook, or you can read the CSV directly from GitHub.

In [None]:
from pandas import read_csv

happy = read_csv("https://github.com/computationaljournalism/columbia2019/raw/master/data/happy.csv")
happy.head()

Now, we will use their same scoring procedure. We will run through the description of the first article and give it a sentiment represented by the average of the happiness scores in `happy`. To do this efficiently, you want to "look up" the score with associated with a word... hmm, seems like a dictionary not a dataframe, but you decide.

So given the `description` of the first article, create a sentiment score. Then, do it for each of the 100 articles you have in `articles`.

In [None]:
# Your work here



These researchers have made daily timeseries of average happiness scores across a random sample of tweets available in an [interactive display](http://hedonometer.org/index.html) and in a time series. You can grab the data [here](http://hedonometer.org/data/word-vectors/vacc/sumhapps.csv). They also have [an API that you can dust off.](http://hedonometer.org/legacyapi.html)

Tell me about the daily fluctuations. What do you see? 

In [None]:
# Your code here



**Spooky ending**

Last year, we had Michal Kosinski  join us to talk about Cambridge Analytica. His work is at the center of that story. He is now a faculty member at Stanford. We are using the Cambridge Analytica story as an opportunity to think critically about algorithms. The happiness score was one thing but... Anyway, this means understanding something of how they work, the metaphors they draw on to create knowledge, the roles they play in society, and how we might interview them. 

While machine learning or AI or statistical modeling are becoming more important to how companies, communities, and countries operate, the area that Kosinski is involved in is particularly sensitive. We are going to end with a return to social media data.

Take a tweet. 

In [None]:
%%HTML
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Looks like Bob Mueller’s team of 13 Trump Haters &amp; Angry Democrats are illegally leaking information to the press while the Fake News Media make up their own stories with or without sources - sources no longer matter to our corrupt &amp; dishonest Mainstream Media, they are a Joke!</p>&mdash; Donald J. Trump (@realDonaldTrump) <a href="https://twitter.com/realDonaldTrump/status/1114888062884954114?ref_src=twsrc%5Etfw">April 7, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

As we have seen, one way to analyze this text is to break it up into words...

> Looks like Bob Mueller’s team of 13 Trump Haters & Angry Democrats are illegally leaking information to the press

... and using those words as symbols. The presence or absence of a symbol might suggest something about the author or speaker. In 2014, Kosinski and his colleagues published a paper using this kind of lexical analysis to predict the age and gender of a person interacting with social media. Here is their paper and its abstract. It appeared in the *2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)* and is titled [Developing Age and Gender Predictive Lexica over Social Media](http://www.aclweb.org/anthology/D14-1121).

>Demographic lexica have potential for
widespread use in social science, economic,
and business applications. We derive predictive
lexica (words and weights) for age and
gender using regression and classification
models from word usage in Facebook, blog,
and Twitter data with associated demographic
labels. The lexica, made publicly available,
achieved state-of-the-art accuracy in language
based age and gender prediction over Facebook
and Twitter, and were evaluated for
generalization across social media genres as
well as in limited message situations.

You can download the data associated with the paper [from an associated web site](http://mypersonality.org/wiki/doku.php?id=download_databases). Look for "Datasets Available Without Registration" and then "Language-based predictions." What you will get is gzipped file that unpacks into a folder with a couple of CSVs. Move them to the same folder as this notebook is located and open them with your favorite spreadsheet and have a look. Or use your new UNIX prowess!
<pre>
____|\
`-/    \
 (\_/)  \
 /_  _   |
 \/_/||) /
    '---' jg
</pre>

In [None]:
%%sh
head emnlp14age.csv

You have two columns, one for `term`'s and one for `weight`'s. The paper describes how to use these weights. The data are better described [at this web site](http://www.wwbp.org/lexica.html) and at the bottom of the page you can "expand" to see how to apply the technique. Let's not wonder for the moment how these numbers were calculated, but instead see what kind of story they might tell.

First off, let's focus on `emnlp14age.csv` and tell me how you use these word-weight pairs? 

Now let's try it out. We are not going to use Pandas here but give you a little reminder about how to work with dictionaries and such. So let's open up the file and use a CSV `reader` object. We skim off the header...

In [None]:
from csv import reader

data = reader(open("emnlp14age.csv"))
d = next(data)
print(d)

... and then read off the `_intercept`. It is thestart of the proces. The use of a sentence or document the estimate of someone's age is formed by simple additions to this number. 

In [None]:
d = next(data)
print(d)

In [None]:
intercept = float(d[1])

The easiest data structure for this exercise is a dictionary. We want numbers that are associated with words. So looking things up by word makes the most sense. We could torture Pandas into it, this is much cleaner. 

We will create a dictionary called `age` and then for each row in the data set, store the `weight` under the key `term`. 

In [None]:
age = {}
for d in data:
    age[d[0]] = float(d[1])

In [None]:
len(age)

In [None]:
age.keys()

And let's have a look at some of these. 

In [None]:
age["angry"]

In [None]:
age["fake"]

Ideally we'd like a function that takes in some text and produces the estimate of the person's age. How should we do this? Take a second and write out the steps.

.

.

.

At the heart of any procedure will be simply checking to see if a word the text is contained in the dictionary `age` or not. Here we check for "crying."

In [None]:
word = "haters"

if word in age:
    print(age[word])

In [None]:
words = ["everybody","is","now","acknowledging","that","right","from","the",
         "time","i","announced","my","run","for","president","i","was","100","correct",
         "on","the","border","remember","the","heat","i","took","democrats","should","now",
         "get","rid","of","the","loopholes","the","border","is","being","fixed",
         "mexico","will","not","let","people","through"] 

# initialize
score = 0
n = 0

# run through the words
for word in words:

    if word in age:
        score += age[word]
        n += 1

age_est = intercept + score/n
        
print("Final age estimate:",age_est)

Here we create a function that will take the text and return the age score. We use TextBlob to parse out the words, but have to make an adjustment -- TextBlob makes "don't" into "don" and "'t" which is not how Kosinski and his colleagues coded things. So instead we will remove the single quotes, as "dont" and "cant" are there. To be completely above board, we'd have to deal with this, but it involves a little more work than it's worth right now.

Here's the function.

In [None]:
def age_predict(text):
    
    # drop out single quotes
    text = text.replace("'","")
    
    # initialize 
    score = 0
    n = 0

    # create a list of lowercase words to match the terms in our "age" dictionaray
    words = [w.lower() for w in TextBlob(text).words]
    print(words)

    # run through the words and create their contribution
    for word in words:
    
        if word in age:
            score += age[word]
            n += 1

    return(intercept + score/n)

Let's try this with Trump's tweet. Again, this algorithm gets better the more data it has to work with -- the more text of yours it sees. So one tweet won't be very accurate.

In [None]:
from textblob import TextBlob

sentence = "Everybody is now acknowledging that, right from the time I announced my run for President, I was 100% correct on the Border. Remember the heat I took? Democrats should now get rid of the loopholes. The Border is being fixed. Mexico will not let people through!"

age_predict(sentence)

How about 2 tweets, then?

In [None]:
sentence = "Everybody is now acknowledging that, right from the time I announced my run for President, I was 100% correct on the Border. Remember the heat I took? Democrats should now get rid of the loopholes. The Border is being fixed. Mexico will not let people through! Looks like Bob Mueller’s team of 13 Trump Haters &amp; Angry Democrats are illegally leaking information to the press while the Fake News Media make up their own stories with or without sources - sources no longer matter to our corrupt &amp; dishonest Mainstream Media, they are a Joke!"

age_predict(sentence)

Here we can pull the last 20 tweets from the President.

In [None]:
from tweepy import OAuthHandler, API
# setup the authentication

CONSUMER_KEY = ""
CONSUMER_SECRET = ""
ACCESS_TOKEN = ""
ACCESS_TOKEN_SECRET = ""

auth = OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

# create an object we will use to communicate with the Twitter API
api = API(auth)

tweets = api.user_timeline("realdonaldtrump")

sentences = ""

for tweet in tweets:
    sentences = sentences + " "+tweet.text
    
print(sentences)

We should drop out things we know aren't in the `age` dictionary but are instead specific to Twitter -- basically drop the RT's and all the URLs.

In [None]:
from re import sub

sentences = sub(r"RT","",sentences)
sentences = sub(r"(http[^ ]+)","",sentences)
print(sentences)

And what do we get?

In [None]:
age_predict(sentences)

We could try more data, but the authors acknowledge that social media skews young, much younger than our president. It might be hard for this algorithm to reach his age in part because the training data doesn't really include folks like Trump. Tweets are pretty ragged also, typically, so this might not be the best test. 

That said, notice the way we are working here. The presence or absence of a word is used as a weight in a model. Positive weights tend to lift the estimate higher, lower weights push the estimate of age down. We could breakdown and bring the CSV into pandas and see what the ranges look like...

In [None]:
from pandas import read_csv

agedf = read_csv("emnlp14age.csv")
agedf.head()

Let's have a look! Ah, plotly.

In [None]:
from plotly.plotly import iplot, sign_in
import plotly.graph_objs as go

sign_in("cocteautt","8YLww0QuMPVQ46meAMaq")

myplot_parts = [go.Histogram(x=agedf["weight"])]
mylayout = go.Layout(autosize=False, width=800,height=400,margin=go.layout.Margin(l=150,r=50,b=100,t=100,pad=4))
myfigure = go.Figure(data = myplot_parts, layout = mylayout)
iplot(myfigure,filename="age")

Which leads us to wonder about the really high and really low values... or even the values in betweeen. Let's sort the data and have a look. Keep in mind that some words just "go together" and this kind singleton analysis will miss those correlations. But still...

In [None]:
agedf.sort_values("weight",inplace=True)
agedf.head(25)

In [None]:
agedf.tail(25)

Does that make sense? What I'm after here is the narrative content in the algorithm. Having access to its internals, we can demystify and start to talk about it more sensibly -- even when it is as easy as this lexical weighting.