# Twitter Organizations

Imagine you have a list of organization names and you would like to know what their Twitter accounts are. This notebook explores how to look them up on Wikipedia, find the organizations home page, and then look on the homepage for a link to their Twitter account.


## Get the Wikipedia Article

First we will create a small function to get the Wikipedia Article for a given organization name. To do this we will install [requests_html](https://pypi.org/project/requests-html/) which is a nice Python library for doing HTTP requests and also for parsing returned JSON and HTML.

In [1]:
! pip install requests_html

Collecting requests_html
  Downloading https://files.pythonhosted.org/packages/24/bc/a4380f09bab3a776182578ce6b2771e57259d0d4dbce178205779abdc347/requests_html-0.10.0-py3-none-any.whl
Collecting parse
  Downloading https://files.pythonhosted.org/packages/23/ea/ba7f9a5d728f7c45b298ae007b06908a5909eb5693a559ee5d70e7f84f92/parse-1.17.0.tar.gz
Collecting w3lib
  Downloading https://files.pythonhosted.org/packages/a3/59/b6b14521090e7f42669cafdb84b0ab89301a42f1f1a82fcf5856661ea3a7/w3lib-1.22.0-py2.py3-none-any.whl
Collecting pyppeteer>=0.0.14
[?25l  Downloading https://files.pythonhosted.org/packages/5d/4b/3c2aabdd1b91fa52aa9de6cde33b488b0592b4d48efb0ad9efbf71c49f5b/pyppeteer-0.2.2-py3-none-any.whl (145kB)
[K     |████████████████████████████████| 153kB 6.9MB/s 
Collecting pyquery
  Downloading https://files.pythonhosted.org/packages/78/43/95d42e386c61cb639d1a0b94f0c0b9f0b7d6b981ad3c043a836c8b5bc68b/pyquery-1.4.1-py2.py3-none-any.whl
Collecting fake-useragent
  Downloading https://files.py

Now lets create a function which we can give the organization name, and which will query the Wikipedia's [Open Search API](https://www.mediawiki.org/wiki/API:Opensearch) to get a list of articles.

In [46]:
import requests_html

http = requests_html.HTMLSession()

def get_wikipedia_article(name):
  url = 'https://en.wikipedia.org/w/api.php'
  params = {
      "action": "opensearch",
      "limit": 1,
      "namespace": 0,
      "format": "json",
      "search": name
  }
  results = http.get(url, params=params).json()
  if len(results) == 4 and len(results[1]) > 0:
    return results[3][0]
  else:
    return None  

Ok, lets test it out on *Exxon-Mobil*:

In [47]:
get_wikipedia_article('Exxon-Mobil')

'https://en.wikipedia.org/wiki/ExxonMobil'

Not bad! Lets try it with something that won't match just to make sure it doesn't throw an error.

In [48]:
get_wikipedia_article('This is a fictitious organization name!')

## Get the Official Website

Maybe there's a way to get the official website data from the Wikipedia or Wikidata API. But it's pretty easy to get the HTML for the article and look for it since it appears in a standard way.

In [131]:
def get_official_website(article_url):
  doc = http.get(article_url)
  link = doc.html.find('.official-website .url a', first=True)
  if link:
    return link.attrs['href']
  
  # fall back to looking for the first "Official website" link
  for a in doc.html.find('a'):
    if re.match('official website', a.text, re.IGNORECASE):
      return a.attrs['href']

  return None

In [129]:
get_official_website('https://en.wikipedia.org/wiki/ExxonMobil')

'http://www.exxonmobil.com'

Nice!

## Twitter Account

Now that we know the homepage for the organization we need a function that can look for a link to their Twitter account.

This is a bit more complicated because we don't know where on the page the link will be or even if there are multiple links. We do know that twitter account URLs will look something like:

    https://twitter.com/{account-name}

Where *account-name* can be any sequence of letters, numbers and underscores. Twitter used to allow you to link with a hashbang URL, which is worth looking for too eventhough it is deprecated:

    https://twitter.com/#!/{account-name}

So we can look at all the links in the page that match that pattern and return the account that has the most links.

In [113]:
import re
from collections import Counter

def get_twitter(url):

  # count all the accounts
  accounts = Counter()

  doc = http.get(url)
  for a in doc.html.find('a[href]'):
    m = re.match(r'.*twitter.com/(#!/)?([a-z0-9_]+).*', a.attrs['href'], re.IGNORECASE)
    if m:
      accounts[m.group(2)] += 1

  if len(accounts) > 0:
    return '@' + accounts.most_common()[0][0]

In [116]:
get_twitter('http://www.exxonmobil.com')

'@exxonmobil'

## Putting It Together

Now for the fun part since we can create a function that orchestrates these three functions into one that takes an organization name and returns the Twitter handle for it.

Note: we do need to take care that each step in the process returns a result in order to keep going.

In [117]:
def get_info(org_name):

  article = get_wikipedia_article(org_name)

  if article:
    homepage = get_official_website(article)
  else:
    homepage = None

  if homepage:
    twitter = get_twitter(homepage)
  else:
    twitter = None

  return {
      "article": article,
      "homepage": homepage,
      "twitter": twitter
  }


Let's try it out! 

In [118]:
get_info('Exxon Mobil')

{'article': 'https://en.wikipedia.org/wiki/ExxonMobil',
 'homepage': 'http://www.exxonmobil.com',
 'twitter': '@exxonmobil'}

🎉 Let's try it with some other organizations now. Maybe it only works for Exxon-Mobil...

In [119]:
get_info('Sears Roebuck')

{'article': 'https://en.wikipedia.org/wiki/Sears_Roebuck',
 'homepage': 'http://www.sears.com',
 'twitter': '@searsdeals'}

In [120]:
get_info('US Navy')

{'article': 'https://en.wikipedia.org/wiki/US_Navy',
 'homepage': 'http://www.navy.mil/',
 'twitter': '@usnavy'}

In [132]:
get_info("McDonalds")

{'article': 'https://en.wikipedia.org/wiki/McDonald%27s',
 'homepage': 'https://www.mcdonalds.com',
 'twitter': '@McDonalds'}

In [133]:
get_info('University of Maryland')

{'article': 'https://en.wikipedia.org/wiki/University_of_Maryland,_College_Park',
 'homepage': 'https://www.umd.edu/',
 'twitter': '@UofMaryland'}

In [135]:
get_info('Chicago Police Department')

{'article': 'https://en.wikipedia.org/wiki/Chicago_Police_Department',
 'homepage': 'https://home.chicagopolice.org/',
 'twitter': '@Chicago_Police'}