### Set PATH
<pre>
$ PATH=$PATH: &lt;pwd&gt;
</pre>

### Get the environment
<pre>
$conda env create -f environment.yml
$source activate tweet_bigly_env
</pre>

### Open Jupyter Notebook
    $jupyter notebook

### Get the data
The NYTimes hosts [this article](https://www.nytimes.com/interactive/2016/01/28/upshot/donald-trump-twitter-insults.html) with Tweet content that they have identified as insults. Our goal is to obtain a well formated list containg the following fields:
<pre>
{
    "group": (string - defined category),
    "date": (string - Trump Tweet date),
    "link": (string - link to Tweet),
    "body": (string - insult),
    "name": (string - name of insultee),
    "title": (string - title of insultee) 
}
</pre>

In [None]:
#import configparser
#config = configparser.ConfigParser()
#config.read('config.cfg')
import requests
#import time
import pickle
#from collections import Counter
#from dateutil.parser import parse as dateutil_parse
#import dateutil
#import pandas as pd
#from IPython.display import display
import ujson as json
from datetime import datetime
#import json

### Review site
Look at the html in [this article](https://www.nytimes.com/interactive/2016/01/28/upshot/donald-trump-twitter-insults.html) and try to understand the structure.

In [None]:
from lxml import html
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from sklearn.externals import joblib
import pickle
import sys

Why are there no tweet links? What happened? 

In [None]:
url = 'https://www.nytimes.com/interactive/2016/01/28/upshot/donald-trump-twitter-insults.html'
today = datetime.now().strftime("%Y-%m-%d")
file_name = 'data/'+today+'_test_page.html'
page = requests.get(url)
#print(page.text)
#headers = {'Accept-Encoding': 'identity'}
#r = requests.get(url, headers=headers)
#print(r)
#tree = html.fromstring(page.content)
data = page.text

with open(file_name,'w') as f:
    f.write(data)
    
soup = BeautifulSoup(data, "lxml")

for link in soup.find_all('a'):
    print(link.get('href'))


In [None]:
url = 'https://www.nytimes.com/interactive/2016/01/28/upshot/donald-trump-twitter-insults.html'
today = datetime.now().strftime("%Y-%m-%d")
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source #print this and compare the difference to the page text from above
browser.quit()

In [None]:
soup = BeautifulSoup(html, "lxml")
tweets = []
for link in soup.find_all('a'):
    href = link.get('href')
    if href==None:
        continue
    elif href.startswith('https://twitter.com/realDonaldTrump/status/'):
        tweets.append(href)

In [None]:
tweets[:5]

Now that we have the Tweet links, how we can search for the other information?

In [None]:
soup = BeautifulSoup(html,"lxml")
tweets = []
for a in soup.select('.g-insult-links-c a[href^="https://twitter.com/realDonaldTrump/status/"]'):
    name = a.parent.parent.parent.select('.g-entity-name')[0].string
    title = a.parent.parent.parent.select('.g-entity-title')[0].string
    link = a.attrs['href']
    text = a.string[1:-1] #removing added quotes
    date = a.next_sibling.string
    tweets.append({"name":name
                   ,"title":title
                   ,"link":link
                   ,"body":text
                   ,"date":date})

In [None]:
tweets[-3:-1]

### Manual Curation
Self identify clusters and save to file.

In [None]:
clusters = ["US Business - financially focused individuals and companies"
            , "US Political - senator, governor, mayor, democrats, republicans, related rant"
            , "Foreign Interest - person, country, related topic"
            , "Random - inanimate objects, golf courses, sporting events, books"
            , "Famous - people, broadway shows, tv shows, popular songs"
            , "News - person, org, allegations, association"
            ]

The color groups file was created to allow manual addition of group labels.

In [None]:
groups = set()
for d in tweets:
    title=d['title']
    if not d['title']:
        title = ''
    groups.add((d["name"]+'|'+title))
    
with open('data/{}_color_groups.json'.format(today),'w') as f:
    for item in groups:
        rec = item.split("|")
        d = {"DELETE":item
             ,"name":rec[0]
             ,"group":""
             }
        f.write(json.dumps(d)+"\n")

After the color_groups file was edited, the next step was to insert this new information into the tweet list.

In [None]:
# create group dict keyed on name
name_group_dict ={}
with open("data/2017-01-27_color_groups.json",'r') as f:
    for item in f:
        rec = json.loads(item)
        name_group_dict[rec["name"]]=rec["group"]

In [None]:
for item in tweets:
    item['group']=name_group_dict [item['name']]

In [None]:
tweets[:2]

In [None]:
sys.setrecursionlimit(90000) ##for potention recursion depth error
with open("data/{}_full_insult_list.json".format(today),'w') as f, open("data/{}_full_insult_list.json.pkl".format(today),'wb') as pkl:
    f.write(json.dumps(tweets))
    pickle.dump(tweets,pkl, pickle.HIGHEST_PROTOCOL)

### TO D3 
If you want to take a gander: [d3js.org](https://d3js.org/).

In [None]:
#...end