# TweetNMeet Bot

Hackathon 2017
Christine Chung & Anne Lai

## Set-up

- Install nltk, gensim, twython
- Get OAuth credentials from Twitter
- mkdir ~/twitter-files
- Create a credentials.txt file in twitter-files/
- Add your oauth info in this format: 
```
app_key=YOUR CONSUMER KEY  
app_secret=YOUR CONSUMER SECRET  
oauth_token=YOUR ACCESS TOKEN  
oauth_token_secret=YOUR ACCESS TOKEN SECRET
```
- Export to environment: `export TWITTER="/path/to/your/twitter-files"`

In [131]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import gensim
from gensim import corpora
from twython import Twython
from nltk.twitter import Twitter, Query, Streamer, credsfromfile
import json, re, pickle, string

## Getting Data

We primarily used Twython for interacting with the Twitter API, though NLTK has some neat built-in tools as well that we used for initial data exploration.

In [24]:
# tw = Twitter()
# #Writing meetup tweets out
# tw.tweets(keywords='enjoy', to_screen=False, stream=False, limit=20000) #sample from the public stream

In [2]:
# We chose 10 from Meetup's 36 broad categories
categories = {"Technology": None,
              "Outdoors": None,
              "Arts": None,
              "Books": None,
              "Business": None,
              "Language": None,
              "Sports and Fitness": None,
              "Food": None,
              "LGBTQ": None,
              "Fashion": None}

In [24]:
oauth = credsfromfile()
client = Twython(**oauth)

In [60]:
userIDs = {}
for cat in categories:
    users = client.search_users(q=cat)
    userIDs[cat] = [user['id'] for user in users]

In [84]:
# for category in categories:
#     if categories[category] is not None:
#         continue
#     print(category)
#     IDs = userIDs[category]
#     tweets = [client.get_user_timeline(user_id=thisID, count=200) for thisID in IDs]
#     categories[category] = [item['text'] for sublist in tweets for item in sublist]

In [85]:
for category in categories:
    if categories[category] is None:
        print(category)
    else:
        print("# Tweets in {}: {}".format(category, len(categories[category])))

# Tweets in Technology: 3999
# Tweets in Outdoors: 4000
# Tweets in Arts: 4000
# Tweets in Books: 4000
# Tweets in Business: 4000
# Tweets in Sports and Fitness: 3985
# Tweets in Food: 4000
# Tweets in LGBTQ: 3966
# Tweets in Fashion: 3998
# Tweets in Language: 3995


## Cleaning Data

In [159]:
stop = set(stopwords.words('english') + ['amp', 'rt', '—'])
lemma = WordNetLemmatizer()
translator = str.maketrans(dict.fromkeys(string.punctuation))

def clean(doc):
    punc_free = doc.translate(translator)
    stop_free = " ".join([i for i in punc_free.lower().split() if i not in stop])
    url_free = re.sub(r"http\S+", '', stop_free)
    normalized = " ".join(lemma.lemmatize(word) for word in url_free.split())
    return normalized

In [86]:
allTweets = [tweet for cat in categories for tweet in categories[cat]]

In [161]:
with open('tweetCorpus.txt', mode='wt', encoding='utf-8') as myfile:
    myfile.write('\n'.join(allTweets))

In [127]:
#clean and produce bag of words from a list of tweets
def getDocBoW(docs):
    doc_clean = [clean(doc).split() for doc in docs]
    dictionary = corpora.Dictionary(doc_clean)
    return  [dictionary.doc2bow(doc) for doc in doc_clean]

In [162]:
doc_term_matrix = getDocBoW(allTweets)

## Training LDA Model

[wiki](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

In [119]:
lda = gensim.models.ldamodel.LdaModel

In [None]:
ldamodel = lda(doc_term_matrix, num_topics=20, id2word = dictionary, passes=150)

In [134]:
pickle.dump(ldamodel, open("ldamodel.pickle", "wb"))

In [122]:
ldamodel.print_topics()

[(0,
  '0.012*"rt" + 0.011*"recipe" + 0.010*"good" + 0.008*"latin" + 0.005*"full" + 0.005*"man" + 0.005*"music" + 0.004*"yes" + 0.004*"health" + 0.004*"world"'),
 (1,
  '0.018*"rt" + 0.013*"please" + 0.013*"lgbtq" + 0.012*"transgender" + 0.012*"spanish" + 0.011*"german" + 0.010*"irish" + 0.010*"italian" + 0.007*"trans" + 0.006*"—"'),
 (2,
  '0.056*"word" + 0.054*"day" + 0.015*"rt" + 0.014*"thank" + 0.008*"japanese" + 0.005*"danielnewman" + 0.005*"walkrstalkrcon" + 0.005*"celebrate" + 0.004*"sorry" + 0.004*"game"'),
 (3,
  '0.043*"audio" + 0.024*"rt" + 0.014*"amp" + 0.011*"u" + 0.008*"korean" + 0.008*"join" + 0.007*"learn" + 0.007*"tonight" + 0.007*"story" + 0.007*"place"'),
 (4,
  '0.014*"fashion" + 0.010*"book" + 0.010*"rt" + 0.008*"lgbt" + 0.007*"community" + 0.007*"amp" + 0.006*"come" + 0.006*"new" + 0.005*"perfect" + 0.004*"time"'),
 (5,
  '0.020*"food" + 0.019*"rt" + 0.011*"amp" + 0.009*"english" + 0.007*"eat" + 0.007*"art" + 0.007*"scoop" + 0.006*"4" + 0.005*"free" + 0.005*"child

### Trial and Error

The first time we trained this model, we trained it on two corpora: tweets containing the word "meetup", and tweets containing the word "enjoy". These, it turns out, were not incredibly helpful. But it did surface some pretty great topics, including a topic dominated by the words "miserable", "island", "channel4news", and "brexit". It also surfaced topics related to Darren Criss (performing on the today show?), grilling, and happy women.

We later sampled tweets based on 10 meetup categories that we handpicked. However, the original results are here for your enjoyment.

```
[(0,
  '0.248*"enjoy" + 0.011*"yesterday" + 0.011*"love" + 0.011*"created" + 0.010*"michael" + 0.009*"summer" + 0.009*"u" + 0.009*"everyone" + 0.007*"5" + 0.007*"im"'),
 (1,
  '0.014*"miserable" + 0.013*"island" + 0.013*"moore" + 0.013*"channel4news" + 0.011*"via" + 0.011*"https…" + 0.010*"new" + 0.009*"here" + 0.008*"brexit" + 0.008*"job"'),
 (2,
  '0.133*"enjoy" + 0.032*"happy" + 0.023*"amp" + 0.022*"it" + 0.021*"dont" + 0.016*"long" + 0.016*"woman" + 0.015*"doesnt" + 0.015*"together" + 0.013*"🤷🏽\u200d♀️"'),
 (3,
  '0.029*"time" + 0.027*"hope" + 0.021*"i" + 0.020*"day" + 0.018*"dream" + 0.016*"fantastic" + 0.015*"performing" + 0.015*"todayshow" + 0.015*"darrencriss" + 0.015*"dreamed"'),
 (4,
  '0.016*"always" + 0.016*"new" + 0.014*"man" + 0.013*"watching" + 0.013*"show" + 0.012*"time" + 0.009*"every" + 0.009*"start" + 0.009*"live" + 0.008*"much"'),
 (5,
  '0.018*"best" + 0.016*"people" + 0.012*"meetup" + 0.012*"join" + 0.010*"movie" + 0.009*"know" + 0.008*"u" + 0.007*"amp" + 0.006*"thats" + 0.006*"awesome"'),
 (6,
  '0.059*"great" + 0.054*"weekend" + 0.053*"…" + 0.046*"light" + 0.044*"grill" + 0.044*"promotion" + 0.044*"perduechicken" + 0.044*"perduecrew" + 0.044*"kebab" + 0.008*"follow"'),
 (7,
  '0.092*"enjoy" + 0.012*"go" + 0.011*"know" + 0.008*"get" + 0.007*"amp" + 0.007*"stop" + 0.007*"week" + 0.007*"favorite" + 0.007*"cant" + 0.007*"u"'),
 (8,
  '0.038*"drive" + 0.030*"black" + 0.011*"think" + 0.010*"game" + 0.010*"want" + 0.010*"year" + 0.009*"made" + 0.009*"much" + 0.009*"thanks" + 0.008*"amp"'),
 (9,
  '0.093*"life" + 0.078*"u" + 0.068*"gonna" + 0.043*"come" + 0.039*"like" + 0.036*"girl" + 0.036*"me" + 0.033*"forget" + 0.032*"either" + 0.031*"gon"')]
  ```

## Predicting a Topic for a User

In [137]:
myTweets = [tweet['text'] for tweet in client.get_user_timeline(
                                                            screen_name="curmudgeon", 
                                                            include_rts=True,
                                                            count=200)]

In [148]:
def flatten(list_of_lists):
    return [val for sublist in list_of_lists for val in sublist]

In [149]:
myDocTermMatrix = flatten(getDocBoW(myTweets))

In [151]:
myLDA = ldamodel[myDocTermMatrix]

In [158]:
myLDA.sort(key=lambda x: x[1], reverse=True)
myLDA

[(0, 0.1385202554208125),
 (1, 0.12840139598610675),
 (7, 0.12785929686654426),
 (6, 0.099817503944872488),
 (5, 0.092628977457324416),
 (3, 0.088213090228229304),
 (8, 0.083397223169100226),
 (2, 0.082450493258694371),
 (9, 0.082357441773621767),
 (4, 0.076354321894694002)]

## Sandbox

In [62]:
tw = Twitter()

In [63]:
#Writing meetup tweets out
tw.tweets(keywords='meetup', to_screen=False, stream=False, limit=20000) #sample from the public stream

Writing to /Users/christine/twitter-files/tweets.20170803-160530.json
No more Tweets available through rest api
Written 6432 Tweets


In [62]:
client.get_user_suggestions()

[{'name': 'Sports', 'size': 31, 'slug': 'sports'},
 {'name': 'Entertainment', 'size': 14, 'slug': 'entertainment'},
 {'name': 'Music', 'size': 15, 'slug': 'music'},
 {'name': 'Digital Creators', 'size': 15, 'slug': 'digital-creators'},
 {'name': 'News', 'size': 15, 'slug': 'news'},
 {'name': 'Gaming', 'size': 15, 'slug': 'gaming'},
 {'name': 'Government', 'size': 15, 'slug': 'government'},
 {'name': 'Television', 'size': 13, 'slug': 'television'},
 {'name': 'Funny', 'size': 14, 'slug': 'funny'},
 {'name': 'Fashion', 'size': 14, 'slug': 'fashion'},
 {'name': 'Food & Drink', 'size': 15, 'slug': 'food-drink'},
 {'name': 'Family', 'size': 9, 'slug': 'family'},
 {'name': 'Business', 'size': 9, 'slug': 'business'},
 {'name': 'Books', 'size': 9, 'slug': 'books'},
 {'name': 'Leaders', 'size': 12, 'slug': 'leaders'},
 {'name': 'Influencers', 'size': 12, 'slug': 'influencers'}]

### References

- [Twitter with NLTK](http://www.nltk.org/howto/twitter.html)
- [LDA with Gensim & Python Tutorial](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/)
- [Gensim LDA Docs](https://radimrehurek.com/gensim/models/ldamodel.html)