# Intro
Welcome everyone! This notebook is intended to demo the tweetnlp 
module. 

# What is TweetNLP?
I wrote tweetnlp because I wanted to try out keras and tensorflow, 
and I'm interested in natural language processing. I enjoy reading 
twitter, and I've always enjoyed how each person on twitter can have
such a distinctive voice, even with so few words.  So I wondered 
if you could use tensortflow to build models for writing tweets in
the style of an existing twitter user. 

The tweetnlp model does this (sort of...). It has an API for
downloading tweets using a twitter app credentials (you need 
to apply to be a twitter developer to make an app). It then 
will process the tweets into text, and build a machine learning 
model based on those tweets. You can then either generate tweets 
individually, or predict tweets word-by-word, potentially inserting 
your own topics that way. 

# Demo
This code is pretty easy to use. The downloading and model 
building functions take several configuration options which 
may or may not make your predictions better.  

Let's make a model with my favorite twitter account, @dog_rates:

In [1]:
import os
import tweetnlp
import shutil

data_dir = './temp'
# get rid of old data
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)

os.mkdir(data_dir)

raw_file = os.path.join(data_dir, 'raw')
# download the tweets to ./temp/raw
tweetnlp.download_tweets('dog_rates', raw_file, exclude_replies=True)

print(os.listdir(data_dir))

['raw']


We've now downloaded the tweets to a file. Each line of this file
is raw json, here's an example:
```json
{"created_at": "Mon Sep 16 16:21:43 +0000 2019", "favorite_count": 111102, "full_text": "This is Apollo. He likes to point at those with beautiful smiles who are going to have a great day today. 13/10 https://t.co/oCdx6Rn3PQ", "hashtags": [], "id": 1173633070630498309, "id_str": "1173633070630498309", "lang": "en", "media": [{"display_url": "pic.twitter.com/oCdx6Rn3PQ", "expanded_url": "https://twitter.com/dog_rates/status/1173633070630498309/photo/1", "id": 1173633065635074048, "media_url": "http://pbs.twimg.com/media/EEmVDhXU0AA5XOV.jpg", "media_url_https": "https://pbs.twimg.com/media/EEmVDhXU0AA5XOV.jpg", "sizes": {"thumb": {"w": 150, "h": 150, "resize": "crop"}, "medium": {"w": 682, "h": 776, "resize": "fit"}, "small": {"w": 598, "h": 680, "resize": "fit"}, "large": {"w": 682, "h": 776, "resize": "fit"}}, "type": "photo", "url": "https://t.co/oCdx6Rn3PQ"}, {"display_url": "pic.twitter.com/oCdx6Rn3PQ", "expanded_url": "https://twitter.com/dog_rates/status/1173633070630498309/photo/1", "id": 1173633065643470848, "media_url": "http://pbs.twimg.com/media/EEmVDhZU8AAWHxp.jpg", "media_url_https": "https://pbs.twimg.com/media/EEmVDhZU8AAWHxp.jpg", "sizes": {"thumb": {"w": 150, "h": 150, "resize": "crop"}, "medium": {"w": 759, "h": 924, "resize": "fit"}, "large": {"w": 759, "h": 924, "resize": "fit"}, "small": {"w": 559, "h": 680, "resize": "fit"}}, "type": "photo", "url": "https://t.co/oCdx6Rn3PQ"}], "retweet_count": 14172, "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "urls": [], "user": {"id": 4196983835, "id_str": "4196983835"}, "user_mentions": []}
```
From there, we generate a separate file that's just the text, with
urls and mentions removed.

In [3]:
tweets_file = os.path.join(data_dir, 'tweets')
tweetnlp.generate_tweets_text(raw_file, tweets_file)

print(os.listdir(data_dir))

['tweets', 'raw']


This now just has the sanitized tweet text like so:
```
This is Apollo. He likes to point at those with beautiful smiles who are going to have a great day today. 13/10
```
Finally, we build the model itself.  Here we're using the defaults except
for the embedding_dim, which I've found was good when it was at 100.

In [4]:
import tweetnlp.model
model_file = os.path.join(data_dir, 'model')
tokenizer_file = os.path.join(data_dir, 'tokenizer')
# this function will write to both the model file and the tokenizer file
# this process can take a long time. 
tweetnlp.model.build_tweet_model(tweets_file, model_file, tokenizer_file, embedding_dim=50)

print(os.listdir(data_dir))

Vocabulary Size: 3112
Total Sequences: 15341



Using TensorFlow backend.




Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 2, 50)             155600    
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dense_1 (Dense)              (None, 3112)              158712    
Total params: 334,512
Trainable params: 334,512
Non-trainable params: 0
_________________________________________________________________
None


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Epoch 1/50
 - 4s - loss: 6.5863 - acc: 0.0694
Epoch 2/50
 - 3s - loss: 5.8446 - acc: 0.0894
Epoch 3/50
 - 3s - loss: 5.6147 - acc: 0.1206
Epoch 4/50
 - 4s - loss: 5.4494 - acc: 0.1564
Epoch 5/50
 - 4s - loss: 5.2881 - acc: 0.1728
Epoch 6/50
 - 4s - loss: 5.0972 - acc: 0.186

That was ok!  Now lets generate a tweet!

In [5]:
print(tweetnlp.model.tweet_from_model(model_file, tokenizer_file))

this is archie it’s his first time today had you to know he’s a smiley boy hopes you like it 12 10


You can do word-by-word predictions like so:
```
python -m tweetnlp predict temp/model temp/tokenizer
```