# Vice Articles Topic Model

#### Supervised Text classification model

In [79]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras, numpy as np
from keras.layers import Embedding, Dense, LSTM, GRU
from keras.models import Sequential
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

_A small sample dataset to train and test the model_

In [129]:
data_loc = "./data/articles_sample.csv"
data = pd.read_csv(data_loc, sep='|', engine='python', names=['article_id','body', 'topic'])
# data.columns = ['article_id', 'url_fragment', 'first_published', 'body', 'topic']
data = data[~data.body.isnull()]

In [130]:
nb_words = 200000
max_seq_len = 1000
data.columns

Index(['article_id', 'body', 'topic'], dtype='object')

In [105]:
# train_size = int(np.floor(data.shape[0] * .8))


# train_x = data["body"][0:train_size]
# train_y = data["topic"][0:train_size]

# test_x = data["body"][train_size:]
# test_y = data["topic"][train_size:]


In [132]:
# train_x.shape, train_y.shape, test_x.shape, test_y.shape

((1946,), (1946,), (387,), (387,))

In [144]:
X = data["body"]
y = data["topic"]

In [145]:
topics = list(y.unique())
y_encoded = [topics.index(topic) for topic in y] 

n_classes = len(topics)
n_classes, sum(train_y.value_counts() <= 10)

(607, 581)

Preparing the data for the model
* Tokenizing the text - Identifying unique words, creating a dictionary and counting their frequency in the list of documents (texts) in the training data.
* One-hot encoding the labels (topics)
* Splitting the data into train and test(validation) sets

In [147]:
tokenizer = Tokenizer(num_words=nb_words)
tokenizer.fit_on_texts(X)
sequences = Tokenizer.texts_to_sequences(tokenizer, X)
word_index = tokenizer.word_index

ydata = keras.utils.to_categorical(y_encoded)
input_data = pad_sequences(sequences, maxlen=max_seq_len)

Xtrain, Xvalid, ytrain, yvalid = train_test_split(input_data, ydata, test_size=0.2)

_Model definition and training_

In [150]:
embedding_vector_length = 128
model = Sequential()
model.add(Embedding(len(word_index)+1, embedding_vector_length, input_length=max_seq_len, embeddings_initializer='glorot_normal', 
                    embeddings_regularizer=keras.regularizers.l2(0.01)))
model.add(LSTM(200))
model.add(Dense(n_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 1000, 128)         8911872   
_________________________________________________________________
lstm_9 (LSTM)                (None, 200)               263200    
_________________________________________________________________
dense_9 (Dense)              (None, 607)               122007    
Total params: 9,297,079
Trainable params: 9,297,079
Non-trainable params: 0
_________________________________________________________________
None


In [151]:
model.fit(Xtrain, ytrain, validation_data=(Xvalid, yvalid), nb_epoch=3, batch_size=128)



Train on 1556 samples, validate on 390 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x115914470>

In [116]:
example = ("""Why Some Rappers Just Aren't Good Live True fact: lots of successful rappers don't need a great live show to get where they are. Let them live! Danny Brown, who is a good live rapper, performs live.It doesn't take long for the average person to develop a healthy skepticism of rap shows. Aside from the usual live venue problems (guest list confusion, aggro bouncers, expensive drinks), rap has its own special set of bullshit. There are too many opening acts, a rotating list of five local struggle rappers who never graduate to headliner status. There are too many people and it's always mostly dudes who often feel a way about you breathing their air. The shows run hours behind schedule. If you make it that far, the headliner is often not very good. If they even know all their lyrics, they're spitting them on top of their own recorded vocals.It's that last sin, rappers not really even rapping, that seems to offend the most. And it's not just dillettante hip-hop fans who complain; it's something I hear all the time as a reason why my friends don't want to go see a rapper whose music they love perform their songs in real life. It's less of a concert and more of a spectacle.So why does this happen? Why do some rappers seem to phone in their live shows? It's not like live hip-hop is dead. After all, active showmen like Action Bronson and Danny Brown have clearly benefitted from the work they put into their stage presence. Why don't all rappers do the same thing? The short answer is you can become a famous rapper without doing very many shows, per se.The assumed path to success for musicians today involves recording music, booking to play your music and building a fan base that will buy your music and see your shows. Increasingly, convincing brand managers that your music will sell products, but that's part of the same performance-heavy equation. And this is proven method across genres: rock, country, disco and, yeah, even rap.But rappers often take an alternate path, one that relies way less on stagework. Instead of recording an album and hustling a live show, they pick one song and push it to as many DJ's as they can. If the record is good, it gets added to the rotation on the local radio station and in the clubs around the way. And if the record is really good (and if they have good people behind you), it will get added to more stations and played at more clubs, further and further from the local market.As the single gets momentum and their name gets out there, promoters far and wide will book them to perform at clubs. But this isn't a full show, it's a quick set at a club based around a hit record. If the song is really popular, the artist's presence is just gravy. Everybody will know all the words and the club would have gone off whether or not they were there. Nobody's worried about the rapper nailing the hook.A rapper with a big single can make an insane amount of money doing club sets. Rappers are fitting into a nightlife culture that relies on paying a diverse range of celebrities to show up as a way to entice people to come out; they can earn their money just by being in the building. And everything resembling a metropolitan area has a nightlife destination where a rapper might get booked if their record is getting spins. There are way fewer ideal places for full-fledged rap shows, and that number gets smaller after factoring in many venues' inherent biases against hip-hop events. An artist that can get a stack for a club appearance in, say, Albany, GA is not stressed about their stage presence and breath control.All this said, should we even expect old school showmanship and rappin-ass rappin from these dudes? If the music makes a mob of excited people jump up and down and yell lyrics at each other, that's not a bad thing (nor is it a new concept).But that's the reason why some successful emcees don't have great live shows: they never needed one to get where they are.Skinny Friedman has unmatchable stage presence when he blogs. He's on Twitter - @skinny412""", 'rap')

ex2 = ("""‘Black-ish’ Addressed Postpartum Depression Better Than My Doctor It’s the PSA of my goddamn dreams. At the height of my postpartum depression and anxiety, I had been to the emergency room twice for palpitations I perceived to be a heart attack. I also had an MRI for migraines I was convinced were the result of a tumor. I often lay awake watching my daughter's monitor for hands that would snatch her from her crib. I called my dentist after hours and pushed for her to assure me my tooth pain was not an infection that would spread to my brain. Most nights before turning out the lights I asked my husband Dan to assure me I wouldn't die in my sleep. On one particular evening, I was lying in bed with my legs against our headboard, trying to take the pressure off of an ache in my calf that I was pretty sure was a blood clot slowly making its way to my lungs. Dan slid into bed and I turned my head to him, panicked tears already rolling down my face. 'Do you think…?' Before I finished he said, 'Yes, I will see you in the morning. I promise.' He caught my eye and we both started laughing. It wasn't exactly funny, but even in the midst of my very real distress, I knew my thought patterns weren't typical. Sometimes finding humor in it made it feel less overwhelming. Tracee Ellis Ross reinforced that in last night's episode of  Black-ish when her character Rainbow ('Bow') and her husband, Dre (played by Anthony Johnson), take a quiz about postpartum depression. Dre reads the first question: 'Do you feel sad, hopeless, overwhelmed, empty?' Through tears, Bow responds: 'Ah…well, I feel sad and I feel hopeless and I feel overwhelmed, but I don't feel empty, so I guess it's a no for me.' It's a moment of comic relief in the scene—while Rainbow is clearly depressed, this moment pokes fun at her just falling short of being the total package of PPD misery. 'I think that comedy can help us shine a light on important mental health issues, when it is done responsibly,' says Mike Fraser, a psychologist chief of staff at Behavioral Associates in New York. 'Comedy that aims for the easy laugh by poking fun at people struggling with real mental health issues obviously doesn't help. But if it can bring exposure to issues that millions of people battle—often in isolation—comedy can open the door for people to get an important conversation going and possibly even reach out for help.'  Watch this from Tonic:As Dan and I watched the episode, we were both especially struck by a scene in which Dre and Bow visit a doctor about her perceived condition. They discuss Bow's symptoms—anxiety, insomnia, crying and constant insecurity—and the doctor assures her they are not only normal, but treatable. Bow represents a huge number of women when she tells the doctor, 'I can get through this, I don't want medication.' Her doctor replies, 'postpartum depression is a mood disorder—it's not just something you can power through—and it's not something you should be ashamed of.'  It's the postpartum PSA of my goddamn dreams—one I wish they would play in place of the over simplified, condescending crap most women hear before exiting the hospital with a new baby. Dan points out that the scene represents a support system some women will never see in real life. And I knew he was right: My own experience with a doctor who listened to my symptoms and then put her hands on my knees and told me to go for a brisk walk was proof enough. Even though I had dealt with depression and anxiety for most of my adult life, her patronization made me second guess what I knew to be facts. Lucky for me, by the time I made it to the car my support system (hi mom) was on other end of the phone while I screamed, 'Fuck that lady!' I knew that exercise couldn't 'cure' what science has shown to be chemical. But I was pissed for the people who would accept her words as truth. Hours of therapy and one daily dose of Lexapro later, I still struggle with bouts of depression and many of my impulses from those early days of motherhood have hung around. But each night when Dan locks the front door and I walk over to touch it—a habit rooted in obsessive compulsive tendencies—we both laugh. 'I know, I know,' I say, smiling as I reach out my hand. It's a necessary ritual for me to feel at ease, but the lightness we sometimes inject into it makes it feel less clinical. And that's what  Black-ish offers its viewers when Dre cozies up next to Bow on the couch and rattles off questions from a women's magazine quiz, giving her props on her near perfect score (and while that equates to depressed as hell, Bow is never one to bomb a quiz of any type). There were some scenes that were a little too neatly tied up, particularly one where Bow's mother-in-law Ruby apologizes for her vast misconceptions about PPD (which included comments like, 'I didn't go to some quack doctor because I was mentally ill with some made up disease'). The turnaround between her blatant ignorance and acceptance is a bit quick and, sadly, unrealistic, but it sticks with the show's intent of bringing a serious issue to an accessible 30-minute comedic platform. After all, I realized, most people don't discuss mental health on the regular as we do in our household. Shows like  Black-ish might be the open door they've been needing to walk through.Read This Next: Chrissy Teigen, Postpartum Depression, and Trump""", 'depression')

In [148]:
import pickle

with open('model/tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open('model/tokenizer.pickle', 'rb') as handle:
    tok = pickle.load(handle)

In [159]:
# tokenizer.fit_on_texts([example[0]])
sequences = Tokenizer.texts_to_sequences(tok, [ex2[0]])
inp = pad_sequences(sequences, maxlen=max_seq_len)

In [174]:
preds = model.predict(inp)

In [175]:
preds

array([[2.99151358e-03, 1.22784439e-03, 6.71361398e-04, 7.51021900e-04,
        3.64817469e-03, 1.29765668e-03, 6.43520756e-03, 4.26253974e-02,
        6.49474503e-04, 8.24139174e-03, 4.44732327e-03, 2.44179601e-03,
        6.78680558e-03, 1.96080375e-03, 1.21848006e-02, 6.70902571e-03,
        1.18469041e-04, 1.40894833e-03, 5.10199089e-03, 1.65965185e-02,
        1.66625157e-03, 7.07344338e-03, 1.07228020e-02, 1.27576559e-03,
        2.46457197e-02, 4.17325133e-03, 5.43171866e-03, 6.65167463e-04,
        7.00661563e-04, 1.37496961e-03, 7.40222726e-03, 4.82650893e-03,
        7.73572770e-04, 4.59399726e-03, 1.42653298e-03, 2.77824095e-03,
        7.94040097e-04, 1.86302897e-03, 2.69145984e-03, 1.49809371e-03,
        3.88143235e-03, 6.39630156e-03, 2.62805424e-03, 6.10506628e-04,
        6.24002889e-04, 1.36781146e-03, 1.57610408e-03, 6.54631946e-03,
        5.13481733e-04, 1.68345275e-03, 7.93699000e-04, 1.35860080e-03,
        2.98741460e-03, 6.84177410e-03, 1.27164200e-02, 5.121910

trans    1
Name: topic, dtype: int64

In [128]:
topics[181]

'ukraine'

In [76]:
model.save('./model/topic_model.h5')  # creates a HDF5 file 'my_model.h5'
del model

In [115]:
'depression' in topics

True