# Politics vs. Sport (vs. Other) classification of Tweets

In [1]:
require(tm)
require(e1071)

Loading required package: tm
Loading required package: NLP
Loading required package: e1071


Load data: a few tweets for each class.

In [2]:
tp=c("Mainstream (FAKE) media refuses to state our long list of achievements, including 28 legislative signings, strong borders &amp; great optimism!","Looking forward to RALLY in the Great State of Pennsylvania tonight at 7:30. Big crowd, big energy!",".@LouDobbs just stated that President Trump's successes are unmatched in recent presidential history  Thank you Lou!","North Korea disrespected the wishes of China &amp; its highly respected President when it launched, though unsuccessfully, a missile today. Bad!")
ts=c("The day after the night before ... Jose Mourinho's Manchester United handed tough #ChampionsLeague draw", "Super-sub @XS_11official sinks @ManUtd in #LIVMUN derby to leave @LFC top of the #EPL https://cnn.it/2Er3xU5", "Horror crash for @MarcGisin at #valgardena downhill won by Aleksander Aamodt Kilde #fisalpine https://cnn.it/2QvrAYY ", "He won 59 England caps and 10 major trophies with Man Utd, but Phil Neville says winning the Women’s World Cup with England would be his “greatest achievement” ", "⚡ @FIAFormulaE is back! ⚡ @NickiShields caught up with @MassaFelipe19 who's making his debut in the electric championship this weekend. https://cnn.it/2S0vENP ")
to=c("A Very Terry Christmas needs happy little trees.", "Matt Damon's favorite day of the year? Secret. Santa. Day.", "As if the two-night #VoiceFinale event wasn't going to be L-I-T already... we went and did THIS. 🔥", "Don’t miss my #GameofGames Holiday Spectacular, tonight! @NBC")

Build a dataframe with labels.

In [3]:
d = cbind(text=c(tp, ts, to), label=c(rep("politics",length(tp)), rep("sport",length(ts)), rep("other",length(to))))
d = as.data.frame(d)
dim(d)

Define preprocessing function:
* performs: lowercase, remove punctuation, bag of words
* input is a vector of documents
* optional input is a vector of words: if non-empty, the words of the output data-frame are thos included in the optional input

In [20]:
preprocess = function(tweets, words=c()) {
    corpus = Corpus(VectorSource(tweets))
    tm_map(corpus, content_transformer(tolower))
    removeNumPunct <- function(x) gsub("[^a-z]*", "", x)
    tm_map(corpus, content_transformer(removeNumPunct))
    bow = as.data.frame(t(as.matrix(TermDocumentMatrix(corpus))))
    if (length(words)>0) {
        bow = bow[,intersect(words, names(bow))]
        for (name in setdiff(words, names(bow))) {
            bow[,name] = 0
        }
    }
    bow
}

Define learning function:
* takes a labeled dataset with the label in label and the text in text
* uses SVM
* returns the learned SVM and a vector of words

In [5]:
learn = function(labeledTweets) {
    bow = preprocess(labeledTweets$text)
    classifier = svm(bow, labeledTweets$label, kernel="linear")
    list(classifier=classifier, words=names(bow))
}

Define prediction function:
* takes vector of documents to predict for and model (classifier and words)
* returns labels

In [23]:
predictForTweets = function(tweets, model) {
    bow = preprocess(tweets, model$words)
    labels = predict(model$classifier, bow)
    labels
}

Try learn and prediction and measure accuracy.

In [24]:
learn.d = d[-c(1,5,10),]
test.d =  d[c(1,5,10),]
model = learn(learn.d)
predicted = predictForTweets(test.d$text, model)
table(predicted, test.d$label)

          
predicted  other politics sport
  other        1        0     0
  politics     0        1     1
  sport        0        0     0