Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
249 lines (197 sloc) 7.31 KB
---
title: "Logistic regression in R"
output:
pdf_document:
toc: yes
toc_depth: 2
html_document:
theme: readable
toc: yes
toc_depth: 3
date: "May 09, 2016"
---
```{r setup, include = FALSE}
# my set-up
knitr::opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, fig.align = 'center')
require(dplyr)
require(ggplot2)
require(GGally)
require(caret)
setwd('~/projects/BI-TECH-CP303/projects/project 2')
twitter = read.delim('./data/bot_or_not.tsv',
sep = '\t',
header = TRUE)
```
# Introduction
For the second project we'll explore user data from Twitter to identify accounts
likely belonging to bots. The data set has variables about profile configuration
(`default_profile`, `default_profile_image`), connectivity (`friends_count`,
`followers_count`), and some information about the nature of their tweets
(`diversity`, `mean_mins_between_tweets`). Additionally, there's an outcome
variable called `bot` that denotes whether the account belongs to a bot
(`bot == 1`) or to a human (`bot == 0`).
```{r read-data, eval = FALSE}
library(dplyr)
library(ggplot2)
library(GGally)
library(caret)
twitter = read.delim('bot_or_not.tsv',
sep = '\t',
header = TRUE)
```
# Exploratory data analysis
We've got a brand new data set, so let's familiarize ourselves by
conducting an exploratory data analysis. Let's start by summarizing the whole
data set to see what the variable values are.
```{r}
summary(twitter)
```
From the summary, we can see that there are a couple `factor` variables in the
data set, `bot`, `default_profile`, `default_profile_image` and `geo_enabled`.
Before exploring further, let's first tell R that those columns represent
categorical variables.
```{r}
twitter$bot = factor(twitter$bot)
twitter$default_profile = factor(twitter$default_profile)
twitter$default_profile_image = factor(twitter$default_profile_image)
twitter$geo_enabled = factor(twitter$geo_enabled)
summary(twitter)
```
Like before, we can evaluate many relationships simultaneously with `ggpairs.`
```{r ggpairs, fig.align = 'center', fig.width = 10, fig.height = 10}
# inspect many trends with ggpairs
ggpairs(twitter[ , c('followers_count', 'friends_count', 'account_age_hours',
'diversity', 'statuses_count', 'bot')])
```
Once we have some initial hypotheses we can make more specific plots.
```{r eda}
ggplot(twitter, aes(x = followers_count, fill = bot)) +
geom_histogram()
# Some people have a lot of followers, but most don't. we need to lob off
# the long tail so we can see the distribution better
ggplot(filter(twitter, followers_count < 100),
aes(x = followers_count, fill = bot)) +
geom_histogram()
ggplot(filter(twitter, followers_count < 100),
aes(x = followers_count, fill = bot)) +
geom_histogram() +
facet_grid(bot ~.)
# how about the number of people they follow?
ggplot(twitter, aes(x = friends_count, fill = bot)) +
geom_histogram() +
facet_grid(bot ~.)
# it's a little hard to see
ggplot(filter(twitter, friends_count < 2500),
aes(x = friends_count, fill = bot)) +
geom_histogram() +
facet_grid(bot ~.)
# what about account age?
ggplot(twitter, aes(x = account_age_hours, fill = bot)) +
geom_histogram() +
facet_grid(bot ~.)
# lexical diversity
ggplot(twitter, aes(x = diversity, fill = bot)) +
geom_histogram() +
facet_grid(bot ~.)
# what are the average values?
avg_diversity =
twitter %>%
group_by(bot) %>%
summarize(avg_diversity = mean(diversity, na.rm = TRUE))
# add it to the plot
ggplot(twitter, aes(x = diversity, fill = bot)) +
geom_histogram() +
geom_vline(data = avg_diversity, aes(xintercept = avg_diversity)) +
facet_grid(bot ~.)
```
## Feature engineering
Feature engineering is the process of creating predictor variables using domain
knowledge. We can test hypotheses about the importance of various relationships
by creating new predictors that help interrogate those relationships. For example,
you might hypothesize a relationship between the number of tweets made
and the lexical diversity that is relevant to model. To test that, make a new
categorical variable indicating whether an account holder is a 'heavy tweeter', 'medium tweeter'
or 'light tweeter':
```{r feature-engineering}
# the number of tweets per account has a long tail
ggplot(twitter, aes(x = statuses_count)) +
geom_histogram()
# break into three categories by quantile
quantile(twitter$statuses_count)
# low tweeters will be the bottom 25%,
twitter$tweet_volume = NA
twitter$tweet_volume = ifelse(twitter$statuses_count <= 188,
'Light Tweeter',
twitter$tweet_volume)
twitter$tweet_volume = ifelse((twitter$statuses_count > 188 & twitter$statuses_count <= 2646),
'Medium Tweeter',
twitter$tweet_volume)
twitter$tweet_volume = ifelse(twitter$statuses_count > 2646,
'Heavy Tweeter',
twitter$tweet_volume)
twitter$tweet_volume = factor(twitter$tweet_volume, levels = c('Light Tweeter', 'Medium Tweeter', 'Heavy Tweeter'))
# plot it!
ggplot(twitter, aes(x = statuses_count)) +
geom_histogram(aes(fill = tweet_volume), bins = 100)
# update the figure
avg_diversity =
twitter %>%
group_by(bot, tweet_volume) %>%
summarize(avg_diversity = mean(diversity, na.rm = TRUE))
ggplot(twitter, aes(x = diversity, fill = bot)) +
geom_histogram() +
geom_vline(data = avg_diversity, aes(xintercept = avg_diversity)) +
facet_grid(bot ~ tweet_volume)
```
# Logisitic Regression
## Training and testing sets
```{r train-test, warning = FALSE, message = FALSE}
set.seed(243)
twitter = na.omit(twitter)
# select the training observations
in_train = createDataPartition(y = twitter$bot,
p = 0.75, # 75% in train, 25% in test
list = FALSE)
train = twitter[in_train, ]
test = twitter[-in_train, ]
```
## Training logisitic regressions
Check out [this page](http://topepo.github.io/caret/Logistic_Regression.html)
for more types of logistic regression to try out.
```{r logistic-model, warning = FALSE}
logistic_model = train(bot ~ .,
data = train,
method = 'glm',
family = binomial,
preProcess = c('center', 'scale'))
summary(logistic_model)
plot(varImp(logistic_model))
# test predictions
logistic_predictions = predict(logistic_model, newdata = test)
confusionMatrix(logistic_predictions, test$bot)
```
There are subset selection methods for logistic regression as well. Try out
`method = 'glmStepAIC'`:
```{r step-model, message = FALSE, results = 'hide'}
# stepwise logisitic regression
step_model = train(bot ~ .,
data = train,
method = 'glmStepAIC',
family = binomial,
preProcess = c('center', 'scale'))
```
```{r step-results}
summary(step_model)
step_predictions = predict(step_model, newdata = test)
confusionMatrix(step_predictions, test$bot)
```
How do the models compare?
```{r compare-models}
# compare
results = resamples(list(logistic_model = logistic_model,
step_model = step_model))
# compare accuracy and kappa
summary(results)
# plot results
dotplot(results)
```
You can’t perform that action at this time.