Permalink
Cannot retrieve contributors at this time
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
248 lines (197 sloc)
7.31 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Logistic regression in R" | |
output: | |
pdf_document: | |
toc: yes | |
toc_depth: 2 | |
html_document: | |
theme: readable | |
toc: yes | |
toc_depth: 3 | |
date: "May 09, 2016" | |
--- | |
```{r setup, include = FALSE} | |
# my set-up | |
knitr::opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, fig.align = 'center') | |
require(dplyr) | |
require(ggplot2) | |
require(GGally) | |
require(caret) | |
setwd('~/projects/BI-TECH-CP303/projects/project 2') | |
twitter = read.delim('./data/bot_or_not.tsv', | |
sep = '\t', | |
header = TRUE) | |
``` | |
# Introduction | |
For the second project we'll explore user data from Twitter to identify accounts | |
likely belonging to bots. The data set has variables about profile configuration | |
(`default_profile`, `default_profile_image`), connectivity (`friends_count`, | |
`followers_count`), and some information about the nature of their tweets | |
(`diversity`, `mean_mins_between_tweets`). Additionally, there's an outcome | |
variable called `bot` that denotes whether the account belongs to a bot | |
(`bot == 1`) or to a human (`bot == 0`). | |
```{r read-data, eval = FALSE} | |
library(dplyr) | |
library(ggplot2) | |
library(GGally) | |
library(caret) | |
twitter = read.delim('bot_or_not.tsv', | |
sep = '\t', | |
header = TRUE) | |
``` | |
# Exploratory data analysis | |
We've got a brand new data set, so let's familiarize ourselves by | |
conducting an exploratory data analysis. Let's start by summarizing the whole | |
data set to see what the variable values are. | |
```{r} | |
summary(twitter) | |
``` | |
From the summary, we can see that there are a couple `factor` variables in the | |
data set, `bot`, `default_profile`, `default_profile_image` and `geo_enabled`. | |
Before exploring further, let's first tell R that those columns represent | |
categorical variables. | |
```{r} | |
twitter$bot = factor(twitter$bot) | |
twitter$default_profile = factor(twitter$default_profile) | |
twitter$default_profile_image = factor(twitter$default_profile_image) | |
twitter$geo_enabled = factor(twitter$geo_enabled) | |
summary(twitter) | |
``` | |
Like before, we can evaluate many relationships simultaneously with `ggpairs.` | |
```{r ggpairs, fig.align = 'center', fig.width = 10, fig.height = 10} | |
# inspect many trends with ggpairs | |
ggpairs(twitter[ , c('followers_count', 'friends_count', 'account_age_hours', | |
'diversity', 'statuses_count', 'bot')]) | |
``` | |
Once we have some initial hypotheses we can make more specific plots. | |
```{r eda} | |
ggplot(twitter, aes(x = followers_count, fill = bot)) + | |
geom_histogram() | |
# Some people have a lot of followers, but most don't. we need to lob off | |
# the long tail so we can see the distribution better | |
ggplot(filter(twitter, followers_count < 100), | |
aes(x = followers_count, fill = bot)) + | |
geom_histogram() | |
ggplot(filter(twitter, followers_count < 100), | |
aes(x = followers_count, fill = bot)) + | |
geom_histogram() + | |
facet_grid(bot ~.) | |
# how about the number of people they follow? | |
ggplot(twitter, aes(x = friends_count, fill = bot)) + | |
geom_histogram() + | |
facet_grid(bot ~.) | |
# it's a little hard to see | |
ggplot(filter(twitter, friends_count < 2500), | |
aes(x = friends_count, fill = bot)) + | |
geom_histogram() + | |
facet_grid(bot ~.) | |
# what about account age? | |
ggplot(twitter, aes(x = account_age_hours, fill = bot)) + | |
geom_histogram() + | |
facet_grid(bot ~.) | |
# lexical diversity | |
ggplot(twitter, aes(x = diversity, fill = bot)) + | |
geom_histogram() + | |
facet_grid(bot ~.) | |
# what are the average values? | |
avg_diversity = | |
twitter %>% | |
group_by(bot) %>% | |
summarize(avg_diversity = mean(diversity, na.rm = TRUE)) | |
# add it to the plot | |
ggplot(twitter, aes(x = diversity, fill = bot)) + | |
geom_histogram() + | |
geom_vline(data = avg_diversity, aes(xintercept = avg_diversity)) + | |
facet_grid(bot ~.) | |
``` | |
## Feature engineering | |
Feature engineering is the process of creating predictor variables using domain | |
knowledge. We can test hypotheses about the importance of various relationships | |
by creating new predictors that help interrogate those relationships. For example, | |
you might hypothesize a relationship between the number of tweets made | |
and the lexical diversity that is relevant to model. To test that, make a new | |
categorical variable indicating whether an account holder is a 'heavy tweeter', 'medium tweeter' | |
or 'light tweeter': | |
```{r feature-engineering} | |
# the number of tweets per account has a long tail | |
ggplot(twitter, aes(x = statuses_count)) + | |
geom_histogram() | |
# break into three categories by quantile | |
quantile(twitter$statuses_count) | |
# low tweeters will be the bottom 25%, | |
twitter$tweet_volume = NA | |
twitter$tweet_volume = ifelse(twitter$statuses_count <= 188, | |
'Light Tweeter', | |
twitter$tweet_volume) | |
twitter$tweet_volume = ifelse((twitter$statuses_count > 188 & twitter$statuses_count <= 2646), | |
'Medium Tweeter', | |
twitter$tweet_volume) | |
twitter$tweet_volume = ifelse(twitter$statuses_count > 2646, | |
'Heavy Tweeter', | |
twitter$tweet_volume) | |
twitter$tweet_volume = factor(twitter$tweet_volume, levels = c('Light Tweeter', 'Medium Tweeter', 'Heavy Tweeter')) | |
# plot it! | |
ggplot(twitter, aes(x = statuses_count)) + | |
geom_histogram(aes(fill = tweet_volume), bins = 100) | |
# update the figure | |
avg_diversity = | |
twitter %>% | |
group_by(bot, tweet_volume) %>% | |
summarize(avg_diversity = mean(diversity, na.rm = TRUE)) | |
ggplot(twitter, aes(x = diversity, fill = bot)) + | |
geom_histogram() + | |
geom_vline(data = avg_diversity, aes(xintercept = avg_diversity)) + | |
facet_grid(bot ~ tweet_volume) | |
``` | |
# Logisitic Regression | |
## Training and testing sets | |
```{r train-test, warning = FALSE, message = FALSE} | |
set.seed(243) | |
twitter = na.omit(twitter) | |
# select the training observations | |
in_train = createDataPartition(y = twitter$bot, | |
p = 0.75, # 75% in train, 25% in test | |
list = FALSE) | |
train = twitter[in_train, ] | |
test = twitter[-in_train, ] | |
``` | |
## Training logisitic regressions | |
Check out [this page](http://topepo.github.io/caret/Logistic_Regression.html) | |
for more types of logistic regression to try out. | |
```{r logistic-model, warning = FALSE} | |
logistic_model = train(bot ~ ., | |
data = train, | |
method = 'glm', | |
family = binomial, | |
preProcess = c('center', 'scale')) | |
summary(logistic_model) | |
plot(varImp(logistic_model)) | |
# test predictions | |
logistic_predictions = predict(logistic_model, newdata = test) | |
confusionMatrix(logistic_predictions, test$bot) | |
``` | |
There are subset selection methods for logistic regression as well. Try out | |
`method = 'glmStepAIC'`: | |
```{r step-model, message = FALSE, results = 'hide'} | |
# stepwise logisitic regression | |
step_model = train(bot ~ ., | |
data = train, | |
method = 'glmStepAIC', | |
family = binomial, | |
preProcess = c('center', 'scale')) | |
``` | |
```{r step-results} | |
summary(step_model) | |
step_predictions = predict(step_model, newdata = test) | |
confusionMatrix(step_predictions, test$bot) | |
``` | |
How do the models compare? | |
```{r compare-models} | |
# compare | |
results = resamples(list(logistic_model = logistic_model, | |
step_model = step_model)) | |
# compare accuracy and kappa | |
summary(results) | |
# plot results | |
dotplot(results) | |
``` |