# Twitter Data Retrieval and Pre-processing

The general procedure will be:     
-[Retrieve the data from Twitter API](#Retrieving-data-from-Twitter-API)    
-[Look for previously data,load it to a temp data frame](#Look-for-previously-data,-load-it-to-a-temp-data-frame)    
-[Create a tweet search function](#Create-a-tweet-search-function)    
-[Perform some checks on the combined data](#Perform-some-checks-on-the-combined-data)    
-[Write the data to CSV and Rda files](#Write-the-data-to-CSV-and-Rda-files)  


## Retrieving data from Twitter API

In [1]:
# Clear all variables and devices
rm(list=ls())
invisible(dev.off())

#### Descriptions of packages to be used

  twitteR: Twitter Web API the provides an interface to Twitter Data  
  twitteR: Twitter Web API the provides an interface to Twitter Data  
  ROAuth:  Allows users to connect users to the server and authenticate via OAuth package  
  RCurl:   General network (HTTP/FTP/..) client interface for R  
  httr:    Tools for working with URLs and HTTP  
  lubridate: Make dealing with dates easier  

In [2]:
#  Load the required packages (if packages are not available, 
#  then install them first)
for (package in c('twitteR', 'ROAuth', 'RCurl', 
                  'httr', 'plyr', 'dplyr', 'stringr', 'lubridate')) {
  if (!require(package, character.only=T, quietly=T)) {
    invisible(install.packages(package))
    library(package, character.only=T)
  }
}

invisible(install.packages("base64enc", dependencies=T))
library(base64enc)

# Confirm that all packages got loaded
length(search()) # 20 items


Attaching package: ‘plyr’

The following object is masked from ‘package:twitteR’:

    id


Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from ‘package:twitteR’:

    id, location

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘lubridate’

The following object is masked from ‘package:plyr’:

    here

The following object is masked from ‘package:base’:

    date

Installing package into ‘/Users/cynthiacorrea/Library/R/3.3/library’
(as ‘lib’ is unspecified)



The downloaded binary packages are in
	/var/folders/yb/zv0jpymj5yn6bqf2s2x4yjx80000gn/T//Rtmp25wNNN/downloaded_packages


In [3]:
#  Declare Local Variables for previously downloaded Twitter data
file.path <- "/Users/cynthiacorrea/TwitterProject/Jupyter/" 

To access data, Twitter requires one to create an application. Private keys are generated which give secure access to the data. 

In [4]:
#  OAuth settings for twitter:
#  library(httr)
oauth_endpoints("twitter")

<oauth_endpoint>
 request:   https://api.twitter.com/oauth/request_token
 authorize: https://api.twitter.com/oauth/authenticate
 access:    https://api.twitter.com/oauth/access_token

In [5]:
#  Access keys from Twitter application (API) at https://apps.twitter.com/
#  THESE KEYS SHOULD BE KEPT PRIVATE

api_key <- "kACQPjXUxG7NPH7sD0h1CHo6K"
api_secret <- "AVrQEityBmlI1ScDuPm63O0eo5NyYgrz09DzNTBfkCPMhMZOmy"
access_token <- "782078006869233664-7LPW0rJ7e7wzn9nVD9TwlxLZrknK4ot"
access_token_secret <- "J5kkVNqExzcdahnkEOvEPmSqTVrePl0NMWjPLC0EU471e"

setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret);

[1] "Using direct authentication"


After connecting to the API, you submit a request for data based on a search string and other parameters, such as the maximum number of tweets, the dates, lattitudes, and language.


Sample Twitter Search Command

      searchTwitter('Donald Trump', n=10000, 
                    since=Sys.Date(), until=Sys.Date(), 
                    geocode='39.8,-95.583068847656,2500km',
                    retryOnRateLimit=25, lang="en")

In [6]:
tweet.db.start.date <- "2016-11-07"  # First day of tweets database creation, October 1st.

orig.run.flag <- FALSE  # flag to indicate initial search for new tweets
re.run.flag <- TRUE     # flag to indicate re-run for tweets search

run.date <- Sys.Date()-1   # default run date set to yesterday
re.check.date <- run.date  # holds first date within the dataset
end.date <- Sys.Date()     # default end date set to today

print(c('end.date', end.date))    ###!

no.of.tweets <- 10000   # number of tweets set to 10,000
search.str <- c('Donald Trump', 'Hillary Clinton')

temp.combined.tweets <- NULL

# The following warning messages may be displayed after twitter search is
# executed any time. These can be ignored.
#
# Warning messages:
#   1: In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit,  :
#                        10000 tweets were requested but the API can only return 0
#   2: In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit,  :
#                                           10000 tweets were requested but the API can only return 0
#   3: In rbind_all(x) : Unequal factor levels: coercing to character

[1] "end.date" "17113"   


## Look for previously data, load it to a temp data frame 

If there exists previously downloaded data in a file named "DownloadedTweets.Rda", it is loaded into a temp data frame named **"temp.combined.tweets"**. I retrieve the beginning and end dates, update the date variables to download a new batch of data. I create a data frame with useful parameters such as number of days, all the search parameters, and the highest tweet ID number for each candidate. This is all stored in the **"search.parms.df"**. 

Else, if there is no previously downloaded data, the search parameters data frame is created and filled with the current days parameters.

In [7]:
# Search for the tweets data in the local directory referenced
# by the variable file.path set earlier.
#
if (file.exists(paste0(file.path, "DownloadedTweets.Rda"))) {
  
  print("Found previously saved data file")
  
  load(file=paste0(file.path, "DownloadedTweets.Rda"))
  print("Marker -5")
  
  # Save the loaded data frame to a temporary data frame
  temp.combined.tweets <- combined.tweets
    
  print('Dimensions of temp.combined.tweets')    ###!
  dim(temp.combined.tweets)
  
  # Since data exists for previous day(s), re-check will be required
  re.run.flag <- TRUE
  print("Marker -4")
  
  # Get the last date in the downloaded dataset
  created.max <- as.Date(max(combined.tweets$created))
  
  # Get the first date in the downloaded dataset
  re.check.date <- as.Date(min(combined.tweets$created))
  print(c("Marker -3, created.max", created.max))
  
  # run date is set to day after the last date in the dataset
  run.date <- created.max + 1
  
  # If run date is found to be greater than end date, then
  # tweets for all dates have already been searched previously.
  # Set the simple search for tweets flag to FALSE; else,
  # set the simple search for tweets flag to TRUE
  if (run.date >= end.date) {
    orig.run.flag <- FALSE
  } else {
    orig.run.flag <- TRUE
  }
  print("Marker -2, end.date, re.check.date")
  print(end.date)
    rint(re.check.date)
  # Create the parameters table
  
  # m is the difference between the current/end date and the first
  # date within the dataset
  m <- as.numeric(end.date - re.check.date)
  print(c("Marker -1, m=",m))
  
  # Set a sequence from one to the difference of above dates.
  # This is done to set the dates in the reverse chronological order.
  from.date <- m:1
  print('marker0')
  
  # Set the first three parameters of the parameter table
  date.from <- Sys.Date() - from.date
  from.days <- m:1   # Round 2: 10/17/2016
  date.since <- format(Sys.Date()-from.days)
  
  # Create a data frame containing all search parameters 
  # from all combinations of - date.since, no.of.tweets, and search.str
  search.parms.df <- data.frame(expand.grid(date.since = date.since, 
                                            no.of.tweets = no.of.tweets, 
                                            search.str = search.str))
  
  # Extract and set the day value of date since column
  search.parms.df$day.of.month <- mday(search.parms.df$date.since)
  print('marker1')
  
  # add 'date.until' after creating the above frame - 
  # just a day's difference is required from 'date.since' for each row
  search.parms.df$date.until <- format(Sys.Date()-(from.days-1))
  
  # For each candidate, get the maximum tweet ID for each day  
  data.aggr <- aggregate(id ~ search.str+mday(created), 
                         data=combined.tweets, max)
  names(data.aggr) <- c("search.str", "day.of.month", "since.id")
  print('marker2')
  
  # Merge the above data frame with the parameters table frame
  search.parms.df <- merge(x=search.parms.df, y=data.aggr,
                           by=c("search.str", "day.of.month"), all.x=T) 
  
} else {  # if the tweets data file does not exist
  
  print("No previously saved data file found")
  
  re.run.flag <- FALSE  # when no previous data exists, re-run not reqd
  
  # Run date is set to October 1st to search for all tweets since Oct 1
  # run.date <- as.Date("2016-10-01")
  run.date <- as.Date(tweet.db.start.date)
  
  orig.run.flag = TRUE   # tweets for new date needs to be searched
  
  # m is the difference between the current/end date and the first
  # date within the dataset
  m <- as.numeric(end.date - run.date)
  
  from.date <- m:1
  
  # Set the first three parameters of the parameter table
  date.from <- Sys.Date() - from.date
  from.days <- m:1   # Round 2: 10/17/2016
  date.since <- format(Sys.Date()-from.days)
  
  # Create a data frame containing all search parameters 
  # from all combinations of - date.since, no.of.tweets, and search.str
  search.parms.df <- data.frame(expand.grid(date.since = date.since, 
                                            no.of.tweets = no.of.tweets, 
                                            search.str = search.str))
  
  # Extract and set the day value of date since column
  search.parms.df$day.of.month <- mday(search.parms.df$date.since)
  
  # add 'date.until' after creating the above frame - 
  # just a day's difference is required from 'date.since' for each row
  search.parms.df$date.until <- format(Sys.Date()-(from.days-1))
}

[1] "No previously saved data file found"


In [8]:
print('Dimensions of search.parms.df')    ###!
dim(search.parms.df)

[1] "Dimensions of search.parms.df"


Below, I perform some checks on the parameters data frame.

In [9]:
# Check the structure of parameters data frame
str(search.parms.df)
# 'data.frame':	72 obs. of  5 variables:
#  $ date.since  : Factor w/ 36 levels "2016-10-01","2016-10-02",..: 1 2 3 4 5 6 7 8 9 10 ...
#  $ no.of.tweets: num  10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
#  $ search.str  : Factor w/ 2 levels "Donald Trump",..: 1 1 1 1 1 1 1 1 1 1 ...
#  $ day.of.month: int  1 2 3 4 5 6 7 8 9 10 ...
#  $ date.until  : chr  "2016-10-02" "2016-10-03" "2016-10-04" "2016-10-05" ...

'data.frame':	2 obs. of  5 variables:
 $ date.since  : Factor w/ 1 level "2016-11-07": 1 1
 $ no.of.tweets: num  10000 10000
 $ search.str  : Factor w/ 2 levels "Donald Trump",..: 1 2
 $ day.of.month: int  7 7
 $ date.until  : chr  "2016-11-08" "2016-11-08"


In [10]:
# Modify all factor columns to character columns.
# This is required since the searchString input for searchTwitter function
# should be in character format.
# First, check which columns of parameters data frame are factor columns
# sapply is from family of apply functions available in the base package.
# It facilitates executing a function on each element of a 
# vector (list, matrix, or dataframe)
rowIndex <- sapply(search.parms.df, is.factor)

rowIndex
# date.since
# TRUE
# no.of.tweets
# FALSE
# search.str
# TRUE
# day.of.month
# FALSE
# date.until
# FALSE

In [11]:
print('search.parms.df')
head(search.parms.df[rowIndex])
#    date.since      search.str
# 1  2016-10-01    Donald Trump
# 2  2016-10-02    Donald Trump

[1] "search.parms.df"


date.since,search.str
2016-11-07,Donald Trump
2016-11-07,Hillary Clinton


In [12]:
# Then modify the factor columns to character columns
# lapply returns a list of the same length as X
# All columns applied for modifications using lapply are modified
search.parms.df[rowIndex] <- lapply(search.parms.df[rowIndex], 
                                    as.character)

In [13]:
print('dim of search.parms.df')
dim(search.parms.df)

[1] "dim of search.parms.df"


In [14]:
    
# Check the structure of parms
str(search.parms.df)

# 'data.frame':	72 obs. of  6 variables:
#  $ search.str  : chr  "Donald Trump" "Donald Trump" "Donald Trump" "Donald Trump" ...
#  $ day.of.month: int  1 1 10 11 12 13 14 15 16 17 ...
#  $ date.since  : chr  "2016-10-01" "2016-11-01" "2016-10-10" "2016-10-11" ...
#  $ no.of.tweets: num  10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...

#  rbind warning can be ignored.

'data.frame':	2 obs. of  5 variables:
 $ date.since  : chr  "2016-11-07" "2016-11-07"
 $ no.of.tweets: num  10000 10000
 $ search.str  : chr  "Donald Trump" "Hillary Clinton"
 $ day.of.month: int  7 7
 $ date.until  : chr  "2016-11-08" "2016-11-08"


In [15]:
# Display records in the parameters data frame
head(search.parms.df, 5)  # default is 6 rows

# search.str	day.of.month	date.since	no.of.tweets	date.until	since.id
# Donald Trump	1	2016-10-01	10000	2016-10-02	793603522440802304

date.since,no.of.tweets,search.str,day.of.month,date.until
2016-11-07,10000,Donald Trump,7,2016-11-08
2016-11-07,10000,Hillary Clinton,7,2016-11-08


## Create a tweet search function

Create a function to search tweets based on parameters such as the string to be searched, number of tweets, date from, date until, and since ID. If date from, date until, and since ID parameter values are not passed to this function, then the default NULL (current date for date fields and no value for since ID) is assumed. 

The resultType parameter is not used since the default 'mixed' option enables searching for tweets that are both popular and recent for the given time period. The parameters geocode, lang, and retryOnRateLimit are hard coded and set to locations within USA; in English language; and, block search command retry up to 25 times if rate limit is experienced, respectively.

Here, only original tweets are kept in **"ret.tweets.df"**.

In [16]:
  tweets.df = function(search.str, no.of.tweets, 
                     date.since = NULL, date.until = NULL,
                     since.id = NULL) {
  
  print (paste0("Searching ", no.of.tweets, " tweets for ..." ,
                search.str, " from ", date.since, " to ", 
                date.until, " since ", since.id, "."))
  
  ret.tweets <- searchTwitter(searchString = search.str, n = no.of.tweets, 
                              since = date.since, until = date.until,
                              sinceID = since.id,
                              geocode = '39.8,-95.583068847656,2500km', 
                              lang = "en", retryOnRateLimit = 25)
  
  # If no tweets are found, return an empty data frame
  if (length(ret.tweets) <= 0) {
    return (data.frame())
  }
  
  # Remove newer style retweets and both older style retweets (RT) and 
  # modified tweets (MT). Retain 'original' tweets only
  ret.tweets <- tryCatch(
    {
      strip_retweets(ret.tweets, strip_manual = TRUE, 
                     strip_mt = TRUE)
    },
    warning = function(w) {
      print("Warning: All Tweets returned are Retweets")
      return (NULL)
    },
    error = function(e) {
      print("Beware: All Tweets returned are Retweets")
      return (NULL)
    }
  )
  
  print (paste0("Retrieved ", length(ret.tweets), " tweets for ..." ,
                search.str, " from ", date.since, " to ", 
                date.until, " since ", since.id, "."))
  print(c('since.id', since.id))
  print('date.until')
  print(date.until)
      
  # Re-check if the set of returned tweets list is null or empty
  if (is.null(ret.tweets) | length(ret.tweets) <= 0) {
    return (data.frame())
  }
    # Convert the returned tweets list to a data frame
  ret.tweets.df <- twListToDF(ret.tweets)
  print('Dimensions of ret.tweets.df')    ###!
  dim(ret.tweets.df)
  
  # Ensure numeric values are saved in the lat and lon fields
  ret.tweets.df$longitude <- as.numeric(ret.tweets.df$longitude)
  ret.tweets.df$latitude <- as.numeric(ret.tweets.df$latitude)
  
  # The function cbind of the base package binds column(s) to a list,
  # vector or a matrix
  return(cbind(search.str, ret.tweets.df))
}

The following condition occurs when tweets for new date/day is to be searched, but there is NO existing data for other dates. Each row of the parameter table is looped through to build parameters for and executing the tweets search function. The new tweets are stored in **"combined.tweets"**.

The combined data frame for all days is checked for duplicate rows. It is saved in csv and Rda formats.

In [17]:
if (orig.run.flag == TRUE & re.run.flag == FALSE) {
  
  combined.tweets <- 
    bind_rows(lapply(1:nrow(search.parms.df), 
                     function(x) tweets.df(search.parms.df$search.str[x], 
                                           no.of.tweets=no.of.tweets, 
                                           date.since=search.parms.df$date.since[x], 
                                           date.until=search.parms.df$date.until[x]))) %>% 
    as.data.frame()
}


# Although the above two "if" commands can be combined into one by removing the
# check for re.rerun.flag, they are kept separate for the sake of clarity.


# Combine the temporary data frame (downloaded tweets, if applicable) along with
# the combined tweets data frame due to above searches.
combined.tweets <- rbind(temp.combined.tweets, combined.tweets)


# Get the maximum ID for each Candidate and Day if orig.run.flag is set.
# If searches for new tweets were launched, then the data aggregation step will
# have to be repeated to find the maximum tweets ID for each candidate and
# new day(s).
#
if (orig.run.flag == TRUE) {
  search.parms.df$since.id <- NULL
  data.aggr <- aggregate(id ~ search.str+mday(created), 
                         data=combined.tweets, max)
  names(data.aggr) <- c("search.str", "day.of.month", "since.id")
  search.parms.df <- merge(x=search.parms.df, y=data.aggr,
                           by=c("search.str", "day.of.month"), all.x=T) 
}


# Once again, save the existing tweets data frame to a temporary data frame
temp.combined.tweets <- combined.tweets


# Search again for more tweets by providing the since ID for each parameter
# combination. These parameter combinations (without since ID) were used for
# tweets search earlier.
#
combined.tweets <- 
  bind_rows(lapply(1:nrow(search.parms.df), 
                   function(x) tweets.df(search.parms.df$search.str[x], 
                                         no.of.tweets=no.of.tweets, 
                                         date.since=search.parms.df$date.since[x], 
                                         date.until=search.parms.df$date.until[x],
                                         since.id=search.parms.df$since.id[x]))) %>% 
  as.data.frame()
    

[1] "Searching 10000 tweets for ...Donald Trump from 2016-11-07 to 2016-11-08 since ."
[1] "Retrieved 2433 tweets for ...Donald Trump from 2016-11-07 to 2016-11-08 since ."
[1] "since.id"
[1] "date.until"
[1] "2016-11-08"
[1] "Dimensions of ret.tweets.df"
[1] "Searching 10000 tweets for ...Hillary Clinton from 2016-11-07 to 2016-11-08 since ."
[1] "Rate limited .... blocking for a minute and retrying up to 24 times ..."
[1] "Rate limited .... blocking for a minute and retrying up to 23 times ..."
[1] "Rate limited .... blocking for a minute and retrying up to 22 times ..."
[1] "Rate limited .... blocking for a minute and retrying up to 21 times ..."
[1] "Rate limited .... blocking for a minute and retrying up to 20 times ..."
[1] "Rate limited .... blocking for a minute and retrying up to 19 times ..."
[1] "Rate limited .... blocking for a minute and retrying up to 18 times ..."
[1] "Rate limited .... blocking for a minute and retrying up to 17 times ..."
[1] "Rate limited .... blockin

“Unequal factor levels: coercing to character”

[1] "Searching 10000 tweets for ...Donald Trump from 2016-11-07 to 2016-11-08 since 795777853245648901."


“10000 tweets were requested but the API can only return 2”

[1] "Beware: All Tweets returned are Retweets"
[1] "Retrieved 0 tweets for ...Donald Trump from 2016-11-07 to 2016-11-08 since 795777853245648901."
[1] "since.id"           "795777853245648901"
[1] "date.until"
[1] "2016-11-08"
[1] "Searching 10000 tweets for ...Hillary Clinton from 2016-11-07 to 2016-11-08 since 795777851626655744."


“10000 tweets were requested but the API can only return 5”

[1] "Beware: All Tweets returned are Retweets"
[1] "Retrieved 0 tweets for ...Hillary Clinton from 2016-11-07 to 2016-11-08 since 795777851626655744."
[1] "since.id"           "795777851626655744"
[1] "date.until"
[1] "2016-11-08"


In [18]:
print('Dimensions of combined tweets')    ###!
dim(combined.tweets)                       

[1] "Dimensions of combined tweets"


In [19]:
# Combine the temporary data frame with the new tweets data frame
combined.tweets <- rbind(temp.combined.tweets, combined.tweets)


# Identify any duplicate tweets within the dataset
idx <- sapply(combined.tweets, function(x) !is.na(match(x, x[duplicated(x)])))
idx.val <- apply(idx,1,function(x) ifelse(all(x)==TRUE,TRUE,FALSE))


# Find number of duplicate rows
    
print('Number of duplicate rows:')
nrow(combined.tweets[which(idx.val == TRUE),])
# [1] 3658


# Remove duplicate rows by copying non-duplicate data set
combined.tweets <- combined.tweets[which(idx.val == FALSE),]
    
print('Dimensions of combined tweets')    ###!
dim(combined.tweets)

[1] "Number of duplicate rows:"


[1] "Dimensions of combined tweets"


In [20]:
# Save the downloaded tweets in Rda and CSV formatted files
save(combined.tweets, file=paste0(file.path, "DownloadedTweets.Rda"))
write.csv(combined.tweets, file=paste0(file.path, "DownloadedTweets.csv"),
          row.names = F)

Function call with all combinations of the search parmeters except since ID.
  
Each row of parameter data frame that DO NOT have since ID parameter value
populated is looped through to build parameters for and execute the search
tweets function with the help of lapply function.
The results returned are then stacked in one large data frame. The %>% (pipe) operator is then used to save all stacked rows in a dataframe

The following condition occurrs when tweets for new date/day is to be searched, and there is existing data for other dates. Tweets for these old dates have to be checked again.

In [21]:
if (orig.run.flag == TRUE & re.run.flag == TRUE) {
  
  combined.tweets <- 
    bind_rows(lapply(which(is.na(search.parms.df$since.id)), 
                     function(x) tweets.df(search.parms.df$search.str[x], 
                                           no.of.tweets = no.of.tweets, 
                                           date.since = search.parms.df$date.since[x], 
                                           date.until = search.parms.df$date.until[x]))) %>% 
    as.data.frame()
}

## Perform some checks on the combined data

In [22]:
# Check the dimensions of the combined tweets data frame
print("dimensions of combined.tweets")
dim(combined.tweets)
# [1] 35683    17


# Check the structure of the combined tweets data frame
str(combined.tweets)
# 'data.frame':	35683 obs. of  17 variables:
# $ search.str   : chr  "Donald Trump" "Donald Trump" "Donald Trump" "Donald Trump" ...


# Check which dates were not loaded
print('dates that were not loaded')
combined.tweets[!duplicated(day(combined.tweets$created)), c(6)]
# [1] "2016-10-02 23:59:59 UTC" "2016-10-03 23:59:58 UTC" "2016-10-04 23:59:59 UTC"
# [4] "2016-10-05 23:59:58 UTC" "2016-10-06 23:59:59 UTC" "2016-10-07 23:59:58 UTC"
# [7] "2016-10-08 23:59:59 UTC" "2016-10-09 23:59:59 UTC" "2016-10-10 23:59:59 UTC"
# [10] "2016-10-11 23:59:59 UTC"


# For ease of date/time queries and manipulation later, add new date
# and time fields to the tweets data frame
#
# library(lubridate)
#
combined.tweets$timestamp <- ymd_hms(combined.tweets$created)
combined.tweets$dateonly <- trunc(combined.tweets$timestamp, "days")
combined.tweets$timeonly <- round(as.numeric(combined.tweets$timestamp - 
                                               trunc(combined.tweets$timestamp, "days")),0)

[1] "dimensions of combined.tweets"


'data.frame':	4000 obs. of  17 variables:
 $ search.str   : chr  "Donald Trump" "Donald Trump" "Donald Trump" "Donald Trump" ...
 $ text         : chr  "Dave Chappelle -- Hell No, I Don't Support Trump! Consider the Source (VIDEO) https://t.co/WQcYdZB2zy via @TMZ" "in spanish today we were talking about our turn offs about ppl &amp; this quiet girl in the back of the class said \"don… https"| __truncated__ "Not saying I'm a Clinton supporter but Donald Trump once used fake medical records to get out of military service so... @g_payn"| __truncated__ "I keep getting recommendations to follow Donald Trump accounts, what is life anymore." ...
 $ favorited    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ favoriteCount: num  0 2 0 0 0 0 0 0 0 4 ...
 $ replyToSN    : chr  NA NA NA NA ...
 $ created      : POSIXct, format: "2016-11-07 23:59:59" "2016-11-07 23:59:59" ...
 $ truncated    : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
 $ replyToSID   : chr  NA NA NA NA ...
 $ id           : 

[1] "2016-11-07 23:59:59 UTC"

## Write the data to CSV and Rda files

In [23]:
# Save the modified dataset in separate data and csv file
save(combined.tweets, 
     file=paste0(file.path, "ModifiedTweets1.Rda"))
write.csv(combined.tweets, 
          file=paste0(file.path, "ModifiedTweets1.csv"),
          row.names = FALSE)