# Two Dimensional Data Worksheet

This worksheet focuses on manipulating two dimensional data using R.

In [1]:
# install.packages("devtools")
# devtools::install_github("hadley/tidyverse")
library(tidyverse)
knitr::opts_chunk$set(echo=FALSE)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


Create a dataframe called `twitter` from the CSV file

_Note if this is breaking your machine there is a smaller data set in the data file called twitter1-small.csv_

In [3]:
twitter <- read_csv( '../../Data/twitter1.csv')

Parsed with column specification:
cols(
  `Primary Key` = col_integer(),
  Service = col_character(),
  Term = col_character(),
  Username = col_character(),
  Name = col_character(),
  Update = col_character(),
  Location = col_character(),
  URL = col_character(),
  Friends = col_integer(),
  Followers = col_integer(),
  `Time(PDT)` = col_character(),
  City = col_character(),
  `State/Region` = col_character(),
  Country = col_character(),
  Metro = col_character(),
  Latitude = col_double(),
  Longitude = col_double()
)


## Exercise 1

Using the twitter data frame, count the appearances of each `Username`, get the top 10 active users and output the sorted result.

1. Start with `dplyr::count` on `Username`
2. Grab the top 10 with `dplyr_top_n`
3. sort with `dplyr::arrange`, can make it descending by wrapping the sort variable with `desc()`

In [4]:
twitter %>% count(Username) %>% top_n(10) %>% arrange(desc(n))

Selecting by n


Username,n
HoolohaTube,155
Rasu24,150
HOOLOHASPORT,126
mahboobali3,119
EminemsRealWife,116
byezekiel,89
MyrtleMuelr,83
LucindaFischer,79
DebraRichayd,77
JeanieNoble,70


## Exercise 2
Using the original twitter data set, create a second DataFrame called `twitterSummary` which contains the following columns:

* Username
* Friends 
* Followers

Next add a column called `ffratio` which contains the ratio of friends to followers.  Show the first few lines.

In [5]:
# classic method:
# twitterSummary <- twitter[ ,c('Username', 'Friends', 'Followers')]
# twitterSummary$ffratio <- twitterSummary$Friends / twitterSummary$Followers
# head(twitterSummary)

# with tidyverse
twitterSummary <- twitter %>% 
  select(Username, Friends, Followers) %>% 
  mutate(ffratio = Friends / Followers)
# it will automatically show it pretty
twitterSummary

Username,Friends,Followers,ffratio
_prettybrown,1042,1538,0.6775033
CarlyManning24,278,304,0.9144737
madzLuvzLakers,619,1039,0.5957652
_AyyJayy,203,204,0.9950980
Akeemoneale,165,27,6.1111111
demetricel,0,0,
MrKooman,203,6001,0.0338277
bri_quebengco,28,30,0.9333333
Lboogs82,212,253,0.8379447
orccs,32,13,2.4615385


## Exercise 3

In the `data` folder, there is file called `studentData.csv` consisting of students and test scores.  Write a script which calculates each students' average test score and adds that as a column to the DataFrame.  

The first person to tell me which student has the highest average test score, and what it is wins something.


## Exercise 4
Using the twitter data, find all the users with Facebook accounts and create a new column called `FacebookID` which contains the users' Facebook ID.  As you can see in the URL below, a user's Facebook ID can be found in the URL column, http://www.facebook.com/profile.php?id=5141860.  Extract this the ID. Don't forget to remove all the invalid or empty IDs.  

1. Use the Function supplied to extract "?id=<numbers>" from the URL
    + Note the function does not filter for facebook, so you will have to prefilter the data for facebook, try using `dplyr::filter` with `grepl` inside the function.
2. Once complete, filter out the data where the FacebookID is not NA (`!is.na()`).

_Note: there are a lot of [ways to pull a substring from a string](http://stackoverflow.com/questions/2192316/extract-a-regular-expression-match-in-r-version-2-10):_

In [6]:
library(stringr) # do you have this installed?
getfacebook <- function(x) {
    # this pivots on grepl (logical grep returns TRUE or FALSE)
    # if it contains facebook, then split on ?id=, grabbing the remaining text
    # else return NA
    ifelse(grepl('\\?id=\\d+', x), 
           str_extract(str_extract(x, "\\?id=\\d+"), "\\d+"),
           NA)
}
foo <- twitter %>% 
  ## filter for all facebook to reduce computations later
  filter(grepl('facebook.com', URL)) %>% 
  ## run all the URLS through the above function
  mutate(FacebookID = getfacebook(URL))
foo %>% select(URL, FacebookID) %>% filter(!is.na(FacebookID))

URL,FacebookID
http://www.facebook.com/home.php?#!/profile.php?id=100000926596946&ref=profile,100000926596946
http://www.facebook.com/profile.php?id=514186015&ref=profile#!/profile.php?id=514186015&ref=ts,514186015
http://http://www.facebook.com/?ref=home#!/profile.php?id=506234142&ref=profile,506234142
http://www.facebook.com/home.php#/profile.php?id=1707568551&ref=name,1707568551
http://www.facebook.com/home.php?#/profile.php?id=38403445&ref=name,38403445
http://www.facebook.com/profile.php?id=1060743307&ref=profile,1060743307
http://www.facebook.com/#!/profile.php?id=702012065,702012065
http://www.facebook.com/#!/profile.php?id=1701903898&ref=profile,1701903898
http://www.facebook.com/profile.php?id=1567050070&ref=profile,1567050070
http://www.facebook.com/profile.php?id=100000382878368,100000382878368
