## Editing Text Variables

### Loading data

In [49]:
if(!file.exists('.data')){dir.create("./data")}
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/cameras.csv")
cameraData <- read.csv("./data/cameras.csv")

“'./data' already exists”


In [40]:
dim(cameraData)

In [41]:
names(cameraData)

In [2]:
head(cameraData)

Unnamed: 0_level_0,address,direction,street,crossStreet,intersection,Location.1,X2010.Census.Neighborhoods,X2010.Census.Wards.Precincts,Zip.Codes
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>
1,GARRISON BLVD & WABASH AVE,E/B,Garrison,Wabash Ave,Garrison & Wabash Ave,"(39.341209, -76.683117)",252,63,27295
2,HILLEN ST & FORREST ST,W/B,Hillen,Forrest St,Hillen & Forrest St,"(39.29686, -76.605532)",179,108,13645
3,EDMONDSON AVE & N ATHOL AVE,E/B,Edmonson,Woodbridge Ave,Edmonson  & Woodbridge Ave,"(39.293453, -76.689391)",213,75,27950
4,YORK RD & GITTINGS AVE,S/B,York Rd,Gitting Ave,York Rd & Gitting Ave,"(39.370493, -76.609812)",37,270,14009
5,RUSSELL ST & W HAMBURG ST,S/B,Russell,Hamburg St,Russell  & Hamburg St,"(39.279819, -76.623911)",250,178,27953
6,S MARTIN LUTHER KING JR BLVD & W PRATT ST,S/B,MLK Jr. Blvd,Pratt St,MLK Jr. Blvd & Pratt St,"(39.286027, -76.627846)",11,168,27953


In [5]:
tolower(names(cameraData))

### fixiting character vectors - strsplit()

In [12]:
names(cameraData)[6]

能见到它的格式是`Location.1`，我们可以利用`strsplit()`对它作拆分

In [7]:
splitNames = strsplit(names(cameraData), "\\.")

In [10]:
splitNames[[5]]

In [11]:
splitNames[[6]]

### Quick aside - lists

In [22]:
mylist <- list(letters = c('a','b','c'), numbers = 1:3, matrix(1:25, ncol=5))

In [23]:
head(mylist)

0,1,2,3,4
1,6,11,16,21
2,7,12,17,22
3,8,13,18,23
4,9,14,19,24
5,10,15,20,25


In [24]:
mylist[1]

In [25]:
mylist$letters

In [26]:
mylist[[1]]

### Fixing character vectors - sapply()

- Applies a function to each element in a vector or list
- Important parameters: `X`, `FUN`

In [27]:
splitNames[[6]][1]

In [28]:
firstElement <- function(x){x[1]}
sapply(splitNames, firstElement)

In [29]:
reviews <- read.csv("./data/reviews.csv")
solutions <- read.csv("./data/solutions.csv")

In [30]:
head(reviews,2)

Unnamed: 0_level_0,id,solution_id,reviewer_id,start,stop,time_left,accept
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,3,27,1304095698,1304095758,1754,1
2,2,4,22,1304095188,1304095206,2306,1


In [31]:
head(solutions, 2)

Unnamed: 0_level_0,id,problem_id,subject_id,start,stop,time_left,answer
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<chr>
1,1,156,29,1304095119,1304095169,2343,B
2,2,269,25,1304095119,1304095183,2329,C


In [32]:
names(reviews)

用`sub()`命令去掉文本中的下横线`_`

In [33]:
sub("_", "", names(reviews),)

`sub()`删除掉文本中的第1个下横线，而保留随后的第2,3...N个下横线

In [34]:
testName <- "this_is_a_test"
sub("_", "", testName)

用`gsub()`删除文本中的全部N个下横线

In [35]:
gsub("_","",testName)

In [36]:
gsub("_",".",testName)

### Finding values - grep(), grepl()

In [43]:
grep("Alameda", cameraData$intersection)

参数`value = TRUE`用于返回整个字段

In [51]:
grep("Alameda", cameraData$intersection, value = TRUE)

In [44]:
table(grepl('Alameda', cameraData$intersection))


FALSE  TRUE 
   77     3 

In [50]:
cameraData2 <- cameraData[
    !grepl("Alameda", cameraData$intersection), 
]

In [52]:
grep('JeffStreet', cameraData$intersection)

In [53]:
length(grep('JeffStreet', cameraData$intersection))

### More useful string functions

In [54]:
library(stringr)

`nchar()`返回字符数

In [57]:
nchar("Jeffrey Leek")

In [58]:
substr('Jeffrey Leek', 1, 7)

`paste`默认输入一个空格，将几部分text连结在一起。参数`sep="x"`可以改变默认的空格连缀符

In [61]:
paste("Jeffrey", 'Leek')

In [60]:
paste("Jeffrey", 'Leek',sep = "_")

中间不想要任何连缀符的话，可以使用`paste0()`

In [63]:
paste("Jeffrey", "leek", sep = "")

In [64]:
paste0("Jeffrey", "leek")

`str_trim()`将文本最前和最后的所有空格都取消

In [67]:
str_trim("   Jeff        ")

## Regular Expressions

- Regular expressions can be thought of as a combination of literals and metacharacters
- To draw an analogy with natural language, thinki of literal text forming the words of this language, and the metacharacters defining its gramar
- Regular expressions have a rich set of metacharacters

`^` represents the beginning of a line, and `$` represents the end of a line

`[Bb], [Uu], [Ss], [Hh]` denotes a set of characters we will accept at a given point in the match, and is case insensitive

`^[Ii] am` will match ...

`^[0-9][a-zA-z]` for a range of number and letters, notice that the order does not matter

When used at the beginning of a character class, the "" is also a metacharacter and indicates matching characters NOT in the indicated class, e.g. `[^?.]$` denotes searching for any beginning word which does not have a `.`

will match the lines

`.` is sued to refer to any character. So

9.11

will match the lines

`|` means "or", e.g. `flood|fire`, `flood|earthquake|hurricane|coldfire`

`^[Gg]ood|[Bb]ad`

Subexpressions are often contained in parentheses to constrain the alternatives, e.g. `^([Gg]ood|[Bb]ad)`

`^[Gg]eorge( [Ww]\.)? [Bb]ush` - We want to match a `.` as a literal period; to do that, we have to "escape" the metacharacter, proceding it with a backslash. In genral, we have to do this for any metacharacter we want to include in our match.

- `*` means any number, including none, of the item
- `+` means at least one of the item

`(.*)` will match the lines 寻找字段

[0-9]+ (.*) [0-9]+ 寻找两个数字中间有间隔字段的情况，如

`{and}` are referred to as interval quantifiers; the let us specify the minimum and maximum number of matches of an expression, e.g. `[Bb]ush( +[^ ]+ +){1,5} debate` will match

`\1`,`\2`表示所有重复出现的情况，如 `+([a-zA-Z]+) +\1 +` will match

`*` is greedy, it always matches the longest possible string that satisfies the regular expression, e.g. `^s(.*)s` matches

`^s(.*?)s`可用于让它less greedy，如`?=3`

## Working with dates

In [76]:
d1 = date()
d1

In [77]:
class(d1)

In [78]:
str(d1)

 chr "Fri Aug  7 17:57:28 2020"


In [79]:
d2 = Sys.Date()
d2

In [80]:
class(d2)

- `%d` = days as number (0-31)
- `%a` = abbreviated weekday
- `%A` = unabbreviated weekday
- `%m` = month (00-12)
- `%b` = abbreviated month
- `%B` = unabbreviated month
- `%y` = 2 digit year
- `%Y` = 4 digit year

In [81]:
format(d2, '%a %b %d')

In [84]:
x = c("1jan1960", '2jan1960', '31mar1960', '30jul1960')
z = as.Date(x, "%d%b%Y")
z

In [85]:
d2

In [86]:
weekdays(d2)

In [87]:
months(d2)

In [88]:
julian(d2)

In [93]:
# install.packages("lubridate")
library(lubridate)
dt <- ymd("20140108")
dt

In [94]:
class(dt)

In [95]:
mdy('08/04/2018')

In [96]:
dmy('03-04-2013')

## Data Resources

## Week 4 Quiz

In [98]:
housing <- read.csv("./data/getdata_data_ss06hid.csv")
names(housing)

In [100]:
housingnamelist <- strsplit(names(housing), "wgtp")
housingnamelist[123]

In [162]:
gdp <- read.csv('data/getdata_data_GDP.csv')
gdp <- gdp[5:235,]
gdp <- select(gdp, c("X":"X.3"))
gdp <- rename(gdp, CountryCode = X, Ranking = Gross.domestic.product.2012, Country=X.2, Value = X.3)
gdp$X.1 <- NULL
gdp$Value <- as.integer(gsub(',', '', gdp$Value))
#mean(gdp$Value)

“NAs introduced by coercion”


In [119]:
library(dplyr)
#?select

In [153]:
head(gdp)

Unnamed: 0_level_0,CountryCode,Ranking,Country,Value
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
5,USA,1,United States,16244600
6,CHN,2,China,8227103
7,JPN,3,Japan,5959718
8,DEU,4,Germany,3428131
9,FRA,5,France,2612878
10,GBR,6,United Kingdom,2471784


In [167]:
#gdp$Value <- as.numeric(gdp$Value)
#mean(as.integer(gdp$Value), na.rm = TRUE)
#gdp[, mean(as.integer(gsub(pattern = ',', replacement = '', x = gdp)))]
# mean(gdp[,'Value'])
#str(gdp)
mean(gdp$Value, na.rm = T)

In [168]:
grep("^United", gdp$Country)

“input string 99 is invalid in this locale”
“input string 186 is invalid in this locale”
“input string 196 is invalid in this locale”


In [173]:
#library("data.table")
#gdp <- fread("data/getdata_data_GDP.csv", skip = 5, nrows = 190, select = c(1,2,4,5),
#     col.names=c('CountryCode', 'Rank', 'Country', 'GDP'))
gdp <- read.csv('data/getdata_data_GDP.csv')
gdp <- gdp[5:235,]
gdp <- select(gdp, c("X":"X.3"))
gdp <- rename(gdp, CountryCode = X, Ranking = Gross.domestic.product.2012, Country=X.2, Value = X.3)
gdp$X.1 <- NULL

In [186]:
edu <- read.csv('data/getdata_data_EDSTATS_Country.csv')
names(edu)

In [178]:
#merged.gdp.edu <- merge(gdp, edu, by.x = 'CountryCode', by.y='CountryCode')
merged.gdp.edu <- merge(gdp, edu, by="CountryCode")

In [179]:
head(merged.gdp.edu)

Unnamed: 0_level_0,CountryCode,Ranking,Country,Value,Long.Name,Income.Group,Region,Lending.category,Other.groups,Currency.Unit,⋯,Source.of.most.recent.Income.and.expenditure.data,Vital.registration.complete,Latest.agricultural.census,Latest.industrial.data,Latest.trade.data,Latest.water.withdrawal.data,X2.alpha.code,WB.2.code,Table.Name,Short.Name
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,ABW,161.0,Aruba,2584,Aruba,High income: nonOECD,Latin America & Caribbean,,,Aruban florin,⋯,,,,,2008,,AW,AW,Aruba,Aruba
2,ADO,,Andorra,..,Principality of Andorra,High income: nonOECD,Europe & Central Asia,,,Euro,⋯,,Yes,,,2006,,AD,AD,Andorra,Andorra
3,AFG,105.0,Afghanistan,20497,Islamic State of Afghanistan,Low income,South Asia,IDA,HIPC,Afghan afghani,⋯,,,,,2008,2000.0,AF,AF,Afghanistan,Afghanistan
4,AGO,60.0,Angola,114147,People's Republic of Angola,Lower middle income,Sub-Saharan Africa,IDA,,Angolan kwanza,⋯,"IHS, 2000",,1964-65,,1991,2000.0,AO,AO,Angola,Angola
5,ALB,125.0,Albania,12648,Republic of Albania,Upper middle income,Europe & Central Asia,IBRD,,Albanian lek,⋯,"LSMS, 2005",Yes,1998,2005.0,2008,2000.0,AL,AL,Albania,Albania
6,ARE,32.0,United Arab Emirates,348595,United Arab Emirates,High income: nonOECD,Middle East & North Africa,,,U.A.E. dirham,⋯,,,1998,,2008,2005.0,AE,AE,United Arab Emirates,United Arab Emirates


In [184]:
names(merged.gdp.edu)

In [188]:
merged.gdp.edu[, `Special.Notes`]

ERROR: Error in `[.data.frame`(merged.gdp.edu, , Special.Notes): object 'Special.Notes' not found


In [199]:
# grepl(pattern = 'Fiscal year end: June 30', merged.gdp.edu[, 'Special.Notes'])
nrow(merged.gdp.edu[grepl(pattern = 'Fiscal year end: June 30', merged.gdp.edu[, 'Special.Notes']), .N])

In [200]:
install.packages("quantmod")

Installing package into ‘/home/yanyuan/R/x86_64-pc-linux-gnu-library/4.0’
(as ‘lib’ is unspecified)

also installing the dependencies ‘xts’, ‘zoo’, ‘TTR’




In [201]:
library(quantmod)
amzn = getSymbols("AMZN",auto.assign=FALSE)
sampleTimes = index(amzn)

Loading required package: xts

Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric



Attaching package: ‘xts’


The following objects are masked from ‘package:data.table’:

    first, last


The following objects are masked from ‘package:dplyr’:

    first, last


Loading required package: TTR

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 

Version 0.4-0 included new data defaults. See ?getSymbols.

‘getSymbols’ currently uses auto.assign=TRUE by default, but will
use auto.assign=FALSE in 0.5-0. You will still be able to use
‘loadSymbols’ to automatically load data. getOption("getSymbols.env")
and getOption("getSymbols.auto.assign") will still be checked for
alternate defaults.

This message is shown once per session and may be disabled by setting 




In [214]:
sampleTimes <- index(amzn)
timeDT <- data.table::data.table(timeCol = sampleTimes)

In [216]:
timeDT[(timeCol >= "2012-01-01") & (timeCol < "2013-01-01"), .N]

In [219]:
timeDT[(timeCol >= "2012-01-01") & (timeCol < "2013-01-01") & (weekdays(timeCol) == '星期一'), .N]

# Peer-graded Assignment

In [None]:
#