# A.1 Working with data in R

### A.1.1 What are the other advantages of using R?
- We can be lazy and use the thousands of free libraries to easily:
    - Easily manipulate data (Today's topic)
    - Download data directly from the internet
    - Viusualize our data (graphing etc.)
    - Build models (Regression, Machine learning, Neural Networks)
    
    
- You have all used libraries before, perhaps without knowing it!
    - This is done in R in two steps: 
        1. install.packages("Package name") Downloads package
        2. library(Package name) Imports package

### A.1.2 What is a Data Frame? 
- Think of it as an excel sheet with data
- In many cases:
    - Rows are observations (e.g. people, households, countries, time)
    - Columns are variables (e.g. GDP, life expectancy)
        


### A.1.3 What applications can I use to run R?
R is not a software, it is a coding language! So there are multiple applications which can run R in

### A.1.3.1 R-studio cloud!
- Many students get frustrated because there are sometimes bugs which prevent the software from running smoothly
- R-studio cloud takes the hassle out of the setup of r-studio and allows for us to focus on learning R!

### A.1.3.2 R-studio software for your machine [link]((https://rstudio.com/products/rstudio/download/))

#### For those who want to use r in Jupyter notebook (what this tutorial is written in).

- [Computer download: Anaconda software](https://www.anaconda.com/)
- Cloud services
    - [R-studio cloud](https://rstudio.cloud/)
    - [Azure cloud](https://notebooks.azure.com/)
        

## A.2 R-studio basics
### - Tutorial can be found at this [link](https://nbviewer.jupyter.org/github/corybaird/PLCY_610_public/blob/master/Discussion_sections/Disc1_Intro/Disc1_intro.ipynb)

## A.3 Import data and libraries

### A.3.1 Import libraries

In [3]:
# Step 1

#install.packages('dplyr')
#install.packages('gapminder')

#Step 2

library('dplyr')
library('gapminder')

### A.3.2 Import data

In [6]:
gapminder %>% head(2)

country,continent,year,lifeExp,pop,gdpPercap
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853


# 1. DPLYR review
- This is meant to be a brief review
- If you want to see longer DPLYR notes please check out this other [notebook](https://nbviewer.jupyter.org/github/corybaird/PLCY_610_public/blob/master/Reference_materials/Tutorials_R_Stata_Python/R/W1_DPLYR/W1_DPLYR_code.ipynb) I created

## 1.1 Select

In [8]:
gapminder %>% 
select(country, year, gdpPercap) %>% 
head(3)

country,year,gdpPercap
Afghanistan,1952,779.4453
Afghanistan,1957,820.853
Afghanistan,1962,853.1007


## 1.2 Filter

### 1.2.1 Filter by 1 condition

In [13]:
gapminder %>% 
filter(year==2007) %>% 
head(2)

country,continent,year,lifeExp,pop,gdpPercap
Afghanistan,Asia,2007,43.828,31889923,974.5803
Albania,Europe,2007,76.423,3600523,5937.0295


### 1.2.2 Filter by 2 conditions

In [14]:
gapminder %>% 
filter(year>1990 & year<2007) %>% 
head(2)

country,continent,year,lifeExp,pop,gdpPercap
Afghanistan,Asia,1992,41.674,16317921,649.3414
Afghanistan,Asia,1997,41.763,22227415,635.3414


## 1.3 Mutate

In [16]:
gapminder %>% 
mutate(gdp_log = log(gdpPercap)) %>% 
head(3)

country,continent,year,lifeExp,pop,gdpPercap,gdp_log
Afghanistan,Asia,1952,28.801,8425333,779.4453,6.658583
Afghanistan,Asia,1957,30.332,9240934,820.853,6.710344
Afghanistan,Asia,1962,31.997,10267083,853.1007,6.748878


## 1.4 Summarise
- See list of functions under the "useful functions" header [here]

In [18]:
gapminder  %>% 
summarise(mean_pop = mean(pop),
         median_pop = median(pop))

mean_pop,median_pop
29601212,7023596


### 1.4.1 Summarise & Filter
- Chain two functions

In [22]:
gapminder  %>% 
filter(year==2007) %>% 
summarise(mean_pop = mean(pop))

mean_pop
44021220


## 1.5 Groupby

In [23]:
gapminder  %>%
group_by(continent) %>% 
summarise(mean_gdp = mean(gdpPercap))

continent,mean_gdp
Africa,2193.755
Americas,7136.11
Asia,7902.15
Europe,14469.476
Oceania,18621.609


# 2. Data check

## 2.1 Data types: str(DF_NAME)

In [24]:
str(gapminder)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	1704 obs. of  6 variables:
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num  779 821 853 836 740 ...


## 2.2 Summary stats: summary(DF_NAME)

In [25]:
summary(gapminder)

        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  
                                       

## 2.3 Check for NA: is.na()

In [27]:
gapminder %>% 
is.na() %>% 
any()

## 2.4 Drop na: na.omit()

- Add na then drop

### 2.4.1 Add na observations in the last row

In [42]:
# Adds NA row at the bottom of dataset
gapminder = gapminder %>% rbind(c(NA,NA, NA, NA, NA, NA))
gapminder %>% tail(2)

country,continent,year,lifeExp,pop,gdpPercap
,,,,,
,,,,,


### 2.4.2 Re-check for na

In [46]:
gapminder %>% 
is.na() %>% 
any()

### 2.4.3 na.omit()

In [45]:
gapminder %>% 
na.omit() %>% tail(2)

country,continent,year,lifeExp,pop,gdpPercap
Zimbabwe,Africa,2002,39.989,11926563,672.0386
Zimbabwe,Africa,2007,43.487,12311143,469.7093


# 3. Data manipulation

## 3.1 Dummy variable

In [50]:
gapminder_2007 = gapminder %>% filter(year==2007)
gapminder_2007 %>% head(5)

country,continent,year,lifeExp,pop,gdpPercap
Afghanistan,Asia,2007,43.828,31889923,974.5803
Albania,Europe,2007,76.423,3600523,5937.0295
Algeria,Africa,2007,72.301,33333216,6223.3675
Angola,Africa,2007,42.731,12420476,4797.2313
Argentina,Americas,2007,75.32,40301927,12779.3796


### 3.1.1 Add dummy for high-income countries

In [53]:
gapminder_2007 = gapminder_2007 %>% 
mutate(highinc_dummy = as.numeric(gdpPercap>10000))

In [54]:
gapminder_2007  %>% head(2)

country,continent,year,lifeExp,pop,gdpPercap,highinc_dummy
Afghanistan,Asia,2007,43.828,31889923,974.5803,0
Albania,Europe,2007,76.423,3600523,5937.0295,0


### 3.1.2 Dummies are useful for summary stats

In [80]:
gapminder_2007  %>% 
group_by(highinc_dummy) %>% 
summarise(gdp_mean = mean(gdpPercap),
         pop_mean = mean(pop))

highinc_dummy,gdp_mean,pop_mean
0,3397.524,54947339
1,25177.557,26215692


## 3.2 Mapping values

In [60]:
gapminder_2007  %>% 
mutate(highinc_dummy_factor = recode(highinc_dummy, '0'='Low', '1'='High')) %>% 
tail(2)

country,continent,year,lifeExp,pop,gdpPercap,highinc_dummy,highinc_dummy_factor
Zambia,Africa,2007,42.384,11746035,1271.2116,0,Low
Zimbabwe,Africa,2007,43.487,12311143,469.7093,0,Low


## 3.3 Cut-off dummies

In [75]:
cutoffs = c(seq(40, 100, by = 10))
cutoffs

In [76]:
gapminder_2007 = gapminder_2007  %>% 
mutate(cut_variable = cut(gapminder_2007$lifeExp, cutoffs, include.lowest=TRUE))

In [77]:
gapminder_2007 %>% head(3)

country,continent,year,lifeExp,pop,gdpPercap,highinc_dummy,cut_variable
Afghanistan,Asia,2007,43.828,31889923,974.5803,0,"[40,50]"
Albania,Europe,2007,76.423,3600523,5937.0295,0,"(70,80]"
Algeria,Africa,2007,72.301,33333216,6223.3675,0,"(70,80]"


In [78]:
gapminder_2007 %>% 
group_by(cut_variable) %>% 
summarise(mean_gdp = mean(gdpPercap))

“Factor `cut_variable` contains implicit NA, consider using `forcats::fct_explicit_na`”

cut_variable,mean_gdp
"[40,50]",1586.257
"(50,60]",3078.872
"(60,70]",2869.655
"(70,80]",15258.397
"(80,90]",33662.222
,4513.481


# 4. Misc

## 4.1 Rename column

In [84]:
gapminder_2007  %>% rename('gdp'='gdpPercap') %>% head(2)

country,continent,year,lifeExp,pop,gdp,highinc_dummy,cut_variable
Afghanistan,Asia,2007,43.828,31889923,974.5803,0,"[40,50]"
Albania,Europe,2007,76.423,3600523,5937.0295,0,"(70,80]"


## 4.2 Unique

In [88]:
gapminder_2007 %>% 
select(continent) %>% 
unique()

continent
Asia
Europe
Africa
Americas
Oceania


## 4.3 Table(row, column)

In [90]:
df_polity = read.csv('https://raw.githubusercontent.com/corybaird/PLCY_610_public/master/Discussion_sections/Disc4_PS2/demo.csv')
df_polity %>% head(2)

country,polity2,gdp,regime,wealth
US,10,18054,3,3
CANADA,10,17173,3,3


### 4.3.1 Freq table

In [94]:
freq_table = df_polity %>% select(wealth, regime) %>% table()
freq_table

      regime
wealth  1  2  3
     1 26  5  6
     2  6 15 20
     3  0  1 26

In [95]:
rownames(freq_table) = c('Wealth 1', 'Wealth 2', 'Wealth 3')
colnames(freq_table) = c('Regime 1', 'Regime 2', 'Regime 3')
freq_table 

          regime
wealth     Regime 1 Regime 2 Regime 3
  Wealth 1       26        5        6
  Wealth 2        6       15       20
  Wealth 3        0        1       26

### 4.3.2 Prop.table

In [96]:
prop.table(freq_table)

          regime
wealth       Regime 1   Regime 2   Regime 3
  Wealth 1 0.24761905 0.04761905 0.05714286
  Wealth 2 0.05714286 0.14285714 0.19047619
  Wealth 3 0.00000000 0.00952381 0.24761905

## 4.4 Filter list

In [97]:
country_list = c('Albania', 'Italy', 'France', 'Belgium')
gapminder_2007  %>% filter(country %in% country_list)

country,continent,year,lifeExp,pop,gdpPercap,highinc_dummy,cut_variable
Albania,Europe,2007,76.423,3600523,5937.03,0,"(70,80]"
Belgium,Europe,2007,79.441,10392226,33692.61,1,"(70,80]"
France,Europe,2007,80.657,61083916,30470.02,1,"(80,90]"
Italy,Europe,2007,80.546,58147733,28569.72,1,"(80,90]"


## 4.5 Case when

In [131]:
spanish_country_list = c('Spain', 'Argentina', 'Mexico','Chile')

gapminder_2007 %>% 
mutate(language = case_when(country=='Spain'~'Spanish', 
                           country=='Italy' ~ 'Italian', 
                           country=='United Kingdom'~'English')) %>% na.omit()

country,continent,year,lifeExp,pop,gdpPercap,highinc_dummy,cut_variable,language
Italy,Europe,2007,80.546,58147733,28569.72,1,"(80,90]",Italian
Spain,Europe,2007,80.941,40448191,28821.06,1,"(80,90]",Spanish
United Kingdom,Europe,2007,79.425,60776238,33203.26,1,"(70,80]",English


## 4.6 Count

In [132]:
gapminder_2007  %>% count(highinc_dummy)

highinc_dummy,n
0,88
1,54


## 4.7 Export data

In [None]:
#gapminder_2007 %>% write.csv('FILENAME.csv')
#gapminder_2007 %>% write.xlsx('FILENAME.xlsx')

# 5. Merge

## 5.1 Merge rows

In [103]:
df_1 = gapminder_2007[1:3, ]
df_1

country,continent,year,lifeExp,pop,gdpPercap,highinc_dummy,cut_variable
Afghanistan,Asia,2007,43.828,31889923,974.5803,0,"[40,50]"
Albania,Europe,2007,76.423,3600523,5937.0295,0,"(70,80]"
Algeria,Africa,2007,72.301,33333216,6223.3675,0,"(70,80]"


In [105]:
df_2 = gapminder_2007[5:7, ]
df_2

country,continent,year,lifeExp,pop,gdpPercap,highinc_dummy,cut_variable
Argentina,Americas,2007,75.32,40301927,12779.38,1,"(70,80]"
Australia,Oceania,2007,81.235,20434176,34435.37,1,"(80,90]"
Austria,Europe,2007,79.829,8199783,36126.49,1,"(70,80]"


### 5.1.1 rbind

In [106]:
rbind(df_1, df_2)

country,continent,year,lifeExp,pop,gdpPercap,highinc_dummy,cut_variable
Afghanistan,Asia,2007,43.828,31889923,974.5803,0,"[40,50]"
Albania,Europe,2007,76.423,3600523,5937.0295,0,"(70,80]"
Algeria,Africa,2007,72.301,33333216,6223.3675,0,"(70,80]"
Argentina,Americas,2007,75.32,40301927,12779.3796,1,"(70,80]"
Australia,Oceania,2007,81.235,20434176,34435.3674,1,"(80,90]"
Austria,Europe,2007,79.829,8199783,36126.4927,1,"(70,80]"


## 5.2 Merge columns

In [114]:
df_1 = gapminder_2007[1:3, c('year','lifeExp')]
df_1
df_2 = gapminder_2007[5:7, c('continent','country')]
df_2

year,lifeExp
2007,43.828
2007,76.423
2007,72.301


continent,country
Americas,Argentina
Oceania,Australia
Europe,Austria


In [110]:
cbind(df_1, df_2)

year,lifeExp,continent,country
2007,43.828,Americas,Argentina
2007,76.423,Oceania,Australia
2007,72.301,Europe,Austria


## 5.3 Merge rows and columns

### 5.3.1 Case data

In [115]:
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
nyt_cases_df = read.csv(url)
nyt_cases_df  %>% head(3)

date,county,state,fips,cases,deaths
2020-01-21,Snohomish,Washington,53061,1,0
2020-01-22,Snohomish,Washington,53061,1,0
2020-01-23,Snohomish,Washington,53061,1,0


### 5.3.2 Mask data

In [117]:
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/mask-use/mask-use-by-county.csv'
nyt_mask_df = read.csv(url)
nyt_mask_df %>% head(3)

COUNTYFP,NEVER,RARELY,SOMETIMES,FREQUENTLY,ALWAYS
1001,0.053,0.074,0.134,0.295,0.444
1003,0.083,0.059,0.098,0.323,0.436
1005,0.067,0.121,0.12,0.201,0.491


### 5.3.3 Merge

#### 5.3.3.1 For the column merge on make sure the name is the same in both data sets

In [119]:
nyt_mask_df = nyt_mask_df %>% rename('fips'='COUNTYFP')
nyt_mask_df %>% names()

In [121]:
merge(nyt_cases_df, nyt_mask_df, by='fips') %>% head(10)

fips,date,county,state,cases,deaths,NEVER,RARELY,SOMETIMES,FREQUENTLY,ALWAYS
1001,2020-11-27,Autauga,Alabama,2716,42,0.053,0.074,0.134,0.295,0.444
1001,2020-10-17,Autauga,Alabama,1983,28,0.053,0.074,0.134,0.295,0.444
1001,2020-07-02,Autauga,Alabama,561,13,0.053,0.074,0.134,0.295,0.444
1001,2020-09-19,Autauga,Alabama,1673,24,0.053,0.074,0.134,0.295,0.444
1001,2020-09-03,Autauga,Alabama,1466,24,0.053,0.074,0.134,0.295,0.444
1001,2020-12-18,Autauga,Alabama,3647,44,0.053,0.074,0.134,0.295,0.444
1001,2020-08-06,Autauga,Alabama,1096,22,0.053,0.074,0.134,0.295,0.444
1001,2020-03-24,Autauga,Alabama,1,0,0.053,0.074,0.134,0.295,0.444
1001,2020-08-19,Autauga,Alabama,1298,23,0.053,0.074,0.134,0.295,0.444
1001,2020-10-27,Autauga,Alabama,2082,31,0.053,0.074,0.134,0.295,0.444
