# Session 1--Rstudio and downloading (September 10th)


### Session 1: notes

1. Module 1 covers Rstudio basics 
    - [Link](https://github.com/corybaird/SPP_Data_Seminar/blob/main/R/Session_1/Module_1_Rstudio.ipynb)
2. Module 2 covers downloading 
    - Current page
3. Excercise to practice material covered in Modules 1 & 2
    - [Link](https://github.com/corybaird/SPP_Data_Seminar/blob/main/R/Session_1/Session_1_Excercise.ipynb)



# Module 2: Downloading data




### Module 2 notes

There are 4 common ways to import/download data

1. Read files from your computer
    - See R.1
        - Covered last week


2. Read files using R library
    - See 1.1 WDI library
    
    
    
3. Read files directly from websites
    - Zip data (see 2.1)
    - xlsx data (see 2.2)
    - github data (see 2.3)


4. Scrape data
    - HTML tables (See 3.1)



# Review

## R.1 Import from csv file

### R.1.1 From same working directory as code

In [1]:
df = read.csv('vote.csv')
head(df,2)

state,vote,income,education,age,sex
AR,1,9,2,73,0
AR,1,11,2,24,0


### R.1.2 From sub-directory

In [2]:
## 2.2 Read csv file in sub-folder
df_arrests = read.csv("Sub_folder/arrests.csv")
head(df_arrests,2)

X,Murder,Assault,UrbanPop,Rape
Alabama,13.2,236,58,21.2
Alaska,10.0,263,48,44.5


# 1. Download data using libraries

## 1.1 WDI library



### 1.1.A Install packages and import library

- This is done in R in two steps: 
    1. install.packages("Package name") Downloads package
    2. library(Package name) Imports package

In [4]:
# Step 1: install
install.packages('WDI')

# Step 2:
library(WDI)

also installing the dependency ‘RJSONIO’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


### 1.1.1 Import dplyr library

In [32]:
search_df = WDIsearch(string = "gender", field = "name") 
head(search_df,4)

indicator,name
2.3_GIR.GPI,Gender parity index for gross intake ratio in grade 1
2.6_PCR.GPI,Gender parity index for primary completion rate
5.51.01.07.gender,Gender equality
BI.EMP.PWRK.PB.FE.ZS,Public sector employment as a share of paid employment by gender (Female)


### 1.1.2 Download data

In [33]:
wdi_df = WDI(indicator = "5.51.01.07.gender", country= c('AF'), start=2000)
head(wdi_df,3)

iso2c,country,5.51.01.07.gender,year
AF,Afghanistan,1.0,2020
AF,Afghanistan,0.66667,2019
AF,Afghanistan,0.66667,2018


# 2. Download from websites

## 2.1 Zip data

### 2.1.A Download and import libraries

In [None]:
#Step 1
#install.packages('downloader')
#install.packages('foreign')
#install.packages('dplyr')

In [50]:
#Step 2
library(foreign) #Imports dta files
library(dplyr) #Data manipulation
library(downloader) #Downloads files from the internet

### 2.1.1. Download zip

- Data: MXFLS data
    - Website url: http://www.ennvih-mxfls.org/english/ennvih-1.html


In [37]:
#URL 
url = "http://www.ennvih-mxfls.org/english/assets/hh02dta_bc.zip"
#File name
file_name = "mxfls.zip"

In [38]:
# "Downloader" library--function: download.file()
download.file(url, file_name)

In [57]:
#Shows folders
list.dirs()

### 2.1.2 Show files in zip folder

In [39]:
unzip("mxfls.zip", list = TRUE)

Name,Length,Date
hh02dta_bc/c_conpor.dta,2834192,2004-12-20 19:18:00
hh02dta_bc/c_cv.dta,1730138,2004-12-20 19:18:00
hh02dta_bc/c_cvo.dta,406940,2004-12-20 19:18:00
hh02dta_bc/c_eh.dta,249060,2005-07-14 10:35:00
hh02dta_bc/c_ls.dta,3428996,2004-12-20 19:18:00
hh02dta_bc/c_ne.dta,103070,2004-12-20 19:18:00
hh02dta_bc/c_portad.dta,373398,2005-07-14 10:35:00
hh02dta_bc/c_rc.dta,440998,2004-12-20 19:18:00
hh02dta_bc/c_sp.dta,1116290,2005-07-14 10:35:00


### 2.1.3 Unzip folder


In [40]:
unzip("mxfls.zip")

### 2.1.4 Import data

In [41]:
df = read.dta("hh02dta_bc/c_ls.dta")

In [42]:
df %>% 
head(2)

folio,ls,secuencia,ls00,ls02_1,ls02_2,ls03_1,ls03_21,ls03_22,ls04,...,ls09,ls10,ls11,ls12,ls13_1,ls13_2,ls14,ls15_1,ls16,ls18
1000,1,1,1,1,37,,,,1,...,1,5,2,1,1.0,32000.0,3,6.0,3,
1000,2,2,2,1,35,,,,3,...,1,5,1,3,,,1,,3,


## 2.2 Directly from url

### 2.2.A Show files in directory

In [59]:
list.files()

### 2.2.1 Download

In [60]:
url = 'https://www.macrohistory.net/app/download/9834512569/JSTdatasetR5.xlsx?t=1623599312'
file_name = 'Macro.xlsx'
download.file(url, file_name)

In [61]:
list.files()

In [62]:
df_macro = read_excel('macro.xlsx', sheet=2) #Reads second sheet
head(df_macro)

year,country,iso,ifs,pop,rgdpmad,rgdppc,rconpc,gdp,iy,...,eq_capgain,eq_dp,eq_capgain_interp,eq_tr_interp,eq_dp_interp,bond_rate,eq_div_rtn,capital_tr,risky_tr,safe_tr
1870,Australia,AUS,193,1775,3273.239,13.83616,21.44973,208.78,0.1092656,...,-0.07004543,0.07141703,,,,0.04911817,0.06641459,,,
1871,Australia,AUS,193,1675,3298.507,13.93686,19.9308,211.56,0.1045791,...,0.04165363,0.06546638,,,,0.04844633,0.06819329,,,
1872,Australia,AUS,193,1722,3553.426,15.04425,21.08501,227.4,0.130438,...,0.10894547,0.06299735,,,,0.0473735,0.06986062,,,
1873,Australia,AUS,193,1769,3823.629,16.21944,23.25491,266.54,0.1249862,...,0.08308647,0.06448419,,,,0.04671958,0.06984195,,,
1874,Australia,AUS,193,1822,3834.797,16.26823,23.45805,287.58,0.1419599,...,0.11938886,0.06350296,,,,0.04653317,0.07108451,,,
1875,Australia,AUS,193,1874,4138.207,17.59211,25.66951,300.74,0.160564,...,0.05705873,0.06364298,,,,0.04507325,0.06727437,,,


## 2.3 Download from github

- https://github.com/nytimes/covid-19-data/blob/master/excess-deaths/deaths.csv
- View raw file
- Copy url from raw file

In [66]:
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/excess-deaths/deaths.csv'
df_covid = read.csv(url)
head(df_covid,3)

country,placename,frequency,start_date,end_date,year,month,week,deaths,expected_deaths,excess_deaths,baseline
Austria,,weekly,2020-01-06,2020-01-12,2020,1,2,1702,1806,-104,2015-2019 historical data
Austria,,weekly,2020-01-13,2020-01-19,2020,1,3,1797,1819,-22,2015-2019 historical data
Austria,,weekly,2020-01-20,2020-01-26,2020,1,4,1779,1831,-52,2015-2019 historical data


# 3. Scraping

## 3.1 Read html table

### 3.1.A Import library

In [78]:
# Step 1
#install.packages('rvest')
# Step 2
library(rvest)

### 3.1.1 Download webpage -- XML format

In [79]:
url =  "https://en.wikipedia.org/wiki/List_of_current_United_States_governors"
file = read_html(url)
file

{xml_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

### 3.1.2 Extract tables

In [82]:
tables = html_nodes(file, "table")
#tables

### 3.1.3 Table -> Dataframe

In [95]:
table1 = html_table(tables[2], fill = TRUE)
table1

Democratic (23) Republican (27),Democratic (23) Republican (27).1,Democratic (23) Republican (27).2,Democratic (23) Republican (27).3,Democratic (23) Republican (27).4,Democratic (23) Republican (27).5,Democratic (23) Republican (27).6,Democratic (23) Republican (27).7,Democratic (23) Republican (27).8,Democratic (23) Republican (27).9,Democratic (23) Republican (27).10
State,Portrait,Governor,Party,Party,Born,Prior public experience,Inauguration,End of term,Past governors,
Alabama,,Kay Ivey,,Republican,"(1944-10-15) October 15, 1944 (age 76)","Lieutenant Governor, Treasurer","April 10, 2017",2023,List,
Alaska,,Mike Dunleavy,,Republican,"(1961-05-05) May 5, 1961 (age 60)",Alaska Senate,"December 3, 2018",2022,List,
Arizona,,Doug Ducey,,Republican,"(1964-04-09) April 9, 1964 (age 57)",Treasurer,"January 5, 2015",2023 (term limits),List,
Arkansas,,Asa Hutchinson,,Republican,"(1950-12-03) December 3, 1950 (age 70)","Under Secretary of Homeland Security for Border & Transportation Security, Administrator of the Drug Enforcement Administration, U.S. House, U.S. Attorney","January 13, 2015",2023 (term limits),List,
California,,Gavin Newsom,,Democratic,"(1967-10-10) October 10, 1967 (age 53)","Lieutenant Governor, Mayor of San Francisco","January 7, 2019",2023,List,
Colorado,,Jared Polis,,Democratic,"(1975-05-12) May 12, 1975 (age 46)","U.S. House, Colorado State Board of Education","January 8, 2019",2023,List,
Connecticut,,Ned Lamont,,Democratic,"(1954-01-03) January 3, 1954 (age 67)",Greenwich Selectman,"January 9, 2019",2023,List,
Delaware,,John Carney,,Democratic,"(1956-05-20) May 20, 1956 (age 65)","U.S. House, Lieutenant Governor","January 17, 2017",2025 (term limits),List,
Florida,,Ron DeSantis,,Republican,"(1978-09-14) September 14, 1978 (age 42)",U.S. House,"January 8, 2019",2023,List,


# Appendix: Reading stata files

## A.1 dta files


In [107]:
#url = 'https://github.com/corybaird/PLCY_611_Public/blob/f4b9cc3ad04f174a1de78f13fa3a2b500fbe700d/Sessions/Session_1/session1.dta?raw=true'
#df = read.dta(url)
#head(df,2)

## A.2 dta files (stata 13>=above)

In [98]:
install.packages('readstata13')
library(readstata13)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [101]:
url = "https://github.com/corybaird/PLCY_611_Public/blob/main/Sessions/Session_1/session1.dta?raw=true"
df = read.dta13(url)
head(df,2)

“
   Factor codes of type double or float detected in variables

   gov, region, urban, sex, nationality,
   marital, reltohd, student, educatt1,
   rsnnosrch, rsnolf, fteducatt, mteducatt

   No labels have been assigned.
   Set option 'nonint.factors = TRUE' to assign labels anyway.
”

hhid,indid,gov,district,subdistrict,stratum,region,urban,sex,pn,...,crlfsr1,student,educatt1,wealth,qwealth,yrseduc,rsnnosrch,rsnolf,fteducatt,mteducatt
1027101,102710101,21,3,1,1,2,1,2,1,...,1,0,3,-1.457183,1,9,,,2,1
1027101,102710102,21,3,1,1,2,1,2,2,...,0,1,2,-1.457183,1,9,,1.0,2,3
