Skip to content


slides: add 01-introduction
Browse files Browse the repository at this point in the history
  • Loading branch information
gastonstat committed Nov 18, 2016
1 parent 6e272e9 commit 9fa4a6f
Show file tree
Hide file tree
Showing 7 changed files with 837 additions and 0 deletions.
397 changes: 397 additions & 0 deletions 01-introduction/01-introduction.Rnw
Original file line number Diff line number Diff line change
@@ -0,0 +1,397 @@

% load packages


\title[Getting data from the web with R]{\LARGE Getting Data from the Web with R}
\subtitle[Web Data in R]{\large Part 1: Introduction}
\textcolor{gray}{\textbf{G}aston \textbf{S}anchez}
\institute[]{\scriptsize \textcolor{lightgray}{April-May 2014}}
\date[CC BY-SA-NC 4.0]{
\textcolor{lightgrey}{\tiny{Content licensed under
\href{}{CC BY-NC-SA 4.0}}}


<<setup, include=FALSE>>=
# smaller font size for chunks
opts_chunk$set(size = 'tiny')
thm <- knit_theme$get("bclear")

%--- the titlepage frame -------------------------%



{ % all template changes are local to this group.
\setbeamertemplate{navigation symbols}{}
\begin{tikzpicture}[remember picture,overlay]
\node[at=(current] {
\node[fill=black, opacity=0, text opacity=1] at (5.5,-3.8) {\large{ \color{white} Getting Data from the Web with R}};



\begin{block}{\scriptsize License:}
\item[] Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License \\

\begin{block}{\scriptsize You are free to:}
\item[] \textcolor{darkgray}{\textbf{Share}} --- \textcolor{gray}{copy and redistribute the material}
\item[] \textcolor{darkgray}{\textbf{Adapt}} --- \textcolor{gray}{rebuild and transform the material}

\begin{block}{\scriptsize Under the following conditions:}
\item[] \textcolor{darkgray}{\textbf{Attribution}} --- \textcolor{gray}{You must give appropriate credit, provide a link to the license, and indicate if changes were made.}
\item[] \textcolor{darkgray}{\textbf{NonCommercial}} --- \textcolor{gray}{You may not use this work for commercial purposes.}
\item[] \textcolor{darkgray}{\textbf{Share Alike}} --- \textcolor{gray}{If you remix, transform, or build upon this
work, you must distribute your contributions under the same license to this one.}



\frametitle{Lectures Menu}

%--- empty space ---%
\begin{block}{Slide Decks}
\item \textbf{Introduction}
\item \textcolor{lightgray}{Reading files from the Web}
\item \textcolor{lightgray}{Basics of XML and HTML}
\item \textcolor{lightgray}{Parsing XML / HTML content}
\item \textcolor{lightgray}{Handling JSON data}
\item \textcolor{lightgray}{HTTP Basics and the RCurl Package}
\item \textcolor{lightgray}{Getting data via Web Forms}
\item \textcolor{lightgray}{Getting data via Web APIs}
%--- empty space ---%



\frametitle{About these lectures}

%--- empty space ---%
My goal is \textbf{to give you an introduction} to some of the tools in R for getting data from the Web.

I don't pretend to cover everything nor going very deep. I just want to show you an overview of various Web Data scenarios you can handle with R.
%--- empty space ---%






\begin{block}{Must have:}
\item Some experience working with R
\item Some knowledge of HTML
\item An insatiable curiosity for learning new things

\begin{block}{Nice to have:}
\item Knowledge about data storage formats
\item Some programming experience
\item Knowledge on how the Web works




\begin{block}{You'll need:}
\item R \textcolor{lightgray}{(preferably the last version)} \\ \url{}
\item RStudio \textcolor{lightgray}{(highly recommended)} \\ \url{}
\item Text Editor \\ \textcolor{lightgray}{(eg vim, emacs, TextWrangler, notepad, sublime text)}
\item Web Browser \\ \textcolor{lightgray}{(eg Chrome, Safari, Firefox, Internet Explorer, Opera)}
\item and a good internet connection!



\frametitle{In my case ...}

\begin{block}{Software I used for these slides:}
\item R version 3.1.0 (2014-04-10) -- "Spring Dance"
\item Platform: x86\_64-apple-darwin10.8.0 (64-bit)
\item IDE: RStudio Version 0.98.501
\item Text Editor: TextWrangler
\item Web Browser: Google Chrome Version 34.0.1847.131
\item Operating System: OS-X Version 10.8.5




\begin{block}{Some R Books}
\item XML and Web Technologies for Data Sciences with R \\
\low{by Deb Nolan and Duncan Temple Lang}
\item Introduction to Data Technologies \\
\low{by Duncan Murdoch}
\item Data Manipulation with R \\
\low{by Phil Spector}
\item more references in each slide deck




\begin{block}{Web Scraping with R}
\item Web scraping for the humanities and social sciences \\
\low{(by Rolf Fredheim and Aiora Zabala)} \\
{\tiny \url{}}
\item Web Scraping with R \low{(by Xian Nan)} \\
{\tiny \url{}}
\item R-bloggers posts on \textit{Web Scraping} \\
{\tiny \url{}}



\frametitle{Some R Packages}

\begin{tabular}{l l}
Package & Description \\
\highcode{RCurl} & R interface to the \code{libcurl} library \\
& for making general HTTP requests \\
\highcode{RHTMLForms} & Tools to process Web/HTML forms \\
\highcode{XML} & Tools for parsing XML and HTML documents \\
& and working with structured data from the Web \\
\highcode{RJSONIO} & Functions for handling JSON data \\
\highcode{jsonlite} & Functions for handling JSON data \\
\highcode{rjson} & Functions for handling JSON data \\
\highcode{ROAuth} & Interface for authentication via OAuth 1.0 \\
\highcode{SSOAP} & Use SOAP protocol to retrieve data \\

CRAN Task View: \textit{Web Technologies and Services} \\
{\scriptsize \url{}}



{ % all template changes are local to this group.
\setbeamertemplate{navigation symbols}{}
\begin{tikzpicture}[remember picture,overlay]
\node[at=(current] {
\node[fill=black, opacity=0, text opacity=1] at (7.5,-2.8) {\Huge{ \color{white} The Web}};


\frametitle{VIP Questions}

%--- empty space ---%
\begin{block}{Very Important Preliminary Questions}
The Data that you want:
\item Where is it located?
\item How accessible is it?
\item What is its structure / format?
%--- empty space ---%



\frametitle{VIP Questions}

\begin{block}{Location of Data}
\item Do you know the location (URL) beforehand?
\item[] Or do you have to figure it out? \\
\item Is it in one single specific place? \\
\low{(eg one HTML table, one file in the Web)}
\item Is it in one website but spread across several pages? \\
\low{(eg several HTML tables at different pages)}
\item Is it spread across several websites? \\
\low{(eg multiple pieces of information in various sites)}
\item Is it in one or several databases?



\frametitle{VIP Questions}

\begin{block}{Accessibility of Data}
\item Do you have free direct immediate access to data?
\item Do you need to fill a Web Form?
\item Do you need to use a Web API?
\item Do you require username, password, authentication?
\item Do you need to use a specifc transfer protocol?
\item Do you need to use a specifc type/method of request?



\frametitle{VIP Questions}

\begin{block}{Format / Structure of Data}
\item Is it plain text? \\
\item Is it in tabular \low{(spreadsheet-like)} form? \\
\item Is it in HTML? \\
\item Is it in some XML-dialect?
\item Is it in JSON format?
\item Other formats: binary, images, maps, etc?




\begin{block}{Some Acronyms}
\item \textbf{WWW} World Wide Web
\item \textbf{W3C} World Wide Web Consortium
\item \textbf{URL} Uniform Resource Locator
\item \textbf{HTTP} HyperText Transfer Protocol
\item \textbf{XML} Extensible Markup Language
\item \textbf{HTML} HyperText Markup Language
\item \textbf{JSON} JavaScript Object Notation



Binary file added 01-introduction/01-introduction.pdf
Binary file not shown.

0 comments on commit 9fa4a6f

Please sign in to comment.