Here we will show how to do webscraping with rvest and the chrome
extension selector gadget. This technique is borrowed from the great
explanation at: [Online Bargin Hunting in R with
rvest](https://jef.works/blog/2019/01/12/online-bargain-hunting-in-R-with-rvest/)
and is basically a simplified version of that description

First make sure that you install and load the rvest package:

In [None]:
if (!require("rvest")) install.packages("rvest")
library("rvest")

Then once we do this we pick a site we want to get data from. We want to
pull some data from
<a href="http://poshmark.com" class="uri">http://poshmark.com</a> and
compare original prices with the current price, maybe to find excellent
bargins. We use the `read_html` method from the rvest package. We will
look in the category Jackets and Coats-Blazers, and we will just use the
data we find on the first page:

In [None]:
url <- 'https://poshmark.com/category/Women-Jackets_&_Coats-Blazers'
webpage <- read_html(url)

Now we use the Selector Gadget to find the part of the website we want
to get. [Selector
Gadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en)
is a chrome extension(go ahead and install it) that allows you to grab a
css selector for a part of a website that you want to collect in your
data. The link above has some screenshots of this in action and there
are good [videos](https://www.youtube.com/watch?v=oqNTfWrGdbk) on
YouTube showing how to use it as well.

First we get the titles using `html_nodes` and the selector
`#tiles-con .title` which we got from using SelectorGadget:

In [None]:
titleNodes <- html_nodes(webpage,'#tiles-con .title')
length(titleNodes) # double check 48 products

It looks like it is the right length at least. So probably we got the
data we wanted. Next we don’t need the html structure of what we pulled
but we just want the text inside the tags so we use `html_text` to get
that.

In [None]:
titles <- html_text(titleNodes)

We have to clean up the titles array, since it sometimes has weird
unicode in it. (Like Money Bags unicode). We want to have pure text:

In [None]:
titles <- iconv(titles, to="ASCII", sub="")
titles

Next we look for the prices…

In [None]:
priceNodes <- html_nodes(webpage,'.price')
prices <- html_text(priceNodes)
length(prices) # double check 48 products
head(prices)

I wasn’t able to find a way to pull each price separately the way they
were written in the html, so we are going to have to split each of the
price strings into 2 strings. Its difficult to tell but the strings
above with the two prices are actually the prices and a “non breaking
space” between them. That is the way they were coded in the html. A
nonbreaking space looks like this `&nbsp;` and you will see it in the
html if you look carefully. It turns out that the unicode for this
non-breaking space is `\u00A0` so the code that splits the string based
on that separator is this:

In [None]:
prices <- strsplit(prices, split='\u00A0')
head(prices)
str(prices[[1]])

That looks great. Lets put each of these prices into its own vector so
we can make a dataframe.

This next function takes the list of prices, and uses the function given
on it. The first function `function(l) l[1]` just takes an list as
argument, and then gets the first item on the list and returns it. It
does this for each list item passed to it. So the result is a vector of
all the “first” entries from each entry in the list.

Likewise `function(l) l[2]` just gets the second entry from each of
lists passed to it. So this will extract the second price in each of the
list items in prices:

In [None]:
current <- sapply(prices, function(l) l[1])
original <- sapply(prices, function(l) l[2])

Let’s take a look and make sure we have the right things:

In [None]:
head(current)
head(original)

Looks great now…

Finally lets make a dataframe of our fields.

In [None]:
df <-data.frame("name" = titles, "original" =  original, "current"= current)
df