## Introduction

STM is a text mining technique, initially conceived for the analysis of political texts, which has been extensively adopted in social sciences (<a href="https://www.structuraltopicmodel.com/">here</a> you can find a list of the main publications that have adopted STM). As other topic models, like Latent Dirichlet Allocation it basically allows to identify abstract "topics" that occur in a collection of documents, but compared to other models, it allows the analysis of relationships with document metadata, in form of co-variates, either in terms of the degree of association of a document to a topic, either of the association of a word to a topic. As an example, it is possible to take a bunch of posts published on different political blogs in the months before an election, and see which topics were prevalent in the posts of blogs of a certain political leaning (in this case, political leaning of the blog is used as co-variate for topical prevalence), or to see how words associated to the treatment of a specific topic change depending to the political affiliation (in this case, political leaning of the blog is used as co-variate for topical content) (you can refer to the  <a href="https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf">R package vignette</a> for more details).

As I will stress all over, as the sample is totally arbitrary, any result will not have any statistical validity whatsoever. This is simply meant to be an attempt to explore a technique (and a R package) which can have several potential applications, both in terms of analytical purposes, both in terms of information visualisation (allowing for example to get over the use and abused word cloud). 

As many topics in data analysis, it is something easier to do rather than to explain, and something that can be really understood only when you get your hands dirty with some data, which is what prompted me to try this. In this and the following entries, I will try my hand and post the results of some attempts I am making with STM, more specifically testing the model on some job offers extracted from Indeed UK. As this is a work in progress, I will post the results of my work as I get through them, so I am not really sure where this will lead, but I hope to have some fun along the way. 


## Part I - Web Scraping

In this first post I will not really get my hands on STM yet, but I will illustrate how I obtained the textual data I will use in the rest of the work. As mentioned above, I decided to focus on job offers: there is no specific reason for this, other than that I considered this a good example of texts that offer some metadata to incorporate in the analysis (type of job, salary, location), and whose topicality identification could represent a good test for the model. The choice fell on indeed.co.uk also for no specific reason, and I stuck to it after I noticed that scraping it was relatively easy (although the quality of metadata, as we will see later on, is not the best we could have hoped for). 
Obviously, if I had access to indeed.co.uk API this whole process would have been probably quicker, but since I don’t, I scraping the site was the only feasible option. 
The first step was to create three accessory functions to obtain from each offer page the information needed. This was relatively easily done thanks to htmlParse from XML package:

In [36]:
#load necessary libraries
suppressWarnings(library(rvest))
suppressWarnings(library("xml2"))
suppressWarnings(library("XML"))
suppressWarnings(library("stringr"))
suppressWarnings(library(dplyr))
suppressWarnings(library(naniar))

## Scrape the info from pages:
#metadata 
getmetadataindeed<-function(url){
  meta<-read_html(url)%>%
    htmlParse( asText=TRUE)%>%xpathSApply( "//*[contains(@class,'jobsearch-JobMetadataHeader-iconLabel')]", xmlValue)
  if (is.list(meta)) {meta<-NA
  } #as not all the job descriptions contain metadata, this was introduced to avoid ending up with empty lists in the dataframe
meta 
  }

#job description
getjobdescindeed<-function(url){
  read_html(url)%>%
    htmlParse( asText=TRUE)%>%xpathSApply( "//*[contains(@id,'jobDescriptionText')]", xmlValue)%>%paste(collapse=', ')%>%str_replace_all("\n", " ")
  
}

#job title 
getjobtitle<-function(url){
  tit<-read_html(url)%>%
    htmlParse( asText=TRUE)%>%xpathSApply( "//*[contains(@class, 'icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title')]", xmlValue)
  if (is.list(tit)) {
    tit<-NA
  }
  tit
}

As these functions need to be fed with the specific URLs of the job offer webpages, and do this by hand for hundreds of offers is not really an option, I created a second function to directly scrape the URLs from the results page of a search:

In [37]:

getlinks<-function(url){
  linksb<-read_html(url)%>%htmlParse(asText = TRUE)%>%xmlRoot()%>%xpathSApply("//*[contains(@class,'title')]", xmlGetAttr, 'href')
  linksb[sapply(linksb, is.null)] <- NULL
  linksb<-as.character(linksb)
  linksb<-paste("https://www.indeed.co.uk", linksb,sep="")
    } 

Interestingly, as some sponsored links are always present in the results page, the total number of job offer URLs scraped is slightly superior to the expected number, which for our purpose doesn’t really create particular issues. The final step was to put everything together:

In [39]:
scrapeindeed<-function(urlres) {
  linksbb<-getlinks(urlres)
  jobtitles<-lapply(FUN=getjobtitle, linksbb)%>%plyr::ldply(rbind)%>%mutate_if(is.factor,as.character)
  jobsdesc<-lapply (FUN=getjobdescindeed, linksbb)%>%plyr::ldply(rbind)%>%mutate_if(is.factor,as.character)
  jobmeta<-lapply (FUN=getmetadataindeed, linksbb)%>%plyr::ldply(rbind)%>%mutate_if(is.factor,as.character)
  tobemoved <- grepl("£", jobmeta[,2])#as salary can fall in the second column if location is missing, single out all the salary entries…
  jobmeta[tobemoved,3 ]<-jobmeta[tobemoved,2 ] #...to move them to the third column
  jobmeta[tobemoved,2 ] <-NA #leaving a NA on their place in the second column 
  final<-cbind(jobtitles,jobsdesc,jobmeta)%>%`colnames<-`(c("Title", "Description", "Location","Type","Salary"))
  final
}

###############
#test 
##
results1<-scrapeindeed("https://www.indeed.co.uk/jobs?as_and=&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=&as_src=&salary=&radius=25&l=ne18&fromage=3&limit=50&sort=&psf=advsrch")

In [55]:
print.table(results1[1,], justify="left", width=20)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Title 
                        

This function, fed with the URL of the results page, can extract all the data needed and store them in a dataframe of five columns. The last step was to fed the function with all the results pages relevant, which I did (in this case manually), for all the job offers published in the last three days before Saturday 18th May within 25 miles of the postcode NE18 (Newcastle Upon Tyne). The results were then merged together, the data are available <a href="https://www.dropbox.com/s/fa3hhcfi5qkqfp1/totaljobs.txt?dl=0">here</a>  in .txt format.

As you can see, there is still much to do in terms of data cleaning before starting to work, which is what we will see in the next section. 
