<a href="https://colab.research.google.com/github/chrdrn/dbd_binder/blob/main/session_08-showcase_text_as_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background

This showcase is 


1.   Listeneintrag
2.   Listeneintrag




## Preparation



### Install packages

Install addtional necessary packages

**⚠ It might take a few minutes to install all packages and dependencies**

In [3]:
install.packages(c(
  "rmarkdown", "kableExtra", # html output
  "sjmisc", "magrittr", "lubridate", "janitor", "sjlabelled", # data processing
  "sjPlot", "ggpubr", "ggsci", # visual analysis
  "RCurl", "XML", "rvest", # scraping
  "quanteda", "quanteda.textplots", "quanteda.textstats", # text processing
  "fastText", # language detection
  "stm", "tidytext", # topic modeling/analysis 
  "glue", # tidy text output
  "tictoc" # timer
  ))

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘SparseM’, ‘MatrixModels’, ‘estimability’, ‘numDeriv’, ‘mvtnorm’, ‘minqa’, ‘nloptr’, ‘carData’, ‘abind’, ‘pbkrtest’, ‘quantreg’, ‘coda’, ‘iterators’, ‘emmeans’, ‘lme4’, ‘corrplot’, ‘car’, ‘ISOcodes’, ‘extrafontdb’, ‘Rttf2pt1’, ‘statnet.common’, ‘foreach’, ‘shape’, ‘RcppEigen’, ‘webshot’, ‘insight’, ‘snakecase’, ‘datawizard’, ‘bayestestR’, ‘effectsize’, ‘ggeffects’, ‘parameters’, ‘performance’, ‘sjstats’, ‘ggrepel’, ‘cowplot’, ‘ggsignif’, ‘gridExtra’, ‘polynom’, ‘rstatix’, ‘bitops’, ‘fastmatch’, ‘Rcpp’, ‘RcppParallel’, ‘SnowballC’, ‘stopwords’, ‘RcppArmadillo’, ‘extrafont’, ‘sna’, ‘igraph’, ‘network’, ‘nsyllable’, ‘proxyC’, ‘glmnet’, ‘lda’, ‘matrixStats’, ‘quadprog’, ‘slam’, ‘hunspell’, ‘janeaustenr’, ‘tokenizers’




In [4]:
# Define options for output
set.seed(42)
options(scipen = 999)

### Recommended: Import (complete) data from [<img src="https://raw.githubusercontent.com/FortAwesome/Font-Awesome/6.x/svgs/brands/github.svg" width="15" height="15">](https://github.com/chrdrn/dbd_binder) [Github repository](https://github.com/chrdrn/dbd_binder)

*   Due to the low processing power of Google Colab, scraping the Amazon reviews and calculating the topic model could take a very long time. 
*   Therefore, it is highly recommended to load the dataset that contains all the data(sets) created in this Colab and only run the chunks that contain analysis and/or render graphs. 



In [10]:
amazon <- readRDS(
  url("https://github.com/chrdrn/dbd_binder/blob/main/data/08-text_as_data/reviews_tpm.RDS?raw=true"), "rb")

# Check data structure and sublists
amazon |> summary()

     Length Class  Mode
raw  5      -none- list
data 6      -none- list
temp 4      -none- list
txt  4      -none- list
tpm  5      -none- list
stm  3      -none- list

## Scrape Amazon reviews

*   based on [stackoverflow post](https://stackoverflow.com/a/70993803).

### Create custom function

*   `scrape_amazon` function with two arguments
  *   `page_num`: The number of the review page that is to be scraped
  *   `review_url`: URL of review page for the product


In [None]:
scrape_amazon <- function(page_num, review_url) {
  url_reviews <- paste0(review_url, "&pageNumber=", page_num, "&sortBy=recent")
  doc <- read_html(url_reviews)
  map_dfr(doc %>% html_elements("[id^='customer_review']"), ~ data.frame(
    review_title = .x %>% html_element(".review-title") %>% html_text2(),
    review_text = .x %>% html_element(".review-text-content") %>% html_text2(),
    review_star = .x %>% html_element(".review-rating") %>% html_text2(),
    date = .x %>% html_element(".review-date") %>% html_text2() %>% gsub(".*vom ", "", .),
    author = .x %>% html_element(".a-profile-name") %>% html_text2(),
    page = page_num
  )) %>%
    as_tibble %>%
    return()
}

### Define product urls

-   `p01` (*Lineavi*): 1.679 Gesamtbewertungen, 782 mit Rezensionen --\> 79 pages
-   `p02` (*DietySlim*): 1.652 Gesamtbewertungen, 268 mit Rezensionen --\> 28 pages
-   `p03` (*Keto Burn*): 3.341 Gesamtbewertungen, 540 mit Rezensionen --\> 55 pages
-   `p04` (*Yokebe*): 1.586 Gesamtbewertungen, 156 mit Rezensionen --\> 16 pages
-   `p05` (*Vihado*): 1.335 Gesamtbewertungen, 396 mit Rezensionen --\> 40 pages


In [None]:
url <- list(
  p01 = "https://www.amazon.de/LINEAVI-Eiwei%C3%9F-Shake-Kombination-Molkeneiwei%C3%9F-laktosefrei/product-reviews/B018IB02AU/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
  p02 = "https://www.amazon.de/Detoxkuren%E2%80%A2-Entw%C3%A4sserung-Entschlackung-Stoffwechsel-entschlacken/product-reviews/B072QW5ZN1/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
  p03 = "https://www.amazon.de/Saint-Nutrition%C2%AE-KETO-BURN-Appetitz%C3%BCgler/product-reviews/B08B67V8G5/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
  p04 = "https://www.amazon.de/Yokebe-vegetarisch-Mahlzeitersatz-Gewichtsabnahme-hochwertigen/product-reviews/B08GYZ8LRB/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
  p05 = "https://www.amazon.de/Vihado-Liquid-chlorophyll-drops-alfalfa/product-reviews/B093XNC8QH/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews"
)

###  Scrape reviews for products
- Create a list to store the output (`amazon`) 
- Scrape all review (pages) for all products (`p01`:`p05`)

In [None]:
# Create list for output
amazon <- list()

# Scrape reviews o
## p01 
for (i in 1:79) {
  df <- scrape_amazon(page_num = i, review_url = url$p01)
  amazon$raw$p01[[i]] <- df
}

## p02
for (i in 1:28) {
  df <- scrape_amazon(page_num = i, review_url = url$p02)
  amazon$raw$p02[[i]] <- df
}

## p03
for (i in 1:55) {
  df <- scrape_amazon(page_num = i, review_url = url$p03)
  amazon$raw$p03[[i]] <- df
}

## p04
for (i in 1:16) {
  df <- scrape_amazon(page_num = i, review_url = url$p04)
  amazon$raw$p04[[i]] <- df
}

## p05
for (i in 1:40) {
  df <- scrape_amazon(page_num = i, review_url = url$p05)
  amazon$raw$p05[[i]] <- df
}

### Bind rows

In [None]:
# Create vector with variable names of products
product <- names(url)

# bind rows for each product
for (i in product) {
  amazon$data$raw[[i]] <- amazon$raw[[i]] %>% 
    bind_rows() %>% 
    rownames_to_column("id") %>% 
    mutate(across(id, as.numeric))
}

# bind rows of all products
amazon$data$full <- amazon$data$raw %>% 
  bind_rows(.id = "src")

## Data processing

Several steps are necessary to use the data in further analysis: 
- Create unique identifiers (`doc_id` & `id`)
- Create and edit `date` variables
- Check for language of the review
