<a href="https://colab.research.google.com/github/chrdrn/dbd_binder/blob/main/session_07-showcase_tiktok.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background

This showcase is intended to illustrate different analysis possibilities of <img src="https://raw.githubusercontent.com/FortAwesome/Font-Awesome/6.x/svgs/brands/tiktok.svg" width="15" height="15"> TikTok data downloaded with the [`Zeeschuimer`](https://github.com/digitalmethodsinitiative/zeeschuimer) browser extension.




## Preparation

Install addtional necessary packages

⚠ It might take a few minutes to install all packages and dependencies



In [None]:
install.packages(c(
  "sjmisc",
  "sjPlot",
  "quanteda",
  "quanteda.textplots"))

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘estimability’, ‘numDeriv’, ‘mvtnorm’, ‘xtable’, ‘minqa’, ‘nloptr’, ‘RcppEigen’, ‘coda’, ‘emmeans’, ‘lme4’, ‘ISOcodes’, ‘extrafontdb’, ‘Rttf2pt1’, ‘statnet.common’, ‘insight’, ‘sjlabelled’, ‘bayestestR’, ‘datawizard’, ‘effectsize’, ‘ggeffects’, ‘parameters’, ‘performance’, ‘sjstats’, ‘fastmatch’, ‘Rcpp’, ‘RcppParallel’, ‘SnowballC’, ‘stopwords’, ‘RcppArmadillo’, ‘extrafont’, ‘ggrepel’, ‘sna’, ‘igraph’, ‘network’




## Data analysis

*   based on TikToks that are tagged with the hashtag `statistics`
*   collected via [`Zeeschuimer`](https://github.com/digitalmethodsinitiative/zeeschuimer) with .csv export via 🐈🐈 [**4CAT**](https://github.com/digitalmethodsinitiative/4cat) 🐈🐈

### Data import from [<img src="https://raw.githubusercontent.com/FortAwesome/Font-Awesome/6.x/svgs/brands/github.svg" width="15" height="15">](https://github.com/chrdrn/dbd_binder) [Github repository](https://github.com/chrdrn/dbd_binder)

In [None]:
# load packages
library(tidyverse)
library(readr)

# get data from github
statistics <- read_csv(
  "https://raw.githubusercontent.com/chrdrn/dbd_binder/main/data/07-tiktok/tiktok-search-statistics.csv",
  col_types = cols(author_followers = col_number()))

# quick preview
statistics %>% glimpse()


### Data exploration



####  Periode in which the TikToks were posted

In [None]:
# Load additional packages
library(lubridate)
library(sjPlot)
library(ggpubr)

# Display 
statistics %>% 
  mutate(date  = as.factor(year(timestamp))) %>% 
  plot_frq(date) +
  theme_pubr()

#### Location parameters of different statistics

In [None]:
# Load additional packages
library(sjmisc)

# Get location parameters
statistics %>% 
  select(likes:plays) %>% 
  descr()

#### Distribution of likes


In [None]:
statistics %>% 
  plot_frq(likes, type = "density")

#### Warning messages displayed in TikToks

In [None]:
statistics %>% 
  frq(warning)

### Text analysis
based on [quanteda](https://quanteda.io/index.html): 

> *quanteda is an R package for managing and analyzing textual data developed by Kenneth Benoit, Kohei Watanabe, and other contributors. Its initial development was supported by the European Research Council grant ERC-2011-StG 283794-QUANTESS.*

> *The package is designed for R users needing to apply natural language processing to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. The package is therefore of great benefit to researchers, students, and other analysts with fewer financial resources. While using quanteda requires R programming knowledge, its API is designed to enable powerful, efficient analysis with a minimum of steps. By emphasizing consistent design, furthermore, quanteda lowers the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers.*

Packages [`quanteda`](https://github.com/quanteda/quanteda) & [`quanteda.textplots`](https://github.com/quanteda/quanteda.textplots) are used. 






#### Create corpus 

In [None]:
# Create corpus based on variable hashtags
crp <- corpus(
  statistics, 
  docid_field = "id",
  text_field = "hashtags")

# Display
crp 

#### Tokenization


In [None]:
# Create tokens based on corpus
tkn <- crp %>% 
  tokens(
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_url = TRUE,
    remove_separators = TRUE)

# Display
tkn

#### Create Document-Feature-Matrix (DFM)


In [None]:
# Create dfm based on tokens
dfm <- tkn %>% 
  dfm()

# Display
dfm

#### Wordclouds

##### based on complete DFM

In [None]:
dfm %>% 
  textplot_wordcloud(
    min_size = 1,
    max_size = 8,
    max_words = 50,
    rotation = 0
  )

##### without the searchterm `statistics`

In [None]:
dfm %>% 
  dfm_remove(pattern = "statistics") %>% 
  textplot_wordcloud(
    min_size = 1,
    max_size = 8,
    max_words = 50,
    rotation = 0,
    color = "dodgerblue3"
  )