<a href="https://colab.research.google.com/github/andreaskuepfer/data-analysis-visualization-lecture/blob/main/Data_Analysis_Guest_Lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Analysis: From Data Crawling to Visualization
### An Introduction to APIs for Social Scientists and How To (not) Display Data

##### **Data Analysis Guest Lecture**
##### **National University of Kyiv-Mohyla Academy**
##### **March 15, 2023**

##### *Andreas Küpfer (Technical University of Darmstadt)*

Today, we want to work with the tweets of the President of Ukraine: Volodymyr Zelenskyy.

This hands-on notebook is divided into four sections:

1. First, we will load our dataset into R and do some data wrangling before we add a sentiment score to each tweet.
2. Second, we are going to visalize the preprocessed data in R.
3. Third, we discuss further steps that could be applied to improve our analysis.
4. Last but not least you can find some resources if you want to learn R now to analyze your own data.

## But whait... How do I actually work in such a notebook and why do we need this?

In [1]:
# I'm a comment and I'm always ignored by R
variable <- 1 + 2
print(variable)

[1] 3


## Loading Packages (additional functionalities) in R

In [None]:
# install.packages("pacman")

pacman::p_load(tidyverse,
               emoji,
               tidytext,
               textdata)

## Part 1: Loading and organizing the Tweets

In [None]:
tweets <- readRDS("tweets.Rds")

tweets_en <- tweets %>%
  select(created_at, text, lang) %>%
  filter(lang == "en")

In [None]:
tweets_en_emoji <- tweets_en %>% # Gefilterten Datensatz (Sprache Englisch) auswählen
  dplyr::mutate(emoji = emoji::emoji_extract_all(text)) %>% # mit Hilfe von emoji::emoji_extract_all() alle emojis extrahieren
  tidyr::unnest(emoji) %>% # unnest generiert eine Zeile pro gefundenem emoji (z.B. 3 Zeilen, wenn in einem tweet 3 emojis gefunden wurden)
  dplyr::mutate(emojiname = purrr::map(emoji, ~names(which(emoji::emoji_name == .)))) %>% # ersetze emoji mit alternativem Text (z.B. Ukraininsche Flagge wird zu "Ukraine")
  dplyr::filter(stringr::str_detect(stringr::str_c(emojiname), ".*flag.*")) %>% # entferne alle Zeilen bzw. gefundenen Emojis, welche keine Flagge sind
  dplyr::mutate(emojiname = stringr::str_extract(emojiname, regex("flag_.*[^\")]")), # extrahiere einheitliches Länderkürzel aus alternativem Text
                emojiname = stringr::str_remove(emojiname, "flag_")) # entferne "flag_" prefix von Text

In [None]:
View(tweets_en_emoji)

In [None]:
afinn_dict <- tidytext::get_sentiments(lexicon = "afinn")

afinn_dict %>%
  dplyr::group_by(value) %>%
  dplyr::slice_sample(n=1)

In [None]:
tweets_en_emoji_sentiment <- tweets_en_emoji %>%
  dplyr::select(emojiname, text)

head(tweets_en_emoji_sentiment)

In [None]:
tweets_en_emoji_sentiment <- tweets_en_emoji %>%
  dplyr::select(emojiname, text) %>%
  tidytext::unnest_tokens(output = word, input = text)

head(tweets_en_emoji_sentiment)

In [None]:
tweets_en_emoji_sentiment <- tweets_en_emoji %>%
  dplyr::select(emojiname, text) %>%
  tidytext::unnest_tokens(output = "word", input = text) %>%
  dplyr::right_join(y = afinn_dict, by = "word") %>%
  dplyr::filter(!is.na(emojiname))

head(tweets_en_emoji_sentiment)

In [None]:
tweets_en_emoji_sentiment <- tweets_en_emoji %>%
  dplyr::select(emojiname, text) %>%
  tidytext::unnest_tokens(output = "word", input = text) %>%
  dplyr::right_join(y = afinn_dict, by = "word") %>%
  dplyr::filter(!is.na(emojiname)) %>%
  dplyr::count(emojiname, value, name = "count")

head(tweets_en_emoji_sentiment)

In [None]:
tweets_en_emoji_sentiment <- tweets_en_emoji %>%
  dplyr::select(emojiname, text) %>%
  tidytext::unnest_tokens(output = "word", input = text) %>%
  dplyr::right_join(y = afinn_dict, by = "word") %>%
  dplyr::filter(!is.na(emojiname)) %>%
  dplyr::count(emojiname, value, name = "count") %>%
  dplyr::group_by(emojiname) %>% # (3)
  dplyr::mutate(n_normalized = count/sum(count) * 100)

head(tweets_en_emoji_sentiment)

## Part 2: Visualize the results

In [None]:
country_list <- c("Ukraine", "Germany", "Poland", "European_Union", "Belarus", "Russia", "United_States", "United_Kingdom", "Mozambique", "Canada")

In [None]:
label_df <- tweets_en_emoji_sentiment %>%
  dplyr::filter(emojiname %in% country_list) %>%
  dplyr::group_by(emojiname) %>%
  dplyr::summarise(label = sum(count))

ggplot2::ggplot(tweets_en_emoji_sentiment %>%
                  filter(emojiname %in% country_list),
                mapping = ggplot2::aes(x = emojiname, y = n_normalized)) +
  ggplot2::geom_col(aes(fill = as.factor(value)), position=position_dodge(.6), width=.6) +
  ggplot2::geom_label(aes(label = label, y = -2.5), data = label_df) +  
  ggplot2::scale_fill_manual(name = "Sentiment",
                             values = c("-5" = "#8B0000",
                               "-4" = "#FF0000",
                               "-3" = "#FFA500",
                               "-2" = "#FFFF00",
                               "-1" = "#fafad4",
                               "0" = "#B0B0B0",
                               "1" = "#b6fcc3",
                               "2" = "#90EE90",
                               "3" = "#00FF00",
                               "4" = "#228B22",
                               "5" = "#006400")) +
  ggplot2::labs(x = "Country",
                y = "Proportion in %") +
  ggplot2::theme(axis.text.x = element_text(angle = 20, vjust = 0.7, face="bold"))

In [None]:
tweets_en_emoji_sentiment_before <- tweets_en_emoji %>%
  dplyr::filter(emojiname %in% country_list) %>%
  dplyr::mutate(before_war = created_at < as.Date("2022-02-24")) %>% # (4)
  dplyr::select(before_war, emojiname, text) %>%
  tidytext::unnest_tokens(output = "word", input = text) %>%
  dplyr::right_join(y = afinn_dict, by = "word") %>%
  dplyr::count(before_war, emojiname, value, name = "count") %>%
  dplyr::group_by(emojiname, before_war) %>%
  dplyr::mutate(n_normalized = count/sum(count) * 100)

In [None]:
# Vor Kriegsbeginn
label_df <- tweets_en_emoji_sentiment_before %>%
  dplyr::filter(before_war) %>%
  dplyr::group_by(emojiname) %>%
  dplyr::summarise(label = sum(count))

ggplot2::ggplot(tweets_en_emoji_sentiment_before %>%
                  dplyr::filter(before_war),
                mapping = ggplot2::aes(x = emojiname, y = n_normalized)) +
  ggplot2::geom_col(ggplot2::aes(fill = as.factor(value)), position=ggplot2::position_dodge(.6), width=.6) +
  ggplot2::geom_label(ggplot2::aes(label = label, y = -2.5), data = label_df) +
  ggplot2::scale_fill_manual(name = "Sentiment",
                             values = c("-5" = "#8B0000", # Farbverlauf (HEX Codes) generiert mit ChatGPT
                               "-4" = "#FF0000",
                               "-3" = "#FFA500",
                               "-2" = "#FFFF00",
                               "-1" = "#fafad4",
                               "0" = "#B0B0B0",
                               "1" = "#b6fcc3",
                               "2" = "#90EE90",
                               "3" = "#00FF00",
                               "4" = "#228B22",
                               "5" = "#006400")) +
  ggplot2::labs(x = "Country",
                y = "Proportion") +
  ggplot2::theme(axis.text.x = element_text(angle = 20, vjust = 0.7, face="bold"))

In [None]:
# Nach Kriegsbeginn
label_df <- tweets_en_emoji_sentiment_before %>%
  dplyr::filter(!before_war) %>%
  dplyr::group_by(emojiname) %>%
  summarise(label = sum(count))

ggplot2::ggplot(tweets_en_emoji_sentiment_before %>%
                  dplyr::filter(!before_war),
                mapping = ggplot2::aes(x = emojiname, y = n_normalized)) +
  ggplot2::geom_col(ggplot2::aes(fill = as.factor(value)), position=ggplot2::position_dodge(.6), width=.6) +
  ggplot2::geom_label(ggplot2::aes(label = label, y = -2.5), data = label_df) +
  ggplot2::scale_fill_manual(name = "Sentiment",
                             values = c("-5" = "#8B0000",
                               "-4" = "#FF0000",
                               "-3" = "#FFA500",
                               "-2" = "#FFFF00",
                               "-1" = "#fafad4",
                               "0" = "#B0B0B0",
                               "1" = "#b6fcc3",
                               "2" = "#90EE90",
                               "3" = "#00FF00",
                               "4" = "#228B22",
                               "5" = "#006400"),
                             drop = FALSE) +
  ggplot2::labs(x = "Country",
                y = "Proportion") +
  ggplot2::theme(axis.text.x = element_text(angle = 20, vjust = 0.7, face="bold"))

### Alternative Way to Present the Data

In [None]:
ggplot2::ggplot(data = tweets_en_emoji_sentiment_before %>%
                  dplyr::filter(before_war) %>%
                  dplyr::group_by(emojiname) %>%
                  dplyr::mutate(mean = sum(value) / dplyr::n()) %>%
                  dplyr::arrange(desc(mean), .by_group = TRUE),
                mapping = ggplot2::aes(x = forcats::fct_reorder(as.factor(emojiname), mean),
                                       y = n_normalized)) +
  ggplot2::geom_col(aes(fill = as.factor(value))) +
  ggplot2::scale_fill_manual(name = "Sentiment",
                           values = c("-5" = "#8B0000",
                             "-4" = "#FF0000",
                             "-3" = "#FFA500",
                             "-2" = "#FFFF00",
                             "-1" = "#fafad4",
                             "0" = "#B0B0B0",
                             "1" = "#b6fcc3",
                             "2" = "#90EE90",
                             "3" = "#00FF00",
                             "4" = "#228B22",
                             "5" = "#006400"),
                           drop = FALSE) +
  ggplot2::coord_flip() +
  ggplot2::labs(y = "Sentiment Proportions",
                x = "Country") +
  ggplot2::theme_bw()

What else? How could we proceed?

We could:
1. Also check the text for country names
2. Look for synonyms for the mentions of Russia
3. Apply more sophisticated models for sentiment extraction

What are your ideas?

## Resources to learn R

There are many learning offerings freely available on the web. Below you find some recommendations:



* DataQuest interactive tutorials: [Introduction to Data Analysis in R](https://www.dataquest.io/course/intro-to-r-rewrite/)
* [R for Data Science](https://r4ds.hadley.nz/) by Hadley Wickham and Garrett Grolemund (2022)
* [How to learn R?](https://ozlemtuncel.github.io/files/Learning_R.pdf) by Ozlem Tuncel (2022)
