<center><img src="prog_lang.jpg" width=500></center>

How can you determine which programming languages and technologies are most widely used? Which languages are gaining or losing popularity, helping you decide where to focus your efforts?

One excellent data source is Stack Overflow, a programming question-and-answer site with more than 16 million questions on programming topics. Each Stack Overflow question is tagged with a label identifying its topic or technology. By counting the number of questions related to each technology, you can estimate the popularity of different programming languages.

In this project, you will use data from the Stack Exchange Data Explorer to examine the relative popularity of R compared to other programming languages.

You'll work with a dataset containing one observation per tag per year, including the number of questions for that tag and the total number of questions that year.

`stack_overflow_data.csv`
|Column|Description|
|------|-----------|
|`year`|The year the question was asked (2008-2020)|
|`tag`|A word or phrase that describes the topic of the question, such as the programming language|
|`num_questions`|The number of questions with a certain tag in that year|
|`year_total`|The total number of questions asked in that year|

In [1]:
# Load necessary packages
library(readr)
library(dplyr)
library(ggplot2)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [2]:
# Load the dataset
data <- read_csv("stack_overflow_data.csv")

[1mRows: [22m[34m420066[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): tag
[32mdbl[39m (3): year, num_questions, year_total

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [4]:
# Question 1: What percentage of the total number of questions asked in 2020 had the R tag?

# Add a percentage column
data_percentage <- data %>%
  mutate(percentage = (num_questions / year_total) * 100)

In [5]:
# Filter for R tags
r_over_time <- data_percentage %>%
  filter(tag == "r")


In [6]:

# Bonus: create a line plot of percentage over time
# ggplot(r_over_time) +
#   geom_line(aes(x = year, y = percentage))

# Filter for the year 2020
r_2020 <- r_over_time %>%
  filter(year == "2020")

print(r_2020)

[90m# A tibble: 1 × 5[39m
   year tag   num_questions year_total percentage
  [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m         [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m  [4m2[24m020 r             [4m5[24m[4m2[24m662    5[4m4[24m[4m5[24m[4m2[24m545      0.966


In [7]:
# Question 2: What were the five most asked-about tags between 2015-2020?

# Find total number of questions for each tag in the period 2015-2020
sorted_tags <- data %>%
  filter(year >= 2015) %>% 
  group_by(tag) %>% 
  summarize(tag_total = sum(num_questions)) %>% 
  arrange(desc(tag_total))

In [8]:
# Get the five largest tags
# Another way of doing the below is by selecting the column with $: highest_tags <- head(sorted_tags$tag, n = 5)
highest_tags <- sorted_tags %>% 
  select(tag) %>% 
  head(n = 5)
print(highest_tags)

[90m# A tibble: 5 × 1[39m
  tag       
  [3m[90m<chr>[39m[23m     
[90m1[39m javascript
[90m2[39m python    
[90m3[39m java      
[90m4[39m android   
[90m5[39m c#        
