# Group Project Proposal

### Title: Ideal Times to Publish Certain News Articles

### Introduction

In a fast-paced 24-hour news cycle, it is crucial to publicize articles at the right time such that they can bring maximum impact to the audience before being overtaken and forgotten by the newer articles. For online media distribution, an effective measure of success can be the number of “shares” that a piece receives.
To this end, we would like to determine if there is an optimal time period in which to publish news pieces. Our question is: Does the day of the week impact the type of news article that people are more likely to read or share? If so, what are the ideal times for different types of articles?
The dataset we have chosen to work with is the Online News Popularity dataset from 2015. This dataset has 61 different attributes and outlines, the genre of 39797 articles, the day of the week they published, and the number of shares each garnered.


### Preliminary Exploratory Data Analysis

In [11]:
library(tidyverse)
library(repr)
library(tidymodels)


“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔[39m [34mbroom    [39m 0.7.0      [32m✔[39m [34mrecipes  [39m 0.1.13
[32m✔[39m [34mdials    [39m 0.0.9      [32m✔[39m [34mrsample  [39m 0.0.7 
[32m✔[39m [34minfer    [39m 0.5.4      [32m✔[39m [34mtune     [39m 0.1.1 
[32m✔[39m [34mmodeldata[39m 0.0.2      [32m✔[39m [34mworkflows[39m 0.2.0 
[32m✔[39m [34mparsnip  [39m 0.1.3      [32m✔[39m [34myardstick[39m 0.0.7 

“package ‘broom’ was built under R version 4.0.2”
“package ‘dials’ was built under R version 4.0.2”
“package ‘infer’ was built under R version 4.0.3”
“package ‘modeldata’ was built under R version 4.0.1”
“package ‘parsnip’ was built under R version 4.0.2”
“package ‘recipes’ was built under R version 4.0.1”
“package ‘tune’ was built under R version 4.0.2”
“package ‘workflows’ was built under R version 4.0.2”
“package ‘yardstick’ was built u

In [None]:
read file here

In [12]:
set.seed(1234)
publishing_split <- initial_split(data, prop = 0.6, strata = shares)
publishing_train <- training(publishing_split)
publishing_test <- testing(publishing_split)

named_data <- publishing_train %>%
    rename(
        monday = weekday_is_monday,
        tuesday = weekday_is_tuesday,
        wednesday = weekday_is_wednesday,
        thursday = weekday_is_thursday,
        friday = weekday_is_friday,
        saturday = weekday_is_saturday,
        sunday = weekday_is_sunday,
        lifestyle = data_channel_is_lifestyle,
        entertainment =  data_channel_is_entertainment,
        business = data_channel_is_bus,
        society_and_medicine =  data_channel_is_socmed,
        technology =  data_channel_is_tech,
        world =  data_channel_is_world
    )
#It didn't work when all in one pipe, I have no idea why. This seems to work though
tidy_data <- named_data %>%
    pivot_longer(cols =  lifestyle : world, 
                 names_to = 'Genre', 
                 values_to = 'Genre_Present') %>%
    pivot_longer(cols =   monday:sunday, 
                 names_to = 'Day_Of_Week', 
                 values_to = 'Day_Present') %>%
    filter(Genre_Present == 1) %>%
    filter(Day_Present == 1) %>%
    mutate(abs_pos_neg_ratio = (global_rate_positive_words*avg_positive_polarity)/abs(global_rate_negative_words*avg_negative_polarity)) %>%
    filter(global_rate_positive_words != 0 & global_rate_negative_words != 0) %>%
    select(Genre, Day_Of_Week, shares,  abs_pos_neg_ratio)

head(tidy_data)

day_table <- tidy_data %>%
    group_by(Day_Of_Week) %>%
    summarize(total_articles=n(), avg_shares=mean(shares),abs_pos_neg_ratio = mean(abs_pos_neg_ratio) )
day_table

genre_table <- tidy_data %>%
    group_by(Genre) %>%
    summarize(total_articles=n(), avg_shares=mean(shares),abs_pos_neg_ratio = mean(abs_pos_neg_ratio) )
genre_table

ERROR: Error: Can't subset columns that don't exist.
[31m✖[39m Column `shares` doesn't exist.


In [None]:
publish_data <- tidy_data %>%
    group_by(Genre, Day_Of_Week) %>%
    summarize(n = n())


options(repr.plot.width = 12, repr.plot.height = 10)
publish_plot <- publish_data %>%
    ggplot(aes(x = Day_Of_Week, y = n, fill = Genre)) + 
    geom_bar(stat = 'identity', position = "dodge") +
    xlab("Day of the Week") +
    ylab("Number of Articles Published") +
    labs(fill = "Genre of Article") +
    theme(text = element_text(size = 20))

publish_plot

In [None]:
publish_data <- tidy_data %>%
    group_by(Genre, Day_Of_Week) %>%
    summarize(n = n())


options(repr.plot.width = 12, repr.plot.height = 10)
publish_plot <- publish_data %>%
    ggplot(aes(x = Day_Of_Week, y = n, fill = Genre)) + 
    geom_bar(stat = 'identity', position = "dodge") +
    xlab("Day of the Week") +
    ylab("Number of Articles Published") +
    labs(fill = "Genre of Article") +
    theme(text = element_text(size = 20))

publish_plot

In [None]:
share_data_avg <- tidy_data %>%
    group_by(Day_Of_Week, Genre) %>%
    summarize(n = n(), Average_Shares_Per_Article = mean(shares)/n)

head(share_data_avg)

In [None]:
options(repr.plot.width = 12, repr.plot.height = 10)
share_plot <- share_data_avg %>%
    ggplot(aes(x = Day_Of_Week, y = Average_Shares_Per_Article, fill = Genre)) + 
    geom_bar(stat = 'identity', position = "dodge") +
    xlab("Day of the Week") +
    ylab("Average Shares Per Article") +
    labs(fill = "Genre of Article") +
    theme(text = element_text(size = 20))

share_plot

In [None]:
share_data_totals <- tidy_data %>%
    group_by(Day_Of_Week, Genre) %>%
    summarize(n = n(), shares = sum(shares))

head(share_data_totals)

In [None]:
options(repr.plot.width = 12, repr.plot.height = 10)
share_plot_proportion <- share_data_totals %>%
    ggplot(aes(x = Day_Of_Week, y = shares, fill = Genre)) + 
    geom_bar(stat = 'identity', position = "fill") +
    xlab("Day of the Week") +
    ylab("Proportion of Shares") +
    labs(fill = "Genre of Article") +
    theme(text = element_text(size = 20))

share_plot_proportion

In [None]:
daily_totals <- share_data_totals %>%
    group_by(Day_Of_Week) %>%
    summarize(Total_Shares = sum(shares))
daily_totals

In [None]:
share_daily_proportions <- share_data_totals %>%
    inner_join(daily_totals) %>%
    mutate(share_proportion = shares/Total_Shares)
head(share_daily_proportions)

In [13]:
options(repr.plot.width = 12, repr.plot.height = 10)
share_plot_proportioned <- share_daily_proportions %>%
    ggplot(aes(x = Day_Of_Week, y = share_proportion, fill = Genre)) + 
    geom_bar(stat = 'identity', position = "dodge") +
    xlab("Day of the Week") +
    ylab("Proportion of Daily Shares") +
    labs(fill = "Genre of Article") +

    theme(text = element_text(size = 20))

share_plot_proportioned

ERROR: Error in eval(lhs, parent, parent): object 'share_daily_proportions' not found


### Methods

We will begin by analyzing the dataset in its original form to determine the correct way to import it into Jupyter notebooks, and by creating a GitHub where we can share our project contributions. We will tidy the dataset to ensure that columns represent variables, rows represent observations, and cells contain single values. Next, we will use the variable ‘Day of the Week’ juxtaposed (individually) with the variables of  ‘Proportion of Shares’, ‘Proportion of Daily Shares’, ‘Average Shares per Article’, and ‘Number of Articles Published’ to create visualizations of the data (coloured bar graphs). We will study the data visualizations in order to draw information about article sharing and publishing on each day of the week.

### Expected Outcomes and Significance

We are expecting to determine whether the day of the week will have an impact on the type of news article that people are more likely to read or share. The findings can be used to optimize the efficiency of when news articles should be published so that they are not overtaken and forgotten as newer topics get published. Future questions may include: How often should newspaper companies publish new articles? When is the best day of the week and time of the day to publish new articles?
