## A Rough Skeleton of Working Draft
---------------------------------------------------------------------------------------------------------


### Table of Contents:
1. Sypnosis
    - Research Question
    - Summary
<p></p>
2. Selection of Data 
    - Packages Used
    - Loading and Cleansing
<p></p>
3. Training, Validation, and Testing Sets
    - Compartmentalization
    - Summary Statistics
<p></p>
4. Preprocessing 
    - Indetification of Class Imbalances
    - Balancing Decision
<p></p>
5. Building our Model
    - Overview 
    - Tuning
    - Accuracy Comparison
<p></p>
5. Analysis and Conclusion
---------------------------------------------------------------------------------------------------------

### 1.) Sypnosis: 

##### Research Question:
How can an author increase engagement from users on Facebook and can we predict the success of a post using insights from an author's page?


##### Overview:
<p></p>
The market utility of social media platforms such as Facebook, which are able to generate mass revenues for cosmetic brands, has been an established and exploited advertising strategy in the digital age (Moro et. al, 2016). The goal of this project is to take a predictive analytical approach to determine which type of Facebook post (i.e., photo, video, status, or link) will engage the most internet-user engagement, determined through variables such as likes, post consumptions, and post total reach. The dataset which will be used for this analysis was acquired through an experimental data mining technique which included scraping data from the Facebook page of an internationally renowned cosmetics company on posts made between January 1st and December 31st (Moro et. al., 2016).
<p></p>
For the methodology, we will use the variables of the continuous numerical variables of total reaches (Lifetime_Post_Total_Reach) and the number of total impressions (Lifetime_Post_Total_Impressions), and the categorical variable of Facebook post (Type). First, we will look at the relationship between these variables in a scatter plot graph that will help us to formulate our hypothesis. Then, as we are trying to predict the type of post that will be the most successful, we will use a K-nearest neighbour classification analysis. To do so, we must determine the K value using cross-validation of the training data. Then, we will need to test the accuracy of the classifier with the testing data.
<p></p>
We expect to find that posts which include media, such as photos and videos, are more likely to engage users than other posts, such as statuses and links. This is based on the assumption that the former types of posts might be more likely to be shared and thus will have more exposure.
It is beneficial for social media platforms to increase user engagement, as this is likely to increase revenue through advertising. Therefore, these findings may be used to choose what type of posts are prioritized to maximize user engagement.
<p></p>
These findings may lead to further exploration of how the contents of these posts impact user engagement. This may include the duration of a video, content of an image, length of a status, or details about the contents of a link. 

---------------------------------------------------------------------------------------------------------

### 2.) Selection of data

#### List of Packages Used

In [None]:
install.packages("caTools")
library(caTools)
library(tidyverse)
library(repr)
library(tidymodels)
library(kknn)
library(MASS)
library(cowplot)
library(ggplot2)
options(repr.matrix.max.rows = 8)

#### Loading Data and Fixing Column Headers

In [None]:
facebook <- read_csv2("https://raw.githubusercontent.com/calamari99/Facebook-Post-Predictor/main/data/dataset_Facebook.csv")
# Replacing specified columns to categorical factors
cols <- c("Type", "Category", "Post Month", "Paid", "Post Weekday", "Post Hour")
facebook[cols] <- lapply(facebook[cols], as.factor)

# Renaming column headers without spaces
facebook_colname_fix <- facebook

facebook_col_name_vec <- gsub(" ", "_", colnames(facebook))
colnames(facebook) <- facebook_col_name_vec

glimpse(facebook)

##### Let us select only the data values relevant to our case scenario

To specify the best type of post possible and we explore the relationship between the metrics produced by a post and the post type. The following key performance indicators describe a post's success:
- comments
- likes
- shares
- total interactions (summation of the 3 observations above) 

In [None]:
facebook_clean <- dplyr::select(facebook, Type, comment,
                                like, share, Total_Interactions,
                                Paid, Lifetime_Post_Total_Impressions, Lifetime_Post_Total_Reach) %>% na.omit(df)

facebook_clean_unpaid <- facebook_clean %>% filter(Paid == 0)
facebook_clean_paid <- facebook_clean %>% filter(Paid == 1)

unpaid_summary <- facebook_clean_unpaid %>% group_by(Type) %>% 
  summarise(unpaid = n()) 

paid_summary <- facebook_clean_paid %>% group_by(Type) %>% 
  summarise(paid = n())

Reduce(dplyr::full_join, list(unpaid_summary, paid_summary))

Within our 500 data points collected, we have filtered out all observations with NA values and separated our data into paid and unpaid categories due to additive relationships. This allows us to explore the relationship between post type and our defined success metric. Moving forward, this study will only evaluate on media postings without paid advertising.

> Note: Social media algorithms that adjust prioritizations between paid and non-paid posts can heavily factor into our metrics received and should be considered in this analysis. To control for this potential source of uncertainty, we haved isolated our data into paid and unpaid categories. 

---


### 3.) Training, Validation, and Testing Sets 



#### Compartmentalization
We have split our data into training and testing sets in order to reduce bias within our model data and testing data. 

*Distribution of Training and Testing set*
<br>
Testing set will be 20% of data collected
<br>
Validation set will be 10% of data collected
<br>
Training data set be 70% of data collected

*Cross-validation technique*
<br>
let us split our data into 10 total groups.
<br>
(~25 points tested, 100 points for training)

We chose to approach our training data by creating a 80:20 ratio between testing and training data where the  training set is composed of both the “validation” and “training” set. We have also chosen a 10-fold cross-validation procedure to establish unbiased estimators.

In [None]:
# Defining Variables
set.seed(99)
partitionTrain = 0.8
ratioTrainValidation = 7/8

##### Total Posts:

In [None]:
# 80/20 ratio TrainingSet:TestingSet
split <- sample.split(facebook_clean$like, SplitRatio = partitionTrain)
train_val_data <- subset(facebook_clean, split == TRUE)
test_set <- subset(facebook_clean, split == FALSE)

split <- sample.split(train_val_data$like, SplitRatio = ratioTrainValidation)
train_set <- subset(train_val_data, split == TRUE)
val_set <- subset(train_val_data, split == FALSE)

train_set

#### Unpaid Posts:

In [None]:
split <- sample.split(facebook_clean_unpaid$like, SplitRatio = partitionTrain)
train_val_data_unpaid <- subset(facebook_clean_unpaid, split == TRUE)
test_set_unpaid <- subset(facebook_clean_unpaid, split == FALSE)

split <- sample.split(train_val_data_unpaid$like, SplitRatio = ratioTrainValidation)
train_set_unpaid <- subset(train_val_data_unpaid, split == TRUE)
val_set_unpaid <- subset(train_val_data_unpaid, split == FALSE)

#glimpse(train_set_unpaid)
train_set_unpaid


#### Training Data Summaries

- Number of observations of each type
- Mean and Median of key metrics in each post type


##### Summary of Unpaid Posts:

In [None]:
summ_train_unpaid <- train_set_unpaid %>%
    group_by(Type) %>%
        summarise(
        count = n(),
        mean_comment = mean(comment), 
        median_comment = median(comment), 
        mean_like = mean(like),
        median_like = median(like),
        mean_Total_Interactions = mean(Total_Interactions),
        median_Total_Interactions = median(Total_Interactions),
        mean_share = mean(share),
        median_share = median(share),
    )

summ_train_unpaid

-----
### 4.) PreProcessing 

#### Indentification of class imbalances
We want to be able to identify possible class imbalances as the KNN-classification model is a lazy learning algorithm. Thus we need to ensure that our data set is balanced. We start by reviewing summary statistics and quickly visualizing the distribution of observations.

In [None]:
summary(train_set_unpaid)

In [None]:
options(repr.plot.width=15, repr.plot.height=20)
test_unpaid_hist_1 <- train_set_unpaid %>%
    ggplot(aes(fill=Type, x=Lifetime_Post_Total_Reach/100))+
    geom_histogram(binwidth=20,center = 1,boundary = NULL,alpha=0.7,position=position_stack(vjust=0, reverse=FALSE))+
    labs(x="Total reach (hundreds)", fill="Type of post")+
    scale_x_continuous(limits = c(0,600))+
    scale_fill_manual(values = c("#eb1515", "#15eb15", "#1515eb", "#eb8015"))+
    theme(text = element_text(size = 16))
test_unpaid_hist_2 <- train_set_unpaid %>%
    ggplot(aes(x=Lifetime_Post_Total_Impressions/100, fill=Type))+
    geom_histogram( stat = "bin", alpha=0.7,position=position_stack(vjust=0, reverse=FALSE))+
    labs(x="Total impression (hundreds)", fill="Type of post")+
    scale_fill_manual(values = c("#eb1515", "#15eb15", "#1515eb", "#eb8015"))+
    scale_x_continuous(limits = c(0,600))+
    theme(text = element_text(size = 20))
plot_grid(test_unpaid_hist_1,test_unpaid_hist_2, ncol=1)

#### Balancing Decision
We see the distribution of the type of posts is not equal so we should consider balancing. However, this introduces potential complications in further parts of our analysis, mainly the cross validation step. We find that balancing our data in this part of our analysis results in overestimated accuracies for our cross validation model later on. Additionally, we are hestitant to balance training set because this alternation is not reflected in our testing set, which can lead to more uncertainty. Because of these factors, we chose to leave our data unbalanced. We believe this will lead to less biased results when using our training data set further in our report. 

### 5.) Building our Model

#### Overview
We use the original training data into our tuning selection process. Then by scaling the data and following the tidymodel recipes workflow, we collect the results from various values of k. Our base value of k is set to 3.

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) %>%
  set_engine("kknn") %>%
  set_mode("classification")

unpaid_recipe <- recipe(Type ~ 
                        Lifetime_Post_Total_Reach + Lifetime_Post_Total_Impressions,
                        data = train_set_unpaid) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

unpaid_fit <- workflow() %>%
      add_recipe(unpaid_recipe) %>%
      add_model(knn_spec) %>%
      # fit(data = upsampled_cancer)
      fit(data = train_set_unpaid)

      #fit_resamples(resamples = unpaid_vfold)

unpaid_val_predicted <- predict(unpaid_fit, val_set_unpaid) %>%
    bind_cols(val_set_unpaid)

unpaid_prediction_accuracy <- unpaid_val_predicted %>%
    metrics(truth = Type, estimate = .pred_class)
    
unpaid_prediction_accuracy

We found that our current accuracy against our validation set is roughly 75%. We will continue to tune our model in the following steps.

#### Tuning our model

1. We will perform the cross validation technique with 10 folds to account for randomness.

In [None]:
set.seed(99)
#unpaid_vfold <- vfold_cv(upsampled_cancer, v = 10, strata = Type)
unpaid_vfold <- vfold_cv(train_set_unpaid, v = 10, strata = Type)

unpaid_fit_v2 <- workflow() %>%
      add_recipe(unpaid_recipe) %>%
      add_model(knn_spec) %>%
      fit_resamples(resamples = unpaid_vfold) %>% collect_metrics()
unpaid_fit_v2

We see the accuracy of our model is around 86%.

2. Next we will perform a paramterization selection method to select a better value for K.

In [None]:
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("classification")

knn_results <- workflow() %>%
  add_recipe(unpaid_recipe) %>%
  add_model(knn_tune) %>%
  tune_grid(resamples = unpaid_vfold, grid = 10) %>% 
  collect_metrics()

accuracies <- knn_results %>% 
       filter(.metric == "accuracy")

accuracies

3. Then using our collected metrics, we can visualize our accuracies to refine our value of K.

In [None]:
options(repr.plot.width=15, repr.plot.height=5)
accuracy_versus_k <- ggplot(accuracies, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate", title = "K-NN Classification Accuracy by Neighbors") +
      scale_x_continuous(breaks = seq(0, 16, by = 2)) +  # adjusting the x-axis
      scale_y_continuous(limits = c(0.8, 1.0)) # adjusting the y-axis

accuracy_versus_k

In [None]:
most_accurate_k <- knn_results %>% filter(.metric == "accuracy") %>% arrange(desc(mean)) %>% slice(1)
most_accurate_k

The visualization suggests that K=4 averages the highest accuracy of ~86% from our 10 cross validation sets. We edit our model specification to take k=2 instead of k=3 as follows. After doing so, we can compare the accuracy of each model.

In [None]:
# we use the same recipe, change spec

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 4) %>%
  set_engine("kknn") %>%
  set_mode("classification")

unpaid_fit_tuned <- workflow() %>%
      add_recipe(unpaid_recipe) %>%
      add_model(knn_spec) %>%
      fit(data = train_set_unpaid) 
      #fit_resamples(resamples = unpaid_vfold)

unpaid_val_predicted_tuned <- predict(unpaid_fit_tuned, val_set_unpaid) %>%
    bind_cols(val_set_unpaid)

unpaid_prediction_accuracy_tuned <- unpaid_val_predicted_tuned %>%
    metrics(truth = Type, estimate = .pred_class) %>% filter(.metric == "accuracy")

model_improvement <- unpaid_prediction_accuracy_tuned$.estimate - unpaid_prediction_accuracy$.estimate

unpaid_prediction_accuracy_tuned
print(model_improvement)

After changing our model spec from 2 to 3, we see that the modifications to our model has increased our accuracy by roughly *5.56%* (our original accuracy was roughly *72.2%*) and thus we will choose the tuned model.

#### Accuracy Comparison
- We have limited video observation in our testing set
- after we have classified the limited video observations, any prediction that classifies a video is incorrect
- todo


---------------------------------------------------------------------------------------------------------

### 6.) Analysis and Conclusion

---
### Additional Exploratory Analysis

In [None]:
mean_comment <- summ_train_unpaid$mean_comment
mean_like <- summ_train_unpaid$mean_like
mean_Total_Interactions <- summ_train_unpaid$mean_Total_Interactions
mean_share <- summ_train_unpaid$mean_share
type <- summ_train_unpaid$Type

test_df <- data.frame(mean_comment,mean_like,mean_Total_Interactions,mean_share,type)
test_df

fb_long <- test_df %>%
gather("Stat", "Value", -type)

fb_long


In [None]:
# test_df <- data.frame(
# mean_comment = c(4.0, 7.317073, 13.166667, 12.333333),
# mean_like = c(56.66667, 202.14634, 281.16667, 276.33333),
# mean_Total_Interactions = c(71.0000, 235.2683, 353.8333, 346.6667),
# mean_share = c(10.33333, 25.80488, 59.50000, 58.00000),
# type = c("Link", "Photo", "Status", "Video"))


# fb_long <- test_df %>%
# gather("Stat", "Value", -type)

# fb_long


In [None]:
filter_mean_like <- fb_long %>%
    filter(Stat == "mean_like")

filter_mean_like

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6) 

mean_likes_bar <- ggplot(filter_mean_like, aes(x = type, y = Value, fill=type)) +
    geom_bar(stat = "identity") +
    labs(x = "Post Type", y = "Average Number of Likes") +
    theme(text = element_text(size = 20)) 
mean_likes_bar

In [None]:
options(repr.plot.width=10, repr.plot.height=10)
mean_fb <- ggplot(fb_long, aes(x = type, y = Value, fill = Stat)) +
    geom_col(position = "dodge") +
    labs(x = "Type of Post", y = "Count") +
    scale_fill_discrete(name = "Stats", labels = c("Mean Comment", "Mean Like", "Mean Share", "Mean Total Interactions"))+
    theme(text = element_text(size = 20))
mean_fb

In [None]:
unpaid_plot <- facebook_clean_unpaid %>% 
    ggplot(aes(x = Lifetime_Post_Total_Reach/100, y = Total_Interactions, shape=Type, color=Type, fill=Type))+
    geom_point(alpha=0.6, size=4)+
    labs(x="Total reach (hundreds)", y="Total interactions", group="Type")+
    scale_y_continuous(limits = c(0,900))+
    scale_x_continuous(limits = c(0,500))+
    scale_shape_manual(values = c(21,22,23,24)) +
    scale_size_manual(values=c(1,6,7,9))+
    theme_minimal()+
    theme(text = element_text(size = 20))
    options(repr.plot.width =14, repr.plot.height = 8) 
unpaid_plot

### Bibliography

Moro, S., Rita, P., & Vala, B. (2016). Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research. 69(9), 3341 - 3351. 

In [None]:
facebook