## A Rough Skeleton of Working Draft
---------------------------------------------------------------------------------------------------------


### Table of Contents:
1. Sypnosis
    - Research Question
    - Summary
<p></p>
2. Our Data Set 
    - Packages Used
    - Loading and Cleansing
    
---------------------------------------------------------------------------------------------------------

### 1.) Sypnosis: 

##### Research Question:
How can an author increase engagement from users on Facebook and can we predict the success of a post using insights from an author's page?


##### Overview:
<p></p>
The market utility of social media platforms such as Facebook, which are able to generate mass revenues for cosmetic brands, has been an established and exploited advertising strategy in the digital age (Moro et. al, 2016). The goal of this project is to take a predictive analytical approach to determine which type of Facebook post (i.e., photo, video, status, or link) will engage the most internet-user engagement, determined through variables such as likes, post consumptions, and post total reach. The dataset which will be used for this analysis was acquired through an experimental data mining technique which included scraping data from the Facebook page of an internationally renowned cosmetics company on posts made between January 1st and December 31st (Moro et. al., 2016).
<p></p>
Our data includes posts from both paid and non-paid marketing campaigns. Social media algorithms that adjust prioritizations between paid and non-paid posts can heavily factor into our metrics received and should be considered in this analysis. In order to explore the relationship between post type and our defined success metric, we can further isolate our data into paid and unpaid categories.
<p></p>
With our acquired data, we must first compartmentalize into training and testing sets before performing an exploratory data analysis on our dataset. We chose to approach our training data by creating a 80:20 ratio between testing and training data meaning 80% of our total data will be labeled as the “general training set” and the remainder as our “test set”. The general training set will be further partitioned into a “validation set” and “training set” in order to reduce bias within our model data and testing data. Now that we’ve labeled our training data, we can further explore the summary statistics within each set. 
<p></p>
For the methodology, we will use the variables of the continuous numerical variables of total reaches (Lifetime_Post_Total_Reach) and the number of total impressions (Lifetime_Post_Total_Impressions), and the categorical variable of Facebook post (Type). First, we will look at the relationship between these variables in a scatter plot graph that will help us to formulate our hypothesis. Then, as we are trying to predict the type of post that will be the most successful, we will use a K-nearest neighbour classification analysis. To do so, we must determine the K value using cross-validation of the training data. Then, we will need to test the accuracy of the classifier with the testing data.
<p></p>
We expect to find that posts which include media, such as photos and videos, are more likely to engage users than other posts, such as statuses and links. This is based on the assumption that the former types of posts might be more likely to be shared and thus will have more exposure.
It is beneficial for social media platforms to increase user engagement, as this is likely to increase revenue through advertising. Therefore, these findings may be used to choose what type of posts are prioritized to maximize user engagement.
<p></p>
These findings may lead to further exploration of how the contents of these posts impact user engagement. This may include the duration of a video, content of an image, length of a status, or details about the contents of a link. 

---------------------------------------------------------------------------------------------------------

### 2.) Dataset

##### Packages Used:

In [None]:
source("packages.R")

In [None]:
source("import-data.R")

In [None]:
source("rename-data.R")

### Let us select only the data values relevant to our case scenario

We want to explain the best type of post possible and thus we should first explore the relationship between the metrics produced by a post and the individual post type. The following key performance indicators describe a post's success:
- comments
- likes
- shares
- total interactions (summation of the 3 observations above)

We will consider removing data points that are missing values if enough data is present. 

In [None]:
source("clean-data.R")

## Compartmentalization of our data into Training, Validation, and Testing Sets 



We have 500 data points collected, of which me must remove observations with NA. We will first filter our data into paid / unpaid data frames and see if there is enough data to persist. After which we will explore the various summary statistics within each group.

-----
**METHOD 1: Training and Testing set**
<br>
Testing set will be 20% of data collected
Validation set will be 10% of data collected
Training data set be 70% of data collected

**METHOD 2: Cross-validation technique**
<br>
let us split our data into 4 total groups with a 1:5 ratio of testing and training data. 
(~25 points tested, 100 points for training)

In [None]:
source("preliminary-steps.R")

#### Total Posts:

In [None]:
source("total-posts.R")

In [None]:
source("total-summary.R")

#### Unpaid Posts:

In [None]:
source("unpaid-posts.R")

-----
## Training Data Summaries
Each Table includes:
- Number of observations of each type
- Mean and Median of key metrics in each post type


#### Summary of Unpaid Posts:

In [None]:
source("unpaid-summary.R")

-----
## PreProcessing 

We want to be able to identify possible class imbalances as the KNN-classification model is a lazy learning algorithm. Thus we need to ensure that our data set is balanced. We start by reviewing the summary statistics and quickly visualizing the distribution of observations.

In [None]:
summary(train_set_unpaid)

In [None]:
source("test-unpaid-plot.R")

#### Balancing
We see the distribution of the type of posts is not equal so we should consider balancing 

In [None]:
source("balancing.R")

### Pre-Process (Scale Data) for K-nn Classification
We must do this because: ________________

### Building our Model

We use the upscaled data into our tuning selection process as we need a balanced data set. Then by scaling the data and following the tidymodel recipes workflow, we collect the results from various values of k. Our base value of k is set to 3.

In [None]:
source("building-model.R")

We found that our current accuracy against our validation set is roughly 75%. We will continue to tune our model in the following steps.

### Tuning our model

1. We will perform the cross validation technique with 10 folds to account for randomness.

In [None]:
source("cross-validation.R")

We see the accuracy of our model is around 86%.

2. Next we will perform a paramterization selection method to select a better value for K.

In [None]:
source("accuracies.R")

3. Then using our collected metrics, we can visualize our accuracies to refine our value of K.

In [None]:
source("accuracy-vs-k.R")

In [None]:
source("most-accurate-k.R")

The visualization suggests that K=4 averages the highest accuracy of ~86% from our 10 cross validation sets. We edit our model specification to take k=2 instead of k=3 as follows. After doing so, we can compare the accuracy of each model.

In [None]:
source("improved-model.R")

After changing our model spec from 2 to 3, we see that the modifications to our model has increased our accuracy by roughly *5.56%* (our original accuracy was roughly *72.2%*) and thus we will choose the tuned model.

### EXPLAIN WHY LOW ACCURACY OCCURS
- We have 1 video observation in our testing set
- which after we have classified the only video value
- 

---------------------------------------------------------------------------------------------------------

### Additional Exploratory Analysis

In [None]:
source("fb-long.R")

In [None]:
source("fb-mean-like.R")

In [None]:
source("mean-likes-bar.R")

In [None]:
source("mean-fb.R")

In [None]:
source("unpaid-plot.R")

### Bibliography

Moro, S., Rita, P., & Vala, B. (2016). Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research. 69(9), 3341 - 3351. 

In [None]:
facebook