<img src = 'workflow/SC_LOGO-1.png'>

# ClickInsight : Increasing clicks on Facebook news posts
## Barath Ezhilan, Insight Data Science Fellow
## Project Overview


This is a consulting project done in collaboration with [Seachange/Timeline](http://www.schange.com/solutions-products/products/timeline) which is a social analytics platform helping media companies analyse social sentiment, drive content marketing and surface relevant conversation.

<img src="workflow/intro.png">

Predicting success metrics on social media can be highly challenging. The figure above shows two news posts on a local news station's Facebook page on seeming similar topics (sensational crime stories) but they drive significantly different number of clicks.

Our clients are local news stations who are interested in increasing engagement on their Facebook posts. The primary goal of this project is to develop an algorithm capable of predicting whether a post will have high or low engagement/consumption based on their content. Such a predictive algorithm can greatly help these news stations preselect articles for their facebook feed and tune the content of their status message to better engage their audience on facebook and increase the number of clicks to their website.



# Questions

To accomplish this goal, we had to answer several key questions.

- **success metrics**
    - What are the suitable success metrics for quantifying engagement and consumption on Facebook?
- **Topic clustering**
    - Can we classify facebook posts into different topics just based on the content of their messages?
- **Predictive modeling**
    - Can we predict the degree of engagement or consumption just from on the content of their messages?
- **Actionable business insights**
    - Can we draw crucial business insights that can increase engagement / consumption / revenue for the new stations?

# Metrics

Before we can build a predictive model, it is important to be able to define a success metric for Facebook news posts. While thinking about a success metric, it is crucial to distinguish between qualitative and quantitative metrics.

- Engagement: Qualitative measure describing how users engage with content. In this work, we define it as the sum of likes, shares and comments on a post. These engagements creates 'stories' about the post.

- Consumption: Quantitative measure describing how users consume posts. In this work, we define it as the number of link clicks which directly correlates with revenue for the company

We also consider both engagement and link clicks normalized by 'reach' (which is the total number of people the post was shared to)

- Engagement to reach ratio
- link clicks to reach ratio

These metrics have a high correlation as can be seen in the figure below. (For a more detailed analysis, please see [Feature Analysis](Code/EDA/feature_analysis.html))

<img src="code/EDA/response_.png">

The exploratory data analysis and predictive modeling of either of 4 response variables lead to similar insights / results. 

See individual pages dedicated to the analysis of each of these metrics in the links below

- [Link Clicks](Code/EDA/EDA_clicks.html)
- [Link Clicks to reach](Code/EDA/EDA_clicks_to_reach.html)
- [Engagement](Code/EDA/EDA_engagement.html)
- [Engagement to reach](Code/EDA/EDA_engagement_to_reach.html)

In the discussion below, we focus on one of the metrics: number of link clicks. The number of clicks on posts over the last year have a median value of 1297, with values ranging from 0 to 472618

# Workflow

The overall workflow of this project is described in this flow chart.


<img src="workflow/workflow_.png">

# Data

To tackle this problem, we used the facebook insights data consisting of all posts (~ 8000) made by one New York based news station over the last 12 months. The data contains the following features

- Time posted
- Full text of the status message
- Lifetime engagement, reach and post consumption for each post

## Cleaning

(Also see [data_cleaning.py](Code/data_cleaning.py))

The facebook insights data for each post is available as a JSON file. 

- Import the JSON files in Python 
- Clean and store the data in a more convenient Pandas DataFrame
- Remove paid posts and posts with reach less than 100
- Store in a CSV file

# Feature Engineering and Exploratory Data Analysis

We performed feature engineering to identify predictors for our model. We found several features could be important. 

- Time features (see [time_feature_extraction.py](Code/time_feature_extraction.py))
   - Day of week
   - Hour of day
   - Month and Year
- Deeper text analysis: (see [NLP_text_feature_extraction.py](Code/NLP_text_feature_extraction.py))
   - post length
   - Number of all caps words at the start of the message
   - Keywords (see [keyword_extraction.py](Code/keyword_extraction.py))
       - Number
       - Video
       - words relating to virality ('viral', 'breaking' etc.)
       - relating to women, sex, politics, crime
- **topic**
    - See next section on topic clustering

Most of the above features are straight forward to obtain from our data (after some NLP i.e., remove stopwords, perform stemming ...). However there is one important feature, the topic of the post, that is not readily accessible in the data. So, we use an unsupervised learning methods called the Latent Dirichlet Allocation algorithm for topic clustering. (See next section)

## Data Stories

An exploratory data analysis of the features reveal several interesting data stories. An elaborate analysis can be found  [here](Code/EDA/EDA_clicks.html). Here we include a few interesting stories

### Time

<img src = 'workflow/time.png'>

### Post Length and ALL CAPS

<img src = 'workflow/text.png'>

### Keywords

- More words relating to women, sex, video increases clicks
- Politics related keywords decrease clicks

<img src = 'workflow/keywords.png'>

# Interlude: Topic Clustering using Latent Dirichlet Allocation algorithm

We performed topic clustering using the Latent Dirichlet Allocation algorithm ([Blei et al. (2003)](https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf)) using the gensim package (Please see [LDA_topic_model.py]('Code/LDA_topic_model.py') for a walk through LDA in python)

LDA is a generative probabilistic model that posits each topic as a distribution over words in a fixed dictionary. Or in other words, every topic has a probability for every word drawn from Dirichlet distribution. 

The topic distributions over words can also be seen from the interactive visualization below.

In [1]:
from IPython.display import IFrame
IFrame('Code/LDAresults/topics_LDA.html', width=1200, height=750)

Each post is considered to be a mixture of topics. Thus, each post has a weight for each topic. By looking at the posts having the highest weights for a given topic, we can identify the topics

- topic1 : Violence / women
- topic2 : Local news (Brooklyn / New Jersey) 
- ...

For instance, for the story about Brooklyn girl beaten up by 4 teens, the model rightly gives higher weights for topic1 and topic2

<img src="workflow/topic_mix.png">

## Mean Weighted click score 

Let each topic $i$ has a particular weight $w_{ij}$ for each post $j$

We get the mean weighted click score for each topic using the formula

$Score(topic\, i) = \frac{1}{TotalPosts}\Sigma_{j = 1}^{TotalPosts} w_{ij} \times number of clicks(j)$ 

<img src="code/EDA/weighted_click_score.png">

Clearly, violence/women/local news (topic1 and topic2) have higher click scores than other stories

# Predictive model : Random Forest Classifier

Now that, we have labeled each post with a topic, we have all the ingredients necessary to classify posts into low and high number of clicks based on a median split. 

We find that several significant correlations exist between topics, keywords and word_count. We will keep these correlations in mind when we get to the predictive model!

<img src="code/EDA/predictor_correlation.png">

We perform the classification using a Random Forest Classifier. For a detailed analysis of the accuracy, precision metrics, please see [Random Forest Ipython notebook](Code/RandomForest/RandomForest_clicks.html).

We build 3 different models using 3 different set of features
- Model 1: (almost) all features
  - time features + Keywords (video, sex, number) + text features (word count, ALLCAPS) + topics
- Model 2: All but time features
  - Keywords (video, sex, number) + text features (word count, ALLCAPS) + topics
- Model 3: Only topics

<img src="Code/RandomForest/ROC.png">
Above is a ROC curve comparing the performance of the 3 models. All 3 models perform significantly better than random. Interestingly, we also find that, model 3 with just the topic features has a AUC of 0.65, showing that topic of the post is a significant predictor of post consumption

NOTE: We also performed modeling using Naive Bayes and logistic regression and their performance is competitive to random forests

# Actionable Business Insights

<img src="workflow/Insights.png">

# Summary of work accomplished over the last 3 weeks

### - Engineered features that affect engagement and consumption on Facebook news posts
### - Identified 4 metrics for quantifying 'success' for these posts
### - Performed topic clustering using Latent Dirichlet Allocation
### - Built a Random Forest model to classify posts between high and low post clicks
### - Provided actionable insights that will help the news station better engage with their audience