# Facebook Metrics Modeling

So far we have found that most of the attributes in our facebook metrics dataset are positively correlated with each other. This shows us that we should perform some Dimensionality Reduction techniques to get a better understanding of some of the core features/attributes that affect the total interactions of a facebook post. 

## Goal

The goal for this project is to answer the following question:

    What Facebook metrics should an account focus on if their goal is to increase their total interactions on a post?
    
This question will be answered by building a regression model that can predict total interactions of a post using dimensionality reduction techniques that choose the best attributes for the model. 

## Imports

In [1]:
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from sklearn import linear_model
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('../data/facebook_clean.csv')

In [3]:
df.head()

Unnamed: 0,total_likes,type,category,post_month,post_weekday,post_hour,paid,lifetime_reach,lifetime_impressions,lifetime_engaged_users,lifetime_consumers,lifetime_consumptions,impressions_by_likers,reach_by_likers,likers_who_engaged,total_interactions
0,139441,Photo,2,12,4,3,0.0,2752,5091,178,109,159,3078,1640,119,100
1,139441,Status,2,12,3,10,0.0,10460,19057,1457,1361,1674,11710,6112,1108,164
2,139441,Photo,3,12,3,3,0.0,2413,4373,177,113,154,2812,1503,132,80
3,139441,Photo,2,12,2,10,1.0,50128,87991,2211,790,1119,61027,32048,1386,1777
4,139441,Photo,2,12,2,3,0.0,7244,13594,671,410,580,6228,3200,396,393


In [4]:
df.shape

(495, 16)

## Data Preparation

Before we begin building our model we must prepare our data for the purpose of machine learning model building. This means creating dummy variables for categorical columns like `type`, `category`, `post_month`, `post_weekday`, and `post_hour`.

In [5]:
df = pd.get_dummies(df, columns=['type', 'category', 'post_month', 'post_weekday', 'post_hour'], dtype='int')

In [6]:
df.head()

Unnamed: 0,total_likes,paid,lifetime_reach,lifetime_impressions,lifetime_engaged_users,lifetime_consumers,lifetime_consumptions,impressions_by_likers,reach_by_likers,likers_who_engaged,...,post_hour_13,post_hour_14,post_hour_15,post_hour_16,post_hour_17,post_hour_18,post_hour_19,post_hour_20,post_hour_22,post_hour_23
0,139441,0.0,2752,5091,178,109,159,3078,1640,119,...,0,0,0,0,0,0,0,0,0,0
1,139441,0.0,10460,19057,1457,1361,1674,11710,6112,1108,...,0,0,0,0,0,0,0,0,0,0
2,139441,0.0,2413,4373,177,113,154,2812,1503,132,...,0,0,0,0,0,0,0,0,0,0
3,139441,1.0,50128,87991,2211,790,1119,61027,32048,1386,...,0,0,0,0,0,0,0,0,0,0
4,139441,0.0,7244,13594,671,410,580,6228,3200,396,...,0,0,0,0,0,0,0,0,0,0


In [7]:
df.columns

Index(['total_likes', 'paid', 'lifetime_reach', 'lifetime_impressions',
       'lifetime_engaged_users', 'lifetime_consumers', 'lifetime_consumptions',
       'impressions_by_likers', 'reach_by_likers', 'likers_who_engaged',
       'total_interactions', 'type_Link', 'type_Photo', 'type_Status',
       'type_Video', 'category_1', 'category_2', 'category_3', 'post_month_1',
       'post_month_2', 'post_month_3', 'post_month_4', 'post_month_5',
       'post_month_6', 'post_month_7', 'post_month_8', 'post_month_9',
       'post_month_10', 'post_month_11', 'post_month_12', 'post_weekday_1',
       'post_weekday_2', 'post_weekday_3', 'post_weekday_4', 'post_weekday_5',
       'post_weekday_6', 'post_weekday_7', 'post_hour_1', 'post_hour_2',
       'post_hour_3', 'post_hour_4', 'post_hour_5', 'post_hour_6',
       'post_hour_7', 'post_hour_8', 'post_hour_9', 'post_hour_10',
       'post_hour_11', 'post_hour_12', 'post_hour_13', 'post_hour_14',
       'post_hour_15', 'post_hour_16', 'post_hour

Next we will split our data into training, testing and validation sets. We have 495 rows to work with so we will aim at having a split of 60-20-20 percent for our corresponding training, testing and validation sets. 

In [8]:
#separate our features and target variable
X = df.drop('total_interactions', axis=1)
y = df['total_interactions']

In [9]:
#create our 60-20-20 split of our data
X, X_test, y, y_test = train_test_split(X, y, test_size=0.2, random_state=8)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=8)

In [10]:
#assert that our splits follow 60-20-20
print('Training set percent: ', X_train.shape[0] / df.shape[0])
print('Testing set percent: ', X_test.shape[0] / df.shape[0])
print('Validation set percent: ', X_val.shape[0] / df.shape[0])

Training set percent:  0.6
Testing set percent:  0.2
Validation set percent:  0.2


In [11]:
#dictionary that will be used to store our testing results
test_results = {}

## Establishing A Baseline

Now that we have our dummy variables as well as our training, testing and validation sets we are ready to begin building our regression models.

Since the purpose of project is to evaluate the important attributes using a linear regression model we must first create a baseline. The purpose of our baseline would be to compare how our future models with less attributes compare to a model with all the attributes present. 

The baseline model for this project will be a simple Ordinary Least Square Linear Regression model from `sklearn` using all the available features. We will evaluate all our models using Mean Absolute Error (MAE).

In [12]:
#used to store test results
test_results['baseline_reg'] = {}

#creating our baseline Linear Regression Model
baseline_reg = linear_model.LinearRegression()
baseline_reg.fit(X_train, y_train)

LinearRegression()

In [13]:
#evaluating our baseline regression model on training data
test_results['baseline_reg']['train_mae'] = mean_absolute_error(y_train, baseline_reg.predict(X_train))

In [14]:
#evaluating our baseline regression model on test data
test_results['baseline_reg']['test_mae'] = mean_absolute_error(y_test, baseline_reg.predict(X_test))

In [15]:
#see our results
test_results['baseline_reg']

{'train_mae': 35.213034588514546, 'test_mae': 32.74318852274562}