### Finding the best show parameters

Once the model has been trained, we can manually use the frontend and see what kind of results we get for each combination of values we insert. However, let's see a ranking of combinations we can try.

We have 4 fields:
- invited_noreply
- days_of_promoting
- page_posts
- visitor_posts

We will try combinations between these and see which one gives us the best result.
Before that, however, we will check what is the actual error between values of our training data set and the predicted values on the training data set.

In [1]:
import pandas as pd
import numpy as np
import pickle

In [9]:
training_data = pd.read_excel('improv_events.xlsx', index_col=0)
scaler = pickle.load(open('scaler.p', 'rb'))
model = pickle.load(open('best_model.p', 'rb'))
training_data

In [14]:
# Let's make sure our predictor does an accurate job
def test_training_predictions(test_data, model, scaler):
    target = test_data[['attending_count', 'interested_count']]
    test_data = pd.DataFrame(
        scaler.transform(test_data[['invited_noreply', 'days_of_promoting', 'page_posts', 'visitor_posts']]),
        columns=['invited_noreply', 'days_of_promoting', 'page_posts', 'visitor_posts'])
    predictions = model.predict(test_data)
    print(len(predictions))
    for i in range(len(predictions)):
        difference = target.values[i] - predictions[i]
        print("Training data was: {}, and prediction was {}. So the difference was {}".format(str(target.values[i]), str(predictions[i]), str(difference)))

In [15]:
test_training_predictions(training_data, model, scaler)

20
Training data was: [21 54], and prediction was [23.49805457 60.84269784]. So the difference was [-2.49805457 -6.84269784]
Training data was: [ 27 114], and prediction was [ 29.86162078 128.53370952]. So the difference was [ -2.86162078 -14.53370952]
Training data was: [ 24 115], and prediction was [ 32.55040159 132.8322485 ]. So the difference was [ -8.55040159 -17.8322485 ]
Training data was: [ 37 117], and prediction was [ 28.6796071  105.35392229]. So the difference was [ 8.3203929  11.64607771]
Training data was: [ 20 134], and prediction was [ 36.51822595 170.09859117]. So the difference was [-16.51822595 -36.09859117]
Training data was: [ 32 206], and prediction was [ 45.25294595 223.1779136 ]. So the difference was [-13.25294595 -17.1779136 ]
Training data was: [ 32 170], and prediction was [ 32.58313164 173.30038837]. So the difference was [-0.58313164 -3.30038837]
Training data was: [ 33 135], and prediction was [ 32.16595118 131.96255317]. So the difference was [0.83404882

It's not a very good predictor, but it's the best we've got. Since the data we have is limited, this will do.
We can see the error for attending are between 32 and 1, whereas for interested they are between 57 and 3 (with 1 exception below 1). These are big errors, so it will not give us any useful information, unfortunately.

However, it is a fun exercise to see what this predictor tells us is the best combination of features.
Below I will provide sets of parameters that I will combine and see which combination would give the best results.
As the improv team will do more and more shows (hence we will have more data points), our predictor should be able to be more and more accurate.

One of the reasons the accuracy is not so great is because the pandemic also changed the theather performance attendance, therefore we would be looking at a disbalanced data set (more shows pre-pandemic and less afterwards).

In [21]:
invited_no_reply_set = [0, 100, 300, 500, 700, 1000, 1200, 1500, 2000, 2500]
days_of_promoting_set = [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 20, 23, 26, 28]
page_posts_set = [0, 1, 2, 3, 4, 5, 6, 7]
visitor_posts_set = [0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 14, 18, 20]

predictions_dict = {'invited_noreply' : [], 'days_of_promoting': [], 'page_posts': [], 'visitor_posts': [], 'predicted_attending': [], 'predicted_interested': []}

for inv_noreply in invited_no_reply_set:
    for days_of_prom in days_of_promoting_set:
        for page_post in page_posts_set:
            for visitor_post in visitor_posts_set:
                predictions_dict['invited_noreply'].append(inv_noreply)
                predictions_dict['days_of_promoting'].append(days_of_prom)
                predictions_dict['page_posts'].append(page_post)
                predictions_dict['visitor_posts'].append(visitor_post)
                initial_data = pd.DataFrame(
                    [[inv_noreply, days_of_prom,  page_post, visitor_post]],
                    columns=['invited_noreply', 'days_of_promoting', 'page_posts', 'visitor_posts'])
                scaled_data = pd.DataFrame(
                    scaler.transform(initial_data), 
                    columns=['invited_noreply', 'days_of_promoting', 'page_posts', 'visitor_posts'])
                predicted = model.predict(scaled_data)
                predictions_dict['predicted_attending'].append(predicted[0][0])
                predictions_dict['predicted_interested'].append(predicted[0][1])

In [22]:
predictions_df = pd.DataFrame(predictions_dict)
predictions_df

Unnamed: 0,invited_noreply,days_of_promoting,page_posts,visitor_posts,predicted_attending,predicted_interested
0,0,5,0,0,30.477287,117.458077
1,0,5,0,1,39.692191,146.978322
2,0,5,0,2,49.057972,176.854460
3,0,5,0,3,57.912106,205.051547
4,0,5,0,4,65.572567,229.435676
...,...,...,...,...,...,...
17675,2500,28,7,9,-0.410204,40.514985
17676,2500,28,7,11,-0.893534,24.293487
17677,2500,28,7,14,-0.435314,9.581889
17678,2500,28,7,18,0.094616,2.130215


In [23]:
predictions_df['total_engagement'] = predictions_df['predicted_attending'] + predictions_df['predicted_interested']

In [25]:
predictions_df.sort_values(by=['predicted_attending', 'predicted_interested'], ascending=False).head(20)

Unnamed: 0,invited_noreply,days_of_promoting,page_posts,visitor_posts,predicted_attending,predicted_interested,total_engagement
5324,500,5,1,7,86.250485,291.197156,377.44764
3556,300,5,1,7,86.159518,304.380054,390.539572
5337,500,5,2,7,85.511347,310.754967,396.266314
3569,300,5,2,7,85.284368,323.937021,409.221389
13871,1500,23,3,0,85.025089,478.764532,563.789621
5325,500,5,1,9,84.734917,283.646036,368.380953
5428,500,6,1,7,84.178386,295.311431,379.489817
3660,300,6,1,7,83.599021,306.541675,390.140696
5429,500,6,1,9,83.522491,288.699828,372.222319
3555,300,5,1,6,83.438275,296.098878,379.537153


In [26]:
predictions_df.sort_values(by=['predicted_interested', 'predicted_attending'], ascending=False).head(20)

Unnamed: 0,invited_noreply,days_of_promoting,page_posts,visitor_posts,predicted_attending,predicted_interested,total_engagement
12209,1200,26,3,2,77.807089,506.704665,584.511753
12208,1200,26,3,1,79.924677,505.020667,584.945344
12210,1200,26,3,3,74.618944,498.820524,573.439468
12105,1200,23,3,2,79.343641,498.202041,577.545682
12104,1200,23,3,1,81.466838,494.57366,576.040497
12207,1200,26,3,0,80.682596,493.503002,574.185597
12106,1200,23,3,3,76.306168,493.058688,569.364856
12313,1200,28,3,2,73.843297,493.029175,566.872472
13976,1500,26,3,1,80.901409,492.962526,573.863935
12312,1200,28,3,1,75.795289,492.291195,568.086484


In [27]:
predictions_df.sort_values(by=['total_engagement', 'predicted_attending', 'predicted_interested'], ascending=False).head(10)

Unnamed: 0,invited_noreply,days_of_promoting,page_posts,visitor_posts,predicted_attending,predicted_interested,total_engagement
12208,1200,26,3,1,79.924677,505.020667,584.945344
12209,1200,26,3,2,77.807089,506.704665,584.511753
12105,1200,23,3,2,79.343641,498.202041,577.545682
12104,1200,23,3,1,81.466838,494.57366,576.040497
12207,1200,26,3,0,80.682596,493.503002,574.185597
13976,1500,26,3,1,80.901409,492.962526,573.863935
12210,1200,26,3,3,74.618944,498.820524,573.439468
13975,1500,26,3,0,83.032132,488.431643,571.463775
12106,1200,23,3,3,76.306168,493.058688,569.364856
12312,1200,28,3,1,75.795289,492.291195,568.086484


This is interesting. It looks like our model believes that to get a lot of people to click on 'going', we do not need to invite a lot of people, 800 seems enough.

However, if we want to have more people 'interested' about the show, we will need to invite 1200-2000 people (here we only have the invited ones who did not reply).

If we want more attending people, we should do more visitor posts and less page posts.

If we want more people interested in our events, we need more days of promotion and a grand total of 3 page posts about the event. We should also not stress the public with visitor posts about the event.

Even if the model is flawed, it can still see the basic relationship in our data. It seemed normal that more days of promoting = more people engaged with the event, but we did not foresee that there's a sweet spot in page_posts that attracts the most users.

We have just remained with the ideal parameters:
An Improbabilii Improv Show should be created (on facebook) at least 3 weeks before it is happening. We should invite 1700-2000 people. We should make a total of 3 page posts. As for visitor 