## Regresssion analysis to determine memorable characteristics that have significant inlfuence on ratings

#### Author: Lia Chin-Purcell
#### Date: 5/4/2022

In [1]:
# import our libraries
import io
import pandas as pd

import numpy as np 
import statsmodels.api as sm

## Vizualise
%matplotlib inline
import matplotlib
from matplotlib import pyplot as plt


In [2]:
matplotlib.__version__

'3.5.1'

## Read in the chocolate bar rating dataset

We use a publicly [availiable chocolate bar rating dataset](http://flavorsofcacao.com/chocolate_database.html)

In [3]:
df_choc = pd.read_csv('../data/chocolate.csv')

In [4]:
df_choc.head(5)

Unnamed: 0,REF,Company (Manufacturer),Company Location,Review Date,Country of Bean Origin,Specific Bean Origin or Bar Name,Cocoa Percent,Ingredients,Most Memorable Characteristics,Rating
0,1205,Habitual,Canada,2014,Blend,one hundred,100%,,"unrefined, bitter, earthy",2.0
1,701,Haigh,Australia,2011,Blend,South America and Africa,70%,,"vanilla, chocolate milk",3.0
2,1113,Hotel Chocolat,U.K.,2013,St. Lucia,"Island Growers, 2012, 120hr c., batch 13080",100%,,"pastey, bitter, unfixable",1.75
3,296,Hotel Chocolat (Coppeneur),U.K.,2008,Uganda,Uganda,80%,,"charred, espresso",2.5
4,552,Hotel Chocolat (Coppeneur),U.K.,2010,Ecuador,Ecuador,70%,,"spicy, sour, burning",2.75


# Most memorable characteristic regression

Here, we use a regresssion analysis to determine memorable characteristics that have significant inlfuence on ratings. For example, does 'bitter' have a positive or negative relationship with rating? What about 'creamy'? Which memorable characteristics have the most statistically significant effect, and which have the largest effect?

Memorable characteristics of a bar are terms generally relating to anything from texture, flavor, overall opinion, etc

First, we need to parse the memorable characteristics string as an array

In [5]:
dif_chars = []
most_mem = []

# new field for array version
df_choc['most_memorable'] = df_choc['Most Memorable Characteristics']


for index, row in df_choc.iterrows():
    mem_chars = row['Most Memorable Characteristics']
    # turn into array
    mem_chars = mem_chars.split(',')
    # strip
    mem_chars = [cha.strip() for cha in mem_chars]
    
    df_choc.at[index, 'most_memorable'] = mem_chars
    # print(mem_chars)
df_choc.head(5)


Unnamed: 0,REF,Company (Manufacturer),Company Location,Review Date,Country of Bean Origin,Specific Bean Origin or Bar Name,Cocoa Percent,Ingredients,Most Memorable Characteristics,Rating,most_memorable
0,1205,Habitual,Canada,2014,Blend,one hundred,100%,,"unrefined, bitter, earthy",2.0,"[unrefined, bitter, earthy]"
1,701,Haigh,Australia,2011,Blend,South America and Africa,70%,,"vanilla, chocolate milk",3.0,"[vanilla, chocolate milk]"
2,1113,Hotel Chocolat,U.K.,2013,St. Lucia,"Island Growers, 2012, 120hr c., batch 13080",100%,,"pastey, bitter, unfixable",1.75,"[pastey, bitter, unfixable]"
3,296,Hotel Chocolat (Coppeneur),U.K.,2008,Uganda,Uganda,80%,,"charred, espresso",2.5,"[charred, espresso]"
4,552,Hotel Chocolat (Coppeneur),U.K.,2010,Ecuador,Ecuador,70%,,"spicy, sour, burning",2.75,"[spicy, sour, burning]"


In [6]:
# get dummies from array of memorable characteristics, so each feature is a dummy for that characteristic
df_tree = pd.get_dummies(df_choc['most_memorable'].explode()).groupby(level=0).sum()
df_tree.head(5)

Unnamed: 0,Unnamed: 1,Fruity,Rich cocoa,Roasty,XL nibs,accessible,acidic,alcohol,almond,almond butter,...,well developed,wet,wheat,wild berries,wild berry,wine,woodsy,woody,yellow fruit,yogurt
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# set up the regression

# outcome variable
y = df_choc['Rating']

# memorable characteristic dummies
X = df_tree

# add a constant
X = sm.add_constant(X)

# declare and fit the model
model = sm.OLS(y,X).fit()
model.summary()

0,1,2,3
Dep. Variable:,Rating,R-squared:,0.779
Model:,OLS,Adj. R-squared:,0.596
Method:,Least Squares,F-statistic:,4.261
Date:,"Thu, 12 May 2022",Prob (F-statistic):,3.12e-87
Time:,00:16:09,Log-Likelihood:,272.49
No. Observations:,1560,AIC:,867.0
Df Residuals:,854,BIC:,4646.0
Df Model:,705,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.2853,0.052,63.639,0.000,3.184,3.387
,0.0846,0.165,0.513,0.608,-0.239,0.408
Fruity,0.1074,0.140,0.768,0.442,-0.167,0.382
Rich cocoa,-0.0773,0.141,-0.548,0.584,-0.354,0.199
Roasty,0.1074,0.140,0.768,0.442,-0.167,0.382
XL nibs,-0.1515,0.300,-0.506,0.613,-0.740,0.437
accessible,0.2152,0.202,1.066,0.287,-0.181,0.611
acidic,-0.1206,0.071,-1.707,0.088,-0.259,0.018
alcohol,0.0037,0.207,0.018,0.986,-0.402,0.409

0,1,2,3
Omnibus:,194.718,Durbin-Watson:,1.785
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1174.824
Skew:,-0.406,Prob(JB):,7.77e-256
Kurtosis:,7.173,Cond. No.,1.46e+18


### Thats a lot of features! which ones are statistically significant?

## Positive effect memorable characteristics

Lets go through our model and filter on p-value less than 0.05, and positive coeficients. This indicates the characteristic had a statistically significant effect on rating, and that effect was *positive*. In other words, if the bar had this characteristic, the rating is likley to be higher - specifically `coef` points higher.

In [7]:
# create dataframe to be able to sort on p values with model information and parameters

ps = []
coefs = []
most_memorable = []

for i, var in enumerate(X.columns.tolist()):
    ps.append(model.pvalues[var])
    coefs.append(model.params[var])
    most_memorable.append(var)


In [8]:
# dataframe for output

df_params = pd.DataFrame()
df_params['p-value'] = ps
df_params['coef'] = coefs
df_params['name'] = most_memorable

In [9]:
df_params_sig = df_params[(df_params['p-value'] <= 0.05) & (df_params['coef'] > 0)]
# constant is not relevant
df_params_sig = df_params_sig[df_params_sig['name'] != 'const']

In [24]:
df_params_sig.head(10)

Unnamed: 0,p-value,coef,name
23,0.008718429,0.234444,balanced
38,0.002504779,0.868794,black current
43,0.0005182544,0.420081,blackberry
47,0.01547497,0.97077,blueberries
59,0.03637186,0.206921,bright fruit
70,0.01872272,0.714407,burnt brownie
97,0.02626219,0.464276,cardamom
98,0.01070548,0.71473,cardamon
120,0.01070548,0.71473,chocolate and grapes
128,0.0484659,0.365115,cinamon


### save as a table

In [10]:
df_params_sig.to_csv('../tables/good_review_memorable_characteristics.csv')

## Negative effect memorable characteristics

Lets do the same for negative characteristics. Turns out there are more statistically significant terms for negative characteristics than positive ones, so lets make our significance threshold lower, 0.001.

In [11]:
df_params_sig = df_params[(df_params['p-value'] <= 0.001) & (df_params['coef'] < 0)]

In [12]:
df_params_sig.head(10)

Unnamed: 0,p-value,coef,name
35,2.0972730000000002e-27,-0.5986,bitter
79,0.0003158706,-0.611484,burnt rubber
81,0.0002250781,-1.03527,burnt up front
109,2.912281e-06,-0.463721,chemical
110,0.0002781201,-0.733152,chemical off
176,0.0003395652,-0.262095,dirty
188,1.644918e-08,-1.586933,dominate off note
198,1.567257e-05,-0.145713,earthy
233,7.896309e-05,-0.649563,fuel
259,1.813871e-05,-0.430084,harsh


### save as a table

In [13]:
df_params_sig.to_csv('../tables/bad_review_memorable_characteristics.csv')