# Conjoint Analysis

Conjoint analysis approximates the relationship between a product's attributes and their value or importance to the overall product.  The goal of the analysis is to get a better sense of product feature tradeoffs.

At its heart, the computation of product feature utility is mathematically a regression. In simple terms, linear regression can be applied to the dataset's dummy variables against a product rank or rating or customer preference for those features. With choices, and ranking as an alternative, Logistical Regression can be used for categories. More advanced techniques use Hierarchical Bayes. 

Example treatment below using sample gelato preferences, inspired by: http://keii.ue.wroc.pl/conjoint/Conjoint_R.html#ref8

In Italian, *gelato* means ice cream. Unlike American or more conventional ice cream which is churned at a fairly high speed to incorporate air and increase its volume (cheaper ice creams tends to have more air whipped into them), gelato is churned at a much slower rate, incorporating less air and leaving the gelato denser than ice cream. Gelato's resulting texture, which is also influenced by lower proportion of crema, is silkier and softer.  



In [5]:
import pandas as pd 
import numpy as np 
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn import preprocessing
import matplotlib.pyplot as plt

### Load, examine and plot data

In [86]:
rank_data = pd.read_csv('ice_prefs_data.csv')
data = rank_data

In [87]:
data.head(5)

Unnamed: 0,gusto,prezzo en Euros,contenitore,guarnizione,Customer_Rank
0,mango,1,cone,yes,23
1,fragola,2,cone,yes,14
2,stracciatella,1,cup,yes,17
3,stracciatella,2,cup,yes,18
4,mango,3,cup,yes,1


In [88]:
data.isnull().sum() # examine for missing values in each column

gusto              0
prezzo en Euros    0
contenitore        0
guarnizione        0
Customer_Rank      0
dtype: int64

In [89]:
data.columns

Index(['gusto', 'prezzo en Euros', 'contenitore', 'guarnizione',
       'Customer_Rank'],
      dtype='object')

### "Dummy" the data

Dummy variables are useful because they enable use of a single regression equation to represent multiple features. This means that we don't need to write out separate equation models for each subgroup. The dummy variables act like 'switches' that turn various parameters on and off, like "switches" of 1s and 0s, in an equation. 

For additional readings -> https://socialresearchmethods.net/kb/dummyvar.php

Video (but using MS Excel) -> https://www.youtube.com/watch?v=EMbiGPGlBEM

In [107]:
conjoint_data = pd.get_dummies(rank_data,columns = ['gusto', 'prezzo en Euros', 'contenitore', 'guarnizione'])

In [108]:
conjoint_data.head(5)

Unnamed: 0,Customer_Rank,gusto_fragola,gusto_mango,gusto_stracciatella,prezzo en Euros_1,prezzo en Euros_2,prezzo en Euros_3,contenitore_cone,contenitore_cup,guarnizione_no,guarnizione_yes
0,23,0,1,0,1,0,0,1,0,0,1
1,14,1,0,0,0,1,0,1,0,0,1
2,17,0,0,1,1,0,0,0,1,0,1
3,18,0,0,1,0,1,0,0,1,0,1
4,1,0,1,0,0,0,1,0,1,0,1


In [115]:
fullnames = {"gusto_stracciatella":"stracciatella", "gusto_fragola":"fragola", "gusto_mango":"mango", "prezzo en Euros_1":"Euros 1", "prezzo en Euros_2":"Euros 2", "prezzo en Euros_3":"Euros 3","contenitore_cone":"cono", "contenitore_cup":"tazza", "guarnizione_no":"nezzuna", "guarnizione_yes":"ha", "Customer_Rank": "Rank"}

In [116]:
fullnames

{'gusto_stracciatella': 'stracciatella',
 'gusto_fragola': 'fragola',
 'gusto_mango': 'mango',
 'prezzo en Euros_1': 'Euros 1',
 'prezzo en Euros_2': 'Euros 2',
 'prezzo en Euros_3': 'Euros 3',
 'contenitore_cone': 'cono',
 'contenitore_cup': 'tazza',
 'guarnizione_no': 'nezzuna',
 'guarnizione_yes': 'ha',
 'Customer_Rank': 'Rank'}

In [117]:
conjoint_data.rename(columns=fullnames, inplace=True)

In [118]:
conjoint_data.columns

Index(['Rank', 'fragola', 'mango', 'stracciatella', 'Euros 1', 'Euros 2',
       'Euros 3', 'cono', 'tazza', 'nezzuna', 'ha'],
      dtype='object')

In [140]:
type(conjoint_data)

pandas.core.frame.DataFrame

In [119]:
conjoint_data.head(5)

Unnamed: 0,Rank,fragola,mango,stracciatella,Euros 1,Euros 2,Euros 3,cono,tazza,nezzuna,ha
0,23,0,1,0,1,0,0,1,0,0,1
1,14,1,0,0,0,1,0,1,0,0,1
2,17,0,0,1,1,0,0,0,1,0,1
3,18,0,0,1,0,1,0,0,1,0,1
4,1,0,1,0,0,0,1,0,1,0,1


### Estimate preferences using linear regression

The most common technique to estimate the parameters (coefficients) of a linear model is Ordinary Least Squares (OLS).

As the name implies, an OLS model is solved by finding the parameters that minimize the sum of squared residuals.

Read more -> https://statisticsbyjim.com/regression/ols-linear-regression-assumptions/


In [120]:
X = conjoint_data[[u'stracciatella', u'fragola', u'mango', u'Euros 1', u'Euros 2', u'Euros 3', u'cono', u'tazza', u'nezzuna', u'ha']]
X = sm.add_constant(X)
Y = conjoint_data.Rank
linearRegression = sm.OLS(Y, X).fit()


#### Regression summary table 

The R-squared value of 0.283 indicates that around 28% of variation in product Ranking can be explained by the various gelato product features.

In [122]:
linearRegression.summary().tables[0]

0,1,2,3
Dep. Variable:,Rank,R-squared:,0.283
Model:,OLS,Adj. R-squared:,0.096
Method:,Least Squares,F-statistic:,1.511
Date:,"Sun, 02 Jun 2019",Prob (F-statistic):,0.219
Time:,14:49:58,Log-Likelihood:,-102.64
No. Observations:,30,AIC:,219.3
Df Residuals:,23,BIC:,229.1
Df Model:,6,,
Covariance Type:,nonrobust,,


#### Regression Coefficient summary table

From our results below, we see that

- The intercept is  6.2.


- The slopes are 1.001, 2,85 2.34, 6.86, 2.54, -3.20, 1.9, 4.3, 3.85, 2.35 for each of the ice cream product features, respectively, stracciatella, fragola, mango, Euros 1, Euros 2, Euros 3, cono, tazza, nezzuna, ha.


- Positive slopes (>1) imply that those features have a positive effect on Rankings, while those with negative slopes (<1) have a negative effect on Rankings.  Based on initial analysis of the resulting summary table, fragola flavored tazza gelatos with no toppings priced at about a one Euro suggest favorable ranking, but stracciatella flavored cones that have toppings priced at three Euros suggest unfavorable ranking. Perhaps customers, for this data set, don't see the value of a 3 Euro gelato and prefer tropical flavors?


- Features with P-value of < 0.05 such as Euro 1, serviced in a cup, with no toppings are statistically significant (using p < 0.05 as a rejection rule).


In [123]:
linearRegression.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.2007,0.681,9.101,0.000,4.791,7.610
stracciatella,1.0001,2.259,0.443,0.662,-3.673,5.673
fragola,2.8564,2.277,1.254,0.222,-1.855,7.568
mango,2.3442,2.248,1.043,0.308,-2.306,6.994
Euros 1,6.8587,2.459,2.789,0.010,1.772,11.946
Euros 2,2.5432,2.357,1.079,0.292,-2.332,7.418
Euros 3,-3.2012,2.081,-1.538,0.138,-7.506,1.103
cono,1.8989,1.999,0.950,0.352,-2.236,6.034
tazza,4.3017,1.699,2.531,0.019,0.786,7.817


In [124]:
conjoint_attribs = [u'stracciatella', u'fragola', u'mango', u'Euros 1', u'Euros 2', u'Euros 3', u'cono', u'tazza', u'nezzuna', u'ha']

### Estimate contribution of gelato product attributes, and their options, to product overall Utility

- Importance of attribute = Max(coeff) - Min(coeff), or consumer preference by feature option.

- Relative Importance of attribute = Importance of attribute / Sum(Importance of all attributes), or the contribution of an attribute to the total utility of the product.

- Utility - Thumbs Up or Down judgement representing the total product's value.

In [138]:
option_name = []
importance_range = []
importance = []

end = 1
for item in conjoint_attribs:
    options = len(list(set(conjoint_data[item])))
    option_name.append(list(set(conjoint_data[item])))
    start = end
    end = start + options - 1
    new_attrib_importance = list(linearRegression.params[start:end].round(4))  # find a regression line through each attribute
    new_attrib_importance.append((-1) * sum(new_attrib_importance))            # append the absolute value range counterpart
    importance_range.append(max(new_attrib_importance) - min(new_attrib_importance))
    importance.append(new_attrib_importance)

relative_importance = []
for item in importance_range:
    relative_importance.append(round(100 * (item / sum(importance_range)),4))

In [163]:
conjoint_attribs = conjoint_data.drop(columns="Rank")
relative_importance_df = pd.DataFrame(columns = conjoint_attribs.columns)
relative_importance_df.loc[len(relative_importance_df)] = relative_importance
relative_importance_df

Unnamed: 0,fragola,mango,stracciatella,Euros 1,Euros 2,Euros 3,cono,tazza,nezzuna,ha
0,3.2049,9.1536,7.5122,21.9794,8.1499,10.2586,6.0852,13.7852,12.3425,7.5283


### Conclusion

For each gelato product option, pick out the highest number among the relative importance values.  Based on this extended analysis, mango flavored tazza gelatos with no toppings priced at about a one Euro yields the highest overall product utility.

### Discuss

- Analysis
    - Do you agree with the analysis?
    - What are the limitations or assumptions of this type of analysis?
    - If attributes are collinear with another, how would this impact the calculations of regression, or of relative importance?
    
### Read More

https://ariepratama.github.io/How-to-do-conjoint-analysis-in-python/

https://github.com/stayingfoolish/Conjoint-Analysis/blob/master/Conjoint%20analysis.ipynb

https://www.youtube.com/watch?v=EMbiGPGlBEM

https://ariepratama.github.io/How-to-do-conjoint-analysis-in-python/

https://github.com/Herka/Traditional-Conjoint-Analysis-with-Python/blob/master/Traditional%20Conjoint%20Analyse.ipynb

https://www.linkedin.com/pulse/conjoint-analysis-simple-python-implementation-prajwal-sreenivas/

https://lectures.quantecon.org/py/ols.html
