<a href="https://colab.research.google.com/github/Jikhan-Jeong/Discrete-Choice-Model/blob/master/Feb_22%2C_2020_pyblp_tutorial_logit_and_nest_ipynb%EC%9D%98_%EC%82%AC%EB%B3%B8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# **Feb 22, 2020 Pyblp Tutorial**
---
* Name: Jikhan Jeong
* Random Coefficients Logit Tutorial with the Automobile Data 

* This is for checking the result of BLP before repliation for BLP(1995) before actual replication. In sum, it is just a learning purpose and all orginal content belongs prof. Chris Conlon. I just modifies some part to easier understadning in learning purpose. Persoanlly, thank for great work to Prof. Chris Conlon
---
* **Part1. pyBLP Tutorial: Logit and Nested Logit Demand Estimation**
* Ref: https://notes.quantecon.org/submission/5cd236014174bb001a39a904
* **Part2. Random Coefficients Demand Estimation (Nero 2000 data)**
* Ref: https://notes.quantecon.org/submission/5cd2364c4174bb001a39a905
* **Part3. Random Coefficients Estimation of Simultaneous Supply and Demand (BLP 1995 Data)**
* Ref: https://notes.quantecon.org/submission/5cd236974174bb001a39a906
* **Part4. Post Estimation Counterfactuals**
* Ref: https://notes.quantecon.org/submission/5cd236ee4174bb001a39a907
---

# <font color = blue> **Part1. pyBLP Tutorial: Logit and Nested Logit Demand Estimation**  </blue>
---

* Compare two simple models, the plain (IIA) logit model and the nested logit (GEV) model 
* Dataset: Fake cereal dataset of Nevo (2000).
* Ref: https://notes.quantecon.org/submission/5cd236014174bb001a39a904
---


In [0]:
! pip install pyblp

Collecting pyblp
[?25l  Downloading https://files.pythonhosted.org/packages/cb/68/6f2b2f8ac1c77f27e22d83093fb451d24e9910ddf540b4c0a4749a545c8f/pyblp-0.8.1-py3-none-any.whl (1.6MB)
[K     |▏                               | 10kB 16.1MB/s eta 0:00:01[K     |▍                               | 20kB 3.2MB/s eta 0:00:01[K     |▋                               | 30kB 4.6MB/s eta 0:00:01[K     |▉                               | 40kB 3.0MB/s eta 0:00:01[K     |█                               | 51kB 3.6MB/s eta 0:00:01[K     |█▎                              | 61kB 4.3MB/s eta 0:00:01[K     |█▌                              | 71kB 5.0MB/s eta 0:00:01[K     |█▊                              | 81kB 5.6MB/s eta 0:00:01[K     |██                              | 92kB 6.2MB/s eta 0:00:01[K     |██                              | 102kB 4.9MB/s eta 0:00:01[K     |██▎                             | 112kB 4.9MB/s eta 0:00:01[K     |██▌                             | 122kB 4.9MB/s eta 0:00:0

In [0]:
import pyblp
import numpy as np
import pandas as pd

pyblp.options.digits = 2
pyblp.options.verbose = False
pyblp.__version__

'0.8.1'

---
# <font color = blue> Part 1 Logit Estimation Process </font>
* Ref: https://notes.quantecon.org/submission/5cd236014174bb001a39a904 (all from here)
--- 
1. Load the product data which at a minimum consists of market_ids, shares, prices, and at least a single column of demand instruments, demand_instruments0.

2. Define a Formulation for the X1 (linear) demand model.

- This and all other formulas are similar to R and **patsy formulas**.
- It includes a constant by default. To exclude the constant, specify either a 0 or a -1.
- To efficiently include **fixed effects**, use the **absorb **option and specify which categorical variables you would like to absorb.
- Some model reduction may happen automatically. The constant will be excluded if you include **fixed effects** and some precautions are taken against collinearity. However, you will have to make sure that differently-named variables are not collinear.
3. Combine the **Formulation** and **product data **to construct a **Problem**.
4. Use **Problem.solve** to estimate paramters.
----

# Loading the Data as a pd dataframe : Nero Dataset not BLP(1995)
* market_ids, 
* product_ids, 
* firm_ids,
* shares, 
* prices, 
* a number of other IDs 
* product characteristics 
* pre-computed excluded demand_instruments0, demand_instruments1  (19 IVs) 
* product_ids will be incorporated as fixed effects.

In [0]:
product_data = pd.read_csv(pyblp.data.NEVO_PRODUCTS_LOCATION)
product_data.head(10)

Unnamed: 0,market_ids,city_ids,quarter,product_ids,firm_ids,brand_ids,shares,prices,sugar,mushy,demand_instruments0,demand_instruments1,demand_instruments2,demand_instruments3,demand_instruments4,demand_instruments5,demand_instruments6,demand_instruments7,demand_instruments8,demand_instruments9,demand_instruments10,demand_instruments11,demand_instruments12,demand_instruments13,demand_instruments14,demand_instruments15,demand_instruments16,demand_instruments17,demand_instruments18,demand_instruments19
0,C01Q1,1,1,F1B04,1,4,0.012417,0.072088,2,1,-0.215973,0.040573,-3.247948,-0.523938,-0.23246,0.006833,3.13974,-0.574786,0.20622,0.177466,2.116358,-0.154708,-0.005796,0.014538,0.126244,0.067345,0.068423,0.0348,0.126346,0.035484
1,C01Q1,1,1,F1B06,1,6,0.007809,0.114178,18,1,-0.245239,0.054742,-19.832461,-0.18052,0.014689,0.000799,0.287654,0.03294,0.105121,-0.287562,-7.374091,-0.576412,0.012991,0.076143,0.029736,0.087867,0.110501,0.087784,0.049872,0.072579
2,C01Q1,1,1,F1B07,1,7,0.012995,0.132391,4,1,-0.176459,0.046596,-2.878531,-0.284219,-0.215537,-0.031869,2.886274,-0.749765,-0.478956,0.214739,2.187872,-0.207346,0.003509,0.091781,0.163773,0.111881,0.108226,0.086439,0.122347,0.101842
3,C01Q1,1,1,F1B09,1,9,0.00577,0.130344,3,0,-0.121401,0.04876,-2.059918,-0.328412,-0.22207,-0.031474,4.45311,0.255675,-0.472967,0.356098,2.704576,0.040748,-0.003724,0.094732,0.135274,0.08809,0.101767,0.101777,0.110741,0.104332
4,C01Q1,1,1,F1B11,1,11,0.017934,0.154823,12,0,-0.132611,0.039628,-6.137598,-0.138625,-0.189365,-0.043747,-3.554651,0.138821,-0.688678,0.260273,1.261242,0.034836,-0.000568,0.102451,0.13064,0.084818,0.101075,0.125169,0.133464,0.121111
5,C01Q1,1,1,F1B13,1,13,0.026602,0.137049,14,0,-0.1535,0.042988,-8.417332,0.007829,-0.138501,-0.021058,-2.75948,0.050201,-0.273444,0.127306,0.337554,0.02351,0.000264,0.08628,0.072336,0.022251,0.105644,0.116037,0.099651,0.105727
6,C01Q1,1,1,F1B17,1,17,0.025015,0.144209,3,1,-0.164352,0.044922,-2.389348,-0.15697,-0.215145,-0.045543,4.3441,-0.85874,-0.733705,0.246524,2.617504,-0.195578,0.004489,0.09415,0.138474,0.110273,0.101192,0.106082,0.143585,0.120973
7,C01Q1,1,1,F1B30,1,30,0.005058,0.128191,4,0,-0.118166,0.049645,-2.314019,-0.317537,-0.223526,-0.029128,3.275057,0.235487,-0.429212,0.367439,2.591142,0.044275,-0.004563,0.108831,0.135491,0.128176,0.059036,0.08544,0.044623,0.097111
8,C01Q1,1,1,F1B45,1,45,0.005332,0.149611,14,0,-0.144381,0.042091,-8.164721,-0.022463,-0.152652,-0.029185,-3.326924,0.072494,-0.416244,0.15872,0.489811,0.026016,-6.6e-05,0.114297,0.116368,0.141625,0.095104,0.122102,0.131221,0.119009
9,C01Q1,1,1,F2B05,2,5,0.038068,0.108514,1,0,-0.116467,0.060361,-2.50996,-0.325799,-0.228135,-0.00708,6.473079,0.255249,-0.01839,0.37926,2.727929,0.035499,-0.007844,0.083079,0.020242,-0.020562,0.064436,0.082678,-0.007339,0.072053


In [0]:
print(type(product_data)) 
print(product_data.shape) # N = 2256, K = 30

<class 'pandas.core.frame.DataFrame'>
(2256, 30)


In [0]:
print(product_data.columns.shape)
product_data.columns 

(30,)


Index(['market_ids', 'city_ids', 'quarter', 'product_ids', 'firm_ids',
       'brand_ids', 'shares', 'prices', 'sugar', 'mushy',
       'demand_instruments0', 'demand_instruments1', 'demand_instruments2',
       'demand_instruments3', 'demand_instruments4', 'demand_instruments5',
       'demand_instruments6', 'demand_instruments7', 'demand_instruments8',
       'demand_instruments9', 'demand_instruments10', 'demand_instruments11',
       'demand_instruments12', 'demand_instruments13', 'demand_instruments14',
       'demand_instruments15', 'demand_instruments16', 'demand_instruments17',
       'demand_instruments18', 'demand_instruments19'],
      dtype='object')

---
# Logit Model: Setting Up the Problem
#### A. **Formulation** and B. **product_data** to construct a Problem.
#### A. **Formulation** : 1 price variable with fixed effect terms
* **$X_1$** is the linear component of utility for demand and depends only on prices (after the fixed effects are removed).
#### B. **product_data**
* **T** is the number of markets. = 94 markets
* **N** is the length of the dataset (the number of products across all markets) = 2256 products.
* **F** is the number of firms, which we won't use in this example. = 5 firms
* **$K_1$** is the dimension of the linear demand parameters. = 1 price variable
* **MD** is the dim of the IVs (excluded IVs and exogenous regressors). = 19 + 1
* **ED** is the number of fixed effect dimensions (1 dim fixed effects, 2dim fixed effects, etc.). = 1 fixed effect (= product level)
---

In [0]:
logit_formulation = pyblp.Formulation('prices', absorb='C(product_ids)')
logit_formulation

prices + Absorb[C(product_ids)]

In [0]:
problem = pyblp.Problem(logit_formulation, product_data)
problem

Dimensions:
 T    N     F    K1    MD    ED 
---  ----  ---  ----  ----  ----
94   2256   5    1     20    1  

Formulations:
     Column Indices:          0   
--------------------------  ------
X1: Linear Characteristics  prices

In [0]:
logit_results = problem.solve()
logit_results

Problem Results Summary:
GMM   Objective  Weighting Matrix
Step    Value    Condition Number
----  ---------  ----------------
 2    +8.3E-02       +5.7E+07    

Cumulative Statistics:
Computation   Objective 
   Time      Evaluations
-----------  -----------
 00:00:00         2     

Beta Estimates (Robust SEs in Parentheses):
  prices  
----------
 -3.0E+01 
(+1.0E+00)

---
#  <font color = blue> Part 1-2: Nested Logit </font>
---
## Theory of Nested Logit

We can extend the logit model to allow for correlation within a group $h$ so that

$$U_{jti} = \alpha p_{jt} + x_{jt} \beta^x + \xi_{jt} + \bar{\epsilon}_{h(j)ti} + (1 - \rho) \bar{\epsilon}_{jti}.$$

Now, we require that $\epsilon_{jti} = \bar{\epsilon}_{h(j)ti} + (1 - \rho) \bar{\epsilon}_{jti}$ is distributed IID with the Type I Extreme Value (Gumbel) distribution. As $\rho \rightarrow 1$, all consumers stay within their group. As $\rho \rightarrow 0$, this collapses to the IIA logit. Note that if we wanted, we could allow $\rho$ to differ between groups with the notation $\rho_{h(j)}$.

This gives us aggregate marketshares as the product of two logits, the within group logit and the across group logit:

$$s_{jt} = \frac{\exp[V_{jt} / (1 - \rho)]}{\exp[V_{h(j)t} / (1 - \rho)]}\cdot\frac{\exp V_{h(j)t}}{1 + \sum_h \exp V_{ht}},$$

where $V_{jt} = \alpha p_{jt} + x_{jt} \beta^x + \xi_{jt}$.

After some work we again obtain the linear estimating equation:

$$\log s_{jt} - \log s_{0t} = \alpha p_{jt}+ x_{jt} \beta^x +\rho \log s_{j|h(j)t} + \xi_{jt},$$

where $s_{j|h(j)t} = s_{jt} / s_{h(j)t}$ and $s_{h(j)t}$ is the share of group $h$ in market $t$. See [Berry (1994)](https://pyblp.readthedocs.io/en/stable/references.html#berry-1994) or [Cardell (1997)](https://pyblp.readthedocs.io/en/stable/references.html#cardell-1997) for more information.

Again, the left hand side is data, though the $\ln s_{j|h(j)t}$ is clearly endogenous which means we must instrument for it. Rather than include $\ln s_{j|h(j)t}$ along with the linear components of utility, $X_1$, whenever `nesting_ids` are included in `product_data`, $\rho$ is treated as a nonlinear $X_2$ parameter. This means that the linear component is given instead by

$$\log s_{jt} - \log s_{0t} - \rho \log s_{j|h(j)t} = \alpha p_{jt} + x_{jt} \beta^x + \xi_{jt}.$$

This is done for two reasons:

1. It forces the user to treat $\rho$ as an endogenous parameter.
2. It extends much more easily to the RCNL model of [Brenkers and Verboven (2006)](https://pyblp.readthedocs.io/en/stable/references.html#brenkers-and-verboven-2006).

A common choice for an additional instrument is the number of products per nest.

---

In [0]:
def solve_nl(df):
    groups = df.groupby(['market_ids', 'nesting_ids'])
    df['demand_instruments20'] = groups['shares'].transform(np.size) # each market nest id level shares as a IV 20
    nl_formulation = pyblp.Formulation('0 + prices')    
    problem = pyblp.Problem(nl_formulation, df)
    return problem.solve(rho=0.7)

In [0]:
df1 = product_data.copy()
df1['nesting_ids'] = 1
nl_results1 = solve_nl(df1)
nl_results1

Problem Results Summary:
GMM   Objective    Projected    Reduced   Weighting Matrix  Covariance Matrix
Step    Value    Gradient Norm  Hessian   Condition Number  Condition Number 
----  ---------  -------------  --------  ----------------  -----------------
 2    +9.0E-02     +3.7E-13     +4.8E+00      +2.0E+09          +3.0E+04     

Cumulative Statistics:
Computation  Optimization   Objective 
   Time       Iterations   Evaluations
-----------  ------------  -----------
 00:00:03         3             8     

Rho Estimates (Robust SEs in Parentheses):
All Groups
----------
 +9.8E-01 
(+1.4E-02)

Beta Estimates (Robust SEs in Parentheses):
  prices  
----------
 -1.2E+00 
(+4.0E-01)

In [0]:
# H = 1
nl_results1.problem

Dimensions:
 T    N     F    K1    MD    H 
---  ----  ---  ----  ----  ---
94   2256   5    1     21    1 

Formulations:
     Column Indices:          0   
--------------------------  ------
X1: Linear Characteristics  prices

In [0]:
# we'll solve the problem when there are two nests for mushy and non-mushy.
df2 = product_data.copy()
df2['nesting_ids'] = df2['mushy']
nl_results2 = solve_nl(df2)
nl_results2

Problem Results Summary:
GMM   Objective    Projected    Reduced   Weighting Matrix  Covariance Matrix
Step    Value    Gradient Norm  Hessian   Condition Number  Condition Number 
----  ---------  -------------  --------  ----------------  -----------------
 2    +3.1E-01     +1.2E-11     +2.5E+00      +5.1E+08          +2.0E+04     

Cumulative Statistics:
Computation  Optimization   Objective 
   Time       Iterations   Evaluations
-----------  ------------  -----------
 00:00:03         3             8     

Rho Estimates (Robust SEs in Parentheses):
All Groups
----------
 +8.9E-01 
(+1.9E-02)

Beta Estimates (Robust SEs in Parentheses):
  prices  
----------
 -7.8E+00 
(+4.8E-01)

In [0]:
# H is 2
nl_results2.problem

Dimensions:
 T    N     F    K1    MD    H 
---  ----  ---  ----  ----  ---
94   2256   5    1     21    2 

Formulations:
     Column Indices:          0   
--------------------------  ------
X1: Linear Characteristics  prices

In [0]:
nl_results1.beta[0] / (1 - nl_results1.rho)

array([[-67.39338888]])

In [0]:
nl_results2.beta[0] / (1 - nl_results2.rho)

array([[-72.27074638]])