We want to narrow down the set of products relevant to the user by asking them multiple choice questions. Too many would bore them away, so we want to keep the list small while trying to learn as many user preferences as we can. 

Each user preference could correspond to 
- **a single product attribute** 

    affinity for Macbooks => Brand == "Apple"

- **multiple attributes** 

    `gaming` laptop => (GPU mem >= 4GB) AND (Display >= "Full HD")

For simplicity, we elicit user preferences that correspond to singular attributes only.

## Shortlist to a smaller set of products

In [1]:
import json
from collections import Counter

import altair as alt
import numpy as np

In [2]:
with open("data/list_laptop_attrs.jsonl", "r") as f:
    all_laptops = [json.loads(r) for r in f]

len(all_laptops)

1151

In [3]:
all_laptops[0]

{'id': 0,
 'warranty': 1,
 'battery': 15.0,
 'ram': 8,
 'sdd': 256,
 'hdd': None,
 'gpu_make': None,
 'gpu_memory': None,
 'processor_make': 'Apple chip',
 'display_size': 13.3,
 'resolution': [2560, 1600],
 'os': 'MacOS',
 'weight': 1.29,
 'make': 'Apple',
 'colors': ['Gold'],
 'price': 90900,
 'rating_value': 4.7,
 'rating_count': 11912,
 'title': 'Apple MacBook Air M1 MGND3HN/A Ultrabook (Apple M1/8 GB/256 GB SSD/macOS Big Sur)',
 'urls': {'specs': 'https://www.91mobiles.com//apple-m1-mgnd3hn-a-apple-m1-8-gb-256-gb-macos-big-sur-laptop-price-in-india-141587#specifications',
  'img_small': 'https://www.91-img.com/pictures/laptops/apple/apple-m1-mgnd3hn-a-141587-v1-small-1.jpg?tr=q-80',
  'img_large': 'https://www.91-img.com/pictures/laptops/apple/apple-m1-mgnd3hn-a-141587-v1-large-1.jpg?tr=q-80'},
 'specs': {'General Information': {'Brand': 'Apple',
   'Model': 'M1 MGND3HN/A',
   'Dimensions(WxHxD)': '304.1 x 212.4 x 10.9 \xa0mm',
   'Weight': '1.29 Kg',
   'Colors': 'Gold',
   'Oper

Filter out products for which a non-optional attribute is missing (GPU for ex is optional)

In [5]:
must_attrs = (
    "warranty",
    "battery",
    "ram",
    "processor_make",
    "display_size",
    "resolution",
    "os",
    "weight",
    "make",
    "price",
)

catalog = []
for item in all_laptops:
    if not any(item[key] is None for key in must_attrs):
        if item["sdd"] is None and item["hdd"] is None:
            pass
        else:
            catalog.append(item)

len(catalog)

1034

Still a lot of these, maybe we could filter based on the num of reviews and if ratings are any good. 

In [6]:
import numpy as np

counts = [item["rating_count"] for item in catalog]
np.percentile(counts, np.linspace(0, 90, 10))

array([1.0000e+00, 6.3000e+00, 2.1000e+01, 6.1900e+01, 8.7000e+01,
       1.3500e+02, 2.4660e+02, 4.4860e+02, 8.2920e+02, 2.0223e+03])

A lot of popular laptops, like Apple have few ratings though. We could instead filter after sorting by ratings. 

In [7]:
counts = [item["rating_value"] for item in catalog]
np.percentile(counts, np.linspace(0, 90, 10))

array([1. , 3.6, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.4, 4.5])

Neat, lets filter down to items with a >4 rating. 

In [8]:
catalog_final = [item for item in catalog if item["rating_value"] > 4]
len(catalog_final)

693

In [9]:
Counter([item["make"] for item in catalog_final])

Counter({'Apple': 9,
         'Acer': 56,
         'HP': 196,
         'Asus': 159,
         'MSI': 38,
         'Infinix': 9,
         'Xiaomi': 8,
         'Lenovo': 128,
         'Dell': 70,
         'Nokia': 2,
         'Microsoft': 4,
         'Samsung': 3,
         'Honor': 1,
         'Avita': 6,
         'VAIO': 1,
         'LG': 2,
         'Wipro': 1})

## picking questions

Also, in absence of sophisticated modeling (say, reinforcement learning to model user rewards through the full interaction), we assume that the value of each question is independent of the current user-system state. To pick out the useful questions, two metrics seem obvious. 

1. **popular criteria for purchases**

    We don't want to ask for preferences that are too technical. The popular features people use to make decisions could be obtained through mining online reviews, surveys or expert judgement. 
    
    This needs external data though, so I am just gonna handwave it. 


2. **preferences that help us narrow down the list the quickest**

    We could pick attributes which, on average, reduce the consideration set the largest. For instance, a user specifying only Apple laptops only need to explore through all the Macbooks. 

    Another metric we could use is the information gain measured as the change in `Shannon entropy`. Assuming all items were equally likely to be relevant at the outset, we could compute the average reduction in entropy if the user filters down to one of the attribute values. This is similar to how decision trees split, and more theoretically "principled". 

The intuition from using average decrease in entropy/count is to create choices that split the catalog in equal sized subsets. Though we also need to keep familiarity of users with each option in mind. 

#### Brand

In [10]:
Counter([r["make"] for r in catalog_final])

Counter({'Apple': 9,
         'Acer': 56,
         'HP': 196,
         'Asus': 159,
         'MSI': 38,
         'Infinix': 9,
         'Xiaomi': 8,
         'Lenovo': 128,
         'Dell': 70,
         'Nokia': 2,
         'Microsoft': 4,
         'Samsung': 3,
         'Honor': 1,
         'Avita': 6,
         'VAIO': 1,
         'LG': 2,
         'Wipro': 1})

The most frequent brands are - Asus, Lenovo and HP. Though, we know empirically that Apple makes a major fraction of all laptops sold, while Dell has a big market share too. We could frame the question as:

```
Are you looking for laptops of a specific make? 
- Apple
- Dell
- Asus
- Show me everything
```

#### OS

In [11]:
Counter([r["os"] for r in catalog_final])

Counter({'MacOS': 9, 'Windows': 665, 'Chrome': 9, 'Others': 8, 'Linux': 2})

Laptops coming with Linux as the default are few. The relevant question here would be, 

```
Do you have a preference among the following operating system?
- Apple MacOS
- Windows laptops
- Chromebooks
- Not sure
```

The laptop make is strongly correlated with the OS for half the purchases (Apple, MacOS), so asking for OS preference seems the better option. 

#### Price

This is a continuous scale, lets look at the distribution to group laptops into buckets. 

In [12]:
prices = [r["price"] for r in all_laptops]
ds = alt.Data(values=[{"price": p} for p in prices])
alt.Chart(ds).transform_density("price", as_=["price", "density"]).mark_area().encode(
    x="price:Q", y="density:Q"
)

Multiple peaks would have made bucketing easier. Lets also look at the percentiles. 

In [13]:
np.percentile(prices, [25, 50, 75, 90])

array([ 42990.,  58980.,  80895., 115890.])

Hmm, the relevant question could be, 

```
How much do you estimate to spend? 
- less than 40k
- 50-70k
- 80-100k
- price is no bar
```

After this, other attributes to seek preference on could be - 
* Display size and resolution
* amount of RAM/storage
* num of years of warranty
* how long the battery lasts (approximating it with num of cells)
* dedicated GPU and its specs

#### Display size

In [14]:
Counter([r["display_size"] for r in all_laptops])

Counter({13.3: 90,
         15.6: 613,
         14.0: 360,
         17.3: 20,
         14.1: 8,
         11.6: 18,
         16.1: 8,
         16.6: 1,
         14.9: 1,
         16.0: 7,
         15.0: 7,
         16.2: 1,
         12.3: 3,
         14.2: 1,
         13.4: 4,
         13.0: 1,
         10.0: 1,
         12.5: 3,
         13.5: 2,
         15.2: 2})

```
Do you have a screen size preference?
- compact, 13 inch or lower
- around 14 inches 
- around 15 inches
- Larger than 15 inches
```

#### Display resolution

In [15]:
Counter([tuple(r["resolution"]) for r in all_laptops])

Counter({(2560, 1600): 16,
         (1920, 1080): 927,
         (2160, 1440): 3,
         (1366, 768): 137,
         (2880, 1800): 13,
         (2560, 1440): 18,
         (1440, 900): 1,
         (1920, 1200): 6,
         (3456, 2234): 1,
         (2736, 1824): 3,
         (2256, 1504): 1,
         (3000, 2000): 1,
         (1800, 1200): 1,
         (2240, 1400): 1,
         (1920, 1280): 1,
         (3840, 2160): 8,
         (2496, 1664): 1,
         (1600, 900): 6,
         (3840, 2400): 3,
         (3200, 1800): 1,
         (1280, 800): 2})

HD is `1360 x 768`, FHD is `1920 x 1080`,4k is `3840 x 2160`

```
Do you care about display quality?
- not really (anything works)
- somewhat (at least HD)
- yes (at least Full HD)
- very much (4k or better)
```

#### RAM size

In [16]:
Counter([r["ram"] for r in catalog_final])

Counter({8: 460, 16: 175, 4: 50, 2: 2, 32: 6})

```
RAM requirements?

- Indifferent
- At least 8 GB
- At least 16 GB
```

#### Dedicated GPU

In [17]:
set(r["gpu_make"] for r in catalog_final)

{'AMD', 'Integrated', 'NVIDIA', None}

In [18]:
Counter([r["gpu_memory"] for r in catalog_final])

Counter({None: 449, 4: 154, 8: 12, 6: 42, 2: 32, 12: 1, 3: 3})

```
Do you have graphics heavy workload (gaming / design)?
- Don't expect to
- Sometimes (Dedicated GPU / Mac)
- Regularly (> 4GB GPU memory / Macbook Pro)
```

#### Warranty

In [19]:
Counter([r["warranty"] for r in catalog_final])

Counter({1: 678, 2: 12, 3: 3})

There is little variation among different laptops. We could skip this one. 

#### Battery life

In [20]:
np.percentile([r["battery"] for r in catalog_final], np.linspace(0, 90, 10))

array([ 3. ,  5. ,  5. ,  5. ,  5. ,  5. ,  8. , 10. , 10. , 10.8])

```
How important is battery duration to you?

- Not significant
- Reasonably (At least 5 hours)
- Very important (Should last >= 8 hours)
```

#### Hard disk size

In [22]:
Counter([r["hdd"] for r in catalog_final])

Counter({None: 564, 1024: 122, 500: 6, 320: 1})

In [23]:
Counter([r["sdd"] for r in catalog_final])

Counter({256: 180, 512: 388, None: 37, 128: 13, 1024: 67, 32: 2, 16: 4, 64: 2})

Most laptops seem to have an SSD; we could just count the total disk space. 

In [26]:
def summ(x, y):
    if x is None:
        return y
    elif y is None:
        return x
    else:
        return x + y


Counter([summ(r["hdd"], r["sdd"]) for r in catalog_final])

Counter({256: 95,
         512: 386,
         1280: 85,
         1024: 97,
         128: 8,
         1536: 2,
         32: 2,
         16: 4,
         64: 2,
         1152: 5,
         500: 6,
         320: 1})

```
Do you need a lot of local storage?

- Not a factor
- Yes (>= 500 GB)
- Very important (>= 1000 GB)
```

## Order of questions

We have 8 questions in total. 

The first 2 questions are posed to everyone. **Operating system could make Q1, price would make Q2** since asking for it outright feels limiting. Also, the distribution of prices would be very different say if the user wants a Macbook vs not having a preference. 

Following that, we can evaluate each question by how much it helps reduce the search space, conditional on the size of catalog until then. We could keep going until only a few items remain or we hit 5/6 questions in total. 