### Day 21, Pt 1 - Solution

This is my "night of" solve and could use a lot of work. Mainly was figuring out how to solve the logic by hand, then write code as I went. 

#### What we know:

1) Each allergen is found in exactly one ingredient. 
2) Each ingredient contains zero or one allergen. 
    - Allergens are not always marked 
    
#### Test Case Interpretation: 

```
mxmxvkd kfcds sqjhc nhms (contains dairy, fish)
trh fvjkl sbzzf mxmxvkd (contains dairy)
sqjhc fvjkl (contains soy)
sqjhc mxmxvkd sbzzf (contains fish)
```

Allergens:
- We can determine that `mxmxvkd` contains dairy based on it existing in line 1 & line 2. 
- We can determine that `sqjhc` could contain fish or soy.
- We can determine that `fvjkl` could contain soy.

No allergens:
- `kfcds` & `nhms` only appear in line 1, so no allergen (`mxmxvkd` has already taken dairy)
- `trh` only in line 2, so no allergen (`mxmxvkd` has already taken dairy)
- `sbzzf` appears in various spots, but can't logically have an allergen:
    - It is in line 2, but can't have dairy.
    - It is line 4, but given line 2 we know it can't be linked to fish since the item linked to fish would need to have been present in line 1. 

In [1]:
import re
from collections import defaultdict, Counter

# Read in test data
filepath = "day21_data.txt"
with open(filepath) as fh:
    lines = [line.strip() for line in fh.readlines()]

### How to Approach This: 

- I need to find how many times an item is in with an allergen & how many times that allergen appears in general? 
    - This should work since "each allergen is found in exactly one ingredient"
    - Need to be careful though since an allergen does not have to be lsited, even if an ingredient is there. 

In [2]:
def allergenExtraction(string):
    """Pass in a string & return list of allergens"""
    m = re.search(r'\((.*?)\)', string)
    allergens = m.group(1) # seperate due to next step
    
    return allergens.replace('contains ', '').split(', ')

In [3]:
allergen_dict = defaultdict(list)
ingred_dict = defaultdict(lambda:0)
for line in lines:
    
    # get allergens for ingredients
    allergens = allergenExtraction(line)
    
    # ingredients
    ingreds = line.split(" (")[0].split()
    
    # add ingreds to each allergen key
    for allergen in allergens:
        allergen_dict[allergen].append(ingreds)
        
    # update total tracker for ingredients -> will be used at the end to determine how often ingreds showed up
    for ingred in ingreds:
        ingred_dict[ingred] += 1

In [4]:
final_dict = {}

for allergen, ingred_list in allergen_dict.items():
    
    # number of times we saw allergen
    all_cnt = len(ingred_list)
    
    # flatten list
    flat_list = [item for sublist in ingred_list for item in sublist]
    
    # build a counter which represents how many of each ingredient showed up across all instances 
    # of allergen of interest. 
    final_dict[allergen] = (all_cnt, dict(Counter(flat_list)))

### General Logic:

- If only one ingredient has a count == number of times item showed up, then we know this ingredient is linked to allergen:
    - we can remove it from all other checks
    - we can store it in an ingredient dictionary
    - we can store off the others in a counter

- If multiple ingredients have a count == number of times item showed up, then we don't know how to proceed unless we can start removing items that already have a key (maybe use a while loop?)
    - Example would be determining that `sqjhc` links to fish - i think this is correct, but `mxmxvkd` also links to fish. I guess we just skip these until it is obvious (***Note: solved at the end**) 
    
- Potential TODO: 
    - My handling of the `sbzzf` example might not scale properly. Right now I am assuming my logic handles the fact that `sbzzf` showed up in 1 of the 2 fish examples, but not positive I am correct on this. (**Note: This was not a problem**)

In [5]:
allergen_ingred = {}
ingred_all = [] # ingredients known to be linked to allergens 
for allergen, v in final_dict.items():
    count, ingred_list = v
    
    # remove any ingredients that are already affiliated with one allergen
    possible = []
    for k,v in ingred_list.items():
        if k in ingred_all:
            del ingred_list[k]
            
        # we need to find the top ingred(s) for each allergen
        else:
            # if counter val = total times allergen showed up, then its likely so add on
            if v == count:
                possible.append(k)
    
    # this still won't be perfect, can have multiple mappings per allergen so will need to resolve. 
    allergen_ingred[allergen] = possible    

In [6]:
# build a single list of ingredients likely matched to an allergen
final_list = []
for k,v in allergen_ingred.items():
    final_list.append(v)

# set of ingredients that match with an allergen
ingreds_w_all = set([item for sublist in final_list for item in sublist])
print(ingreds_w_all)

{'vqzbq', 'dtb', 'nhjrzzj', 'zqjm', 'vdxb', 'rpj', 'gbt', 'bqmhk'}


In [7]:
tot_count = 0
for k,v in ingred_dict.items():
    if k in ingreds_w_all:
        continue
    else:
        tot_count += v
print(tot_count)

2798


### Part 2: I actually wrote this out by hand to ensure my logic worked

- will code up later -> will look like sorting by val length & building a new dict while also removing from other dict. 
- General Logic: Each ingredient can only have at most one allergen
    - once we know `vqzbq` is linked to `sesame`, then we can remove `vqzbq` from all other steps. This is easily solved. 

#### Solution: 

```python
# final answer
gbt,rpj,vdxb,dtb,bqmhk,vqzbq,zqjm,nhjrzzj
```

In [8]:
for key in sorted(allergen_ingred):
    print(key, allergen_ingred[key])

dairy ['vqzbq', 'gbt']
eggs ['vqzbq', 'rpj']
fish ['zqjm', 'vdxb']
nuts ['gbt', 'bqmhk', 'dtb', 'rpj']
peanuts ['bqmhk', 'gbt']
sesame ['vqzbq']
shellfish ['vqzbq', 'zqjm']
wheat ['nhjrzzj', 'vdxb']
