In [112]:
%logstop
%logstart -rtq ~/.logs/pw.py append
import seaborn as sns
sns.set()

In [21]:
from static_grader import grader

# PW Miniproject
## Introduction

The objective of this miniproject is to exercise your ability to use basic Python data structures, define functions, and control program flow. We will be using these concepts to perform some fundamental data wrangling tasks such as joining data sets together, splitting data into groups, and aggregating data into summary statistics.
**Please do not use `pandas` or `numpy` to answer these questions.**

We will be working with medical data from the British NHS on prescription drugs. Since this is real data, it contains many ambiguities that we will need to confront in our analysis. This is commonplace in data science, and is one of the lessons you will learn in this miniproject.

## Downloading the data

We first need to download the data we'll be using from Amazon S3:

In [7]:
%%bash
mkdir pw-data
wget http://dataincubator-wqu.s3.amazonaws.com/pwdata/201701scripts_sample.json.gz -nc -P ./pw-data
wget http://dataincubator-wqu.s3.amazonaws.com/pwdata/practices.json.gz -nc -P ./pw-data

mkdir: cannot create directory ‘pw-data’: File exists
File ‘./pw-data/201701scripts_sample.json.gz’ already there; not retrieving.

File ‘./pw-data/practices.json.gz’ already there; not retrieving.



## Loading the data

The first step of the project is to read in the data. We will discuss reading and writing various kinds of files later in the course, but the code below should get you started.

In [28]:
import gzip
import simplejson as json

In [29]:
with gzip.open('./pw-data/201701scripts_sample.json.gz', 'rb') as f:
    scripts = json.load(f)

with gzip.open('./pw-data/practices.json.gz', 'rb') as f:
    practices = json.load(f)

This data set comes from Britain's National Health Service. The `scripts` variable is a list of prescriptions issued by NHS doctors. Each prescription is represented by a dictionary with various data fields: `'practice'`, `'bnf_code'`, `'bnf_name'`, `'quantity'`, `'items'`, `'nic'`, and `'act_cost'`. 

In [30]:
scripts[:2]

[{'bnf_code': '0101010G0AAABAB',
  'items': 2,
  'practice': 'N81013',
  'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'nic': 5.98,
  'act_cost': 5.56,
  'quantity': 1000},
 {'bnf_code': '0101021B0AAAHAH',
  'items': 1,
  'practice': 'N81013',
  'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
  'nic': 1.95,
  'act_cost': 1.82,
  'quantity': 500}]

In [31]:
len(scripts)

382726

A [glossary of terms](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10686/Download-glossary-of-terms-for-GP-prescribing---presentation-level/pdf/PLP_Presentation_Level_Glossary_April_2015.pdf/) and [FAQ](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10048/FAQs-Practice-Level-Prescribingpdf/pdf/PLP_FAQs_April_2015.pdf/) is available from the NHS regarding the data. Below we supply a data dictionary briefly describing what these fields mean.

| Data field |Description|
|:----------:|-----------|
|`'practice'`|Code designating the medical practice issuing the prescription|
|`'bnf_code'`|British National Formulary drug code|
|`'bnf_name'`|British National Formulary drug name|
|`'quantity'`|Number of capsules/quantity of liquid/grams of powder prescribed|
| `'items'`  |Number of refills (e.g. if `'quantity'` is 30 capsules, 3 `'items'` means 3 bottles of 30 capsules)|
|  `'nic'`   |Net ingredient cost|
|`'act_cost'`|Total cost including containers, fees, and discounts|

The `practices` variable is a list of member medical practices of the NHS. Each practice is represented by a dictionary containing identifying information for the medical practice. Most of the data fields are self-explanatory. Notice the values in the `'code'` field of `practices` match the values in the `'practice'` field of `scripts`.

In [32]:
practices[:2]

[{'code': 'A81001',
  'name': 'THE DENSHAM SURGERY',
  'addr_1': 'THE HEALTH CENTRE',
  'addr_2': 'LAWSON STREET',
  'borough': 'STOCKTON ON TEES',
  'village': 'CLEVELAND',
  'post_code': 'TS18 1HU'},
 {'code': 'A81002',
  'name': 'QUEENS PARK MEDICAL CENTRE',
  'addr_1': 'QUEENS PARK MEDICAL CTR',
  'addr_2': 'FARRER STREET',
  'borough': 'STOCKTON ON TEES',
  'village': 'CLEVELAND',
  'post_code': 'TS18 2AW'}]

In the following questions we will ask you to explore this data set. You may need to combine pieces of the data set together in order to answer some questions. Not every element of the data set will be used in answering the questions.

## Question 1: summary_statistics

Our beneficiary data (`scripts`) contains quantitative data on the number of items dispensed (`'items'`), the total quantity of item dispensed (`'quantity'`), the net cost of the ingredients (`'nic'`), and the actual cost to the patient (`'act_cost'`). Whenever working with a new data set, it can be useful to calculate summary statistics to develop a feeling for the volume and character of the data. This makes it easier to spot trends and significant features during further stages of analysis.

Calculate the sum, mean, standard deviation, and quartile statistics for each of these quantities. Format your results for each quantity as a list: `[sum, mean, standard deviation, 1st quartile, median, 3rd quartile]`. We'll create a `tuple` with these lists for each quantity as a final result.

In [33]:
def mean(values, precision=None):
    #values is a python list. we want to return the mean of the values list
    if precision:
        return round(sum(values) / len(values), precision)
    else:
        return sum(values)/ len(values)

In [34]:
mean([10, 20 ,30])

20.0

In [35]:
from math import sqrt

def std(values):
    #values is a python list. we want to return the standard deviation of the values list
    avg = mean(values)
    return sqrt(sum([(number - avg)**2 for number in values])/len(values))
    

In [36]:
from math import ceil

def quartile(values, q):
    #returns the element in values for the q-th quartile.
    #Note: Assume that the value is already sorted
    #q is a floating point number.
    
    index = ceil(q * len(values))
    return values[index]

In [37]:
[d ['nic'] for d in scripts]

[5.98,
 1.95,
 64.51,
 9.21,
 28.92,
 82.62,
 13.47,
 64.0,
 3.9,
 19.48,
 37.67,
 112.5,
 66.26,
 126.6,
 25.81,
 20.46,
 34.59,
 35.26,
 69.6,
 39.03,
 18.39,
 14.62,
 29.15,
 3.31,
 40.8,
 23.76,
 370.85,
 105.42,
 6.3,
 30.78,
 5.5,
 382.54,
 30.0,
 17.8,
 24.04,
 73.13,
 41.76,
 22.4,
 9.5,
 43.33,
 7.02,
 17.2,
 68.86,
 40.01,
 73.78,
 172.14,
 40.01,
 68.86,
 42.95,
 494.0,
 128.27,
 37.44,
 33.18,
 78.0,
 187.0,
 60.39,
 10.88,
 32.2,
 28.1,
 279.73,
 1.04,
 0.1,
 19.74,
 7.1,
 46.85,
 80.44,
 38.6,
 7.72,
 42.63,
 275.33,
 9.97,
 5.68,
 55.72,
 4.98,
 11.76,
 5.52,
 153.44,
 111.31,
 62.6,
 14.56,
 20.02,
 7.32,
 88.81,
 16.05,
 105.76,
 81.18,
 6.98,
 45.31,
 48.27,
 6.05,
 22.47,
 6.8,
 14.0,
 10.35,
 16.11,
 4.84,
 2.4,
 4.59,
 8.62,
 12.6,
 21.94,
 44.14,
 9.75,
 38.72,
 10.22,
 32.82,
 23.7,
 68.7,
 5.2,
 12.96,
 70.3,
 2.35,
 21.93,
 5.25,
 207.93,
 117.12,
 166.7,
 66.55,
 8.81,
 6.12,
 6.6,
 3.8,
 6.42,
 7.28,
 21.49,
 14.36,
 14.2,
 23.29,
 42.2,
 74.76,
 85.0,
 99.8,

In [38]:

def describe(data, key): #data is a list of dictionaries, key is a key in each dictionary in data
    
    values = sorted([d[key] for d in data]) #This gets the number for a given key in dictionary
    
    
    total = sum(values)  #Sum
    avg = mean(values) #mean
    s = std(values) #standard deviation
    q25 = quartile(values, 0.25) #1st quartile, Q1
    med = quartile(values, 0.50) #Median
    q75 = quartile(values, 0.75) #3rd quartile, Q3

    return (total, avg, s, q25, med, q75)

In [39]:
describe(scripts, 'items')

(4410054, 11.522744731217633, 33.11216633979208, 1, 3, 8)

In [40]:
describe(scripts, 'quantity')

(316356836, 826.5883059943667, 3872.1810146090156, 30, 120, 466)

In [41]:
describe(scripts, 'nic')

(29048309.790000085, 75.89844899484248, 197.57282662775233, 7.7, 22.62, 65.95)

In [42]:
describe(scripts, 'act_cost')

(27053937.60000044, 70.68748295125087, 183.267318953028, 7.25, 21.24, 61.54)

In [43]:
summary = [('items', describe(scripts, 'items')),
           ('quantity', describe(scripts, 'quantity')),
           ('nic', describe(scripts, 'nic')),
           ('act_cost', describe(scripts, 'act_cost'))]

In [44]:
summary

[('items', (4410054, 11.522744731217633, 33.11216633979208, 1, 3, 8)),
 ('quantity',
  (316356836, 826.5883059943667, 3872.1810146090156, 30, 120, 466)),
 ('nic',
  (29048309.790000085,
   75.89844899484248,
   197.57282662775233,
   7.7,
   22.62,
   65.95)),
 ('act_cost',
  (27053937.60000044,
   70.68748295125087,
   183.267318953028,
   7.25,
   21.24,
   61.54))]

In [45]:
grader.score.pw__summary_statistics(summary)

Your score: 1.000


## Question 2: most_common_item

Often we are not interested only in how the data is distributed in our entire data set, but within particular groups -- for example, how many items of each drug (i.e. `'bnf_name'`) were prescribed? Calculate the total items prescribed for each `'bnf_name'`. What is the most commonly prescribed `'bnf_name'` in our data?

To calculate this, we first need to split our data set into groups corresponding with the different values of `'bnf_name'`. Then we can sum the number of items dispensed within in each group. Finally we can find the largest sum.

We'll use `'bnf_name'` to construct our groups. You should have *5619* unique values for `'bnf_name'`.

In [46]:
bnf_names = { s['bnf_name'] for s in scripts}
assert(len(bnf_names) == 5619)

In [47]:
bnf_names = set([d['bnf_name'] for d in scripts])

In [48]:
len(set([s['bnf_name'] for s in scripts]))

5619

In [49]:
bnf_names 

{'Hospiform 10cm x 4m Conform Syn Band',
 'Warfarin Sod_Oral Susp 1mg/1ml S/F',
 'Maxitram SR_Cap 50mg',
 'Trimethoprim_Tab 200mg',
 'Fragmin_Inj 25 000u/ml 0.6ml Pfs',
 'Co-Amilofruse_Tab 5mg/40mg',
 'Trental 400_Tab 400mg',
 'CosmoCol_Oral Pdr Sach (Orng Lem & Lim)',
 'Teleflex_Rusch Leg Bag 500ml 10cm Tubing',
 'Dulaglutide_Inj 0.75mg/0.5ml Pf Dev',
 'Hydrocort But_Scalp Lot 0.1%',
 'Dalteparin Sod_Inj 25 000u/ml 0.4ml Pfs',
 'Dalivit_Dps',
 'Medikinet XL_Cap 20mg',
 'Bumetanide_Tab 1mg',
 'Diazepam_Tab 5mg',
 'Dexameth_Oral Soln 2mg/5ml S/F',
 'Aquacel Ag 5cm x 5cm Wound Dress Proteas',
 'Sod Algin/Pot Bicarb_Susp S/F',
 'Clobavate_Oint 0.05%',
 'Prempak-C_C/Pk Tab 1.25mg/150mcg',
 'Aprovel_Tab 150mg',
 'Ins Detemir_Inj 100u/ml 3ml Pf Pen',
 'Gaviscon_Liq Orig Aniseed Relief',
 'Medicareplus_Medi Derma-S Barrier Crm 90',
 'Fosinopril Sod_Tab 10mg',
 'Ispag Husk_Gran Sach 3.5g G/F',
 'FemSeven 50_Patch 50mcg/24hrs',
 'Dovobet_Gel Applic',
 'Methylpred Acet/Lidoc_Inj 80/20mg/2ml Vl',

In [50]:
bnf_names

{'Hospiform 10cm x 4m Conform Syn Band',
 'Warfarin Sod_Oral Susp 1mg/1ml S/F',
 'Maxitram SR_Cap 50mg',
 'Trimethoprim_Tab 200mg',
 'Fragmin_Inj 25 000u/ml 0.6ml Pfs',
 'Co-Amilofruse_Tab 5mg/40mg',
 'Trental 400_Tab 400mg',
 'CosmoCol_Oral Pdr Sach (Orng Lem & Lim)',
 'Teleflex_Rusch Leg Bag 500ml 10cm Tubing',
 'Dulaglutide_Inj 0.75mg/0.5ml Pf Dev',
 'Hydrocort But_Scalp Lot 0.1%',
 'Dalteparin Sod_Inj 25 000u/ml 0.4ml Pfs',
 'Dalivit_Dps',
 'Medikinet XL_Cap 20mg',
 'Bumetanide_Tab 1mg',
 'Diazepam_Tab 5mg',
 'Dexameth_Oral Soln 2mg/5ml S/F',
 'Aquacel Ag 5cm x 5cm Wound Dress Proteas',
 'Sod Algin/Pot Bicarb_Susp S/F',
 'Clobavate_Oint 0.05%',
 'Prempak-C_C/Pk Tab 1.25mg/150mcg',
 'Aprovel_Tab 150mg',
 'Ins Detemir_Inj 100u/ml 3ml Pf Pen',
 'Gaviscon_Liq Orig Aniseed Relief',
 'Medicareplus_Medi Derma-S Barrier Crm 90',
 'Fosinopril Sod_Tab 10mg',
 'Ispag Husk_Gran Sach 3.5g G/F',
 'FemSeven 50_Patch 50mcg/24hrs',
 'Dovobet_Gel Applic',
 'Methylpred Acet/Lidoc_Inj 80/20mg/2ml Vl',

We want to construct "groups" identified by `'bnf_name'`, where each group is a collection of prescriptions (i.e. dictionaries from `scripts`). We'll construct a dictionary called `groups`, using `bnf_names` as the keys. We'll represent a group with a `list`, since we can easily append new members to the group. To split our `scripts` into groups by `'bnf_name'`, we should iterate over `scripts`, appending prescription dictionaries to each group as we encounter them.

In [51]:
groups = {name: [] for name in bnf_names}

In [52]:
for script in scripts:
    groups[ script['bnf_name'] ].append(script)

In [53]:
len(groups)

5619

In [54]:
item_totals = []
for k in groups.keys():
    item_totals.append((k, sum([ s['items'] for s in groups[k] ])))

In [55]:
len(item_totals)

5619

In [56]:
item_totals

[('Hospiform 10cm x 4m Conform Syn Band', 10),
 ('Warfarin Sod_Oral Susp 1mg/1ml S/F', 2),
 ('Maxitram SR_Cap 50mg', 100),
 ('Trimethoprim_Tab 200mg', 10979),
 ('Fragmin_Inj 25 000u/ml 0.6ml Pfs', 6),
 ('Co-Amilofruse_Tab 5mg/40mg', 1301),
 ('Trental 400_Tab 400mg', 3),
 ('CosmoCol_Oral Pdr Sach (Orng Lem & Lim)', 215),
 ('Teleflex_Rusch Leg Bag 500ml 10cm Tubing', 2),
 ('Dulaglutide_Inj 0.75mg/0.5ml Pf Dev', 2),
 ('Hydrocort But_Scalp Lot 0.1%', 2),
 ('Dalteparin Sod_Inj 25 000u/ml 0.4ml Pfs', 15),
 ('Dalivit_Dps', 647),
 ('Medikinet XL_Cap 20mg', 98),
 ('Bumetanide_Tab 1mg', 6414),
 ('Diazepam_Tab 5mg', 8224),
 ('Dexameth_Oral Soln 2mg/5ml S/F', 21),
 ('Aquacel Ag 5cm x 5cm Wound Dress Proteas', 69),
 ('Sod Algin/Pot Bicarb_Susp S/F', 2636),
 ('Clobavate_Oint 0.05%', 10),
 ('Prempak-C_C/Pk Tab 1.25mg/150mcg', 44),
 ('Aprovel_Tab 150mg', 6),
 ('Ins Detemir_Inj 100u/ml 3ml Pf Pen', 33),
 ('Gaviscon_Liq Orig Aniseed Relief', 244),
 ('Medicareplus_Medi Derma-S Barrier Crm 90', 492),
 ('F

Now that we've constructed our groups we should sum up `'items'` in each group and find the `'bnf_name'` with the largest sum. The result, `max_item`, should have the form `[(bnf_name, item total)]`, e.g. `[('Foobar', 2000)]`.

In [57]:
max(item_totals, key=lambda x: x[1])

('Omeprazole_Cap E/C 20mg', 113826)

In [58]:
max_item = sorted(item_totals, key=lambda x: x[1], reverse=True)[:1]

In [59]:
max_item

[('Omeprazole_Cap E/C 20mg', 113826)]

**TIP:** If you are getting an error from the grader below, please make sure your answer conforms to the correct format of `[(bnf_name, item total)]`.

In [60]:
grader.score.pw__most_common_item(max_item)

Your score: 1.000


**Challenge:** Write a function that constructs groups as we did above. The function should accept a list of dictionaries (e.g. `scripts` or `practices`) and a tuple of fields to `groupby` (e.g. `('bnf_name',)` or `('bnf_name', 'post_code')`) and returns a dictionary of groups. The following questions will require you to aggregate data in groups, so this could be a useful function for the rest of the miniproject.

In [61]:
def group_by_field(data, fields):
    #first create the keys in the new dictionary
    names = { tuple(item[field] for field in fields) for item in data }
    
    #we initialize a new dictionary with an empty list for each key
    groups = { name: [] for name in names } 
    
     #we iterate over the data to create the groups
    for item in data:
        name = tuple(item[field] for field in fields)
        groups[name].append(item)
    return groups

In [62]:
groups = group_by_field(scripts, ('bnf_name',))


In [63]:
test_max_item = [m]
assert test_max_item == max_item

NameError: name 'm' is not defined

## Question 3: postal_totals

Our data set is broken up among different files. This is typical for tabular data to reduce redundancy. Each table typically contains data about a particular type of event, processes, or physical object. Data on prescriptions and medical practices are in separate files in our case. If we want to find the total items prescribed in each postal code, we will have to _join_ our prescription data (`scripts`) to our clinic data (`practices`).

Find the total items prescribed in each postal code, representing the results as a list of tuples `(post code, total items prescribed)`. Sort your results ascending alphabetically by post code and take only results from the first 100 post codes. Only include post codes if there is at least one prescription from a practice in that post code.

**NOTE:** Some practices have multiple postal codes associated with them. Use the alphabetically first postal code.

We can join `scripts` and `practices` based on the fact that `'practice'` in `scripts` matches `'code'` in `practices`. However, we must first deal with the repeated values of `'code'` in `practices`. We want the alphabetically first postal codes.

In [64]:
practice_postal = {}
for practice in practices:
    if practice['code'] in practice_postal:
        practice_postal[practice['code']] = min(practice_postal[practice['code']], practice['post_code'])
    else:
        practice_postal[practice['code']] = practice['post_code']

In [65]:
practice_postal

{'A81001': 'TS18 1HU',
 'A81002': 'TS18 2AW',
 'A81003': 'TS25 1QU',
 'A81004': 'TS1 3BE',
 'A81005': 'TS14 7DJ',
 'A81006': 'TS18 2AT',
 'A81007': 'TS24 7PW',
 'A81008': 'TS6 6TD',
 'A81009': 'TS5 6HF',
 'A81011': 'TS24 7PW',
 'A81012': 'TS3 6AL',
 'A81013': 'TS12 2FF',
 'A81014': 'TS23 2LA',
 'A81015': 'TS10 1TZ',
 'A81016': 'TS1 3QY',
 'A81017': 'TS17 0EE',
 'A81018': 'TS10 4NW',
 'A81019': 'TS3 7RL',
 'A81020': 'TS4 3BU',
 'A81021': 'TS6 6TD',
 'A81022': 'TS12 2TG',
 'A81023': 'TS1 2NX',
 'A81025': 'TS18 1HU',
 'A81026': 'TS5 6HA',
 'A81027': 'TS15 9DD',
 'A81029': 'TS1 2NX',
 'A81030': 'TS1 3RY',
 'A81031': 'TS24 7PW',
 'A81032': 'TS14 7DJ',
 'A81033': 'TS3 6AL',
 'A81034': 'TS17 0EE',
 'A81035': 'TS1 3RX',
 'A81036': 'TS20 2UZ',
 'A81037': 'TS1 2NX',
 'A81038': 'TS3 6AL',
 'A81039': 'TS16 9EA',
 'A81040': 'TS23 2DG',
 'A81041': 'TS24 9DN',
 'A81042': 'TS6 9QG',
 'A81043': 'TS6 0HA',
 'A81044': 'TS25 1QU',
 'A81045': 'TS10 1SR',
 'A81046': 'TS18 1YE',
 'A81047': 'TS11 6BW',
 'A810

In [66]:
len(practice_postal)

10843

**Challenge:** This is an aggregation of the practice data grouped by practice codes. Write an alternative implementation of the above cell using the `group_by_field` function you defined previously.

In [67]:
assert practice_postal['K82019'] == 'HP21 8TR'

Now we can join `practice_postal` to `scripts`.

In [68]:
joined = scripts[:] #creates a copy of the script list
for script in joined:
    script['post_code'] = practice_postal[script['practice']]

Finally we'll group the prescription dictionaries in `joined` by `'post_code'` and sum up the items prescribed in each group, as we did in the previous question.

In [70]:
items_by_post = sorted([ (k[0], sum(member['items'] for member in group )) for k, group in group_by_field(joined, ('post_code',)).items() ])

In [71]:
items_by_post[:8]

[('B11 4BW', 20673),
 ('B18 7AL', 19001),
 ('B21 9RY', 29103),
 ('B23 6DJ', 24859),
 ('B70 7AW', 36531),
 ('BB11 2DL', 34356),
 ('BB2 1AX', 28254),
 ('BB3 1PY', 54514)]

In [72]:
# postal_totals = [('B11 4BW', 20673)] * 100

postal_totals = items_by_post[:100]

grader.score.pw__postal_totals(postal_totals)

Your score: 1.000


## Question 4: items_by_region

Now we'll combine the techniques we've developed to answer a more complex question. Find the most commonly dispensed item in each postal code, representing the results as a list of tuples (`post_code`, `bnf_name`, amount dispensed as proportion of total). Sort your results ascending alphabetically by post code and take only results from the first 100 post codes.

**NOTE:** We'll continue to use the `joined` variable we created before, where we've chosen the alphabetically first postal code for each practice. Additionally, some postal codes will have multiple `'bnf_name'` with the same number of items prescribed for the maximum. In this case, we'll take the alphabetically first `'bnf_name'`.

There are several approaches to solve this problem but we will guide you through one of them. Feel free to solve it your own way if it is easier for you to understand and implement. If your kernel keeps on dying, it's probably an indication that you are running out of memory. Consider deleting objects you don't need anymore using the `del` statement and shutdown any other running notebooks. For example:
```Python
del some_object_not_needed
```

The first step is to calculate the total items for each `'post_code'` and `'bnf_name'`. Let's call that result `total_items_by_post_bnf`. Consider what is the best data structure(s) to represent `total_items_by_post_bnf`. It should have 141196 `('post_code', 'bnf_name')` groups.

In [73]:
grouped_by_post_bnf = group_by_field(joined, ('post_code', 'bnf_name',))

In [76]:
grouped_by_post_bnf[('BH18 8EE', 'Permethrin_Crm 5%')]

[{'bnf_code': '1310040Q0AAABAB',
  'items': 7,
  'practice': 'J81041',
  'bnf_name': 'Permethrin_Crm 5%',
  'nic': 96.98,
  'act_cost': 89.87,
  'quantity': 390,
  'post_code': 'BH18 8EE'},
 {'bnf_code': '1310040Q0AAABAB',
  'items': 1,
  'practice': 'J81046',
  'bnf_name': 'Permethrin_Crm 5%',
  'nic': 59.68,
  'act_cost': 55.26,
  'quantity': 240,
  'post_code': 'BH18 8EE'}]

In [77]:
total_items_by_post_bnf = []

for key, value in grouped_by_post_bnf.items():
    new_item = {'post_code': key[0],
               'bnf_name': key[1],
                'total': sum(v['items'] for v in value) }
    total_items_by_post_bnf.append(new_item)

TypeError: 'keys' is an invalid keyword argument for sort()

In [85]:
total_items_by_post_bnf = []

for key, value in grouped_by_post_bnf.items():
    new_item = {'post_code' : key[0],
               'bnf_name' : key[1],
               'total': sum(v['items'] for v in value)}
    total_items_by_post_bnf.append(new_item)
    

In [87]:
total_items_by_post_bnf[:4]

[{'post_code': 'BH23 3AF', 'bnf_name': 'Paracet_Tab Solb 500mg', 'total': 49},
 {'post_code': 'TS23 2DG',
  'bnf_name': 'K-Soft 10cm x 3.5m M/Layer Compress Band',
  'total': 4},
 {'post_code': 'M30 0NU', 'bnf_name': 'Quinine Bisulf_Tab 300mg', 'total': 44},
 {'post_code': 'SR4 7XF',
  'bnf_name': 'Oilatum Jnr_Bath Additive',
  'total': 14}]

In [86]:
assert len(total_items_by_post_bnf) == 141196

Next, let's take `total_items_by_post_bnf` and group it by `'post_code'`. In other words, we want a  data structure that maps a `'post_code'` to a list of all records that belong to that `'post_code'`. There should be 118 groups.

In [89]:
grouped_post_code = group_by_field(total_items_by_post_bnf, ('post_code',))

assert len(grouped_post_code) == 118

In [91]:
print(grouped_post_code)


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Now with `grouped_post_code`, let's iterate over each group and calculate the following fields for each `'post_code'`:
1. the sum of total items for all `'bnf_name'`
1. the most total items
1. the `'bnf_name'` that had the most total items

Once again, consider the best data structure(s) to use to represent the result. It may help to write and use a function when developing your solution.

In [114]:
def calc_fields(group, data):
    #data is a list of dictionaries from grouped_post_code collection group is going to be a tuple(post_code,)
    #Goal is to Return a tuple of the form(post_code, bnf_name, max_proportion) where post_code is taken from the group, bnf_name and
    #max_proportion are determined fromthe information in data
    arg_max = sorted(data, key=lambda x: x['total'], reverse=True)[0]
    
    post_code = group[0]
    bnf_name = arg_max['bnf_name']
    bnf_name_total = arg_max['total'] #numerator in max proportion
    total_in_post_code = sum(d['total'] for d in data) #denominator in max proportion
    
    return (post_code, bnf_name, bnf_name_total/total_in_post_code)

In [115]:
calc_fields(('DA11 8BZ',), grouped_post_code[('DA11 8BZ',)])

('DA11 8BZ', 'Amoxicillin_Cap 500mg', 0.021502698215026983)

In [117]:
most_item_data_by_post = [calc_fields(group, data) for group, data in grouped_post_code.items()]

In [118]:
most_item_data_by_post 

[('TW13 4GU', 'Omeprazole_Cap E/C 20mg', 0.03073937908496732),
 ('SS0 7AF', 'Omeprazole_Cap E/C 20mg', 0.02118781584681719),
 ('NE37 2PU', 'Paracet_Tab 500mg', 0.02777391304347826),
 ('LE10 1DS', 'Aspirin Disper_Tab 75mg', 0.02211411776629168),
 ('S63 9EH', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.03003995745537126),
 ('M30 0NU', 'Omeprazole_Cap E/C 20mg', 0.03824666953158573),
 ('BB9 7SR', 'Omeprazole_Cap E/C 20mg', 0.023833193804939305),
 ('TS1 2NX', 'Paracet_Tab 500mg', 0.027549713373789975),
 ('TS23 2DG', 'Paracet_Tab 500mg', 0.025532452758642483),
 ('WN3 5HL', 'Omeprazole_Cap E/C 20mg', 0.046486243539118434),
 ('W10 6DZ', 'Amoxicillin_Cap 500mg', 0.029451549434333497),
 ('CW7 1AT', 'Omeprazole_Cap E/C 20mg', 0.038342136965990176),
 ('BB2 1AX', 'Omeprazole_Cap E/C 20mg', 0.03645501521908402),
 ('DN34 4GB', 'Omeprazole_Cap E/C 20mg', 0.03894778497490263),
 ('CV1 4FS', 'Omeprazole_Cap E/C 20mg', 0.02988443966675625),
 ('LA1 1PN', 'Omeprazole_Cap E/C 20mg', 0.03593535438892997),
 ('B23

Now, we are ready to:
1. calculate the ratio (the amount dispensed as proportion of total)
1. [sort](https://docs.python.org/3/howto/sorting.html) alphabetically by the post code
1. format the answer as a list of tuples
1. take only the first 100 tuples
1. submit to the grader

In [119]:
items_by_region = sorted(most_item_data_by_post)[:100]

In [120]:
len(items_by_region)

100

In [121]:
# items_by_region = [('B11 4BW', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.0341508247)] * 100

In [122]:
grader.score.pw__items_by_region(items_by_region)

Your score: 1.000


*Copyright &copy; 2021 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*