# Importing ARCOS Data with Dask

Last week, we used dask to play with a few datasets to get a feel for how dask works. In order to help us develop code that would run quickly, however, we worked with very small, safe datasets. 

Today, we will continue to work with dask, but this time using much larger datasets. This means that (a) doing things incorrectly may lead to your computer crashing (So save all your open files before you start!), and (b) many of the commands you are being asked run will take several minutes each. 

For familiarity, and so you can see what advantages dask can bring to your workflow, today we'll be working with the DEA ARCOS drug shipment database published by the Washington Post! However, to strike a balance between size and speed, we'll be working with a slightly thinned version that has only the last two years of data, instead of all six.

## Exercise 1

Download the thinned ARCOS data [from this link](https://www.dropbox.com/s/o7nc6yvrwog4ozi/arcos_2011_2012.tsv.zip?dl=0). It should be about 2GB zipped, 25 GB unzipped. 

## Exercise 2

Our goal today is going to be to find the pharmaceutical company that has shipped the most opioids (`MME_Conversion_Factor * CALC_BASE_WT_IN_GM`) in the US.

When working with large datasets, it is good practice to begin by prototyping your code with a subset of your data. So begin by using `pandas` to read in the first 100,000 lines of the ARCOS data and write pandas code to compute the shipments from each shipper (the group that reported the shipment). 

In [2]:
import pandas as pd

pd.set_option('mode.copy_on_write', True)

file = "arcos_terminal/arcos_2011_2012.tsv"

data = pd.read_csv(file, sep="\t", nrows=10000)

data.sample(10)

Unnamed: 0.1,Unnamed: 0,REPORTER_DEA_NO,REPORTER_BUS_ACT,REPORTER_NAME,REPORTER_ADDL_CO_INFO,REPORTER_ADDRESS1,REPORTER_ADDRESS2,REPORTER_CITY,REPORTER_STATE,REPORTER_ZIP,...,Product_Name,Ingredient_Name,Measure,MME_Conversion_Factor,Combined_Labeler_Name,Revised_Company_Name,Reporter_family,dos_str,date,year
547,2178,PB0020052,DISTRIBUTOR,"KPH HEALTHCARE SERVICES, INC.",KINNEY DRUGS WAREHOUSE,520 EAST MAIN ST.,,GOUVERNEUR,NY,13642,...,HYDROCODONE BITARTRATE AND ACETA 7.5,HYDROCODONE BITARTRATE HEMIPENTAHYDRATE,TAB,1.0,"Actavis Pharma, Inc.","Allergan, Inc.","KPH Healthcare Services, Inc.",7.5,2011-09-13,2011
2907,11345,PC0003044,DISTRIBUTOR,"CARDINAL HEALTH 110, LLC",,6012 EAST MOLLOY RD,,SYRACUSE,NY,13211,...,OXYCODONE.HCL/APAP 10MG/325MG TABS,OXYCODONE HYDROCHLORIDE,TAB,1.5,"Actavis Pharma, Inc.","Allergan, Inc.",Cardinal Health,10.0,2011-09-20,2011
5397,20445,PC0003044,DISTRIBUTOR,"CARDINAL HEALTH 110, LLC",,6012 EAST MOLLOY RD,,SYRACUSE,NY,13211,...,"OXYCODONE HCL 15MG TABLETS, 100 CT",OXYCODONE HYDROCHLORIDE,TAB,1.5,"Actavis Pharma, Inc.","Allergan, Inc.",Cardinal Health,15.0,2012-09-20,2012
5883,21357,PC0003044,DISTRIBUTOR,"CARDINAL HEALTH 110, LLC",,6012 EAST MOLLOY RD,,SYRACUSE,NY,13211,...,OXYCODONE.HCL/APAP 7.5MG/325MG TABS,OXYCODONE HYDROCHLORIDE,TAB,1.5,"Actavis Pharma, Inc.","Allergan, Inc.",Cardinal Health,7.5,2011-05-27,2011
6874,24655,PC0003044,DISTRIBUTOR,"CARDINAL HEALTH 110, LLC",,6012 EAST MOLLOY RD,,SYRACUSE,NY,13211,...,HYDROCODONE BIT./ACETAMINOPHEN TABS.,HYDROCODONE BITARTRATE HEMIPENTAHYDRATE,TAB,1.0,Amneal Pharmaceuticals LLC,"Amneal Pharmaceuticals, Inc.",Cardinal Health,5.0,2011-07-01,2011
2427,9883,PC0003044,DISTRIBUTOR,"CARDINAL HEALTH 110, LLC",,6012 EAST MOLLOY RD,,SYRACUSE,NY,13211,...,HYDROCODONE BIT/ACETA 7.5MG/325MG US,HYDROCODONE BITARTRATE HEMIPENTAHYDRATE,TAB,1.0,SpecGx LLC,Mallinckrodt,Cardinal Health,7.5,2011-12-19,2011
8969,32430,PC0003044,DISTRIBUTOR,"CARDINAL HEALTH 110, LLC",,6012 EAST MOLLOY RD,,SYRACUSE,NY,13211,...,OXYCODONE HCI 20 MG TABLETS USP,OXYCODONE HYDROCHLORIDE,TAB,1.5,"KVK-Tech, Inc.","KVK-Tech, Inc.",Cardinal Health,20.0,2012-06-06,2012
3339,12671,PC0003044,DISTRIBUTOR,"CARDINAL HEALTH 110, LLC",,6012 EAST MOLLOY RD,,SYRACUSE,NY,13211,...,LORCET HYD.BIT10MG/ACET650MG TAB,HYDROCODONE BITARTRATE HEMIPENTAHYDRATE,TAB,1.0,"Forest Laboratories, Inc.","Forest Laboratories, Inc.",Cardinal Health,10.0,2011-09-07,2011
2189,8943,PC0003044,DISTRIBUTOR,"CARDINAL HEALTH 110, LLC",,6012 EAST MOLLOY RD,,SYRACUSE,NY,13211,...,OXYCODONE HCL 30MG USP TABLETS,OXYCODONE HYDROCHLORIDE,TAB,1.5,Par Pharmaceutical,"Endo Pharmaceuticals, Inc.",Cardinal Health,30.0,2012-12-11,2012
5210,19801,PC0003044,DISTRIBUTOR,"CARDINAL HEALTH 110, LLC",,6012 EAST MOLLOY RD,,SYRACUSE,NY,13211,...,OXYCODONE HCL/ACETAMINOPHEN 5MG/325M,OXYCODONE HYDROCHLORIDE,TAB,1.5,SpecGx LLC,Mallinckrodt,Cardinal Health,5.0,2011-10-04,2011


## Exercise 3

Now let's turn to dask. Re-write your code for dask, and calculate the total shipments by reporting company. Remember: 

- Activate a conda environment with a clean dask installation.
- Start by spinning up a distributed cluster.
- Dask won't read compressed files, so you have to unzip your ARCOS data. 
- Start your cluster in a cell all by itself since you don't want to keep re-running the "start a cluster" code. 

If you need to review dask basic code, [check here](https://nickeubank.github.io/practicaldatascience_book/notebooks/PDS_not_yet_in_coursera/30_big_data/70_dask.html).

As you run your code, make sure to click on the Dashboard link below where you created your cluster:

![dask_dashboard](images/dask_cluster.png)

Among other things, the bar across the bottom should give you a sense of how long your task will take:

![dask_progress](images/dask_progress.png)

(For context, my computer (which has 10 cores) only took a couple seconds. My computer is fast, but most computers should be done within a couple minutes, tops).


In [None]:
import dask.dataframe as dd
from dask.distributed import Client

client = Client()

client


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 11
Total threads: 22,Total memory: 31.43 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:57271,Workers: 0
Dashboard: http://127.0.0.1:8787/status,Total threads: 0
Started: Just now,Total memory: 0 B

0,1
Comm: tcp://127.0.0.1:57348,Total threads: 2
Dashboard: http://127.0.0.1:57349/status,Memory: 2.86 GiB
Nanny: tcp://127.0.0.1:57274,
Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-imdxbrhr,Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-imdxbrhr

0,1
Comm: tcp://127.0.0.1:57318,Total threads: 2
Dashboard: http://127.0.0.1:57332/status,Memory: 2.86 GiB
Nanny: tcp://127.0.0.1:57276,
Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-3tw9wcou,Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-3tw9wcou

0,1
Comm: tcp://127.0.0.1:57326,Total threads: 2
Dashboard: http://127.0.0.1:57343/status,Memory: 2.86 GiB
Nanny: tcp://127.0.0.1:57278,
Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-xxo70iy4,Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-xxo70iy4

0,1
Comm: tcp://127.0.0.1:57319,Total threads: 2
Dashboard: http://127.0.0.1:57328/status,Memory: 2.86 GiB
Nanny: tcp://127.0.0.1:57280,
Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-7cwd29fo,Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-7cwd29fo

0,1
Comm: tcp://127.0.0.1:57327,Total threads: 2
Dashboard: http://127.0.0.1:57346/status,Memory: 2.86 GiB
Nanny: tcp://127.0.0.1:57282,
Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-3w73zt5n,Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-3w73zt5n

0,1
Comm: tcp://127.0.0.1:57324,Total threads: 2
Dashboard: http://127.0.0.1:57340/status,Memory: 2.86 GiB
Nanny: tcp://127.0.0.1:57284,
Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-d2qd77ne,Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-d2qd77ne

0,1
Comm: tcp://127.0.0.1:57320,Total threads: 2
Dashboard: http://127.0.0.1:57330/status,Memory: 2.86 GiB
Nanny: tcp://127.0.0.1:57286,
Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-6fqbk0b2,Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-6fqbk0b2

0,1
Comm: tcp://127.0.0.1:57322,Total threads: 2
Dashboard: http://127.0.0.1:57336/status,Memory: 2.86 GiB
Nanny: tcp://127.0.0.1:57288,
Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-3xx6zbup,Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-3xx6zbup

0,1
Comm: tcp://127.0.0.1:57321,Total threads: 2
Dashboard: http://127.0.0.1:57334/status,Memory: 2.86 GiB
Nanny: tcp://127.0.0.1:57290,
Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-fw503uk6,Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-fw503uk6

0,1
Comm: tcp://127.0.0.1:57323,Total threads: 2
Dashboard: http://127.0.0.1:57338/status,Memory: 2.86 GiB
Nanny: tcp://127.0.0.1:57292,
Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-0v4gm7_o,Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-0v4gm7_o

0,1
Comm: tcp://127.0.0.1:57325,Total threads: 2
Dashboard: http://127.0.0.1:57342/status,Memory: 2.86 GiB
Nanny: tcp://127.0.0.1:57294,
Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-_aq2v4nu,Local directory: C:\Users\LENOVO\AppData\Local\Temp\dask-scratch-space\worker-_aq2v4nu


2025-11-18 14:30:00,961 - tornado.application - ERROR - Uncaught exception GET /profile/ws (127.0.0.1)
HTTPServerRequest(protocol='http', host='127.0.0.1:8787', method='GET', uri='/profile/ws', version='HTTP/1.1', remote_ip='127.0.0.1')
Traceback (most recent call last):
  File "c:\Users\LENOVO\miniforge3\Lib\site-packages\tornado\web.py", line 1848, in _execute
    result = await result
             ^^^^^^^^^^^^
  File "c:\Users\LENOVO\miniforge3\Lib\site-packages\tornado\websocket.py", line 277, in get
    await self.ws_connection.accept_connection(self)
  File "c:\Users\LENOVO\miniforge3\Lib\site-packages\tornado\websocket.py", line 890, in accept_connection
    await self._accept_connection(handler)
  File "c:\Users\LENOVO\miniforge3\Lib\site-packages\tornado\websocket.py", line 973, in _accept_connection
    await self._receive_frame_loop()
  File "c:\Users\LENOVO\miniforge3\Lib\site-packages\tornado\websocket.py", line 1130, in _receive_frame_loop
    self.handler.on_ws_connectio

## Exercise 4

Now let's calculate, *for each state*, what company shipped the most pills?

Note you will quickly find that you can't sort in dask -- sorting in parallel is *really* tricky! So you'll have to work around that. Do what you need to do on the big dataset first, then compute it all so you get it as a regular pandas dataframe, then finish. 

In [10]:
shipment_df = dd.read_csv(file, sep="\t")

shipment_df

Unnamed: 0_level_0,Unnamed: 0,REPORTER_DEA_NO,REPORTER_BUS_ACT,REPORTER_NAME,REPORTER_ADDL_CO_INFO,REPORTER_ADDRESS1,REPORTER_ADDRESS2,REPORTER_CITY,REPORTER_STATE,REPORTER_ZIP,REPORTER_COUNTY,BUYER_DEA_NO,BUYER_BUS_ACT,BUYER_NAME,BUYER_ADDL_CO_INFO,BUYER_ADDRESS1,BUYER_ADDRESS2,BUYER_CITY,BUYER_STATE,BUYER_ZIP,BUYER_COUNTY,TRANSACTION_CODE,DRUG_CODE,NDC_NO,DRUG_NAME,QUANTITY,UNIT,ACTION_INDICATOR,ORDER_FORM_NO,CORRECTION_NO,STRENGTH,TRANSACTION_DATE,CALC_BASE_WT_IN_GM,DOSAGE_UNIT,TRANSACTION_ID,Product_Name,Ingredient_Name,Measure,MME_Conversion_Factor,Combined_Labeler_Name,Revised_Company_Name,Reporter_family,dos_str,date,year
npartitions=393,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1
,int64,string,string,string,float64,string,float64,string,string,int64,string,string,string,string,string,string,string,string,string,int64,string,string,int64,int64,string,float64,float64,float64,float64,float64,float64,int64,float64,float64,int64,string,string,string,float64,string,string,string,float64,string,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


Does this seem like a situation where a single company is responsible for the opioid epidemic?

## Exercise 5 

Now go ahead and try and re-do the chunking you did by hand for your project (with this 2 years of data) -- calculate, for each year, the total morphine equivalents sent to each county in the US. 

## Exercise 6

Now, re-write your opioid project's initial opioid import using dask. Each person on your team should create a NEW branch to try this. The person who wrote the initial chunking code can help everyone else understand what they did originally and the data, but everyone should write their own code. 

**WARNING:** You will probably run into a lot of type errors (depending on how the ARCOS data has changed since last year). With real world messy data one of the biggest problems with dask is that it struggles if halfway through dataset it discovers that the column it *thought* was floats contains text. That's why, in the dask reading, [I specified the column type for so many columns](https://nickeubank.github.io/practicaldatascience_book/notebooks/PDS_not_yet_in_coursera/30_big_data/70_dask.html#what-can-dask-do-for-me) as `objects` explicitly. Then, because occasionally there data cleanliness issues, I had to do some converting data types by hand. 