# `dask-mongo`

In [1]:
from dask_mongo import read_mongo, to_mongo

from distributed import Client

## Sample AirBnB Listings Dataset

For this demo we will be using the [sample AirBnB listings dataset](https://docs.atlas.mongodb.com/sample-data/sample-airbnb/) provided by mongoDB, and hosted on a free tier cluster on Mongo Atlas. For information on how to load this dataset into your cluster check this [link](https://docs.atlas.mongodb.com/sample-data/#std-label-load-sample-data).

## Read data using `dask-mongo`

- Use `read_mongo` to get dataset into a `dask.bag`
- Filter data using `dask` operations
- Convert from a bag to a dask dataframe
- Perform a dask groupby operation. 


In [2]:
client = Client()
client

0,1
Connection method: Cluster object,Cluster type: LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Status: running,Using processes: True
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads:  8,Total memory:  16.00 GiB

0,1
Comm: tcp://127.0.0.1:54796,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads:  8
Started:  Just now,Total memory:  16.00 GiB

0,1
Comm: tcp://127.0.0.1:54809,Total threads: 2
Dashboard: http://127.0.0.1:54811/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54799,
Local directory: /Users/ncclementi/Documents/git/my_forks/dask-mongo/examples/dask-worker-space/worker-yoz1bi1j,Local directory: /Users/ncclementi/Documents/git/my_forks/dask-mongo/examples/dask-worker-space/worker-yoz1bi1j

0,1
Comm: tcp://127.0.0.1:54808,Total threads: 2
Dashboard: http://127.0.0.1:54810/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54801,
Local directory: /Users/ncclementi/Documents/git/my_forks/dask-mongo/examples/dask-worker-space/worker-nyu3bxy1,Local directory: /Users/ncclementi/Documents/git/my_forks/dask-mongo/examples/dask-worker-space/worker-nyu3bxy1

0,1
Comm: tcp://127.0.0.1:54802,Total threads: 2
Dashboard: http://127.0.0.1:54804/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54800,
Local directory: /Users/ncclementi/Documents/git/my_forks/dask-mongo/examples/dask-worker-space/worker-ivsi7xtu,Local directory: /Users/ncclementi/Documents/git/my_forks/dask-mongo/examples/dask-worker-space/worker-ivsi7xtu

0,1
Comm: tcp://127.0.0.1:54803,Total threads: 2
Dashboard: http://127.0.0.1:54806/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54798,
Local directory: /Users/ncclementi/Documents/git/my_forks/dask-mongo/examples/dask-worker-space/worker-tkwec12s,Local directory: /Users/ncclementi/Documents/git/my_forks/dask-mongo/examples/dask-worker-space/worker-tkwec12s


In [3]:
#replace this for your URI connection
host_uri = "mongodb+srv://<username>:<password>@<cluster-address>/myFirstDatabase?retryWrites=true&w=majority"

In [4]:
b = read_mongo(connection_kwargs={"host": host_uri}, 
                database="sample_airbnb", 
                collection="listingsAndReviews",
                chunksize=500)

Let's take a look at the first record of our data set

In [5]:
b.take(1)

({'_id': '10006546',
  'listing_url': 'https://www.airbnb.com/rooms/10006546',
  'name': 'Ribeira Charming Duplex',
  'summary': 'Fantastic duplex apartment with three bedrooms, located in the historic area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary building fully rehabilitated, without losing their original character.',
  'space': 'Privileged views of the Douro River and Ribeira square, our apartment offers the perfect conditions to discover the history and the charm of Porto. Apartment comfortable, charming, romantic and cozy in the heart of Ribeira. Within walking distance of all the most emblematic places of the city of Porto. The apartment is fully equipped to host 8 people, with cooker, oven, washing machine, dishwasher, microwave, coffee machine (Nespresso) and kettle. The apartment is located in a very typical area of the city that allows to cross with the most picturesque population of the city, welcoming, genuine and happy people that fills the streets w

In [6]:
b.pluck("property_type").frequencies().compute()

[('House', 606),
 ('Apartment', 3626),
 ('Condominium', 399),
 ('Loft', 142),
 ('Guesthouse', 50),
 ('Hostel', 34),
 ('Serviced apartment', 185),
 ('Bed and breakfast', 69),
 ('Treehouse', 1),
 ('Bungalow', 14),
 ('Guest suite', 81),
 ('Townhouse', 108),
 ('Villa', 32),
 ('Cabin', 15),
 ('Other', 18),
 ('Chalet', 2),
 ('Farm stay', 9),
 ('Boutique hotel', 53),
 ('Boat', 2),
 ('Cottage', 20),
 ('Earth house', 1),
 ('Aparthotel', 23),
 ('Resort', 11),
 ('Tiny house', 7),
 ('Nature lodge', 2),
 ('Hotel', 26),
 ('Casa particular (Cuba)', 9),
 ('Barn', 1),
 ('Hut', 1),
 ('Camper/RV', 2),
 ('Heritage hotel (India)', 1),
 ('Pension (South Korea)', 1),
 ('Campsite', 1),
 ('Houseboat', 1),
 ('Castle', 1),
 ('Train', 1)]

### Filtered and flattened data for dataframe friendly shape

There is plenty of unstructured information in our records, let's filter some useful information and get it into a dask dataframe. We will flatten down this data so that Pandas operations make sense for it. 

In [7]:
def process(record):
    try:
        yield {
            "accomodates": record["accommodates"],
            "bedrooms": record["bedrooms"],
            "price": float(str(record["price"])),
            "country": record["address"]["country"],
        }
    except KeyError:
        pass

In [8]:
#filter only apartments 
b_flattened = b.filter(lambda record: record["property_type"] == "Apartment").map(process).flatten()

In [9]:
b_flattened.take(3)

({'accomodates': 4, 'bedrooms': 1, 'price': 317.0, 'country': 'Brazil'},
 {'accomodates': 1, 'bedrooms': 1, 'price': 40.0, 'country': 'United States'},
 {'accomodates': 2, 'bedrooms': 1, 'price': 701.0, 'country': 'Brazil'})

Now we can convert the bag into a dataframe using `to_dataframe` and perform some operations. 

In [10]:
ddf = b_flattened.to_dataframe()

In [11]:
ddf

Unnamed: 0_level_0,accomodates,bedrooms,price,country
npartitions=12,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,int64,int64,float64,object
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


In [12]:
ddf.head()

Unnamed: 0,accomodates,bedrooms,price,country
0,4,1,317.0,Brazil
1,1,1,40.0,United States
2,2,1,701.0,Brazil
3,2,1,135.0,United States
4,4,1,119.0,Brazil


### Groupy operation

Let's `groupby` by country and compute what is the average price per country.  

In [13]:
ddf.groupby(["country"])["price"].mean().compute()

country
Australia        168.174174
Brazil           485.767033
Canada            84.860814
Hong Kong        684.622120
Portugal          66.112272
Spain             91.846442
Turkey           366.143552
United States    137.884228
China            448.300000
Name: price, dtype: float64

## Write data to a mongo databse using `dask-mongo`

- Convert dask data frame to a dask bag. 
- Use `to_mongo` to write to the desired database.

In this example we will convert the dask dataframe we just created and wirte it to a new database in our mongo atlas cluster. 

In [14]:
import pandas as pd

In [15]:
import dask.bag as db

In [16]:
new_bag = db.from_delayed(ddf.map_partitions(lambda x:x.to_dict(orient="records")).to_delayed())

In [17]:
new_bag.take(1)

({'accomodates': 4, 'bedrooms': 1, 'price': 317.0, 'country': 'Brazil'},)

In [18]:
to_mongo(new_bag,  
         database='new_database', 
         collection='new_collection',
         connection_kwargs={"host": host_uri})