## Demo Notebook
# Coiled & MongoDB for Large-Scale NLP Analysis


This notebook walks through a basic NLP workflow to illustrate how we can use Coiled and MongoDB for large-scale NLP analyses.

1. Load in toy dataset (AirBnb Sample Dataset from MongoDB)
2. Apply some NLP preprocessing using NLTK and SpaCy
3. Create vectors for ML using Dask-ML
4. Train an XGBoost Classifier

## Launch Coiled Cluster

In [1]:
import coiled

In [2]:
cluster = coiled.Cluster(
    name="dask-nlp-mongodb",
    software="rrpelgrim/dask-nlp-mongo",
    n_workers=20,
    shutdown_on_close=False,
    scheduler_options={'idle_timeout':'2 hours'}
)

Output()

Found software environment build
Created FW rules: coiled-dask-rrpelgr71-70484-firewall
Created scheduler VM: coiled-dask-rrpelgr71-70484-scheduler (type: t3.medium, ip: ['3.231.215.132'])


In [4]:
from dask.distributed import Client

client = Client(cluster)
client


+---------+---------------+---------------+---------------+
| Package | client        | scheduler     | workers       |
+---------+---------------+---------------+---------------+
| numpy   | 1.20.3        | 1.21.4        | 1.21.4        |
| python  | 3.9.7.final.0 | 3.9.0.final.0 | 3.9.0.final.0 |
+---------+---------------+---------------+---------------+


0,1
Connection method: Cluster object,Cluster type: coiled.Cluster
Dashboard: http://3.231.215.132:8787,

0,1
Dashboard: http://3.231.215.132:8787,Workers: 19
Total threads: 38,Total memory: 143.63 GiB

0,1
Comm: tls://10.4.3.77:8786,Workers: 19
Dashboard: http://10.4.3.77:8787/status,Total threads: 38
Started: 3 minutes ago,Total memory: 143.63 GiB

0,1
Comm: tls://10.4.18.23:35797,Total threads: 2
Dashboard: http://10.4.18.23:39511/status,Memory: 7.56 GiB
Nanny: tls://10.4.18.23:46661,
Local directory: /dask-worker-space/worker-dutanjky,Local directory: /dask-worker-space/worker-dutanjky

0,1
Comm: tls://10.4.22.215:44249,Total threads: 2
Dashboard: http://10.4.22.215:38613/status,Memory: 7.56 GiB
Nanny: tls://10.4.22.215:43655,
Local directory: /dask-worker-space/worker-vcr_6895,Local directory: /dask-worker-space/worker-vcr_6895

0,1
Comm: tls://10.4.31.7:45465,Total threads: 2
Dashboard: http://10.4.31.7:39785/status,Memory: 7.56 GiB
Nanny: tls://10.4.31.7:40641,
Local directory: /dask-worker-space/worker-z0ryw_f_,Local directory: /dask-worker-space/worker-z0ryw_f_

0,1
Comm: tls://10.4.30.50:36931,Total threads: 2
Dashboard: http://10.4.30.50:37251/status,Memory: 7.56 GiB
Nanny: tls://10.4.30.50:35157,
Local directory: /dask-worker-space/worker-d6g1nyh5,Local directory: /dask-worker-space/worker-d6g1nyh5

0,1
Comm: tls://10.4.23.176:33473,Total threads: 2
Dashboard: http://10.4.23.176:43147/status,Memory: 7.56 GiB
Nanny: tls://10.4.23.176:45453,
Local directory: /dask-worker-space/worker-uspfbn7x,Local directory: /dask-worker-space/worker-uspfbn7x

0,1
Comm: tls://10.4.20.46:36125,Total threads: 2
Dashboard: http://10.4.20.46:37493/status,Memory: 7.56 GiB
Nanny: tls://10.4.20.46:38915,
Local directory: /dask-worker-space/worker-30lj6iwv,Local directory: /dask-worker-space/worker-30lj6iwv

0,1
Comm: tls://10.4.17.255:44463,Total threads: 2
Dashboard: http://10.4.17.255:39715/status,Memory: 7.56 GiB
Nanny: tls://10.4.17.255:33065,
Local directory: /dask-worker-space/worker-go47ivtn,Local directory: /dask-worker-space/worker-go47ivtn

0,1
Comm: tls://10.4.22.209:34801,Total threads: 2
Dashboard: http://10.4.22.209:46343/status,Memory: 7.56 GiB
Nanny: tls://10.4.22.209:33677,
Local directory: /dask-worker-space/worker-fivn4k0r,Local directory: /dask-worker-space/worker-fivn4k0r

0,1
Comm: tls://10.4.29.149:39879,Total threads: 2
Dashboard: http://10.4.29.149:40869/status,Memory: 7.56 GiB
Nanny: tls://10.4.29.149:34785,
Local directory: /dask-worker-space/worker-ncvn3zws,Local directory: /dask-worker-space/worker-ncvn3zws

0,1
Comm: tls://10.4.31.74:34865,Total threads: 2
Dashboard: http://10.4.31.74:35083/status,Memory: 7.56 GiB
Nanny: tls://10.4.31.74:36457,
Local directory: /dask-worker-space/worker-4x9ja1gb,Local directory: /dask-worker-space/worker-4x9ja1gb

0,1
Comm: tls://10.4.27.246:44259,Total threads: 2
Dashboard: http://10.4.27.246:34789/status,Memory: 7.56 GiB
Nanny: tls://10.4.27.246:37893,
Local directory: /dask-worker-space/worker-77uefhj5,Local directory: /dask-worker-space/worker-77uefhj5

0,1
Comm: tls://10.4.28.203:35633,Total threads: 2
Dashboard: http://10.4.28.203:43407/status,Memory: 7.56 GiB
Nanny: tls://10.4.28.203:40777,
Local directory: /dask-worker-space/worker-pud_yqyf,Local directory: /dask-worker-space/worker-pud_yqyf

0,1
Comm: tls://10.4.28.159:40247,Total threads: 2
Dashboard: http://10.4.28.159:46571/status,Memory: 7.56 GiB
Nanny: tls://10.4.28.159:43655,
Local directory: /dask-worker-space/worker-y814qczs,Local directory: /dask-worker-space/worker-y814qczs

0,1
Comm: tls://10.4.23.63:43223,Total threads: 2
Dashboard: http://10.4.23.63:37255/status,Memory: 7.56 GiB
Nanny: tls://10.4.23.63:33315,
Local directory: /dask-worker-space/worker-q5dx6ovh,Local directory: /dask-worker-space/worker-q5dx6ovh

0,1
Comm: tls://10.4.21.255:43865,Total threads: 2
Dashboard: http://10.4.21.255:45527/status,Memory: 7.56 GiB
Nanny: tls://10.4.21.255:45539,
Local directory: /dask-worker-space/worker-5mn3dgfa,Local directory: /dask-worker-space/worker-5mn3dgfa

0,1
Comm: tls://10.4.28.125:36797,Total threads: 2
Dashboard: http://10.4.28.125:38723/status,Memory: 7.56 GiB
Nanny: tls://10.4.28.125:44613,
Local directory: /dask-worker-space/worker-pjoqkf1s,Local directory: /dask-worker-space/worker-pjoqkf1s

0,1
Comm: tls://10.4.17.165:40523,Total threads: 2
Dashboard: http://10.4.17.165:36071/status,Memory: 7.56 GiB
Nanny: tls://10.4.17.165:42487,
Local directory: /dask-worker-space/worker-mkflqik5,Local directory: /dask-worker-space/worker-mkflqik5

0,1
Comm: tls://10.4.25.76:37791,Total threads: 2
Dashboard: http://10.4.25.76:45019/status,Memory: 7.56 GiB
Nanny: tls://10.4.25.76:43957,
Local directory: /dask-worker-space/worker-ezeyluna,Local directory: /dask-worker-space/worker-ezeyluna

0,1
Comm: tls://10.4.19.6:34823,Total threads: 2
Dashboard: http://10.4.19.6:46645/status,Memory: 7.56 GiB
Nanny: tls://10.4.19.6:44607,
Local directory: /dask-worker-space/worker-3ywqzs49,Local directory: /dask-worker-space/worker-3ywqzs49


## Read Data from MongoDB

In [5]:
from dask_mongo import read_mongo
import urllib

In [6]:
# Replace the username, password, and cluster address with your own connection details
host_uri = "mongodb+srv://richard:" + urllib.parse.quote("Rp@976559MO") + "@cluster0.ffttf.mongodb.net/myFirstDatabase?retryWrites=true&w=majority"

In [7]:
bag = read_mongo(
    connection_kwargs={"host": host_uri},
    database="sample_airbnb",
    collection="listingsAndReviews",
    chunksize=500,
)

In [9]:
bag.take(1)

({'_id': '10006546',
  'listing_url': 'https://www.airbnb.com/rooms/10006546',
  'name': 'Ribeira Charming Duplex',
  'summary': 'Fantastic duplex apartment with three bedrooms, located in the historic area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary building fully rehabilitated, without losing their original character.',
  'space': 'Privileged views of the Douro River and Ribeira square, our apartment offers the perfect conditions to discover the history and the charm of Porto. Apartment comfortable, charming, romantic and cozy in the heart of Ribeira. Within walking distance of all the most emblematic places of the city of Porto. The apartment is fully equipped to host 8 people, with cooker, oven, washing machine, dishwasher, microwave, coffee machine (Nespresso) and kettle. The apartment is located in a very typical area of the city that allows to cross with the most picturesque population of the city, welcoming, genuine and happy people that fills the streets w

This is a LOT of information.

Let's boil this down to something simple for this demo. Let's say we want to use the Description text to predict the Review Rating.

Below we define a processing function that will extract only the relevant information from all records. We'll then select only the Apartments property types, flatten the data structure and turn it into a Dask Dataframe.

In [54]:
def process(record):
    try:
        yield {
            "description": record["description"],
            "review_rating": int(str(record["review_scores"]["review_scores_rating"])),
            #"accomodates": record["accommodates"],
            #"bedrooms": record["bedrooms"],
            #"price": float(str(record["price"])),
            #"country": record["address"]["country"],
        }
    except KeyError:
        pass

In [55]:
# Filter only apartments
b_flattened = (
    bag.filter(lambda record: record["property_type"] == "Apartment")
    .map(process)
    .flatten()
)

In [56]:
b_flattened.take(3)

({'description': 'Here exists a very cozy room for rent in a shared 4-bedroom apartment. It is located one block off of the JMZ at Myrtle Broadway.  The neighborhood is diverse and appeals to a variety of people.',
  'review_rating': 100},
 {'description': "Murphy bed, optional second bedroom available. Wifi available, Hulu, Netflix, TV Eat-in kitchen. Bathroom with great shower/bath.  Washer/dryer in basement. New York City! Great neighborhood - many terrific restaurants, bakeries, bagelries. Within easy walking distance are restaurants with the cuisines from India, Thailand, Japan, China, Mexico, South America and Europe.  As well as the many small independent stores that line Broadway, there chain stores such as Urban Outfitters (clothing), Whole Foods (groceries), Sephora (cosmetics), Michaels (crafts), and Modell's (sporting goods). Equidistant to Central Park and Riverside Park which have walking/running/biking trails as well as tennis and racquet ball courts. 10-15 blocks from C

This works. Let's not transform this into a Dask Dataframe.

In [57]:
ddf = b_flattened.to_dataframe()

In [58]:
ddf

Unnamed: 0_level_0,description,review_rating
npartitions=12,Unnamed: 1_level_1,Unnamed: 2_level_1
,object,int64
,...,...
...,...,...
,...,...
,...,...


In [60]:
ddf.review_rating.value_counts().compute()

100    636
96     185
97     182
93     179
98     174
95     166
90     149
94     141
80     116
92     112
99      94
91      80
89      78
87      70
88      53
85      40
86      36
84      28
83      28
60      27
70      16
82      12
20       9
81       8
76       8
75       8
78       7
73       6
40       5
79       5
77       4
72       4
71       4
74       3
65       3
69       2
67       2
50       1
Name: review_rating, dtype: int64

Let's write this to our S3 bucket as a Parquet file.

In [61]:
# ddf.to_parquet(
#     's3://coiled-datasets/airbnb-monogo/description-and-ratings.parquet',
#     engine="pyarrow",
# )

[None]

In [62]:
ddf.head()

Unnamed: 0,description,review_rating
0,Here exists a very cozy room for rent in a sha...,100
1,"Murphy bed, optional second bedroom available....",94
2,"The Apartment has a living room, toilet, bedro...",98
3,Loft Suite Deluxe @ Henry Norman Hotel Located...,88
4,"Clean, fully furnish, Spacious 1 bedroom flat ...",100


Now we're all set to turn this into an ML classification problem.

We'll create a train/test splits and then vectorize the Description column.

### Create train/test split

In [63]:
from dask_ml.model_selection import train_test_split

In [64]:
X = ddf['description'].to_dask_array(lengths=True)
y = ddf['review_rating'].to_dask_array(lengths=True)

In [65]:
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.20, 
    random_state=40
)

### Vectorize

In [66]:
from dask_ml.feature_extraction.text import HashingVectorizer

HashingVectorizer has some built-in tokenization and preprocessing capabilities we could explore.

We'll just use it out-of-the-box for now.

In [67]:
vect = HashingVectorizer()

In [68]:
X_train_vect = vect.fit_transform(X_train)

In [69]:
X_train_vect

Unnamed: 0,Array,Chunk
Shape,"(nan, 1048576)","(nan, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix
"Array Chunk Shape (nan, 1048576) (nan, 1048576) Count 108 Tasks 12 Chunks Type float64 scipy.sparse.csr.csr_matrix",,

Unnamed: 0,Array,Chunk
Shape,"(nan, 1048576)","(nan, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix


Vectorizing leads to array of unknown chunk size

In [70]:
X_train_vect.compute_chunk_sizes()

Unnamed: 0,Array,Chunk
Shape,"(2139, 1048576)","(211, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix
"Array Chunk Shape (2139, 1048576) (211, 1048576) Count 108 Tasks 12 Chunks Type float64 scipy.sparse.csr.csr_matrix",1048576  2139,

Unnamed: 0,Array,Chunk
Shape,"(2139, 1048576)","(211, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix


In [71]:
X_train_vect.blocks[0].compute()

<176x1048576 sparse matrix of type '<class 'numpy.float64'>'
	with 15469 stored elements in Compressed Sparse Row format>

In [72]:
X_train_vect.shape

(2139, 1048576)

Each block in X is a **scipy.sparse matrix**.

Now use scipy.sparse matrix as input for distributed XGBoostClassifier.

## 5. Create XGBoost

In [73]:
import xgboost as xgb
from xgboost.dask import DaskXGBClassifier

In [74]:
clf = DaskXGBClassifier()

In [75]:
%%time
clf.fit(X_train_vect, y_train)

AttributeError: divisions not found

The error above is a bug in the XGBoost package. Issue was raised and there's a PR ready to be merged that will resolve this issue:
https://github.com/dmlc/xgboost/issues/7454

In [None]:
proba = xgb.predict_proba(X_test)