# Process Large NLP Documents in Parallel with dask-mongo
MongoDB is a popular database for storing semistructured data like JSON documents for ease of development and scaling. To manipulate this semistructured data, many data analysts will switch over to a programming language like Python to perform exploratory data analysis and/or build machine learning models. A common pain point in this workflow is that transferring data from a database to a Python session can cause significant overhead.

To remedy this problem, Coiled and MongoDB have partnered to develop dask-mongo: a fast and light Python connector that lets you pull in data from MongoDB in parallel. Dask-mongo empowers data analysts to read from and write to MongoDB collections faster using MongoDB Atlas cloud-hosted and locally hosted installations.

This tutorial will walk you through using the dask-mongo connector to run a NLP workflow that preprocesses JSON data and trains an XGBoost classifier machine learning model. 

Let’s jump in.

## Workflow
This notebook walks through a Natural Language Processing workflow to illustrate how we can use Coiled and MongoDB for large-scale NLP analyses.

Our workflow will look like this:
1. Load in the AirBnb sample dataset from MongoDB
2. Flatten documents into tabular DataFrame
2. Perform Tokenization using NLTK
3. Perform Lemmatization using SpaCy
4. Create vectors for ML using Dask-ML
5. Train an XGBoost Classifier

## 1. Launch Cloud Computing Resources with Coiled
The dask-mongo connector speeds up workflows by reading and writing from/to MongoDB in parallel. To test this, let's spin up a Coiled cluster with 10 workers on which we'll run our parallel computations. Coiled clusters are on-demand computational resources hosted in the cloud. Get started with Coiled [here](https://docs.coiled.io/user_guide/getting_started.html).

We'll create and specify the software environment (Docker image) that will be loaded onto each worker; this will ensure we have all the dependencies we need to run our computations in the cloud. We'll also give the Coiled cluster a name so we can connect to it from multiple Python sessions simultaneously.


In [1]:
import coiled

In [2]:
# coiled.create_software_environment(
#     account="coiled-examples",
#     name="dask-mongo",
#     conda="/Users/rpelgrim/Documents/git/coiled-resources/mongodb-with-coiled/environment.yml",
# )

In [3]:
cluster = coiled.Cluster(
    name="dask-nlp",
    software="coiled-examples/dask-mongo",
    n_workers=10,
    shutdown_on_close=False,
    scheduler_options={'idle_timeout':'2 hours'}
)

Output()

Now let's connect our local Python client to the Coiled cluster.

In [4]:
from dask.distributed import Client
client = Client(cluster)

## 2. Read Data in Parallel from MongoDB
Now our cluster is up, we're all set to read some data from MongoDB. We'll be working with the sample AirBnB dataset that MongoDB provides when you sign up for a free Atlas account.

In [5]:
import getpass
pw = getpass.getpass()

 ···········


In [6]:
import urllib

# Replace the username, password, and cluster address with your own connection details
host_uri = "mongodb+srv://richard:" + urllib.parse.quote(pw) + "@cluster0.ffttf.mongodb.net/myFirstDatabase?retryWrites=true&w=majority"

Now let's import the `read_mongo` function from `dask_mongo` and use it to read in the `listingsAndReviews` collection from the `sample_airbnb` database. We'll tell Dask to read the data in chunks of 500 rows per partition.

In [7]:
from dask_mongo import read_mongo

bag = read_mongo(
    connection_kwargs={"host": host_uri},
    database="sample_airbnb",
    collection="listingsAndReviews",
    chunksize=500,
)

In [8]:
# inspect number of partitions
bag

dask.bag<read_mongo, npartitions=12>

We’ve read the data into a Dask Bag, which is the Dask object class best suited to semi-structured JSON-like data. You can think of it as a Pythonic implementation of the PySpark RDD. The documents in a Dask Bag are unordered. 

We can then pull 1 record from the Dask Bag using the .take() method:

In [9]:
bag.take(1)

({'_id': '10006546',
  'listing_url': 'https://www.airbnb.com/rooms/10006546',
  'name': 'Ribeira Charming Duplex',
  'summary': 'Fantastic duplex apartment with three bedrooms, located in the historic area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary building fully rehabilitated, without losing their original character.',
  'space': 'Privileged views of the Douro River and Ribeira square, our apartment offers the perfect conditions to discover the history and the charm of Porto. Apartment comfortable, charming, romantic and cozy in the heart of Ribeira. Within walking distance of all the most emblematic places of the city of Porto. The apartment is fully equipped to host 8 people, with cooker, oven, washing machine, dishwasher, microwave, coffee machine (Nespresso) and kettle. The apartment is located in a very typical area of the city that allows to cross with the most picturesque population of the city, welcoming, genuine and happy people that fills the streets w

This outputs a very large amount of data, including the listing itself as well as all the available reviews. The data is unwieldy and not in a format we can input to a regular machine learning model. 

### Subset Data 
Let’s use the text in the Description field to predict the Review Rating and limit ourselves to listings of type “Apartment” only.

To do this, we’ll have to flatten the semi-structured JSON into a tabular form. We’ll use the processing function defined below to extract only the relevant information from all records. We'll then filter out only the Apartment listings, flatten the data structure and turn it into a Dask Dataframe.

In [10]:
def process(record):
    try:
        yield {
            "description": record["description"],
            "review_rating": int(str(record["review_scores"]["review_scores_rating"])),
            #"accomodates": record["accommodates"],
            #"bedrooms": record["bedrooms"],
            #"price": float(str(record["price"])),
            #"country": record["address"]["country"],
        }
    except KeyError:
        pass

In [11]:
# Filter only apartments
b_flattened = (
    bag.filter(lambda record: record["property_type"] == "Apartment")
    .map(process)
    .flatten()
)

In [12]:
b_flattened.take(2)

({'description': 'Here exists a very cozy room for rent in a shared 4-bedroom apartment. It is located one block off of the JMZ at Myrtle Broadway.  The neighborhood is diverse and appeals to a variety of people.',
  'review_rating': 100},
 {'description': "Murphy bed, optional second bedroom available. Wifi available, Hulu, Netflix, TV Eat-in kitchen. Bathroom with great shower/bath.  Washer/dryer in basement. New York City! Great neighborhood - many terrific restaurants, bakeries, bagelries. Within easy walking distance are restaurants with the cuisines from India, Thailand, Japan, China, Mexico, South America and Europe.  As well as the many small independent stores that line Broadway, there chain stores such as Urban Outfitters (clothing), Whole Foods (groceries), Sephora (cosmetics), Michaels (crafts), and Modell's (sporting goods). Equidistant to Central Park and Riverside Park which have walking/running/biking trails as well as tennis and racquet ball courts. 10-15 blocks from C

## 3. Tokenization with NLTK

The data is now in tabular format. In order to input it into a machine learning model, we will have to preprocess the text. This will include tokenization, stop word removal, lemmatization, and vectorization.

Let's tokenize the Description text and remove stop words using the NLTK library.

In [13]:
from nltk.corpus import stopwords 
from nltk.tokenize import RegexpTokenizer
from functools import partial

In [14]:
# define tokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [15]:
# define set of stopwords
stopword_set = set(stopwords.words('english'))

Let's cast our Dask Bag to a Dask DataFrame for easier manipulation:

In [16]:
ddf = b_flattened.to_dataframe()
ddf.head(2)

Unnamed: 0,description,review_rating
0,Here exists a very cozy room for rent in a sha...,100
1,"Murphy bed, optional second bedroom available....",94


And then convert all the description text to lowercase:

In [17]:
ddf['description'] = ddf['description'].str.lower()
ddf.head(2)

Unnamed: 0,description,review_rating
0,here exists a very cozy room for rent in a sha...,100
1,"murphy bed, optional second bedroom available....",94


Now let's tokenize the Description texts using the NLTK tokenizer we just created.

We'll be using the `map_partitions` method which maps a function over all the partitions in our Dask DataFrame. Each partition in a Dask DataFrame is a regular pandas DataFrame. If you need a refresher on the basic Dask DataFrame architecture, we recommend reading [this blog](https://coiled.io/blog/what-is-dask/).

In [18]:
def tokenize_partitions(df):
    df['description'] = df['description'].apply(tokenizer.tokenize)
    return df

ddf = ddf.map_partitions(tokenize_partitions)
ddf.head()

Unnamed: 0,description,review_rating
0,"[here, exists, a, very, cozy, room, for, rent,...",100
1,"[murphy, bed, optional, second, bedroom, avail...",94
2,"[the, apartment, has, a, living, room, toilet,...",98
3,"[loft, suite, deluxe, henry, norman, hotel, lo...",88
4,"[clean, fully, furnish, spacious, 1, bedroom, ...",100


The Description column now contains lists of strings (tokens).

## 4. Lemmatization with SpaCy
Now let's proceed to lemmatize our tokens using SpaCy. We'll import the library and load our vocabulary set:

In [19]:
import spacy

In [20]:
# # run this cell if you've not used spacy in this env before
# # this will download the lexicon files
# ! python -m spacy download en

In [21]:
nlp = spacy.load("en_core_web_sm")

We'll now define two functions: 
1. One function that. will lemmatize each row 
2. Another function to apply this to all rows in a Dask DataFrame partition

We'll then map this function over all the partitions in our Dask DataFrame

In [22]:
# lemmatize each row
def lemmatize_row(text, nlp=nlp):
    doc = nlp(" ".join(text))
    lemmatized = [token.lemma_ for token in doc]
    return lemmatized

In [23]:
# apply to all rows in partition (each partition is a pandas df)
def apply_lemmatize(df):
    df.description = df.description.apply(lemmatize_row)
    return df

In [24]:
# map lemmatizing function over all partitions
ddf = ddf.map_partitions(apply_lemmatize)

In [25]:
ddf.head()

Unnamed: 0,description,review_rating
0,"[here, exist, a, very, cozy, room, for, rent, ...",100
1,"[murphy, bed, optional, second, bedroom, avail...",94
2,"[the, apartment, have, a, living, room, toilet...",98
3,"[loft, suite, deluxe, henry, norman, hotel, lo...",88
4,"[clean, fully, furnish, spacious, 1, bedroom, ...",100


## 5. Write Data Back to MongoDB

Now that we have processed our raw JSON NLP data into a neat tabular format, we'll want to store this for future use. You can use the `to_mongo` method to write data to your MongoDB database.

You'll first need to convert your Dask DataFrame to a Dask Bag:

In [26]:
import dask.bag as db

# convert dask dataframe back to Dask bag
new_bag = db.from_delayed(
    ddf.map_partitions(lambda x: x.to_dict(orient="records")).to_delayed()
)

new_bag.take(1)

({'description': ['here',
   'exist',
   'a',
   'very',
   'cozy',
   'room',
   'for',
   'rent',
   'in',
   'a',
   'share',
   '4',
   'bedroom',
   'apartment',
   'it',
   'be',
   'locate',
   'one',
   'block',
   'off',
   'of',
   'the',
   'jmz',
   'at',
   'myrtle',
   'broadway',
   'the',
   'neighborhood',
   'be',
   'diverse',
   'and',
   'appeal',
   'to',
   'a',
   'variety',
   'of',
   'people'],
  'review_rating': 100},)

And then use `to_mongo` to write the data to MongoDB:

In [None]:
from dask_mongo import to_mongo

to_mongo(
    new_bag,
    database="<your-database>",
    collection="<your-collection>",
    connection_kwargs={"host": host_uri},
)

*Note that because the AirBnB sample data is located on a low-resource Free Tier cluster, the read/write speeds may be suboptimal. Consider upgrading your cluster to improve your query performance.*

## Summary
- You can use the dask-mongo connector to read and write files in parallel between your local Python session and a MongoDB Atlas cluster.
- You can use Dask and MongoDB to build an NLP pipeline at scale.
- You can run Dask computations in the cloud using a Coiled cluster. Running Coiled and MongoDB together can lead to performance gains on your queries and analyses.


## Get Started with dask-mongo
To get started with the dask-mongo connector, install it on your local machine using pip or conda.

`pip install dask-mongo`

`conda install dask-mongo -c conda-forge`

You can then run the code in the accompanying notebook yourself to test-drive dask-mongo. For more information, we recommend taking a look at [the Coiled documentation](https://docs.coiled.io/user_guide/examples/mongodb.html). 

Let us know how you get on by tweeting to the developing team at [@CoiledHQ](https://twitter.com/CoiledHQ)!

In [27]:
from dask_ml.model_selection import train_test_split

In [28]:
# create train/test splits
from dask_ml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.20, 
    random_state=40,
)

And then pass these into the Dask-ML HashingVectorizer. 

In [29]:
# vectorize 
from dask_ml.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer()
X_train_vect = vect.fit_transform(X_train)

Note that this Vectorizer comes with its own built-in preprocessing functions. We could override these since we have already done our own customized preprocessing above.

In [30]:
# make sure to compute the chunk sizes before training
X_train_vect.compute_chunk_sizes()

Unnamed: 0,Array,Chunk
Shape,"(2139, 1048576)","(211, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix
"Array Chunk Shape (2139, 1048576) (211, 1048576) Count 108 Tasks 12 Chunks Type float64 scipy.sparse.csr.csr_matrix",1048576  2139,

Unnamed: 0,Array,Chunk
Shape,"(2139, 1048576)","(211, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix


In [31]:
# omit?
y_train.compute_chunk_sizes()

Unnamed: 0,Array,Chunk
Bytes,16.71 kiB,1.65 kiB
Shape,"(2139,)","(211,)"
Count,96 Tasks,12 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 16.71 kiB 1.65 kiB Shape (2139,) (211,) Count 96 Tasks 12 Chunks Type int64 numpy.ndarray",2139  1,

Unnamed: 0,Array,Chunk
Bytes,16.71 kiB,1.65 kiB
Shape,"(2139,)","(211,)"
Count,96 Tasks,12 Chunks
Type,int64,numpy.ndarray


In [32]:
# omit?
y_train = y_train.reshape(2139,1)

In [33]:
# omit?
y_train

Unnamed: 0,Array,Chunk
Bytes,16.71 kiB,1.65 kiB
Shape,"(2139, 1)","(211, 1)"
Count,108 Tasks,12 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 16.71 kiB 1.65 kiB Shape (2139, 1) (211, 1) Count 108 Tasks 12 Chunks Type int64 numpy.ndarray",1  2139,

Unnamed: 0,Array,Chunk
Bytes,16.71 kiB,1.65 kiB
Shape,"(2139, 1)","(211, 1)"
Count,108 Tasks,12 Chunks
Type,int64,numpy.ndarray


In [34]:
X_train_vect = X_train_vect.persist()
y_train = y_train.persist()

Vectorizing leads to array of unknown chunk size

In [36]:
X_train_vect.compute_chunk_sizes()

Unnamed: 0,Array,Chunk
Shape,"(2139, 1048576)","(211, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse._csr.csr_matrix
"Array Chunk Shape (2139, 1048576) (211, 1048576) Count 108 Tasks 12 Chunks Type float64 scipy.sparse._csr.csr_matrix",1048576  2139,

Unnamed: 0,Array,Chunk
Shape,"(2139, 1048576)","(211, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse._csr.csr_matrix


In [37]:
X_train_vect.blocks[0].compute()

<176x1048576 sparse matrix of type '<class 'numpy.float64'>'
	with 15469 stored elements in Compressed Sparse Row format>

In [38]:
X_train_vect.shape

(2139, 1048576)

Each block in X is a **scipy.sparse matrix**.

Now use scipy.sparse matrix as input for distributed XGBoostClassifier.

In [39]:
import xgboost
from xgboost.dask import DaskXGBClassifier

In [41]:
clf = DaskXGBClassifier()

In [42]:
%%time
clf.fit(X_train_vect, y_train)

AttributeError: divisions not found

The error above is a bug in the XGBoost package. Issue was raised and there's a PR ready to be merged that will resolve this issue:
https://github.com/dmlc/xgboost/issues/7454

In [None]:
proba = xgb.predict_proba(X_test)

In [18]:
# define processing functions
def extract_description(element):
    return element['description'].lower()

def filter_stopword(word, stopwords):
    return word not in stopwords

def filter_stopwords(tokens, stopwords):
    return list(filter(partial(filter_stopword, stopwords=stopwords), tokens))

### Tokenization with Dask Bags

In [20]:
# get cleaned, tokenized description texts
description_text = b_flattened.map(extract_description)
description_text_tokens = description_text.map(tokenizer.tokenize)
description_text_clean = description_text_tokens.map(partial(filter_stopwords, stopwords=stopword_set))

In [21]:
# verify
description_text_clean.take(1)

(['exists',
  'cozy',
  'room',
  'rent',
  'shared',
  '4',
  'bedroom',
  'apartment',
  'located',
  'one',
  'block',
  'jmz',
  'myrtle',
  'broadway',
  'neighborhood',
  'diverse',
  'appeals',
  'variety',
  'people'],)

Stop words like “here”, “a”, “in”, etc. have been removed and the contents of the Description column have been turned into a list of strings (tokens).

In [84]:
# define preprocessing function
## 1. lowercase
## 2. tokenize
## 3. filter stopwords

def filter_stopword(word, stopwords):
    return word not in stopwords

# def filter_stopwords(tokens, stopwords):
#    return list(filter(partial(filter_stopword, stopwords=stopwords), tokens))

def lowercase(element):
    element['description'] = element['description'].lower()
    return element

# we can access the description text with 

def tokenize_desc(element):
    element['description'] = tokenizer.tokenize(element['description'])
    return element

In [107]:
test_df = b_flattened.to_dataframe()
test_df.head()

Unnamed: 0,description,review_rating
0,Here exists a very cozy room for rent in a sha...,100
1,"Murphy bed, optional second bedroom available....",94
2,"The Apartment has a living room, toilet, bedro...",98
3,Loft Suite Deluxe @ Henry Norman Hotel Located...,88
4,"Clean, fully furnish, Spacious 1 bedroom flat ...",100


In [108]:
test_df['description'] = test_df['description'].str.lower()
test_df.head()

Unnamed: 0,description,review_rating
0,here exists a very cozy room for rent in a sha...,100
1,"murphy bed, optional second bedroom available....",94
2,"the apartment has a living room, toilet, bedro...",98
3,loft suite deluxe @ henry norman hotel located...,88
4,"clean, fully furnish, spacious 1 bedroom flat ...",100


In [114]:
def tokenize_partitions(df):
    df['description'] = df['description'].apply(tokenizer.tokenize)
    return df

test2 = test_df.map_partitions(tokenize_partitions)
test2.head()

In [115]:
test2 = test_df.map_partitions(tokenize_partitions)
test2.head()

Unnamed: 0,description,review_rating
0,"[here, exists, a, very, cozy, room, for, rent,...",100
1,"[murphy, bed, optional, second, bedroom, avail...",94
2,"[the, apartment, has, a, living, room, toilet,...",98
3,"[loft, suite, deluxe, henry, norman, hotel, lo...",88
4,"[clean, fully, furnish, spacious, 1, bedroom, ...",100


OK, I think this is working now. Confirmed that we have a list of strings as the content of `description` column.

In [124]:
X_t = test2['description']
y_t = test2['review_rating']

In [125]:
# create train/test splits
from dask_ml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X_t, 
    y_t, 
    test_size=0.20, 
    random_state=40
)



In [128]:
# vectorize
from dask_ml.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer(lowercase=False, tokenizer=lambda x:x)
X_train_vect = vect.fit_transform(X_train)



In [129]:
# make sure to compute the chunk sizes before training
X_train_vect.compute_chunk_sizes()

Unnamed: 0,Array,Chunk
Shape,"(2151, 1048576)","(220, 1048576)"
Count,120 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix
"Array Chunk Shape (2151, 1048576) (220, 1048576) Count 120 Tasks 12 Chunks Type float64 scipy.sparse.csr.csr_matrix",1048576  2151,

Unnamed: 0,Array,Chunk
Shape,"(2151, 1048576)","(220, 1048576)"
Count,120 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix


In [None]:
%%time
# train classifier
clf.fit(X_train_vect, y_train)

In [26]:
lemmas = description_text_clean.map(lemmatize)

In [27]:
lemmas.take(1)

(['exist',
  'cozy',
  'room',
  'rent',
  'share',
  '4',
  'bedroom',
  'apartment',
  'locate',
  'one',
  'block',
  'jmz',
  'myrtle',
  'broadway',
  'neighborhood',
  'diverse',
  'appeal',
  'variety',
  'people'],)

Notice how a Dask Bag lets you .map() any Python function over all of the records it contains, regardless of the library you might be using.