# Mongo Notebook Draft

What this notebook contains

1. Load AirBnb Mongo sample data
2. Run preprocessing using Dask Bags (tokenization)
3. same as above (lemmatization)
4. Topic Modelling
   1. Joblib
   2. partial_fit
5. Load Amazon FineFoods data

In [1]:
import dask.dataframe as dd
from distributed import Client

In [2]:
client = Client()
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 8,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:58619,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:58633,Total threads: 2
Dashboard: http://127.0.0.1:58642/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:58622,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-for-nlp/dask-worker-space/worker-yd5eulem,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-for-nlp/dask-worker-space/worker-yd5eulem

0,1
Comm: tcp://127.0.0.1:58634,Total threads: 2
Dashboard: http://127.0.0.1:58637/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:58625,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-for-nlp/dask-worker-space/worker-z_c9c76o,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-for-nlp/dask-worker-space/worker-z_c9c76o

0,1
Comm: tcp://127.0.0.1:58635,Total threads: 2
Dashboard: http://127.0.0.1:58638/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:58624,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-for-nlp/dask-worker-space/worker-ilcdhaw3,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-for-nlp/dask-worker-space/worker-ilcdhaw3

0,1
Comm: tcp://127.0.0.1:58632,Total threads: 2
Dashboard: http://127.0.0.1:58636/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:58623,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-for-nlp/dask-worker-space/worker-5tf6ulix,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-for-nlp/dask-worker-space/worker-5tf6ulix


## 1. Import Mongo Data

In [3]:
from dask_mongo import read_mongo
import urllib

In [4]:
# Replace the username, password, and cluster address with your own connection details
host_uri = "mongodb+srv://richard:" + urllib.parse.quote("Rp@976559MO") + "@cluster0.ffttf.mongodb.net/myFirstDatabase?retryWrites=true&w=majority"

In [5]:
bag = read_mongo(
    connection_kwargs={"host": host_uri},
    database="sample_airbnb",
    collection="listingsAndReviews",
    chunksize=500,
)

In [None]:
bag.take(1)

In [7]:
def process(record):
    try:
        yield {
            "description": record["description"],
            "review_rating": int(str(record["review_scores"]["review_scores_rating"])),
            #"accomodates": record["accommodates"],
            #"bedrooms": record["bedrooms"],
            #"price": float(str(record["price"])),
            #"country": record["address"]["country"],
        }
    except KeyError:
        pass

In [8]:
# Filter only apartments
b_flattened = (
    bag.filter(lambda record: record["property_type"] == "Apartment")
    .map(process)
    .flatten()
)

In [9]:
b_flattened.take(1)

({'description': 'Here exists a very cozy room for rent in a shared 4-bedroom apartment. It is located one block off of the JMZ at Myrtle Broadway.  The neighborhood is diverse and appeals to a variety of people.',
  'review_rating': 100},)

In [10]:
df1 = b_flattened.to_dataframe()

In [11]:
df1.head()

Unnamed: 0,description,review_rating
0,Here exists a very cozy room for rent in a sha...,100
1,"Murphy bed, optional second bedroom available....",94
2,"The Apartment has a living room, toilet, bedro...",98
3,Loft Suite Deluxe @ Henry Norman Hotel Located...,88
4,"Clean, fully furnish, Spacious 1 bedroom flat ...",100


## 2. Tokenization

In [12]:
from nltk.corpus import stopwords 
from nltk.tokenize import RegexpTokenizer
from functools import partial

In [13]:
tokenizer = RegexpTokenizer(r'\w+')

In [14]:
# define processing functions
def extract_description(element):
    return element['description'].lower()

def extract_rating(element):
    return element['review_rating']

def filter_stopword(word, stopwords):
    return word not in stopwords

def filter_stopwords(tokens, stopwords):
    return list(filter(partial(filter_stopword, stopwords=stopwords), tokens))

In [15]:
# define set of stopwords
stopword_set = set(stopwords.words('english'))

In [25]:
# get cleaned, tokenized description texts
description_text = b_flattened.map(extract_description)
description_text_tokens = description_text.map(tokenizer.tokenize) # this outputs tokens as a list of strings which can't be cast to a dataframe properly
descript_clean = description_text_tokens.map(partial(filter_stopwords, stopwords=stopword_set))

In [26]:
descript_clean.take(1)

(['exists',
  'cozy',
  'room',
  'rent',
  'shared',
  '4',
  'bedroom',
  'apartment',
  'located',
  'one',
  'block',
  'jmz',
  'myrtle',
  'broadway',
  'neighborhood',
  'diverse',
  'appeals',
  'variety',
  'people'],)

## 3. Lemmatization

In [17]:
import spacy

In [18]:
nlp = spacy.load("en_core_web_sm")

In [19]:
def lemmatize(text, nlp=nlp):
    doc = nlp(" ".join(text))
    lemmatized = [token.lemma_ for token in doc]
    return lemmatized

In [20]:
lemmas = descript_clean.map(lemmatize)

In [21]:
lemmas.take(1)

(['exist',
  'cozy',
  'room',
  'rent',
  'share',
  '4',
  'bedroom',
  'apartment',
  'locate',
  'one',
  'block',
  'jmz',
  'myrtle',
  'broadway',
  'neighborhood',
  'diverse',
  'appeal',
  'variety',
  'people'],)

This works BUT can't be cast to a dataframe yet because the tokens are a list of strings, each string will be cast to its own column. May need to find a workaround for this.

## 4. Topic Modelling

To perform topic modelling, we will have to:
1. Create an array containing only the lemmatized text
2. Create a Bag of Words dictionary (filtering for extreme cases)
3. Map our documents to the BOW dictionary

OR
1. use the HashingVectorizer to turn text into a matrix of token occurrences

Then we'll use sklearn instead of Gensim with the n_jobs parameter and the Dask backend (connected to a Coiled cluster).



1. Input Bag to LDA? or Array?
2. Use Sklearn instead of Gensim?
3. njobs with dask backend?

## 4.1. Use Lemmas > Array > BOW

### Turn Bag into Array

In [27]:
type(lemmas)

dask.bag.core.Bag

In [46]:
lemmas_df = lemmas.to_dataframe(meta={'lemmas': 'object'}) # this doesn't work yet because tokens are a list of strings

Casting this to a dataframe doesn't work because each element of the list of lemmas is getting cast to its own df column. We need the whole list of lemmas to end up in a single column.

### Create BOW Dictionary

### Create tf-idf mapping

## 4.2. HashingVectorizer

These `dask-ml` Vectorizers have built in tokenizers so we don't have the issue of casting the tokens into a dataframe.

In [49]:
from dask_ml.feature_extraction.text import HashingVectorizer

In [52]:
ddf = dd.read_parquet(
    's3://coiled-datasets/airbnb-monogo/description-and-ratings.parquet',
    engine="pyarrow",
)

In [53]:
ddf.head()

Unnamed: 0,description,review_rating
0,Here exists a very cozy room for rent in a sha...,100
1,"Murphy bed, optional second bedroom available....",94
2,"The Apartment has a living room, toilet, bedro...",98
3,Loft Suite Deluxe @ Henry Norman Hotel Located...,88
4,"Clean, fully furnish, Spacious 1 bedroom flat ...",100


In [54]:
X = ddf['description'].to_dask_array(lengths=True)
y = ddf['review_rating'].to_dask_array(lengths=True)

In [55]:
vect = HashingVectorizer()

In [56]:
X_vect = vect.fit_transform(X)

In [59]:
X_vect.compute_chunk_sizes()

Unnamed: 0,Array,Chunk
Shape,"(2681, 1048576)","(264, 1048576)"
Count,48 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix
"Array Chunk Shape (2681, 1048576) (264, 1048576) Count 48 Tasks 12 Chunks Type float64 scipy.sparse.csr.csr_matrix",1048576  2681,

Unnamed: 0,Array,Chunk
Shape,"(2681, 1048576)","(264, 1048576)"
Count,48 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix


In [61]:
X_vect.blocks[0].compute()

<221x1048576 sparse matrix of type '<class 'numpy.float64'>'
	with 19632 stored elements in Compressed Sparse Row format>

Each block in X is a scipy.sparse matrix.

## 4.3. CountVectorizer

In [27]:
from dask_ml.feature_extraction.text import CountVectorizer
import dask.bag as db

"The Dask-ML implementation currently requires that raw_documents is a dask.bag.Bag of documents (lists of strings)."

In [28]:
description_text.take(1)

('here exists a very cozy room for rent in a shared 4-bedroom apartment. it is located one block off of the jmz at myrtle broadway.  the neighborhood is diverse and appeals to a variety of people.',)

`description_text` is a Dask Bag of documents

In [29]:
vectorizer = CountVectorizer()

In [30]:
%%time
X = vectorizer.fit_transform(description_text)

In [31]:
X_local = X.compute().toarray()

Just like `HashingVectorizer`, the `CountVectorizer` outputs a Dask Array where each chunk is a scipy.sparse.matrix

In [40]:
type(X_local)

numpy.ndarray

## 5. LDA

Can we use Dask/Coiled for parallel topic modelling?
1. Yes if dataset fits in memory: sklearn with joblib
2. If dataset doesn't fit in memory: `partial_fit`


## 5.1. Sklearn with Joblib (if model fits in memory)

We can input `X_local` (output of CountVectorizer or HashVectorizer .compute()-ed and cast to array) into the sklearn LDA algorithm. The only thing is that this means we have to store results locally. 

This is because sklearn doesn't natively accept Dask Arrays. So this option works only if your dataset fits in memory and you want to parallelize model training with Dask.

In [32]:
from sklearn.decomposition import LatentDirichletAllocation

In [33]:
lda = LatentDirichletAllocation(
    n_components=5,
    random_state=42,
)

In [34]:
import joblib

In [42]:
%%time
with joblib.parallel_backend("dask"):
    lda.fit(X_local)

CPU times: user 1.48 s, sys: 390 ms, total: 1.87 s
Wall time: 4.71 s


## 5.2 Sklearn with `partial_fit`

In [None]:
chunksize = 2000

In [46]:
import numpy as np
import pandas as pd

In [None]:
np.nditer()

In [47]:
df_X = pd.DataFrame(data=X_local)
df_X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19711,19712,19713,19714,19715,19716,19717,19718,19719,19720
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2676,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2677,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2678,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2679,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [48]:
df_X.to_csv("/Users/rpelgrim/Desktop/mongo_airbnb_test.csv")

In [49]:
len(df_X)

2681

In [51]:
lda = LatentDirichletAllocation(
    n_components=5,
    random_state=42,
)

In [52]:
%%time
for partial_df in pd.read_csv("/Users/rpelgrim/Desktop/mongo_airbnb_test.csv", chunksize=500, iterator=True):
    X = partial_df
    with joblib.parallel_backend("dask"):
        lda.partial_fit(X)

This looks like it works.

Need to:
1. test with an actually large dataset
2. test if [.fit(learning_method="online")](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) also works instead of iterating over chunks
3. If #2 doesn't work, then need to figure out how to iterate over a Numpy Array and/or Parquet file
4. Interpret output of LDA