<a href="https://colab.research.google.com/github/atm1504/mongodb-details/blob/master/cleansing_data_with_updates.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleansing
### Things to learn
* Import bulk data
* Count data by type filter
* Modify data (Update)

In [1]:
# We're going to install this module to help us parse datetimes from the raw dataset
!pip3 install dateparser

# To install srv link reader
!pip3 install pymongo[srv]

Collecting dateparser
[?25l  Downloading https://files.pythonhosted.org/packages/c1/d5/5a2e51bc0058f66b54669735f739d27afc3eb453ab00520623c7ab168e22/dateparser-0.7.6-py2.py3-none-any.whl (362kB)
[K     |█                               | 10kB 18.6MB/s eta 0:00:01[K     |█▉                              | 20kB 6.5MB/s eta 0:00:01[K     |██▊                             | 30kB 7.7MB/s eta 0:00:01[K     |███▋                            | 40kB 7.9MB/s eta 0:00:01[K     |████▌                           | 51kB 6.0MB/s eta 0:00:01[K     |█████▍                          | 61kB 6.0MB/s eta 0:00:01[K     |██████▍                         | 71kB 5.7MB/s eta 0:00:01[K     |███████▎                        | 81kB 6.0MB/s eta 0:00:01[K     |████████▏                       | 92kB 5.7MB/s eta 0:00:01[K     |█████████                       | 102kB 5.8MB/s eta 0:00:01[K     |██████████                      | 112kB 5.8MB/s eta 0:00:01[K     |██████████▉                     | 122kB 5.8M

In [1]:
from pymongo import MongoClient, InsertOne, UpdateOne
import pprint
import dateparser
from bson.json_util import loads

In [5]:
# Replace XXXX with your connection URI from the Atlas UI
client = MongoClient(xxxxx)
people_raw = client['cleansing']['peoples']

Skipping the below step, as I am using google colab and my network spped is slow. I have used a normal python3 script locally to import the data. There are two ways to import data.
### Method-1 (In local system)
```
import bson.json_util
from pymongo import InsertOne, MongoClient

BATCH_SIZE = 1000  # Batch size for batch insertion

cli = MongoClient(Connec tion url")
people_raw = cli.cleansing['peoples']

batch_insertions = []
with open('people-raw.json') as f:
    for line in f:
        line_dict = bson.json_util.loads(line)
        batch_insertions.append(InsertOne(line_dict))
        if len(batch_insertions) == BATCH_SIZE:
            people_raw.bulk_write(batch_insertions)
            print(f'Finished inserting a batch of {BATCH_SIZE} documents')
            batch_insertions = []
if batch_insertions:
    people_raw.bulk_write(batch_insertions)
    print(f'Finished inserting a last batch of {len(batch_insertions)} '
          f'documents')

print('Finished all the insertions.')
```

### Method-2 (In google colab)
```
import json
from google.colab import files
uploaded = files.upload()
file_name = "data.json"
io.StringIO[file_name].decode("utf-8")
json.loads(uploaded[file_name].decode("utf-8"))
```

Now parse the data in the following way:
```
uploaded[file_name].decode("utf-8") as dataset: 
    for line in dataset: 
        inserts.append(InsertOne(loads(line)))
        
        count += 1
                       
        if count == batch_size:
            people_raw.bulk_write(inserts)
            inserts = []
            count = 0
if inserts:         
    people_raw.bulk_write(inserts)
    count = 0
```

### Method -3 (Prefered way)

In [18]:
batch_size = 1000
inserts = []
count = 0

In [None]:
## Method-3 
with open("./people-raw.json") as dataset: 
    for line in dataset: 
        inserts.append(InsertOne(loads(line)))
        
        count += 1
                       
        if count == batch_size:
            people_raw.bulk_write(inserts)
            inserts = []
            count = 0
if inserts:         
    people_raw.bulk_write(inserts)
    count = 0

In [6]:
# Confirm that 50,474 documents are in your collection before moving on
people_raw.count_documents({})

50474

In [19]:
# Replace YYYY with a query on the people-raw collection that will return a cursor with only
# documents where the birthday field is a string
people_with_string_birthdays = people_raw.find({'birthday':{'$type':'string'}})
filter={'birthday':{'$type':'string'}}

In [20]:
# # This is the answer to verify you completed the lab
people_raw.count_documents(filter)
people_with_string_birthdays.count()

  This is separate from the ipykernel package so we can avoid doing imports until


1382

In [21]:
updates = []
# Again, we're updating several thousand documents, so this will take a little while
for person in people_with_string_birthdays:
    # Pymongo converts datetime objects into BSON Dates. The dateparser.parse function returns a
    # datetime object, so we can simply do the following to update the field properly.
    # Replace ZZZZ with the correct update operator
    updates.append(UpdateOne({ "_id": person["_id"] }, { '$set': { "birthday": dateparser.parse(person["birthday"]) } }))
    
    count += 1
                       
    if count == batch_size:
        people_raw.bulk_write(updates)
        updates = []
        count = 0
        
if updates:         
    people_raw.bulk_write(updates)
    count = 0

In [22]:
# If everything went well this should be zero
people_with_string_birthdays.count()

  


0