This is a technical assessment code that I was required to complete as part of a hiring process for a marketing company. It is an interesting problem to solve that tests multiple skills all in one task, which include:

- HTML parsing
- Lists and Dictionaries (indexing, slicing)
- pandas DataFrame manipulation
- JSON file manipulation (serialization, deserialization)
- MongoDB using PyMongo

The task is to compile blog entries and corresponding comments from two different csv files into one JSON formatted file to be inserted into a MongoDB collection.

In [1]:
import pandas as pd
import json
from bs4 import BeautifulSoup

In [2]:
blogs = pd.read_csv("blogs.csv", index_col='id')

comments = pd.read_csv("comments.csv")

In [3]:
blogs.head()

Unnamed: 0_level_0,content,author,title
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,<p><strong>elasticsearch</strong> version <cod...,Shay Banon,0.19.2 Released
2,<p><strong>elasticsearch</strong> version <cod...,Shay Banon,0.19.1 Released
3,<p><strong>elasticsearch</strong> version <cod...,Shay Banon,0.19.0 Released
4,<p><strong>elasticsearch</strong> version <cod...,Shay Banon,0.19.0.RC3 Released
5,<p><strong>elasticsearch</strong> version <cod...,Shay Banon,0.19.0.RC2 Released


In [4]:
comments.head(10)

Unnamed: 0,poster,message,blog_id,id
0,Bryan Green,"Is there a plan for ""version 1.0""? \r\nOr will...",1,1
1,Kristian,Great! And I see the new ICU plugin has been r...,2,2
2,ianmayo,Here's a URL to view the issues resolved betwe...,3,3
3,Benny Sadeh,like multiple keyword search (each one with a ...,4,4
4,Bryan Green,Thanks a ton Shay! This release rocks as usual...,4,5
5,Bryan Green,I always pull several different sets of data w...,4,6
6,haarts,What would be a good use case for msearch? I c...,4,7
7,Eric,"Should you add new features to a RC version, r...",5,8
8,Seon,"Hooo-rah, Shay and team.",7,9
9,Damian Tylczyński,Thanks!,7,10


In [5]:
cleaned_content = []

for i in blogs.content:
    
    soup = BeautifulSoup(i, "html.parser")
    
    cleaned_content.append(soup.find("p").get_text())

In [6]:
blogs.drop(columns="content", inplace=True)

blogs["content"] = cleaned_content

In [7]:
blog_list = []

for i in blogs.index:
    
    blogEntry={
        
        "id" : i,
        "title" : blogs["title"][i], 
        "author" : blogs["author"][i],
        "content" : blogs["content"][i],
        "comments" : comments[comments["blog_id"] == i].to_dict('records')
        
    }
    
    blog_list.append(blogEntry)
    with open("blogs_data.json", "w", encoding="utf-8") as file:
        json.dump(blog_list, file, indent=4)

In [8]:
import pymongo
from pymongo import MongoClient

In [9]:
client = MongoClient("mongodb://localhost:27017")

In [10]:
db = client.db["interview"]
collection = db.collection["blogs"]

with open("blogs_data.json", "r", encoding="utf-8") as f:
    data = json.load(f)

db.collection.insert_many(data)

<pymongo.results.InsertManyResult at 0x1387879ff48>