# Description:
In this notebook we explore the MongoDB instance where we keep our news documents. We see some examples of documents stored, the time at which each batch of documents was inserted, provide a function to remove batch of documents given their insertion time, analyze the distribution of time at which each document is published, analyze the distribution of category of documents as well as their sources, analyze the relationship between category and source, provide a function to remove documents with no content or description and a function to check/ remove duplicate documents based on description and content combined. This function was used previoulsy the creation of the unique index of content and description. With this index there shouldn't be any duplicate documents.

# TODO:
- UPDATE THIS TO GET DATA FROM ELASTICSEARCH OVER MONGODB   

In [1]:
import os
from datetime import datetime, timedelta
from pprint import pprint
from elasticsearch import Elasticsearch
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Connecting to Elasticsearch
es = Elasticsearch(
    hosts=['odfe-node1', '0.0.0.0'],
    http_auth=('admin', 'admin'),
    scheme="https",
    verify_certs=False
)

  % self.host
  % self.host


## Indices
The document store is composed of indices. These indices in turn hold documents.

In [3]:
# List indices
es.indices.get_alias("*")



{'document': {'aliases': {}},
 'security-auditlog-2021.08.31': {'aliases': {}},
 'label': {'aliases': {}},
 '.opendistro_security': {'aliases': {}}}

In [4]:
# How many documents does the 'document' collection hold?
es.indices.refresh('document')
count = es.cat.count('document', params={"format": "json"})[0]['count']
print(f"There are {count} documents in the index 'document'")



There are 334925 documents in the index 'document'




## Topic label


In [5]:
# Get counts of topic_labels
result = es.search(
    {
        "size": 0,
         "aggs": {
             "group_by_topic": {
                 "terms": {
                     "field": "topic_label"
                 }
             }
         }
    },
    index="document"
)['aggregations']['group_by_topic']['buckets']
counts = dict(list(map(lambda x: list(x.values()), result)))
counts



{'-1_covid_covid 19_coronavirus_year': 332728,
 '1_nba_lakers_game_warriors': 261,
 '6_deals_black friday_black_startup': 225,
 '11_india_minister_delhi_cricket': 184,
 '13_china_chinese_ireland_iran': 164,
 '2_iphone_galaxy_iphone 12_cyberpunk': 152,
 '0_nfl_football_team_coach': 147,
 '3_trump_president_joe_donald trump': 115,
 '5_county_covid_died_covid 19': 108,
 '4_nasa_spacex_mars_rocket': 107}

## Get random document

In [6]:
# Print random document
result = es.search(
    {
        "size": 1,
        "query": {
            "function_score": {
                "functions": [
                    {
                        "random_score": {
                            "seed": "1477072619038"
                        }
                    }
                ]
            }
        }
    },
    index="document"
)['hits']['hits'][0]
print(f"Document ID: {result['_id']}", "\n", result['_source'])

Document ID: 7dc12e3f-70dc-4163-9ea0-ec37aec4d47d 
 {'text': "Watch Live: Minnesota Governor Tim Walz announces new COVID-19 restrictions in response to surge - CBS News#SEPTAG#The governor is expected to announce restrictions affecting gyms, restaurants, bars and youth sports.#SEPTAG#Minnesota Governor Tim Walz is expected to announce Wednesday that gyms will have to close starting Friday and restaurants and bars will have to go to take-out only. He's also expected to pause youth", 'embedding': [-0.24771520495414734, -0.4632345736026764, 0.2184119075536728, 0.1549517661333084, 0.011453185230493546, -1.174071192741394, 0.6115257740020752, 0.5052496790885925, -0.3569122850894928, -0.5278075933456421, -0.1934090107679367, -0.20707258582115173, -0.13271763920783997, 0.1206926703453064, 0.43512800335884094, 0.021877190098166466, 0.00505519425496459, -0.4153878688812256, 0.5692142844200134, 0.15935109555721283, 0.03257742524147034, -0.5594801902770996, 0.04462718591094017, 0.538518548011779



## Published date

In [7]:
# Get documents between two dates
result = es.search(
    {
        "query": {
            "range": {
                "publishedat": {
                    "gte": "2020-01-01",
                    "lte": "2021-01-01"
                }
            }
        }
    },
    index="document"
)['hits']['hits']

print(f"There are {len(result)} documents in the search result. An example of these documents:\n")
print(f"Document ID: {result[0]['_id']}", "\n", result[0]['_source'])

There are 10 documents in the search result. An example of these documents:

Document ID: eff22f25-f34b-4f39-99ac-7fd33672d7bb 
 {'text': '\'They are out there\': \'Gator Girl’ Christy Kroboth on massive alligator spotted at Florida golf club - CNBC#SEPTAG#"Gator Girl" Kristy Kroboth talks about the massive alligator spotted at the Valencia Golf and Country Club in Naples, Florida.#SEPTAG#"Gator Girl" Christy Kroboth told CNBC that she thinks the massive alligator at a golf course in Naples, Florida was "real.""I\'ve seen huge gators, the biggest one I ever caught was 12.5 feet, and ', 'embedding': [0.010295205749571323, -0.5026167035102844, -0.12875521183013916, -0.5158852338790894, 0.7270004749298096, -0.04015268757939339, -0.4996439218521118, 0.27543261647224426, -0.710713267326355, 0.30155619978904724, 0.0596010759472847, -0.24039795994758606, 0.5091423988342285, -0.11288513243198395, 0.11359363049268723, 0.16006572544574738, 0.039813265204429626, -0.029788551852107048, 0.282041281



In [8]:
# Get most recent documents
result = es.search(
    {
        "size": 10,
        "sort": {
            "publishedat": "desc"
        },
        "query": {
            "match_all": {}
        }
    },
    index="document"
)['hits']['hits']
print("The most recent document is: \n")
print(f"Document ID: {result[0]['_id']}", "\n", result[0]['_source'])

The most recent document is: 

Document ID: f90dd301-317b-4da8-a809-f290a4b2ec33 
 {'text': 'News24.com | No evidence of corruption was provided - IPP office on challenge to R218bn Karpowership deal#SEPTAG#Authorities have said a legal challenge by DNG Energy to the awarding of a power supply contract worth an estimated R218 billion to Turkey\'s Karpowership was "without merit" and "self serving".#SEPTAG#The head of SA\'s Independent Power Producer Procurement (IPP) Programme Office said a rival\'s challenge to the awarding of a R218bn contract to Karpowership was \'without merit\'DNG ha', 'embedding': [-0.11433468014001846, 0.14669691026210785, 0.017174866050481796, 0.24237686395645142, -0.6248934864997864, 0.2966727316379547, -0.5372915267944336, -0.43996506929397583, -0.07661102712154388, 0.3373984098434448, 0.1610710769891739, -0.07574402540922165, 0.17847639322280884, -0.15618790686130524, 0.10130850970745087, -0.41460397839546204, 0.21484705805778503, -0.17823313176631927, 0.2634



## Exploratory Data Analysis

### publishedAt

In [None]:
pipeline = [
    {  # project publishedAtDay and publishedAtHour
        '$project': {
            'publishedAtDay': {
                '$dateToString': {
                    'format': '%d-%m-%YT%H', 
                    'date': {'$toDate': '$publishedAt'}
                }
            },
            'publishedAtHour': {
                '$hour': {
                    'date': {'$toDate': '$publishedAt'}
                }
            }
        }
    },
    {  # groups on publishedAtDay and gets number of documents per day and hour (document_count) and publishedAtHour
        '$group': {
            '_id': '$publishedAtDay',
            'document_count': {'$sum': 1},
            'publishedAtHour': {'$first': '$publishedAtHour'}
        }
    },
    {  # groups on publishedAtHour and gets average of documents per hour over days
        '$group': {
            '_id': '$publishedAtHour',
            'avg_document_count': {'$avg': '$document_count'}
        }
    },
    {  # sort results in descending order by _id
        '$sort': {'_id': -1}
    }
]

fig, axes = plt.subplots(2, 1, figsize=(13, 9))
for ax, col in zip(axes.flatten(), collection_list):
    x, y = [], []
    for i in list(db[col].aggregate(pipeline)):
        x.append(i['_id'])
        y.append(i['avg_document_count'])
    ax.plot(x, y, linestyle="-")
    ax.set_xticks(x)
    ax.set_title(f"Average number of documents per publishedAt Hour - {col} collection")
    
plt.show()

## category

In [None]:
pipeline = [
    {  # project category
        '$project': {
            '_id': 0,
            'category': 1
        }
    },
    {  # groups on category and gets number of documents for each category
        '$group': {
            '_id': '$category',
            'document_count': {'$sum': 1},
        }
    },
    {  # sort results in descending order by _id
        '$sort': {'_id': 1}
    }
]

fig, axes = plt.subplots(2, 1, figsize=(13, 9))
for ax, col in zip(axes.flatten(), collection_list):
    x, y = [], []
    for i in list(db[col].aggregate(pipeline)):
        x.append(i['_id'])
        y.append(i['document_count'])
    ax.bar(x, y)
    ax.set_xticks(x)
    ax.set_title(f"Number of documents per category - {col} collection")
    
plt.show()

## source

In [None]:
pipeline = [
    {  # project source
        '$project': {
            '_id': 0,
            'source': 1
        }
    },
    {  # groups on source and gets number of documents for each source
        '$group': {
            '_id': '$source',
            'document_count': {'$sum': 1},
        }
    },
    {  # sort results in descending order by _id
        '$sort': {'_id': 1}
    }
]

fig, axes = plt.subplots(2, 1, figsize=(19, 9))
for ax, col in zip(axes.flatten(), collection_list):
    x, y = [], []
    for i in list(db[col].aggregate(pipeline)):
        if i['_id'] is None:
            x.append("Null")
        else:
            x.append(i['_id'])
        y.append(i['document_count'])
    ax.bar(x, y)
    ax.set_xticks(x)
    ax.set_xticklabels(x, rotation=30, ha='right')
    ax.set_title(f"Number of documents per source - {col} collection")

plt.subplots_adjust(hspace=0.4)
plt.show()

## relationship between categories and sources

In [None]:
# use stacked bar chart
pipeline = [
    {  # project source and category
        '$project': {
            '_id': 0,
            'source': 1,
            'category': 1
        }
    },
    {  # groups on source and gets number of documents for each source
        '$group': {
            '_id': {
                'category': '$category',
                'source': '$source'
            },
            'document_count': {'$sum': 1},
        }
    },
    {
        '$project':{
            '_id': 0,
            'document_count': 1,
            'category': '$_id.category',
            'source': '$_id.source'            
        }
    },
    {  # sort results in descending order by _id
        '$sort': {'category': 1}
    }
]

fig, axes = plt.subplots(2, 1, figsize=(19, 11))
for ax, col in zip(axes.flatten(), collection_list):
    plt_data = pd.DataFrame(list(db[col].aggregate(pipeline))).pivot(index="source", columns="category", values="document_count")
    plt_data["sum"] = plt_data.sum(axis=1)
    plt_data.sort_values("sum", ascending=False).drop("sum", axis=1).plot(kind='bar', stacked=True, rot=90, ax=ax)
    ax.set_title(f"Source frequencies by category - {col} collection")

plt.subplots_adjust(hspace=0.9)
plt.show()

## missing values

In [None]:
def remove_missing_values(db, collection=None): 
    """
    Function to remove documents with missing values on both description and content from db's specified collection
    or all of them (default).
    """
    pipeline_remove = [
        {
            '$project': {
                '_id': 1,
                'description': 1,
                'content': 1
            }
        },
        {
            "$match": {
                '$or': [
                    {
                        "description" : {"$eq" : None},
                        "content" : {"$eq": None}
                    },
                    {
                        "description" : {"$eq" : ''},
                        "content" : {"$eq": ''}
                    },
                    {
                        "description" : {"$eq" : None},
                        "content" : {"$eq": ''}
                    },
                    {
                        "description" : {"$eq" : ''},
                        "content" : {"$eq": None}
                    }                    
                ]
            } 
        }, 

        {
            "$project": {
                "id" : 1
            }
        }
    ]
    
    if collection is None:
        collection_list = db.list_collection_names()
        for col in collection_list:
            idsList = list(map(lambda x: x['_id'], db[col].aggregate(pipeline_remove)))
            db[col].delete_many({'_id': {'$in': idsList}})
            print(f"{len(idsList)} documents with missing values were removed from {col}\n")
    else:
        idsList = list(map(lambda x: x['_id'], db[collection].aggregate(pipeline_remove)))
        db[collection].delete_many({'_id': {'$in': idsList}})
        print(f"{len(idsList)} documents with missing values were removed from {collection}\n")

remove_missing_values(db)

## duplicates

In [None]:
def remove_duplicates(db, collection=None): 
    """
    Function to remove documents with missing values on both description and content from db's specified collection
    or all of them (default).
    """
    pipeline_remove = [
        {
            "$group": {
                "_id": {'description': '$description', 'content': '$content'},
                "_idsNeedsToBeDeleted": {"$push": "$$ROOT._id"} # push all `_id`'s to an array
            }
        },
        # Remove first element - which is removing a doc
        {
            "$project": {
                "_id": 0,
                "_idsNeedsToBeDeleted": {  
                    "$slice": [
                        "$_idsNeedsToBeDeleted", 1, {"$size": "$_idsNeedsToBeDeleted"}
                    ]
                }
            }
        },
        {
            "$unwind": "$_idsNeedsToBeDeleted" # Unwind `_idsNeedsToBeDeleted`
        },
        # Group without a condition & push all `_idsNeedsToBeDeleted` fields to an array
        {
            "$group": { "_id": "", "_idsNeedsToBeDeleted": { "$push": "$_idsNeedsToBeDeleted" } }
        },
        { 
            "$project" : { "_id" : 0 }  # Optional stage
        }
        # At the end you'll have an [{ _idsNeedsToBeDeleted: [_ids] }] or []
    ]
    
    if collection is None:
        collection_list = db.list_collection_names()
        for col in collection_list:
            try:
                idsList = list(db[col].aggregate(pipeline_remove))[0]["_idsNeedsToBeDeleted"]
                db[col].delete_many({'_id': {'$in': idsList}})
                print(f"{len(idsList)} documents with duplicated documents were removed from {col}\n")
            except IndexError:
                print(f"0 documents with duplicated documents in {col}\n")
    else:
        try:
            idsList = list(db[collection].aggregate(pipeline_remove))[0]["_idsNeedsToBeDeleted"]
            db[collection].delete_many({'_id': {'$in': idsList}})
            print(f"{len(idsList)} documents with duplicated documents were removed from {collection}\n")
        except IndexError:
                print(f"0 documents with duplicated documents in {collection}\n")

remove_duplicates(db)

### duplicates across collections

In [None]:
pipeline = [
    {  # project fields
        '$project': {
            '_id': 0,
            'text': {
                '$concat': [
                    {'$ifNull': ['$title', '']},
                    ' - ',
                    {'$ifNull': ['$description', '']},
                    ' - ',
                    {'$ifNull': ['$content', '']},
                ]
            }
        }
    }
]

r_everything = list(map(lambda x: x['text'], db.everything.aggregate(pipeline)))
r_top_headlines = list(map(lambda x: x['text'], db.top_headlines.aggregate(pipeline)))

In [None]:
# union allows to join the elements of two sets while removing the duplicates
len(set(r_top_headlines).union(set(r_everything)))