# No-SQL, MongoDB and Python
This Jupyter notebook establishes the following lines of questioning:
1. What is No-SQL, especially with reference to SQL databases?
1. Why would you use a No-SQL database over SQL databases?
1. How can you setup a (*free*) cloud-based No-SQL database?
1. Other vendors that you can set-up No-SQL databases with, and their difference to [MongoDB](https://www.mongodb.com/), the choice used here.
1. How you can interact with your No-SQL database in the form of?
    + Connecting to the database
    + Viewing *objects*
    + Importing data
    
    
For this session, are accessing a database already set-up. Accessing this database and performing the tasks outlined above would require **READ-WRITE** access. To obtain the ability to interact with the database used in this notebook, please contact [Avision Ho](https://github.com/avisionh).
  

In [None]:
# Get location of Python installation
import sys
import os
os.path.dirname(sys.executable)

## 1. Set-up
Start by establishing the base for our code later on.

Note, that to effectively run this notebook through your organisation's proxy, including getting round issues of `pip install pymongo --proxy <ip_address>.<port_number>`, then will use a Google's hosted [Jupyter Notebook service](https://colab.research.google.com/).

In [13]:
# Install packages
!pip install pymongo
!pip install kaggle;

Collecting pymongo


  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x0000019A17BA2DA0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/pymongo/
  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x0000019A17BA25F8>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/pymongo/
  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x0000019A17BA2668>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/pymongo/
  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeo

Collecting kaggle


  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x000001DF6EF224E0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/kaggle/
  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x000001DF6EF225C0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/kaggle/
  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x000001DF6EF22710>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/kaggle/
  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutE

### Let's connect to the Kaggle API
As part of our experimentation, will be downloading data directly from Kaggle via their API. To do so, will need to follow their guidance [here](https://github.com/Kaggle/kaggle-api).

In particular, will need a API token and have this saved in our Google Colab session so that it can access the Kaggle API. Guidance on doing this is [here](https://stackoverflow.com/questions/49310470/using-kaggle-datasets-in-google-colab).

In [None]:
# Upload kaggle.json file
from google.colab import files
files.upload()

In [None]:
# Move into folder which Kaggle API client expects
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# For security, hide API key
# Make API key hidden
!chmod 600 /root/.kaggle/kaggle.json

### Need to load packages
Next step is to load our packages so we can use functions within them later in our notebook. Also, check our working directory for interest and authenticate the **Kaggle API client** so we can download data from the web servers.

In [14]:
# Load relevant packages
from pymongo import MongoClient
from pprint import pprint
import kaggle
import json
import os

ModuleNotFoundError: No module named 'pymongo'

In [None]:
# Check current working directory
os.getcwd()

In [6]:
# Upload user_credentials file for connecting to MongoDB cluster
from google.colab import files
files.upload()

'C:\\Users\\AHo\\OneDrive - Department for Education\\Documents\\Python Scripts\\nosql-mongodb'

In [8]:
# Authenticate Kaggle API to enable downloading data from there
kaggle.api.authenticate()

## 2. Theory and Concepts
Here, we will discuss what No-SQL is, why you will use it, especially with reference to SQL, and important pieces of language to pick-up when talking No-SQL.

### What is this new-fangled No-SQL then?
No-SQL is **not** SQL.

No-SQL belongs to a family of non-relational databases, whereas SQL belongs to the category of relational databases.

It refers to a set of databases designed to handle the processing of dynamically evolving, real-time changing large-scale and unstructured data.

Essentially, No-SQL databases are built to be flexible and scalable so that they can hold any type of data at large volumes and with little work in creating a pre-existing structure to hold the data.

#### Example: Amazon
As an ecommerce giant, they sell a phenomenal range of products (*e.g. books, electronics, clothes and even food*), with each of these products having a number of characteristics associated with them (*e.g. price, weight, dimensions, manufacturer description, reviews*). 

In theory, these products, their product categories and characteristics could all be stored in a relational database such as SQL. However, if a new characteristic such as user reviews were to be added, it may require that the entire database is destroyed and re-designed to incorporate this. With a non-relational database, there is no need to start from scratch - **flexibility**.

Another compelling example for choosing non-relational databases over relational databases in this example is that given how many orders Amazon collects and processes (not to mention that they also monitor your activity in browsing and mulling over products), then they are collecting huge reams of data. There will come a point where they have reached their 'computer memory' limit. Under a relational database system which is **vertically-scalable**, this means to increase their 'computer memory', they will need to buy a bigger computer with more storage space, and transfer all that data over, getting rid of the old one. Whereas with a non-relational database system which is **horizontally-scalable**, they only need to buy an extra computer with less memory, add it to the network of other computers, and the *total storage space of all these computers* increases thereby taking on the extra data. It's immediately apparent that this is much more cost-effective - **scalability**.

## Why should I use No-SQL over my dearly beloved SQL?

## Righty-ho, I want a No-SQL database, what are my choices?

## Let's get our language right first, eh?
As No-SQL is a tad different to SQL, then the language to describe it are different too. Fret not though, there are analogues of SQL terminology which maps to No-SQL terminology!

| SQL Term | No-SQL Term |
| --- | --- | 
| Server | Cluster |
| Database | Database |
| Schema | Schema |
| Table | Collection |
| Row | Document |

## How do I create a (free) No-SQL cluster?

## How do I connect to my No-SQL MongoDB cluster?
To connect to your newly-created MongoDB cluster, will follow the below steps which are somewhat covered on the official [MongoDB Atlas documentation](https://docs.atlas.mongodb.com/driver-connection/) and blogs like this [one](https://code.tutsplus.com/tutorials/create-a-database-cluster-in-the-cloud-with-mongodb-atlas--cms-31840).
1. On the MongoDB Atlas dashboard, create user group that has **READ-WRITE** access to the cluster.
1. Create a JSON file and store these credentials in there so not every Tom, Dick and Harry can do stuff on your cluster.
    + Including destructively destroy your data (*gasps*)!
1. In Python, import this JSON file.
1. On the MongoDB Atlas dashboard, obtain the connection string for your cluster.
1. In Python, feed the user credentials from your JSON file into your connection string.
1. In Python, use this connection string to connect to your MongoDB cluster.

In [10]:
# Import credentials for connecting to MongoDB server
with open("user_credentials.json") as file_json:
    data_credentials = json.load(file_json)
print(data_credentials)

{'user_group': 'user_readwrite', 'user_password': 'FYyyJYk37SN5WyTo'}


In [11]:
# Create connection string
connect_user = data_credentials["user_group"]
connect_password = data_credentials["user_password"]
 # obtain full connection string from MongoDB Atlas server dashboard
connect_string = "mongodb://" + connect_user + ":" + connect_password + "@cluster-open-shard-00-00-kzzlc.mongodb.net:27017,cluster-open-shard-00-01-kzzlc.mongodb.net:27017,cluster-open-shard-00-02-kzzlc.mongodb.net:27017/test?replicaSet=cluster-open-shard-0&authSource=admin&ssl=true"

class Connect(object):
    @staticmethod
    def get_connection():
        return MongoClient(connect_string)
    
# Call class just created to connect to MongoDB
client = Connect.get_connection()

NameError: name 'MongoClient' is not defined

## Woop! Can I see what's already inside the cluster?
Let's have a look at the pre-existing **library** database as a quick check to see we are connected to the right cluster!

In [11]:
# Access the 'library' database
db = client.library

# Retrieve all documents in 'authors' collection within the 'library' database
cursor = db.authors.find({})
for authors in cursor:
     pprint(authors)

{'_id': 'AhoAV', 'fname': 'Alfred V.', 'lname': 'Aho', 'yob': 1941.0}
{'_id': 'HopcroftJE', 'fname': 'John E.', 'lname': 'Hopcroft', 'yob': 1939.0}
{'_id': 'WirthN', 'fname': 'Niklaus', 'lname': 'Wirth', 'yob': 1934.0}
{'_id': 'LeisersonCE',
 'fname': 'Charles E.',
 'lname': 'Leiserson',
 'yob': 1953.0}
{'_id': 'RivestRL', 'fname': 'Ronald L.', 'lname': 'Rivest', 'yob': 1947.0}
{'_id': 'SteinCL', 'fname': 'Clifford S.', 'lname': 'Stein', 'yob': 1965.0}


## I want to get my hands dirty now! How can I import data?