### MongoDB Programmatic Access
In this notebook, you'll learn how to connect to a MongoDB database programmatically using [PyMongo](https://pymongo.readthedocs.io/en/stable/) and manipulate the retrieved data using [Pandas DataFrames](https://pandas.pydata.org/docs/index.html). This combination of technologies enables efficient data retrieval, manipulation, and analysis, bridging the gap between unstructured NoSQL databases and structured data analysis tools.

#### Tools and Libraries
- **PyMongo**: A Python library that provides tools for working with MongoDB. It allows you to connect to MongoDB, execute queries, and retrieve data programmatically.
- **Pandas**: A powerful data manipulation and analysis library for Python. Pandas provides data structures like Series and DataFrames that make it easy to clean, transform, and analyze data.

#### PyMongo Library
Connecting to a MongoDB database using Python can be performed by using the **PyMongo** library. Import the **pymongo** library. Execute the following cell:

**Note**: Command cell execution is performed by placing the cursor at the **beginning** of the command cell and then performing one of the following options:

- **Option 1**: Press the **CTRL+ENTER** (Windows) or **CMD+ENTER** (MacOS) key sequence.
- **Option 2**: Click the **Play** button in the top menu bar.
- **Option 3**: Select the **Run/Run Selected Cell** option in the top menu.
    
Placing the cursor at the very **beginning** of the cell avoids triggering intellisense unnecessarily.

In [1]:
from pymongo import MongoClient

#### MongoDB Connectivity
Connectivity to a MongoDB database is established by using the **MongoClient** class. Connect to the **booksdb** database. Execute the following cell:

In [2]:
# Connect to the local MongoDB server
hostname = 'localhost'
port = 27017

# Create a MongoClient instance
client = MongoClient(hostname, port)

# Connect to the booksdb database
booksdb = client['booksdb']

#### Database Query
Data retrieval operations can be performed by calling the various functions directly on the database object.

For example, lets retrieve all books from the **books** collection in the **booksdb** database and print out the results. Execute the following cell:

In [3]:
# Get the books collection from the booksdb database
books = booksdb['books']

# Get all the books from the books collection - limit to 10
all_books = books.find().limit(10)

# Convert the cursor to a list
books_list = list(all_books)

# Print the list of books
print(books_list)

[{'_id': 3, 'title': 'Specification by Example', 'isbn': '1617290084', 'pageCount': 0, 'publishedDate': datetime.datetime(2011, 6, 3, 7, 0), 'thumbnailUrl': 'https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.book-thumb-images/adzic.jpg', 'status': 'PUBLISH', 'authors': ['Gojko Adzic'], 'categories': ['Software Engineering']}, {'_id': 4, 'title': 'Flex 3 in Action', 'isbn': '1933988746', 'pageCount': 576, 'publishedDate': datetime.datetime(2009, 2, 2, 8, 0), 'thumbnailUrl': 'https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.book-thumb-images/ahmed.jpg', 'longDescription': "New web applications require engaging user-friendly interfaces   and the cooler, the better. With Flex 3, web developers at any skill level can create high-quality, effective, and interactive Rich Internet Applications (RIAs) quickly and easily. Flex removes the complexity barrier from RIA development by offering sophisticated tools and a straightforward programming language so you can focus on what you want to do instead of 

### Pandas Data Manipulation
Pandas is a powerful and widely-used open-source data manipulation and analysis library for the Python programming language. It provides data structures and functions needed to manipulate structured data seamlessly, one of which is the DataFrame.

A DataFrame is a two-dimensional, size-mutable, heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a table in a database or an Excel spreadsheet. DataFrames allow for easy manipulation and analysis of data, including operations such as filtering, grouping, aggregating, and joining datasets.

Key features of DataFrames include:

- **Labeled Axes**: Each row and column can be labeled, making data manipulation and retrieval easier.
- **Arithmetic Operations**: Can be performed on both rows and columns, leveraging the labeled data.
- **Heterogeneous Data Handling**: DataFrames can store different data types (integers, floats, strings, booleans, etc.) in different columns. This flexibility allows DataFrames to handle a wide variety of datasets, making them suitable for diverse data analysis tasks.

#### Pandas Library
Import the **pandas** library. Execute the following cell:

In [4]:
import pandas as pd

#### DataFrame Creation And Loading
Creating a new **DataFrame** and loading it with data from MongoDB is quick and easy.

The following example queries for all books. The matching data returned is passed into a new dataframe which is then rendered out. Execute the following cell:

In [5]:
# Get all books from the books collection as a cursor
books_cursor = list(booksdb.books.find())

# Convert the cursor to a DataFrame
data = list(books_cursor)
df = pd.DataFrame(data)

# Print the DataFrame
df

Unnamed: 0,_id,title,isbn,pageCount,publishedDate,thumbnailUrl,status,authors,categories,longDescription,shortDescription,lastModified
0,3,Specification by Example,1617290084,0,2011-06-03 07:00:00,https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....,PUBLISH,[Gojko Adzic],[Software Engineering],,,NaT
1,4,Flex 3 in Action,1933988746,576,2009-02-02 08:00:00,https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....,PUBLISH,"[Tariq Ahmed with Jon Hirschi, Faisal Abid]",[Internet],New web applications require engaging user-fri...,,NaT
2,2,"Android in Action, Second Edition",1935182722,592,2011-01-14 08:00:00,https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....,PUBLISH,"[W. Frank Ableson, Robi Sen]",[Java],"When it comes to mobile apps, Android can do a...","Android in Action, Second Edition is a compreh...",NaT
3,1,Unlocking Android,1933988673,416,2009-04-01 07:00:00,https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....,PUBLISH,"[W. Frank Ableson, Charlie Collins, Robi Sen]","[Open Source, Mobile]",Android is an open source mobile phone platfor...,Unlocking Android: A Developer's Guide provide...,NaT
4,6,Collective Intelligence in Action,1933988312,425,2008-10-01 07:00:00,https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....,PUBLISH,[Satnam Alag],[Internet],"There's a great deal of wisdom in a crowd, but...",,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...
425,53c2ae8528d75d572c06adb8,DSLs in Action,1935182455,376,2010-12-01 08:00:00,https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....,PUBLISH,[],[],"On any given day, a developer may encounter a ...",DSLs in Action introduces the concepts and def...,NaT
426,53c2ae8528d75d572c06adb9,Database Programming for Handheld Devices,1884777856,0,2000-07-01 07:00:00,https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....,PUBLISH,[],[],,,NaT
427,53c2ae8528d75d572c06adba,Jakarta Commons Online Bookshelf,1932394524,402,2005-03-01 08:00:00,https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....,PUBLISH,[],[],Written for developers and architects with rea...,,NaT
428,53c2ae8528d75d572c06adbb,Browsing with HttpClient,1932394524a-e,0,2005-03-01 08:00:00,https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....,PUBLISH,[],[],,Written for developers and architects with rea...,NaT


With the dataframe created and loaded, we can next begin to reshape it. For example let's drop the **_id**, **thumbnailUrl**, **longDescription**, and **shortDescription** columns. The updated dataframe is rendered back out. Execute the following cell:

In [6]:
# Drop columns and update DataFrame
df = df.drop(columns=["_id", "thumbnailUrl", "longDescription", "shortDescription"])

# Print the DataFrame
df

Unnamed: 0,title,isbn,pageCount,publishedDate,status,authors,categories,lastModified
0,Specification by Example,1617290084,0,2011-06-03 07:00:00,PUBLISH,[Gojko Adzic],[Software Engineering],NaT
1,Flex 3 in Action,1933988746,576,2009-02-02 08:00:00,PUBLISH,"[Tariq Ahmed with Jon Hirschi, Faisal Abid]",[Internet],NaT
2,"Android in Action, Second Edition",1935182722,592,2011-01-14 08:00:00,PUBLISH,"[W. Frank Ableson, Robi Sen]",[Java],NaT
3,Unlocking Android,1933988673,416,2009-04-01 07:00:00,PUBLISH,"[W. Frank Ableson, Charlie Collins, Robi Sen]","[Open Source, Mobile]",NaT
4,Collective Intelligence in Action,1933988312,425,2008-10-01 07:00:00,PUBLISH,[Satnam Alag],[Internet],NaT
...,...,...,...,...,...,...,...,...
425,DSLs in Action,1935182455,376,2010-12-01 08:00:00,PUBLISH,[],[],NaT
426,Database Programming for Handheld Devices,1884777856,0,2000-07-01 07:00:00,PUBLISH,[],[],NaT
427,Jakarta Commons Online Bookshelf,1932394524,402,2005-03-01 08:00:00,PUBLISH,[],[],NaT
428,Browsing with HttpClient,1932394524a-e,0,2005-03-01 08:00:00,PUBLISH,[],[],NaT


DataFrames can be filtered. As an example, filter the existing dataframe and return just the book rows where the **pageCount** is greater or equal to **300**.

In [7]:
# Filter DataFrame on pageCount column
df = df[df["pageCount"] >= 300]

# Print the DataFrame
df

Unnamed: 0,title,isbn,pageCount,publishedDate,status,authors,categories,lastModified
1,Flex 3 in Action,1933988746,576,2009-02-02 08:00:00,PUBLISH,"[Tariq Ahmed with Jon Hirschi, Faisal Abid]",[Internet],NaT
2,"Android in Action, Second Edition",1935182722,592,2011-01-14 08:00:00,PUBLISH,"[W. Frank Ableson, Robi Sen]",[Java],NaT
3,Unlocking Android,1933988673,416,2009-04-01 07:00:00,PUBLISH,"[W. Frank Ableson, Charlie Collins, Robi Sen]","[Open Source, Mobile]",NaT
4,Collective Intelligence in Action,1933988312,425,2008-10-01 07:00:00,PUBLISH,[Satnam Alag],[Internet],NaT
5,Flex 4 in Action,1935182420,600,2010-11-15 08:00:00,PUBLISH,"[Tariq Ahmed, Dan Orlando, John C. Bland II, J...",[Internet],NaT
...,...,...,...,...,...,...,...,...
422,Sencha Touch in Action,1617290378,375,2013-07-12 07:00:00,PUBLISH,[],[],NaT
423,Programming Windows Server 2003,1930110987,328,2003-08-01 07:00:00,PUBLISH,[],[],NaT
424,Struts Recipes,1932394249,520,2004-11-01 08:00:00,PUBLISH,[],[],NaT
425,DSLs in Action,1935182455,376,2010-12-01 08:00:00,PUBLISH,[],[],NaT


Aggregations can be performed on dataframes. For example, lets return the average page count for all books in the current dataframe. Execute the following cell:

In [8]:
# Return the average page count for books
df["pageCount"].mean()

483.0996015936255

The next example returns the maximum page count for the books stored in the current dataframe. Execute the following cell:

In [9]:
# Return the maximum page count for books
df["pageCount"].max()

1101

Repeating the previous example but this time breaking out the maximum page count for each book status (published, meap). Execute the following cell:

In [10]:
# Return the maximum page count for books for each book status
df.groupby('status').max('pageCount')

Unnamed: 0_level_0,pageCount
status,Unnamed: 1_level_1
MEAP,700
PUBLISH,1101
