<a href="https://colab.research.google.com/github/fourfeatherz/DS2002F24/blob/main/NoSQL/mongodb_sample_mflix_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MongoDB Atlas with Python: Using the `sample_mflix` Dataset
This notebook introduces the basics of querying, filtering, and performing aggregations with MongoDB using the **sample_mflix** dataset. The dataset contains movie-related information such as movies, comments, theaters, and users. You'll learn how to:
- Connect to MongoDB Atlas
- Perform basic queries and filtering
- Execute advanced operations like aggregation
- Create indexes and update/delete documents

Let's get started!

In [None]:
!pip install --upgrade pymongo certifi



## 1. Setup and Connection to MongoDB Atlas


In [None]:

# Install pymongo for MongoDB connection


# Import necessary libraries
from pymongo import MongoClient
import pprint

# Replace with your MongoDB Atlas connection string
connection_string = "mongodb+srv://username:password@<yourcluster>mongodb.net/test?retryWrites=true&w=majority"

# Connect to MongoDB Atlas
client = MongoClient(connection_string)

# Access the sample_mflix database and the movies collection
db = client['sample_mflix']
collection = db['movies']


## 2. Basic MongoDB Commands
### Searching for Documents (Basic Query)


In [None]:

# Find one document from the movies collection
document = collection.find_one()
pprint.pprint(document)


{'_id': ObjectId('573a1390f29313caabcd42e8'),
 'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
 'cast': ['A.C. Abadie',
          "Gilbert M. 'Broncho Billy' Anderson",
          'George Barnes',
          'Justus D. Barnes'],
 'countries': ['USA'],
 'directors': ['Edwin S. Porter'],
 'fullplot': 'Among the earliest existing films in American cinema - notable '
             'as the first film that presented a narrative story to tell - it '
             'depicts a group of cowboy outlaws who hold up a train and rob '
             "the passengers. They are then pursued by a Sheriff's posse. "
             'Several scenes have color included - all hand tinted.',
 'genres': ['Short', 'Western'],
 'imdb': {'id': 439, 'rating': 7.4, 'votes': 9847},
 'languages': ['English'],
 'lastupdated': '2015-08-13 00:27:59.177000000',
 'num_mflix_comments': 0,
 'plot': 'A group of bandits stage a brazen train hold-up, only to find a '
         'determined posse hot on their heels.',
 'poster'

### Searching with a Filter (Filtering)


In [None]:

# Find all movies where the genre contains "Action"
action_movies = collection.find({"genres": "Action"}).limit(5)

# Print the results
for movie in action_movies:
    pprint.pprint(movie)


{'_id': ObjectId('573a1390f29313caabcd5293'),
 'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
 'cast': ['Pearl White', 'Crane Wilbur', 'Paul Panzer', 'Edward Josè'],
 'countries': ['USA'],
 'directors': ['Louis J. Gasnier', 'Donald MacKenzie'],
 'fullplot': 'Young Pauline is left a lot of money when her wealthy uncle '
             "dies. However, her uncle's secretary has been named as her "
             'guardian until she marries, at which time she will officially '
             'take possession of her inheritance. Meanwhile, her "guardian" '
             'and his confederates constantly come up with schemes to get rid '
             'of Pauline so that he can get his hands on the money himself.',
 'genres': ['Action'],
 'imdb': {'id': 4465, 'rating': 7.6, 'votes': 744},
 'languages': ['English'],
 'lastupdated': '2015-09-12 00:01:18.647000000',
 'num_mflix_comments': 0,
 'plot': 'Young Pauline is left a lot of money when her wealthy uncle dies. '
         "However, her 

### Sorting Results


In [None]:

# Find and sort movies by release year in descending order
sorted_movies = collection.find().sort("year", -1).limit(5)

# Print the sorted results
for movie in sorted_movies:
    pprint.pprint(movie)


{'_id': ObjectId('573a13eaf29313caabdcfbc1'),
 'awards': {'nominations': 4,
            'text': 'Nominated for 2 Primetime Emmys. Another 1 win & 4 '
                    'nominations.',
            'wins': 3},
 'cast': ['Meryl Streep',
          'Edward Herrmann',
          'Doris Kearns Goodwin',
          'Franklin D. Roosevelt'],
 'countries': ['USA'],
 'fullplot': 'A documentary that weaves together the stories of Theodore, '
             'Franklin and Eleanor Roosevelt, three members of one of the most '
             'prominent and influential families in American politics.',
 'genres': ['Documentary'],
 'imdb': {'id': 3400010, 'rating': 8.8, 'votes': 682},
 'languages': ['English'],
 'lastupdated': '2015-08-23 00:10:24.657000000',
 'num_mflix_comments': 1,
 'plot': 'A documentary that weaves together the stories of Theodore, Franklin '
         'and Eleanor Roosevelt, three members of one of the most prominent '
         'and influential families in American politics.',
 'poster'

### Searching with Multiple Conditions


In [None]:

# Find movies where the genre is "Action" and the rating is greater than 8
multi_condition_query = {"genres": "Action", "imdb.rating": {"$gt": 8}}

# Execute the query
results = collection.find(multi_condition_query).limit(5)

# Print the results
for result in results:
    pprint.pprint(result)


{'_id': ObjectId('573a1395f29313caabce2498'),
 'awards': {'nominations': 1, 'text': '1 win & 1 nomination.', 'wins': 1},
 'cast': ['Clint Eastwood',
          'Marianne Koch',
          'Gian Maria Volontè',
          'Wolfgang Lukschy'],
 'countries': ['Italy', 'Spain', 'West Germany'],
 'directors': ['Sergio Leone'],
 'fullplot': 'An anonymous, but deadly man rides into a town torn by war '
             "between two factions, the Baxters and the Rojo's. Instead of "
             'fleeing or dying, as most other would do, the man schemes to '
             'play the two sides off each other, getting rich in the bargain.',
 'genres': ['Action', 'Drama', 'Western'],
 'imdb': {'id': 58461, 'rating': 8.1, 'votes': 126585},
 'languages': ['Italian', 'Spanish', 'English'],
 'lastupdated': '2015-09-02 00:17:22.303000000',
 'num_mflix_comments': 0,
 'plot': 'A wandering gunfighter plays two rival families against each other '
         'in a town torn apart by greed, pride, and revenge.',
 'pos

## 3. Advanced MongoDB Operations
### Aggregation Example: Average IMDb Rating by Genre


In [None]:

# Aggregation pipeline to calculate average IMDb rating by genre
aggregation_pipeline = [
    {"$unwind": "$genres"},  # Separate each movie's genres into individual documents
    {"$group": {"_id": "$genres", "avg_rating": {"$avg": "$imdb.rating"}}},
    {"$sort": {"avg_rating": -1}},
    {"$limit": 5}
]

# Execute the aggregation
aggregated_data = collection.aggregate(aggregation_pipeline)

# Print the results
for data in aggregated_data:
    pprint.pprint(data)


### Indexing: Creating an Index


In [None]:

# Create an index on the "year" field to improve query performance for year-related searches
collection.create_index([("year", 1)])

# Show existing indexes
indexes = collection.index_information()
pprint.pprint(indexes)


### Updating Documents


In [None]:

# Update a movie's IMDb rating (change the rating of a specific movie)
collection.update_one({"title": "The Godfather"}, {"$set": {"imdb.rating": 9.3}})


### Deleting Documents


In [None]:

# Delete a movie based on a condition (delete movies that were released before 1950)
#collection.delete_many({"year": {"$lt": 1950}})


## 4. Exercises for Hands-On Practice
### Exercise 1: Searching and Filtering
**Task**: Find all movies where the genre is 'Comedy' and the IMDb rating is greater than 7.


In [None]:

# Your task: Write a query to find comedies with an IMDb rating greater than 7
comedies = collection.find({"genres": "Comedy", "imdb.rating": {"$gt": 7}}).limit(5)

# Print the first 5 results
for comedy in comedies:
    pprint.pprint(comedy)


{'_id': ObjectId('573a1390f29313caabcd4803'),
 'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
 'cast': ['Winsor McCay'],
 'countries': ['USA'],
 'directors': ['Winsor McCay', 'J. Stuart Blackton'],
 'fullplot': 'Cartoonist Winsor McCay agrees to create a large set of drawings '
             'that will be photographed and made into a motion picture. The '
             'job requires plenty of drawing supplies, and the cartoonist must '
             'also overcome some mishaps caused by an assistant. Finally, the '
             'work is done, and everyone can see the resulting animated '
             'picture.',
 'genres': ['Animation', 'Short', 'Comedy'],
 'imdb': {'id': 1737, 'rating': 7.3, 'votes': 1034},
 'languages': ['English'],
 'lastupdated': '2015-08-29 01:09:03.030000000',
 'num_mflix_comments': 0,
 'plot': 'Cartoon figures announce, via comic strip balloons, that they will '
         'move - and move they do, in a wildly exaggerated style.',
 'poster': 'https://m.me

### Exercise 2: Aggregation Pipeline
**Task**: Write an aggregation pipeline to find the top 5 directors by the average IMDb rating of their movies.


In [None]:

# Your task: Write an aggregation pipeline to calculate average IMDb rating by director
pipeline = [
    {"$group": {"_id": "$directors", "avg_rating": {"$avg": "$imdb.rating"}}},
    {"$sort": {"avg_rating": -1}},
    {"$limit": 5}
]

# Execute the pipeline and print results
avg_rating_by_director = collection.aggregate(pipeline)
for data in avg_rating_by_director:
    pprint.pprint(data)


{'_id': ['Sara Hirsh Bordo'], 'avg_rating': 9.4}
{'_id': ['Kevin Derek'], 'avg_rating': 9.3}
{'_id': ['Michael Benson'], 'avg_rating': 9.0}
{'_id': ['Slobodan Sijan'], 'avg_rating': 8.95}
{'_id': ['Sundar C.'], 'avg_rating': 8.9}


### Exercise 3: Create an Index and Measure Performance
**Task**: Create an index on the `imdb.rating` field. Measure performance before and after creating the index.


In [None]:

# Task: Create an index on imdb.rating and query before and after indexing

# Query without index
from time import time

start_time = time()
no_index_result = collection.find({"imdb.rating": {"$gt": 8}}).limit(5)
print("Time without index:", time() - start_time)

# Create an index
collection.create_index([("imdb.rating", 1)])

# Query with index
start_time = time()
with_index_result = collection.find({"imdb.rating": {"$gt": 8}}).limit(5)
print("Time with index:", time() - start_time)

# Print the results
for result in with_index_result:
    pprint.pprint(result)


Time without index: 0.0001518726348876953
Time with index: 0.00015044212341308594
{'_id': ObjectId('573a1391f29313caabcd72f0'),
 'awards': {'nominations': 0, 'text': '2 wins.', 'wins': 2},
 'cast': ['Richard Barthelmess',
          'Gladys Hulette',
          'Walter P. Lewis',
          'Ernest Torrence'],
 'countries': ['USA'],
 'directors': ['Henry King'],
 'fullplot': 'When three thuggish men are responsible for the death of his '
             'father and the crippling of his brother, young David must choose '
             'between supporting his family or risking his life and exacting '
             'vengeance.',
 'genres': ['Drama'],
 'imdb': {'id': 12763, 'rating': 8.1, 'votes': 1455},
 'lastupdated': '2015-08-23 01:12:08.943000000',
 'num_mflix_comments': 0,
 'plot': 'When three thuggish men are responsible for the death of his father '
         'and the crippling of his brother, young David must choose between '
         'supporting his family or risking his life and exacting 