<a href="https://colab.research.google.com/github/drshahizan/special-topic-data-engineering/blob/main/assignment/data-scraping/submission/part1/Gadgeteen/FlickrScraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Flickr Web Scraping</h1>

<p>
Flickr is one of the most popular online photo-sharing platforms. It has a massive collection of images shared by millions of users worldwide. These images can be publicly accessible or limited to a selected audience. In this context, web scraping is a technique used to extract data from websites automatically. It involves writing scripts that collect data from websites by crawling through its pages and extracting the required information.

Web scraping on Flickr can provide valuable insights into the metadata of images that are shared publicly. By web scraping Flickr for images of dogs, we can gather information about the camera settings used to capture these images, such as the make and model of the camera. This information can be used for various purposes such as analyzing trends in the types of cameras used to capture images of dogs or identifying the most popular settings used by photographers to capture these images. Additionally, we can use this data to create a dataset that can be used for training machine learning models in computer vision applications.</p>

This Python code imports various libraries for web scraping, image processing, and working with MongoDB.

* requests is a library that allows sending HTTP requests and handling responses in Python. It is commonly used for web scraping or interacting with web APIs.
* json is a library that provides methods for encoding and decoding JSON data in Python. It is used for working with JSON data, which is a lightweight data interchange format commonly used in web applications.
* csv is a library that provides functionality to read from and write to CSV files. It is commonly used for working with tabular data.
* cv2 is a library that provides computer vision functions for processing images and videos. It is based on the OpenCV library and is commonly used for tasks such as object detection, image segmentation, and facial recognition.
* numpy is a library that provides numerical computing functionality in Python. It is commonly used for working with arrays and matrices.
* pymongo is a Python library for working with MongoDB, a popular NoSQL database. It provides an interface for connecting to MongoDB, creating and querying collections, and performing CRUD (create, read, update, delete) operations on documents.
* The !pip install pymongo command installs the pymongo library if it is not already installed  

* The from pymongo import MongoClient statement imports the MongoClient class from the pymongo library. MongoClient is the primary interface for connecting to a MongoDB server and working with databases and collections.

Overall, this code imports several essential libraries for working with data in Python and specifically with MongoDB, which is a popular database for storing and managing data in web applications.

In [9]:
import requests
import json
import csv
import cv2
import numpy as np
!pip install pymongo
import pymongo
from pymongo import MongoClient

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


This code defines several variables that are used to construct URLs for making API requests to the Flickr API:


* api_key is a string variable that contains a valid API key for accessing the  Flickr API. API keys are used to authenticate requests and identify the application making the request.
* search_url is a string variable that contains the URL for searching photos on Flickr. It includes the api_key variable, as well as several parameters that specify the search criteria, such as the tags to search for, the number of photos to return per page, and the page number to start on. The format=json parameter specifies that the response should be returned in JSON format, and nojsoncallback=1 disables the JSONP callback function.
* info_url is a string variable that contains the URL for retrieving detailed information about a specific photo on Flickr. It includes the api_key variable and a photo_id variable, which is the ID of the photo to retrieve information about. The format=json parameter specifies that the response should be returned in JSON format, and nojsoncallback=1 disables the JSONP callback function.
* exif_url is a string variable that contains the URL for retrieving EXIF (Exchangeable Image File Format) data for a specific photo on Flickr. It includes the api_key variable and a photo_id variable, which is the ID of the photo to retrieve EXIF data for. The format=json parameter specifies that the response should be returned in JSON format, and nojsoncallback=1 disables the JSONP callback function.


Overall, this code defines the URLs that will be used to make requests to the Flickr API to search for photos, retrieve detailed information about specific photos, and retrieve EXIF data for specific photos.

In [10]:
api_key = "<Your_API_Key>"
search_url = "https://www.flickr.com/services/rest/?method=flickr.photos.search&api_key={api_key}&tags=dog&per_page=100&page=1&format=json&nojsoncallback=1"
info_url = "https://www.flickr.com/services/rest/?method=flickr.photos.getInfo&api_key={api_key}&photo_id={photo_id}&format=json&nojsoncallback=1"
exif_url = "https://www.flickr.com/services/rest/?method=flickr.photos.getExif&api_key={api_key}&photo_id={photo_id}&format=json&nojsoncallback=1"

After defining the search_url variable in the previous code block, this code uses the requests library to send a GET request to the Flickr API endpoint specified by search_url. The api_key variable is inserted into the URL using Python's string formatting syntax and the .format() method, so that the API key is included in the request.

The response from the API is then converted from JSON format to a Python dictionary using the json.loads() method, and stored in the data variable.

The total_pages variable is then assigned the value of the page attribute from the photos dictionary within the data dictionary. This represents the total number of pages of search results available for the given search query.

Overall, this code retrieves the total number of pages of search results available for the "dog" tag on Flickr, using the previously defined search_url and api_key variables.

In [11]:
response = requests.get(search_url.format(api_key=api_key))
data = json.loads(response.text)
total_pages = data["photos"]["page"]

This code block starts by creating an empty list metadata_list to store the metadata for each photo. Then, using a for loop, it iterates over each page of search results (1 to total_pages) to retrieve the metadata for each photo.

For each photo on each page of search results, the code sends a request to the Flickr API endpoint specified by info_url and exif_url to get information about the photo and its camera settings, respectively.

The metadata and camera settings are then stored in a dictionary called metadata_dict, with the title of the photo, author of the photo, and URL to the photo stored as keys. If the camera settings for the photo are not accessible, the value of exif_data is set to "Owner denied access". Otherwise, the make and model of the camera are extracted from the camera settings and added to the dictionary.

Finally, the metadata_dict dictionary is appended to the metadata_list for each photo, so that the final metadata_list contains a list of dictionaries, each containing the metadata and camera settings for a single photo.

In [12]:
metadata_list = []
for page in range(1, total_pages+1):
    response = requests.get(search_url.format(api_key=api_key, page=page))
    data = json.loads(response.text)
    for photo in data["photos"]["photo"]:
        # Get the photo info
        response = requests.get(info_url.format(api_key=api_key, photo_id=photo["id"]))
        data = json.loads(response.text)
        metadata = data["photo"]

        # Get the camera settings
        response = requests.get(exif_url.format(api_key=api_key, photo_id=photo["id"]))
        data_exif = json.loads(response.text)

        # Store the metadata and camera settings in a dictionary
        metadata_dict = {
            "Title": metadata["title"].get("_content", "Untitled"),
            "Author": metadata["owner"]["username"],
            "URL": f'https://live.staticflickr.com/{metadata["server"]}/{metadata["id"]}_{metadata["secret"]}.jpg',
        }

        if data_exif['stat'] == 'fail':
            exif_data ="Owner denied access"
        else:
            exif_data = data_exif["photo"]["exif"]

            for exif in exif_data:
                if exif["label"] in ["Make", "Model"]:
                    metadata_dict[exif["label"]] = exif["raw"]["_content"]

        # Add the metadata to the list
        metadata_list.append(metadata_dict)

This code writes the metadata information of a list of dog images obtained from Flickr to a CSV file named "flickr_scraping.csv".

It uses the csv library to write the data to a CSV file with the provided fieldnames in the first row. The metadata list is iterated through, and for each image, it downloads the image using the requests library, and decodes it using the cv2 library. Finally, it writes the metadata information to the CSV file using the writerow() function.

In [13]:
with open("flickr_scraping.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["Title", "Author", "URL", "Make", "Model"])
    writer.writeheader()

    for metadata in metadata_list:
        # Download the image
        response = requests.get(metadata["URL"])
        image = cv2.imdecode(np.frombuffer(response.content, np.uint8), cv2.IMREAD_COLOR)

        # Write the metadata to the CSV file
        writer.writerow(metadata)
        

This code reads the CSV file "flickr_scraping.csv" using the Pandas library and stores the data in a DataFrame object named "data". The DataFrame object allows for easy manipulation and analysis of tabular data. The variable "data" can now be used to perform various operations such as filtering, sorting, and calculating summary statistics on the data.

In [14]:
import pandas as pd
data = pd.read_csv("flickr_scraping.csv")
data

Unnamed: 0,Title,Author,URL,Make,Model
0,biking with dog,kasa51,https://live.staticflickr.com/65535/5283089746...,Panasonic,DC-G9
1,Ozark Mayo2023,Cangorezeus Photography,https://live.staticflickr.com/65535/5286318422...,SONY,ILCE-6100
2,DSC00591,johnjmurphyiii,https://live.staticflickr.com/65535/5286349996...,SONY,DSC-RX100M7
3,DSC00633,johnjmurphyiii,https://live.staticflickr.com/65535/5286248225...,SONY,DSC-RX100M7
4,Forza Napoli.,guidamichele91,https://live.staticflickr.com/65535/5286263939...,NIKON CORPORATION,NIKON D610
...,...,...,...,...,...
95,Don't Let This Dog Out,BUNEES,https://live.staticflickr.com/65535/5286145715...,,
96,McDonald's Wind Surfer Connect-A-Snoopy - 2004,jadedoz,https://live.staticflickr.com/65535/5286145207...,Apple,iPhone 11 Pro Max
97,Doggo Bloggo via Poop4U,PickDoggo,https://live.staticflickr.com/65535/5286101631...,,
98,Der große dünne Mann!,Günter Hentschel,https://live.staticflickr.com/65535/5286031282...,NIKON CORPORATION,NIKON D5500


This code establishes a connection to a MongoDB database using a connection string, creates a new database named <DataScraping>, and a new collection named <Dog>. Then, it inserts the metadata_list records, which contains metadata scraped from Flickr API for images tagged as 'dog'. The records are inserted into the <Dog> collection in the <DataScraping> database using the insert_many method provided by the PyMongo library.

It's important to note that the uri, <DataScraping>, and <Dog> placeholders in the code should be replaced with the actual values specific to your use case.

In [17]:
#Connection String 
uri="mongodb+srv://Kelvin2001:kelvinshuai@cluster0.cokgc4s.mongodb.net/test"
client = MongoClient(uri)

#Define the database and collection
db = client['<DataScraping>']
collection = db['<Dog>']

collection.insert_many(metadata_list)

<pymongo.results.InsertManyResult at 0x7fdc2a3b5b10>

In [21]:
from google.colab import files

files.download("flickr_scraping.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>