# Dataset description

Group number: Group 5


Team members: 
1. Sayantika Saha  - T00731231
2. Manisha K - T00727938
3. Huynh Hiep Tran (preferred name: Alex Tran) - T00728369
4. Precious - T00727498

## Introduction

- Dataset name : Emission by Country
- Authors - The Devastator (Owner)
- Source/URL - https://www.kaggle.com/datasets/thedevastator/global-fossil-co2-emissions-by-country-2002-2022?select=GCB2022v27_MtCO2_flat_metadata.json
- A brief description of what the dataset is about - This dataset provides an in-depth look into the global CO2 emissions at the country-level, allowing for a better understanding of how much each country contributes to the global cumulative human impact on climate. It contains information on total emissions as well as from coal, oil, gas, cement production and flaring, and other sources. The data also provides a breakdown of per capita CO2 emission per country - showing which countries are leading in pollution levels and identifying potential areas where reduction efforts should be concentrated. This dataset is essential for anyone who wants to get informed about their own environmental footprint or conduct research on international development trends

## General information

- Data format : JSON
- How many files/collections : 2 files
- Data size in terms of storage : 2MB

## Import data

In [3]:
# YOUR CODE TO IMPORT THE DATA HERE
from pymongo import MongoClient # import mongo client to connect
import json # import json to load credentials
import urllib.parse

# load credentials from json file
with open('credentials_mongodb.json') as f:
    login = json.load(f)

# assign credentials to variables
username = login['username']
password = urllib.parse.quote(login['password'])
host = login['host']
url = "mongodb+srv://{}:{}@{}/?retryWrites=true&w=majority".format(username, password, host)

In [5]:
client = MongoClient(url)

In [7]:
client.list_database_names()

['bookstore',
 'group_5_project',
 'sample_airbnb',
 'sample_analytics',
 'sample_geospatial',
 'sample_guides',
 'sample_mflix',
 'sample_restaurants',
 'sample_supplies',
 'sample_training',
 'sample_weatherdata',
 'school',
 'admin',
 'local']

In [20]:
# Access the 'group_5_project' database
db = client['group_5_project']

# Access a specific collection (replace 'your_collection_name' with the actual name)
collection1 = db['GCB2022v27_MtCO2_flat_metadata']
collection2 = db['GCB2022v27_percapita_flat_metadata']


# # Print all documents from the collection
# for document in collection.find():
#     print(document)

# for document2 in collection.find():
#     print(document2)


# Fetch the first document from each collection
sample_doc1 = collection1.find_one()
sample_doc2 = collection2.find_one()

print("Sample document from MongoDB (file 1):",sample_doc1)
print("Sample document from MongoDB (file 2):",sample_doc2)




{'_id': ObjectId('66f0f794f1049cc35ee1f58a'), 'fields': [{'name': 'Country', 'title': 'Country name', 'type': 'string'}, {'name': 'ISO 3166-1 alpha-3', 'title': 'ISO code', 'type': 'string'}, {'name': 'Year', 'title': 'Year', 'note': 'In almost all cases this is calendar year', 'type': 'number'}, {'name': 'Total', 'title': 'Total fossil CO2 emissions', 'units': 'millions of tonnes of CO2', 'source': 'Global Carbon Budget 2022', 'organisation': 'Global Carbon Project', 'version': '2022v27', 'type': 'number', 'licence': 'Licensed under Creative Commons Attribution 4.0 International', 'citation': 'Friedlingstein et al 2020 (DOI: 10.5194/essd-12-3269-2020)'}, {'name': 'Coal', 'title': 'Fossil CO2 emissions from Coal', 'units': 'millions of tonnes of CO2', 'source': 'Global Carbon Budget 2022', 'organisation': 'Global Carbon Project', 'version': '2022v27', 'type': 'number', 'licence': 'Licensed under Creative Commons Attribution 4.0 International', 'citation': 'Friedlingstein et al 2020 (DO

In [19]:
print(collection1.count_documents({}))  # Outputs the number of documents in the collection
print(collection2.count_documents({}))

1
1


- Describe how many collections/how many documents
- Describe the schema of the dataset/collection
- Print out a sample document
- List and briefly describe the most important fields/attributes in the dataset




Based on the information provided, here's how you can describe the dataset:

### 1. Describe Collections and Documents
- **Collections**: There are two collections in the `group_5_project` database:
  - `GCB2022v27_MtCO2_flat_metadata`
  - `GCB2022v27_percapita_flat_metadata`
  
- **Documents**: Each collection contains **1 document**. 

### 2. Describe the Schema of the Dataset/Collection
The schema for both collections appears to be similar, based on the sample documents retrieved. Each document has:
- An `_id` field (automatically generated by MongoDB).
- A `fields` array, which contains metadata about various attributes related to fossil CO2 emissions.

Each entry in the `fields` array is an object that describes a specific attribute, including:
- **name**: The key name for the field.
- **title**: A descriptive title for the field.
- **type**: The data type of the field (e.g., string, number).
- **units** (where applicable): The units of measurement for the field.
- **source**, **organisation**, **version**, **licence**, and **citation**: Additional metadata providing context for the data.

### 3. List and Briefly Describe the Most Important Fields/Attributes
Here are the most important fields/attributes found in the dataset:

- **Country**: 
  - **Title**: Country name
  - **Type**: string
  - **Description**: The name of the country for which the data is recorded.

- **ISO 3166-1 alpha-3**: 
  - **Title**: ISO code
  - **Type**: string
  - **Description**: The three-letter country code defined by ISO.

- **Year**: 
  - **Title**: Year
  - **Type**: number
  - **Description**: The calendar year for which the emissions data is relevant.

- **Total**: 
  - **Title**: Total fossil CO2 emissions
  - **Units**: millions of tonnes of CO2
  - **Type**: number
  - **Description**: The total amount of fossil CO2 emissions for the specified country and year.

- **Coal**: 
  - **Title**: Fossil CO2 emissions from Coal
  - **Units**: millions of tonnes of CO2
  - **Type**: number
  - **Description**: The amount of CO2 emissions derived from coal.

- **Oil**: 
  - **Title**: Fossil CO2 emissions from Oil
  - **Units**: millions of tonnes of CO2
  - **Type**: number
  - **Description**: The amount of CO2 emissions derived from oil.

- **Gas**: 
  - **Title**: Fossil CO2 emissions from Gas
  - **Units**: millions of tonnes of CO2
  - **Type**: number
  - **Description**: The amount of CO2 emissions derived from natural gas.

- **Cement**: 
  - **Title**: Fossil CO2 emissions from Cement
  - **Units**: millions of tonnes of CO2
  - **Type**: number
  - **Description**: The amount of CO2 emissions derived from cement production.

- **Flaring**: 
  - **Title**: Fossil CO2 emissions from Flaring
  - **Units**: millions of tonnes of CO2
  - **Type**: number
  - **Description**: The amount of CO2 emissions resulting from flaring practices.

- **Other**: 
  - **Title**: Fossil CO2 emissions from Other sources
  - **Units**: millions of tonnes of CO2
  - **Type**: number
  - **Description**: CO2 emissions that cannot be categorized into the other specified sources.

- **Per Capita**: 
  - **Title**: Per capita fossil CO2 emissions
  - **Units**: tonnes of CO2 per capita
  - **Type**: number
  - **Description**: The average CO2 emissions per person in the specified country for the given year.



In [22]:
# YOUR CODE TO EXPLORE THE DATA HERE



--- Exploring collection: GCB2022v27_MtCO2_flat_metadata ---
Sample document:


NameError: name 'pprint' is not defined

## Submission instruction
- Push the notebook to your group Github repository
- Upload an URL to the `data-eda.ipynb` to Moodle under week 3 assignment