# QuantumBlack Hackathon - AI For Good

Binary classifier that predicts presence of methane emissions in an image. Model integrated in a Streamlit Web App for seamless workflows.

***
by: Clara Besnard, Ian Moon, Marina Pellet, Łukasz Pszenny, Adel Remadi, Lasse Schmidt

within: MS Data Sciences & Business Analytics

at: CentraleSupélec & ESSEC Business School
***

This notebook covers an initial analysis of the provided data (metadata of images as well as images themselves).

In [None]:
# if necessary, run installs
%pip install os pandas geopandas shapely reverse_geocoder pycountry

### 1. Import Packages

In [2]:
# data handling
import os
import pandas as pd

# geodata
import geopandas as gpd
from shapely.geometry import Point
import reverse_geocoder as rg
import pycountry

### 2. Load Data

In [3]:
metadata = pd.read_csv("images/metadata.csv")
metadata.head(2)

Unnamed: 0,date,id_coord,plume,set,lat,lon,coord_x,coord_y,path
0,20230223,id_6675,yes,train,31.52875,74.330625,24,47,images/plume/20230223_methane_mixing_ratio_id_...
1,20230103,id_2542,yes,train,35.538,112.524,42,37,images/plume/20230103_methane_mixing_ratio_id_...


In [7]:
len(metadata)

430

Overall, we only have 430 available images. Transfer learning might be a very promising approach.

Let us now have a look at the countries that are represented in this dataset.

In [5]:
# Create a new column called 'geometry' that creates Point objects from the lat and lon columns
metadata["geometry"] = metadata.apply(lambda row: Point(row.lon, row.lat), axis=1)

# Convert the dataframe to a GeoDataFrame
gdf = gpd.GeoDataFrame(metadata, geometry="geometry", crs = "EPSG:4326")

# Write dataframe to map
map = gdf.explore("plume", tiles = "CartoDB positron", cmap = "RdYlGn_r")

# Save map
map.save(outfile= "maps/dataset_overview.html")
    
# Show map
map

In [8]:
# Use reverse_geocoder to get the city and country information
coords = list(zip(gdf["lat"], gdf["lon"]))
results = rg.search(coords)

# Extract city and country and add to the dataframe
gdf["city"] = [r["name"] for r in results]
gdf["country_code"] = [r["cc"] for r in results]

# Convert country codes to country names
def get_country_name(country_code):
    try:
        return pycountry.countries.get(alpha_2=country_code).name
    except AttributeError:
        return None
    
gdf["country"] = gdf["country_code"].apply(get_country_name)

# show dataframe
gdf.head(2)

Unnamed: 0,date,id_coord,plume,set,lat,lon,coord_x,coord_y,path,geometry,city,country_code,country
0,20230223,id_6675,yes,train,31.52875,74.330625,24,47,images/plume/20230223_methane_mixing_ratio_id_...,POINT (74.33062 31.52875),Lahore,PK,Pakistan
1,20230103,id_2542,yes,train,35.538,112.524,42,37,images/plume/20230103_methane_mixing_ratio_id_...,POINT (112.52400 35.53800),Fengcheng,CN,China


In [9]:
# Define a function to count the number of images with plumes
def count_with_plumes(x):
    return x[x == 'yes'].count()

# Define a function to count the number of images without plumes
def count_without_plumes(x):
    return x[x == 'no'].count()

# Group by 'country' and apply the aggregate functions
result = gdf.groupby('country').agg(
    total_images=pd.NamedAgg(column='path', aggfunc='count'),
    images_with_plumes=pd.NamedAgg(column='plume', aggfunc=count_with_plumes),
    images_without_plumes=pd.NamedAgg(column='plume', aggfunc=count_without_plumes)
)

# Sort the result in descending order by total_images
result = result.sort_values(by='total_images', ascending=False)

# Show dataframe
result.head(10)

Unnamed: 0_level_0,total_images,images_with_plumes,images_without_plumes
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,97,97,0
Syrian Arab Republic,58,4,54
Iraq,37,0,37
Bangladesh,27,27,0
"Korea, Republic of",21,0,21
Jordan,21,0,21
Pakistan,17,17,0
Turkmenistan,16,16,0
Russian Federation,15,8,7
United States,12,12,0
