# Project Title Here

---
Please fill out your details:

Name = Dias Kuatbekov

Student Id = 20709092

Course Type = Taught Masters



### Integrity statement

**If not using this Jupyter notebook please make sure this statement is included in your main.py file**

I, Dias Kuatbekov, have read and understood the School's Academic Integrity Policy, as well as guidance relating to this module, and confirm that this submission complies with the policy. The content of this file is my own original work, with any significant material copied or adapted from other sources clearly indicated and attributed.

---

**This cell can be deleted when you are ready to submit but we'd leave it as a checklist for yourself until that time**

## Project Template

This project template can be used for the submission of your project. Further details including dates and details of assessment are included in the appropriate part of the course book. You should structure the notebook to include markdown and code cells. Documentation should be included in docstrings / comments in your code cells, but markdown cells can be used to record decisions on your thought process; explanations of datasets; discussion of the functionality created and how it will help the user etc 

We would prefer submissions using this Jupyter Notebook (*.ipynb). The Jupyter Notebook should not only provide code but be structured with Markdown cells:

- Please briefly outline the scenario.
- Give a description of the aim of the project and the problem it is solving
- A brief overview in words of how the code works
- A discussion of design decisions taken in structuring the code.
- How you would develop / improve the code if given more time?

However, we are aware that some code eg guis does not work as well in a Jupyter notebook. If this is the case:

- You should provide a `main.py` file which runs your code
- Include any tests in a separate `tests.py` file
- Provide a pdf which covers the points above. As a guide we'd expect 300-500 words

In both formats, additional code / modules can also be included in other python files in the same folder and then imported. Make sure you use relative imports in your code.

**Please make every effort to structure your code in a logical fashion to assist us in understanding it**

In addition to the Jupyter Notebook / code you should also submit:

- a Conda environment file which enables your notebook to be run. Please ensure you test this on a Windows machine.
- any output that cannot easily be embedded in the notebook (e.g an example data file or some visualisation, pdfs). These should be referred to explicitly.

Your program should (as a rough guide): 

- analyse real data or perform a simulation
- define at least two user functions (but typically more)
- make use of appropriate specialist modules
- comprise >~ 50 lines of actual code (excluding comments, imports and other ‘boilerplate’)
- contain no more than 1000 lines in total (if you have written more, please isolate an individual element). The additional code can be imported from a .py file but will not be marked.


---

## Imports

*Please group all your imports together in the code cell below and provide web links in this cell to either a documentation page or a github repository. The accompanying conda environment file should install the necessary modules.

Supplementary pieces of code can be included in .py files in the top level of your project. In this box you should name each .py file and state to what extent it is your own work. If it is the work of others or you have modified others work then you should state this clearly and provide a link to the source of the original code. The default assumption is that all code is your own work unless otherwise stated.* 

### Import documentation links and contribution statements




In [2]:
# Imports from external libraries
import tkinter as tk
from tkinter import ttk, messagebox
from tkinter import StringVar
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
import matplotlib.pyplot as plt
import pandas as pd
from tkcalendar import DateEntry
import datetime
import geopandas as gpd
from scipy.interpolate import griddata, Rbf
import numpy as np
import mplcursors
import matplotlib.dates as mdates
from shapely import wkt
from shapely.geometry import Point
from matplotlib.figure import Figure
from geopandas import GeoSeries

# Imports from supplementary modules / code included with the project


## Discussion of project

These headings are a guide to help you cover the key points. Whilst the first heading should remain here ahead of your code it may be more appropriate to move the other headings around to suit your project, placing them near relevant code cells. The choice is yours.

### Aim of project and the problem it is solving

### Brief overview of how code works

### Design decisions made in structuring of the code

### How you would improve / develop the code if given more time

## Main Code

I suggest breaking your code up into multiple cells. Grouping related code together and interspersing with markdown cells where appropriate to explain what you are doing.

Loading the Preprocessed data

In [None]:
combined_gdf = pd.read_csv("data/sensors_chp.csv")
combined_gdf['geometry'] = combined_gdf['geometry'].apply(wkt.loads)
combined_gdf = gpd.GeoDataFrame(combined_gdf, geometry='geometry')
combined_gdf.set_crs('EPSG:4326', inplace=True)

Manipulation of the data constitutes a huge and important part of the project. Therefore, as the project evolved, it became evident that there is a need for dedicating a whole separate class responssible only for handing data operations. Doing so allowed to keep the code more modular and debugging a lot easier.
In general, DataManipulator class is responsible for:
- Filtering out date ranges in a dataframe
- Calculating Statistical information
- Calculating pollution for each district in the city of Almaty
- Interpolation
- Since I did not know any obvious workaround, it also stores the locations of power plants. Though power plant locations are used only when plotting, I find that it keeps the code structured

In [None]:
class DataManipulator:
    """
    The class aims to handle all the data manipulation that one can encounter when working on the project

    Attributes:
        gdf (GeoDataFrame): dataframe with sensor locations, coal burnt in power plants, and P.M. 2.5 readings
        almaty_boundaries: a geo.json file with info about Almaty
        power_plant_locs (GeoDataFrame): dataframe with locations of power plants
    """
    def __init__(self, gdf, almaty_boundaries):
        self.gdf = gdf
        self.almaty_boundaries = almaty_boundaries

        if self.gdf.crs != self.almaty_boundaries.crs:
            self.gdf = self.gdf.to_crs(self.almaty_boundaries.crs)

        power_plant_locs = [
            {'location': 'Power Plant 2', 'geometry': Point(77.0061645, 43.4224249)},
            {'location': 'Power Plant 3', 'geometry': Point(76.9278271, 43.280907)}
        ]

        self.power_plants_gdf = gpd.GeoDataFrame(power_plant_locs, crs=gdf.crs)

    
    def filter_data(self, start_date: pd.Timestamp, end_date: pd.Timestamp)-> gpd.GeoDataFrame:
        """
        Filters Data according to the passed date range and returns a gdf with correct crs

        Args:
            start_date (pd.TimeStamp): Filtering start date
            end_date (pd.TimeStamp): Filtering end date

        Returns:
            GeoDataFrame: a gdf that has data only for filtered out dates
        """
        start_date_str = start_date.strftime('%Y-%m-%d')
        end_date_str = end_date.strftime('%Y-%m-%d')

        return self.gdf[(self.gdf['date'] >= start_date_str) & (self.gdf['date'] <= end_date_str)]
        

    def get_statistics(self, gdf: gpd.GeoDataFrame):
        """
        Computation of basic statistical information based on the provided dataframe

        Args:
            gdf (GeoDataFrame): gdf containing sensor readings

        Returns:
            tuple: min, max, avg, var and most polluted district
        """

        if 'Reading' not in gdf.columns or gdf['Reading'].isnull().all(): # maybe catch error? !!!!!!!!!!!!!!!!!!!!!!
            return None, None, None, None
        
        min = gdf['Reading'].min()
        max = gdf['Reading'].max()
        avg = gdf['Reading'].mean()
        var = gdf['Reading'].var()

        joined_gdf = gpd.sjoin(gdf, self.almaty_boundaries, how='inner', predicate='intersects') # this has to be reworked !!!!!!!!!
        district_avg = joined_gdf.groupby('name_left')['Reading'].mean()

        most_polluted_district = district_avg.idxmax() if not district_avg.empty else None

        return min, max, avg, var, most_polluted_district
    

    def get_power_plant_data(self):
        """
        Power plant data

        Returns:
            GeoDataFrame: gdf with power plant locations
        """
        return self.power_plants_gdf
    
    
    def calculate_pollution_by_district(self, gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
        """
        Generates a GeoDataFrame with Almaty Districts and avg pollutant matter for corresponding districts

        Args:
            gdf (gpd.GeoDataFrame): GeoDataFrame with sensor readings, locations etc. (usually after filtering, for specific dates)

        Returns:
            gpd.GeoDataFrame: gdf that carries district names, geometries, and avg_pm25
        """
        # sjoin stands for spatial join in geopandas
        # how='inner' retains the rows where Point of sensor loc satisfy the boundaries of districts
        # I use inner join since some sensors may be located outside of any districts
        # predicate='intersets' returns True if and only if the Point(Sensor) instersects with interior OR boundary of district
        # To get more information about spatial joins: https://geopandas.org/en/stable/docs/user_guide/mergingdata.html
        # To get more information predicates: https://shapely.readthedocs.io/en/latest/manual.html#binary-predicates
        joined_gdf = gpd.sjoin(gdf, self.almaty_boundaries, how='inner', predicate='intersects')

        # name_right would be the district name (since it is the right argument in the sjoin)
        # from the DataFrameGroupBy, we are interested in the mean of Readings
        # this results in district name in one column, and avg. sensor reading in another column of district_pollution
        district_pollution = joined_gdf.groupby('name_right')['Reading'].mean().reset_index()

        # for the further purposes of the project, I incorporate the 'avg_pm25' into boundaries GeoDataFrame (please, revisit this code) !!!!!!!
        almaty_boundaries = self.almaty_boundaries.copy()
        almaty_boundaries = almaty_boundaries.set_index('name')
        district_pollution = district_pollution.set_index('name_right')
        almaty_boundaries = almaty_boundaries.join(district_pollution)

        almaty_boundaries.rename(columns={'Reading': 'avg_pm25'}, inplace=True)

        # without resetting the index, dataframe takes really weird form. I wanted to escape that
        almaty_boundaries.reset_index(inplace=True)
        
        return almaty_boundaries
    

    def interpolate(self, gdf: gpd.GeoDataFrame, method: str = 'No Interpolation', grid_spacing: float = 0.005):
        """
        Does interpolation necessary for plotting purposes

        Args:
            gdf (gpd.GeoDataFrame): gdf with sensor readings, locations etc.
            method (str, optional): Chosen interpolation method. Defaults to 'No Interpolation'.
            grid_spacing (float, optional): Spacing for interpolation method. Defaults to 0.005.

        Returns:
            tuple: Masked interpolated grid, longitude grid, and latitude grid
        """
        
        sensor_points = gdf[['geometry', 'Reading']].dropna(subset=['Reading'])

        if sensor_points.empty:
            tk.messagebox.showwarning("No Data", "No sensor data available for the selected date.")
            return

        lons = sensor_points.geometry.x.values
        lats = sensor_points.geometry.y.values
        readings = sensor_points['Reading'].values

        min_lon, min_lat, max_lon, max_lat = almaty_boundaries.total_bounds
        grid_spacing = 0.005  # Adjust this for desired resolution
        grid_lon = np.arange(min_lon, max_lon, grid_spacing)
        grid_lat = np.arange(min_lat, max_lat, grid_spacing)
        grid_lon, grid_lat = np.meshgrid(grid_lon, grid_lat)

        # Choose interpolation method
        if method == 'Nearest neighbor':
            # Nearest neighbor interpolation using griddata
            grid_z = griddata(
                points=(lons, lats),
                values=readings,
                xi=(grid_lon, grid_lat),
                method='nearest'
            )
        elif method == 'RBF':
            # RBF interpolation
            rbf_interpolator = Rbf(lons, lats, readings, function='multiquadric', smooth=0)
            grid_z = rbf_interpolator(grid_lon, grid_lat)
            grid_z = np.clip(grid_z, 0, 2)
        else:
            # Show the data without any interpolation
            grid_z = np.full(grid_lon.shape, np.nan)

        # Mask out areas outside Almaty boundaries
        boundary_polygon = almaty_boundaries.unary_union

        # Create grid points as a GeoDataFrame
        grid_points = np.vstack((grid_lon.flatten(), grid_lat.flatten())).T
        grid_points_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(grid_points[:,0], grid_points[:,1]), crs=almaty_boundaries.crs)

        # Check which points are within the boundary polygon
        grid_points_gdf['inside'] = grid_points_gdf.within(boundary_polygon)

        # Create mask (True for points outside the boundary)
        mask = ~grid_points_gdf['inside'].values.reshape(grid_lon.shape)

        # Apply mask to the interpolated grid
        grid_z_masked = np.ma.array(grid_z, mask=mask)

        return grid_z_masked, grid_lon, grid_lat

almaty_boundaries = gpd.read_file('notebook/almaty-districts.geo.json')
dataManipulator = DataManipulator(combined_gdf, almaty_boundaries)


## Tests

Discuss how you chose what to test.

In [None]:
# Tests