# Telelink Case Solution
### Team Chameleons

### The Team
- vrategov
- kali
- stan
- caseyp

Github Repo: https://github.com/datasciencesociety/vrategov

## 0. Data and working environment

We were given the following 4 datasets:

**atmosphere_profile_train.csv** - data from the University of Wyoming. It consists data for the temperature at certain height.

**construction_sites.csv** - data for all construction sites in Sofia that are relevant for our time period.

**household_heating.csv** - data from a survey by Green Sofia. It corresponds to what type of heating people use.

**industrial_pollution.csv** - Ddta on emissions from industrial installations collected from emissions permits. Data provided by UCTM Sofia.

**sofia_topo.csv** - Topological data of Sofia based on NASA SRTM digital elevation model.

**stations_data_train.csv** - Official air quality measurements (4 stations in the city) – as per EU guidelines on air quality monitoring.

**weather_lbsf_20161101-20161130_IP_train.csv** - Meteorological measurements. Data from 1 station - Sofia airport (LBSF).

We worked on the provided Amazon's virtual machines. However, to properly do our analysis we needed some additional libraries. Install the following libraries and restart the kernel to work properly.

In [None]:
# !pip install geopy --user
# !pip install folium --user

In [1]:
# import the libraries

import pandas as pd
import numpy as np
import math
import geopy.distance
import scipy
import folium
from folium import plugins

In [2]:
# read the data into memory
folder = "/workspace/vrategov/00.Data/"

atmo_profile = "atmosphere_profile_train.csv"
construction = "csvConstructData.csv"
industry = "industrial_pollution_latlon.csv"
topo = "sofia_topo.csv"
stations_train = "stations_data_train.csv"
meteo = "weather_lbsf_20161101-20161130_IP_train.csv"
stations = "stations.csv" # this is a csv file with the characteristics of the stations

df_atmo = pd.DataFrame(pd.read_csv(folder+atmo_profile))
df_const = pd.DataFrame(pd.read_csv(folder+construction))
df_ind = pd.DataFrame(pd.read_csv(folder+industry))
df_topo = pd.DataFrame(pd.read_csv(folder+topo))
df_stations_train = pd.DataFrame(pd.read_csv(folder+stations_train))
df_meteo = pd.DataFrame(pd.read_csv(folder+meteo))
df_stations = pd.DataFrame(pd.read_csv(folder+stations,sep=";")) 

## 1. Business Understanding

Nowadays, air pollution is considered one of the most serious problems in the world. It refers to the contamination of the atmosphere by harmful chemicals or biological materials. To solve the problem of air pollution, it's necessary to understand the issues and look for ways to counter it.

Air pollution could be a reason for a lot of health issues, both in short-term and long-term.

Air pollution causes damage to crops, animals, forests, and bodies of water. It also contributes to the depletion of the ozone layer, which protects the Earth from the sun's UV rays.

In Bulgaria, EEA collects air pollution statistics. It's important to study these statistics because they show how polluted the air has become in various places around the country.

Sofia has a long history with air polution problems. The norms were exceeded many times in the past few years. The record of highest measurement of air polution exceeded six times the reccomended value. Major contributors to the polution are believed to be the **households heating fuels, industrial and construction sites** contamination.

This is the main objective of the case - **to estimate the effect of these factors** on the total measured polution and to predict the **air pollution in Sofia for the next 24 hour period**.



## 2. Data Understanding

#### Atmoshere profile
The data consisted temperature measurements in Celsius for 62 different hights in meters. The dataset is useful to calculate the temperature gradient, which is important for modelling the contribution of each pollutant.

In [3]:
df_atmo.head()

Unnamed: 0,Date,HGHT(m),TEMP(C)
0,2016-11-01,595,9.6
1,2016-11-01,663,7.6
2,2016-11-01,844,5.4
3,2016-11-01,1047,3.6
4,2016-11-01,1284,1.5


In [4]:
df_atmo.describe()

Unnamed: 0,HGHT(m),TEMP(C)
count,1341.0,1341.0
mean,10361.200597,-37.704623
std,6723.793451,26.771796
min,595.0,-68.1
25%,4082.0,-60.9
50%,10221.0,-52.1
75%,15602.0,-9.6
max,26912.0,22.4


#### Construction sites
There is data for 389 construction sites and a starting date for each of them. Date ranges from 7 January 2016 to 28 October 2016.

In [5]:
df_const.head()

Unnamed: 0.1,Unnamed: 0,id,start date,type,district,locality,address,District,Address,PM10,...,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21
0,0,251,6/3/2016,infrastructure,SERDIKA,ZONA V-17,BANISHORA,,"42.709536,23.313438",1226063.0,...,,0.0,,,,,,,,)/2
1,1,284,6/27/2016,small housing,PODUYANE,LEVSKI ZONA G,block 22A - 23,,"42.710712,23.37743",382.0341,...,,,,,,,,,,
2,2,367,8/17/2016,infrastructure,SERDIKA,ZONA V - 17,block 43,"42.729704,23.34461 — 42.71331,23.312595",,1226063.0,...,,,,,,,,,,
3,3,233,5/19/2016,infrastructure,SERDIKA,FONDOVI ZHILISHTA,block 209,"42.729704,23.34461 — 42.71331,23.312595",,1226063.0,...,,,,,,,,,,
4,4,234,5/19/2016,infrastructure,SERDIKA,FONDOVI ZHILISHTA,block 214,"42.729704,23.34461 — 42.71331,23.312595",,1226063.0,...,,,,,,,,,,


In [6]:
df_const.describe()

Unnamed: 0.1,Unnamed: 0,id,PM10,lat,lon,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20
count,389.0,389.0,389.0,389.0,389.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,194.0,294.0,209427.7,42.657161,23.234017,,,0.0,,,,,,,
std,112.438872,112.438872,455960.4,0.072217,0.775887,,,,,,,,,,
min,0.0,100.0,382.0341,42.345394,21.029508,,,0.0,,,,,,,
25%,97.0,197.0,3898.081,42.635561,23.25,,,0.0,,,,,,,
50%,194.0,294.0,3898.081,42.677222,23.308889,,,0.0,,,,,,,
75%,291.0,391.0,12290.24,42.699641,23.348333,,,0.0,,,,,,,
max,388.0,488.0,1226063.0,42.84016,27.181344,,,0.0,,,,,,,


There are 4 different types of construction sites:

| Construction site type  |  N |
|---|---|
|big housing       | 144 |
|infrastructure    | 65  |
|non-residential   | 96  |
|small housing     | 84  |

All of these were spread accross 20 different districts.

#### Household heating

The main interest for us in this dataset was the data in regards to heating with solid fuels - diesel, coal, and wood. This is because for the other types of heating it is already accounted with the industrial pollution. We made the assumtion to count the number of pollutant based on the number of dwelings , rather than the number of people in the building.

Below is a map of the households data.

<img src="https://www.datasciencesociety.net/wp-content/uploads/2019/04/households_plot-600x350.png"
      />

#### Industrial polution

The dataset consisted characteristics of 71 industriall pollutants. We had the coordinates in DMS format and further transformation was needed to get the cartesian values.

In [7]:
df_ind.head()

Unnamed: 0,Lat,Lon,m,t/y
0,42.737961,23.241339,8.0,0.38
1,42.662781,23.388806,15.0,0.03
2,42.662908,23.388686,15.0,0.2
3,42.662972,23.388631,15.0,0.96
4,42.663089,23.38925,15.0,1.58


In [8]:
df_ind.describe()

Unnamed: 0,Lat,Lon,m,t/y
count,71.0,71.0,71.0,71.0
mean,42.705438,23.369704,31.556338,0.964789
std,0.036471,0.064256,28.632468,1.545424
min,42.651478,23.241339,3.5,0.01
25%,42.680017,23.338849,12.0,0.2
50%,42.705869,23.339958,25.0,0.41
75%,42.721821,23.417769,38.5,1.035
max,42.821203,23.536339,125.0,7.09


Below is a map of the industrial polluters data.

<img src="https://www.datasciencesociety.net/wp-content/uploads/2019/04/industrial_polution_plot-600x350.png"
     />

#### Sofia topo

In the Topological data there is information about the latitude, longitude and elevation for 196 points. No time component.

In [9]:
df_topo.head()

Unnamed: 0,Lat,Lon,Elev
0,42.62,23.22,1184.0
1,42.62,23.233571,1333.0
2,42.62,23.247143,1505.0
3,42.62,23.260714,1586.0
4,42.62,23.274286,1533.0


#### Official stations

We were provided with PM10 daily measurements from 4 stations in Sofia from 1 November 2016 to 16 November 2016.

In [10]:
df_stations_train.head()

Unnamed: 0,Date,STA-BG0052A,STA-BG0050A,STA-BG0073A,STA-BG0040A
0,2016-11-01,692.88,823.44,624.0,876.24
1,2016-11-02,1632.96,1756.56,1516.56,2382.288
2,2016-11-03,953.28,978.48,1086.0,680.736
3,2016-11-04,545.52,631.44,888.24,613.2
4,2016-11-05,1420.08,1664.4,1617.12,1608.48


In [11]:
df_stations_train.describe()

Unnamed: 0,STA-BG0052A,STA-BG0050A,STA-BG0073A,STA-BG0040A
count,20.0,20.0,20.0,20.0
mean,1059.0,1278.54,1257.408,1393.7544
std,591.817362,965.460779,844.498994,1074.07896
min,261.6,178.08,244.56,277.512
25%,618.24,605.94,602.52,592.176
50%,891.12,900.96,1035.24,990.948
75%,1473.3,1803.42,1763.52,1801.932
max,2285.52,3300.0,3072.24,3463.752


## 3. Data Preparation

#### Industrial pollution

In the begining in our preparation of industrial data, we had to convert the provided coordinates from DMS format to Cartesian coordinates. Then, we had to calculate the temperature gradient from the formula provided in the case description (deltaTemperature/deltaHeight). The used data set was atmosphere_profile_train. We found out that the whole sample had a stabillity of class E. We proceded with converting the variables in the required units. For example, the debit variable of the pollutants was in t/y and we converted it to g/s, wind speed was converted to m/s from km/h.

In [12]:
df_ind["PM10"] = df_ind["t/y"] * 1000000/(365*24*60*60) # convert the debit to g/s
df_meteo["wind"] = df_meteo["sfcWindAVG"] * 1000/3600 # convert wind speed in m/s

#### Household pollution

We used the following data tables to calculate household pollution: "weather_lbsf_20161101-20161130_IP_train.csv" and "household heating.csv".

We used combination of the simple approach and the heating degree days approach to estimate PM10 emissions (grams/household/day) for the examined period. We divided the PM10 emissions (grams/household/year) by 190 to calculate average daily emission since we do not have information for temperature ranges across the whole heating period. After that we have calculated total emissions for the period with available data i.e. 01-11-2016 to 20-11-2016 (20 days). This is total yearly emissions divided by 190, multiplied by 20.

We calculated HDD for each day, following the if – else structure described in the case. Then we calculated proportion of total HDD for each day and multiplied that proportion to total emissions for the period to estimate emissions per day.

From the file household heating.csv we focused on the variable number of dwellings in the building by source of heating - 4,6,7. DIESEL, COAL, WOOD, i.e. heating on solid fuel and assumed 1039 PM10 emissions (grams/household/year). Than we summed dwellings grouped by coordinates and deleted duplicates, leaving 33 409 unique rows.

Then we combined the two datasets and estimated total emissions for each of the dates per address.

#### Building sites pollution

We used the procedure, provided by  the EMEP/EEA Inventory Guidebook 2016 Tier 1 methodology. We used the parameters of the equation: the emission factor for the pollutant emission, affected area, construction duration and efficiency of emission control measures as suggested by the guidebook and assumed sand soil. We calculated PE Index using the suggested formulae and PRCPAVG and TASAVG data from the weather dataset. The outcome of that calculation is PM10 emissions by type of construction.

We have extracted address coordinates by google maps for some of the constructions and district coordinates for all of the rest observations. For those districts without addresses we measured the center of the district by the given district coordinates. Formula:
$$ x = \frac{coordinateX_1 + coordinateX_2}{2}$$
$$ y = \frac{coordinateY_1 + coordinateY_2}{2}$$

## 4. Data Modelling

The output of the model suggested in the case description can be described with the following function:

In [16]:
def concentration(y,q,u,h,sigma_y,sigma_z,z = 0):
    """Estimate the concentration of PM10 in a point in space.
    
    Keyword arguments:
    y       -- meters crosswind from the emission plume centerline, assumed to be equal to x in our model
    z       -- position in the z direction, default set to 0 to equal ground level(where are the people)
    q       -- stack emissions (g/s)
    u       -- wind speed (m/s)
    h       -- pollutant release height
    sigma_y -- horizontal standard deviation of the emission distribution, in m 
    sigma_z -- vertical standard deviation of the emission distribution, in m 
    """    
    
    c = (q/2*np.pi*u*sigma_y*sigma_z) * (np.exp((-y**2)/(2*sigma_y**2))) * (np.exp((-(z-h)**2)/(2*sigma_z**2))) * (np.exp((-(z+h)**2)/(2*sigma_z**2)))
    
    return(c)

#### Industrial Factor

We folowed the suggested model in the case description. We decided to calculate the contribution of each industrial pollutant to each point from the Sofia Topo data, essentially creating our map grid from these points.

In general, our approach is as follows:

In [17]:
topo_polution_industry = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})

for i in range(0,df_meteo.shape[0]):
    for j in range(0,df_topo.shape[0]):
        c = 0
        for k in range(0,df_ind.shape[0]):
            a = (df_topo["Lat"][j], df_topo["Lon"][j])
            b = (df_ind["Lat"][k],df_ind["Lon"][k])

            x = geopy.distance.distance(a, b).km #distance in kilometers

            if x < 1:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 22.8 * x**(0.675-1.3)
            else:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 55.4 * x**(0.305-34.0)

            c += concentration(y = x ,
                              q = df_ind["PM10"][k],
                              u = df_meteo["wind"][i],
                              h = df_ind["m"][k],
                              sigma_y = sigma_y,
                              sigma_z = sigma_z)
        topo_polution_industry = topo_polution_industry.append({'Date': i, 'Lat': df_topo["Lat"][j], 'Lon': df_topo["Lon"][j], 'Ind_P10': c}, ignore_index=True)
    print('Calculations for day {} are ready.'.format(i))

Calculations for day 0 are ready.
Calculations for day 1 are ready.
Calculations for day 2 are ready.
Calculations for day 3 are ready.
Calculations for day 4 are ready.
Calculations for day 5 are ready.
Calculations for day 6 are ready.
Calculations for day 7 are ready.
Calculations for day 8 are ready.
Calculations for day 9 are ready.
Calculations for day 10 are ready.
Calculations for day 11 are ready.
Calculations for day 12 are ready.
Calculations for day 13 are ready.
Calculations for day 14 are ready.
Calculations for day 15 are ready.
Calculations for day 16 are ready.
Calculations for day 17 are ready.
Calculations for day 18 are ready.
Calculations for day 19 are ready.


The end result is the total pollution from industrial objects at each point from the Topo data for each day in our sample.

In [18]:
topo_polution_industry.head()

Unnamed: 0,Date,Lat,Lon,Ind_P10
0,0.0,42.62,23.22,0.0
1,0.0,42.62,23.233571,0.0
2,0.0,42.62,23.247143,0.0
3,0.0,42.62,23.260714,0.0
4,0.0,42.62,23.274286,0.0


In [19]:
topo_polution_industry.describe()

Unnamed: 0,Date,Lat,Lon,Ind_P10
count,3920.0,3920.0,3920.0,3920.0
mean,9.5,42.679796,23.308214,26.388263
std,5.767017,0.037089,0.054715,101.003416
min,0.0,42.62,23.22,0.0
25%,4.75,42.647598,23.260714,0.0
50%,9.5,42.679796,23.308214,0.0
75%,14.25,42.711994,23.355714,0.0
max,19.0,42.739592,23.396429,991.497689


In order to make comparable our results with the official measurements, we had to further divide the estimated pollution by the air density. The following functions helps in doing that.

In [20]:
# functions taken from http://python.hydrology-amsterdam.nl/moduledoc/index.html

def ea_calc(airtemp= scipy.array([]),\
            rh= scipy.array([])):
    '''
    Function to calculate actual vapour pressure from relative humidity:
    
    .. math::    
        e_a = \\frac{rh \\cdot e_s}{100}
        
    where es is the saturated vapour pressure at temperature T.

    Parameters:
        - airtemp: array of measured air temperatures [Celsius].
        - rh: Relative humidity [%].

    Returns:
        - ea: array of actual vapour pressure [Pa].

    Examples
    --------
    
        >>> ea_calc(25,60)
        1900.0946514729308

    '''
    
    # Test input array/value
    #airtemp,rh = _arraytest(airtemp, rh)

    # Calculate saturation vapour pressures
    es = es_calc(airtemp)
    # Calculate actual vapour pressure
    eact = rh / 100.0 * es
    return eact # in Pa

def es_calc(airtemp= scipy.array([])):
    '''
    Function to calculate saturated vapour pressure from temperature.

    For T<0 C the saturation vapour pressure equation for ice is used
    accoring to Goff and Gratch (1946), whereas for T>=0 C that of
    Goff (1957) is used.
    
    Parameters:
        - airtemp : (data-type) measured air temperature [Celsius].
        
    Returns:
        - es : (data-type) saturated vapour pressure [Pa].

    References
    ----------
    
    - Goff, J.A.,and S. Gratch, Low-pressure properties of water from -160 \
    to 212 F. Transactions of the American society of heating and \
    ventilating engineers, p. 95-122, presented at the 52nd annual \
    meeting of the American society of \
    heating and ventilating engineers, New York, 1946.
    - Goff, J. A. Saturation pressure of water on the new Kelvin \
    temperature scale, Transactions of the American \
    society of heating and ventilating engineers, pp 347-354, \
    presented at the semi-annual meeting of the American \
    society of heating and ventilating engineers, Murray Bay, \
    Quebec. Canada, 1957.

    Examples
    --------    
        >>> es_calc(30.0)
        4242.725994656632
        >>> x = [20, 25]
        >>> es_calc(x)
        array([ 2337.08019792,  3166.82441912])
    
    '''

    # Test input array/value
    #airtemp = _arraytest(airtemp)

    # Determine length of array
    n = scipy.size(airtemp)
    # Check if we have a single (array) value or an array
    if n < 2:
        # Calculate saturated vapour pressures, distinguish between water/ice
        if airtemp < 0:
            # Calculate saturation vapour pressure for ice
            log_pi = - 9.09718 * (273.16 / (airtemp + 273.15) - 1.0) \
                     - 3.56654 * math.log10(273.16 / (airtemp + 273.15)) \
                     + 0.876793 * (1.0 - (airtemp + 273.15) / 273.16) \
                     + math.log10(6.1071)
            es = math.pow(10, log_pi)   
        else:
            # Calculate saturation vapour pressure for water
            log_pw = 10.79574 * (1.0 - 273.16 / (airtemp + 273.15)) \
                     - 5.02800 * math.log10((airtemp + 273.15) / 273.16) \
                     + 1.50475E-4 * (1 - math.pow(10, (-8.2969 * ((airtemp +\
                     273.15) / 273.16 - 1.0)))) + 0.42873E-3 * \
                     (math.pow(10, (+4.76955 * (1.0 - 273.16\
                     / (airtemp + 273.15)))) - 1) + 0.78614
            es = math.pow(10, log_pw)
    else:   # Dealing with an array     
        # Initiate the output array
        es = scipy.zeros(n)
        # Calculate saturated vapour pressures, distinguish between water/ice
        for i in range(0, n):              
            if airtemp[i] < 0:
                # Saturation vapour pressure equation for ice
                log_pi = - 9.09718 * (273.16 / (airtemp[i] + 273.15) - 1.0) \
                         - 3.56654 * math.log10(273.16 / (airtemp[i] + 273.15)) \
                         + 0.876793 * (1.0 - (airtemp[i] + 273.15) / 273.16) \
                         + math.log10(6.1071)
                es[i] = math.pow(10, log_pi)
            else:
                # Calculate saturation vapour pressure for water  
                log_pw = 10.79574 * (1.0 - 273.16 / (airtemp[i] + 273.15)) \
                         - 5.02800 * math.log10((airtemp[i] + 273.15) / 273.16) \
                         + 1.50475E-4 * (1 - math.pow(10, (-8.2969\
                         * ((airtemp[i] + 273.15) / 273.16 - 1.0)))) + 0.42873E-3\
                         * (math.pow(10, (+4.76955 * (1.0 - 273.16\
                         / (airtemp[i] + 273.15)))) - 1) + 0.78614
                es[i] = pow(10, log_pw)
    # Convert from hPa to Pa
    es = es * 100.0
    return es # in Pa
def rho_calc(airtemp= scipy.array([]),\
             rh= scipy.array([]),\
             airpress= scipy.array([])):
    '''
    Function to calculate the density of air, rho, from air
    temperatures, relative humidity and air pressure.

    .. math::    
        \\rho = 1.201 \\cdot \\frac{290.0 \\cdot (p - 0.378 \\cdot e_a)}{1000 \\cdot (T + 273.15)} / 100
        
    Parameters:
        - airtemp: (array of) air temperature data [Celsius].
        - rh: (array of) relative humidity data [%].
        - airpress: (array of) air pressure data [Pa].
        
    Returns:
        - rho: (array of) air density data [kg m-3].
        
    Examples
    --------
    
        >>> t = [10, 20, 30]
        >>> rh = [10, 20, 30]
        >>> airpress = [100000, 101000, 102000]
        >>> rho_calc(t,rh,airpress)
        array([ 1.22948419,  1.19787662,  1.16635358])
        >>> rho_calc(10,50,101300)
        1.2431927125520903
        
    '''

    # Test input array/value    
    #airtemp,rh,airpress = _arraytest(airtemp,rh,airpress)
    
    # Calculate actual vapour pressure
    eact = ea_calc(airtemp, rh)
    # Calculate density of air rho
    rho = 1.201 * (290.0 * (airpress - 0.378 * eact)) \
             / (1000.0 * (airtemp + 273.15)) / 100.0
    return rho # in kg/m3

In [21]:
air_density = pd.DataFrame(rho_calc(df_meteo["TASAVG"], df_meteo["RHAVG"], df_meteo["PSLAVG"]*100), columns=["density"])
air_density["Date"] = air_density.index

In [22]:
topo_polution_industry = topo_polution_industry.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')

In [23]:
topo_polution_industry["Ind_P10"] = topo_polution_industry["Ind_P10"]/topo_polution_industry["density"]

In [24]:
topo_polution_industry.describe()

Unnamed: 0,Date_caller,Lat,Lon,Ind_P10,density,Date_other
count,3920.0,3920.0,3920.0,3920.0,3920.0,3920.0
mean,9.5,42.679796,23.308214,21.045768,1.260852,9.5
std,5.767017,0.037089,0.054715,80.906232,0.027713,5.767017
min,0.0,42.62,23.22,0.0,1.201294,0.0
25%,4.75,42.647598,23.260714,0.0,1.246616,4.75
50%,9.5,42.679796,23.308214,0.0,1.263779,9.5
75%,14.25,42.711994,23.355714,0.0,1.276042,14.25
max,19.0,42.739592,23.396429,804.915173,1.313515,19.0


And, of course, it is imoprtant to visualize the results which can hint if our analysis is correct. The following code is doing that.

In [25]:
def make_heatmap(df, timestamp, metric):
    """Create a Heat Map of Sofia for a given timestamp to visualize a given metric.
    For example, to visualize PM10 pollution
    
    The map is interactive.

    The map also visualizes clusters of the locations.

    Keyword arguments:
    df     -- the data frame with time, longitude, lattitude and the chosen metric
    timestamp -- the point in time for which to visualize the heat map
    metric -- the metric, used for visualization
    """
    points = df[df["Date_caller"] == int(timestamp)]
    
    folium_map = folium.Map(location=sofia_center,
                            zoom_start=11,
                            tiles='Stamen Terrain')

    marker_cluster = plugins.MarkerCluster().add_to(folium_map)
    
    for i in range(0, len(points)):
        point = points.iloc[i]

        folium.Marker(
            [point['Lat'], point['Lon']],
            popup=str(point['Ind_P10'])
        ).add_to(marker_cluster)
        
#         folium.Circle(
#             radius=10,
#             location=[point['latitude'], point['longitude']],
#             popup=str(point['P1']),
#             color='#333333',
#             fill=False
#         ).add_to(folium_map)

    plugins.MarkerCluster().add_to(folium_map)
        
    # plot heatmap
    folium_map.add_child(plugins.HeatMap(
        points[['Lat', 'Lon', metric]].as_matrix(),
        min_opacity=0.2,
        max_val=points[metric].max(),
        radius=30, blur=17,
        max_zoom=1
    ))

    # You can also save the interactive heat map as an HTML file
    # folium_map.save("output-map.html")
    
    return folium_map

In [26]:
sofia_center = [42.697708, 23.321867] # coordinates assumed to be the center of the city
make_heatmap(topo_polution_industry, '0', 'Ind_P10') 



In the heatmap above you can see where the biggest pollution from the industry was on November 1st 2016. The visualization is in fact relevent with the locations of the biggest industry factories in Sofia.

The next step in our solution was to estimate the industry pollution at the points where the official stations were located.

In [27]:
stations_polution_ind = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})

for i in range(0,df_meteo.shape[0]):
    for j in range(0,df_stations.shape[0]):
        c = 0
        for k in range(0,df_ind.shape[0]):
            a = (df_stations["Latitude"][j], df_stations["Longitude"][j])
            b = (df_ind["Lat"][k],df_ind["Lon"][k])

            x = geopy.distance.distance(a, b).km #distance in kilometers

            if x < 1:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 22.8 * x**(0.675-1.3)
            else:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 55.4 * x**(0.305-34.0)

            c += concentration(y = x ,
                              q = df_ind["PM10"][k],
                              u = df_meteo["wind"][i],
                              h = df_ind["m"][k],
                              sigma_y = sigma_y,
                              sigma_z = sigma_z)
        stations_polution_ind = stations_polution_ind.append({'Date': i, 'Lat': df_stations["Latitude"][j], 'Lon': df_stations["Longitude"][j], 'Ind_P10': c}, ignore_index=True)
    print('Calculations for day {} are ready.'.format(i))

Calculations for day 0 are ready.
Calculations for day 1 are ready.
Calculations for day 2 are ready.
Calculations for day 3 are ready.
Calculations for day 4 are ready.
Calculations for day 5 are ready.
Calculations for day 6 are ready.
Calculations for day 7 are ready.
Calculations for day 8 are ready.
Calculations for day 9 are ready.
Calculations for day 10 are ready.
Calculations for day 11 are ready.
Calculations for day 12 are ready.
Calculations for day 13 are ready.
Calculations for day 14 are ready.
Calculations for day 15 are ready.
Calculations for day 16 are ready.
Calculations for day 17 are ready.
Calculations for day 18 are ready.
Calculations for day 19 are ready.


In [28]:
stations_polution_ind = stations_polution_ind.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')
stations_polution_ind["Ind_P10"] = stations_polution_ind["Ind_P10"]/stations_polution_ind["density"]

In [29]:
stations_polution_ind.head()

Unnamed: 0,Date_caller,Lat,Lon,Ind_P10,density,Date_other
0,0.0,42.732292,23.310972,137.017091,1.272993,0
1,0.0,42.680558,23.296786,0.0,1.272993,0
2,0.0,42.666508,23.400164,299.437043,1.272993,0
3,0.0,42.669797,23.268403,0.0,1.272993,0
4,1.0,42.732292,23.310972,64.218538,1.267498,1


In [30]:
stations_polution_ind.describe()

Unnamed: 0,Date_caller,Lat,Lon,Ind_P10,density,Date_other
count,80.0,80.0,80.0,80.0,80.0,80.0
mean,9.5,42.687289,23.319081,112.625657,1.260852,9.5
std,5.802662,0.026664,0.049569,147.136524,0.027885,5.802662
min,0.0,42.666508,23.268403,0.0,1.201294,0.0
25%,4.75,42.668975,23.28969,0.0,1.246616,4.75
50%,9.5,42.675178,23.303879,26.558029,1.263779,9.5
75%,14.25,42.693492,23.33327,174.511778,1.276042,14.25
max,19.0,42.732292,23.400164,557.009501,1.313515,19.0


It is interesting to see that some official stations are not affected by industrial pollution.

#### Construction Sites Factor

We followed all steps from the previous step (from the industrial factor). The only think that differs in the analysis is that we assumed arbitrary height of the pollutant (in the industry data set we were provided with the actual hight). Our values are as follows - for infrastructure type we use h equal to 5, for small housing type h is 10, and for the other types it is 15.

In [33]:
df_const["PM10_gs"] = df_const["PM10"] * 1/(365*24*60*60) # convert the debit to g/s

In [34]:
topo_polution_const = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})

for i in range(0,df_meteo.shape[0]):
    for j in range(0,df_topo.shape[0]):
        c = 0
        for k in range(0,df_const.shape[0]):
            a = (df_topo["Lat"][j], df_topo["Lon"][j])
            b = (df_const["lat"][k],df_const["lon"][k])

            x = geopy.distance.distance(a, b).km #distance in kilometers

            if x < 1:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 22.8 * x**(0.675-1.3)
            else:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 55.4 * x**(0.305-34.0)
            if df_const["type"][k] == "infrastructure":
                h = 5
            elif df_const["type"][k] == "small housing":
                h = 10
            else:
                h = 15            

            c += round(concentration(y = x ,
                              q = df_const["PM10_gs"][k],
                              u = df_meteo["wind"][i],
                              h = h,
                              sigma_y = sigma_y,
                              sigma_z = sigma_z))
        topo_polution_const = topo_polution_const.append({'Date': i, 'Lat': df_topo["Lat"][j], 'Lon': df_topo["Lon"][j], 'Ind_P10': c}, ignore_index=True)
    print('Calculations for day {} are ready.'.format(i))

Calculations for day 0 are ready.
Calculations for day 1 are ready.
Calculations for day 2 are ready.
Calculations for day 3 are ready.
Calculations for day 4 are ready.
Calculations for day 5 are ready.
Calculations for day 6 are ready.
Calculations for day 7 are ready.
Calculations for day 8 are ready.
Calculations for day 9 are ready.
Calculations for day 10 are ready.
Calculations for day 11 are ready.
Calculations for day 12 are ready.
Calculations for day 13 are ready.
Calculations for day 14 are ready.
Calculations for day 15 are ready.
Calculations for day 16 are ready.
Calculations for day 17 are ready.
Calculations for day 18 are ready.
Calculations for day 19 are ready.


In [35]:
topo_polution_const.head()

Unnamed: 0,Date,Lat,Lon,Ind_P10
0,0.0,42.62,23.22,0.0
1,0.0,42.62,23.233571,0.0
2,0.0,42.62,23.247143,0.0
3,0.0,42.62,23.260714,0.0
4,0.0,42.62,23.274286,0.0


In [36]:
topo_polution_const.describe()

Unnamed: 0,Date,Lat,Lon,Ind_P10
count,3920.0,3920.0,3920.0,3920.0
mean,9.5,42.679796,23.308214,179.715051
std,5.767017,0.037089,0.054715,432.878554
min,0.0,42.62,23.22,0.0
25%,4.75,42.647598,23.260714,0.0
50%,9.5,42.679796,23.308214,0.0
75%,14.25,42.711994,23.355714,90.0
max,19.0,42.739592,23.396429,4235.0


In [37]:
# divide again by the air density

topo_polution_const = topo_polution_const.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')
topo_polution_const["Ind_P10"] = topo_polution_const["Ind_P10"]/topo_polution_const["density"]

In [38]:
topo_polution_const.describe()

Unnamed: 0,Date_caller,Lat,Lon,Ind_P10,density,Date_other
count,3920.0,3920.0,3920.0,3920.0,3920.0,3920.0
mean,9.5,42.679796,23.308214,143.332376,1.260852,9.5
std,5.767017,0.037089,0.054715,346.894974,0.027713,5.767017
min,0.0,42.62,23.22,0.0,1.201294,0.0
25%,4.75,42.647598,23.260714,0.0,1.246616,4.75
50%,9.5,42.679796,23.308214,0.0,1.263779,9.5
75%,14.25,42.711994,23.355714,70.526596,1.276042,14.25
max,19.0,42.739592,23.396429,3438.047104,1.313515,19.0


In [39]:
# visualize the data

make_heatmap(topo_polution_const, '0', 'Ind_P10')



In [40]:
stations_polution_const = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})

for i in range(0,df_meteo.shape[0]):
    for j in range(0,df_stations.shape[0]):
        c = 0
        for k in range(0,df_const.shape[0]):
            a = (df_stations["Latitude"][j], df_stations["Longitude"][j])
            b = (df_const["lat"][k],df_const["lon"][k])

            x = geopy.distance.distance(a, b).km #distance in kilometers

            if x < 1:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 22.8 * x**(0.675-1.3)
            else:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 55.4 * x**(0.305-34.0)
            if df_const["type"][k] == "infrastructure":
                h = 5
            elif df_const["type"][k] == "small housing":
                h = 10
            else:
                h = 15            

            c += concentration(y = x ,
                              q = df_const["PM10_gs"][k],
                              u = df_meteo["wind"][i],
                              h = h,
                              sigma_y = sigma_y,
                              sigma_z = sigma_z)
        stations_polution_const = stations_polution_const.append({'Date': i, 'Lat': df_stations["Latitude"][j], 'Lon': df_stations["Longitude"][j], 'Ind_P10': c}, ignore_index=True)
    print('Calculations for day {} are ready.'.format(i))

Calculations for day 0 are ready.
Calculations for day 1 are ready.
Calculations for day 2 are ready.
Calculations for day 3 are ready.
Calculations for day 4 are ready.
Calculations for day 5 are ready.
Calculations for day 6 are ready.
Calculations for day 7 are ready.
Calculations for day 8 are ready.
Calculations for day 9 are ready.
Calculations for day 10 are ready.
Calculations for day 11 are ready.
Calculations for day 12 are ready.
Calculations for day 13 are ready.
Calculations for day 14 are ready.
Calculations for day 15 are ready.
Calculations for day 16 are ready.
Calculations for day 17 are ready.
Calculations for day 18 are ready.
Calculations for day 19 are ready.


In [41]:
stations_polution_const = stations_polution_const.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')
stations_polution_const["Ind_P10"] = stations_polution_const["Ind_P10"]/stations_polution_const["density"]

In [42]:
stations_polution_const.describe()

Unnamed: 0,Date_caller,Lat,Lon,Ind_P10,density,Date_other
count,80.0,80.0,80.0,80.0,80.0,80.0
mean,9.5,42.687289,23.319081,306.38352,1.260852,9.5
std,5.802662,0.026664,0.049569,358.396118,0.027885,5.802662
min,0.0,42.666508,23.268403,0.002609,1.201294,0.0
25%,4.75,42.668975,23.28969,0.723477,1.246616,4.75
50%,9.5,42.675178,23.303879,113.51572,1.263779,9.5
75%,14.25,42.693492,23.33327,611.94961,1.276042,14.25
max,19.0,42.732292,23.400164,1136.716425,1.313515,19.0


## 5. Evaluation

We were provided with test data to asses our analysis.

**stations_data_test.csv.csv** - data for the measurements by the official stations from November 21st 2016 to November 25th 2016.

**atmosphere_profile_test.csv** - data to calculate the gradient.

**weather_lbsf_20161101-20161130_IP_test.csv** - weather measurements from November 21st 2016 to November 25th 2016.

#### Industry polution

In [43]:
meteo = "weather_lbsf_20161101-20161130_IP_test.csv"

df_meteo = pd.DataFrame(pd.read_csv(folder+meteo))


In [46]:
# transform some variables in correct units

df_meteo["wind"] = df_meteo["sfcWindAVG"] * 1000/3600 # convert wind speed in m/s

In [50]:
stations_polution_ind_test = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})

for i in range(0,df_meteo.shape[0]):
    for j in range(0,df_stations.shape[0]):
        c = 0
        for k in range(0,df_ind.shape[0]):
            a = (df_stations["Latitude"][j], df_stations["Longitude"][j])
            b = (df_ind["Lat"][k],df_ind["Lon"][k])

            x = geopy.distance.distance(a, b).km #distance in kilometers

            if x < 1:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 22.8 * x**(0.675-1.3)
            else:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 55.4 * x**(0.305-34.0)

            c += concentration(y = x ,
                              q = df_ind["PM10"][k],
                              u = df_meteo["wind"][i],
                              h = df_ind["m"][k],
                              sigma_y = sigma_y,
                              sigma_z = sigma_z)
        stations_polution_ind_test = stations_polution_ind_test.append({'Date': i, 'Lat': df_stations["Latitude"][j], 'Lon': df_stations["Longitude"][j], 'Ind_P10': c}, ignore_index=True)
    print('Calculations for day {} are ready.'.format(i))

Calculations for day 0 are ready.
Calculations for day 1 are ready.
Calculations for day 2 are ready.
Calculations for day 3 are ready.
Calculations for day 4 are ready.


In [51]:
stations_polution_ind_test = stations_polution_ind_test.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')
stations_polution_ind_test["Ind_P10"] = stations_polution_ind_test["Ind_P10"]/stations_polution_ind_test["density"]

In [52]:
stations_polution_ind_test

Unnamed: 0,Date_caller,Lat,Lon,Ind_P10,density,Date_other
0,0.0,42.732292,23.310972,164.42051,1.272993,0
1,0.0,42.680558,23.296786,0.0,1.272993,0
2,0.0,42.666508,23.400164,359.324451,1.272993,0
3,0.0,42.669797,23.268403,0.0,1.272993,0
4,1.0,42.732292,23.310972,110.088922,1.267498,1
5,1.0,42.680558,23.296786,0.0,1.267498,1
6,1.0,42.666508,23.400164,240.588244,1.267498,1
7,1.0,42.669797,23.268403,0.0,1.267498,1
8,2.0,42.732292,23.310972,83.231891,1.257368,2
9,2.0,42.680558,23.296786,0.0,1.257368,2


#### Construction pollution

In [54]:
stations_polution_const_test = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})

for i in range(0,df_meteo.shape[0]):
    for j in range(0,df_stations.shape[0]):
        c = 0
        for k in range(0,df_const.shape[0]):
            a = (df_stations["Latitude"][j], df_stations["Longitude"][j])
            b = (df_const["lat"][k],df_const["lon"][k])

            x = geopy.distance.distance(a, b).km #distance in kilometers

            if x < 1:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 22.8 * x**(0.675-1.3)
            else:
                sigma_y = 50.5 * x**(0.894)
                sigma_z = 55.4 * x**(0.305-34.0)
            if df_const["type"][k] == "infrastructure":
                h = 5
            elif df_const["type"][k] == "small housing":
                h = 10
            else:
                h = 15            

            c += concentration(y = x ,
                              q = df_const["PM10_gs"][k],
                              u = df_meteo["wind"][i],
                              h = h,
                              sigma_y = sigma_y,
                              sigma_z = sigma_z)
        stations_polution_const_test = stations_polution_const_test.append({'Date': i, 'Lat': df_stations["Latitude"][j], 'Lon': df_stations["Longitude"][j], 'Ind_P10': c}, ignore_index=True)
    print('Calculations for day {} are ready.'.format(i))

Calculations for day 0 are ready.
Calculations for day 1 are ready.
Calculations for day 2 are ready.
Calculations for day 3 are ready.
Calculations for day 4 are ready.


In [55]:
stations_polution_const_test = stations_polution_const_test.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')
stations_polution_const_test["Ind_P10"] = stations_polution_const_test["Ind_P10"]/stations_polution_const_test["density"]

In [56]:
stations_polution_const_test

Unnamed: 0,Date_caller,Lat,Lon,Ind_P10,density,Date_other
0,0.0,42.732292,23.310972,733.290914,1.272993,0
1,0.0,42.680558,23.296786,688.508273,1.272993,0
2,0.0,42.666508,23.400164,2.97311,1.272993,0
3,0.0,42.669797,23.268403,0.008075,1.272993,0
4,1.0,42.732292,23.310972,490.98015,1.267498,1
5,1.0,42.680558,23.296786,460.995614,1.267498,1
6,1.0,42.666508,23.400164,1.990667,1.267498,1
7,1.0,42.669797,23.268403,0.005407,1.267498,1
8,2.0,42.732292,23.310972,371.2018,1.257368,2
9,2.0,42.680558,23.296786,348.53222,1.257368,2


## 6. Deployment

Nowadays there are 6 official stations which measures the air quality in Sofia. However, one of them is in the mountain and it is probably not a great source of information about the quality of the air in the city below. The results in our analysis could be used to place new official stations appropriately. And with the introduction of new stations the measurements of the air we breath would be more close to the true ones.