## The notebook for plotting values and locations of interest on an interective map

The queries that can be answered using an adapted version of the following code include but are not limited to:
* Plot as dots on a spatial map all *'Site'* locations which have a certain origin and are dated between certain years
* Show the summed *'Frequency'* for this period as the size/colour of the dot;
* Determine for each *'Year'* the *'Sites'* on which there is evidence of an *'RAAD form'* with a certain *'Origin'*;
* Scale dot size/colour with the count of RAAD form at a *'Site'*.

### 1. Import packages
**Note**: Rememeber to always import functions from `functions.py` file

If the packages from the `requirements.txt` are **absent**, they can be installed via `!pip` command

One needs to do it only once. Example of installation is given below. The lines are commented with `#` symbol 

Run the cell where the packages are imported'. If an error of type `'no module named... is found'` occurs:

1.  delete `#` before the corresponding package name; 
2.  run the cell.

In [3]:
# ! pip install pandas
# ! pip install seaborn
# ! pip install matplotlib
# ! pip install numpy
# ! pip install regex
#! pip install geopandas
#! pip install kaleido
#! pip install plotly
#! pip install pyproj

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import cm
import regex as re
import geopandas
import kaleido
import plotly
import plotly.express as px
import plotly.io as pio
import pyproj
sns.set()
import sys
sys.path.append("../src")
from functions import freq_per_year, propor_to_map_range    # module with all functions used for the task

### 2. Load data into pandas dataframe
With `usecols = []`  one specifies which columns from a csv file to load (optional)

In [3]:
df = pd.read_csv('RAAD_data_restructured.csv', usecols=[
                                                  'RAAD_form','long', 'lat', 'origin', 
                                                  'origin_h1', 'origin_h2', 'raad_type_start_date', 
                                                  'raad_type_end_date', 'site_name_modern', 'frequency'
                                                  ])
df.head()

Unnamed: 0,RAAD_form,origin,frequency,origin_h1,origin_h2,raad_type_start_date,raad_type_end_date,site_name_modern,lat,long
0,augst 48,em,1,em,,,,augst,47.533512,7.71628
1,augst 49,em,1,em,,1.0,100.0,augst,47.533512,7.71628
2,augst 55 agora f6566,em,12,em,,-50.0,400.0,augst,47.533512,7.71628
3,augst 56,em,2,em,,300.0,450.0,augst,47.533512,7.71628
4,augst 57,em,1,em,,400.0,500.0,augst,47.533512,7.71628


### 3. Prepare data

#### 3.1 Check in which columns numeric values are of an object type
#### 3.2 If found, convert objects into numeric values (float) 
This is essential for performing math operations with these variables

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1357 entries, 0 to 1356
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   RAAD_form             1357 non-null   object 
 1   origin                1357 non-null   object 
 2   origin_h1             1357 non-null   object 
 3   origin_h2             915 non-null    object 
 4   raad_type_start_date  1157 non-null   float64
 5   raad_type_end_date    1157 non-null   float64
 6   lat                   1357 non-null   float64
 7   long                  1357 non-null   float64
dtypes: float64(4), object(4)
memory usage: 84.9+ KB


In [4]:
# Invalid parsing will be set as NaN
df['raad_type_end_date'] = pd.to_numeric(df['raad_type_end_date'], errors='coerce') 
df['raad_type_start_date'] = pd.to_numeric(df['raad_type_start_date'], errors='coerce') 
df['long'] = pd.to_numeric(df['long'], errors='coerce')

#### 3.3 Clean text data (from punctuation, double spaces) and lowercase
This is done in order to avoid inconsistency in object names, etc. Thus, to avoid errors while counting. For the RAAD data this was already done in the `preparing_dataframe.ipynb`. See  `Sonata_data/sonata_maps.ipynb` for an example of how this is done. 

### 4. Create the dataframe which inlcudes *'Sites'* only for a certain *'Origin'*

In [5]:
em = df[df['origin_h1'] == 'em']
em.head()

Unnamed: 0,RAAD_form,origin,frequency,origin_h1,origin_h2,raad_type_start_date,raad_type_end_date,site_name_modern,lat,long
0,augst 48,em,1,em,,,,augst,47.533512,7.71628
1,augst 49,em,1,em,,1.0,100.0,augst,47.533512,7.71628
2,augst 55 agora f6566,em,12,em,,-50.0,400.0,augst,47.533512,7.71628
3,augst 56,em,2,em,,300.0,450.0,augst,47.533512,7.71628
4,augst 57,em,1,em,,400.0,500.0,augst,47.533512,7.71628


### 5. Calculate *'Frequency'* per *'Year'* per *'Amphora type'*. Add the resulting values to the dataframe

In [6]:
em = freq_per_year(data = em,
                   lower_date = 'raad_type_start_date',
                   upper_date = 'raad_type_end_date',
                   freq = 'frequency')

em.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[freq_per_year] = data[freq] / (data[upper_date] - data[lower_date])


Unnamed: 0,RAAD_form,origin,frequency,origin_h1,origin_h2,raad_type_start_date,raad_type_end_date,site_name_modern,lat,long,Freq_per_year
0,augst 48,em,1,em,,,,augst,47.533512,7.71628,
1,augst 49,em,1,em,,1.0,100.0,augst,47.533512,7.71628,0.010101
2,augst 55 agora f6566,em,12,em,,-50.0,400.0,augst,47.533512,7.71628,0.026667
3,augst 56,em,2,em,,300.0,450.0,augst,47.533512,7.71628,0.013333
4,augst 57,em,1,em,,400.0,500.0,augst,47.533512,7.71628,0.01


### 6. Calculate the proportion of '*Frequency*' to a given map period and add the results to the dataframe

In [7]:
em = propor_to_map_range(data = em, 
                         map_lower_date = 0, 
                         map_upper_date = 300,
                         object_lower_date = 'raad_type_start_date',          
                         object_upper_date = 'raad_type_end_date',
                         freq_per_year = 'Freq_per_year')

em.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[proportion] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[proportion].iloc[row] = data[freq_per_year].iloc[row] * date_range


Unnamed: 0,RAAD_form,origin,frequency,origin_h1,origin_h2,raad_type_start_date,raad_type_end_date,site_name_modern,lat,long,Freq_per_year,Proportion
0,augst 48,em,1,em,,,,augst,47.533512,7.71628,,
1,augst 49,em,1,em,,1.0,100.0,augst,47.533512,7.71628,0.010101,1.0
2,augst 55 agora f6566,em,12,em,,-50.0,400.0,augst,47.533512,7.71628,0.026667,8.0
3,augst 56,em,2,em,,300.0,450.0,augst,47.533512,7.71628,0.013333,0.0
4,augst 57,em,1,em,,400.0,500.0,augst,47.533512,7.71628,0.01,0.0


### 7. Create the datarfame containing only *'Sites'* with *'RAAD_form'* which fall within the map range (if proportion == 0, do not include)

In [8]:
em = em[em['Proportion'] > 0]

em.head()

Unnamed: 0,RAAD_form,origin,frequency,origin_h1,origin_h2,raad_type_start_date,raad_type_end_date,site_name_modern,lat,long,Freq_per_year,Proportion
1,augst 49,em,1,em,,1.0,100.0,augst,47.533512,7.71628,0.010101,1.0
2,augst 55 agora f6566,em,12,em,,-50.0,400.0,augst,47.533512,7.71628,0.026667,8.0
6,dressel 24,em,100,em,,-100.0,200.0,augst,47.533512,7.71628,0.333333,66.666667
7,kapitan 1,em,7,em,,75.0,200.0,augst,47.533512,7.71628,0.056,7.0
8,kapitan 2,em,1,em,,200.0,400.0,augst,47.533512,7.71628,0.005,0.5


### 8. Calculate summed RAAD *'Frequency'* per *'Site'*
To that end, one needs to specify the variables on the basis of which the data will be grouped 

In the cell below, *'Proportion'*  values are grouped by *'site_name_modern'*, *'lat'* and *'long'*

Then the grouped *'Proportion'* values are summed

In [9]:
summed_frequency = em.groupby(['site_name_modern', 'lat', 'long'])['Proportion'].sum()        
summed_frequency = summed_frequency.reset_index()
summed_frequency = summed_frequency.rename(columns = {'Proportion':'Summed_freq'})
summed_frequency.head()

Unnamed: 0,site_name_modern,lat,long,Summed_freq
0,aislingen,48.506039,10.45559,0.857143
1,alphen aan den rijn,52.127658,4.668851,4.285714
2,alzey,49.743052,8.11403,1.333333
3,anreppen,51.73796,8.593171,34.08631
4,augsburg,48.371441,10.898255,4.666667


### 9. Count the number of unique *'Amphora types'* per *'Site'*
In the cell below, *'RAAD_form'* are grouped by *'site_name_modern'* and then the unique number of *'site_name_modern'* is calculated


In [10]:
RAAD_type_count = em.groupby('site_name_modern')['RAAD_form'].nunique()
RAAD_type_count = RAAD_type_count.reset_index()
RAAD_type_count = RAAD_type_count.rename(columns={'RAAD_form': 'RAAD_type_count'})
RAAD_type_count.head()

Unnamed: 0,site_name_modern,RAAD_type_count
0,aislingen,1
1,alphen aan den rijn,2
2,alzey,1
3,anreppen,4
4,augsburg,4


### 10. Make a dataframe containing the data required for plotting, namely: 
 - *'site_name_modern'* 
 - Coordinates of the site
 - Summed frequencies
 - Unique RAAD type count values

In [11]:
em_map = pd.merge(summed_frequency, RAAD_type_count, on = 'site_name_modern')
em_map.head()

Unnamed: 0,site_name_modern,lat,long,Summed_freq,RAAD_type_count
0,aislingen,48.506039,10.45559,0.857143,1
1,alphen aan den rijn,52.127658,4.668851,4.285714,2
2,alzey,49.743052,8.11403,1.333333,1
3,anreppen,51.73796,8.593171,34.08631,4
4,augsburg,48.371441,10.898255,4.666667,4


### 11. Plot maps
**Note:** Remember to change the title of the plot

`color_continuous_scale = ['#1ed14b',  '#d63638']` <- for green to red (basically just find css for wanted colors)

#### 11.1 **All *'Site'* locations which have *'origin_h1'* == em and are dated between 0-300** 
#### The **dot size** is scaled by the summed *'Frequency'* per group of unique *'Sites'*

In [12]:
fig = px.scatter_geo(em_map,
                     lat = em_map.lat,          # param for latitude coordinates
                     lon = em_map.long,          # param for longitude
                     height = 1200, 
                     #text = em_map['site_name_modern'],           # param to add labels to dots, better do not use
                     size = em_map.RAAD_type_count,        # param for dot size scaling
                     scope = 'europe',                     # param for resrticting a map to a specific continent
                     projection = 'mercator')              # param for geographic projection
                      

    
# centre map by lat and long of a country
# set up a 'projection_scale' to zoom into the country 

fig.update_geos(projection_scale = 6, center_lat = 41.8719, center_lon = 12.5674)

# title of the map, its potion (title_x) and font size can be set up
fig.update_layout(title_text = 'Summed Frequency per Site for Africa in Date Range 150–200', 
                  title_x = 0.5, 
                  title_font_size = 20)
   

fig.write_image('fig1.pdf') # to save plot (any format, .pdf, .png, etc )   
fig.show()

#### The **dot colour** is scaled by the sum frequency per group of unique *'Sites'*

In [13]:
fig = px.scatter_geo(em_map,
                     lat = em_map.lat,
                     lon = em_map.long,
                     height = 1200,
                     color = em_map.Summed_freq,
                   #  text = em_map['site_name_modern'],                        # param to add labels to dots  
                     scope = 'europe',  
                     color_continuous_scale = ['#1ed14b',  '#d63638'],  # param for colourbar palette
                     projection = 'mercator')         


fig.update_geos(projection_scale = 6, center_lat = 41.8719, center_lon = 12.5674)


fig.update_layout(title_text = 'Summed Frequency per Site for EM in Date Range 0–300', 
                  title_x = 0.43, 
                  title_font_size = 20, 
                  coloraxis_colorbar = dict(len = 0.80, y = 0.60, xanchor = 'center', xpad = 192, title = ' '))

fig.write_image('fig2.pdf')
fig.show()


#### 12.2 **Show *'Sites'* at which there is evidence of *'Amphora types'* in a given *'Provenance'***


In [14]:
fig = px.scatter_geo(em_map,
                     lat = em_map.lat,
                     lon = em_map.long,
                     height = 1200, 
                     #text = em_map['site_name_modern'],           
                     scope = 'europe',                    
                     projection = 'mercator')      
         
        
fig.update_geos(projection_scale = 6, center_lat = 41.8719, center_lon = 12.5674)

fig.update_layout(title_text = 'Sites for EM in Date Range 0–300', 
                  title_x = 0.5, 
                  title_font_size = 20)    

fig.write_image('fig3.pdf')
fig.show()


#### 12.3 **Scale the dot size/colour by (unique) *'Amphora types'* per *'Site'***

In [15]:
fig = px.scatter_geo(em_map,
                     lat = em_map.lat,
                     lon = em_map.long,
                     height = 1200, 
                     size = em_map.RAAD_type_count,      
                     #text = em_map['site_name_modern'],              
                     scope = 'europe',                           
                     projection = 'mercator')
                                       

fig.update_geos(projection_scale = 6, center_lat = 41.8719, center_lon = 12.5674)

fig.update_layout(title_text = 'RAAD Type Count per Site for EM in Date Range 0–300', 
                  title_x = 0.5,
                  title_font_size = 20)    

fig.write_image('fig4.pdf')
fig.show()

In [16]:
fig = px.scatter_geo(em_map,
                     lat = em_map.lat,
                     lon = em_map.long,
                     height = 1200,
                     color = em_map.RAAD_type_count,
                    # text = em_map['site_name_modern'],               
                     scope = 'europe',  
                     color_continuous_scale = ['#1ed14b',  '#d63638'],  
                     projection = 'mercator')         
 

fig.update_geos(projection_scale = 6, center_lat = 41.8719, center_lon = 12.5674)
fig.update_layout(title_text = 'RAAD Form Count per Site for Africa in Date Range 150–200', 
                  title_x = 0.5, 
                  title_font_size = 20, 
                  coloraxis_colorbar=dict(len = 0.80, y = 0.60, xanchor = 'center', xpad = 192, title = ' '))

fig.write_image('fig5.pdf')
fig.show()