# Variable frequency comparison

Imagine now that you are interested in looking at the relationship several of Seshat's data variables, to see which appear to be correlated in the number of polities that have them recorded as "Present".

For example, the relationship between a transport infrastructure variable such as "Road" and a profession variable like "Professional soldier". Your hypothesis could be that the polities with professional soldiers need roads for them to move on, so there should be a strong correlation between the number of polities with soldiers and the number of polities with roads in any given year.

We could compare this to some other variables that we believe less likely to be correlated.

*In this notebook I decided to contibute back some of the functions for data transformation I started writing here into the package itself, for easy re-use. Other users of the API might not want to do this, but instead maintain their own set of data wrangling/transofrmation functions in a separate repository.*

In [1]:
from seshat_api import SeshatAPI, get_frequencies, get_variable_classes
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

In [2]:
client = SeshatAPI(base_url="https://seshatdata.com/api")
# client = SeshatAPI(base_url="https://seshat-db.com/api")

In [3]:
# Set the range of years to consider
years = range(-1000, 1900)

In [None]:
# Take a look at the available variables
get_variable_classes()

In [5]:
class_names = ['Roads',
               'ProfessionalSoldiers',
               'Philosophies',
               'Coppers',
               'Elephants']

In [7]:
# The get_frequencies function returns a pandas DataFrame with the number of polities that have each variable in each year
example_frequency_df = get_frequencies(client, class_names, years)

In [None]:
# Let's take a look at 5 random years
example_frequency_df.sample(5)

## Plotting the data

Let's first plot all the variables to see the frequency of "Present" being recorded across the selected year range:

In [None]:
plt.figure(figsize=(13, 7))
plt.plot(example_frequency_df.index, example_frequency_df['professional_soldier'], label='Professional Soldiers')
plt.plot(example_frequency_df.index, example_frequency_df['road'], label='Roads')
plt.plot(example_frequency_df.index, example_frequency_df['philosophy'], label='Philosophy')
plt.plot(example_frequency_df.index, example_frequency_df['copper'], label='Copper')
plt.plot(example_frequency_df.index, example_frequency_df['elephant'], label='Elephants')
plt.xlabel('Year')
plt.ylabel('Number of Polities')
plt.title('Number of Polities with selected variables "Present" over time')
plt.legend()
plt.show()

## Soldiers and Roads

Now let's explore the relationship between 2 specific variables.

First, let's see if our hypothesis that the number of polities with roads should be correlated to the number of polities with professional soldiers. 

We'll create a scatter plot where each point represents a particular year, and the number of polities having each variable (recorded as "Present") on the x and y axes:

In [None]:
def frequency_scatter(df, x_var, y_var):
    corr_coef, _ = pearsonr(df[x_var], df[y_var])
    plt.figure(figsize=(13, 7))
    scatter = plt.scatter(
        df[x_var], 
        df[y_var], 
        c=df.index,
        cmap='viridis',
    )
    plt.xlabel(f'Number of polities with "{x_var}" present')
    plt.ylabel(f'Number of polities with "{y_var}" present')
    plt.title(f'Polities recorded as having {x_var} vs {y_var}: 1000 BCE - 2024 CE')
    plt.colorbar(scatter, label='Year')
    plt.text(0.05, 0.95, f'Correlation Coefficient: {corr_coef:.2f}', transform=plt.gca().transAxes, fontsize=12, verticalalignment='top')
    plt.show()

frequency_scatter(example_frequency_df, 'professional_soldier', 'road')

## Other variable combos

Now let's make the same plot, but comparing different sets of the variables we pulled from the API:

In [None]:
frequency_scatter(example_frequency_df, 'copper', 'philosophy')

In [None]:
frequency_scatter(example_frequency_df, 'elephant', 'road')