# Variable frequency comparison

Imagine now that you are interested in looking at the relationship several of Seshat's data variables, to see which appear to be correlated in the number of polities that have them recorded as "Present".

For example, the relationship between a transport infrastructure variable such as "Road" and a profession variable like "Professional soldier". Your hypothesis could be that the polities with professional soldiers need roads for them to move on, so there should be a strong correlation between the number of polities with soldiers and the number of polities with roads in any given year.

We could compare this to some other variables that we believe less likely to be correlated.

In [1]:
from seshat_api import SeshatAPI, get_frequencies
import matplotlib.pyplot as plt

In [2]:
client = SeshatAPI(base_url="https://seshatdata.com/api")
# client = SeshatAPI(base_url="https://seshat-db.com/api")

In [3]:
# Set the range of years to consider
years = range(-1000, 2024)

In [4]:
# Create a list of variables to consider
variables = ['road',
             'professional_soldier',
             'philosophy',
             'copper',
             'elephant'
             ]

In [5]:
# The get_frequencies function returns a pandas DataFrame with the number of polities that have each variable in each year
example_frequency_df = get_frequencies(client, variables, years)

In [None]:
# Let's take a look at 5 random years
example_frequency_df.sample(5)

## Plotting the data

Let's first plot all the variables to see the frequency of "Present" being recorded across the selected year range:

In [None]:
plt.figure(figsize=(13, 7))
plt.plot(example_frequency_df.index, example_frequency_df['professional_soldier'], label='Professional Soldiers')
plt.plot(example_frequency_df.index, example_frequency_df['road'], label='Roads')
plt.plot(example_frequency_df.index, example_frequency_df['philosophy'], label='Philosophy')
plt.plot(example_frequency_df.index, example_frequency_df['copper'], label='Copper')
plt.plot(example_frequency_df.index, example_frequency_df['elephant'], label='Elephants')
plt.xlabel('Year')
plt.ylabel('Number of Polities')
plt.title('Number of Polities with selected variables "Present" over time')
plt.legend()
plt.show()

## Correlation?

Now let's explore the relationship between 2 specific variables.

First, let's see if our hypothesis that the number of polities with roads should be correlated to the number of polities with professional soldiers. 

We'll create a scatter plot where each point represents a particular year, and the number of polities having each variable (recorded as "Present") on the x and y axes:

In [None]:
plt.figure(figsize=(13, 7))
scatter = plt.scatter(
    example_frequency_df['professional_soldier'], 
    example_frequency_df['road'], 
    c=example_frequency_df.index,
    cmap='viridis',
)
plt.xlabel('Number of polities with Professional Soldiers')
plt.ylabel('Number of polities with Roads')
plt.title('Polities recorded as having Professional Soldiers vs Roads: 1000 BCE - 2024 CE')
plt.colorbar(scatter, label='Year')
plt.show()

## Correlation?

Now let's make the same plot, but comparing two of the variables we think are unlikely to correlate:

In [None]:
plt.figure(figsize=(13, 7))
scatter = plt.scatter(
    example_frequency_df['copper'], 
    example_frequency_df['philosophy'], 
    c=example_frequency_df.index,
    cmap='viridis',
)
plt.xlabel('Number of polities with Copper')
plt.ylabel('Number of polities with Philosophy')
plt.title('Polities recorded as having Copper vs Philosophy: 1000 BCE - 2024 CE')
plt.colorbar(scatter, label='Year')
plt.show()