### MEAP DATA LAB: Collection Data from Latin America

**QUESTION 1**

*Are the geographic data points at the same scale? (ie. do all Latin American collections identify their objects at the city level? Are some only at the country level?) A related and next step question is: what kind of consistent scale would we like so that we can visualize effectively – maybe at the city or state level? Something below the country level if it’s interesting.*

- MMDH (geography in Spanish -> translated to English)
    - Place of origin: varied, down to city level, some missing
    - Subject geography: all Chile

- CCP
    - Place of origin: city level, lots of data missing
    - Subject geography: all city level in Peru

- Memoria Abierta
    - Place of origin: city level, mostly filled
    - Subject geography: Country level

- Ibicaba 
    - Place of origin: N/A -- likely since all data from Ibicaba Farm? 
    - Subject geography: Country level, all in Brazil

- Obidos
    - Place of origin: no data
    - Subject geographic: Town/city level (all Obidos, Brazil)

- Centro Cultural Tallersol
    - Place of origin: N/A
    - Subject geography: Varied (Building/coordinate, city, and country level)

|  | MMDH | CCP | Memoria Abierta | Ibicaba | Obidos | Centro Cultural Tallersol |
| --- | --- | --- | --- | --- | --- | --- |
| Place of Origin | Varied, down to city level, some missing | City level, lots of data missing | City level, mostly filled | No data | No data | No data |
| Subject Geography | All Chile | All city level in Peru | Country level | Country level, all in Brazil | Town/city level (all Obidos, Brazil) | Varied (Building/coordinate, city, and country level) |
| Notes | All in Spanish | N/A | N/A | No place of origin data since all from Ibicaba Farm? | N/A | N/A |

Questions after Research/Data Cleaning: 

- Could we visualize the geographies that are present, even if there is a lot of data missing? Would it be helpful to look at subject geography across all datasets, even if some are concentrated only in one area? 

**QUESTION 2**

*Can we normalize the dates or temporal information? Are they identifying the same decades? Same years?*

Below are the corresponding column names for dates/temporal information for each dataset: 

- MMDH 
    - Date.created (i.e. 1973-1990)
    - Date.normalized (i.e. 1973/1990)

- CCP
    - Date.created (i.e. 1973-1990)
    - Date.normalized (i.e. 1973/1990)

- Memoria Abierta
    - Date.created (i.e. November - December 1992)
    - Date.normalized (i.e. 1992-11-01/1992-12-31)

- Ibicaba 
    - Date.created
    - Date.normalized

- Obidos
    - Date.created
    - Date.normalized
    - Subject temporal 

- Centro Cultural Tallersol
    - Date.created
    - Date.normalized
    - Subject temporal 

Questions after Research: 
- What is the best way to format and normalize date/time? There are the following cases:
    - One year (easy) 
    - Specific date (different formats, what is easiest to feed into Python?)
    - Range or list of 2 years (i.e. 1970-1990, 1970/1990)
    - Range of dates with only months (i.e. June 1970 - Oct 1990)
    - Range of dates w/ months and days (see specific date)

**QUESTION 3**

I would really like to identify some key subject headings or themes that cut across these collections and potentially create a visualization that allows users to explore all of these collections based on overlapping themes. So, what are the most dominant subject terms, what are the closely related terms, can we group some themes together into some kind of node graph?

Next Steps:

- Schedule a Meeting:
https://calendly.com/rdeblinger/meap?month=2023-05

- Publisher location
- Subject geographic: what collection is about
- Place of origin: where was collection created
-> What are the interesting questions concerning geography?
    - How does geography vary within a collection? 
    - Connections between place of origin and subject geographic?




#### Back to Question 1: Looking at Geography Data

#### *MMDH Dataset*

In [43]:
import pandas as pd
import numpy as np
import plotly.express as px

In [22]:
mmdh_geo = pd.read_csv('mmdh_geo.csv')

In [20]:
mmdh_geo = mmdh_geo.dropna(how='all')

In [8]:
unique_mmdh_geos = (mmdh_geo.groupby('Place of origin - Country')
                    .agg(**{'unique values': ('Place of origin - Country', 'nunique')})
                    .reset_index())

In [21]:
fig = px.bar(mmdh_geo, x=mmdh_geo['Place of origin - Country'].value_counts(), y=mmdh_geo['Place of origin - Country'].value_counts().index, title='Place of Origin - Country Level - for MMDH Collection Objects', orientation='h')
fig.show()

The subject geographic for MMDH is all Chile, meaning that all artifacts in the MMDH collection are about Chile. This corresponds with the place of origin, as most materials also seem to be from Chile, with some variation from other countries.

See Tableau to visualize country spread in map format. 

Next questions: 
- Since most plaes of origin for materials are from Chile, let's zoom in more to the city level if possible. The MMDH does have most data at the city level -> what cities are the artifacts most commonly from? 
- What about non-Chile objects?

#### *Memoria Abierta Dataset*

In [23]:
memab_geo = pd.read_csv('memoria_abierta_ geo_only.csv')

In [24]:
fig = px.bar(memab_geo, x=memab_geo['Place of origin combined'].value_counts(), y=memab_geo['Place of origin combined'].value_counts().index, title='Place of Origin - City Level - for Memoria Abierta', orientation='h')
fig.show() 

Memoria Abierta place of origin data is quite interesting considering the collection is about human rights organizations in Argentina! Now let's look at subject geographic.

In [25]:
fig = px.bar(memab_geo, x=memab_geo['Subject geographic'].value_counts(), y=memab_geo['Subject geographic'].value_counts().index, title='Subject Geographic - Country Level - for Memoria Abierta', orientation='h')
fig.show() 

Even more surprising results here. Argentina takes the backseat in terms of what geographies collections are describing, but geography spread does reflect spread for places of origin. Would be interesting to go into dataset to see where Argentina fits in/if it was purposefully left out. 

#### *CCP Dataset*

In [39]:
ccp_geo = pd.read_csv('CCP_geo_only.csv')

In [40]:
ccp_geo = ccp_geo.stack().reset_index().rename(columns={0:'subject geo'})

In [46]:
fig = px.bar(ccp_geo, x=ccp_geo['subject geo'].value_counts(), y=ccp_geo['subject geo'].value_counts().index, title='Subject Geographic - City and Country Level Combined - for CCP', orientation='h')
fig.show() 

Let's clean up the graph a bit. What if I looked at the country level (combining cities/towns in Peru)? 

In [44]:
ccp_geo['subject country'] = np.where(ccp_geo['subject geo'].str.contains('Peru'), 'Peru', ccp_geo['subject geo'])

In [45]:
fig = px.bar(ccp_geo, x=ccp_geo['subject country'].value_counts(), y=ccp_geo['subject country'].value_counts().index, title='Subject Geographic - Country Level - for CCP', orientation='h')
fig.show() 

Now let's exclude Peru and see what happens.

In [49]:
# Following code will subset data to exclude entries with Peru
ccp_no_peru = ccp_geo[ccp_geo['subject country'] != 'Peru']

In [51]:
fig = px.bar(ccp_no_peru, x=ccp_no_peru['subject country'].value_counts(), y=ccp_no_peru['subject country'].value_counts().index, title='Subject Geographic - Country Level - for CCP (excluding Peru)', orientation='h')
fig.show() 

Now let's look at the distribution of collections with relation to cities/towns in Peru itself. 

In [52]:
# Following code will subset data to include entries only w/ city/town level geo in Peru
ccp_cities = ccp_geo[ccp_geo['subject geo'].str.contains(', Peru')]

In [54]:
fig = px.bar(ccp_cities, x=ccp_cities['subject geo'].value_counts(), y=ccp_cities['subject geo'].value_counts().index, title='Subject Geographic - City Level in Peru - CCP', orientation='h')
fig.show() 

Overwhelming number of objects relating to Lima, Peru. 

#### *Ibicaba Farm Records*

No interesting geographical data from Ibicaba Farm Records, as all objects are from and are pertaining to the Ibicaba Farm in Brazil.

#### *Obidos Court Records*

No interesting geographical data from Obidos Court Records either, as all objects are from and are pertaining to the Court of Justice in the city of Obidos, Brazil.

#### *Centro Cultural Tallersol (CCT)*

Interesting geographical data from the Centro Cultural Tallersol for 'Subject Geographic' that can be broken down on the place, city, and country level. Let's see the distribution.

In [60]:
cct_geo = pd.read_csv('cct_geo.csv')

In [61]:
fig = px.bar(cct_geo, x=cct_geo['Country'].value_counts(), y=cct_geo['Country'].value_counts().index, title='Subject Geographic - Country Level - Centrol Cultural Tallersol', orientation='h')
fig.show() 

Not super interesting. As expected, most data pertains to Chile. Let's zoom in on the city and place levels.

In [62]:
fig = px.bar(cct_geo, x=cct_geo['City'].value_counts(), y=cct_geo['City'].value_counts().index, title='Subject Geographic - Country Level - Centrol Cultural Tallersol', orientation='h')
fig.show() 

Most cities are in Chile, as expected as well. Do note that Cartagena is referring to the city in Chile. San Juan refers to the city in Puerto Rico. 

What about place? I expect lots of variation that will need to be cleaned.

In [64]:
fig = px.bar(cct_geo, x=cct_geo['Subject geographic - place'].value_counts(), y=cct_geo['Subject geographic - place'].value_counts().index, title='Subject Geographic - Country Level - Centrol Cultural Tallersol', orientation='h')
fig.show() 

Actually, it seems like most places do just show up once, with the exception of Casona San Isidro (x2), Cafe del Cerro (x4), and Teatro Cariola (x6). 

I'm now going to move to Tableau and map the cities, since that seems to be the most interesting source of data! 