# Archaeological Data Analysis: lab module 1

### Author:  ____________

# Exploring a data set

In this notebook, you'll download a data set derived from the openly licensed content of the [Online Coins of the Roman Empire](http://numismatics.org/ocre/) (OCRE). The original data set is available from <http://nomisma.org/> RDF XML format.  We'l work with a version formatted as a delimited-text file, using `#` as the column delimiter, with a header line labelling each column.

As with any data set, our first task is to figure out what kinds of data it contains, and what the range of values are for each category of data. We'll examine the contents of several columns of data.




## Download delimited-text data

We'll make the standard Scala `Source` object available by `import`ing it, then use it to retrieve the content of a URL.

In [1]:
import scala.io.Source
val beazley = "https://raw.githubusercontent.com/neelsmith/clas299/master/vases-lastyear.csv"

[32mimport [39m[36mscala.io.Source
[39m
[36mbeazley[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/clas299/master/vases-lastyear.csv"[39m

We'll extract a sequence of lines from the URL source, and convert them to our favorite type of Scala collection, a `Vector`.

(The following cell downloads the data:  depending on your internet connection, this might take a moment.)

In [2]:
val lines = Source.fromURL(beazley).getLines.toVector

[36mlines[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"Painter,Number,Shape,Lat,Lon"[39m,
  [32m"Achilles Painter,201,white lekythos,35.037245,32.436032"[39m,
  [32m"Achilles Painter,202,white lekythos,37.9728191495,23.7237438519"[39m,
  [32m"Achilles Painter,205,white lekythos,37.9728191495,23.7237438519"[39m,
  [32m"Achilles Painter,1(1),Amphora,42.419009,11.6298975"[39m,
  [32m"Achilles Painter,4(4),Neck-Amphora,42.419009,11.6298975"[39m,
  [32m"Achilles Painter,5(5),Neck-Amphora,42.419009,11.6298975"[39m,
  [32m"Achilles Painter,7(6),Neck-Amphora,40.926823,14.524545"[39m,
  [32m"Achilles Painter,9(8),Neck-Amphora,40.926823,14.524545"[39m,
  [32m"Achilles Painter,11(14),Neck-Amphora,40.926823,14.524545"[39m,
  [32m"Achilles Painter,12(10),Neck-Amphora,40.926823,14.524545"[39m,
  [32m"Achilles Painter,15(13),Neck-Amphora,40.926823,14.524545"[39m,
  [32m"Achilles Painter,17(15),Neck-Amphora,40.926823,14.524545"[39m,
  [32m"Achilles Pa

## Examine header line

To start with, let's see what the first line looks like, and compare it with the first data line.

In [3]:
lines.head // same as lines(0)

[36mres2[39m: [32mString[39m = [32m"Painter,Number,Shape,Lat,Lon"[39m

In [6]:
lines(1).split(",").toVector

[36mres5[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"Achilles Painter"[39m,
  [32m"201"[39m,
  [32m"white lekythos"[39m,
  [32m"35.037245"[39m,
  [32m"32.436032"[39m
)

## Split data strings into columns

Every line is a `String`.  If we break it up using the `split` method, we get an `Array` of `String`s, which we'll convert to a `Vector` of `String`s.  The end result will be that from a Vector of Strings, we create a Vector of Vectors of Strings.  Notice that Scala identifies the class of the new `data` expression as  `Vector[Vector[String]]`.
 

In [4]:
val data = lines.tail.map(ln => ln.split(",").toVector)

[36mdata[39m: [32mVector[39m[[32mVector[39m[[32mString[39m]] = [33mVector[39m(
  [33mVector[39m([32m"Achilles Painter"[39m, [32m"201"[39m, [32m"white lekythos"[39m, [32m"35.037245"[39m, [32m"32.436032"[39m),
  [33mVector[39m(
    [32m"Achilles Painter"[39m,
    [32m"202"[39m,
    [32m"white lekythos"[39m,
    [32m"37.9728191495"[39m,
    [32m"23.7237438519"[39m
  ),
  [33mVector[39m(
    [32m"Achilles Painter"[39m,
    [32m"205"[39m,
    [32m"white lekythos"[39m,
    [32m"37.9728191495"[39m,
    [32m"23.7237438519"[39m
  ),
  [33mVector[39m([32m"Achilles Painter"[39m, [32m"1(1)"[39m, [32m"Amphora"[39m, [32m"42.419009"[39m, [32m"11.6298975"[39m),
  [33mVector[39m([32m"Achilles Painter"[39m, [32m"4(4)"[39m, [32m"Neck-Amphora"[39m, [32m"42.419009"[39m, [32m"11.6298975"[39m),
  [33mVector[39m([32m"Achilles Painter"[39m, [32m"5(5)"[39m, [32m"Neck-Amphora"[39m, [32m"42.419009"[39m, [32m"11.6298975"[39m),
  [33

In [8]:
val lonlat = data.map( props => props(3) + ":"+  props(4))

[36mlonlat[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"35.037245:32.436032"[39m,
  [32m"37.9728191495:23.7237438519"[39m,
  [32m"37.9728191495:23.7237438519"[39m,
  [32m"42.419009:11.6298975"[39m,
  [32m"42.419009:11.6298975"[39m,
  [32m"42.419009:11.6298975"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.9900604421:14.3986099472"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"40.926823:14.524545"[39m,
  [32m"44.697575:12.108186"[39m,
  [32m"40.9900604421:14.3986099472"[39m,
  [32m"44.4945737:11.3455467"[39m,
  [32m"42.419009:11.6298975"[39m,
  [32m"42.419009:11.6298975"[39m,
  [32m"42.41

In [9]:
lonlat.size
data.size

[36mres8_0[39m: [32mInt[39m = [32m101[39m
[36mres8_1[39m: [32mInt[39m = [32m101[39m

In [12]:
val zipped=lonlat zip data

[36mzipped[39m: [32mVector[39m[([32mString[39m, [32mVector[39m[[32mString[39m])] = [33mVector[39m(
  (
    [32m"35.037245:32.436032"[39m,
    [33mVector[39m(
      [32m"Achilles Painter"[39m,
      [32m"201"[39m,
      [32m"white lekythos"[39m,
      [32m"35.037245"[39m,
      [32m"32.436032"[39m
    )
  ),
  (
    [32m"37.9728191495:23.7237438519"[39m,
    [33mVector[39m(
      [32m"Achilles Painter"[39m,
      [32m"202"[39m,
      [32m"white lekythos"[39m,
      [32m"37.9728191495"[39m,
      [32m"23.7237438519"[39m
    )
  ),
  (
    [32m"37.9728191495:23.7237438519"[39m,
    [33mVector[39m(
      [32m"Achilles Painter"[39m,
      [32m"205"[39m,
      [32m"white lekythos"[39m,
      [32m"37.9728191495"[39m,
      [32m"23.7237438519"[39m
    )
  ),
  (
    [32m"42.419009:11.6298975"[39m,
    [33mVector[39m([32m"Achilles Painter"[39m, [32m"1(1)"[39m, [32m"Amphora"[39m, [32m"42.419009"[39m, [32m"11.6298975"[39m)
  ),
 

In [16]:
val grouped = zipped.groupBy{case (ll, beazley) => ll}

[36mgrouped[39m: [32mMap[39m[[32mString[39m, [32mVector[39m[([32mString[39m, [32mVector[39m[[32mString[39m])]] = [33mMap[39m(
  [32m"44.697575:12.108186"[39m -> [33mVector[39m(
    (
      [32m"44.697575:12.108186"[39m,
      [33mVector[39m([32m"Achilles Painter"[39m, [32m"53"[39m, [32m"Krater"[39m, [32m"44.697575"[39m, [32m"12.108186"[39m)
    ),
    (
      [32m"44.697575:12.108186"[39m,
      [33mVector[39m(
        [32m"Tyszkiewicz Painter"[39m,
        [32m"10"[39m,
        [32m"Column-Kraters"[39m,
        [32m"44.697575"[39m,
        [32m"12.108186"[39m
      )
    ),
    (
      [32m"44.697575:12.108186"[39m,
      [33mVector[39m(
        [32m"Tyszkiewicz Painter"[39m,
        [32m"11"[39m,
        [32m"Column-Kraters"[39m,
        [32m"44.697575"[39m,
        [32m"12.108186"[39m
      )
    ),
    (
      [32m"44.697575:12.108186"[39m,
      [33mVector[39m(
        [32m"Tyszkiewicz Painter"[39m,
        [32m"

In [18]:
val groupWithCount = grouped.toVector.map {case (lonlat, beazley) => (lonlat, beazley.size)}

[36mgroupWithCount[39m: [32mVector[39m[([32mString[39m, [32mInt[39m)] = [33mVector[39m(
  ([32m"44.697575:12.108186"[39m, [32m4[39m),
  ([32m"36.0912206:28.0882029"[39m, [32m1[39m),
  ([32m"44.4945737:11.3455467"[39m, [32m2[39m),
  ([32m"42.0047755:12.10281885"[39m, [32m1[39m),
  ([32m"40.926823:14.524545"[39m, [32m15[39m),
  ([32m"40.9900604421:14.3986099472"[39m, [32m2[39m),
  ([32m"54.968831:-1.610077"[39m, [32m1[39m),
  ([32m"42.2999961333:12.3573670667"[39m, [32m2[39m),
  ([32m"38.440912:27.14781"[39m, [32m1[39m),
  ([32m"37.7409397:23.430141"[39m, [32m4[39m),
  ([32m"42.6325682:11.6394037528"[39m, [32m3[39m),
  ([32m"37.9728191495:23.7237438519"[39m, [32m42[39m),
  ([32m"42.021338:12.390519"[39m, [32m1[39m),
  ([32m"39.5232397778:16.71231"[39m, [32m1[39m),
  ([32m"41.891775:12.486137"[39m, [32m1[39m),
  ([32m"35.037245:32.436032"[39m, [32m2[39m),
  ([32m"30.900508:30.5919275"[39m, [32m4[39m),
  ([32m"3

In [20]:
val csv = groupWithCount.map{ case (s,i) => s.replace(":", ",") + "," + i}

[36mcsv[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"44.697575,12.108186,4"[39m,
  [32m"36.0912206,28.0882029,1"[39m,
  [32m"44.4945737,11.3455467,2"[39m,
  [32m"42.0047755,12.10281885,1"[39m,
  [32m"40.926823,14.524545,15"[39m,
  [32m"40.9900604421,14.3986099472,2"[39m,
  [32m"54.968831,-1.610077,1"[39m,
  [32m"42.2999961333,12.3573670667,2"[39m,
  [32m"38.440912,27.14781,1"[39m,
  [32m"37.7409397,23.430141,4"[39m,
  [32m"42.6325682,11.6394037528,3"[39m,
  [32m"37.9728191495,23.7237438519,42"[39m,
  [32m"42.021338,12.390519,1"[39m,
  [32m"39.5232397778,16.71231,1"[39m,
  [32m"41.891775,12.486137,1"[39m,
  [32m"35.037245,32.436032,2"[39m,
  [32m"30.900508,30.5919275,4"[39m,
  [32m"38.1465515,23.970146,1"[39m,
  [32m"42.419009,11.6298975,12"[39m,
  [32m"38.482289,22.501169,1"[39m
)

In [21]:
println(csv.mkString("\n"))

44.697575,12.108186,4
36.0912206,28.0882029,1
44.4945737,11.3455467,2
42.0047755,12.10281885,1
40.926823,14.524545,15
40.9900604421,14.3986099472,2
54.968831,-1.610077,1
42.2999961333,12.3573670667,2
38.440912,27.14781,1
37.7409397,23.430141,4
42.6325682,11.6394037528,3
37.9728191495,23.7237438519,42
42.021338,12.390519,1
39.5232397778,16.71231,1
41.891775,12.486137,1
35.037245,32.436032,2
30.900508,30.5919275,4
38.1465515,23.970146,1
42.419009,11.6298975,12
38.482289,22.501169,1


Mapping each Vector to the first item in the Vector is equivalent to extracting the first column from each Vector.  The header line told us that the first column should contain ID values.

In [None]:
val ids = data.map(columns => columns(0))

We want to be sure that all ID values are unique.  We can verify that by comparing the number of items in the `ids` Vector with the number of *distinct values* in the `ids` Vector.  If they're the same, then every value is unique.

In [None]:
//println("Records: " + ids.size)
//println("Distinct IDs: " + ids.distinct.size)
if(ids.size == ids.distinct.size) {
    println("All records uniquely identified.")
} else {
    println("Duplicate identifiers in data set.")
}

## Distribution of denominations

Let's look at how coin denominations are described.  You can see from the header line that denominations are in the third column, so we'll map each Vector to the thrid column -- and remember that we start indexing with 0, so the third column is indexed as `(2)`.

In [None]:
val denominations = data.map(columns => columns(2))

We'll use a very handy Scala idiom to count how many times each authority appears. If we group the elements in our Vector by their value, the result is a Map from the unique set of values to a list of the matching values.  

In [None]:
val denominationsGrouped = denominations.groupBy(denom => denom)


In [None]:
// Free puzzle:  notice that the result of this groupBy should be the same size as the numnber of distinct values in our list:
if (denominationsGrouped.size == denominations.distinct.size) {
    println("Number of groups is same as number of distinct values.")
} else {
    println("Something is terribly wrong.  The number of groups is not the same as the number of distinct values.")
}

What we really want to know is *how many times* does each denomination appear?  We can find that out by transforming our mapping of String->Vector[String] to give us a mapping of each denomination to the *size* of the Vector of its occurrences.

In [None]:
val denominationsCounts = denominationsGrouped.map{ case (d, v) => (d, v.size) }


Recall that `Map`s are not ordered in Scala. If we now convert the `Map` to a `Vector`, we will have a Vector pairing a String with an Int.  We can sort the Vector by the second element of the pairing (which will sort from smallest to largest), then reverse the results to have a descedning list of how often each denomination occurs.

In [None]:
val denominationsHisto = denominationsCounts.toVector.sortBy(frequency => frequency._2).reverse

Now we can easily see the extremes of the counts:

In [None]:
println("Most frequent denomination: " + denominationsHisto.head)

In [None]:
// Find denominations occurring fewer than some threshhold number of times
val cutOff = 10 
val leastDenominations = denominationsHisto.filter(frequency => frequency._2 < cutOff)
println("Least frequent denominations: \n" + leastDenominations.mkString("\n"))

## Assignment


Analyze how many issues are produced by each issuing authority to answer the following questions:

- How many different authorities strike coins in OCRE's data?
- Who strikes the greatest number of issues?  How many?
- What is the smallest number of issues struck by a single authority?

### Gather and organize your data

In [None]:
// First, to extract the "Authority" column from the data set, uncomment and complete the following line:
// val authorities = ??

### Question 1: how many authorities strike coins?



In [None]:
// Use the distinct method and size method to count how many distinct values you have in `authorities`
//
// authorities.??

### Group records by authority and count them

In [None]:
// use the groupBy method to group each auhority by the authority value.
// This will give you a Map of Strings to a Vector of Strings
// 
// val authoritiesGrouped = ??


In [None]:
// now convert each pairing of String->Vector[String] to a String->Int counting how many elements are in the original Vector.
// The result is a Map[String->Int].
//
// val authoritiesCounts = authoritiesGrouped.map{ ?? }

In [None]:
// next convert your Map[String->Int] to a Vector.  The result is a Vector of pairings of (String, Int).
// We'll sort this by the second element of the pairing, namely the Int.  Since we sort from smallest to largest
// by default, you can reverse the result so that the 
//
// val authoritiesHistogram = authoritiesCounts.??

### Questions 2 and 3: who strikes the most issues? the fewest?

In [None]:
// With the authoritiesHistogram you created, you can use the `head` and `last` methods to see the first and last entries in the Vector.
