# Tutorial to Koalas - Metadata and adding provenance information

## Metadata

Koalas automatically records metadata for each column in a WordFrame. Each event that modifies a column is described by the operation applied, its parameters, the user who apllied the changes and a timestamp. First we import koalas under the abbreviation kl and create a test WordFrame by passing a dictionary into the constructor:

In [1]:
import koalas as kl
wordframe = kl.WordFrame({'terms': ['red', 'green', 'blue'], 'frequency': [134, 421, 245]})
print(wordframe)

   frequency  terms
0        134    red
1        421  green
2        245   blue


Koalas already recorded metadata, in this case the creation of the data. You can access the metadata through the `meta` attribute.

In [2]:
print(wordframe.meta)

{'frequency': 
  - created  (beyt on 2018/04/12 09:35:24)
, 'terms': 
  - created  (beyt on 2018/04/12 09:35:24)
}


Any operation transformed on a column is recorded and appended to the metadata...

In [3]:
wordframe['relative frequency'] = wordframe['frequency'] / wordframe['frequency'].sum()
wordframe.str.upper(on='terms')
print(wordframe)

   frequency  terms  relative frequency
0        134    red             0.16750
1        421  green             0.52625
2        245   blue             0.30625


...and can be inspected through `meta`.

In [4]:
wordframe.meta

{'frequency': 
  - created  (beyt on 2018/04/12 09:35:24)
,
 'last modified': 'relative frequency',
 'relative frequency': 
  - created  (beyt on 2018/04/12 09:35:24)
  - divided by "800" (beyt on 2018/04/12 09:35:24)
,
 'terms': 
  - created  (beyt on 2018/04/12 09:35:24)
}

## Adding provenance information to a WordFrame
Let's continue with the example of the first tutorial. We want to find out which of the candidate terms are brand names. Suppose we have a text file containing the names of car brands, one on each line. We can simply import it as a WordFrame.

In [5]:
carbrands = kl.read_csv('data/carbrands.csv')
print(carbrands)

                                  name
0                               Abarth
1                       Abbott-Detroit
2                        AC Propulsion
3                                Acura
4                           Alfa Romeo
5                               Alpina
6                  Alpine (automobile)
7    Alvis Car and Engineering Company
8          American Motors Corporation
9                              Amilcar
10                               Amuza
11                    Apollo Automobil
12                 Ariel Motor Company
13                                 ARO
14                            Arrinera
15                         Ascari Cars
16                        Aston Martin
17                                Audi
18                    Automobile Dacia
19                Automobiles Grégoire
20             Automobiles René Bonnet
21               Automotive Industries
22                           Auverland
23                       Avions Voisin
24                 Ballot

If we plan to frequently reuse this list as a reference it is a good idea to add metadata to it. Koalas automatically creates a metadata entry if we import a csv file. You can see any time by calling the `meta` attribute.

In [6]:
carbrands.meta

{'name': 
  - from file "data/carbrands.csv" (beyt on 2018/04/12 09:35:24)
}

Unfortunately this is very basic, Koalas does not know more about the content of the file, its source, nor its intended use. If you have this information, however, you can add to the WordFrame. Supply a dictionary with metadata and provenance information to the parameter `provenance` of the `read_csv` function.

In [7]:
provenance = {'source': 'Wikipedia',
              'link': 'https://en.wikipedia.org/wiki/Category:Car_brands',
              'description': 'A non-exhaustive list of car brand names.',
              'version': 'revision from 20:37, 18 February 2018'}
carbrands_with_provenance = kl.read_csv('carbrands.csv', provenance=provenance)

Inspecting the `meta` attribute now you see the added metadata in addition to the user that opened the file and a timestamp.

In [8]:
print(carbrands_with_provenance.meta)

{'name': 
  - external source "{'source': 'Wikipedia', 'link': 'https://en.wikipedia.
    org/wiki/Category:Car_brands', 'description': 'A non-exhaustive list of car 
    brand names.', 'version': 'revision from 20:37, 18 February 2018'}" (beyt on 
    2018/04/12 09:35:24)
}


This additional information can not be stored easily in a csv file along the actual data. Instead we can save the WordFrame as a JSON file which preseves all information.

In [9]:
carbrands_with_provenance.to_json('data/carbrands.json')

Opening this file again with the `read_json` function retrieves the exact same information.

In [10]:
newWordFrame = kl.read_json('data/carbrands.json')
print(newWordFrame.meta)

{'name': 
  - external source "{'source': 'Wikipedia', 'link': 'https://en.wikipedia.
    org/wiki/Category:Car_brands', 'description': 'A non-exhaustive list of car 
    brand names.', 'version': 'revision from 20:37, 18 February 2018'}" (beyt on 
    2018/04/12 09:35:24)
}


And any operation performed on the WordFrame leaves a trace on the WordFrame as seen in the section above.

In [11]:
newWordFrame = newWordFrame.remove_qualifiers(on='name', to='normalized name').str.lower()
print(newWordFrame.meta)

{'name': 
  - external source "{'source': 'Wikipedia', 'link': 'https://en.wikipedia.
    org/wiki/Category:Car_brands', 'description': 'A non-exhaustive list of car 
    brand names.', 'version': 'revision from 20:37, 18 February 2018'}" (beyt on 
    2018/04/12 09:35:24)
, 'normalized name': 
  - external source "{'source': 'Wikipedia', 'link': 'https://en.wikipedia.
    org/wiki/Category:Car_brands', 'description': 'A non-exhaustive list of car 
    brand names.', 'version': 'revision from 20:37, 18 February 2018'}" (beyt on 
    2018/04/12 09:35:24)
  - removed qualifier  (beyt on 2018/04/12 09:35:24)
  - lower  (beyt on 2018/04/12 09:35:24)
, 'last modified': 'normalized name'}


In [12]:
print(newWordFrame.head())

                    name      normalized name
0                 Abarth               abarth
1         Abbott-Detroit       abbott-detroit
10                 Amuza                amuza
100                Mazda                mazda
101  Mazzanti Automobili  mazzanti automobili
