# Natural Language Processing (NLP), Named Entity Recognition (NER), Ontology and Maps

## Natural Language Processing (NLP)

* NLP means trying to understand the syntax of language.

* We will look at the basics: morphology and syntax tools.

* There is no specification for NLP libraries across human languages or computer languages.

* We are in a transition from rule-based to statistical (machine learning) approaches.

## Basic NLP terminology

* Tokenization: chopping language up into its elements: sentences, words, punctuation, etc. 

* Can be complex: e.g. contractions, compound splitting, suffixes, agglutinative languages, languages without spaces.

* Stemming: rules for chopping the ends off words.  This works OK for information retrieval in languages with a simple morphology (e.g. English), when you do not care about syntax.

* Lemmatization: properly reducing a word to its dictionary form using a grammar and dictionary.  Still leaves ambiguities.

* POS tagging: Adding part-of-speech info along with the lemma.

* Tree-banking: Tagging the syntactic structure of whole sentences.

## Rule-based and early statistical NLP

* Rule-based systems are fast, straightforward and are good enough for some tasks.

* Statistical systems are heavily dependent on the quality and quantity of the training data: is it representative?

* Early statistical models were based upon n-grams and hidden Markov models.

* Not great for languages with freer word-order.

* Traditional solution: NLTK in Python, but it has lots of problems.

## Current Deep Learning approaches

* Rule-based systems are essentially context-free; early statistical approaches had a reductive view of context.

* New DL models with a vast number of parameters can be trained on whole sentences at once.

* Requires an even larger amount of quality data.

* Can give very good accuracy for well-resourced languages, though there are a significant number of syntactic ambiguities it cannot resolve (e.g. Winograd schemas).

* Competing NLP libraries like Spacy, Flair and Stanza are springing up all the time.

## Named Entity Recognition

* Mapping the world of the text onto the real world.

* Extracting references to real-world objects from texts.

* Capital letters can help a bit, but lots of room for ambiguity.

* Really the computer needs to have an understanding of parts of speech to be able to identify noun phrases.  

* Hence NER is usually incorporated into Natural Language Processing (NLP) libraries. 

* Once you have done NER, you can generate maps and graphs displaying networks of connectivity from your texts.  

## Digital Ontology

* Not the study of being.

* Once you have found the proper names in your text, what do you do with them?

* How can you connect them with the same objects as referenced in other texts?

* How to point unambiguously at objects in the real world, and describe relationships between them.

* For our texts to be interoperable, they need to use the same systems of referring to the external world. 

* A model in which Prince Hal, King Henry V and Henry of Monmouth point to the same person.

## Semantic Web

* A grand vision of Tim Berners-Lee, to do for data what the web did for texts.

* “A computational artefact which provides a formal model of a domain of interest consisting of classes of objects, their attributes and the relations between them.”

* A vast collection of subject-predicate-object statements: [example](https://en.wikipedia.org/wiki/Semantic_Web#Example).

* These statements can be expressed in various dialects of RDF (Resource Description Framework).

* The subjects and the objects are what are most interesting right now, from a DH perspective.  These are invariably URLs.

* Provides a way to mark-up texts unambiguously and to leverage the power of cross-referencing.

## Linked Open Data

* “All conceptual things should have a name starting with HTTP” (Tim Berners-Lee).

* Covers all of the objects of humanistic study: places, people, texts, artifacts, etc.

* Sometimes useful to operate at high resolution: Pompeii, or a particular wall of a room of a house in a block.  The whole bible or a particular verse.

* The humanities often have quite well developed systems of reference.  The problem is digitizing them in a way that is stable and widely agreed.

* Scholars do not always recognize the importance of being as careful about digital references as we always have been about traditional references.

## Maps

* Maps are an important aspect of data visualization in the Humanities. Most historical data has a geographical aspect, and it can be interesting to extract and plot place names from other kinds of texts via Named Entity Recognition.

* There are a number of separate tasks in creating geospatial visualizations.

## Geocoding

* This is the task of taking place-names, addresses and so forth and turning them into coordinates in latitude and longitude.

* There are number of services that provide this functionality, free and paid.  The `geopy` library provides a unified interface to them: Google Maps, Bing Maps, etc.  

* We will be using a free geocoding service called `nominatim` from OpenStreetMap, which does not require an API key for moderate use.

## Using Online Maps

* The easiest way to work with maps is to do so via an online mapping service, such as Google Maps.

* An alternative, OpenStreetMaps, is a free, open-access editable map of the world. 

* There is a very nice JavaScript interface to it called [leaflet.js](https://leafletjs.com/), which in turn has a nice Python wrapper library called `folium`.

* One of the convenient aspects of these online maps is that they tend to use latitude and longitude natively, so you can plot geocoded points directly.

## Working with Geometry

* Most normal maps do not use latitude and longitude as their primary system of reference.  Those are spherical coordinates in 3D space, whereas most maps use flat 2D projections, all of which have different characteristics.  So you often have to convert points from one Coordinate Reference System (CRS) to another.

* The `geopandas` library can read and translate most kinds of CRS data (and re-project it in another system if necessary).  It has many other geometric methods, such as measuring area.  It can also draw maps using `matplotlib`.

## Other Libraries

* There are lots of other Python libraries for mapping, such as `plotly`, `cartopy` (developed by the UK Met service), etc.  Most of these plot locations on a generic map.

* To create your own maps and to manage detailed geographical data, you want to use specialized GIS (Geographical information systems) software applications.  See [QGIS](https://qgis.org/en/site/) for an open-source option. 