# Using Jupyter Notebooks to Process Europeana Newspaper Text Resources

These notebooks have been designed to help you get started processing historical text resources
(from [Europeana Newspapers](https://www.europeana.eu/en/collections/topic/18-newspapers))
with natural language processing (NLP) tools (from [CLARIN](https://www.clarin.eu)).


## Getting started

### Get familiar with the data
* Browse the [Europeana Newspapers portal](https://www.europeana.eu/en/collections/topic/18-newspapers)
* Browse the newspaper resources in the [Virtual Language Observatory](https://vlo.clarin.eu/search?fq=collection:Europeana+newspapers+full-text)

### Get access to a Jupyter environment
If you are not already accessing this inside a Jupyter environment, make sure that you get access. You may have access to such an environment through your institution, but you can also use of a public "Binder" platform for free and without registration. Go to the [GitHub repository](https://github.com/clarin-eric/europeana-newspapers-notebooks) for this collection of notebooks and click the `Launch binder` badge to get started within a few minutes.


### Retrieve the data
**Note: if you are using the notebooks in an environment that has the data available in the right location (such as [jupyter.clarin-dev.eu](https://jupyter.clarin-dev.eu)), you can skip this step!**

If you are working with these notebooks for the first time and/or in a new environment, you will not yet have data, i.e. the newspaper full text resources, locally available. You will need a copy of the data for the set(s) that you will be working with to run most of the notebooks. Data can be retrieved and unpacked locally by running the full [retrieve data notebook](./util/retrieve_data.ipynb). Click on the link to go there, and follow the instructions that you will find at the start of the notebook.

### Follow the tutorial
The tutorial consists of three chapters. Start by opening the [first chapter](./tutorial/demo/Chapter1.ipynb), then read and carry out the cells one by one while examing their output.

Or jumpy directly to a chapter of interest:
- [Chapter 1: XML and CMDI introduction](./tutorial/demo/Chapter1.ipynb)
- [Chapter 2: Data selection and resource access](./tutorial/demo/Chapter2.ipynb)
- [Chapter 3: NLP processing](./tutorial/demo/Chapter3.ipynb) 

### Work on the excercise notebooks
Start with [Exercise 1](./tutorial/exercises/Exercise1.ipynb) and follow the pointers from there.


## Doing more
- Take a look at the scripts in the `examples` directory
- Use texts from other languages, eras and/or sources
- Scrape text from online sources with [Newspaper3k](https://newspaper.readthedocs.io/en/latest/)


## Links and further reading

* CLARIN's landing page for notebooks for research and education: [clarin.eu/notebooks](https://www.clarin.eu/notebooks).
* [This set of notebooks on GitHub](https://github.com/fbkarsdorp/python-course)
* [This set of notebooks and related training material on the SSHOC Open Marketplace](https://marketplace.sshopencloud.eu/training-material/duVII1)

### Jupyter notebooks
- [Official documentation](https://docs.jupyter.org/en/latest/)
- [Introduction to Jupyter Notebooks from the Programming Historian](https://programminghistorian.org/en/lessons/jupyter-notebooks)
- [Teaching and Learning with Jupyter](https://jupyter4edu.github.io/jupyter-edu-book/)
- [Awesome Jupyter](https://github.com/markusschanta/awesome-jupyter) - A curated list of awesome Jupyter projects, libraries and resources.

### CLARIN & NLP
- [Easy-to-Use Language Resources](https://www.clarin.eu/content/language-resources)
- [CLARIN Switchboard tool inventory](https://switchboard.clarin.eu/tools)

### Historical newspapers as text resource
- [Europeana newspapers blog & knowledge base](http://www.europeana-newspapers.eu/)
- [Listing of full text newspaper sources from Cornell University Library](https://guides.library.cornell.edu/news_online)

### Learn more about NLP
- [Notebooks based course: Python Programming for the Humanities](https://github.com/fbkarsdorp/python-course) (includes a chapter "Building NLP applications")
- [Notebooks based Introduction to NLP](https://github.com/nlptown/nlp-notebooks)


----

These training materials have been developed by Twan Goosen and Michał Gawor (CLARIN ERIC) in the context of the [Europeana DSI-4 project](https://pro.europeana.eu/project/europeana-dsi-4).

Thanks to Dieter Van Uytvanck (CLARIN ERIC), Iulianna van der Lek-Ciudin (CLARIN ERIC ), Alba Irollo (Europeana) for their contributions.

![CC0](https://i.creativecommons.org/p/zero/1.0/88x31.png) \
The materials in this repositoy are released under a [CC0 1.0 licence](https://creativecommons.org/publicdomain/zero/1.0/).

