Open in colab: https://colab.research.google.com/github/brochhagen/nlpupf/blob/main/material/2023q2/session01/notebook0101.ipynb

# Natural Language Processing (Session 01.01)

## Overview 

### Presentation 

Welcome to **N**atural **L**anguage **P**rocessing! This class gives an introduction to central aspects of natural language processing. It puts its main emphasis on hands-on experience with the acquisition, manipulation, curation, and processing of linguistic data. It covers both symbolic and statistical methods, from a theoretical and practical angle.

A central goal of this class is for you to acquaint yourself with state of the art techniques used in industry and academia to structure language data and extract information from it; as well as to empower you to apply this knowledge to new problems outside the scope of this class. 

On the one hand, you will learn to reason about problems and tasks that arise when processing linguistic information. On the other, you will learn about common methods and approaches to address them. My suggestion: Focus on the former, but keep the latter in mind for when you need it.

***

### Associated skills

  * python
  * data acquisition; manipulation; curation; and processing
  * machine learning
  * quantitative reasoning applied to language sciences

***

### Prerequisites

Working knowledge of *python 3*. 

See the *Recommendations* section of the __[specialization in Computational Linguistics of the Master's for more information](https://www.upf.edu/web/masterlinguistica/linguistica-computacional)__

***

### Contents

1. **Main tasks**

  * Handling text
  * Tagging
  * Parsing
  

2. **Main models & technologies**

  * Industrial speed & strength streamlined modlels, using __[spaCy](https://spacy.io/)__
  * Larger state-of-the-art systems, using __[hugging face's transformers](https://huggingface.co/)__

3. **Associated topics**

  * Data curation
  * Data quality
  * Fine-tuning
  * Training and evaluation
 
***

### Evaluation
  
  * 20% participation in class (5% for each instance, so four in total for all 20%)

  * 80% exercises
    * Exercise 1: 25% (due: 13/02 at 23:59)
    * Exercise 2: 25% (due: 27/02 at 23:59)
    * Exercise 3: 30% (due: 27/03 at 23:59)

Participation in class can be either of two kinds:
  1. Present a new concept. Concepts for the next session are announced at the end of each session
  2. Present your approach/results to a problem in class

Exercises can be done individually or in groups of up to three members.

***


### Weekly structure 

Before a session:

  * Prepare reading
  * Make sure you have a working environment that fulfills the necessary dependencies
  * Submit exercise (if due)
  * Prepare your concept presentation (if you volunteered) 

The session itself:

  * Roughly half is devoted to discussing concepts from a conceptual and theoretical angle (including your mini-presentations)
  * The other half is hands-on work

***

### A few recommendations

  * They look nice but do not write "heavy duty" scripts in Jupyter notebooks. Use them for dynamic presentations (or smaller scripts and collaborations)
  * Comment your code extensively
  * Document your choices
  * Use virtual environments (e.g., [ana]conda) for your projects
  * Use the language that is most convenient to you whenever you can. For this class, python and R are admissible

***

### Short instructor bio

Thomas is a professor in computational linguistics / computational cognitive science at the UPF. He's particularly interested in the way language is structured: why it is the way it is and how it came to be that way. To answer these questions, he sometimes gets people in the lab; sometimes he uses artificial agents to simulate language (use); and other times, he uses large-scale typological data. Besides NLP and machine learning, he also employs Bayesian models; game theory; and information theory in his daily research.

Contact: __thomas.brochhagen@upf.edu__

Webpage: __[https://brochhagen.github.io](https://brochhagen.github.io)__

Main natural languages: EN, DE, ES, CAT

Main programming languages: python, R, Stan

### Teaching assistant
Jeanne Bruneau--Bongard (__jeanne.bruneau.bongard@gmail.com__)

***


<div class="alert alert-block alert-info"> <b>Class activity.</b> Who are you? What is your background? What do you expect from this class? Is there any particular topic/problem/... you'd like to cover, or a particular skill to acquire? </div>


*** 

# Some ancillaries (Session 01.02)

## What is a Jupyter notebook?


It is a dynamic programming interface, a "web application for creating and sharing computational documents." See [https://jupyter.org/](https://jupyter.org/) for full documentation.

In [1]:
# A Jupyter notebook consists of different cells. 
# The cell above is interpreted as Markdown
# This cell is interpreted as python 3.0

print('Testing that this is, indeed, python 3 and not 2')

Testing that this is, indeed, python 3 and not 2


In [2]:
# When running a python cell, you see the output just below
# For example:

4+4 

8

In [1]:
# What is the output of this cell?

for i in range(1,10):
    print(i * 2)

2
4
6
8
10
12
14
16
18


In this class most of the material is made up of one or more Jupyter notebooks. You will find them in __[Campus Global](www.upf.edu/intranet/campus-global)__

***

## What is a(n) (ana)conda environment?

Anaconda is a python and R distribution for package/environment managing. Essentially, a way to separate the modules/packages you need for a project from your other projects' modules; to avoid conflicts and keep things light.

See: __[https://docs.anaconda.com/anaconda/](https://docs.anaconda.com/anaconda/)__


*** 

## What is colab(oratory)?

An easy way to run Jupyter notebooks on Google's servers and share them with others. Intuitively, it's Google Docs but for notebooks. 

A particularly decent option to consider if you want to do some heavier computations.

See: __[https://colab.research.google.com](https://colab.research.google.com)__

*** 

## How do I manage and structure my code?

This is not a programming class. However, the expectations is that the code you share with others is well structured and documented. Ideally, the same is true of your code --and research documentation-- in general, whether others see it or not.

If you would like to learn how to properly manage and structure projects, __[The Good Research Code Handbook](https://goodresearch.dev/index.html)__ is a good starting point.


***

## Main modules:

* `spacy`
* `transformers`
* `spacy-transformers`
* `BeautifulSoup`