# Intro to NLP Packages and Data

Notebook to allow participants to install all required packages and data downloads before the course begins.

It is recommended that participants use a virtual environment for managing packages. For more information on how to do this:

[Here for working with venv](https://docs.python.org/3/library/venv.html)

[And here for getting it to work with jupyter notebooks](https://janakiev.com/blog/jupyter-virtual-envs/)

This course uses a range of domain specific libraries, unfortunately some of these are challenging to use with secure devices. Where possible work arounds have been provided in this course - however the best advice is to consult your home department's package guidance. 

## Package Dependencies

By `pip` installing requirements.txt we can have all the required packages for the course ready to go.

**Use the command below in your Anaconda Prompt, or where you install packages**
This will need to be done from the same working directory as this notebook, or point to the location of "requirements.txt".

**WARNING** Some packages are large and may take some time.

```
pip install -r requirements.txt
```

In [1]:
# The below code will install all the required versions of packages to your **current** kernel
# only try if the above does not work
import sys
!{sys.executable} -m pip install -r requirements.txt

'c:\users\leyshr\onedrive' is not recognized as an internal or external command,
operable program or batch file.


In [2]:
# install NLTK if it is not already
import sys
!{sys.executable} -m pip install nltk

'c:\users\leyshr\onedrive' is not recognized as an internal or external command,
operable program or batch file.


## NLP data from packages

Some packages require specific data in order to achieve their full functionality. Note the below will not work on an ONS network device, please use the binder version of the course.

In [3]:
# Run this block if you do not want to install the data yourself (off network)
import nltk
nltk.data.path.append('./nltk_data/')

# test it works
from nltk import word_tokenize

word_tokenize("My name is")

ModuleNotFoundError: No module named 'nltk'

In [None]:
# Run this block if you would like to install the data onto your machine
import nltk

required_downloads = ['tokenizers/punkt', 'corpora/wordnet', 'corpora/stopwords',
                      'taggers/averaged_perceptron_tagger', 'chunkers/maxent_ne_chunker',
                      'corpora/words']

for asset_name in required_downloads:
    name = asset_name.split("/")[1]
    try:
        nltk.data.find(asset_name)
    except LookupError:
        nltk.download(name)

# test it works
from nltk import word_tokenize

word_tokenize("My name is")

# Install spacy model

The spacy model used cannot often be installed with the `requirements.txt` file. Running the code below will allow it to be imported.



In [None]:
# Run if package not installed
#!{sys.executable} -m pip install spacy

In [None]:
import spacy

The small model has been pre-installed within this repository, if you cannot download it yourself then the path to the model will be given throughout the course.

In [None]:
# This will not work on a closed network system
!python -m spacy download en_core_web_sm

# Stanza Model

**WARNING** This model, used twice in the course is ~500MB in size.

Whilst `stanza` is a powerful and useful models - it's downloads and required packages are not necessarily supported on all systems. The core concepts in this course are taught without it - but stanza models are included where appropriate. If you cannot get the model download to work - that is okay and will not significantly reduce your experience.

In [None]:
# Run if package not installed
#!{sys.executable} -m pip install stanza

In [None]:
import stanza

In [None]:
# The below will only work if your network allows
stanza.download('en') # download English model for neural pipelines


In [None]:
nlp = spacy.load("./spacy/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1/")