# A quick intro to working with ```spaCy```

In [1]:
import spacy

When ```spaCy``` is loaded, we then need to initialize a model.

NB: Models first have to be downloaded from the command line. An overview of avaiable models from ```spaCy``` can be found [here](https://spacy.io/usage/models):

```spacy download en_core_web_sm```

In [2]:
nlp = spacy.load("en_core_web_sm") # when people initialise a spacy model, they usually use nlp

We first create a ```spaCy``` pipeline which is going to be used for all of our analysis. Essentially we feed our examples of language down the pipeline, and get annotated texts out the end.

In [3]:
sentence = "My name is Ross and I come from Scotland"

The final object that comes out of the end is known as a ```spaCy``` Doc which is essentiall a list of tokens. 

However, rather than just being a list of strings, each of the tokens in this list have their own *attributes*, which can be accessed using the dot notation.

In [4]:
doc = nlp(sentence) # this is known as a spacy doc object (see type(doc))
# a doc is just a list of tokens but with attributes

In [7]:
for token in doc:
    print(token.text, "\t\t", token.pos_, "\t\t", token.dep_,"\t\t", token.morph)


My 		 PRON 		 poss 		 Number=Sing|Person=1|Poss=Yes|PronType=Prs
name 		 NOUN 		 nsubj 		 Number=Sing
is 		 AUX 		 ROOT 		 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
Ross 		 PROPN 		 attr 		 Number=Sing
and 		 CCONJ 		 cc 		 ConjType=Cmp
I 		 PRON 		 nsubj 		 Case=Nom|Number=Sing|Person=1|PronType=Prs
come 		 VERB 		 conj 		 Tense=Pres|VerbForm=Fin
from 		 ADP 		 prep 		 
Scotland 		 PROPN 		 pobj 		 Number=Sing


We can also visualise certain aspects of the linguistic structure of the sentence, such as the dependency relations between individual words:

In [8]:
spacy.displacy.serve(doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Experimenting

- Experiment with different language models available from ```spaCy``` for another language you know - Danish, Dutch, Chinese, Portuguese, whatever.
    - How does ```spaCy``` perform? 
    - Are all the same features available for all languages?

## Task

- In the shared data drive, there is a folder called ```News_Category_Dataset_v2.json```. This is taken from [this Kaggle exercise](https://www.kaggle.com/datasets/rmisra/news-category-dataset) and comprises some 200k news headlines from [HuffPost](https://www.huffpost.com/). The data is a *json lines* format, with one JSON object per row. You can load this data into ```pandas``` in the following way:

```python
data = pd.read_json(filepath, lines=True)
```

- Select a couple of sub-categories of news data and use ```spaCy``` to find the **relative frequency per 10k words*** of each of the following word classes - NOUN, VERB, ADJECTIVE, ADVERB
    - Save the results as a CSV file (again using ```pandas```)
    - Are there any differences in the distributions?