# Introduction

You should process some texts using [NLTK](https://www.nltk.org/) or [spaCy](https://spacy.io/) libraries (ideally both). In particular, you should do the following:
- Load the `harry_potter` book. You can find this text corpus in the datasets folder.
- Segment the text of the book into sentences. How many sentences does this book have?
- Compute the frequency of each token in the book. What are the most frequent tokens?
- Choose a sentence from the book. Analyze this chosen sentence by
    - Calculating all [n-grams](https://en.wikipedia.org/wiki/N-gram).
    - Finding [POS tags](https://en.wikipedia.org/wiki/Part-of-speech_tagging) of tokens.
    - [Stemming](https://en.wikipedia.org/wiki/Stemming) and [lemmatizing](https://en.wikipedia.org/wiki/Lemmatisation) tokens.
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

### Import libraries

In [132]:
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/adolfomytr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/adolfomytr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/adolfomytr/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/adolfomytr/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Load text file

In [133]:
with open('/Users/adolfomytr/Documents/Alemania/Master/GISMA/Materias/natural_language_processing/harry_potter.txt', 'r') as file:
    harry_potter = file.read()

#print(harry_potter)

### Separate into sentences and count them

In [134]:
sentences = sent_tokenize(harry_potter)
num_sentences = len(sentences)
num_sentences

6394

### Compute the most frequent tokens in the book

In [135]:
#Extract the words from the text file
tokens = nltk.word_tokenize(harry_potter)

#Convert all words into lowercase
tokens = [word.lower() for word in tokens]

# Remove any punctuation and other non-alphabetic characters
tokens = [word for word in tokens if word.isalpha()]

# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if not word in stop_words]

#Create dictionary with frequency of the words
freq = {}
for t in tokens:
    if t in freq:
        freq[t] += 1
    else:
        freq[t] = 1


#Convert dictionary to dataframe to print the top words
freq = pd.DataFrame(freq, index=['word_count']).transpose()
freq = freq.sort_values('word_count', ascending=False)
freq.head(10)

Unnamed: 0,word_count
harry,1324
said,794
ron,429
hagrid,369
could,303
hermione,270
back,261
one,256
got,206
like,194


### Choose a sentence from the file and analyze it

In [136]:
#Select a random sentence
random_sentence = sentences[1]

#Calculate n_grams
n = 2

ngram_list = []
for i in range(1, n+1):
    ngrams_list = ngrams(random_sentence.split(), i)
    ngram_list += list(ngrams_list)
#print(ngram_list)

#Finding POS tags of tokens
tokens_rs = nltk.word_tokenize(random_sentence)
pos_tags = nltk.pos_tag(tokens_rs)
#print(pos_tags)

#Stemming tokens
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens_rs]
#print(stemmed_tokens)

#Lemmatizing tokens
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = []
for token in tokens_rs:
    pos = nltk.pos_tag([token])[0][1][0].lower()  # get the POS tag of the token
    if pos not in ['a', 'r', 'n', 'v']:  # map the POS tag to WordNet POS tag
        pos = 'n'  # if the POS tag is not recognized, assume it's a noun
    wn_pos = {'a': wordnet.ADJ, 'r': wordnet.ADV, 'n': wordnet.NOUN, 'v': wordnet.VERB}.get(pos)
    lemma = lemmatizer.lemmatize(token, pos=wn_pos)
    lemmatized_tokens.append(lemma)
#print(lemmatized_tokens)