## Classificazione supervisionata

### Feature per classificazione
- Coi conditional random field, abbiamo visto come le caratteristiche (feature) siano state utilizzate in piÃ¹ problemi di sequence-labeling, addestrando su dei template feature random.
- Se si conosce il dominio, possiamo pensare a delle strategie per identificare le feature esplicitamente, e su questi fare un training

```python
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')])
 + ([(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)
featureset = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featureset[500:], featureset[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
```

**gender_features(name)** estrae, dal nome, delle caratteristiche.

## Esempio GENDER_FEATURE

In cosa consistono le feature da estrarre:
- Un dizionario che associa ad ogni feature (chiave) un valore
- di norma, i valori sono tipi semplici (booleani, interi, caratteri, brevi stringhe, ...), mentre le chiavi possono cambiare da task a tast
```python
def gender_features(word):
  return { 'last_letter': word[-1] }
```

Attenzione alla dimensione del dizionario, al crescere della quale si rischia overfitting

# Esercizi

- Provare a definire feature ed implementare classificatori per i seguenti task
  - Segmentation (training tramite sents)
  - Sentiment (corpus da nltk.corpus.nps_chat)

- Valutare ognuno tramite accuracy, precision, recall, e f-measure

In [1]:
from nltk.corpus import brown
import random
from nltk import *
from itertools import chain

In [2]:
def extract_features(word: str) -> dict:
  return {
    "pos_tag": pos_tag([word, ])[0][1],
    "is_title": word.istitle(),
    "ending_punct": word in ".?!;:".split()
  }

def train_test_split(dataset, pivot: float):
  pivot = round(min(1, max(pivot, 0))*len(dataset))
  return dataset[:pivot], dataset[pivot:]


corpus = brown
n_categories = 2
categories = random.choices(brown.categories(), k = n_categories)

print("Using {} categories".format(n_categories))
print("Categories chosen: {}".format(categories))

tagged_words = list(chain.from_iterable(corpus.tagged_sents(tagset='universal')))

def tag(word):
  match word[1]:
    case 'DET': return 'B'
    case '.': return 'O'
    case _: return 'I'

BIO = [(word, tag((word, t))) for (word, t) in tagged_words]


feature_set = [(extract_features(word), t) for (word, t) in BIO]




Using 2 categories
Categories chosen: ['religion', 'mystery']


In [3]:
train_set, test_set = train_test_split(feature_set, .5)
classifier = NaiveBayesClassifier.train(train_set)


In [4]:

classifier.classify(extract_features('.'))


def segments(text):
  result = []
  i = 0
  while i < len(text):
    word = text[i]
    Class = classifier.classify(extract_features(word))
    if Class == 'B':
      start = i
      i += 1
      while Class != 'O' and i < len(text):
        word = text[i]
        Class = classifier.classify(extract_features(word))
        i += 1
      end = i
      result.append(text[start:end])
    i += 1

  return result

text = corpus.words(categories=categories)[:1000]
segs = segments(text)

for seg in segs:
  print(seg)


['a', 'result', ',']
['this', 'distinction', ',']
['the', 'meaning', 'of', 'the', 'basic', 'terms', 'employed', '.']
['The', 'terms', 'are', 'generally', 'taken', 'for', 'granted', 'as', 'though', 'they', 'referred', 'to', 'direct', 'and', 'axiomatic', 'elements', 'in', 'the', 'common', 'experience', 'of', 'all', '.']
['the', 'contemporary', 'context', 'this', 'is', 'precisely', 'what', 'one', 'must', 'not', 'do', '.']
['the', 'modern', 'world', 'neither', '``']
['any', 'generally', 'agreed-upon', 'elements', 'of', 'experience', '.']
['a', 'transitional', 'stage', 'in', 'which', 'many', 'of', 'the', 'connotations', 'of', 'former', 'usage', 'have', 'had', 'to', 'be', 'revised', 'or', 'rejected', '.']
['the', 'words', 'are', 'used', ',']
['which', 'of', 'the', 'traditional', 'meanings', 'the', 'user', 'may', 'have', 'in', 'mind', ',']
['his', 'revisions', 'and', 'rejections', 'of', 'former', 'understandings', 'correspond', 'to', 'ours', '.']
['the', 'most', 'widespread', 'features', 'of'

In [5]:
gold = [label for (_, label) in test_set]
predicted = [classifier.classify(feature) for (feature, _) in test_set]
labels = set(gold)
cm = nltk.ConfusionMatrix(gold, predicted)
print(cm)
{label: {func.__name__: func(label) for func in [cm.precision, cm.recall, cm.f_measure]} for label in labels}

NameError: name 'nltk' is not defined

In [None]:
classifier.most_informative_features()

[('pos_tag', ','),
 ('pos_tag', ':'),
 ('pos_tag', 'NN'),
 ('pos_tag', 'WDT'),
 ('pos_tag', 'VB'),
 ('pos_tag', 'PRP'),
 ('pos_tag', 'CC'),
 ('pos_tag', 'DT'),
 ('pos_tag', 'PRP$'),
 ('pos_tag', 'IN'),
 ('pos_tag', 'WP'),
 ('is_title', True),
 ('is_title', False),
 ('ending_punct', False),
 ('pos_tag', "''"),
 ('pos_tag', '('),
 ('pos_tag', ')'),
 ('pos_tag', '.'),
 ('pos_tag', '``'),
 ('pos_tag', 'CD'),
 ('pos_tag', 'EX'),
 ('pos_tag', 'FW'),
 ('pos_tag', 'JJ'),
 ('pos_tag', 'JJR'),
 ('pos_tag', 'JJS'),
 ('pos_tag', 'LS'),
 ('pos_tag', 'MD'),
 ('pos_tag', 'NNP'),
 ('pos_tag', 'NNPS'),
 ('pos_tag', 'NNS'),
 ('pos_tag', 'RB'),
 ('pos_tag', 'RBR'),
 ('pos_tag', 'SYM'),
 ('pos_tag', 'TO'),
 ('pos_tag', 'UH'),
 ('pos_tag', 'VBD'),
 ('pos_tag', 'VBG'),
 ('pos_tag', 'VBN'),
 ('pos_tag', 'VBP'),
 ('pos_tag', 'VBZ'),
 ('pos_tag', 'WP$'),
 ('pos_tag', 'WRB')]