# Naive Bayes Exercise: Introduction

In this exercise we will create a model that tries to label previously unseen words to be either Finnish or Spanish. To do so you need to download the spanish.txt and suomi.txt datasets from ALUD.

These datasets contain text from the wikipedia, so we will need to filter and clean some characters to be able to use it in our models.

Our word classification approach will be very simple. We are going to count how many times each letter of our alphabet (the combined spanish and finnish alphabets "abcdefghijklmnopqrstuvwxyzäöü-") appear in each word, and we are going to train our classifier according to that. So our examples will have 35 features, one for each letter, and their values will be the number of occurrences in that word for each letter. For example, if our word is "aabb", the features will be "[2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.  0. 0. 0. 0. 0. 0. 0. 0. 0.]".


## Making the data available

First you need to make the dataseta available in colab. There are multiple ways to do this (https://neptune.ai/blog/google-colab-dealing-with-files), for this example, we will use the upload option in the file explorer.





# PART 1: Getting the words from the files

Write a function get_words that recovers the words from the files.

In [1]:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB, MultinomialNB

In [4]:

# our valid characters
dictionary = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','ä','ö','ü','-']
def get_words(filename):
  finalwords = []
  with open(filename, encoding="UTF-8") as f:
      lines = f.readlines()

      for line in lines:
        words = line.split(' ')
        finalwords = finalwords + words

  return finalwords

#sanity check
words = get_words("../../../data/03_c_spanish.txt")
print(words)



['Los', 'pingüinos', '(Spheniscidae)', 'son', 'una', 'familia', 'de', 'aves,', 'la', 'única', 'del', 'orden', 'Sphenisciformes.', 'Son', 'aves', 'marinas,', 'no', 'voladoras,', 'que', 'se', 'distribuían', 'casi', 'exclusivamente', 'en', 'el', 'hemisferio', 'sur,', 'exceptuando', 'el', 'pingüino', 'de', 'las', 'islas', 'Galápagos', '(Spheniscus', 'mendiculus).', 'El', 'nombre', 'del', 'orden', 'proviene', 'del', 'vocablo', 'spheniscus', 'el', 'cual', 'proviene', 'del', 'griego', 'σφήν', '(sphen,', "'cuña')", 'y', 'el', 'sufijo', 'diminutivo', '-iscus,', 'literalmente', '"cuñita",', 'haciendo', 'referencia', 'a', 'su', 'forma', 'hidrodinámica', 'al', 'nadar.1\u200b', 'Se', 'reconocían', 'al', 'menos', 'dieciocho', 'especies', 'vivas', 'agrupadas', 'en', 'seis', 'géneros,', 'aunque', 'actualmente', 'se', 'encuentran', 'extintas.2\u200b\n', '\n', 'Los', 'primeros', 'europeos', 'en', 'observar', 'a', 'estas', 'aves', 'fueron', 'miembros', 'de', 'la', 'primera', 'expedición', 'de', 'Vasco', 

# PART 2: Filter invalid chars from the words

Write function filter_valid_chars that removes invalid chars (',', '.', '('...) from our words


In [6]:

def filter_valid_chars(word):
  word = word.translate({ord(c): None for c in '1234567890!"·$%&/()=?¿,.;:_[]{}|@#~€¬/*'})
  return word

#sanity check
print(filter_valid_chars("12hola@34"))

hola


# PART 3: Preparing our features
Write function get_features that gets a one list, containing words, as parameter. It should return a feature matrix of shape (n, 31
), where n is the number of elements of the input array. There should be one feature for each of the letters in the following alphabet: "abcdefghijklmnopqrstuvwxyzäöü-". The values should be the number of times the corresponding character appears in the word.

Example:


*   Input: 'aabb'
*   Output: [2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.  0. 0. 0. 0. 0. 0. 0.]



In [8]:
def get_features(words):
  #YOUR CODE GOES HERE
  # get the number of words
  n = len(words)
  # create a matrix of zeros with the shape of n and the number of characters in the dictionary
  feature_matrix = np.zeros((n, len(dictionary)))
  
  # iterate over the words
  for i, word in enumerate(words):
      # iterate over the characters in the word
      for char in word:
          # if the character is in the dictionary, increment the corresponding feature
          if char in dictionary:
              idx = dictionary.index(char)
              feature_matrix[i][idx] += 1
  return feature_matrix

#sanity check
final_matrix=get_features(['aaaa', 'bbbb'])
print(final_matrix)
final_matrix.shape

[[4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0.]
 [0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0.]]


(2, 30)

# PART 4: Getting the features and data for each language

Write function get_features_and_labels that returns the tuple (X, y) of the feature matrix and the target vector. Use the labels 0 and 1 for Finnish and Spanish, respectively. Use the supplied functions load_finnish() and load_spanish() to get the lists of words. Filter the lists in the following ways:

1.   Convert the Finnish words to lowercase, and then filter out those words that contain characters that don't belong to the alphabet. Do the same for the Spanish words
2.   Use get_features function you made earlier to form the feature matrix.







In [21]:
def load_spanish():
    #YOUR CODE GOES HERE
    X = get_words("../../../data/03_c_spanish.txt")
    print("Spanish total words: ", len(X))
    X = [word.lower() for word in X]
    filtered_words = []
    for word in X:
        valid_word = True
        for char in word:
            if char not in dictionary:
                valid_word = False
                break
        if valid_word:
            filtered_words.append(word)
    X = [filter_valid_chars(word) for word in filtered_words]
    X = [word for word in X if len(word) > 0]
    X = list(set(X))
    print("Spanish total filtered words: ", len(X))
    matrix = get_features(X)
    y = np.ones(len(X))
    return matrix, y
    
def load_finnish():
    #YOUR CODE GOES HERE
    X = get_words("../../../data/03_c_suomi.txt")
    print("Finnish total words: ", len(X))
    X = [word.lower() for word in X]
    filtered_words = []
    for word in X:
        valid_word = True
        for char in word:
            if char not in dictionary:
                valid_word = False
                break
        if valid_word:
            filtered_words.append(word)
    X = [filter_valid_chars(word) for word in filtered_words]
    X = [word for word in X if len(word) > 0]
    X = list(set(X))
    print("Finnish total filtered words: ", len(X))
    matrix = get_features(X)
    y = np.zeros(len(X))
    return matrix, y


#sanity check
print("***SPANISH***")
X, y = load_spanish()
print("Spanish Data shape: ")
print(X.shape)
print("Spanish Labels shape: ")
print(y.shape)
print("***FINNISH***")
X, y = load_finnish()
print("Finnish data shape: ")
print(X.shape)
print("Finnish labels shape: ")
print(y.shape)



***SPANISH***
Spanish total words:  14758
Spanish total filtered words:  2370
Spanish Data shape: 
(2370, 30)
Spanish Labels shape: 
(2370,)
***FINNISH***
Finnish total words:  8300
Finnish total filtered words:  3106
Finnish data shape: 
(3106, 30)
Finnish labels shape: 
(3106,)


# PART5: Put the dataset together

Merge the spanish and finnish data.

In [22]:
def get_features_and_labels():
    #YOUR CODE GOES HERE
    X_spanish, y_spanish = load_spanish()
    X_finnish, y_finnish = load_finnish()
    X = np.concatenate((X_spanish, X_finnish))
    y = np.concatenate((y_spanish, y_finnish))
    return X, y

#sanity check
X, y = get_features_and_labels()
print("Data shape: ")
print(X.shape)
print("Labels shape: ")
print(y.shape)


Spanish total words:  14758
Spanish total filtered words:  2370
Finnish total words:  8300
Finnish total filtered words:  3106
Data shape: 
(5476, 30)
Labels shape: 
(5476,)


# PART 6: And now the machine learning part
We have already prepared our data, now it's time to train our model.

We have earlier seen examples where we split the data into learning part and testing part. This way we can test whether the model can really be used to predict unseen data. However, it can be that we had bad luck and the split produced very biased learning and test datas. To counter this, we can perform the split several times and take as the final result the average from the different splits. This is called cross validation.

Create word_classification function that does the following:

1.   Use the function get_features_and_labels you made earlier to get the feature matrix and the labels. Use multinomial naive Bayes to do the classification. Get the accuracy scores using the sklearn.model_selection.cross_val_score function; use 5-fold cross validation. The function should return a list of five accuracy scores.
2.   The cv parameter of cross_val_score can be either an integer, which specifies the number of folds, or it can be a cross-validation generator that generates the (train set,test set) pairs. What happens if you pass the following cross-validation generator to cross_val_score as a parameter: sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=0).

Play with the different parts of the solution (features, different naive bayes algorithms, hyperparameters...) to try and improve the accuracy. Try besting my results of Average accuracy: 0.883658160188016

In [26]:
X, y = get_features_and_labels()

#YOUR CODE GOES HERE
def word_classification(X, y):
    accs = cross_val_score(MultinomialNB(), X, y, cv=5)
    return accs

accs = word_classification(X, y)

print ("Model accuracy per fold:", accs)
print("Average accuracy:", sum(accs)/len(accs))


Spanish total words:  14758
Spanish total filtered words:  2370
Finnish total words:  8300
Finnish total filtered words:  3106
Model accuracy per fold: [0.85310219 0.89680365 0.88401826 0.87762557 0.87579909]
Average accuracy: 0.8774697530246975
