# Ex 1

Start from the function below which, given a list of tokenized sentences (obtained with the Plainstream module), returns a dictionary with the frequencies of the bigrams present in the text. Modify the function so that it accepts a second optional argument: no_stopwords. If no_stopwords == True, make sure that the resulting bigrams do not contain any stopword. You can use the stop_words_english.txt file in the data folder to get a list of the main English stopwords. Try to modularize your code as much as possible by creating smaller functions which can be called within the main function. Try to restructure the whole code in this perspective.

In [22]:
import re
def bigram_generator(inlist, no_punct = True):
    out = {}
    for sentence in inlist:
        for i,token in enumerate(sentence):
            try:
                w1 = token
                w2 = sentence[i+1]
                bigram = f"{w1}_{w2}".lower()
                # checking the value of the default argument
                if no_punct:
                     # reject punctuation
                    if re.search(r"\W", bigram):
                        pass

                    else:
                        if bigram in out:
                            out[bigram] += 1
                        else:
                            out[bigram] = 1
                # we count bigrams with punctuation if no_punct==False   
                else:
                    if bigram in out:
                        out[bigram] += 1
                    else:
                        out[bigram] = 1
                    
            except IndexError: # we take care here of the index error
                pass
    return out

# Ex 2

Let's continue working on the previous function. Add to the function an argument, threshold, which is an integer with a default value of 3. Modify the function so that the output dictionary contains only bigrams with frequencies above the threshold specified in the function parameters. 

Then write a second function that takes as input the tokenized source text and the list of bigrams whose frequency is greater than the threshold. The function's task is to replace the pairs of tokens with the corresponding bigrams passed as input. Thus, the output of the function will be the tokenized text with the bigrams whose frequency is greater than the threshold in place of the corresponding original tokens.

# Arrays & Matrices with NumPy

In the various courses you are attending, you may have come across the concept of vector and matrix. For simplicity's sake, we can see a vector as a one-dimensional array, and a matrix as a collection of vectors. In computer science, the term array refers to a data structure made up of a collection of elements, each of which is identified by an index or a key. In our specific case, the elements of an array will be numbers. 

So far we have used lists to create this particular data structure. However, in python there are more efficient ways to construct and work with vectors and matrices. One of these is the NumPy module. 

In [None]:
! pip install numpy

### Array Creation

You can create an array from a regular Python list or tuple using the <b>array</b> function

In [24]:
import numpy as np
a = np.array([2, 3, 4])
a

array([2, 3, 4])

In [25]:
# check the type of the elements in the array
a.dtype

dtype('int32')

The elements of the array can also be float numbers. Thus, create a small list of float numbers and then convert the list into a np.array(). Print the array and the type of the elements inside. 

### Matrix Creation

As said before, a matrix is a collection of arrays. A list of tuples or a list of lists can easily be cnverted into a matrix. 

In [26]:
c = np.array([(1.5, 2, 3), (4, 5, 6)])
c

array([[1.5, 2. , 3. ],
       [4. , 5. , 6. ]])

In [27]:
d = np.array([[1, 2], [3, 4]])
d

array([[1, 2],
       [3, 4]])

### Indexing

The indexing system in numpy is very similar to the indexing system in lists.

In [28]:
# getting the second element from the b array
el1 = a[1]
el1

3

In [29]:
# getting the second element in the first row of d matrix
el2 = d[0][1]
el2

2

### np.zeros() & np.ones()

You can use the np function zeros() or ones() to create an array whose values are all zeros or all ones. The argument of these functions is the array/matrix shape. The shape of a two-dimensional matrix is a tuple whose first element represents the number of rows, and the second one the number of columns.

For example:
- shape = (3,4) --> 3 rows and 4 columns;
- shape = (1,4) --> 1 row and 4 columns (vector)

In [30]:
e = np.zeros((1,4))
print(f'{e}\n{e.dtype}')

[[0. 0. 0. 0.]]
float64


In [31]:
f = np.zeros((1,4), dtype=np.int16)
f

array([[0, 0, 0, 0]], dtype=int16)

In [32]:
g = np.ones((1), dtype=np.int16)
g

array([1], dtype=int16)

In [33]:
# zero matrix
h = np.ones((3,4))
h

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

<b>EX</b>.
- Create an array of 7 elements whose values are integers all equals to 1
- Create a matrix with 6 rows and 3 columns whose values are float numbers all equals to 1

### Basic Operations

In [34]:
a1 = np.array([3,5], dtype=np.int16)
b1 = np.array([2,4], dtype=np.int16)

- EXPONENTIATION: $(a_1,b_1)^2 = (a_1^2,b_1^2)$

In [35]:
exp = a1**2
exp

array([ 9, 25], dtype=int16)

- SQUARE-ROOT: $\sqrt{a} = (\sqrt{a_1}, \sqrt{a_2}, ... , \sqrt{a_n})$

In [36]:
sqr = np.sqrt(b1)
sqr

array([1.4142135, 2.       ], dtype=float32)

- ADDITION: $(a_1,b_1) + (a_2,b_2) = (a_1+a_2,b_1+b_2)$

In [37]:
c1 = a1 + b1
c1

array([5, 9], dtype=int16)

- SUBRACTION: $(a_1,b_1) - (a_2,b_2) = (a_1-a_2,b_1-b_2)$

In [38]:
c2 = a1 - b1
c2

array([1, 1], dtype=int16)

- SCALAR MULTIPLICATION: $k \cdot (a,b) = (k \cdot a,k \cdot b)$

In [39]:
k = 3
c3 = k * a1
c3

array([ 9, 15], dtype=int16)

- DOT PRODUCT: $a \cdot b = \sum_{n=1}^{n} a_1b_1 = a_1b_2 + a_2b_2 + ... + a_nb_n$

In [40]:
c4 = np.dot(a1,b1)
c4

26

# Ex 3: Euclidean distance and Cosine Similarity

One of the simplest ways of representing vectors is to place them in a geometric space. There are several measures for calculating the distance between two vectors. Among these, two very important measures in CL and NLP are the Euclidean Distance and the Cosine of Similarity.

<img src="https://cmry.github.io/sources/eucos.png" width=500 height=300 />

d = Euclidean Distance

$\theta$ = Cosine of Similarity

$EucledianDistance(a,b) = \sqrt{\sum_{n = 1}^{n} (a_n - b_n)^2}$ 

$Cosine Similarity(a,b) = \frac{\sum_{n=1}^{n} ab}{\sqrt{\sum_{n=1}^{n} a_n^2} \times \sqrt{\sum_{n=1}^{n} b_n^2}}$

Given the two mathematical formulae, write two functions which, taking two vectors as input, return, using the numpy module, the Euclidean distance one, the cosine similarity the other.