# Lecture 2: Numpy

The clever student should have recognized that all images of the previous notebooks are images from movies. Indeed, the data we are going to analyze is the so called stardom-network of the 500 most rated movies of 2018 according to IMDB.
![IMDB](l2_imdb.png)  
You can find the data in the 'data' folder of the repository.

But first, ...

## Numpy, your new best friend

[numpy](http://www.numpy.org/) is cool for a lot of reasons, but mostly because it is the python module for playing with big (or almost big) array and embedding non trivial mathematical functions.

In [1]:
import numpy as np

### numpy array, your new best friend

In [2]:
cacca=np.array([[1, 2, 3], [4,5,6]])

In [3]:
cacca

array([[1, 2, 3],
       [4, 5, 6]])

.size()

In [4]:
cacca.size

6

.shape()

In [5]:
cacca.shape

(2, 3)

##### Accessing elements

In [5]:
cacca[0]

array([1, 2, 3])

In [6]:
cacca[0,1]

2

In [7]:
cacca[:,0]

array([1, 4])

Operation element by element!

In [8]:
cacca *2.

array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]])

In [9]:
cacca %2

array([[1, 0, 1],
       [0, 1, 0]])

Mask

In [10]:
cacca %2==0

array([[False,  True, False],
       [ True, False,  True]])

In [11]:
cacca[cacca %2==0]

array([2, 4, 6])

##### Data types and structured data types

In [12]:
np.array([[1, 2, 3], [4,5,6]], dtype=float)

array([[1., 2., 3.],
       [4., 5., 6.]])

In [13]:
np.array([[1., 'cacca', 42], [0,'bad',1.4]], dtype=object)

array([[1.0, 'cacca', 42],
       [0, 'bad', 1.4]], dtype=object)

In [15]:
np.array([[1., 'cacca', 42], [0,'bad',1.4]], dtype=float)

ValueError: could not convert string to float: cacca

###### Structured data type

In [14]:
tmnt_list=['Donatello', 'Raffaello', 'Michelangelo', 'Leonardo']
tmnt_ages=[14, 15, 13, 16]

In [19]:
tmnt_np=np.array(zip(tmnt_ages,tmnt_list), dtype=[('age', 'i8'), ('name','S20')])

In [20]:
tmnt_np

array([(14, 'Donatello'), (15, 'Raffaello'), (13, 'Michelangelo'),
       (16, 'Leonardo')], dtype=[('age', '<i8'), ('name', 'S20')])

In [21]:
tmnt_np['name']

array(['Donatello', 'Raffaello', 'Michelangelo', 'Leonardo'], dtype='|S20')

In [22]:
tmnt_np[0]

(14, 'Donatello')

Searching on array with structured data type

In [23]:
tmnt_np[tmnt_np['name']=='Leonardo']['age']

array([16])

##### Operation among array

In [22]:
cacca

array([[1, 2, 3],
       [4, 5, 6]])

In [23]:
cacca.shape

(2, 3)

Transpose

In [24]:
cacca.T

array([[1, 4],
       [2, 5],
       [3, 6]])

dot product

In [25]:
np.dot(cacca, cacca.T)

array([[14, 32],
       [32, 77]])

##### Reading/writing from/to file

In [24]:
adjacency_matrix=np.genfromtxt('something.txt', delimiter=',', dtype='i8')

IOError: something.txt not found.

In [27]:
np.savetxt('something_new.txt',adjacency_matrix, fmt='%u', delimiter=',')

In [28]:
np.genfromtxt('something_new.txt', delimiter=',', dtype='i8')

array([[0, 1, 0, 1],
       [0, 0, 1, 1],
       [1, 1, 0, 1],
       [0, 0, 1, 0]])

##### Interesting stuff and functions

np.zeros()

In [26]:
np.zeros(42, dtype='int')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [30]:
np.zeros(42, dtype=str)

array(['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', ''], dtype='|S1')

np.ones()

In [31]:
np.ones(42, dtype='int')

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [32]:
np.ones(42, dtype=str)

array(['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1'], dtype='|S1')

np.arange()

In [33]:
np.arange(42)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41])

In [34]:
np.arange(4,42)

array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
       21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,
       38, 39, 40, 41])

np.unique()

In [35]:
cacca=np.array([1,2,4,1,2,4,12,42])

In [36]:
np.unique(cacca)

array([ 1,  2,  4, 12, 42])

In [37]:
np.unique(cacca, return_counts=True)

(array([ 1,  2,  4, 12, 42]), array([2, 2, 2, 1, 1]))

np.sum()

In [28]:
adjacency_matrix

NameError: name 'adjacency_matrix' is not defined

In [39]:
np.sum(adjacency_matrix)

8

In [40]:
np.sum(adjacency_matrix, axis=0)

array([1, 2, 2, 3])

In [41]:
np.sum(adjacency_matrix, axis=1)

array([2, 2, 3, 1])

np.where()

In [42]:
np.where(cacca==1)

(array([0, 3]),)

In [27]:
np.where(adjacency_matrix==1)

NameError: name 'adjacency_matrix' is not defined

## Exercise:
1. **load the file ./data/imdb_2018_films_actors.txt** It is an edge list (_what is an edge list?_) of a bipartite network (_what is a bipartite network?_) in which on the first column you have films and on the second the actors;
2. **calculate the degree sequence** for both layers (_what is a layer?_)
3. **build the biadjacency matrix** (_what is a biadjacency matrix?_)
4. **calculate the near neighbours degree** (_what is nn?_)

#### 1. Load the file

In [29]:
f = open( './data/imdb_2018_films_actors.txt')
films = f.read().splitlines()
dizionario_film_attori = {}
for i in films:
    film, attori = i.split('\t')
    if film not in dizionario_film_attori:
        dizionario_film_attori[film] = [attori]
    else:
        dizionario_film_attori[film].append(attori)

dizionario_film_attori

#Non è ordinato, non puoi utilizzare questo metodo

{'12 Strong': ['Michael Shannon',
  'Navid Negahban',
  'Geoff Stults',
  'Austin H\xc3\xa9bert',
  "Ben O'Toole",
  'Kenny Sheard',
  'Rob Riggle',
  'Arshia Mandavi',
  'Marie Wagenman',
  'Samuel Kamphuis',
  'Laith Nakli',
  'Numan Acar',
  'Matthew Van Wettering',
  'Jahan Khalili',
  'Nate Boyer',
  'Vincent E. McDaniel',
  'Tommy Truex',
  'Shvan Aladdin',
  'Ali Olomi',
  'Sofia Chicorelli Serna',
  'Martin Edward Andazola',
  'Osama bin Laden',
  'Edward Butron',
  'Perla Daoud',
  'Jason Exum',
  'Dan Gruenberg',
  'Shawn Lecrone',
  'Jay Moore',
  'Martin Palmer',
  'Benjamin Poe',
  'James Joseph Pulido',
  'Dion Ronquillo Jr.',
  'Shawn Sarmeidani',
  'J. Nathan Simmons',
  'Michael E. Stogner',
  'Matthew Velez',
  'David White',
  'Chris Hemsworth',
  'Michael Pe?a',
  'Trevante Rhodes',
  'Thad Luckinbill',
  'Austin Stowell',
  'Kenneth Miller',
  'Jack Kesy',
  'William Fichtner',
  'Elsa Pataky',
  'Allison King',
  'Lauren Chavez-Myers',
  'Fahim Fazli',
  'Peter Ka

In [41]:
### Altro modo, quello giusto:
dizionario_film = np.genfromtxt('./data/imdb_2018_films_actors.txt', delimiter='\t',dtype=[('film', 'U50'), ('actor','U50')])
dizionario_film

array([(u'Avengers: Infinity War', u'Chris Hemsworth'),
       (u'Avengers: Infinity War', u'Chris Evans'),
       (u'Avengers: Infinity War', u'Don Cheadle'), ...,
       (u'Colette', u'Izzy Bayley-King'), (u'Colette', u'Karl Farrer'),
       (u'Colette', u'Masayoshi Haneda')],
      dtype=[('film', '<U50'), ('actor', '<U50')])

#### 2. The degree sequence

In [84]:
actors,degree_actor = np.unique(dizionario_film['actor'],return_counts=True)
films,degree_film = np.unique(dizionario_film['film'],return_counts=True)

#### 3. The biadjacency matrix

In [85]:
biadjacency_matrix = np.zeros((len(actors),len(films)),dtype = int)
for i in xrange(len(actors)): 
    for j in xrange(len(films)):
 

IndentationError: expected an indented block (<ipython-input-85-27e166ec0129>, line 4)

In [86]:
### Metodo del professore

biadjacency_matrix = np.zeros((len(films),len(actors)),dtype = int)

for i in dizionario_film:
    f_pos = np.where(films==i['film'])[0]
    a_pos = np.where(actors==i['actor'])[0]
    biadjacency_matrix[f_pos, a_pos] = 1
    
biadjacency_matrix

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 1],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [87]:
np.sum(biadjacency_matrix)

12912

In [88]:
len(dizionario_film)

12914

#### 4. The Nearest Neighbour Degree

In [95]:
print len(degree_film)
print np.shape(biadjacency_matrix)
print len(degree_actor)

199
(199, 11128)
11128


In [103]:
NND_first = np.dot(biadjacency_matrix.T, degree_film)

In [106]:
NND = 1. * NND_first / degree_actor
print NND

#C'è qualche errore

[111. 107.  56. ...  81.  75.  75.]


### Other interesting function of numpy

np.diag()

In [63]:
np.diag(adjacency_matrix)

array([0, 0, 0, 0])

extracts the diagonal from a square matrix or

In [64]:
np.diag([1,2,3,4])

array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])

builds a diagonal matrix from the array in the argument.

np.vstack()

In [65]:
np.vstack((np.arange(4), np.arange(4,8)))

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

it stacks the two arrays one over the other.

np.isin()

In [66]:
cacca=np.array([0,1,3,4,5,42])

In [67]:
np.isin(cacca, np.array([0,42,3]))

array([ True, False,  True, False, False,  True])

In [68]:
cacca[np.isin(cacca, np.array([0,42,3]))]

array([ 0,  3, 42])

In [69]:
np.isin(np.array([0,42,3]),cacca)

array([ True,  True,  True])

### Exercise: project the network on the two layers and get the adjacency matrix (i.e. the binarized version of the weighted matrix)

### Exercise: calculate the clustering per node and its average value on the film projection
The clustering is the number of observed triangles over the possible realisation.