# Class 28 - Analyzing movie data

The nodes in this network are movie actors derived from data published [here](http://www3.ul.ie/gd2005/dataset.html). This original graph is a [bipartite graph](https://en.wikipedia.org/wiki/Bipartite_graph) containing two kinds of nodes: movies and their actors. I've simplified this graph in several ways to make it smaller and simpler to analyze:

* This is a **weighted projected graph** because links are between actors if they appeared in the same movie. No movies explicitly in this data.
* The only actors and movies included are for movies produced since 2000
* The actors have appeared in more than 5 movies in the dataset
* The actors are connected only if they have been in a movie together more than once.

Launch Gephi and load either `actors_gt1` or `actors_gt2` to have in the background.

Load [networkx](http://networkx.readthedocs.io/en/networkx-1.11/).

In [2]:
import networkx as nx

Read in one of the GEXF graph file.

In [38]:
g = nx.read_gexf('actors_gt1.gexf')

## Descriptive statistics

Calculate the number of nodes, edges, diameter, and density of the graph.

In [17]:
len(g)

5633

In [18]:
g.number_of_nodes()

5633

In [19]:
g.number_of_edges()

29774

In [20]:
nx.diameter(g)

NetworkXError: Graph not connected: infinite path length

In [21]:
for index,graph in enumerate(list(nx.components.connected_component_subgraphs(g))):
    print(index,graph.number_of_nodes())

0 3687
1 2
2 2
3 33
4 3
5 58
6 2
7 2
8 45
9 2
10 348
11 128
12 2
13 18
14 3
15 12
16 6
17 2
18 39
19 7
20 2
21 96
22 2
23 12
24 2
25 2
26 67
27 5
28 2
29 2
30 9
31 2
32 2
33 2
34 3
35 2
36 3
37 46
38 9
39 19
40 9
41 2
42 2
43 5
44 2
45 2
46 3
47 6
48 2
49 2
50 8
51 2
52 2
53 2
54 2
55 2
56 3
57 5
58 3
59 3
60 3
61 3
62 2
63 12
64 3
65 2
66 3
67 3
68 2
69 44
70 3
71 2
72 3
73 2
74 2
75 2
76 2
77 20
78 3
79 2
80 3
81 3
82 3
83 3
84 2
85 9
86 6
87 2
88 10
89 19
90 10
91 2
92 2
93 2
94 2
95 2
96 2
97 2
98 2
99 4
100 4
101 7
102 4
103 4
104 2
105 3
106 3
107 2
108 3
109 2
110 2
111 2
112 2
113 3
114 3
115 2
116 2
117 3
118 9
119 2
120 4
121 3
122 3
123 6
124 2
125 2
126 2
127 3
128 2
129 2
130 2
131 2
132 3
133 2
134 9
135 4
136 10
137 2
138 2
139 2
140 2
141 3
142 4
143 2
144 5
145 2
146 8
147 3
148 2
149 2
150 4
151 3
152 2
153 3
154 2
155 2
156 2
157 2
158 2
159 2
160 4
161 3
162 10
163 4
164 3
165 2
166 2
167 4
168 4
169 2
170 5
171 3
172 7
173 2
174 12
175 6
176 2
177 3
178 3
179 2
180

In [26]:
lcc = list(nx.components.connected_component_subgraphs(g))[3]
lcc.number_of_nodes()

33

In [27]:
nx.diameter(lcc)

8

In [30]:
nx.average_shortest_path_length(lcc)

3.570075757575758

In [28]:
nx.write_gexf(lcc,'lcc.gexf')

In [29]:
nx.density(g)

0.0018770022029275535

## Find the neighbors of nodes

Use the `neighbors` method on the `g` object and give a node name to find all the nodes to which it's directly connected.

In [34]:
g.neighbors('Cage, Nicolas')

['Zeta-Jones, Catherine',
 'Carrey, Jim',
 'Jackson, Samuel L.',
 'Jolie, Angelina',
 'Hayek, Salma',
 'Smith, Will (I)',
 'Stiller, Ben',
 'Leno, Jay',
 'Del Toro, Benicio',
 'Clooney, George',
 'McConaughey, Matthew',
 'Harden, Marcia Gay',
 'Berry, Halle',
 'Lane, Diane (I)',
 'Affleck, Ben',
 'Voight, Jon',
 'Hopper, Dennis',
 'Roberts, Julia',
 'Cruise, Tom',
 'Hudson, Kate (I)',
 'Kidman, Nicole',
 'Douglas, Michael (I)',
 'Cruz, Penélope',
 'Slater, Christian',
 'Theron, Charlize',
 'Diaz, Cameron',
 'Ford, Harrison (I)',
 'Zellweger, Renée',
 'Travolta, John',
 'Spacey, Kevin',
 'Cooper, Chris (I)',
 'Connery, Sean',
 'Lopez, Jennifer (I)',
 'Nicholson, Jack',
 'Hanks, Tom',
 'Reeves, Keanu',
 'Bullock, Sandra',
 'Swank, Hilary',
 'Moore, Julianne (I)']

## Find the most connected node

What actor is the most important actor in the network?

A few ways to measure this:

* **Degree centrality**: Number of connections
* **Betweenness centrality**: Laying along many shortest paths between nodes
* **Closeness centrality**: Having the shortest distance to all other nodes
* **Eigenvector centrality**: Being connected to well-connected nodes

## Find the shortest path between nodes

What's a path? What's a shortest path? What's the longest shortest path?

In [40]:
nx.shortest_path(g,source='Daly, Tess',target='Erholtz, Doug')

['Daly, Tess',
 'Webbe, Simon',
 'Cowell, Simon',
 'Hawk, Tony',
 'Wingert, Wally',
 'Erholtz, Doug']

## Find brokering nodes

What nodes lie along many short paths?