In [1]:
import pandas as pd
import re
import itertools
import networkx as nx
from networkx.algorithms import community
from operator import itemgetter

The following code reads in the data `film_festivals.csv`. The data contains information about films selected to compete at the Cannes, Venice, and Berlin film festivals. In particular, it contains both festival-specific information—which includes which festival the film competed in, the year of the festival, and whether the film won—and film-specific information—which includes the film's title and the people involved in the making of the film.

`film_festivals.csv` was produced by joining `festivals.csv` and `films.csv`. To obtain the data for `festivals.csv`, I scraped the Wikipedia pages associated with each year of the Cannes, Venice, and Berlin festivals, using Beautiful Soup, which returned all of the films selected for competition at these festivals. To obtain the data for `films.csv`, I used the Open Movie Database (OMDb) API to request the film-specific information for each of the films listed in `festivals.csv`.

In [2]:
df = pd.read_csv('./data/film_festivals.csv', keep_default_na=False)

The following code removes missing data, encoded as 'N/A', '', or 'Unknown' values, within the `Director`, `Writer`, `Actors`, and `Production` features, specifically removing films that are missing all four. There are not many films with all of these values missing, so removing these films is possible.

Films with only one, two, or three of these features missing—but not all missing—are still kept within our data. We do not throw out these films because they still have some data on the people or entities involved in their making. We want to keep these people in our network, so we will also keep the remaining missing values for now and handle them a little differently at a later point in our analysis.

After considering the missing data, we also do some initial cleaning. For the `Director` and `Writer` features, we use regular expressions to remove pattern matches for any characters found between parentheses so that what we are left with are just the names. For example, for `Director`, a possible pattern match to be removed is '(co-director), and for `Writer`, possible pattern matches include '(screenplay)', '(dialogue)', and '(story)'.

In [3]:
df = df[~(df['Director'].isin(['N/A', '']) & df['Writer'].isin(['N/A', '']) & df['Actors'].isin(['N/A', '']) & df['Production'].isin(['N/A', '', 'Unknown']))]
df['Director'] = df['Director'].apply(lambda row: re.sub('\(.*?\)', '', row))
df['Writer'] = df['Writer'].apply(lambda row: re.sub(' \(.*?\)', '', row))
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,id,title_english,title_original,director,country,winner,festival,year_festival,link_film,link_director,...,BoxOffice,Production,Website,Response,InternetMovieDatabaseRating,RottenTomatoesRating,MetacriticRating,Error,totalSeasons,Ratings
0,0,Adieu Bonaparte,وداعا بونابرت,Youssef Chahine,Egypt,0,cannes,1985,https://en.wikipedia.org/wiki/Adieu_Bonaparte,https://en.wikipedia.org/wiki/Youssef_Chahine,...,,,,True,6.4/10,,,,,
1,1,Birdy,,Alan Parker,United States,0,cannes,1985,https://en.wikipedia.org/wiki/Birdy_(film),https://en.wikipedia.org/wiki/Alan_Parker,...,,Sony Pictures Home Entertainment,,True,7.3/10,85%,,,,
2,2,Bliss,,Ray Lawrence,Australia,0,cannes,1985,https://en.wikipedia.org/wiki/Bliss_(1985_film),https://en.wikipedia.org/wiki/Ray_Lawrence_(fi...,...,,Starmaker Entertainment,,True,6.9/10,,,,,
3,4,The Coca-Cola Kid,,Dušan Makavejev,Australia,0,cannes,1985,https://en.wikipedia.org/wiki/The_Coca-Cola_Kid,https://en.wikipedia.org/wiki/Du%C5%A1an_Makav...,...,,Cinecom Pictures,,True,6.0/10,44%,,,,
4,5,Colonel Redl,Oberst Redl,István Szabó,Hungary,0,cannes,1985,https://en.wikipedia.org/wiki/Colonel_Redl,https://en.wikipedia.org/wiki/Istv%C3%A1n_Szab...,...,,,,True,7.6/10,,,,,


The following code subsets the data by filtering on festival and years of the festival. The network will be built on this subset.

In [4]:
df_cannes_2010_2019 = df[(df['festival'] == 'cannes') & (df['year_festival'] >= 2010)].reset_index(drop=True)
df_cannes_2010_2019[['Director', 'Writer', 'Actors', 'Production']].head()

Unnamed: 0,Director,Writer,Actors,Production
0,Mike Leigh,Mike Leigh,"Jim Broadbent, Ruth Sheen, Lesley Manville, Ol...",Sony Classics
1,Alejandro G. Iñárritu,"Alejandro G. Iñárritu, Alejandro G. Iñárritu, ...","Javier Bardem, Maricel Álvarez, Hanaa Bouchaib...",Roadside Attractions
2,Nikita Mikhalkov,"Nikita Mikhalkov, Vladimir Moiseenko, Aleksand...","Nikita Mikhalkov, Oleg Menshikov, Nadezhda Mik...",
3,Abbas Kiarostami,"Abbas Kiarostami, Caroline Eliacheff","Juliette Binoche, William Shimell, Jean-Claude...",Artificial Eye Film Co. Ltd
4,Xiaoshuai Wang,Yishu Yang,"Bingbing Fan, Feier Li, Hao Qin, Xueqi Wang",


The idea behind the node list is to create a list of all of the nodes involved in the network. In the context of the networks that we will form for film festivals, we are interested in nodes for the network consisting of the people and entities involved in the making of the films: the directors, writers, actors, and production companies. The function `get_nodelist` creates this node list.

`get_nodelist` takes in strings consisting of the values for the director, writer, actor, and production company. In order to prepare these strings to form the node list, each of these strings are split by commas in order to get lists consisting of individual people or entities. We then remove any remaining missing values—encoded as 'N/A' for directors, writers, and actors, and as 'N/A', '', or 'Unknown' for production companies—because we do not want to include them as nodes in our network. We also add a descriptor to each of the names, describing their roles in their films (e.g. 'director', 'writer', 'actor', 'production company'). We combine each of these individual lists into one total list, which we then turn into our node list dataframe. As a final step, we remove any duplicate names. The reasoning for this is that the node list only keeps track of what nodes should exist in the network, but not the number of times they appear in various films, which should instead emerge from the edge list; hence, we only need each name to appear once in our node list.

In [5]:
def get_nodelist(directors_from_df, writers_from_df, actors_from_df, production_companies_from_df):
    '''
    Given strings containing the directors, writers, actors, and production companies
    involved in the making of the films to be included in the network,
    return the node list. For each node, the node list also includes the attribute
    of the role the node played in the making of the film, e.g. 'director',
    'writer', 'actor', 'production company'.
    
    Keyword arguments:
    directors_from_df -- string containing directors for all of the films, separated by commas
    writers_from_df -- string containing writers for all of the films, separated by commas
    actors_from_df -- string containing actors for all of the films, separated by commas
    production_companies_from_df -- string containing production companies for all of the films, separated by commas
    '''
    # directors
    directors = directors_from_df.split(', ')
    directors = [director for director in directors if director != 'N/A']
    directors = [(director, 'director') for director in directors]
    # writers
    writers = writers_from_df.split(', ')
    writers = [writer for writer in writers if writer != 'N/A']
    writers = [(writer, 'writer') for writer in writers]
    # actors
    actors = actors_from_df.split(', ')
    actors = [actor for actor in actors if actor != 'N/A']
    actors = [(actor, 'actor') for actor in actors]
    # production companies
    production_companies = production_companies_from_df.split(', ')
    production_companies = [production_company for production_company in production_companies if production_company not in ['N/A', '', 'Unknown']]
    production_companies = [(production_company, 'production company') for production_company in production_companies]
    
    # people: directors, actors, writers, and production companies
    people = directors
    people.extend(actors) # the order of the extends determines the order of priority: director -> actor -> writer for people who have multiple roles
    people.extend(writers)
    people.extend(production_companies)
    
    name = [person[0] for person in people]
    role = [person[1] for person in people]
    nodelist = pd.DataFrame({'Name': name,
                             'Role': role})
    nodelist = nodelist.drop_duplicates(subset='Name').reset_index(drop=True)
    return nodelist

The idea behind the edge list is to create a list of all of the edges between two nodes. An edge refers to a relation between nodes; in the  case of films, this relation exists between each of the people involved in the making of a film. As an additional note, in the context of films, the network is undirected because the relations are symmetric: Person A made the film with Person B, and Person B made the film with Person A—the edges are reciprocal and symmetric. The function `get_edgelist` creates the edge list.

`get_edgelist` takes in a node list. From the node list, we know all of the people involved in the making of a film; all of these people should have an edge between them because they all worked together on the film. `get_edgelist` creates all of the combinations of people without repeat values. Each edge is then given a weight of 1, referring to the film they worked on.

In [6]:
def get_edgelist(nodelist):
    '''
    Given a node list, returns a list of edges, both as a list and as a dataframe.
    The edge list is created by taking combinations of each node in the list
    without any repeat values. Each edge is given a weight of 1.
    
    Keyword arguments:
    nodelist -- node list returned by `get_nodelist` function
    '''
    edges = [combo + (1,) for combo in itertools.combinations(nodelist['Name'], 2)]
    edgelist = pd.DataFrame(edges, columns=['Source', 'Target', 'Weight'])
    return edges, edgelist

The `get_film_edges` function takes in strings consisting of the values for the director, writer, actor, and production company. `get_film_edges` creates the node list and the corresponding edge list—from that node list—all in one function.

In [7]:
def get_film_edges(directors_from_df, writers_from_df, actors_from_df, production_companies_from_df):
    '''
    Given strings containing the directors, writers, actors, and production companies
    involved in the making of the films to be included in the network,
    creates the node list and the corresponding edge list from that node list.
    Returns the list of edges, both as a list and as a dataframe.
    
    Keyword arguments:
    directors_from_df -- string containing directors for all of the films, separated by commas
    writers_from_df -- string containing writers for all of the films, separated by commas
    actors_from_df -- string containing actors for all of the films, separated by commas
    production_companies_from_df -- string containing production companies for all of the films, separated by commas
    '''
    nodelist_film = get_nodelist(directors_from_df,
                                 writers_from_df,
                                 actors_from_df,
                                 production_companies_from_df)
    edges_film, edgelist_film = get_edgelist(nodelist_film)
    return edges_film

The function `get_graph`, as of now, takes in a dataframe that is a subset of `film_festivals.csv`, filtered on festival and years of the festival. For each of the features `Director`, `Writer`, `Actors`, and `Production`, we create a string consisting of all of the values for director, writer, actor, and production company, respectively, separated by a comma.

Network analysis requires a node list and an edge list. From the strings, we are able to create the node list for our entire network. It includes nodes for every person and entity involved in the making of all of the films included in our subset.

Getting the edges for our network is a bit more complicated. Unlike the node list, we cannot just create the edge list for our network all in one go. Because edges only exist between people who worked together on the same films, we cannot just make an edge list using `get_edgelist` based on the overall node list containing nodes from all of the films; instead, we need to look at each film individually. Hence, for each row in our data—which corresponds to an individual film—we use `get_film_edges` to create a miniature node list just for that film and then use it to get its corresponding edge list. After doing this for each film, we can then bring each of the edge lists together to get the overall edge list for our network. For people who worked together on multiple films, duplicate entries will be included in the edge list.

Keep in mind that, at this point, each of the edges has a weight of 1—including each of the duplicate entries. We will handle these weights once we've created our graphs.

To create the graph, we use the NetworkX package in Python. We first start with a MultiGraph—an undirected graph class in NetworkX—to be able to handle the duplicate entries. In graph theory, a multigraph is a graph that can have multiple edges between the same nodes. Hence, all of the duplicate entries in our edge list manifest as multiple edges in our MultiGraph.

We can then convert the MultiGraph into a weighted Graph so that we can change the multiple edges into a single edge with weights.

In [8]:
def get_graph(df):
    '''
    Given a dataframe that is a subset filtered on the festival and
    years of festival we are interested in for our network, returns the
    graph/network.
    
    Keyword arguments:
    df -- dataframe that is a subset filtered on the festival and
    years of festival
    '''
    df_directors = df['Director'].str.cat(sep=', ')
    df_writers = df['Writer'].str.cat(sep=', ')
    df_actors = df['Actors'].str.cat(sep=', ')
    df_production_companies = df['Production'].str.cat(sep=', ')
    
    df_nodelist = get_nodelist(df_directors,
                               df_writers,
                               df_actors,
                               df_production_companies)
    df_node_names = list(df_nodelist['Name'])

    df_edges = df.apply(lambda row: get_film_edges(row['Director'], row['Writer'], row['Actors'], row['Production']), axis=1)
    df_edges = df_edges.tolist()
    df_edges = [item for sublist in df_edges for item in sublist]
    
    M = nx.MultiGraph()
    M.add_nodes_from(df_node_names)
    M.add_weighted_edges_from(df_edges)
    
    # https://stackoverflow.com/questions/15590812/networkx-convert-multigraph-into-simple-graph-with-weighted-edges
    # create weighted graph from M
    G = nx.Graph() 
    for u,v,data in M.edges(data=True):
        w = data['weight'] if 'weight' in data else 1.0
        if G.has_edge(u,v):
            G[u][v]['weight'] += w
        else:
            G.add_edge(u, v, weight=w)
    
    return G

In [9]:
G = get_graph(df_cannes_2010_2019)
# print(G.nodes(data=True))
print(G.edges(data=True))

[('Mike Leigh', 'Jim Broadbent', {'weight': 1}), ('Mike Leigh', 'Ruth Sheen', {'weight': 1}), ('Mike Leigh', 'Lesley Manville', {'weight': 1}), ('Mike Leigh', 'Oliver Maltman', {'weight': 1}), ('Mike Leigh', 'Sony Classics', {'weight': 1}), ('Mike Leigh', 'Timothy Spall', {'weight': 1}), ('Mike Leigh', 'Paul Jesson', {'weight': 1}), ('Mike Leigh', 'Dorothy Atkinson', {'weight': 1}), ('Mike Leigh', 'Marion Bailey', {'weight': 1}), ('Mike Leigh', 'Sony Pictures Classics', {'weight': 1}), ('Jim Broadbent', 'Ruth Sheen', {'weight': 1}), ('Jim Broadbent', 'Lesley Manville', {'weight': 1}), ('Jim Broadbent', 'Oliver Maltman', {'weight': 1}), ('Jim Broadbent', 'Sony Classics', {'weight': 1}), ('Ruth Sheen', 'Lesley Manville', {'weight': 1}), ('Ruth Sheen', 'Oliver Maltman', {'weight': 1}), ('Ruth Sheen', 'Sony Classics', {'weight': 1}), ('Lesley Manville', 'Oliver Maltman', {'weight': 1}), ('Lesley Manville', 'Sony Classics', {'weight': 1}), ('Oliver Maltman', 'Sony Classics', {'weight': 1}),