# Applied Data Analysis - Milestone P2

## Title : Hollywood's social structure

### Group Name : SHNO

### Project pipeline

- [Libraries](#Libraries)
- [CMU Data importation](#Data-Importation)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
    - [Movie metadata](#Movie-metadata)
    - [Character metadata](#Character-metadata)
    - [Plot summaries](#Plot-summaries)
    - [TV Tropes](#TV-Tropes)
- [IMDb Datasets Exploratory Data Analysis](#IMDb-Datasets-Analysis)
    - [title.akas.tsv.gz](#title.akas.tsv.gz)
    - [title.basics.tsv.gz](#title.basics.tsv.gz)
    - [title.crew.tsv.gz](#title.crew.tsv.gz)
    - [title.episode.tsv.gz](#title.episode.tsv.gz)
    - [title.principals.tsv.gz](#title.principals.tsv.gz)
    - [title.ratings.tsv.gz](#title.ratings.tsv.gz)
    - [name.basics.tsv.gz](#name.basics.tsv.gz)
- [Data pre-processing, transformation and merging](#Methods)
    - [Data pre-processing and transformation](#Methods)
    - [IMDb Dataset merging](#Methods)
        - [Outlier and missing value correction using IMDb datset values](#Outlier-and-missing-value-correction)
        - [Dataset Expansion with IMDb datsets merging](#IMDb-datset-merging)
    - [Build initial co-stardom graphs](#Methods)
        - [Actor-to-actor](#Methods)
        - [Movie-to-movie](#Methods)
- [Methods](#Methods)
    - [Step 1](#Step1)
    - [Step 2](#Step2)
    - [Step 3](#Step3)
    - [Step 4](#Step4)
    - [Step 5](#Step5)
    - [Step 6](#Step6)

In [None]:
# install library used to perform sentiment analysis
# !pip install vaderSentiment
# interactive graph visualization
# !pip install pyvis

## Libraries

In [None]:
# import required libraries
import pandas as pd
from datetime import date
import matplotlib.pyplot as plt
import numpy as np
import ast

# sentiment analysis
# import vaderSentiment

# graphs handling
from pyvis.network import Network
import networkx as nx
import matplotlib.pyplot as plt

# set pandas options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

## CMU Data Importation

In [None]:
# set data paths
DATA_FOLDER = 'data/MovieSummaries/'

CHARACTER_META = DATA_FOLDER+'character.metadata.tsv'
MOVIE_META = DATA_FOLDER+'movie.metadata.tsv'
NAME_CLUSTERS = DATA_FOLDER+'name.clusters.txt'
PLOT_SUMM = DATA_FOLDER+'plot_summaries.txt'
TV_TROPES = DATA_FOLDER+'tvtropes.clusters.txt'

In [None]:
# load movies metadata
movie_meta = pd.read_csv(MOVIE_META, sep='\t', header=None)

movie_meta.columns = ['wikipedia_movie_id', 'freebase_movie_id', 'movie_name', 'movie_release_date', 'movie_box_office_revenue',
                     'movie_runtime', 'movie_languages', 'movie_countries', 'movie_genres']

# load characters metadata
character_meta = pd.read_csv(CHARACTER_META, sep='\t', header=None)
character_meta.columns = ['wikipedia_movie_id', 'freebase_movie_id', 'movie_release_date' ,'character_name', 'actor_date_of_birth', 'actor_gender',
                    'actor_height_m', 'actor_ethnicity_id', 'actor_name', 'actor_age_at_movie_release', 'freebase_character/actor_map_id',
                    'freebase_character_id', 'freebase_actor_id']

# load charcter name clusters
name_clusters = pd.read_csv(NAME_CLUSTERS, sep='\t', header=None)
name_clusters.columns = ['character_name', 'freebase_character/actor_map_id']

# load plot summaries
plot_summ = pd.read_csv(PLOT_SUMM, sep='\t', header=None)
plot_summ.columns = ['wikipedia_movie_id', 'summary']

# load tv tropes
tv_tropes = pd.read_csv(TV_TROPES, sep='\t', header=None)
tv_tropes.columns = ['character_type', 'freebase_character/actor_map_id']

## Exploratory Data Analysis

### CMU Datasets

The data is separated in n different files namely **plot_summaries**, **movie.metadata**, **character.metadata**, **tvtropes.clusters** and **name.clusters**. We will explore the files individually before merging the relevant features on two different dataframes : the first will be movie-centric indexed by movie_id, the second will be cast centered and indexed by the actor_id. This manipulation will ease the construction of graphs involving movie to movie, actor to movie, actor to movie graphs on which we will perform an extensive network analysis.

#### Movie metadata

The movie metadata file is extracted in a dataframe with the following attributes :  

`wikipedia_movie_id`:  wikidata movie id  (str)  
`freebase_movie_id`:  freebase movie id  (str)  
`movie_name`:  movie name ()  
`release_date`:  unformated release date of the movie  ()    
`movie_revenue`:  box office revenue ()  
`runtime`:   movie runtime ()  
`languages`:  movie languages (freebase id name tuples)  
`countries`:  movie countries release (freebase id name tuples)  
`genres`:  movie genres (freebase id name tuples) 

In [None]:
movie_meta.head(1)

We compute the ratio of missing values per attribute to see which feature are usable or not for our analysis. We can see that almost 90% of the movie revenue attribute is missing, therefore we will drop this column (maybe try to fill this attribute with IMDb datasets).

In [None]:
# display null values for every column
(movie_meta.isna().sum()/len(movie_meta))*100

The next lines prove us that the wikipedia_movie_id and freebase_movie_id can both be used as an index, and that this dataframe contains no duplicates.

In [None]:
# compute number of movies (distinct IDs)
n_wiki_movies = movie_meta['wikipedia_movie_id'].nunique()
n_freebase_id = movie_meta['freebase_movie_id'].nunique()

assert(len(movie_meta) == n_wiki_movies)
assert(n_wiki_movies == n_freebase_id)

print(f'There are {n_wiki_movies} movies in the CMU database')

We then need to convert to format the release date attribute into a dedicated datetime object to ease future computations. 

In [None]:
# convert dates into date format
movie_meta['movie_release_date_formatted'] = pd.to_datetime(movie_meta['movie_release_date'], errors='coerce').apply(lambda x: x.date())
movie_meta.dtypes

We have to check the instances where the movie year is invalid, and then manually correct it if possible. We are lucky, there is only one instance of such case.

In [None]:
# display instances in which movie year is invalid
movie_meta[movie_meta['movie_release_date_formatted'].isnull() & ~(movie_meta['movie_release_date'].isnull())]

In [None]:
# correct invalid release dates
movie_meta.loc[movie_meta['wikipedia_movie_id'] == 29666067, 'movie_release_date_formatted'] = date(2010, 12, 2)
assert 0 == len(movie_meta[movie_meta['movie_release_date_formatted'].isnull() & ~(movie_meta['movie_release_date'].isnull())])

In [None]:
# corrected data
movie_meta[movie_meta['wikipedia_movie_id'] == 29666067]

In [None]:
# update original column
movie_meta['movie_release_date'] = movie_meta['movie_release_date_formatted']
del movie_meta['movie_release_date_formatted']
# save year
movie_meta['movie_release_year'] = movie_meta['movie_release_date'].apply(lambda x: x.year)

In [None]:
n_bins = int(movie_meta['movie_release_year'].dropna().max() - movie_meta['movie_release_year'].dropna().min())
movie_meta['movie_release_year'].plot.hist(bins=n_bins)

plt.xlabel('Movie Release Year')
plt.ylabel('Number of movies')
plt.title('Evolution of yearly movie releases')
plt.show()

In [None]:
np.log(movie_meta['movie_box_office_revenue']).plot.hist(bins=100)

plt.xlabel('Movie Box Office Revenue (log)')
plt.ylabel('Number of movies')
plt.title('Movie Box office revenue distribution')
plt.show()

In [None]:
# Distribution of movie runtime
movie_meta['movie_runtime'].plot.box()

plt.ylabel('Movie Runtime')
plt.title('Movie Box office revenue distribution')
plt.show()
movie_meta['movie_runtime'].head()

In [None]:
# Display outliers
movie_meta.sort_values(by='movie_runtime', ascending=False).head()

To fix outliers and missing values, we plan to use the imdb dataset to find if we can extract the values from it.

In [None]:
# Distribution of movie runtime
# We display values between 0.05 and 0.95 quantiles as we will fix the outliers with the imdb dataset
display_range = (movie_meta['movie_runtime'].quantile(0.05), movie_meta['movie_runtime'].quantile(0.95))
movie_meta['movie_runtime'].plot.hist(bins=50, range=display_range)

plt.xlabel('Movie Runtime')
plt.ylabel('Number of movies')
plt.title('Movie runtime distribution between 0.05 and 0.95 quantiles')
plt.show()

In [None]:
test_meta = movie_meta

In [None]:
test_meta.loc[:, 'movie_languages_ids'] = test_meta['movie_languages'].apply(lambda x: ast.literal_eval(x).values())
test_meta.loc[:, 'movie_countries_ids'] = test_meta['movie_countries'].apply(lambda x: ast.literal_eval(x).values())
test_meta.loc[:, 'movie_genres_ids'] = test_meta['movie_genres'].apply(lambda x: ast.literal_eval(x).values())

In [None]:
language_ids = pd.Series([language for languages in test_meta['movie_languages_ids'] for language in languages])
language_counts = language_ids.value_counts()[:10]

country_ids = pd.Series([country for countries in test_meta['movie_countries_ids'] for country in countries])
country_counts = country_ids.value_counts()[:10]

genre_ids = pd.Series([genre for genres in test_meta['movie_genres_ids'] for genre in genres])
genre_counts = genre_ids.value_counts()[:10]

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(3, figsize=(20, 25))
ax1.bar(x=language_counts.index, height=language_counts.values)
ax1.set_xlabel('Language')
ax1.set_ylabel('Number of movies')
ax1.set_title('The 10 most frequent languages')

ax2.bar(x=country_counts.index, height=country_counts.values)
ax2.set_xlabel('Country')
ax2.set_ylabel('Number of movies')
ax2.set_title('The 10 most frequent countries')

ax3.bar(x=genre_counts.index, height=genre_counts.values)
ax3.set_xlabel('Genre')
ax3.set_ylabel('Number of movies')
ax3.set_title('The 10 most frequent genres')

plt.show()

#### Plot summaries

The plot summary file is extracted in a dataframe with the following attributes :

`wikipedia_movie_id`:  wikidata movie id  (str)    
`plot_summary`:  wikidata plot summary of the movie  (str) 

In [None]:
plot_summ.head(1)

In [None]:
# Using trivial tokenizer
plot_summ['count'] = plot_summ['summary'].apply(lambda x: len(x.split()))

fig, axes = plt.subplots(nrows=1, ncols=2)
plot_summ['count'].plot(kind='hist', ax = axes[0], bins=50)
plot_summ['count'].plot(kind='hist', logy=True, bins=50, ax=axes[1])
#plt.suptitle('Plot summary word count distribution', x=0.5, y=1.05, ha='center', fontsize='xx-large')
axes[0].title.set_text('Word count distribution')
axes[1].title.set_text('Log scaled word count distribution')
fig.tight_layout()

We can merge the two previous datasets namely containing namely plot_summary and movie_metadata information into a single dataframe. Unfortunately, the plot_summary dataframe contains the summary of 42303 movies far from the 81741 movies described in the movie_metadata dataframe. The intersection of the dataframes on movie_id yields 42204 collisions, meaning that 99 summaries were not matched with a movie. We will thus keep it as a separate dataframe in case we need it.

In [None]:
merged = pd.merge(movie_meta, plot_summ, left_on='wikipedia_movie_id', right_on='wikipedia_movie_id')

merged.index = merged.wikipedia_movie_id
merged = merged.drop(columns=['wikipedia_movie_id']).sort_values(by='wikipedia_movie_id')
merged.head(1)

#### Character 

The character metadata file is extracted in a dataframe with the following attributes :

`wikipedia_movie_id`:  wikidata movie id  ()  
`freebase_movie_id`:  freebase movie id  ()  
`release_date`:  unformated release date of the movie  ()   
`character_name`:  character name  ()  
`actor_date_of_birth`:  actor date of birth ()  
`actor_gender`:  actor gender (str)  
`actor_height`:  actor height ()  
`actor_ethnicity`: actor ethnicity specified with a freebase id   
`actor_name`:  actor name  ()  
`actor_age_movie_release`:  actor age at movie release date ()  
`freebase_character_to_actor`:   mapping from character freebase id to actor id ()  
`freebase_character_id`:  character freebase id ()  
`freebase_actor_id`:  actor freebase id ()

In [None]:
character_meta.head()

In [None]:
character_meta['actor_gender'].value_counts().plot.bar()
plt.xlabel('Actor gender')
plt.ylabel('Number of movies')
plt.title('Actor genders in movies')
plt.show()
character_meta['actor_ethnicity_id'].value_counts()[:10].plot.bar()
plt.xlabel('Actor ethnicity')
plt.ylabel('Number of movies')
plt.title('Actor ethnicities in movies')
plt.show()

In [None]:
character_meta['actor_age_at_movie_release'].hist(bins=101, range=(0, 100))
plt.xlabel('actor age at movie release')
plt.ylabel('Number of characters')
plt.title('Distribution of actor age at movie release')
plt.show()
character_meta['actor_height_m'].hist(bins=25, range=(1.2, 2.2))
plt.xlabel('actor height (m)')
plt.ylabel('Number of characters')
plt.title('Distribution of actor height')
plt.show()

In [None]:
name_clusters.head()

In [None]:
name_clusters.nunique()

#### TV Tropes

The tv tropes file is extracted in a dataframe with the following attributes :

`character_type`:  short description of the character type (str)    
`freebase_character/actor_map_id`:  dictionnary containing character name, movie name, actor name and actor map id  (dict) 

In [None]:
tv_tropes.head()

In [None]:
tv_tropes.nunique()

As stated in the paper presenting the CMU datasets, there are 72 character/tv tropes types.  
Our next task is to combine the last three datasets into a single dataframe that will contain character/actor information : we can make use of the 'id' attribute inside the 'freebase_character/actor_map_id' column to merge this dataframe with the character dataframe.

Our ultimate goal is to create two dataframes : the first will be movie-centric that is indexed by the (wikipedia_id/freebase_id) and will contain cast information in the form of a list/dictionnary, the second will actor-centric that is indexed by the actor id (freebase_id) and will contain all the actor information, characters played and adjacent actors (ids of the actors they collaborated with).

## IMDb Datasets Exploratory Data Analysis

### Loading the datasets

In [None]:
DATA_FOLDER = 'data/imdb/'

TITLES_AKA = DATA_FOLDER+'title.akas.tsv.gz'
TITLES_BASICS = DATA_FOLDER+'title.basics.tsv.gz'
TITLES_CREW = DATA_FOLDER+'title.crew.tsv.gz'
TITLES_PRINCIPLES = DATA_FOLDER+'title.principals.tsv.gz'
NAME_BASICS = DATA_FOLDER+'name.basics.tsv.gz'
WRITERS = DATA_FOLDER+"writers_after2012.pkl.gz"
DIRECTORS = DATA_FOLDER+"directors_after2012.pkl.gz"

In [None]:
# load dataset
titles_meta = pd.read_csv(TITLES_AKA, sep='\t')
titles_basics = pd.read_csv(TITLES_BASICS, sep='\t')
titles_crew = pd.read_csv(TITLES_CREW, sep='\t')
titles_principles = pd.read_csv(TITLES_PRINCIPLES, sep='\t')
name_basics = pd.read_csv(NAME_BASICS, sep='\t')

### Filtering and pre-processing the data

#### Films

In [None]:
# we selected only movies after 2012, as we already have movies before 2012
threshold_year = 2012

# From the titles_aka, we only retain the titleId and region, for the original titles
df_titles_aka = titles_meta[['titleId', 'ordering', 'title']][(titles_meta['isOriginalTitle']==1)]

# We retained only non-adult movies
df_basics = titles_basics[(titles_basics['titleType'] == 'movie') & (titles_basics['isAdult'] == 0)]
df_basics = df_basics.drop(['titleType', 'isAdult', 'endYear'], axis=1)
df_basics = df_basics[df_basics['startYear'] != r"\N"]
df_basics['startYear'] = df_basics['startYear'].astype(int)

after_treshold = list(df_basics['tconst'][df_basics['startYear'] >= threshold_year])

#### Writers and directors

In [None]:
df_writers_per_movie = titles_crew[['tconst', 'writers']]
df_directors_per_movie = titles_crew[['tconst', 'directors']]

In [None]:
# we selected only movies after the 2012, as we already have movies before 2012.
# and took only movies with writers/directors
df_writers_per_movie = df_writers_per_movie[(df_writers_per_movie['tconst'].isin(after_treshold)) &
                                           (df_writers_per_movie['writers'] != r"\N")]
df_directors_per_movie = df_directors_per_movie[(df_directors_per_movie['tconst'].isin(after_treshold)) &
                                               (df_directors_per_movie['directors'] != r"\N")]

In [None]:
# Going from movie's id with their writers/directors, to writers/directors with the movies they wrote/directed

df_writers_per_movie = df_writers_per_movie.assign(writers=df_writers_per_movie.writers.str.split(","))
df_writers_per_movie = df_writers_per_movie.writers.apply(pd.Series) \
    .merge(df_writers_per_movie, right_index=True, left_index=True) \
    .drop(["writers"], axis=1) \
    .melt(id_vars=['tconst'], value_name="writers") \
    .drop("variable", axis=1) \
    .dropna()
df_writers_per_movie = df_writers_per_movie.groupby('writers')['tconst'].apply(list).to_frame()

df_directors_per_movie = df_directors_per_movie.assign(directors=df_directors_per_movie.directors.str.split(","))
df_directors_per_movie = df_directors_per_movie.directors.apply(pd.Series) \
    .merge(df_directors_per_movie, right_index=True, left_index=True) \
    .drop(["directors"], axis=1) \
    .melt(id_vars=['tconst'], value_name="directors") \
    .drop("variable", axis=1) \
    .dropna()
df_directors_per_movie = df_directors_per_movie.groupby('directors')['tconst'].apply(list).to_frame()

In [None]:
#Pickling the final dataframes
df_writers_per_movie.to_pickle(WRITERS)
df_directors_per_movie.to_pickle(DIRECTORS)  

In [None]:
df_writers = pd.read_pickle(WRITERS) 
df_directors = pd.read_pickle(DIRECTORS) 

#### Actors

In [None]:
test = titles_principles[(titles_principles['category'].str.contains('actor')) & 
                 (titles_principles['tconst'].isin(after_treshold))]

In [None]:
titles_principles[(titles_principles['category'] != 'actor') &
                 (titles_principles['category'] != 'director') &
                 (titles_principles['category'] != 'writer')]

In [None]:
# Actors with the list of movies they played in
# Issue with this because some actors are also directors or writers, and in that case
# we will not count all the movies they were in
df_actors = titles_principles[['tconst', 'nconst']][((titles_principles['category']=='actor') | 
                                                    (titles_principles['category']=='actress')) &
                                                   (titles_principles['tconst'].isin(after_treshold))]

df_actors = df_actors.assign(nconst=df_actors.nconst.str.split(","))
df_actors = df_actors.nconst.apply(pd.Series) \
    .merge(df_actors, right_index=True, left_index=True) \
    .drop(["nconst"], axis=1) \
    .melt(id_vars=['tconst'], value_name="nconst") \
    .drop("variable", axis=1) \
    .dropna()
df_actors = df_actors.groupby('nconst')['tconst'].apply(list).to_frame()

In [None]:
df_actors

#### Crew information

In [None]:
name_basics.index = name_basics['nconst']
df_name = name_basics.drop(['nconst', 'knownForTitles'], axis=1)

In [None]:
# Merging ids with actual crew information

df_actors_info = df_actors.merge(df_name, how='inner', right_index=True, left_index=True)
df_writers_info = df_writers.merge(df_name, how='inner', right_index=True, left_index=True)
df_directors_info = df_directors.merge(df_name, how='inner', right_index=True, left_index=True)

df_actors_info

In [None]:
# Removing crew with no birthyear

df_actors_info = df_actors_info[df_actors_info['birthYear'] == r"\N"]
df_writers_info = df_writers_info[df_writers_info['birthYear'] == r"\N"]
df_directors_info = df_directors_info[df_directors_info['birthYear'] == r"\N"]

In [None]:
# Creating crew from name_basics (all the crew merged)
# Known for section doesn't contain all the movies

df_crew = name_basics[(name_basics['birthYear'] != r"\N") & (name_basics['knownForTitles'] != r"\N")]

### To DO

In [None]:
# Faire une analyse de donnée initiale

# Si possible :

# Process les actors : ok
# Filtrer les films pour pouvoir merge
# Filter et preprocess crew comme writers et directors - ok
# Merge writers, directors et crew avec les autres crews members de CMU
# Merge les films de CMU avec ceux la

### Crew initial analysis

In [None]:
df_crew

In [None]:
# Crew date of birth distribution

df_crew['birthYear'] = df_crew['birthYear'].astype(int)
plt.hist(df_crew['birthYear'].values, 
         bins = 100, log=True)

plt.xlabel('Year of birth')
plt.ylabel('Number of actors')
plt.title('Distribution of the year of birth for the crew, histogram')

In [None]:
# Top10 actors that played in the most nb of movies

df_actors['nb of movies'] = df_actors['tconst'].apply(lambda x : len(x)) 
df_top_actors = df_actors.sort_values(by='nb of movies', ascending=False).head(10)
df_top_actors = df_top_actors.merge(df_name, how='inner', right_index=True, left_on='nconst')

In [None]:
plt.bar(df_top_actors['primaryName'].values, df_top_actors['nb of movies'].values)

plt.xlabel('Actor name')
plt.ylabel('Number of films')
plt.title('Top10 actors that played in the most movies')
plt.xticks(rotation=90)

In [None]:
# Top10 directors that directed in the most nb of movies

df_directors['nb of movies'] = df_directors['tconst'].apply(lambda x : len(x)) 
df_top_directors = df_directors.sort_values(by='nb of movies', ascending=False).head(10)
df_top_directors = df_top_directors.merge(df_name, how='inner', right_index=True, left_index=True)

In [None]:
plt.bar(df_top_directors['primaryName'].values, df_top_directors['nb of movies'].values)

plt.xlabel('Director name')
plt.ylabel('Number of films')
plt.title('Top10 directors that directed the most number of movies')
plt.xticks(rotation=90)

In [None]:
# Distribution of number of films in which actors played

plt.hist(df_actors['nb of movies'].values, 
         bins = 150, log=True)

plt.xlabel('Number of movies')
plt.ylabel('Number of actors')
plt.title('Distribution of number of movies actors played in')

In [None]:
# Distribution of number of movies directed for directors

plt.hist(df_directors['nb of movies'].values, 
         bins = 100, log=True)

plt.xlabel('Number of movies')
plt.ylabel('Number of directors')
plt.title('Distribution of number of movies directors directed')

In [None]:
#### Writers

# Director date of birth distribution 
# Top10 directors that directed the most nb of movies
# Distribution of number of films directed for each director
# Add the same plots, but make all the related ones in subplots

## Data preprocessing

### Outlier and missing value correction using IMDb datset values

### Data transformation

#### Movie metadata

TODO : \
Preprocessing notes: 
- Replacing the index = movie id
- formatting the dates attributes
- dropping duplicates
- treating the NaN values in movie revenue 
- treating Nan values in movie runtime
- desearealize genres/country/languages attributes
- treating empty value ("{}") in country attribute
- normalizing the numeric values (only before performing PCA/Regression/Classification)

#### Character metadata

TODO : \
Preprocessing notes: 
- joining this character dataset with the previous dataframe on wikipedia movie id
- formatting the dates
- verify that the freebase id of the actor is unique so it can be used as an index for actors
- merge actor and character dataframe
- sentiment analysis on the m neighbours words surrounding the character name

#### Plot summaries

TODO : \
Preprocessing notes:
- indexing using the movie id
- nlp methods : using spaCy nlp framework
  ** tokenize
  ** parse 
  ** removing stop word
  ** lemmatise 
  ** topic prediction using Empath library 
  ** sentiment analysis using Vader
- joining this dataset with movie dataset
- character encoding
- remove wikipedia markup sign e.g {{hatnote}}

#### TV Tropes

TODO : 

### IMDb Datasets merging

In [None]:
# Read imdb to fix data
# imdb_title_basics = pd.read_csv(IMDB_TITLE_BASICS, sep='\t')
# imdb_title_basics.columns = ['imdb_' + cn for cn in imdb_title_basics.columns]

In [None]:
# # display null values
# imdb_title_basics.isna().sum()

In [None]:
# movie_meta_imdb_merged = movie_meta.merge(imdb_title_basics, how='left', left_on='movie_name', right_on='imdb_primaryTitle')

In [None]:
# movie_meta_imdb_merged.head()

In [None]:
# imdb_title_basics.head()

### Data Transformation : Build co-stardom graphs : actor-to-actor and movie-to-movie

## Methods

### Step 1 : Co-stardom graphs

##### 1.A : **Actor-to-actor**
The way we are going to explore the data of this CMU repo augmented by some potentially other datasets, is first of all through building a co-stardom network. A costardom network is essentially a collaboration graph of film actors. The nodes represent movie star actors and two nodes are linked if the two-stars have starred in the same movie. We can add a weight to the link, the weight being the number of times two actors have performed together. This would be our first actor-to-actor graph.

##### 1.B : **Movie-to-movie**
We can similarly construct a movie-to-movie graph which could reveal interesting insights. In such a graph, a node would be a movie and two nodes would be linked if they share some of the cast members. If we deem it necessary, we can consider only the "important" people of the cast.

### Step 2 : Simplify and filter out nodes that are not necessary

##### 2.A Filter out the graph nodes below a certain degree and the graph edges below a certain weight


##### 2.B Create networks for Hollywood, Bollywood, and other countries

### Step 3 : Network analysis

##### 3.A : **Community detection**  
A network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. A known algorithm to give us insights on this problem is the Girvan-Newman algorithm : it detects communities by progressively removing edges from the original network. The connected components of the remaining network are the communities    

##### 3.B: **Clustering based on actor attributes**  
The idea would be to use an algorithm that regroup actors with similar features we extracted and engineered in the preprocessing part.
The fact that actors have categorical features prevent us from using a K-means clustering as the Euclidean distance is not well
defined for categorical features. For this issue, we might use K-modes that mixes the Hamming distance for categorical data and Euclidean distance

### Step 4 : Build an ego network graph between actors

#### Ego network methematical description

Ego networks are network graphs were each individual node is an "ego". They allow us to describe and index the variation among individuals in the way they are embedded in "local" social structures.

Ego networks come with many definitions and related metrics (here the ones that are of interest to us): 

- Neighborhood is the collection of ego and all nodes/ties to whom ego has a connection at some path length (by neighborhood we usually imply path length of one). N-step neighborhood are neighborhoods up to path length N.
- In neighborhood includes all the actors with ties directly to ego.
- Out neighborhood includes all the actors with ties directed from ego

- Average geodesic distance is the mean of the shortest path lengths among all connected pairs in the ego network.
- Diameter of an ego network is the length of the longest path between connected actors
- Brokerage is number of pairs not directly connected. Normalized brokerage is brokerage divided by number of pairs
- Number of weak components, where a weak component is the biggest number of actors who are connected, not taking into account the direction of the ties.

- Two-step reach gives the percentage of all actors in the network that are within two directed steps of ego.  
- Reach efficiency is the two-step reach divided by its size.

- Brokerage is the number of pairs not directly connected. And the normalized brokerage is the brokerage divided by the number of pairs.

- Structural holes is the absence of ties between two parts of a network. They that help determine very important aspects of positional advantage/disadvantage of individuals that result from how they are embedded in neighborhoods. This helps to think about how and why the ways that an actor is connected affect its constraints and opportunities. 

#### Plan

- Describe and index the variation among actors in the way they are embedded in « local » social structures.

### Step 5 : Compute and visualize network metrics to highlight the power and influence of individuals

#### Metrics mathematical description

Graph metrics can be divided into 3 categories, the ones related to Connections, Distributions and Segmentation.

1) Connections

- Assortativity is the extent to which actors form ties with similar versus dissimilar others. This factor of similarity can be gender, race, age, status or any other characteristic.

- Multiplexity is a structural property of network ties that can give the existence of more than one type of relationship between two actors

- Propinquity describes the tendency for actors to have more ties with other actors that are close geographically.

2) Distributions

- Centrality

    - Betweenness centrality captures which nodes are important in the flow of the network. This by computing for every vertex the number of shortest paths that pass through it. The formula that computes it is the following :
    
    $$g(v)= \sum_{s \neq v \neq t}\frac{\sigma_{st}(v)}{\sigma_{st}}$$
    
        where sigma_{st} is the total number of shortest paths from node s to node t and sigma_{st}(v) is the number of those that pass through v.
        
    - Closeness centrality is the reciprocal of the sum of the length of the shortest paths between a given node and all other nodes in the graph. The more central a node is the closer it is to all other nodes.
    
    $$C_B(x)= \frac{1}{\sum_y d(y,x)}$$

        where d(y,x) is the distance (length of the shortest path) between vertices x and y.
        
    - Degree centrality counts how many edges each node has, hence the most degree central actor is the one with the most ties.

3) Segmentation

- Clustering coefficient measures the degree to which nodes in a graph tend to cluster together.

- Coehsion is the degree to which actors are connected directly to each other by cohesive bonds. Where cohesive bonds are bonds that link members of a social group to one another and to the group as a whole.

#### Plan

- Explore the existence of more than one type of relationship between two actors through computing multiplexity.
- Explore the tendency for actors to have more ties with other actors that are close geographically with propinquity.
- Determine positional advantages/disadvantages of individuals from structural holes. This will help to think abut how and why the ways that an actor is connected affect its constraints and opportunities.
- Understand which actors/directors/writers are the most important in the flow of the network, by computing betweenness/closeness/degree centrality.
- Compute to which degree actors/directors/writers tend to cluster together, computing the clustering coefficient.

### Step 6: Build website and redact datastory