**Insert Genre column for each movie.** <br>

Genres data scraped from IMDb, each movie may have **different total number** of tags for its genre.<br>

**e.g:** <br>
**Terminator (1984)** : Action, Sci-fi (2 tags) <br>
**WALL·E (2008)** : Animation, Adventure, Family (3 tags)

In [1]:
import pandas as pd
import networkx as nx
import requests
from bs4 import BeautifulSoup as bs

In [2]:
df = pd.read_csv("../data/dataverse_files/network_metadata.tab", sep='\t', lineterminator='\r')

<h1>Scrape Generes form IMDB</h1>

In [3]:
imdb_ids = df['IMDB_id'].tolist()
base_url = "https://www.imdb.com/title/"
all_genres = []

for imdb_id in imdb_ids:
    url = base_url + imdb_id
    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    soup = bs(response.text, 'html.parser')
    genres = soup.findAll('span', attrs={"class":"ipc-chip__text"})
    if 'Back to top' in genres[:]:
        genres.remove('Back to top')
    all_genres.append([e.text for e in genres if e.text != 'Back to top'])

    
df.insert(4, "Genre", all_genres, True)

<h1>Calculate Transitivity</h1>

In [4]:
import os
# Get the list of files and directories in the current directory
files = os.listdir("../data/gexf/")

# Split the file name and extension for each file and directory
files = [int(os.path.splitext(file)[0]) for file in files]

# Sort the list of file names without the extensions
sorted_files = sorted(files)

transitivity = []
for file in sorted_files:
    file_name = f'{file}.gexf'
    G = nx.read_gexf(f"../data/gexf/{file_name}", node_type=None, relabel=False, version='1.2draft')
    transitivity.append(nx.transitivity(G))

df.insert(4, "Transitivity",  transitivity, True)
df.to_csv("network_metadata_with_genres.csv", index=False)

In [5]:
! cp network_metadata_with_genres.csv ../data/

**<h1>IMPORTANT </h1>**

Use **Excel** to open "network_metadata_with_genres.csv", and **replace all** the **","** to **"."** before going to the next notebook. <br>

This is because the original data are all strings, not numbers, and the data collector mark the decimal point **"."** as **","** , which woule be difficult to deal with if not preprocessed as above. 
