# Star Wars Data Science
## Network Analysis, Topic Modeling, and a Wordcloud!
https://linkedin.com/in/dennisbakhuis

## Scraping and building the dataset

There is an incredible amount of information on Star Wars online. One of my favorite sources is the so called Wookieepedia, a wiki with a crazy amount of Star Wars knowledge. 

`All data is available in the Github repository and Kaggle so there is no need to scrape it yourself and generate high amount of traffic for Wookieepedia.`

In this section describes the process to scrape this information. A wiki is a collection of pages and each topic has its own page. To scrape this information, we need to visit each page. There is a clever way to scrape such websites and that is by using Sitemaps. It is a special file that webmasters can provide that will help web crawlers with indexing the website. We can make use of the sitemap to get a list of all the pages that are available.

In [None]:
import requests
import xml.etree.ElementTree as ET

url = 'https://starwars.fandom.com'

def get_elements(url : str) -> dict:
    site_map_str = "/sitemap-newsitemapxml-index.xml"
    result = requests.get(url + site_map_str)
    content = result.content
    
    root = ET.fromstring(content)
    elements = {}
    for page in root.iter('{http://www.sitemaps.org/schemas/sitemap/0.9}loc'):
        result = requests.get(page.text)
        c = result.content
        new_root = ET.fromstring(c)
        for element in new_root.iter('{http://www.sitemaps.org/schemas/sitemap/0.9}loc'):
            elements[element.text.split('/')[-1]] = element.text
    print('Found {} elements'.format(len(elements)))
    return elements

elements = get_elements(url)

Currently (April 2021) there are 219.900 pages that could be scraped. This is however a bit too much. Therefore, I decided to only scrape the pages that are considered canon. Luckily, Wookieepedia gave canon articles their own category. When we click on the category, we get a paginated index of all pages that are considered canon. This needs some additional work so that we are able to scrape all the topics.

In [None]:
from bs4 import BeautifulSoup 
import pickle

page_url = 'https://starwars.fandom.com/wiki/Category:Canon_articles'  # all canon articles
base_url = 'https://starwars.fandom.com'

pages = {}
page_num = 1
while page_url is not None:
    result = requests.get(page_url)
    content = result.content
    soup = BeautifulSoup(content, "html.parser")
    
    # extract urls
    links = soup.find_all('a', class_='category-page__member-link')
    links_before = len(pages)
    if links:
        for link in links:
            url = base_url + link.get('href')
            key = link.get('href').split('/')[-1]
            if 'Category:' not in key:
                pages[key] = url
    print(f'Page {page_num} - {len(pages) - links_before} new links ({page_url})')
    page_num += 1
                
    # get next page button
    next_urls = soup.find_all("a", class_='category-page__pagination-next')
    if next_urls:
        new_url = next_urls[0].get('href')
        if new_url == page_url:
            break
        else:
            page_url = new_url
    else:
        puge_url = None


print(f'Number of pages: {len(pages)}')

# Save to disk
with open('../Dataset/starwars_all_canon_dict.pickle', 'wb') as f:
    pickle.dump(pages, f, protocol=pickle.HIGHEST_PROTOCOL)

Alright, we now have a list of 29k pages that are considered canon which need to be scraped. These pages have a typical format consisting of a title, a description often with subsections, and a sidebar with properties. To reduce the amount of information, I will only scrape the the first paragraph, the full sidebar, and all links that point towards other canon pages. A typical Wookieepedia page is shown in figure x:

<img src="../Assets/sw_scrape1.png" alt="Artificial Neural Network example" width="500" style="display: block; margin: 0 auto" />

Next, we scrape each page and save the partitioned to disk:

In [None]:
import re
from tqdm import tqdm

scraped = {}
failed = {}
partition_size = 5000
folder = '../Dataset/'
!rm -rf ./data
!mkdir -p ./data

for ix, (key, page_url) in tqdm(enumerate(pages.items()), total=(len(pages))):
    try:
        # Get page
        result = requests.get(page_url)
        content = result.content
        soup = BeautifulSoup(content, "html.parser")

        # Get title
        heading = soup.find('h1', id='firstHeading')
        if heading is None: continue
        heading = heading.text

        # Extract Sidebar
        is_character = False
        side_bar = {}
        sec = soup.find_all('section', class_='pi-item')
        for s in sec:
            title = s.find('h2')
            if title is None:
                title = '<no category>'
            else:
                title = title.text
            side_bar[title] = {}
            items = s.find_all('div', class_='pi-item')
            for item in items:
                attr = item.find('h3', class_='pi-data-label')
                if attr is None:
                    attr = '<no attribute>'
                else:
                    attr = attr.text
                if attr == 'Species': is_character = True
                value = re.sub("[\(\[].*?[\)\]]" ,'', '], '.join(item.find('div', class_='pi-data-value').text.split(']')))
                value = value.strip()[:-1].replace(',,', ',')
                if ',' in value:
                    value = [i.strip() for i in value.split(',') if i.strip() != '']
                side_bar[title][attr] = value

        # Raw page content
        raw_content = soup.find('div', class_='mw-parser-output')
        if raw_content is not None:
            for raw_paragraph in raw_content.find_all('p', recursive=False):
                if 'aside' in str(raw_paragraph): continue
                break
            paragraph = value = re.sub("[\(\[].*?[\)\]]" ,'', raw_paragraph.text)

            # cross-links
            keywords = []
            for link in raw_content.find_all('a'):
                part = link.get('href')
                if part is not None:
                    part = part.split('/')[-1] 
                    if part in pages.keys() and part != key:
                        keywords.append(part)
            keywords = list(set(keywords))
        else:
            # Empty page
            keywords = []
            paragraph = ''

        # Data object
        scraped[key] = {
            'url': page_url,
            'title': heading,
            'is_character': is_character,
            'side_bar': side_bar,
            'paragraph': paragraph,
            'crosslinks': keywords,
        }

        # save partition
        if (ix + 1) % partition_size == 0:
            last_number = (ix+1) // partition_size
            fn = folder + f'starwars_all_canon_data_{last_number}.pickle'
            with open(fn, 'wb') as f:
                pickle.dump(scraped, f, protocol=pickle.HIGHEST_PROTOCOL)
            scraped = {}
    except:
        print('Failed!')
        failed[key] = page_url
    
# Save final part to disk
fn = folder + f'starwars_all_canon_data_{last_number + 1}.pickle'
with open(fn, 'wb') as f:
    pickle.dump(scraped, f, protocol=pickle.HIGHEST_PROTOCOL)


It took a little bit more than an hour to scrape all information and all is stored in pickle files that are split in sections of max 5000 pages. This is the raw dataset and we can always fall back to this.

Next we will split the raw data in two sections: characters and raw text sentences. The character are identified by a property called 'species' that is available in the sidebar. We will collect all characters in a strongly structured Pandas DataFrame and this means that we need to select properties beforehand. We have a total of 5334 characters that are marked as canon. More details on the selected properties are in the scrape notebook that can be found in the Github repository.
The raw text sentences are extracted from the descriptions. Each description is split into sentences and collected in a single list. Details of this split is can also be found in scrape notebook on Github.It took a little bit more than an hour to scrape all information and all is stored in pickle files that are split in sections of max 5000 pages. This is the raw dataset and we can always fall back to this.

Next we will split the raw data in two sections: characters and raw text sentences. The character are identified by a property called 'species' that is available in the sidebar. We will collect all characters in a strongly structured Pandas DataFrame and this means that we need to select properties beforehand. We have a total of 5334 characters that are marked as canon. More details on the selected properties are in the scrape notebook that can be found in the Github repository.
The raw text sentences are extracted from the descriptions. Each description is split into sentences and collected in a single list. Details of this split is can also be found in scrape notebook on Github.

In [None]:
from pathlib import Path
import urllib


files = sorted(Path('../Dataset').glob('*.pickle'))
files

In [None]:
data = {}
for fn in files:
    with open(fn, 'rb') as f:
        part = pickle.load(f)
    data.update(part)

len(data)

In [None]:
def remove_url_shizzle(text):
    return urllib.parse.unquote(text).replace('"', '').replace("'", '')

In [None]:
cleaned = {}
for key, value in tqdm(data.items()):
    new_key = remove_url_shizzle(key)
    cleaned[new_key] = value
    cleaned[new_key]['crosslinks'] = [remove_url_shizzle(crosslink) for crosslink in value['crosslinks']]
data = cleaned

### Star Wars character dataset

In [None]:
def find_key(key_name, data):
    for key, value in data.items():
        if key_name == key:
            return value
        if isinstance(value, dict):
            value = find_key(key_name, value)
            if value is not None:
                return value
    return None

def get_first(key_name, data):
    result = find_key(key_name, data)
    if isinstance(result, list):
        result = result[0]
    return result

In [None]:
import pandas as pd

result = []
for key, part in data.items():
    if not part['is_character']: continue
    row = {
        'name': part['title'],
        'key': key,
        'url': part['url'],
        'description': part['paragraph']
    }
    
    species  = find_key('Species', part['side_bar'])
    row['species_2nd'] = None
    row['species_3rd'] = None
    if isinstance(species, list):
        row['species'] = species[0]
        if len(species) > 1:
            row['species_2nd'] = species[1]
        if len(species) > 2:
            row['species_3rd'] = species[2]
        if len(species) > 3:
            print(species)
    else:
        row['species'] = species.strip()
    row['home_world'] = get_first('Homeworld', part['side_bar'])
    row['gender'] = get_first('Gender', part['side_bar'])

    row['height'] = get_first('Height', part['side_bar'])
    row['eye_color'] = get_first('Eye color', part['side_bar'])
    row['skin_color'] = get_first('Skin color', part['side_bar'])
    row['hair_color'] = get_first('Hair color', part['side_bar'])
    row['weight'] = get_first('Mass', part['side_bar'])

        
    
    result.append(row)
df = pd.DataFrame(result)

# fix gender some errors
gender_map = {
    'Male': 'Male',
    'Female': 'Female',
    'Mal': 'Male',
    'Femal': 'Female',
    'Non-binary': 'Non-binary',
    'male': 'Male',
    'Males': 'Male',
    'female': 'Female',
    'Femle': 'Female',
}
df.loc[:, 'gender'] = df.gender.map(gender_map)
df['gender'] = df['gender'].fillna('None')

# normalize height
translate = {None: None}
for m in df.height.unique().tolist()[1:]:
    if 'meter' in m:
        try:
            split = m.split()
            if len(split) == 2:
                if '/' in split[0]:
                    split[0] = split[0].split('/')[0]
                translate[m] = float(split[0])
            elif split[0] == 'Around' or split[0] == 'Over':
                translate[m] = float(split[1])
            elif split[0] == 'At':
                translate[m] = float(split[2])
            elif split[-1] == 'shoulder':
                translate[m] = float(split[0])
            elif split[-1] == 'meters':
                translate[m] = float(split[-2])
            elif split[1] == 'millimeters':
                translate[m] = 1.7015
            elif split[1] == 'meters':
                translate[m] = float(split[0])
            else:
                print(split)
                break
        except:
            print(m)
            break
    elif 'feet' in m or 'ft' in m:
        try:
            split = m.split()
            if split[0] == 'Around' or split[0] == 'Almost':
                translate[m] = 0.3 * int(split[1])
            elif len(split) == 4:
                translate[m] = 0.3 * int(split[0]) + 0.0254 * int(split[2])
            elif len(split) == 2:
                translate[m] = 0.3 * int(split[0])
            else:
                print(split)
                break
        except:
            print(m)
            break     
    elif m[-1] == 'c':
        translate[m] = float(m[:-1]) / 100
    elif m == '5:1':
        translate[m] = None
    else:
        try:
            translate[m] = float(m)
        except:
            print(m)
            break     
df['height'] = df.height.map(translate)


In [None]:
df.to_parquet('../Dataset/StarWars_Characters.parquet', index=False)

### Raw sentences

In [None]:
fd = pd.DataFrame([
    {
        'key': key,
        'title': value['title'],
        'is_character': value['is_character'],
        'description': value['paragraph'],
    } for key, value in data.items()
])

In [None]:
fd.to_parquet('../Dataset/StarWars_Descriptions.parquet')

In [None]:
import spacy
from tqdm import tqdm

nlp = spacy.load("en_core_web_sm")

In [None]:
def sentence_split(text):
    doc = nlp(text)
    return [sent.text for sent in doc.sents]

In [None]:
%%time

sentences = []
for description in tqdm(fd.description.values):
    sentences += sentence_split(description)

In [None]:
sent = pd.DataFrame(sentences, columns=['sentence'])

In [None]:
sent.to_parquet('../Dataset/StarWars_Raw_Sentences.parquet', index=False)

### Raw sentences for characters only

In [None]:
%%time

sentences2 = []
for description in tqdm(fd.loc[fd.is_character, 'description'].values):
    sentences2 += sentence_split(description)

In [None]:
sent2 = pd.DataFrame(sentences2, columns=['sentence'])

In [None]:
sent2.to_parquet('../Dataset/StarWars_Raw_Sentences_characters.parquet', index=False)