# Star Wars Data Science
## Network Analysis, Topic Modeling, and a Wordcloud!
https://linkedin.com/in/dennisbakhuis

## 2. Wookieepedia data exploration

The first step with a new dataset is always data exploration. It is a way to get acquainted with the dataset and a first step to understand what information is in it. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
mpl.rcParams['font.size'] = 14.0

sw = pd.read_parquet('../Dataset/StarWars_Characters.parquet')

In [None]:
sw.info()

### Species
If you ask a non-fanboy about Star Wars something you hear often is that it is full of weird creatures. Especially the [famous Cantina scene](https://www.youtube.com/watch?v=Lfy5Esue_ls) is full with different species that exist in the galaxy. Therefore, it is interesting to have a first look on the amount of species and how often they are represented in the canon dataset.

In [None]:
len(sw.species.unique())

There are a total of 530 species mentioned, too much to make a nice visualization, Therefore, we only select species that have at least 40 mentions in the dataset. All others are grouped as other.

In [None]:
n_mentions = 40

d = sw.copy()
species = d.species.value_counts()
other = species[species < n_mentions].index.tolist()
d.loc[d.species.isin(other), 'species'] = 'Other'

species =  d.species.value_counts()

In [None]:
fig, ax = plt.subplots(figsize=[14,14])

labels = species.index
_ = plt.pie(
    x=species, 
    autopct="%.1f%%", 
    labels=species.index.tolist(),
    pctdistance=0.9,
    shadow=True,
    startangle=60,    
)
_ = plt.axis('equal')

In [None]:
species

In [None]:
fig, ax = plt.subplots(figsize=(6, 3), subplot_kw=dict(aspect="equal"))


wedges, texts = ax.pie(species, wedgeprops=dict(width=0.5), startangle=0)

bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)
kw = dict(arrowprops=dict(arrowstyle="-"),
          bbox=bbox_props, zorder=0, va="center")

labels = species.index.tolist()
correction = [0,0] + list(np.arange(len(species) - 2) * 0.3 - 1)

for i, p in enumerate(wedges):
    ang = (p.theta2 - p.theta1)/2. + p.theta1
    y = np.sin(np.deg2rad(ang))
    x = np.cos(np.deg2rad(ang))
    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
    connectionstyle = "angle,angleA=0,angleB={}".format(ang)
    ax.annotate(labels[i], xy=(x, y), xytext=(1.35*np.sign(x), correction[i]+1.4*y),
                horizontalalignment=horizontalalignment, **kw)

Humans are by far the dominant species with more than half of the characters or 2770 mentions. The next species are Twi'leks that are only 2.5% of the dataset. There are only a few tenths for the other characters as the number is dropping quickly. We grouped 519 other species in the others group which gives an average of less than four characters each. If we would meet a random Star Wars character, there is a 50% chance that he/she is human. If it is not a human, it an be one of the many diverse species available.

### Home world

As the species are very diverse, it might also be interesting to have a look at the home world of each character. Looking at the unique counts, there are 463 worlds mentioned in the dataset. Again, due to the high diversity, we threshold these to a minimum of 8 charters that have to mention the planet as a home world.

In [None]:
sw.home_world.value_counts()

In [None]:
n = 8

d = sw.copy()
hw = d.home_world.value_counts()
other = hw[hw < n].index.tolist()
hw = hw[hw >= n]
d.loc[d.home_world.isin(other), 'home_world'] = 'Other'

# hw =  d.home_world.value_counts()

In [None]:
pie, ax = plt.subplots(figsize=[15,15])
plt.rcParams['font.size'] = 18
_ = plt.pie(
    x=hw, 
    autopct="%.1f%%", 
#     explode=[0, 0.1] + [0] * (len(species) - 2), 
    labels=hw.index.tolist(),
    pctdistance=0.9,
#     shadow=True,
    startangle=60,
)
_ = plt.axis('equal')

Interestingly, Kamino is the most mentioned world which is famous for their cloning technology. It was first mentioned in episode 2 - attack of the clones where the protagonists found a hidden clone army. But the main reason why this planet is so prominent is because of the animated series of Star Wars called the clone wars. It had seven seasons and a total of 133 episodes and was centralized around many characters that were created on Kamino. The second planet is Naboo which was under rule of Queen Amidala and famous throughout the series.

### Gender

In [None]:
sw.gender.unique()

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.countplot(data=sw, y='gender', ax=ax)
sns.despine()
for p in ax.patches:
    ax.annotate(
        f'\n{p.get_width()}', 
        (p.get_width() + 50, p.get_y()+0.2),
        ha='left', 
        va='top', color='black', size=18)

Because Star Wars started in the late seventies, I expected that there would be a strong representation of male characters. Indeed, two-third of all characters are males but I somehow expected more. Maybe Disney started to increase the amount of females to finally bring balance to the force.

Another thing that is pretty cool is that there are a few non-binary characters in the official Star Wars lore. The author Chuck Wendig confirmed that a humanoid character from his books called Eleodie Maracavanya was indeed non-binary and thereby the first official non-binary canon character. Now there are a total of four, including Keo Venzee.

For some species, it is not very clear if they are male or female and about 8% of all characters do not have a gender.

### The tallest and smallest bunch

As Qui-Gon Jinn once remarked: "there is always a bigger fish' and therefore, it is interesting to see what sizes are recorded in the dataset. Unfortunately, only for about 12% of the characters, the height is known. Lets have a look at the distribution:

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
bins = np.arange(0, 3.1, 0.05)
hist, bins = np.histogram(sw.height, bins=bins, density=False)
bin_mid = (bins[0:-1] + bins[1:]) / 2
ax.fill_between(bin_mid, hist, np.zeros(len(hist)))
ax.plot(bin_mid, hist, 'k')
_, _ = ax.set_xlabel('Height (m)'), ax.set_ylabel('Count')
sns.despine()

In [None]:
sw.sort_values('height', ascending=False)[['name', 'height']].head(10)

Most characters have heights that are pretty typical to humanoids with a large spike at 1.83m which is about the average height of Dutch males. Looking at the extremes we find that Babu Frik (appeared in episode 9) is by far the smallest intelligent creature with only 22cm. Also nice to see is that Grogu has to grow about 20cm more to be equal in size to Yoda.

The largest creature is Omi, the one-eyed beast living in the trash compactor of the Death Star and apparently is 10 meters tall. Also Jabba the Hutt is pretty large with 3.90m but I guess they measure him from nose till the end of his tail.

### Eye color, skin color, and hair color

We have also logged the color of they eyes and there are 97 colors in the dataset. I guess that there are some typos and also some descriptive colors like 'Bluish green'.  Still for almost half the eye color was registered.

In [None]:
sw.eye_color.unique()

In [None]:
eye_color = sw.eye_color.value_counts()
eye_color = eye_color[eye_color>20].index.tolist()

fig, ax = plt.subplots(figsize=(12, 8))
sns.countplot(data=sw.loc[sw.eye_color.isin(eye_color)], y='eye_color', ax=ax)
sns.despine()
_ = ax.set_ylabel('Eye color')

In [None]:
eye_color = sw.eye_color.value_counts()
eye_color[eye_color>20]

In [None]:
skin_color = sw.skin_color.value_counts()
skin_color = skin_color[skin_color>20].index.tolist()

fig, ax = plt.subplots(figsize=(12, 8))
sns.countplot(data=sw.loc[sw.skin_color.isin(skin_color)], y='skin_color', ax=ax)
sns.despine()
_ = ax.set_ylabel('Skin color')

In [None]:
hair_color = sw.hair_color.value_counts()
hair_color = hair_color[hair_color>20].index.tolist()

fig, ax = plt.subplots(figsize=(12, 8))
sns.countplot(data=sw.loc[sw.hair_color.isin(hair_color)], y='hair_color', ax=ax)
sns.despine()
_ = ax.set_ylabel('Hair color')