# Altair Example 2 - Star Wars Most Unfavorable Characters

This notebook demonstrates a method to create a Altair graphic closely resembling the theme from an article.  It is meant to create an alternative visualization that would compliment the source article.  The data used to create this visualization is a subset of the [FiveThirtyEight](https://fivethirtyeight.com) data used in the article [America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters)](https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/) (Hickey, 2014).  The dataset used in this example is a subset of the original.  The original dataset can be found at [FiveThirtyEight Star Wars Survey](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey).


This notebook is an attempt create a new visualization showing some new information missing in the article.

In [1]:
# The code in this cell was written and provided by the instruction team of 
# University of Michigan - School of Information - SIADS-522 - Information Visualization
# Taught by Professor Eytan Adar (2020)

import pandas as pd
import altair as alt
import numpy as np
import math

# enable correct rendering
alt.renderers.enable('default')

# uses intermediate json files to speed things up
alt.data_transformers.enable('json')

sw = pd.read_csv('datasets/StarWars.csv', encoding='latin1')

#--------------------------------------------------------------------------------------------------

# Some format is needed for the survey dataframe, we provide the formatted dataset in a dataframe 
sw = sw.rename(columns={'Have you seen any of the 6 films in the Star Wars franchise?':'seen_any_movie',
                        'Do you consider yourself to be a fan of the Star Wars film franchise?': 'fan',
                        'Which of the following Star Wars films have you seen? Please select all that apply.' : 'seen_EI',
                        'Unnamed: 4' : 'seen_EII',
                        'Unnamed: 5' : 'seen_EIII',
                        'Unnamed: 6' : 'seen_EIV',
                        'Unnamed: 7' : 'seen_EV',
                        'Unnamed: 8' : 'seen_EVI',
                        'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'rank_EI',
                        'Unnamed: 10' : 'rank_EII',
                        'Unnamed: 11' : 'rank_EIII',
                        'Unnamed: 12' : 'rank_EIV',
                        'Unnamed: 13' : 'rank_EV',
                        'Unnamed: 14' : 'rank_EVI',
                        'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.' : 'Han Solo',
                        'Unnamed: 16' : 'Luke Skywalker',
                        'Unnamed: 17' : 'Princess Leia Organa',
                        'Unnamed: 18' : 'Anakin Skywalker',
                        'Unnamed: 19' : 'Obi Wan Kenobi',
                        'Unnamed: 20' : 'Emperor Palpatine',
                        'Unnamed: 21' : 'Darth Vader',
                        'Unnamed: 22' : 'Lando Calrissian',
                        'Unnamed: 23' : 'Boba Fett',
                        'Unnamed: 24' : 'C-3P0',
                        'Unnamed: 25' : 'R2 D2',
                        'Unnamed: 26' : 'Jar Jar Binks',
                        'Unnamed: 27' : 'Padme Amidala',
                        'Unnamed: 28' : 'Yoda',
                       })
sw = sw.drop([0])

#--------------------------------------------------------------------------------------------------

# Sample visualization

# We're going to fix the labels a bit so will create a mapping to the full names
episodes = ['EI', 'EII', 'EIII', 'EIV', 'EV', 'EVI']
names = {
    'EI' : 'The Phantom Meanance', 'EII' : 'Attack of the clones', 'EIII' : 'Revenge of the Sith', 
    'EIV': 'A New Hope', 'EV': 'The Empire Strikes Back', 'EVI' : 'The Return of the Jedi'
}

# we're also going to use this order to sort, so names_l will now have our sort order
names_l = [names[ep] for ep in episodes]

#--------------------------------------------------------------------------------------------------

# let's do some data pre-processing... sw (star wars) has everything

# We want to only use those people who have seen at least one movie, let's get the people, toss NAs
# and get the total count

# find people who have at least on of the columns (seen_*) not NaN
seen_at_least_one = sw.dropna(subset=['seen_' + ep for ep in episodes],how='all')
total = len(seen_at_least_one)

#--------------------------------------------------------------------------------------------------

# for each movie, we're going to calculate the percents and generate a new data frame
percs = []

# loop over each column and calculate the number of people who have seen the movie
# specifically, filter out the people who are *NaN* for a specific episode (e.g., ep_EII), count them
# and divide by the percent
for seen_ep in ['seen_' + ep for ep in episodes]:
    perc = len(seen_at_least_one[~ pd.isna(seen_at_least_one[seen_ep])]) / total
    percs.append(perc)
    
# at this point percs is holding our percentages

# now we're going use a trick to make tuples--pairing names with percents--using "zip" and then make a dataframe
tuples = list(zip([names[ep] for ep in episodes],percs))
seen_per_df = pd.DataFrame(tuples, columns = ['Name', 'Percentage'])

The below visualizations are unique visualizations that do not exist in the article but do offer some contrast to the ways the data could have been presented or add more context to the article.

In [2]:
# This is code written by Nicholas Miller

characters = ['Han Solo', 'Luke Skywalker', 'Princess Leia Organa', 'Anakin Skywalker',
              'Obi Wan Kenobi', 'Emperor Palpatine', 'Darth Vader', 'Lando Calrissian',
              'Boba Fett', 'C-3P0', 'R2 D2', 'Jar Jar Binks', 'Padme Amidala', 'Yoda']

df = sw.copy()
df = df.dropna(subset=characters,how='all')
df = df.dropna(subset=['Gender'],how='any')

most_unfavorable = ['Jar Jar Binks', 'Darth Vader', 'Emperor Palpatine']

def convert_favorable(df):
    unfavorable = ['Very unfavorably', 'Somewhat unfavorably', 1]
    for character in characters:
        # Replace the favorable and NA ratings with 0 and unfavorable ratings with 1
        df[character] = df[character].replace(unfavorable, 1)
        df.loc[~df[character].isin(unfavorable), character] = 0
    
    df = pd.DataFrame(df.sum() / len(df)).reset_index().rename(columns={'index': 'Name', 0: 'Percentage'})
    return df[df['Name'].isin(most_unfavorable)]

num_female = len(df[df['Gender'] == 'Female'])
num_male = len(df[df['Gender'] == 'Male'])

female_df = convert_favorable(df[df['Gender'] == 'Female'][characters].copy())
male_df = convert_favorable(df[df['Gender'] == 'Male'][characters].copy())

#===========================================================================================================
#===========================================================================================================

# Now we make the graph
def build_bar_chart2(df, title, color, show_axis=False):
    if show_axis:
        axis_text = alt.Axis(tickCount=5, title='', labelLimit=300)
        title_anchor = "middle"
    else:
        axis_text = None
        title_anchor = "middle"
        
    bars = alt.Chart(df).mark_bar(size=20, color=color).encode(
        # encode x as the percent, and hide the axis
        x=alt.X(
            'Percentage:Q',
            axis=None),
        y=alt.Y(
            # encode y using the name, use the movie name to label the axis, sort using the names_s
            'Name:N',
             axis=axis_text,
             # we give the sorting order to avoid alphabetical order
             sort=most_unfavorable
             #sort='-x'
        )
    )

    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3  # Nudges text to right so it doesn't appear on top of the bar
    ).encode(
        # we'll use the percentage as the text
        text=alt.Text('Percentage:Q',format='.0%')
    )
    return (text + bars).properties(
        width=200,
        height=100,
        title={"text": title,
               "anchor": title_anchor,
               "fontSize": 17}
    )

#===========================================================================================================

m_chart = build_bar_chart2(male_df, 'Male', '#008fd5', True)
f_chart = build_bar_chart2(female_df, 'Female', '#77ab43', False)

(m_chart | f_chart).configure(
    background='#F0F0F0',
    concat=alt.CompositionConfig(spacing=30),
    padding=15                    # Add some padding around the edge
).configure_mark(
    # we don't love the blue
    color='#008fd5'
).configure_view(
    # we don't want a stroke around the bars
    strokeWidth=0,
    strokeOpacity=0               # Remove the boundary box
).configure_scale(
    # add some padding
    #bandPaddingInner=0.2
).properties(
    # add a title
    title={"text": "Most Unfavorable 'Star Wars' Characters by Gender",
           "subtitle": ["By {} respondents who rated at least one character".format(len(df)),
                        "Of which {} respondents were male and {} female".format(num_male,num_female)],
           "fontSize":24,
           "subtitleFontSize":18,
           "anchor":"start",      # Make the text left justified
           "offset":15            # add some padding between title and below graph
          }
).configure_axis(
    #ticks=False,                  # Remove the ticks
    labelFontSize=15,
    domain=False,                 # Remove the axis line,
    #offset=10,                    # Moves the bars to the right
    tickSize=10,                 # This is a hack to fix the bug with offset
    tickOpacity=0
).resolve_scale(x='shared')