# Re-Create Article Visualization (Altair Practice)

In this assignment, I delve into the exploration and visualization of public opinions on Star Wars movies and characters, drawing upon a dataset originally surveyed by FiveThirtyEight. [America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters)](https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/). Utilizing Altair, a powerful declarative visualization library in Python, my aim is to reconstruct several visualizations featured in the article "America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters)." This exercise serves not only to replicate existing insights but also to experiment with alternative and complementary visualizations that enhance understanding through principles of perception and cognition.

The project is structured around three main objectives:

Recreation of Visualizations: I will recreate four specific visualizations from the article, applying Altair to mirror the original insights into Star Wars' cinematic universe and its characters' reception.

Alternative Visualization Proposal: I will propose an alternative visualization for one of the original article's charts. This step involves not just a reimagination of the data presentation but also a rationale grounded in enhancing cognitive and perceptual effectiveness.

New Visualization Creation: I will design a new visualization that adds value to the article's narrative. This visualization aims to offer fresh insights or a novel perspective, justified through principles of perception and cognition, to deepen the reader's understanding of the dataset.


In [2]:
import pandas as pd
import altair as alt
import numpy as np
import math

In [3]:
# enable correct rendering
alt.renderers.enable('default')

RendererRegistry.enable('default')

In [4]:
# uses intermediate json files to speed things up
alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

In [5]:
def loadStarwarsData(filename='assets/StarWars.csv'):
    
    # input: filename to original dataset (default StarWars.csv)
    # output: cleaned up dataframe with columns appropriately renamed
    
    sw = pd.read_csv(filename, encoding='latin1')

    # Some format is needed for the survey dataframe, we provide the formatted dataset in a dataframe 
    sw = sw.rename(columns={'Have you seen any of the 6 films in the Star Wars franchise?':'seen_any_movie',
                            'Do you consider yourself to be a fan of the Star Wars film franchise?': 'fan',
                            'Which of the following Star Wars films have you seen? Please select all that apply.' : 'seen_EI',
                            'Unnamed: 4' : 'seen_EII',
                            'Unnamed: 5' : 'seen_EIII',
                            'Unnamed: 6' : 'seen_EIV',
                            'Unnamed: 7' : 'seen_EV',
                            'Unnamed: 8' : 'seen_EVI',
                            'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'rank_EI',
                            'Unnamed: 10' : 'rank_EII',
                            'Unnamed: 11' : 'rank_EIII',
                            'Unnamed: 12' : 'rank_EIV',
                            'Unnamed: 13' : 'rank_EV',
                            'Unnamed: 14' : 'rank_EVI',
                            'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.' : 'Han Solo',
                            'Unnamed: 16' : 'Luke Skywalker',
                            'Unnamed: 17' : 'Princess Leia Organa',
                            'Unnamed: 18' : 'Anakin Skywalker',
                            'Unnamed: 19' : 'Obi Wan Kenobi',
                            'Unnamed: 20' : 'Emperor Palpatine',
                            'Unnamed: 21' : 'Darth Vader',
                            'Unnamed: 22' : 'Lando Calrissian',
                            'Unnamed: 23' : 'Boba Fett',
                            'Unnamed: 24' : 'C-3P0',
                            'Unnamed: 25' : 'R2 D2',
                            'Unnamed: 26' : 'Jar Jar Binks',
                            'Unnamed: 27' : 'Padme Amidala',
                            'Unnamed: 28' : 'Yoda',
                           })
    sw = sw.drop([0])
    return(sw)

sw = loadStarwarsData()

In [6]:
# take a peak to look at the data
sw.sample(5)

Unnamed: 0,RespondentID,seen_any_movie,fan,seen_EI,seen_EII,seen_EIII,seen_EIV,seen_EV,seen_EVI,rank_EI,...,Yoda,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
591,3290158000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,2.0,...,Somewhat favorably,I don't understand this question,No,,Yes,Female,18-29,"$100,000 - $149,999",Bachelor degree,Pacific
450,3290552000.0,Yes,No,,,,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,4.0,...,Very favorably,Han,No,,No,Male,30-44,"$150,000+",Bachelor degree,Pacific
993,3288683000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,...,Somewhat favorably,Han,Yes,No,No,Male,18-29,,Less than high school degree,South Atlantic
1075,3288564000.0,No,,,,,,,,,...,,,,,No,Female,45-60,"$50,000 - $99,999",Graduate degree,South Atlantic
876,3289490000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3.0,...,Very favorably,Greedo,No,,Yes,Female,45-60,"$25,000 - $49,999",Bachelor degree,West North Central


# America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters)

_Original article available at [FiveThirtyEight](https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/)_

By [Walt Hickey](https://fivethirtyeight.com/contributors/walt-hickey/)

Filed under [Movies](https://fivethirtyeight.com/tag/movies/)

Get the data on [GitHub](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey)

This week, I caught a sneak peek [of the X-Wing fighter](http://www.wired.com/2014/07/star-wars-episode-vii-x-wing/) from the new “Star Wars” films in production. The forthcoming movies — and the middling response to the most recent trilogy — provide a perfect excuse to examine some questions I’ve long wanted answers to: How many people are “Star Wars” fans? Does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch? Which characters are most well-liked and most hated? And who shot first, Han Solo or Greedo?

We ran a poll through [SurveyMonkey Audience](https://www.surveymonkey.com/mp/audience/), surveying 1,186 respondents from June 3 to 6 (the [data](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey) is available [on GitHub](https://github.com/fivethirtyeight/data)). Seventy-nine percent of those respondents said they had watched at least one of the “Star Wars” films. This question, incidentally, had a substantial difference by gender: 85 percent of men have seen at least one “Star Wars” film compared to 72 percent of women. Of people who have seen a film, men were also more likely to consider themselves a fan of the franchise: 72 percent of men compared to 60 percent of women.

We then asked respondents which of the films they had seen. With 835 people responding, here’s the probability that someone has seen a given “Star Wars” film given that they have seen any Star Wars film:

!["Sol1"](assets/have_seen_resized.png)

In [7]:
# We're going to fix the labels a bit so will create a mapping to the full names
# and have the full sort order
def genEpisodeNamesDF():
    # create the various mapping and order lists/dictionaries
    # return: list of episodes (in order), episode->name map, list of episode names (in order)
    episodes = ['EI', 'EII', 'EIII', 'EIV', 'EV', 'EVI']
    names = {
        'EI' : 'The Phantom Menace', 'EII' : 'Attack of the Clones', 'EIII' : 'Revenge of the Sith', 
        'EIV': 'A New Hope', 'EV': 'The Empire Strikes Back', 'EVI' : 'The Return of the Jedi'
    }

    # we're also going to use this order to sort, so names_l will now have our sort order
    return episodes, names, [names[ep] for ep in episodes]

episodes, namesDict, namesList = genEpisodeNamesDF()

In [8]:
# let's inspect what we've generated. These will be useful to you below
print("abbreviated list (sorted):\n  ",episodes)
print("\nmapping between abberviated names and full titles:\n  ",namesDict)
print("\nfull titles, sorted:\n  ",namesList)


abbreviated list (sorted):
   ['EI', 'EII', 'EIII', 'EIV', 'EV', 'EVI']

mapping between abberviated names and full titles:
   {'EI': 'The Phantom Menace', 'EII': 'Attack of the Clones', 'EIII': 'Revenge of the Sith', 'EIV': 'A New Hope', 'EV': 'The Empire Strikes Back', 'EVI': 'The Return of the Jedi'}

full titles, sorted:
   ['The Phantom Menace', 'Attack of the Clones', 'Revenge of the Sith', 'A New Hope', 'The Empire Strikes Back', 'The Return of the Jedi']


In [9]:
# let's do some data pre-processing... recall that sw (star wars) has everything

def getSeenAtLeastOneDF(indf, eps):
    # input: indf the data file as formatted above
    # input: eps a list of episodes (movies)

    # returns a substet of the dataset
    
    # We want to only use those people who have seen at least one movie, let's get the people, toss NAs
    # and get the total count

    # find people who have at least on of the columns (seen_*) not NaN
    salo = indf.dropna(subset=['seen_' + ep for ep in eps],how='all')
    
    return(salo)


In [10]:
seenAtLeastOneDF = getSeenAtLeastOneDF(sw, episodes)
print("total who have seen at least one: ", len(seenAtLeastOneDF),"\nSample:")
display(seenAtLeastOneDF.sample(5))

total who have seen at least one:  835 
Sample:


Unnamed: 0,RespondentID,seen_any_movie,fan,seen_EI,seen_EII,seen_EIII,seen_EIV,seen_EV,seen_EVI,rank_EI,...,Yoda,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
221,3290961000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5,...,Very favorably,Greedo,No,,No,Male,18-29,"$50,000 - $99,999",Bachelor degree,East North Central
732,3289848000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5,...,Very favorably,Han,Yes,Yes,No,Male,18-29,"$25,000 - $49,999",Some college or Associate degree,East North Central
334,3290753000.0,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5,...,Somewhat favorably,Han,Yes,Yes,Yes,Male,45-60,"$0 - $24,999",Some college or Associate degree,New England
644,3290020000.0,Yes,No,Star Wars: Episode I The Phantom Menace,,,,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,2,...,Somewhat favorably,Greedo,No,,No,Male,> 60,"$25,000 - $49,999",Bachelor degree,West North Central
875,3289492000.0,Yes,No,,,,,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3,...,Very favorably,I don't understand this question,No,,No,Female,45-60,"$25,000 - $49,999",Some college or Associate degree,West South Central


In [11]:
# for each movie, we're going to calculate the percents and generate a new data frame

def genSeenPercentDf(inpf, names=namesDict, names_l=namesList):
    # input: inpf - an input frame of the form output by getSeenAtLeastOneDF()
    # input: names - a dictionary of abbreviations to names
    # input: names_l - a list of all the movies in series order
    
    total = len(inpf)
    percs = []

    # loop over each column and calculate the number of people who have seen the movie
    # specifically, filter out the people who are *NaN* for a specific episode (e.g., ep_EII), count them
    # and divide by the percent
    for seen_ep in ['seen_' + ep for ep in episodes]:
        perc = len(inpf[~ pd.isna(inpf[seen_ep])]) / total
        percs.append(perc)

    # at this point percs is holding our percentages

    # now we're going use a trick to make tuples--pairing names with percents--using "zip" and then make a dataframe
    tuples = list(zip([names[ep] for ep in episodes],percs))
    seen_per_df = pd.DataFrame(tuples, columns = ['Name', 'Percentage'])
    return(seen_per_df)

In [12]:
seenPerMovieDF = genSeenPercentDf(seenAtLeastOneDF)

In [13]:
# let's see what's inside
seenPerMovieDF

Unnamed: 0,Name,Percentage
0,The Phantom Menace,0.805988
1,Attack of the Clones,0.683832
2,Revenge of the Sith,0.658683
3,A New Hope,0.726946
4,The Empire Strikes Back,0.907784
5,The Return of the Jedi,0.883832


In [14]:
def genPercentVis(indf, names_l=namesList):
    # input: indf, the dataframe as seen_per_df
    # input: names_l - a list of all the movies in series order (default namesList)
    # output: simple altair bar chart
    
    # ok, time to make the chart... let's make a bar chart (use mark_bar)
    bars = alt.Chart(indf).mark_bar(size=20).encode(
        # encode x as the percent, and hide the axis
        x=alt.X(
            'Percentage',
            axis=None),
        y=alt.Y(
            # encode y using the name, use the movie name to label the axis, sort using the names_l
            'Name:N',
             axis=alt.Axis(tickCount=5, title=''),
             # we give the sorting order to avoid alphabetical order
             sort=names_l
        )
    )

    # at this point we don't really have a great plot (it's missing the annotations, titles, etc.)
    return(bars)

seenPerMovieVis = genPercentVis(seenPerMovieDF)

In [15]:
# display it
seenPerMovieVis

In [16]:
def augmentPercentVis(base):
    # input: base (the base vis, i.e., bars as above)
    # we're going to overlay the text with the percentages, so let's make another visualization
    # that's just text labels


    text = base.mark_text(
        align='left',
        baseline='middle',
        dx=3  # Nudges text to right so it doesn't appear on top of the bar
    ).encode(
        # we'll use the percentage as the text
        text=alt.Text('Percentage:Q',format='.0%')
    )

    # finally, we're going to combine the bars and the text and do some styling
    seen_movies = (text + base).configure_mark(
        # we don't love the blue
        color='#008fd5'
    ).configure_view(
        # we don't want a stroke around the bars
        strokeWidth=0
    ).configure_scale(
        # add some padding
        bandPaddingInner=0.2
    ).properties(
        # set the dimensions of the visualization
        width=500,
        height=155
    ).properties(
        # add a title
        title="Which 'Star Wars' Movies Have you Seen?"
    )

    return(seen_movies)

# note that we are NOT formatting this in the Five Thirty Eight Style yet... we'll leave that to you to figure out

In [17]:
# let's see it
augmentPercentVis(seenPerMovieVis)

So we can see that “Star Wars: Episode V — The Empire Strikes Back” is the film seen by the most number of people, followed by “Star Wars: Episode VI — Return of the Jedi.” Appallingly, more people reported seeing “Star Wars: Episode I — The Phantom Menace” than the original “Star Wars” (renamed “Star Wars: Episode IV — A New Hope”).

So, which movie is the best? We asked the subset of 471 respondents who indicated they have seen every “Star Wars” film to rank them from best to worst. From that question, we calculated the share of respondents who rated each film as their favorite.

!["Sol1"](assets/best_movie_article_resized.png)

### 2.1 What's the best 'Star Wars' movie? Recreate the above image using altair.

In [18]:
# Recreate this image using Altair
# match the "538 style" as best you can (hint: look at the altair lab at the start of the semester)

def genBestVis(inpf,eps, names=namesDict, names_l=namesList):
    # input: inpf, the star wars dataset
    # input: eps, the list of episodes
    # input: names - a dictionary of abbreviations to names (default: namesDict)
    # input: names_l - a list of all the movies in series order (default: namesList)
    # output: the Altair visualization as described above
    # YOUR CODE HERE
    
    # find people who have seen all six of the columns (seen_*) not NaN
    allsix = inpf.dropna(subset=['seen_' + ep for ep in eps],how='any')
    
    # loop over each column and calculate the number of people who have rate the movie first out of all 6
    total = len(allsix)
    percs = []

    for rank_ep in ['rank_' + ep for ep in episodes]:
        perc = len(allsix[allsix[rank_ep]=='1']) / total
        percs.append(perc)

    # at this point percs is holding our percentages

    # Using zip to create a tuple of movie names and percentage of the moive picked as the favorite movie
    tuples = list(zip([names[ep] for ep in episodes],percs))
    rank_per_df = pd.DataFrame(tuples, columns = ['Name', 'Percentage'])
    
    #Generate the base bar chart visualization using the pre-estabilished two altair functions.
    base_vis = augmentPercentVis(genPercentVis(rank_per_df))
    
    #Configure the style of the charts using altair top-level configurations
    base_vis = base_vis.properties(
        title={
            "text":"What's the Best 'Star Wars' Movie?",
            "subtitle":'of 471 respondents who have seen all six films',
            "subtitleColor": "black",
            "fontSize":25,
            "subtitleFontSize":15,
            "dy":-18,
            "anchor":"start"
        }
    ).configure_axisY(
        labelColor="gray"
    )
    return(base_vis)
    #raise NotImplementedError()

In [19]:
# test your code
genBestVis(sw,episodes)

We can also drill down and find out, generally, how people rate the films. Overall, fans broke into two camps: those who preferred the original three movies and those who preferred the three prequels. People who said “The Empire Strikes Back” was their favorite were also likely to rate “A New Hope” and “Return of the Jedi” higher as well. Those who rated “The Phantom Menace” as the best film were more likely to rate prequels higher.

This chart shows how often each film was rated in the top third (best or second-best), the middle third (third or fourth) or the bottom third (second-worst or worst). It’s a more nuanced take on the series:

!["Sol1"](assets/how_rate_resized.png)


### 2.2 How people rate the 'Star Wars' movie? Recreate the above image using altair 

In [20]:
def gen3PickDf(inpf, names=namesDict, names_l=namesList):
    
    total = len(inpf)
    top3rd = []
    mid3rd = []
    bot3rd = []

    for rank_ep in ['rank_' + ep for ep in episodes]:
        top3 = len(inpf[(inpf[rank_ep] == "1") | (inpf[rank_ep] == "2")]) / total
        top3rd.append(top3)
        mid3 = len(inpf[(inpf[rank_ep] == "3") | (inpf[rank_ep] == "4")]) / total
        mid3rd.append(mid3)
        bot3 = len(inpf[(inpf[rank_ep] == "5") | (inpf[rank_ep] == "6")]) / total
        bot3rd.append(bot3)

    tuples = list(zip([names[ep] for ep in episodes],top3rd,mid3rd,bot3rd))
    pick_3_df = pd.DataFrame(tuples, columns = ['Name', 'Top third', 'Middle third', 'Bottom third'])
    return(pick_3_df)

In [21]:
# Recreate this image using altair here (10 POINTS)

def genRateVis(inpf,eps, names=namesDict, names_l=namesList):
    # input: inpf, the star wars dataset
    # input: eps, the list of episodes
    # input: names - a dictionary of abbreviations to names (default: namesDict)
    # input: names_l - a list of all the movies in series order (default: namesList)
    # output: the Altair visualization as described above
    
    # YOUR CODE HERE
    #Create the dataframe containing the percentage of 3 different levels of picks
    df = gen3PickDf(inpf)
    melted_df = pd.melt(df, id_vars=['Name'], var_name='3Levels', value_name='Percentage')
    
    #Create the visual that matches the newspaper
    
    chart = alt.Chart(melted_df).mark_bar().encode(
        x=alt.X(
            'Percentage:Q'
        ),
        y=alt.Y(
            'Name:N',
            axis=alt.Axis(tickCount=5, title=''),
            sort=namesList
            ),
        color=alt.Color(
            '3Levels:N', 
            scale=alt.Scale(range=['red', '#0096FF', '#48A860'])
        )
    ).properties(
        width=100,
        height=155
    )

    text = chart.mark_text(
        align='left',
        baseline='middle',
        dx=3  # Offset the text from the points
    ).encode(
        x=alt.X(
            'Percentage:Q'
        ),
        y=alt.Y(
            'Name:N',
            axis=alt.Axis(tickCount=5, title=''),
            sort=namesList
            ),
        text=alt.Text('Percentage:Q',format='.0%')
    )

    layered_chart = alt.layer(
        chart, text
    ).facet(
        column=alt.Column(
            '3Levels:N',
            sort=['Top third', 'Middle third', 'Bottom third'],
            title = None
        )
    )
    
    final_chart = layered_chart.configure_axisX(
        grid=False,
        domain=False,
        labels=False,
        ticks=False,
        title="null"
    ).configure_view(
        strokeWidth=0
    ).configure_legend(
        disable=True
    ).properties(
        title={
            "text":"How People Rate the 'Star Wars' Movies",
            "subtitle":["How often each film was rated in the top, middle and bottom third", "(by 471 respondents who have seen all six films)"],
            "subtitleColor": "black",
            "fontSize":23,
            "subtitleFontSize":12,
            "dy":-18,
            "anchor":"start"
        }
    ).configure_axisY(
        labelColor="gray"
    )    
    
    return(final_chart)
    #raise NotImplementedError()

In [22]:
# let's check our solution
genRateVis(sw,episodes)

Finally, we took a boilerplate format used by political favorability polls — “Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her” — and asked respondents to rate characters in the series.

!["Sol1"](assets/char_ranking_resized.png)

### 2.3 Star Wars' Characters Favorability Ratings. Recreate the above image using altair

In [23]:
#Manipulate the Star Wars data frame to containing only the names of the character and their favoriblity.
characters = sw[['Han Solo','Luke Skywalker', 'Princess Leia Organa', 'Anakin Skywalker','Obi Wan Kenobi', 'Emperor Palpatine', 'Darth Vader','Lando Calrissian', 'Boba Fett', 'C-3P0', 'R2 D2', 'Jar Jar Binks','Padme Amidala', 'Yoda']]
ch_df = characters.dropna(how='all')
char_namesl = list(characters.columns)

In [24]:
def genFavDf(inpf, names=char_namesl):
    #Manipulate the original star wars data to create a data containing only the character names, favarability choices, and percentages.
    #inpf = sliced dataframe containing only the names of the Star Wars characters
    total = len(inpf)
    favorable = []
    neutral = []
    unfavorable = []
    unfamiliar = []

    for name in names:
        fav = len(inpf[(inpf[name] == 'Very favorably') | (inpf[name] == 'Somewhat favorably')]) / total
        favorable.append(fav)
        neu = len(inpf[inpf[name] == 'Neither favorably nor unfavorably (neutral)']) / total
        neutral.append(neu)
        unfav = len(inpf[(inpf[name] == 'Somewhat unfavorably') | (inpf[name] == 'Very unfavorably')]) / total
        unfavorable.append(unfav)
        unfam = len(inpf[inpf[name] == 'Unfamiliar (N/A)']) / total
        unfamiliar.append(unfam)  

    tuples = list(zip([name for name in names],favorable,neutral,unfavorable,unfamiliar))
    df = pd.DataFrame(tuples, columns = ['Name', 'Favorable', 'Neutral', 'Unfavorable', 'Unfamiliar'])
    return(df)

In [25]:
# Recreate this image using altair here (10 POINTS)

def genFavorabilityVis(inpf):
    # input: inpf, the star wars dataset (as defined at top of file)
    # output: the Altair visualization as described above
    
    # YOUR CODE HERE

    # Creating the dataframe containing favorbility and percentage of each favorability for different characters.
    fav_df = genFavDf(ch_df)
    melted_fav_df = pd.melt(fav_df, id_vars=['Name'], var_name='Favor_level', value_name='Percentage')
    melted_fav_df["Percentage"].astype("int")
    
    #Create the order of the final chart using the descending order of the percentage for Favorable chart.
    character_order = melted_fav_df[melted_fav_df.Favor_level == "Favorable"].sort_values(by="Percentage", ascending=False)["Name"].tolist()
    
    #Creating a stacked bar chart and a text chart.
    chart = alt.Chart(melted_fav_df).mark_bar().encode(
        x=alt.X(
            'Percentage:Q'
        ),
        y=alt.Y(
            'Name:N',
            axis=alt.Axis(tickCount=14, title=''),
            sort=character_order
            ),
        color=alt.Color(
            'Favor_level', 
            scale=alt.Scale(range=[ '#48A860', '#0096FF',  'gray', 'red'])
        )
    ).properties(
        width=100,
        height=300
    )

    
    text = chart.mark_text(
        align='left',
        baseline='middle',
        dx=3  # Offset the text from the points
    ).encode(
        x=alt.X(
            'Percentage:Q'
        ),
        y=alt.Y(
            'Name:N',
            axis=alt.Axis(tickCount=14, title=''),
            sort=character_order
            ),
        text=alt.Text('Percentage:Q',format='.0%')
    )
    
    #Combining the text chart and bar chart and facet according the to favor_level column
    layered_chart = alt.layer(
        chart, text
    ).facet(
        column=alt.Column(
            'Favor_level:N',
            sort=alt.EncodingSortField('Percentage', order='descending'),
            title = None
        )
    )
    
    #Configure the title and style of the facet chart 
    final_chart = layered_chart.configure_axisX(
        grid=False,
        domain=False,
        labels=False,
        ticks=False,
        title="null"
    ).configure_view(
        strokeWidth=0
    ).configure_legend(
        disable=True
    ).properties(
        title={
            "text":"'Star Wars' Character Favoriblity Ratings",
            "subtitle":['By 834 respondents'],
            "subtitleColor": "black",
            "fontSize":23,
            "subtitleFontSize":12,
            "dy":-18,
            "anchor":"start"
        }
    ).configure_axisY(
        labelColor="gray"
    )    
    return(final_chart)   
    #raise NotImplementedError()

In [26]:
# let's test the solution
genFavorabilityVis(sw)

Jar Jar Binks has a lower favorability rating than the actual personification of evil in the galaxy.

And for those of you who want to know the impact that [historical revisionism](http://en.wikipedia.org/wiki/Han_shot_first) can have on a society:

!["Sol1"](assets/shot_first_article_resized.png)


### 2.4 Who shot first? Recreate the above image using altair 

In [27]:
# Recreate this image using altair here (10 POINTS)

def genFirstShotVis(inpf):
    # input: inpf, the star wars dataset
    # output: the Altair visualization as described above
    
    # YOUR CODE HERE
    #name list of "who shot first?"
    shot_df = inpf['Which character shot first?'].dropna()
    shot_df
    shot_count = shot_df.value_counts()
    shot_name_list = shot_df.unique().tolist()
    shot_name_list = shot_name_list[::-1]

    #Percent of vote for each option of "who shot first?"
    total = len(shot_df)
    percs = []

    shot_count = shot_count.to_frame()
    shot_count = shot_count.reset_index().rename(columns={'index':'Name', 'Which character shot first?':'Votes'})
    shot_count['Percentage'] = shot_count['Votes']/total
    shot_count.drop('Votes',axis=1,inplace=True)
    shot_count

    bars = alt.Chart(shot_count).mark_bar(size=20).encode(
        x=alt.X(
            'Percentage',
            axis=None),
        y=alt.Y(
            'Name:N',
            axis=alt.Axis(tickCount=3, title=''),
            sort=shot_name_list
        )
    )

    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3
    ).encode(
        text=alt.Text('Percentage:Q',format='.0%')
    )
    text

    shotf = (text + bars).configure_mark(
        color='#008fd5'
    ).configure_view(
        strokeWidth=0
    ).configure_scale(
        bandPaddingInner=0.2
    ).properties(
        width=500,
        height=105
    ).properties(
        title="Who Shot First?"
    )


    base_vis = shotf.properties(
        title={
            "text":"Who Shot First?",
            "subtitle":'According to 834 respondents',
            "subtitleColor": "black",
            "fontSize":25,
            "subtitleFontSize":19,
            "dy":-18,
            "anchor":"start"
        }
    ).configure_axisY(
        labelColor="gray"
    )
    return(base_vis)
    #raise NotImplementedError()

In [28]:
# test our solution
genFirstShotVis(sw)

### 2.5 Make my own 

To enrich the narrative of "America's Favorite 'Star Wars' Movies (And Least Favorite Characters)", I will create additional visuals for the article to highlight underexplored facets of the data.

In [29]:
# Draft version

# Create data frame for age, gender, familiar with expanded unviverse.

clean_df = sw.dropna(subset=['Are you familiar with the Expanded Universe?'])
sub_df = clean_df[['Age','Gender','Are you familiar with the Expanded Universe?', 'Location (Census Region)']]
sub_df = sub_df.dropna()
count_df = sub_df.groupby(['Age', 'Gender', 'Location (Census Region)'])['Are you familiar with the Expanded Universe?'].size().reset_index(name='Familiarity')

scat = alt.Chart(count_df).mark_point().encode(
    alt.Size('Location (Census Region)', title=None), 
    alt.X('Age'),
    alt.Y('Familiarity', axis=alt.Axis(grid=False)), 
    alt.Color('Gender'))


scat

In [30]:
# Better Version

# Create data frame for age, gender, familiar with expanded unviverse.

clean_df = sw.dropna(subset=['Are you familiar with the Expanded Universe?'])
sub_df = clean_df[['Age','Gender','Are you familiar with the Expanded Universe?', 'Location (Census Region)']]
sub_df = sub_df.dropna()
count_df = sub_df.groupby(['Age', 'Gender', 'Location (Census Region)'])['Are you familiar with the Expanded Universe?'].size().reset_index(name='Familiarity')

bar = alt.Chart(count_df).mark_bar().encode(
    alt.Column('Location (Census Region)', title=None), 
    alt.X('Age'),
    alt.Y('Familiarity', axis=alt.Axis(grid=False)), 
    alt.Color('Gender'))


bar.configure_mark(
    color='#008fd5'
).configure_view(
    strokeWidth=0
).properties(title={
    "text":"Familiarity of expanded Universe according differnt Age, Gender and Location groups",
    "subtitle":'According to 818 respondents',
    "subtitleColor": "black",
    "fontSize":25,
    "subtitleFontSize":19,
    "dy":-18,
    "anchor":"start"}
            )
            

Interpretation:

My better version is a bar graph. My domain question is to find the difference in familiarity between different age bins and genders from other regions. The abstraction of my problem is comparing two nominal variables (location & gender), one ordinal variable(age), and one quantitative variable(familiarity). A scatter plot is suitable for comparing at least two quantitative variables; the encoding becomes too cluttered and unclear when there are three categorical variables. The bar graph is better because I could give the ordinal and quantitative variables the x and y position encoding and the third position encoding for one nominal by separating them into groups. The last nominal variable can then be encoded with color, which illustrates nominal variable effectively by distinguishing them in different color, as long as there is not too many nominal variables (more than 10 in my opinion).