# Information Visualization I 
## School of Information, University of Michigan

## Week 3: 
- Perception / Cognition

## Assignment Overview
### This assignment's objectives include:

- Review, refect, and apply the concepts of the perception pipeline. Justify how different encodings impact the effectiveness of a visualization depending on the human perception process.

!["Drawing"](assets/preattentive_resized.png)

<p style="text-align: center;"> Preattentive Processing </p>

- Recreate visualizations and propose new and alternative visualizations using [Altair](https://altair-viz.github.io/) 

### The total score of this assignment will be 100 points consisting of:
- Case study reflection: America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters) (25 points)
- Altair programming exercise (75 points)

### Resources:
- Article by [FiveThirtyEight](https://fivethirtyeight.com) available  [online](https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/) (Hickey, 2014)  
- Datasets from FiveThirtyEight, we have downloaded a subset of this data in the folder [./assets](assets)
    - The original dataset can be found at [FiveThirtyEight Star Wars Survey](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey)
    

### Important notes:
1) There will be a couple of places where the numbers you get when you select rows may be a little different than 538, but the percents should still work (e.g., 828 instead of 834). You'll see this in our examples. If you can somehow get the data to match exactly, that's great too.

2) A lot of this assignment will be based on building "compound" charts. Using the wrong strategy will mean that you are spending a lot of time "cleaning up" charts. We strongly suggest reading through (and understanding) the Altair documentation [on this](https://altair-viz.github.io/user_guide/compound_charts.html).

3) Grading for this assignment is entirely done by a human grader. They will be running tests on the functions we ask you to create. This means there is no autograding (submitting through the autograder will result in an error). You are expected to test and validate your own code. 

4) Keep your notebooks clean and readable. If your code is highly messy or inefficient you will get a deduction.

5) Follow the instructions for submission on Coursera. You will be providing us a generated link to a read-only version of your notebook and a PDF. When turning in your PDF, please use the File -> Print -> Save as PDF option ***from your browser***. Do ***not*** use the File->Download as->PDF option. Complete instructions for this are under Resources in the Coursera page for this class. If you're having trouble with printing, take a look at [this video](https://youtu.be/PiO-K7AoWjk).



## Part 1. Perception and Cognition (25 points)
Read the article ["America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters),"](https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/) and answer the following questions:


### 1.1 List the different data types in the following visualizations and their encodings (10 points)
Look at the following visualizations. For each, list the variable, their type, and the encoding used. 

!["How rate visualization"](assets/how_rate_resized.png)
!["Have seen visualization"](assets/have_seen_resized.png)

Use the following table for each variable (indicate which vis each came from):
* variable name, variable type (e.g., ordinal, nominal, etc.), variable encoding (e.g., color, x-position, etc.)

For example, if you [look at this visualization](https://altair-viz.github.io/gallery/scatter_tooltips.html) one row in our table below would be: Horsepower, quantitative, x-position.

_1.1 Answer_

*Use as many rows as you need*

How People Rate the 'Star Wars' Movies

| Variable name| Variable Type (Ordinal/Nominal/Quantitative) | Variable encoding (color, x-position, etc.) |
| --- | --- | --- |
|  Movie Title | Nominal | Y-position |
|  Top Third | Quantitative | X-Position |
|  Middle Third | Quantitative | X-Position |
|  Bottome Third | Quantitative | X-Position |
|  Top Third | Quantitative | Color |
|  Middle Third | Quantitative | Color |
|  Bottome Third | Quantitative | Color |
|  *your answer* | *your answer* | *your answer* |
|  *your answer* | *your answer* | *your answer* |

Which 'Star Wars' Movies Have You Seen?

| Variable name| Variable Type (Ordinal/Nominal/Quantitative) | Variable encoding (color, x-position, etc.) |
| --- | --- | --- |
|  Movie Title | Nominal | Y-position |
|  Respondents Seeing Film | Quantitative | X-Position |
|  *your answer* | *your answer* | *your answer* |
|  *your answer* | *your answer* | *your answer* |
|  *your answer* | *your answer* | *your answer* |
|  *your answer* | *your answer* | *your answer* |
|  *your answer* | *your answer* | *your answer* |
|  *your answer* | *your answer* | *your answer* |
|  *your answer* | *your answer* | *your answer* |



### 1.2 A Pie Chart Version (15 points)


!["Drawing"](assets/have_seen_resized.png)

Your colleague is arguing that the chart above could be made more *effective* by using one or more pie charts. In what situations are they right? In what situations would they be wrong? If it helps makes your case, feel free to create a picture (hand drawn is fine) and load it into the notebook.

_1.2 Answer_

A pie chart would be nice if the quantitative results added up to a whole. But, it would not be effective at all, considering that the percentages do not add up to 100%, and it would be hard to compare the sizes of the results. Meaning it would be hard to see which movie is actually the most seen film compared to others. 

## Part 2. Altair programming exercise (75 points)
We have provided you with some code and parts of the article [America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters)](https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/). This article is based on the dataset:

1. [StarWars](data/StarWars.csv) Created by FiveThirtyEight based on a survey ran through SurveyMonkey Audience, surveying 1,186 respondents from June 3 to 6 2014. Available [online] (https://github.com/fivethirtyeight/data/tree/master/star-wars-survey)

To earn points for this assignment, you must:

- Recreate the visualizations in the article (replace the images in the article with a code cell that creates a visualization). We provide one example. Each visualization is worth 10 points (40 points/ 10 each x 4 total ).

    - _Partial credit can be granted for each visualization (up to 5 points) if you provide the grammar of graphics description of the visualization without a functional Altair implementation_


- Propose two variants (a "good one" and a "bad one") for a new visualization for the article. You'll be asked to justify why one is better than the other. We are specifically interested in your explanation of why the "good one" is more *effective* based on principles of perception/cognition. (35 points/ 20 points for the plots + 15 for justification)




In [223]:
import pandas as pd
import altair as alt
import numpy as np
import math

In [224]:
# enable correct rendering
alt.renderers.enable('default')

RendererRegistry.enable('default')

In [225]:
# uses intermediate json files to speed things up
alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

In [226]:
def loadStarwarsData(filename='assets/StarWars.csv'):
    
    # input: filename to original dataset (default StarWars.csv)
    # output: cleaned up dataframe with columns appropriately renamed
    
    sw = pd.read_csv(filename, encoding='latin1')

    # Some format is needed for the survey dataframe, we provide the formatted dataset in a dataframe 
    sw = sw.rename(columns={'Have you seen any of the 6 films in the Star Wars franchise?':'seen_any_movie',
                            'Do you consider yourself to be a fan of the Star Wars film franchise?': 'fan',
                            'Which of the following Star Wars films have you seen? Please select all that apply.' : 'seen_EI',
                            'Unnamed: 4' : 'seen_EII',
                            'Unnamed: 5' : 'seen_EIII',
                            'Unnamed: 6' : 'seen_EIV',
                            'Unnamed: 7' : 'seen_EV',
                            'Unnamed: 8' : 'seen_EVI',
                            'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'rank_EI',
                            'Unnamed: 10' : 'rank_EII',
                            'Unnamed: 11' : 'rank_EIII',
                            'Unnamed: 12' : 'rank_EIV',
                            'Unnamed: 13' : 'rank_EV',
                            'Unnamed: 14' : 'rank_EVI',
                            'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.' : 'Han Solo',
                            'Unnamed: 16' : 'Luke Skywalker',
                            'Unnamed: 17' : 'Princess Leia Organa',
                            'Unnamed: 18' : 'Anakin Skywalker',
                            'Unnamed: 19' : 'Obi Wan Kenobi',
                            'Unnamed: 20' : 'Emperor Palpatine',
                            'Unnamed: 21' : 'Darth Vader',
                            'Unnamed: 22' : 'Lando Calrissian',
                            'Unnamed: 23' : 'Boba Fett',
                            'Unnamed: 24' : 'C-3P0',
                            'Unnamed: 25' : 'R2 D2',
                            'Unnamed: 26' : 'Jar Jar Binks',
                            'Unnamed: 27' : 'Padme Amidala',
                            'Unnamed: 28' : 'Yoda',
                           })
    sw = sw.drop([0])
    return(sw)

sw = loadStarwarsData()

In [227]:
# take a peak to look at the data
sw.sample(5)

Unnamed: 0,RespondentID,seen_any_movie,fan,seen_EI,seen_EII,seen_EIII,seen_EIV,seen_EV,seen_EVI,rank_EI,...,Yoda,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
82,3291662000.0,Yes,No,,,,,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,6.0,...,Somewhat favorably,I don't understand this question,No,,No,Male,> 60,"$50,000 - $99,999",Some college or Associate degree,South Atlantic
881,3289473000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,1.0,...,Very favorably,Greedo,Yes,Yes,Yes,Female,18-29,"$0 - $24,999",Bachelor degree,Middle Atlantic
631,3290049000.0,Yes,No,,,,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,4.0,...,Very favorably,I don't understand this question,No,,No,Male,45-60,"$25,000 - $49,999",Graduate degree,New England
255,3290889000.0,Yes,,,,,,,,,...,,,,,,,,,,
133,3291396000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,6.0,...,Very favorably,I don't understand this question,Yes,No,No,Female,30-44,"$50,000 - $99,999",Bachelor degree,South Atlantic


# America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters)

_Original article available at [FiveThirtyEight](https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/)_

By [Walt Hickey](https://fivethirtyeight.com/contributors/walt-hickey/)

Filed under [Movies](https://fivethirtyeight.com/tag/movies/)

Get the data on [GitHub](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey)

This week, I caught a sneak peek [of the X-Wing fighter](http://www.wired.com/2014/07/star-wars-episode-vii-x-wing/) from the new “Star Wars” films in production. The forthcoming movies — and the middling response to the most recent trilogy — provide a perfect excuse to examine some questions I’ve long wanted answers to: How many people are “Star Wars” fans? Does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch? Which characters are most well-liked and most hated? And who shot first, Han Solo or Greedo?

We ran a poll through [SurveyMonkey Audience](https://www.surveymonkey.com/mp/audience/), surveying 1,186 respondents from June 3 to 6 (the [data](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey) is available [on GitHub](https://github.com/fivethirtyeight/data)). Seventy-nine percent of those respondents said they had watched at least one of the “Star Wars” films. This question, incidentally, had a substantial difference by gender: 85 percent of men have seen at least one “Star Wars” film compared to 72 percent of women. Of people who have seen a film, men were also more likely to consider themselves a fan of the franchise: 72 percent of men compared to 60 percent of women.

We then asked respondents which of the films they had seen. With 835 people responding, here’s the probability that someone has seen a given “Star Wars” film given that they have seen any Star Wars film:

!["Sol1"](assets/have_seen_resized.png)

In [228]:
# We're going to fix the labels a bit so will create a mapping to the full names
# and have the full sort order
def genEpisodeNamesDF():
    # create the various mapping and order lists/dictionaries
    # return: list of episodes (in order), episode->name map, list of episode names (in order)
    episodes = ['EI', 'EII', 'EIII', 'EIV', 'EV', 'EVI']
    names = {
        'EI' : 'The Phantom Menace', 'EII' : 'Attack of the Clones', 'EIII' : 'Revenge of the Sith', 
        'EIV': 'A New Hope', 'EV': 'The Empire Strikes Back', 'EVI' : 'The Return of the Jedi'
    }

    # we're also going to use this order to sort, so names_l will now have our sort order
    return episodes, names, [names[ep] for ep in episodes]

episodes, namesDict, namesList = genEpisodeNamesDF()

In [229]:
# let's inspect what we've generated. These will be useful to you below
print("abbreviated list (sorted):\n  ",episodes)
print("\nmapping between abberviated names and full titles:\n  ",namesDict)
print("\nfull titles, sorted:\n  ",namesList)


abbreviated list (sorted):
   ['EI', 'EII', 'EIII', 'EIV', 'EV', 'EVI']

mapping between abberviated names and full titles:
   {'EI': 'The Phantom Menace', 'EII': 'Attack of the Clones', 'EIII': 'Revenge of the Sith', 'EIV': 'A New Hope', 'EV': 'The Empire Strikes Back', 'EVI': 'The Return of the Jedi'}

full titles, sorted:
   ['The Phantom Menace', 'Attack of the Clones', 'Revenge of the Sith', 'A New Hope', 'The Empire Strikes Back', 'The Return of the Jedi']


In [230]:
# let's do some data pre-processing... recall that sw (star wars) has everything

def getSeenAtLeastOneDF(indf, eps):
    # input: indf the data file as formatted above
    # input: eps a list of episodes (movies)

    # returns a substet of the dataset
    
    # We want to only use those people who have seen at least one movie, let's get the people, toss NAs
    # and get the total count

    # find people who have at least on of the columns (seen_*) not NaN
    salo = indf.dropna(subset=['seen_' + ep for ep in eps],how='all')
    
    return(salo)


In [231]:
seenAtLeastOneDF = getSeenAtLeastOneDF(sw, episodes)
print("total who have seen at least one: ", len(seenAtLeastOneDF),"\nSample:")
display(seenAtLeastOneDF.sample(5))

total who have seen at least one:  835 
Sample:


Unnamed: 0,RespondentID,seen_any_movie,fan,seen_EI,seen_EII,seen_EIII,seen_EIV,seen_EV,seen_EVI,rank_EI,...,Yoda,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
686,3289925000.0,Yes,Yes,,,,,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3,...,Very favorably,Greedo,Yes,Yes,Yes,Female,45-60,"$150,000+",Bachelor degree,West South Central
1043,3288614000.0,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,2,...,Neither favorably nor unfavorably (neutral),I don't understand this question,No,,No,Female,18-29,"$25,000 - $49,999",Bachelor degree,Pacific
1058,3288585000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,6,...,Very favorably,Han,No,,Yes,Female,18-29,"$50,000 - $99,999",Bachelor degree,Mountain
341,3290740000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,4,...,Very favorably,I don't understand this question,No,,Yes,Female,45-60,"$150,000+",Bachelor degree,Mountain
1051,3288600000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5,...,Very favorably,Han,Yes,No,Yes,Male,> 60,"$25,000 - $49,999",Some college or Associate degree,Pacific


In [232]:
# for each movie, we're going to calculate the percents and generate a new data frame

def genSeenPercentDf(inpf, names=namesDict, names_l=namesList):
    # input: inpf - an input frame of the form output by getSeenAtLeastOneDF()
    # input: names - a dictionary of abbreviations to names
    # input: names_l - a list of all the movies in series order
    
    total = len(inpf)
    percs = []

    # loop over each column and calculate the number of people who have seen the movie
    # specifically, filter out the people who are *NaN* for a specific episode (e.g., ep_EII), count them
    # and divide by the percent
    for seen_ep in ['seen_' + ep for ep in episodes]:
        perc = len(inpf[~ pd.isna(inpf[seen_ep])]) / total
        percs.append(perc)

    # at this point percs is holding our percentages

    # now we're going use a trick to make tuples--pairing names with percents--using "zip" and then make a dataframe
    tuples = list(zip([names[ep] for ep in episodes],percs))
    seen_per_df = pd.DataFrame(tuples, columns = ['Name', 'Percentage'])
    return(seen_per_df)

seenPerMovieDF = genSeenPercentDf(seenAtLeastOneDF)

In [233]:
seenPerMovieDF = genSeenPercentDf(seenAtLeastOneDF)

In [234]:
# let's see what's inside
seenPerMovieDF

Unnamed: 0,Name,Percentage
0,The Phantom Menace,0.805988
1,Attack of the Clones,0.683832
2,Revenge of the Sith,0.658683
3,A New Hope,0.726946
4,The Empire Strikes Back,0.907784
5,The Return of the Jedi,0.883832


In [235]:
def genPercentVis(indf, names_l=namesList):
    # input: indf, the dataframe as seen_per_df
    # input: names_l - a list of all the movies in series order (default namesList)
    # output: simple altair bar chart
    
    # ok, time to make the chart... let's make a bar chart (use mark_bar)
    bars = alt.Chart(indf).mark_bar(size=20).encode(
        # encode x as the percent, and hide the axis
        x=alt.X(
            'Percentage',
            axis=None),
        y=alt.Y(
            # encode y using the name, use the movie name to label the axis, sort using the names_l
            'Name:N',
             axis=alt.Axis(tickCount=5, title=''),
             # we give the sorting order to avoid alphabetical order
             sort=names_l
        )
    )

    # at this point we don't really have a great plot (it's missing the annotations, titles, etc.)
    return(bars)

seenPerMovieVis = genPercentVis(seenPerMovieDF)

In [236]:
# display it
seenPerMovieVis

In [237]:
def augmentPercentVis(base):
    # input: base (the base vis, i.e., bars as above)
    # we're going to overlay the text with the percentages, so let's make another visualization
    # that's just text labels


    text = base.mark_text(
        align='left',
        baseline='middle',
        dx=3  # Nudges text to right so it doesn't appear on top of the bar
    ).encode(
        # we'll use the percentage as the text
        text=alt.Text('Percentage:Q',format='.0%')
    )

    # finally, we're going to combine the bars and the text and do some styling
    seen_movies = (text + base).configure_mark(
        # we don't love the blue
        color='#008fd5'
    ).configure_view(
        # we don't want a stroke around the bars
        strokeWidth=0
    ).configure_scale(
        # add some padding
        bandPaddingInner=0.2
    ).properties(
        # set the dimensions of the visualization
        width=500,
        height=180
    ).properties(
        # add a title
        title="Which 'Star Wars' Movies Have you Seen?"
    )

    return(seen_movies)

# note that we are NOT formatting this in the Five Thirty Eight Style yet... we'll leave that to you to figure out

In [238]:
# let's see it
augmentPercentVis(seenPerMovieVis)

So we can see that “Star Wars: Episode V — The Empire Strikes Back” is the film seen by the most number of people, followed by “Star Wars: Episode VI — Return of the Jedi.” Appallingly, more people reported seeing “Star Wars: Episode I — The Phantom Menace” than the original “Star Wars” (renamed “Star Wars: Episode IV — A New Hope”).

So, which movie is the best? We asked the subset of 471 respondents who indicated they have seen every “Star Wars” film to rank them from best to worst. From that question, we calculated the share of respondents who rated each film as their favorite.

!["Sol1"](assets/best_movie_article_resized.png)


** Homework note: Click [here](assets/best_movie.png) to see a version of this plot generated in Altair.

### 2.1 What's the best 'Star Wars' movie? Recreate the above image using altair (10 Points)

Recreate the image above using Altair. Match the "538" style as best you can (hint: look at the altair lab at the start of the semester). We expect you to *at least* match the [our version](assets/best_movie.png) of the chart that was created in Altair.

In [239]:
# Recreate this image using Altair
# match the "538 style" as best you can (hint: look at the altair lab at the start of the semester)

def genBestVis(inpf,eps, names=namesDict, names_l=namesList):
    # input: inpf, the star wars dataset
    # input: eps, the list of episodes
    # input: names - a dictionary of abbreviations to names (default: namesDict)
    # input: names_l - a list of all the movies in series order (default: namesList)
    # output: the Altair visualization as described above
    # Calculate the share of respondents who rated each film as their favorite
    seenAtLeastOneDF = getSeenAtLeastOneDF(inpf, episodes)
    movies_columns = ['seen_EI', 'seen_EII', 'seen_EIII', 'seen_EIV', 'seen_EV', 'seen_EVI']
    all_movies_seen = seenAtLeastOneDF[movies_columns].notna().all(axis=1) # Filter to have seen all of the movies.
    seenAll = seenAtLeastOneDF[all_movies_seen]
    rankings_columns = ['rank_EI', 'rank_EII', 'rank_EIII', 'rank_EIV', 'rank_EV', 'rank_EVI']
    highest_rank_counts = (seenAll[rankings_columns] == '1').sum()
    highest_rank_percentages = (highest_rank_counts / len(seenAll))

    data = pd.DataFrame({
        'Episode': eps,
        'Percentage': highest_rank_percentages.values,
        'Name': [names[ep] for ep in eps]
    })

    # Creating the bars for the bar chart
    bars = alt.Chart(data).mark_bar(size=20).encode(
        x=alt.X(
            'Percentage:Q',
            axis=alt.Axis(title='Percentage')
        ),
        y=alt.Y(
            'Name:N',
            axis=alt.Axis(tickCount=5, title=''),
            sort=names_l
        )
    )

    # Creating the text labels for the bars
    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3
    ).encode(
        text=alt.Text('Percentage:Q', format='.0%')
    )

    # Combining the bars and the text and doing some styling
    chart = (bars + text).configure_mark(
        color='#008fd5'
    ).configure_view(
        strokeWidth=0
    ).configure_scale(
        bandPaddingInner=0.2
    ).properties(
        width=500,
        height=180
    ).properties(
        title="Favorite Star Wars Movie Rankings"
    )
    
    return chart




    # YOUR CODE HERE
#     raise NotImplementedError()

In [240]:
# test your code
genBestVis(sw,episodes)

## Make sure to *style* your visualization to match the original the best you can

We can also drill down and find out, generally, how people rate the films. Overall, fans broke into two camps: those who preferred the original three movies and those who preferred the three prequels. People who said “The Empire Strikes Back” was their favorite were also likely to rate “A New Hope” and “Return of the Jedi” higher as well. Those who rated “The Phantom Menace” as the best film were more likely to rate prequels higher.

This chart shows how often each film was rated in the top third (best or second-best), the middle third (third or fourth) or the bottom third (second-worst or worst). It’s a more nuanced take on the series:

!["Sol1"](assets/how_rate_resized.png)

** Homework note: Click [here](assets/people_rate.png) to see a version of this plot generated in Altair.

### 2.2 How people rate the 'Star Wars' movie? Recreate the above image using altair (10 Points)

In [241]:
def calculate_movie_rankings(inpf, eps):
    rankings = []
    
    for ep in eps:
        movie_rankings = inpf['rank_' + ep].dropna().astype(int)
        total_responses = len(movie_rankings)
        
        top_third = np.sum(movie_rankings <= 2) / total_responses
        middle_third = np.sum((movie_rankings >= 3) & (movie_rankings <= 4)) / total_responses
        bottom_third = np.sum(movie_rankings >= 5) / total_responses
        
        rankings.append((namesDict[ep], top_third, middle_third, bottom_third))
        
    return pd.DataFrame(rankings, columns=['Movie', 'Top third', 'Middle third', 'Bottom third'])

def genRateVis(inpf, eps, names=namesDict, names_l=namesList):
    data = calculate_movie_rankings(inpf, eps)
    data_long = data.melt('Movie', var_name='Ranking Category', value_name='Proportion')
    
    domain = ['Top third', 'Middle third', 'Bottom third']
    range_ = ['green', 'blue', 'red']

    chart = alt.Chart(data_long).mark_bar(size=15).encode(
        y=alt.Y('Movie:N', title='', sort=names_l),
        x=alt.X('Proportion:Q', title='', axis=alt.Axis(labels=False, ticks=False)),
        color=alt.Color('Ranking Category:N', scale=alt.Scale(domain=domain, range=range_), title='Ranking Category'),
        column=alt.Column('Ranking Category:N', title='', header=alt.Header(labels=True))
    ).properties(
        width=120
    )
    
    return chart.configure_view(strokeWidth=0).properties(
        title="Star Wars Movie Rankings"
    )

    # YOUR CODE HERE
#     raise NotImplementedError()

In [242]:
# let's check our solution
genRateVis(sw,episodes)

Finally, we took a boilerplate format used by political favorability polls — “Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her” — and asked respondents to rate characters in the series.

!["Sol1"](assets/char_ranking_resized.png)


** Homework note. Here's an example solution generated in Altair:
!["Sol1"](assets/people_rate_s.png)

Be careful here in what you're dividing by to get the totals for each character.

### 2.3 Star Wars' Characters Favorability Ratings. Recreate the above image using altair (10 Points)

In [243]:
# Recreate this image using altair here (10 POINTS)
def calculate_character_favorability(inpf):
    # Names of the characters
    characters = ['Luke Skywalker', 'Han Solo', 'Princess Leia Organa', 'Obi Wan Kenobi', 'Yoda', 'R2 D2', 'C-3P0',
              'Anakin Skywalker', 'Darth Vader', 'Lando Calrissian', 'Padme Amidala', 'Boba Fett', 'Emperor Palpatine', 'Jar Jar Binks']
    
    # Calculate the favorability proportions
    favorability_data = []
    for character_name in characters:
        favorability_counts = inpf[character_name].value_counts(normalize=True)
        favorability_data.append({
            'Character': character_name,
            'Favorable': favorability_counts.get('Very favorably', 0) + favorability_counts.get('Somewhat favorably', 0),
            'Neutral': favorability_counts.get('Neither favorably nor unfavorably (neutral)', 0),
            'Unfavorable': favorability_counts.get('Somewhat unfavorably', 0) + favorability_counts.get('Very unfavorably', 0),
            'Unfamiliar': favorability_counts.get('Unfamiliar (N/A)', 0)
        })
        
    return pd.DataFrame(favorability_data)
def genFavorabilityVis(inpf):
    # input: inpf, the star wars dataset (as defined at top of file)
    # output: the Altair visualization as described above
    data = calculate_character_favorability(inpf)
    data_long = data.melt('Character', var_name='Favorability', value_name='Proportion')
    
    domain = ['Favorable', 'Neutral', 'Unfavorable', 'Unfamiliar']
    range_ = ['green', 'blue', 'red', 'gray']

    # Create the bars chart
    bars = alt.Chart(data_long).mark_bar(size=15).encode(
        y=alt.Y('Character:N', title='', sort='-x'),
        x=alt.X('Proportion:Q', title='', axis=alt.Axis(format='.0%')),
        color=alt.Color('Favorability:N', scale=alt.Scale(domain=domain, range=range_), title='Favorability'),
        column=alt.Column('Favorability:N', title='', header=alt.Header(labels=True))
    ).properties(
        width=100
    )
    
    # Add text on the bars
    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3,
        fontWeight='bold'
    ).encode(
        text=alt.Text('Proportion:Q', format='.0%')
    )
    
    return bars.properties(
        title="Character Favorability"
    )
    # YOUR CODE HERE
#     raise NotImplementedError()

In [244]:
# let's test the solution
genFavorabilityVis(sw)

You read that correctly. Jar Jar Binks has a lower favorability rating than the actual personification of evil in the galaxy.

And for those of you who want to know the impact that [historical revisionism](http://en.wikipedia.org/wiki/Han_shot_first) can have on a society:

!["Sol1"](assets/shot_first_article_resized.png)


** Homework note: Click [here](assets/shot_first.png) to see a version of this plot generated in Altair. You may find that you don't get 834 rows (as 538 did) but the percents should still work.

### 2.4 Who shot first? Recreate the above image using altair (10 Points)

In [245]:
# Recreate this image using altair here (10 POINTS)

def genFirstShotVis(inpf):
    # input: inpf, the star wars dataset
    # output: the Altair visualization as described above
    data = calculate_who_shot_first(inpf)
    
    # Create the bars chart
    bars = alt.Chart(data).mark_bar(size=30).encode(
        y=alt.Y('Who Shot First:N', title=''),
        x=alt.X('Proportion:Q', title='', axis=alt.Axis(format='.0%')),
    ).properties(
        width=400,
        height=150
    )
    
    # Add text on the bars
    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3,
        fontWeight='bold'
    ).encode(
        text=alt.Text('Proportion:Q', format='.0%')
    )
    
    # Combine the bars with text labels
    final_chart = (bars + text).properties(
        title="Which character shot first?"
    )
    
    return final_chart

    # YOUR CODE HERE
#     raise NotImplementedError()

In [246]:
# test our solution
genFirstShotVis(sw)

### 2.5 Make your own (35 points total - 20 for implementation, 15 for justification)

Reading through the article, identify some aspect of the story that would benefit from an extra visualization (you can imagine extending the article if it helps). We would like for you to create two versions of this hypothetical visualization.  You should be able to convincingly argue that one visualization is better than the other.

Note: The two visualizations should use different marks and at least some of the encodings should be different. If you're struggling with coming up with two really different representations, your visualization may simply be too simple. Try to add additional columns/comparisons. 

In [247]:
# Extract and process the data
movie_columns = ['rank_EI', 'rank_EII', 'rank_EIII', 'rank_EIV', 'rank_EV', 'rank_EVI']
movie_names = ['Episode I', 'Episode II', 'Episode III', 'Episode IV', 'Episode V', 'Episode VI']

# Convert the rank columns to numeric
for col in movie_columns:
    sw[col] = pd.to_numeric(sw[col], errors='coerce')
    
# Calculate the average ranking
average_ranks = sw[movie_columns].mean()
average_ranks_df = pd.DataFrame({'Movie': movie_names, 'Average Rank': average_ranks.values})


In [248]:
worse_version = alt.Chart(average_ranks_df).mark_line(point=True).encode(
    x=alt.X('Movie:N', title='Star Wars Movie', sort=movie_names),
    y=alt.Y('Average Rank:Q', title='Average Rank (Lower is Better)')
).properties(
    width=300,
    height=200,
    title="Worse Version: Line Chart"
)

worse_version

In [249]:
# add your code for the "better version" here
better_version = alt.Chart(average_ranks_df).mark_bar().encode(
    x=alt.X('Movie:N', title='Star Wars Movie', sort=movie_names),
    y=alt.Y('Average Rank:Q', title='Average Rank (Lower is Better)')
).properties(
    width=300,
    height=200,
    title="Better Version: Bar Chart"
)

better_version

#### Justification
*Provide your justification here* to argue why your "better" version is actually better. Make sure to defend this based on expressiveness and effectiveness as well as perceptual arguments.

* YOUR ANSWER GOES HERE *

The bar chart is more effective in this scenario because it offers a direct comparison of lengths which is a more intuitive visual cue for quantities. The line chart, usually ideal for trends over time, doesn’t convey the nature of the data as effectively and might be misinterpreted as a time series.