## Altair Exercises

This notebook will explore multiple different visualizations in Altair.

______

### Part 6

The following exercise is based on the 538 article [here](https://fivethirtyeight.com/features/women-in-comic-books/). What you should know is that there are two major comic book companies: DC (Batman, Superman, Wonder Woman, etc.) and Marvel (Black Widow, Iron Man, Hulk, etc.). 

We have a dataset of characters, their sex, when they were introduced, if their identify is secret, their eye and hair color, the number of appearances, etc. Lots of dimensions on which to build our visualizations.

In [1]:
import pandas as pd
import numpy as np
import altair as alt

In [2]:
# enable correct rendering
alt.renderers.enable('default')

# uses intermediate json files to speed things up
alt.data_transformers.enable('json')

# use the 538 theme
alt.themes.enable('fivethirtyeight')

ThemeRegistry.enable('fivethirtyeight')

In [3]:
# load up the two datasets, one for Marvel and one for DC
dc = pd.read_csv('../assets/dc-wikia-data.csv')
marvel = pd.read_csv('../assets/marvel-wikia-data.csv')

In [4]:
# Some pre-processing

# Add columns
dc['publisher'] = 'DC'
marvel['publisher'] = 'Marvel'

# rename some columns
marvel.rename(columns={'Year': 'YEAR'}, inplace=True)

# create the table with everything
comic = pd.concat([dc, marvel])

# drop years with na values
comic.dropna(subset=['YEAR'], inplace=True)

comic.sample(5)

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR,publisher
1493,15198,Frederick von Frankenstein (New Earth),\/wiki\/Frederick_von_Frankenstein_(New_Earth),Public Identity,Bad Characters,Red Eyes,Black Hair,Male Characters,,Living Characters,17.0,"1996, October",1996.0,DC
470,1966,David Haller (Earth-616),\/David_Haller_(Earth-616),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,87.0,Mar-85,1985.0,Marvel
3486,125009,Ray Lippert (Earth-616),\/Ray_Lippert_(Earth-616),Public Identity,Good Characters,,Brown Hair,Male Characters,,Living Characters,9.0,Dec-89,1989.0,Marvel
1342,329884,Marcy Pearson (Earth-616),\/Marcy_Pearson_(Earth-616),Public Identity,Neutral Characters,Brown Eyes,Brown Hair,Female Characters,,Living Characters,28.0,Apr-87,1987.0,Marvel
6525,23576,Frederick Devere (New Earth),\/wiki\/Frederick_Devere_(New_Earth),,Bad Characters,,,Male Characters,,Living Characters,1.0,"1941, June",1941.0,DC


### Comic Books Are Still Made By Men, For Men And About Men

_Original article available at [FiveThirtyEight](https://fivethirtyeight.com/features/women-in-comic-books/)_

By [Walt Hickey](https://fivethirtyeight.com/contributors/walt-hickey/)

Get the data on [GitHub](https://github.com/fivethirtyeight/data/tree/master/comic-characters)

We are going to be revising and adding to the visualizations for this article. While they're nice, we think we can do better by adding some interactivity.

_____

#### New Comic Book Characters Introduced per Year

We'd like to build an interactive visualization that allows us to compare the distributions of characters over time as well. The top two charts will represent the total characters over time (as bar charts). The bottom two will be a line chart with separate lines for female and male characters.  

In [5]:
# let's pre-process the data. We're going to focus on just Female and just Male characters 
# for the moment and will only consider those 
comic_ch1_df = comic[(comic['YEAR'] >= 1940) & (comic['SEX'].isin(['Female Characters', 'Male Characters']))]

In [6]:
comic_ch1_df.sample(5)

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR,publisher
10087,29668,Amina Synge (Earth-616),\/Amina_Synge_(Earth-616),Secret Identity,Neutral Characters,Brown Eyes,Black Hair,Female Characters,,Living Characters,2.0,Aug-06,2006.0,Marvel
6272,344864,Mazita (New Earth),\/wiki\/Mazita_(New_Earth),,Good Characters,,,Female Characters,,Living Characters,1.0,"1989, June",1989.0,DC
624,286426,Norah Winters (Earth-616),\/Norah_Winters_(Earth-616),No Dual Identity,Good Characters,Blue Eyes,Blond Hair,Female Characters,,Living Characters,62.0,Dec-08,2008.0,Marvel
3498,686334,Guy Cross-Wallace (Earth-616),\/Guy_Cross-Wallace_(Earth-616),Public Identity,,Brown Eyes,Black Hair,Male Characters,,Deceased Characters,9.0,Jan-91,1991.0,Marvel
1335,4802,David Hersch (New Earth),\/wiki\/David_Hersch_(New_Earth),Public Identity,Bad Characters,Blue Eyes,,Male Characters,,Living Characters,19.0,"2001, March",2001.0,DC


In [7]:
p1_bar_base = alt.Chart(comic).mark_bar(size=2.5).encode( 
    alt.Y('count():Q', 
          axis=alt.Axis(values=[0, 100, 200, 300, 400, 500], 
                        title=None,
                        labelFontWeight="bold",
                        labelFontSize=15),
          scale=alt.Scale(domain=[0, 500]))).properties(
                 width=240,
                 height=300
)



# let's create the bar chart for DC. We'll take the "base" chart
bar_dc = p1_bar_base.encode(alt.X('YEAR:N',  # create the X axis based on year and fix the look of the axes
                               axis=alt.Axis(values=[1940, 1960, 1980, 2000], labels=True, ticks=False,grid=True,
                                             title="DC, New Earth continuity",
                                             titlePadding=-347, 
                                             labelAngle=360,
                                             labelFontWeight="bold",
                                             labelFontSize=15,)),
        ).transform_filter(
            # we will use Altair's filter to only keep DC for this chart
            alt.datum.publisher == 'DC'
        )

# let's do the same thing for marvel
bar_marvel = p1_bar_base.mark_bar(color='#f6573f').encode(alt.X('YEAR:N', # create the X axis based on year 
                            # fix the look of the axes
                           axis=alt.Axis(values=[1940, 1960, 1980, 2000], labels=True, ticks=False,grid=True,
                                         title="Marvel, Earth-616 continuity",
                                         titlePadding=-347,
                                         labelAngle=360,
                                         labelFontWeight="bold",
                                         labelFontSize=15)),
        ).transform_filter(
            # we will use Altair's filter to only keep DC for this chart
            alt.datum.publisher == 'Marvel'
        )



# let's create a new "base" chart for the two line charts. We'll take the bar chart base above
# and modify it to use a line chart
p1_line_base = p1_bar_base.mark_line().encode(
     # the X axis will be year
     alt.X('YEAR:N'),
     # the Y axis will be the count (the number of points that year)
     alt.Y('count():Q', axis=alt.Axis(grid=False, 
                                    labelFontWeight="bold",
                                    labelFontSize=15, 
                                    title=None)),
     # let's split the data and color by SEX
     alt.Color('SEX', 
              scale = alt.Scale(domain=['Female Characters', 'Male Characters'], range=['#31a354', '#ce6dbd']),
              legend=alt.Legend(orient="bottom"))
    ).properties(
                width=240, height=80
     )


line_dc = p1_line_base.encode(alt.X('YEAR:N',
                                       axis=alt.Axis(values=[1940, 1960, 1980, 2000], 
                                                              grid=True, 
                                                              labelAngle=360,
                                                              labelFontWeight="bold",
                                                              labelFontSize=15,
                                                              title = 'Dc, Female and Male characters over time',
                                                              titlePadding=-130,
                                                              titleFontSize = 12
                                                             )
                                      )
            ).transform_filter(
                # this is the DC line chart, so we only want DC
                alt.datum.publisher == 'DC'
            )



line_marvel = p1_line_base.encode(alt.X('YEAR:N', 
                                    axis=alt.Axis(values=[1940, 1960, 1980, 2000], 
                                                          grid=True, 
                                                          labelAngle=360,
                                                          labelFontWeight="bold",
                                                          labelFontSize=15,
                                                          title = 'Marvel, Female and Male characters over time',
                                                          titlePadding=-130,
                                                          titleFontSize = 12
                                                         )
                                    )
            ).transform_filter(
                # this is the Marvel line chart, so we only want Marvel
                alt.datum.publisher == 'Marvel'
            )



# let's put everything together
# top piece 
top_charts = alt.hconcat(bar_dc,bar_marvel).resolve_scale(y='shared'
           ).properties(
                    title='New Comic Book Characters Introduced Per Year'
           )

# bottom piece
bottom_charts = alt.hconcat(line_dc,line_marvel).resolve_scale(y='shared')

alt.vconcat(top_charts,bottom_charts).configure_view(
    strokeWidth=0
)

Use the code below to create a "brush" object (a "selection" in Altair speak) that will let us select a time range. We will then create a condition for the DC chart, and add both the condition and selection to the DC chart. We will then repeat with Marvel. This will create interactivity with the chart.

In [8]:
## DC
# Create brush object
brush = alt.selection_interval(encodings=['x'])

# Create condition - DC
colorConditionDC = alt.condition(brush,alt.value("#2182bd"),alt.value("gray"))

# Add condition and selection - DC
i_bar_dc = bar_dc.encode(
    color=colorConditionDC
).add_selection(
    brush
)

# Create condition - Marvel
colorConditionMarvel = alt.condition(brush,alt.value("#f6573f"),alt.value("gray"))

# Add condition and selection - Marvel
i_bar_marvel = bar_marvel.encode(
    color=colorConditionMarvel
).add_selection(
    brush
)

# top piece 
top_charts = alt.hconcat(i_bar_dc,i_bar_marvel).resolve_scale(y='shared'
           ).properties(
                    title='New Comic Book Characters Introduced Per Year'
           )

Now we modify the two line charts.

In [9]:
i_line_dc = line_dc.add_selection(
    brush
).transform_filter(
    brush
)

i_line_marvel = line_marvel.add_selection(
    brush
).transform_filter(
    brush
)

# bottom piece
bottom_charts = alt.hconcat(i_line_dc,i_line_marvel).resolve_scale(y='shared')

In [10]:
# let's put everything together with the new interactive charts.

# bottom piece
bottom_charts = alt.hconcat(i_line_dc,i_line_marvel).resolve_scale(y='shared')

alt.vconcat(top_charts,bottom_charts).configure_view(
    strokeWidth=0
)

______

#### Comics Aren't Gaining Many Female Characters

This chart will only present one point of interest: Percent female in any given year. It might help us understand the claim that there's a relatively trending change in this percent by plotting year-over-year percent changes. Also, it's possible that there are more characters being introduced in later years. So even one or two good years in the 2000's may make up for lots of bad years in the past (it turns out that this is not the case, but it is a question we might ask).

In [11]:
def generatePercentTable(publisher):
    _df = comic[comic.publisher == publisher]
    _df = _df[['SEX','YEAR']]
    _df = pd.get_dummies(_df)
    _df.YEAR = _df.YEAR.astype('int')
    _df = _df.groupby(['YEAR']).sum()

    _df['total'] = 0
    _df['total'] = _df['total'].astype('int')
    for col in list(comic[comic.publisher == publisher].SEX.unique()):
        col = str(col)
        if (col != 'nan'):
            _df['total'] = _df['total'].astype('int') + _df["SEX_"+col].astype('int')

    _df['% Female'] = _df['SEX_Female Characters'] / _df.total
    _df = _df.reset_index()
    _df = _df[['YEAR','% Female','SEX_Female Characters','SEX_Male Characters','total']]
    _df['publisher'] = publisher
    _df = _df[(_df.YEAR >= 1979)]
    _df['Year-over-year change in % Female'] = _df['% Female'].pct_change()
    toret = _df[(_df.YEAR > 1980) & (_df.YEAR < 2013)].copy()
    t2 = toret.cumsum()
    toret['% Female characters to date'] = list(t2['SEX_Female Characters'] / t2['total'])
    return(toret)

changedata = pd.concat([generatePercentTable("Marvel"),generatePercentTable("DC")])

changedata = pd.melt(changedata,id_vars=['YEAR','publisher'],value_vars=['% Female',
                                                             'Year-over-year change in % Female',
                                                             '% Female characters to date'])

In [12]:
changedata.sample(5)

Unnamed: 0,YEAR,publisher,variable,value
3,1984,Marvel,% Female,0.297561
152,2005,Marvel,% Female characters to date,0.28917
30,2011,Marvel,% Female,0.295522
119,2004,DC,Year-over-year change in % Female,-0.021752
91,2008,Marvel,Year-over-year change in % Female,0.039476


The first job will be to create an interactive chart that has a drop-down box that allows us to select the variable of interest.

In [13]:
def generateLineChartP21():
    
    metricOptions = ['% Female','Year-over-year change in % Female','% Female characters to date']
    input_dropdown = alt.binding_select(options=metricOptions)
    
    dropdown_selection = alt.selection_single(fields=['variable'], bind=input_dropdown, name='Data_')
    
    line = alt.Chart(changedata).mark_line().encode(
        x=alt.X('YEAR:N'), 
        y=alt.Y("sum(value)", title='value'), 
        color='publisher'
    ).add_selection(
        dropdown_selection
    ).transform_filter(
        dropdown_selection
    ).properties(width=750, height=300)
    
    return(line)

In [14]:
generateLineChartP21()

This is pretty static, so let's add some annotations and interactivity.

In [15]:
def generateLineChartP22():
    
    metricOptions = ['% Female','Year-over-year change in % Female','% Female characters to date']
    input_dropdown = alt.binding_select(options=metricOptions)
    
    dropdown_selection = alt.selection_single(fields=['variable'], bind=input_dropdown, name='Data_')
    
    nearest = alt.selection(type='single',nearest=True, on='mouseover',
                            fields=['YEAR'],empty='none')
    
    line2 = alt.Chart(changedata).mark_line().encode(
        x=alt.X('YEAR:N'), 
        y=alt.Y("sum(value):Q", title='value'), 
        color='publisher:N'
    )
    
    selectors = alt.Chart(changedata).mark_point().encode(
        x=alt.X('YEAR:N',axis=alt.Axis(labels=True)),
        opacity=alt.value(0)
    ).add_selection(
        nearest
    )
    
    points = line2.mark_point().encode(
        opacity=alt.condition(
            nearest, 
            alt.value(1), 
            alt.value(0)
        )
    )
    
    text = line2.mark_text(align='left', dx=5, dy=-5).encode(
        text=alt.condition(
            nearest, 
            "sum(value):Q", 
            alt.value(' '),
            format='.0%'
        )
    )
    
    rules = alt.Chart(changedata).mark_rule(color='gray').encode(
        x=alt.X('YEAR:N',),
    ).transform_filter(
        nearest
    )
    
    final = alt.layer(line2, selectors, points, rules, text).add_selection(
        dropdown_selection
    ).transform_filter(
        dropdown_selection
    ).properties(width=750,height=300)
    
    return(final)

In [16]:
generateLineChartP22()

______________________
<div style="text-align: right"><sub>Exercise adapted and modified from UMSI homework assignment for SIADS 622.</sub></div>