## Module 5: Plots for communication

In [1]:
from vega_datasets import data
import pandas as pd
import altair as alt

import warnings
warnings.filterwarnings("ignore")

In [9]:
movies = data.movies().head(500)
movies.head()

Unnamed: 0,Title,US_Gross,Worldwide_Gross,US_DVD_Sales,Production_Budget,Release_Date,MPAA_Rating,Running_Time_min,Distributor,Source,Major_Genre,Creative_Type,Director,Rotten_Tomatoes_Rating,IMDB_Rating,IMDB_Votes
0,The Land Girls,146083.0,146083.0,,8000000.0,Jun 12 1998,R,,Gramercy,,,,,,6.1,1071.0
1,"First Love, Last Rites",10876.0,10876.0,,300000.0,Aug 07 1998,R,,Strand,,Drama,,,,6.9,207.0
2,I Married a Strange Person,203134.0,203134.0,,250000.0,Aug 28 1998,,,Lionsgate,,Comedy,,,,6.8,865.0
3,Let's Talk About Sex,373615.0,373615.0,,300000.0,Sep 11 1998,,,Fine Line,,Comedy,,,13.0,,
4,Slam,1009819.0,1087521.0,,1000000.0,Oct 09 1998,R,,Trimark,Original Screenplay,Drama,Contemporary Fiction,,62.0,3.4,165.0


### Filtering, number formatting, and aggregating

Let's make a quick chart to see how movie revenue has been changing over time. 

We'll put `Release_Date` on the x-axis and `US_Gross` on the y-axis. We'll start with a scatterplot, but feel free to play around with different marks like line or bar to see how they look. 

In [10]:
gross = alt.Chart(movies).mark_point().encode(
    x = alt.X('Release_Date'),
    y = alt.Y('US_Gross'),
    tooltip=[alt.Tooltip('Title')]
)

gross

What's wrong with this chart?

Take a close look and think about what issues the chart has. 

Let's also take a look at the data types of each column to see if we can get a better understanding. 

In [11]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Title                   500 non-null    object 
 1   US_Gross                495 non-null    float64
 2   Worldwide_Gross         495 non-null    float64
 3   US_DVD_Sales            3 non-null      float64
 4   Production_Budget       500 non-null    float64
 5   Release_Date            500 non-null    object 
 6   MPAA_Rating             223 non-null    object 
 7   Running_Time_min        36 non-null     float64
 8   Distributor             415 non-null    object 
 9   Source                  380 non-null    object 
 10  Major_Genre             409 non-null    object 
 11  Creative_Type           342 non-null    object 
 12  Director                282 non-null    object 
 13  Rotten_Tomatoes_Rating  314 non-null    float64
 14  IMDB_Rating             464 non-null    fl

It looks like the `Release_Date` column has a dtype of 'object', meaning that it is being interpreted by altair as a string rather than a datetime object.

Let's change the `type` of the 'Release_Date' column

In [12]:
movies['Release_Date'] = pd.to_datetime(movies['Release_Date'])
movies.head()

Unnamed: 0,Title,US_Gross,Worldwide_Gross,US_DVD_Sales,Production_Budget,Release_Date,MPAA_Rating,Running_Time_min,Distributor,Source,Major_Genre,Creative_Type,Director,Rotten_Tomatoes_Rating,IMDB_Rating,IMDB_Votes
0,The Land Girls,146083.0,146083.0,,8000000.0,1998-06-12,R,,Gramercy,,,,,,6.1,1071.0
1,"First Love, Last Rites",10876.0,10876.0,,300000.0,1998-08-07,R,,Strand,,Drama,,,,6.9,207.0
2,I Married a Strange Person,203134.0,203134.0,,250000.0,1998-08-28,,,Lionsgate,,Comedy,,,,6.8,865.0
3,Let's Talk About Sex,373615.0,373615.0,,300000.0,1998-09-11,,,Fine Line,,Comedy,,,13.0,,
4,Slam,1009819.0,1087521.0,,1000000.0,1998-10-09,R,,Trimark,Original Screenplay,Drama,Contemporary Fiction,,62.0,3.4,165.0


In [13]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Title                   500 non-null    object        
 1   US_Gross                495 non-null    float64       
 2   Worldwide_Gross         495 non-null    float64       
 3   US_DVD_Sales            3 non-null      float64       
 4   Production_Budget       500 non-null    float64       
 5   Release_Date            500 non-null    datetime64[ns]
 6   MPAA_Rating             223 non-null    object        
 7   Running_Time_min        36 non-null     float64       
 8   Distributor             415 non-null    object        
 9   Source                  380 non-null    object        
 10  Major_Genre             409 non-null    object        
 11  Creative_Type           342 non-null    object        
 12  Director                282 non-null    object    

Now our `Release_Date` column is in the correct format. 

Use the next cell to use an alternate method to have the `Release_Date` column in datetime format.

Let's make our chart again

In [14]:
gross

Is this better? Are there some underlying problems with the data?

Even though this dataset contains movies that were released up to 2010, there are some movies showing release dates far into the future. 

This was difficult to see in our first chart but now it is very obvious.

Let's filter the data to exclude the erroneous release dates

In [15]:
movies_filtered = movies[movies['Release_Date'].dt.year <= 2010]
movies_filtered.head()

Unnamed: 0,Title,US_Gross,Worldwide_Gross,US_DVD_Sales,Production_Budget,Release_Date,MPAA_Rating,Running_Time_min,Distributor,Source,Major_Genre,Creative_Type,Director,Rotten_Tomatoes_Rating,IMDB_Rating,IMDB_Votes
0,The Land Girls,146083.0,146083.0,,8000000.0,1998-06-12,R,,Gramercy,,,,,,6.1,1071.0
1,"First Love, Last Rites",10876.0,10876.0,,300000.0,1998-08-07,R,,Strand,,Drama,,,,6.9,207.0
2,I Married a Strange Person,203134.0,203134.0,,250000.0,1998-08-28,,,Lionsgate,,Comedy,,,,6.8,865.0
3,Let's Talk About Sex,373615.0,373615.0,,300000.0,1998-09-11,,,Fine Line,,Comedy,,,13.0,,
4,Slam,1009819.0,1087521.0,,1000000.0,1998-10-09,R,,Trimark,Original Screenplay,Drama,Contemporary Fiction,,62.0,3.4,165.0


And let's make the chart again.

In [16]:
gross = alt.Chart(movies_filtered).mark_point().encode(
    x = alt.X('Release_Date'),
    y = alt.Y('US_Gross')
)

gross

Next let's aggregate the data so we can see trends a bit better. First let's switch to a bar chart.

In [17]:
gross = alt.Chart(movies_filtered).mark_bar().encode(
    x = alt.X('Release_Date'),
    y = alt.Y('US_Gross')
).properties(width=800)

gross

This is not ideal as it is giving the same info as the point plot, but with unaggregated bars, which may be misleading. What's the best way to aggregate this data in altair?

Let's start with the time data, aggregating by year.

In [18]:
gross_by_year = alt.Chart(movies_filtered).mark_bar().encode(
    x = alt.X('year(Release_Date)'),
    y = alt.Y('US_Gross')
).properties(width=800)

gross_by_year

The bars look better, but we have a new issue where there's loads of white lines in our bars. What's causing this?

There is still one bar for each movie, they are just smaller and stacked on top of each other. We can prove this with a tooltip.

In [19]:
gross_with_tt = alt.Chart(movies_filtered).mark_bar().encode(
    x = alt.X('year(Release_Date)'),
    y = alt.Y('US_Gross'),
    tooltip=['US_Gross', 'Release_Date'],
).properties(width=800)

gross_with_tt

Next, let's aggregate the `US_Gross` data using sum.

In [20]:
gross_sum = alt.Chart(movies_filtered).mark_bar().encode(
    x = alt.X('year(Release_Date)'),
    y = alt.Y('sum(US_Gross)'),
    tooltip=['sum(US_Gross)', 'year(Release_Date)'],
).properties(width=800)

gross_sum

Try playing around with different aggregations, such as 'yearmonth', or 'month', or 'count'.

In [21]:
agg_chart = alt.Chart(movies_filtered).mark_bar().encode(
    x = alt.X('month(Release_Date)'),
    y = alt.Y('max(US_Gross)'),
).properties(width=800)

agg_chart

Others: min, max, stdev, variance

See here: https://altair-viz.github.io/user_guide/encodings/index.html#encoding-aggregates

### Formatting numbers

In all our charts above, even though we know that movie revenues are measured in dollars, we did not have a dollar sign on the x-axis. It's best to make sure our axes are showing appropriate units.

To do this, we can use the `format` argument in `alt.Axis`. In this case, we want to make it a string so it displays millions, and we want a dollar sign in front, so we use `'$s'`

In [22]:
gross = alt.Chart(movies_filtered).mark_point().encode(
    x = alt.X('Release_Date'),
    y = alt.Y('US_Gross', 
              axis=alt.Axis(format='$s')
             )
)

gross

# string, decimal, float, generic number, integer

The most common formats are going to be string, float, and integer. String being text, float being a number with decimals, and integer being a number without decimals. 

In [23]:
number = 4.5
print('%s %f %i' % (number, number, number))

#(string, float (six decimal places), int)
# Example from: https://stackoverflow.com/questions/4288973/whats-the-difference-between-s-and-d-in-string-formatting

4.5 4.500000 4
