In [45]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

# Reading data
The book data used in this notebook have been extracted from [Goodreads](https://www.goodreads.com/) by webscraping with a modified version of the `goodreads_scraper` code by Maria Antoniak and Melanie Walsh publicly available at: https://github.com/maria-antoniak/goodreads-scraper.

The most relevant variables of the book dataframes we are going to deal with are:
* `title`: Title of the book.
* `series_name`: Name of the saga/series.
* `series_n`: Order of the book within the saga/series.
* `num_ratings`: Total number of ratings of the book in Goodreads.
* `average_rating`: Average rating of the book on Goodreads. The range is 1-5.
* `hist_rating`: Weighted average rating of the saga/book it belongs to. It is calculated by making a weighted average (by the number of ratings) of the ratings of the book in question and the previous ones. For example, to calculate `hist_rating` of the 3rd book of a saga of 7, the average score of books 1, 2 and 3 is taken into account, weighted by the number of people who have rated each of them. This variable is only in the `sanderson_df` dataframe.


In [48]:
books_df_scatter = pd.read_csv('books_df_scatter.csv')
books_df_scatter.head()

Unnamed: 0,title,series_name,series_n,year_first_published,author_name,num_pages,num_ratings,num_reviews,average_rating,main_genre,saga
0,Harry Potter and the Sorcerer's Stone,Harry Potter,1.0,2003.0,J.K. Rowling,309.0,8491079,134158,4.48,Fantasy,Saga (first book)
1,Harry Potter and the Chamber of Secrets,Harry Potter,2.0,1999.0,J.K. Rowling,341.0,3277548,64524,4.43,Fantasy,Saga (continuation)
2,Harry Potter and the Prisoner of Azkaban,Harry Potter,3.0,2004.0,J.K. Rowling,435.0,3439727,67846,4.58,Fantasy,Saga (continuation)
3,Harry Potter and the Goblet of Fire,Harry Potter,4.0,2002.0,J.K. Rowling,734.0,3057818,56382,4.57,Fantasy,Saga (continuation)
4,Harry Potter and the Order of the Phoenix,Harry Potter,5.0,2004.0,J.K. Rowling,870.0,2927014,51572,4.5,Fantasy,Saga (continuation)


# Plots

### Example 1: Basic structure. Scatterplot + changing appearance + adding lines and annotations

In [49]:
fig_scatter_simple = px.scatter(books_df_scatter, x="num_ratings", y="average_rating", height=500, width=800)
fig_scatter_simple.show()

In [50]:
print(type(fig_scatter_simple))
print(fig_scatter_simple)


<class 'plotly.graph_objs._figure.Figure'>
Figure({
    'data': [{'hovertemplate': 'num_ratings=%{x}<br>average_rating=%{y}<extra></extra>',
              'legendgroup': '',
              'marker': {'color': '#636efa', 'symbol': 'circle'},
              'mode': 'markers',
              'name': '',
              'orientation': 'v',
              'showlegend': False,
              'type': 'scatter',
              'x': array([8491079, 3277548, 3439727, ...,   15518,  285067,   16896]),
              'xaxis': 'x',
              'y': array([4.48, 4.43, 4.58, ..., 3.64, 4.46, 3.86]),
              'yaxis': 'y'}],
    'layout': {'height': 500,
               'legend': {'tracegroupgap': 0},
               'margin': {'t': 60},
               'template': '...',
               'width': 800,
               'xaxis': {'anchor': 'y', 'domain': [0.0, 1.0], 'title': {'text': 'num_ratings'}},
               'yaxis': {'anchor': 'x', 'domain': [0.0, 1.0], 'title': {'text': 'average_rating'}}}
})


The object type of the plot is `plotly.graph_objs._figure.Figure`. On a practical level it is very similar to a dictionary (dict) with two key-value pairs: 
* `data`: quantitative data -e.g. value on x-axis- and related qualitative data -e.g. group/color.
* `layout`: appearance information: colors/palette, text, limits, etc.

`hovertemplate` is in data instead of layout (it could be considered a part of the appearance) because it may include other data from the dataframe that does not appear in the plot axes.

The structure of `go.Figure` is similar: nested dictionaries.
* The value of `data` is a list of dictionaries. In Data there are as many dictionaries as `traces`.
* The value of `layout` is a single dictionary.

More information: https://plotly.com/python/creating-and-updating-figures/

In [51]:
fig_scatter = px.scatter(
    books_df_scatter, x="num_ratings", y="average_rating", color="saga", 
    hover_data=["title", "series_name", "series_n"],
    height=500, width=800
    )
fig_scatter.show()

In [52]:
# len = 3: each color group is a 'trace'
print(len(fig_scatter['data']))


3


In the lower cell we use the module `pio` to access the templates and create a new one. Templates are sets of pre-defined layout settings (e.g. background color, borders, color palette for each trace). Default templates are available (e.g. `'plotly_white`), but we can also create our own with `go.layout.Template`. In the `borders` template we add borders to the plot.

Templates can be combined using `'+'`. In fact, in the next plots we will we specify the combination `'plotly_white+borders'` as a template.

In [53]:
import plotly.io as pio
pio.templates["borders"] = go.layout.Template(
    layout = dict(
        xaxis = dict(
            mirror=True,
            ticks='outside',
            showline=True,
            linecolor='black'
        ),
        yaxis = dict(
            mirror=True,
            ticks='outside',
            showline=True,
            linecolor='black'
        )
    )
)

In the next cell we modify the appearance of `fig_scatter` with `.update_layout()`. When we make these modifications it is **not** necessary to reassign the result to the `fig_scatter` object.
For layout modifications there are multiple ways to access the elements we are interested in. Apart from accessing them as nested dictionaries, they can be accessed by using `_` to chain successive names. That is `xaxis={'title':'Number of ratings'}` is equivalent to `xaxis_title='Number of ratings'`. 

For more information on this: https://plotly.com/python/reference/index/ 

In [54]:
fig_scatter.update_layout(
    # https://plotly.com/python/templates/ 
    template = 'plotly_white+borders', 
    # https://plotly.com/python/reference/layout/xaxis/
    xaxis = {
        'title':'Number of ratings',
        'range':[0, 4000000]
    }, 
    # https://plotly.com/python/reference/layout/yaxis/
    yaxis = {
        'title':'Average rating',
        'range':[1,5]
    },
    # https://plotly.com/python/reference/layout/#layout-legend
    legend = {
        'title': None,
        'x': 1.02,
        'y': 0.5
    }
)
fig_scatter.show()

With `customdata` we access the variables that we have specified in `hover_data` when creating the chart, in the same order in which we have specified them. In our case, these variables were: `['title', 'series_name', 'series_n']`. To indicate that we are referring to those variables inside `hovertemplate` we have to write `%`, for example:`'%{x}'` or `'%{customdata[hoverdata_list_index]}'`.

In [55]:
fig_scatter.update_traces(
    # https://plotly.com/python/hover-text-and-formatting/
    hovertemplate = '<b>%{customdata[0]}</b> <br>%{customdata[1]} %{customdata[2]}'
)

In the following cells we calculate the average rating for each group/color. After that, we add this value as a horizontal line that allows us to compare the three groups.

In [56]:
books_df_scatter_means = books_df_scatter.groupby('saga')['average_rating'].mean()
books_df_scatter_means

saga
Saga (continuation)    4.193780
Saga (first book)      4.141667
Standalone             3.955000
Name: average_rating, dtype: float64

In [57]:
for i in range(3) :
    trace_name = fig_scatter.data[i]['legendgroup']
    fig_scatter.add_hline(
        y=books_df_scatter_means[trace_name],
        line_color = fig_scatter.data[i]['marker']['color']
    )

fig_scatter.show()

### Example 2: Facetted barplot + adding traces

#### Example 2.1: Only one saga (only one plot)

In [58]:
sanderson_sagas_df = pd.read_csv('sanderson_sagas_df.csv')
sanderson_sagas_df.head()

Unnamed: 0,book_id_long,book_id_short,url,title,series_name,series_url,series_n,isbn,year_first_published,author_name,...,average_rating,main_genre,rating_5_starts,rating_4_starts,rating_3_starts,rating_2_starts,rating_1_start,genres,rating_distribution,hist_rating
0,68428.The_Final_Empire,68428,https://www.goodreads.com/book/show/68428.The_...,The Final Empire,The Mistborn Saga,https://www.goodreads.com/series/40910-the-mis...,1,9780765311788.0,2006.0,Brandon Sanderson,...,4.46,Fantasy,300783.0,153806.0,39647.0,8207.0,4258.0,,,4.46
1,68429.The_Well_of_Ascension,68429,https://www.goodreads.com/book/show/68429.The_...,The Well of Ascension,The Mistborn Saga,https://www.goodreads.com/series/40910-the-mis...,2,,2007.0,Brandon Sanderson,...,4.37,Fantasy,181540.0,126733.0,35412.0,5726.0,1837.0,,,4.423154
2,2767793-the-hero-of-ages,2767793,https://www.goodreads.com/book/show/2767793-th...,The Hero of Ages,The Mistborn Saga,https://www.goodreads.com/series/40910-the-mis...,3,,2008.0,Brandon Sanderson,...,4.5,Fantasy,201257.0,91677.0,24761.0,4455.0,1770.0,,,4.444215
3,10803121-the-alloy-of-law,10803121,https://www.goodreads.com/book/show/10803121-t...,The Alloy of Law,The Mistborn Saga,https://www.goodreads.com/series/40910-the-mis...,4,,2011.0,Brandon Sanderson,...,4.21,Fantasy,66578.0,70774.0,23860.0,3191.0,790.0,,,4.415493
4,16065004-shadows-of-self,16065004,https://www.goodreads.com/book/show/16065004-s...,Shadows of Self,The Mistborn Saga,https://www.goodreads.com/series/40910-the-mis...,5,,2015.0,Brandon Sanderson,...,4.29,Fantasy,46918.0,44566.0,12316.0,1461.0,471.0,,,4.40636


In [59]:
fig_stormlight = px.bar(
    sanderson_sagas_df.query('series_name == "The Stormlight Archive"'), 
    x='series_n', y='average_rating', 
    color = 'num_ratings', color_continuous_scale = px.colors.sequential.Teal,
    hover_data = ['title'],
    width = 600, height = 300
    )

fig_stormlight.show()

In [60]:
fig_stormlight.update_layout(
    title = 'The Stormlight Archive',
    coloraxis_colorbar = {'title':'Number<br>of ratings'}, 
    xaxis = {'title':'Series Volume'},
    yaxis = {'title':'Average rating', 'range':[1,5]},
    template = 'plotly_white+borders'
)

In [61]:
fig_stormlight.add_trace(
    go.Scatter(
        x=sanderson_sagas_df.query('series_name == "The Stormlight Archive"')['series_n'],
        y=sanderson_sagas_df.query('series_name == "The Stormlight Archive"')['hist_rating'],
        mode = 'lines+markers', showlegend=False
    )
)

In [62]:
fig_stormlight.update_layout(
    xaxis = {'type':'category'}
)

#### Example 2.2 Facetted barplot
The easiest option would be to specify `facet_row` instead of `facet_col`. That way, we would not have to specify `facet_col_wrap=1` (only one chart per row/the `width` of each column is only 1 chart). But when we use `facet_row` the title of each subplot appears to the left and rotated 90 degrees. Using `facet_col`+`facet_col_wrap=1` is a trick to easily get the title on top.

In [63]:
fig_sanderson = px.bar(
    sanderson_sagas_df, 
    x='series_n', y='average_rating', 
    facet_col='series_name', facet_col_wrap=1,
    color = 'num_ratings',
    color_continuous_scale = px.colors.sequential.Teal,
    hover_data = ['title', 'num_ratings', 'hist_rating'],
    width = 600, height = 700)

fig_sanderson.show()

In [64]:
fig_sanderson_facet_row = px.bar(
    sanderson_sagas_df, 
    x='series_n', y='average_rating', 
    facet_row='series_name', 
    color = 'num_ratings',
    color_continuous_scale = px.colors.sequential.Teal,
    hover_data = ['title', 'num_ratings', 'hist_rating'],
    width = 600, height = 700)

fig_sanderson_facet_row.show()

When we create facet plots, other methods that we have not used before are useful, such as `.for_each_annotation()` (to reference all the titles and labels of legends) or `.for_each_yaxis()` (to reference the Y-axis of each subplot). By default, all plots share the same range, so we can change it with `.update_layout(yaxis_range=[1,5])`. However, the title is individual for each subplot and so we have to `.for_each_yaxis()`.

In [65]:
fig_sanderson.update_layout(
    coloraxis_colorbar = {'title':'Number<br>of ratings'}, 
    xaxis = {'title':'Series Volume'},
    yaxis = {'range':[1,5]},
    template = 'plotly_white+borders'
)
fig_sanderson.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

fig_sanderson.for_each_yaxis(lambda yaxis: yaxis.update(title=None))

fig_sanderson.add_annotation(
    text='Average rating', 
    x=-0.11, y=0.5,
    xref="paper", yref="paper",
    textangle=-90, showarrow=False
    )
    
fig_sanderson.update_traces(
    # https://plotly.com/python/hover-text-and-formatting/
    hovertemplate = '<b>%{customdata[0]}</b><br>Num. ratings: %{customdata[1]:.0f} <br>Avg. rating: %{y:.2f}<br>Saga rating: %{customdata[2]:.2f}'
)

`sanderson_sagas` is a list of the titles of all the sagas that appear in the plot. We use `[::-1]` to reverse the order because in facet plots, the row count starts from the bottom. That is, the numbers in each of the plots would be, from top to bottom: (row=5, col=1), (row=4, col=1), (row=3, col=1), (row=2, col=1), (row=1, col=1).
Writing `fig_sanderson.show()` inside the loop allows us see how the rows are added from the bottom to the top. 

On the other hand, what we are doing in each turn of the loop is adding a new trace to the `data` list of the Figure. Each trace is created with `go.Scatter()`, the `go` function for creating point traces. Plotly also has go.Line, go.Bar, etc. functions. However, to create lines we usually use go.Scatter instead of go.Line because it is more versatile and, by specifying `mode='lines'` or `mode='lines+markers'` the result is the same.

https://plotly.com/python/facet-plots/#adding-lines-and-rectangles-to-facet-plots

In [66]:
sanderson_sagas = sanderson_sagas_df.series_name.unique()

for i, saga in enumerate(sanderson_sagas[::-1]):
    df_i = sanderson_sagas_df.query(f'series_name == "{saga}"')
    fig_sanderson.add_trace(
        go.Scatter(
            x=df_i['series_n'], y=df_i['hist_rating'], 
            mode='lines+markers', line_color='black',
            showlegend=False, hoverinfo='none'
            ),
        row=i+1, col=1)
    #Uncomment these two lines under this one to see how the plot changes in each iteration of the loop
    #print(len(fig_sanderson.data))
    #fig_sanderson.show()

fig_sanderson.show()


We can also add a `rangeslider`, so that users can easily modify the X-axis range. The downside of this option is that it fixes the Y-axis, so it can no longer be zoomed. This option is often useful for charts where the X-axis represents dates. 

In [67]:
# https://plotly.com/python/range-slider/
fig_sanderson.update_layout(
    xaxis = {
        'rangeslider':{
            'autorange':True,
            'visible':True
        }
    }
)