---
layout: post
title:  "Rating Tags in Works Part III"
date:   2021-06-02
categories: visualization
tags: Python Pandas bar-chart-race
---

In part III, we use the [bar_chart_race](https://github.com/dexplo/bar_chart_race) package to create animated chart for our data set.

* Table of Contents
{:toc}

# Loading File

In part I and part II, we prepared and saved the DataFrame to local csv files. We'll load the two files here.

In [1]:
# Load python libraries
import pandas as pd

In [2]:
# Load rating.csv from part I
rating = pd.read_csv("rating.csv")

In [3]:
# preview file
rating

Unnamed: 0,id,type,name,canonical,cached_count,merger_id
0,9,Rating,Not Rated,True,825385,
1,10,Rating,General Audiences,True,2115153,
2,11,Rating,Teen And Up Audiences,True,2272688,
3,12,Rating,Mature,True,1151260,
4,13,Rating,Explicit,True,1238331,
5,12766726,Rating,Teen & Up Audiences,False,333,


In [4]:
# Load rating_pivot.csv from part II
df = pd.read_csv("rating_pivot.csv")

In [5]:
# preview file
df

Unnamed: 0,creation date,9,10,11,12,13
0,2008-09-30,76,232,213,174,233
1,2008-10-31,38,111,93,43,196
2,2008-11-30,11,97,97,56,76
3,2008-12-31,2,93,47,41,56
4,2009-01-31,18,175,104,78,133
...,...,...,...,...,...,...
145,2020-10-31,14188,42416,47706,22015,28723
146,2020-11-30,13397,38003,42168,19005,21743
147,2020-12-31,15763,50443,51435,22664,26656
148,2021-01-31,16875,45592,51099,23830,27084


# Data Cleaning

There are still some data cleaning to do, namely:

- Making the "creation date" column as index;

- The "creation date" column shows which month the data was collected, however it includes the last day of the month in the string, and should be corrected;

- Changing the column name from tag id to tag name.

In [6]:
# Set index
df.set_index("creation date", inplace=True)
df

Unnamed: 0_level_0,9,10,11,12,13
creation date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2008-09-30,76,232,213,174,233
2008-10-31,38,111,93,43,196
2008-11-30,11,97,97,56,76
2008-12-31,2,93,47,41,56
2009-01-31,18,175,104,78,133
...,...,...,...,...,...
2020-10-31,14188,42416,47706,22015,28723
2020-11-30,13397,38003,42168,19005,21743
2020-12-31,15763,50443,51435,22664,26656
2021-01-31,16875,45592,51099,23830,27084


In [7]:
# Remove day from date string
# Use .str to access the string on each row
df.index = df.index.str[:-3]
df

Unnamed: 0_level_0,9,10,11,12,13
creation date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2008-09,76,232,213,174,233
2008-10,38,111,93,43,196
2008-11,11,97,97,56,76
2008-12,2,93,47,41,56
2009-01,18,175,104,78,133
...,...,...,...,...,...
2020-10,14188,42416,47706,22015,28723
2020-11,13397,38003,42168,19005,21743
2020-12,15763,50443,51435,22664,26656
2021-01,16875,45592,51099,23830,27084


In [8]:
# Change tag id to tag name
# We ditched tag id 13 because it's a duplicate
df.columns = rating.name[:5]
df

name,Not Rated,General Audiences,Teen And Up Audiences,Mature,Explicit
creation date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2008-09,76,232,213,174,233
2008-10,38,111,93,43,196
2008-11,11,97,97,56,76
2008-12,2,93,47,41,56
2009-01,18,175,104,78,133
...,...,...,...,...,...
2020-10,14188,42416,47706,22015,28723
2020-11,13397,38003,42168,19005,21743
2020-12,15763,50443,51435,22664,26656
2021-01,16875,45592,51099,23830,27084


# Monthly Posts v.s. Cumulative Sum

We have two options here. The DataFrame contains the number of works posted per month under each rating category. We can also calculate the cumulative sum of posts. The end result should be very close to the total number of works on AO3 at the time of the data dump. Remember, in previous posts, as we were cleaning the data set, we made decisions to drop some works from the data set due to N/A values or duplicates.

In [9]:
# Cumulative sum
df_cumsum = df.cumsum()
df_cumsum

name,Not Rated,General Audiences,Teen And Up Audiences,Mature,Explicit
creation date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2008-09,76,232,213,174,233
2008-10,114,343,306,217,429
2008-11,125,440,403,273,505
2008-12,127,533,450,314,561
2009-01,145,708,554,392,694
...,...,...,...,...,...
2020-10,651539,1895727,2006772,1014066,1081494
2020-11,664936,1933730,2048940,1033071,1103237
2020-12,680699,1984173,2100375,1055735,1129893
2021-01,697574,2029765,2151474,1079565,1156977


In [12]:
# Export to local csv file
df_cumsum.to_csv("rating-cumsum.csv")

# Animated Bar Chart with Bar-Chart-Race 

We use the [bar_chart_race](https://github.com/dexplo/bar_chart_race) package to automate the animation process. You can of course create the whole chart from scratch like [this](https://www.dunderdata.com/blog/create-a-bar-chart-race-animation-in-python-with-matplotlib). More tutorials about the package from the author can be found [here](https://www.dexplo.org/bar_chart_race/tutorial/). 

In [10]:
# Load the package
import bar_chart_race as bcr

# Load gc to manually release memory in case Jupyter Notebook crashes
import gc

In [11]:
# Clear memory
gc.collect()

31

## Customization: show total work count

In [12]:
# Function to show total work count
# From bar-chart-race tutorial
def summary(values, ranks):
    total_works = values.sum()
    s = f'Total Works - {total_works:,.0f}'
    return {'x': .99, 'y': .05, 's': s, 'ha': 'right', 'size': 8}

## Animation

In [13]:
# filename=None in order to display in Jupyter Notebook cell
# period_summary_func=summary to show total work count
bcr.bar_chart_race(df=df_cumsum, filename=None, period_summary_func=summary, title='AO3 Works Rating Breakdown \n 2008-2021')

  ax.set_yticklabels(self.df_values.columns)
  ax.set_xticklabels([max_val] * len(ax.get_xticks()))
