## Data Analytics: Computational Temporal Analysis

Cara Marta Messina <br/>
Northeastern University<br/>
messina [dot] c [at] husky [dot] neu [dot] edu

This notebook takes data collected from <em>Archive of Our Own</em>, a popular fanfiction repository, and sets it up to be analyzed. The data was collected using [this AO3 python scraper](https://github.com/radiolarian/AO3Scraper). The corpus consists of <em>The Legend of Korra</em> and <em>Game of Thrones</em> fanfics, from the first one published on AO3 to 2019.

<em>This notebook is part of the Critical Fan Toolkit, Cara Marta Messina's public + digital dissertation</em>

In [2]:
#pandas for working with dataframes
import pandas as pd

#regular expression library
import re

#numpy specifically works with numbers
import numpy as np

from nltk import word_tokenize

import string
punctuations = list(string.punctuation)

#has the nice counter feature for counting tags
import collections
from collections import Counter 

#for making a string of elements separated by commas into a list
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars 

#visualizations
import plotly as py
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots

#calling my plotly thing
import chart_studio
chart_studio.tools.set_credentials_file(username='caramessina', api_key='IdA4LjtaqYKmFJnfS8Uv')

## Reading and Prepping the Data

In [3]:
korra = pd.read_csv('./data/group_month/allkorra_months.csv')
korra.head(3)

Unnamed: 0,month,rating,additional tags,category,relationship,body,count
0,2011-02,"not rated,","original characters - freeform,","multi,","mai/zuko (avatar), sokka/suki (avatar), aang (...","when kato listens to his father's war stories,...",1
1,2011-04,"general audiences,","family, angst, one shot,","gen,","aang (avatar)/katara,",his father shows tenzin where the flowers grow...,1
2,2011-05,"teen and up audiences,","completely au, written pre-canon, rated for la...","gen,",,\n \nthe earthbender's answer was not what s...,1


In [4]:
korra_full = pd.read_csv('./data/tlok_clean.csv')
korra_full.head(2)

Unnamed: 0.1,Unnamed: 0,work_id,title,published,rating,character,relationship,additional tags,category,body,month
0,0,6388009,"noatak (avatar), tarrlok (avatar), amon (avatar)",2016-03-28,"general audiences,","noatak (avatar), tarrlok (avatar), amon (avatar)",,"alternate universe,","gen,",he's forgotten how to be warm. the thought wou...,2016-03
1,1,13974048,"korra (avatar), lin beifong",2018-03-14,"teen and up audiences,","korra (avatar), lin beifong","lin beifong/korra,","just a quick one-shot i never posted properly,...","f/f,","""korra."" somewhere distant. someone holding h...",2018-03


In [104]:
got0 = pd.read_csv(r'./data/group_month/got_1.csv')
got1 = pd.read_csv(r'./data/group_month/got_2.csv')
got2 = pd.read_csv(r'./data/group_month/got_3.csv')
got3 = pd.read_csv(r'./data/group_month/got_4.csv')

merged_got = pd.concat([got0, got1, got2, got3])
merged_got.head(5)

Unnamed: 0,month,rating,additional tags,category,relationship,body,count
0,2006-08,"teen and up audiences,","incest, dreams,","m/m,","jon snow/robb stark,",up on the wall it is impossible to be warm and...,1
1,2007-02,"general audiences,","possible incest,",,,the last time jon had seen his half-sister san...,1
2,2007-05,"teen and up audiences,","tragedy, canonical character death, suicide, b...","f/m,","rhaegar targaryen/lyanna stark, robert barathe...","it is far too easy, to slip away. lyanna is kn...",1
3,2007-06,"mature, teen and up audiences,","alternate universe, infidelity, unrequited lov...","f/m, f/m,","cersei lannister/oberyn martell, petyr baelish...",the vase shatters beautifully against the wall...,2
4,2007-12,"teen and up audiences, general audiences,","romance, action/adventure, incest, maleslash,","f/m, m/m,","brienne/jaime lannister, jaime lannister/rhaeg...",asshai is the end of the world. the valaryians...,2


## Creating Basic Bar Chart for Published Rates

These next two cells create a basic bar chart using the rate of fanfics published during specific years and maps it onto a bar graph. It's a basic image, but provides a context for why particular times and publishing date is important.

In [106]:
korra_timelinefig = px.line(korra, x='month', y='count')

korra_timelinefig.update_layout(
    title='The Legend of Korra Fanfic Publishing Rates Across Time'
)

# korra_timelinefig.update_traces(marker_color='indigo')

korra_timelinefig.write_html('images/TLOK-history.html', auto_open=True)

In [107]:
got_timelinefig = px.line(merged_got, x='month', y='count')
got_timelinefig.update_layout(
    title='Game of Thrones Fanfic Publishing Rates Across Time')
# got_timelinefig.update_traces(marker_color='indigo')

got_timelinefig.write_html('images/GOT-history.html', auto_open=True)

## Data Analytics on the Fic Metadata

created a function that will take the different tags (which are phrased as characterA/characterB, characterA/characterB, etc in the data) and count the most common relationships to then output it as the most common relationship tags used. 

In [12]:
def column_to_list(df,columnName):
    '''
    this function takes all the information from a specific column, joins it to a string, and then tokenizes & cleans that string.
    input: the name of the dataframe and the column name
    output: the tokenized list of the text with all lower case, punctuation removed, and no stop words
    '''
    df[columnName] = df[columnName].replace(np.nan,'',regex=True) 
    string = ' '.join(df[columnName].tolist())
    return string

In [7]:
def clean_tokens(string):    
    stopwords = ['i', 'me', 'my', 'myself', "“", "”", 'we', 'our', '’', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", "would", "could", 'won', "won't", 'wouldn', "wouldn't"]
    text_lc = [word.lower() for word in string]
    text_tokens_clean = [word for word in text_lc if word not in stopwords]
    text_tokens_clean = [word for word in text_tokens_clean if word not in punctuations]
    return text_tokens_clean
    print(text_tokens_clean[:20])

In [127]:
def TagsAnalyzer(df, monthBegin, monthEnd, columnName):
    '''
    input: the index month+year, such as '2012-04', and the specific metadata, such as 'additional tags'
    output: a tupple of the count of tags in a specific month/year
    
    load in the proper data into a string'''
    
    #choose the months to analyze
    months_df = df.loc[monthBegin:monthEnd, :]
    
    #replace empty values & make a list of all the words
    string = column_to_list(months_df, columnName)
    
    #the function to tokenize, or put each value as an element in a list
    class CommaPoint(PunktLanguageVars):
        sent_end_chars = (',') 
    tokenizer = PunktSentenceTokenizer(lang_vars = CommaPoint())
    
    #tokenizing the list of strings based on the COMMA, not the white space (as seen in the CommaPoint above)
    ListOfTags = tokenizer.tokenize(string)
    length = len(ListOfTags)
    
    #the "Counter" function is from the collections library
    allCounter=collections.Counter(ListOfTags)
    
    #return 
    return allCounter.most_common()

    
#     TO MAKE IT A RATION
#     li = []
#     #returning a dictionary in which the keys are all the tags, and the items are the counts
#     for word0,word1 in allCounter.most_common(300):
#         li.append((word0, (word1/length)*100))

# #     #return
#     return li

In [29]:
def TagsAnalyzer_noMonth(df, columnName, num):
    '''
    Input: the dataframe, the column you want to analyze, and the number of most common tags/phrases
    Output: A tupple of the most common tags in a column
    
    Description: this separates tags by commas, counts the most frequent tags, and will show you the most common
    '''
     #replace empty values & make a list of all the words using the column_to_list function
    string = column_to_list(df, columnName)
    
    #the function to tokenize, or put each value as an element in a list
    class CommaPoint(PunktLanguageVars):
        sent_end_chars = (',') 
    tokenizer = PunktSentenceTokenizer(lang_vars = CommaPoint())
    
    #tokenizing the list of strings based on the COMMA, not the white space (as seen in the CommaPoint above)
    ListOfTags = tokenizer.tokenize(string)
        
    #the "Counter" function is from the collections library
    allCounter=collections.Counter(ListOfTags)

    #returning a dictionary in which the keys are all the tags, and the items are the counts
    return allCounter.most_common()

### Creating TXT Files of Additional Tags

First I want to create .txt files of all the additional tags used. This way, I can upload them to Voyant if I want. I use regular expressions to tranform a list of tags with commas between each tag and spaces separating words to a list of tags with commas and spaces separating each tag and all tag spaces are replaced with '_'. This way, tools like Voyant, which look at spaces, will not separate the tags.

So now this (a, list of, tags may, look like this)
becomes this (a, list_of, tags_may, look_like_this)

In [16]:
lokAT = column_to_list(korra,'additional tags')
lokAT = lokAT.replace(", ",",")
lokAT = lokAT.replace(", ",",")
lokAT = lokAT.replace(" ","_")
lokATfinal = re.sub('(\,)(\w)', '\g<1> \g<2>', lokAT)
lokATfinal
    
filepath = open("data/TLOK_additional_tags.txt","w")
filepath.write(lokATfinal) 
filepath.close()

In [17]:
#reading in multiple csv files, since one large one breaks my kernels 

gotmonth0 = pd.read_csv('data/group_month/got_1.csv')
gotmonth1 = pd.read_csv('data/group_month/got_2.csv')
gotmonth2 = pd.read_csv('data/group_month/got_3.csv')
gotmonth3 = pd.read_csv('data/group_month/got_4.csv')

got_all = pd.concat([gotmonth0, gotmonth1, gotmonth2, gotmonth3]).set_index('month')
got_all.head(5)

Unnamed: 0_level_0,rating,additional tags,category,relationship,body,count
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2006-08,"teen and up audiences,","incest, dreams,","m/m,","jon snow/robb stark,",up on the wall it is impossible to be warm and...,1
2007-02,"general audiences,","possible incest,",,,the last time jon had seen his half-sister san...,1
2007-05,"teen and up audiences,","tragedy, canonical character death, suicide, b...","f/m,","rhaegar targaryen/lyanna stark, robert barathe...","it is far too easy, to slip away. lyanna is kn...",1
2007-06,"mature, teen and up audiences,","alternate universe, infidelity, unrequited lov...","f/m, f/m,","cersei lannister/oberyn martell, petyr baelish...",the vase shatters beautifully against the wall...,2
2007-12,"teen and up audiences, general audiences,","romance, action/adventure, incest, maleslash,","f/m, m/m,","brienne/jaime lannister, jaime lannister/rhaeg...",asshai is the end of the world. the valaryians...,2


In [18]:
gotAT = column_to_list(got_all,'additional tags')
gotAT = gotAT.replace(", ",",")
gotAT = gotAT.replace(", ",",")
gotAT = gotAT.replace(" ","_")
gotATfinale = re.sub('(\,)(\w)', '\g<1> \g<2>', gotAT)
gotATfinale

filepath = open("data/GOT_additional_tags.txt","w")
filepath.write(gotATfinale) 
filepath.close()

### Korra Tags

In [19]:
#First, set 'month' as index

korra_all = korra.set_index('month')

In [128]:
korra_preKArel = TagsAnalyzer(korra_all,'2011-01','2014-05','relationship')
korra_subKArel = TagsAnalyzer(korra_all,'2014-06','2014-11','relationship')
korra_postKArel = TagsAnalyzer(korra_all,'2014-12','2015-07','relationship')

print('Pre-Korrasami')
print(korra_preKArel[:10])
print('\n Korrasami Subtext')
print(korra_subKArel[:10])
print('\n Post-Korrasami')
print(korra_postKArel[:10])

Pre-Korrasami
[('korra/mako (avatar),', 305), ('bolin/korra (avatar),', 91), ('korra/asami sato,', 90), ('amon/lieutenant (avatar),', 68), ('amon/korra (avatar),', 48), ('korra/tahno (avatar),', 44), ('tahno/korra,', 40), ('pema/tenzin (avatar),', 39), ('lin bei fong/tenzin,', 39), ('korra/tarrlok (avatar),', 37)]

 Korrasami Subtext
[('korra/asami sato,', 281), ('korra/mako (avatar),', 63), ('varrick/zhu li,', 19), ('mako/prince wu,', 18), ('amon/korra (avatar),', 17), ('lin bei fong/tenzin,', 13), ('amon | noatak/korra,', 12), ('mako/asami sato,', 11), ('korrasami,', 11), ('aang/katara (avatar),', 11)]

 Post-Korrasami
[('korra/asami sato,', 1250), ('bolin/opal (avatar),', 120), ('korra/mako (avatar),', 96), ('baatar jr./kuvira (avatar),', 94), ('korra/kuvira,', 45), ('korrasami,', 44), ('jinora/kai (avatar),', 41), ('lin beifong/kya ii,', 40), ('mako/prince wu (avatar),', 33), ('lin beifong/tenzin,', 33)]




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [130]:
#SAVING AS A DATAFRAME FOR THE FUTURE
korra_preKAdf = pd.DataFrame(korra_preKArel, columns =['Pairing', 'Appearance']) 
korra_preKAdf.to_csv('data/counting_csvs/TLOK/relationships/preKA_relationships.csv')

korra_subKAdf = pd.DataFrame(korra_subKArel, columns =['Pairing', 'Appearance']) 
korra_subKAdf.to_csv('data/counting_csvs/TLOK/relationships/subKA_relationships.csv')

korra_postKA = pd.DataFrame(korra_postKArel, columns =['Pairing', 'Appearance']) 
korra_postKA.to_csv('data/counting_csvs/TLOK/relationships/postKA_relationships.csv')

In [137]:
preKAtags = TagsAnalyzer(korra_all,'2011-02','2014-05','additional tags')
subKAtags = TagsAnalyzer(korra_all,'2014-06','2014-11','additional tags')
postKAtags = TagsAnalyzer(korra_all,'2014-12','2015-07','additional tags')

print(preKAtags[:10])
print(subKAtags[:10])
print(postKAtags[:10])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



[('romance,', 145), ('angst,', 109), ('friendship,', 92), ('family,', 72), ('fluff,', 67), ('alternate universe,', 63), ('established relationship,', 62), ('smut,', 46), ('au,', 46), ('hurt/comfort,', 43)]
[('fluff,', 63), ('romance,', 54), ('angst,', 43), ('alternate universe,', 33), ('alternate universe - modern setting,', 30), ('friendship,', 24), ('smut,', 22), ('hurt/comfort,', 22), ('humor,', 19), ('au,', 14)]
[('fluff,', 229), ('romance,', 213), ('angst,', 145), ('alternate universe - modern setting,', 126), ('korrasami - freeform,', 126), ('alternate universe,', 89), ('canon compliant,', 81), ('friendship,', 74), ('post-finale,', 62), ('post-canon,', 56)]


In [138]:
#SAVING AS A DATAFRAME FOR THE FUTURE 
korra_preKA_tags_df = pd.DataFrame(preKAtags, columns =['Tags', 'Appearance']) 
korra_preKA_tags_df.to_csv('data/counting_csvs/TLOK/tags/preKA_tags.csv')

korra_subKA_tags_df = pd.DataFrame(subKAtags, columns =['Tags', 'Appearance']) 
korra_subKA_tags_df.to_csv('data/counting_csvs/TLOK/tags/subKA_tags.csv')

korra_postKA_tags_df = pd.DataFrame(postKAtags, columns =['Tags', 'Appearance']) 
korra_postKA_tags_df.to_csv('data/counting_csvs/TLOK/tags/postKA_tags.csv')

In [91]:
korra_preKAcat = TagsAnalyzer(korra_all,'2011-02','2014-05','category')
korra_subKAcat = TagsAnalyzer(korra_all,'2014-06','2014-11','category')
korra_postKAcat = TagsAnalyzer(korra_all,'2014-12','2015-07','category')

print('Pre-Korrasami')
print(korra_preKAcat[:5])
print('\n Korrasami Subtext')
print(korra_subKAcat[:5])
print('\n Post-Korrasami')
print(korra_postKAcat[:5])

Pre-Korrasami
[('f/m,', 40.95904095904096), ('gen,', 30.81918081918082), ('m/m,', 12.137862137862138), ('f/f,', 9.990009990009991), ('multi,', 4.545454545454546)]

 Korrasami Subtext
[('f/f,', 43.007915567282325), ('f/m,', 23.7467018469657), ('gen,', 22.427440633245382), ('m/m,', 6.596306068601583), ('multi,', 2.638522427440633)]

 Post-Korrasami
[('f/f,', 55.596707818930035), ('f/m,', 20.16460905349794), ('gen,', 14.238683127572017), ('m/m,', 4.938271604938271), ('multi,', 3.4156378600823043)]




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### Game of Thrones Metadata 

#### Relationship Tags

In [139]:
#seasons 1 and 2 – season 3 starts March 2013
got1_2relationship = TagsAnalyzer(got_all,'2006-08','2013-02','relationship')
print('\n Seasons 1 and 2')
print(got1_2relationship)

#seasons 3 and 4 – season 5 starts April 2015
got3_4relationship = TagsAnalyzer(got_all,'2013-03','2015-03','relationship')
print('\n Seasons 3 and 4')
print(got3_4relationship)

#seasons 5 and 6 – season 7 starts July 2017
got5_6relationship = TagsAnalyzer(got_all,'2015-07','2017-06','relationship')
print('\n Seasons 5 and 6')
print(got5_6relationship)

#season 7 – seasons 8 starts April 2019
got7relationship = TagsAnalyzer(got_all,'2017-07','2019-03','relationship')
print('\n Season 7')
print(got7relationship)

#season 7 – seasons 8 starts April 2019
got8relationship = TagsAnalyzer(got_all,'2019-04','2019-09','relationship')
print('\n Season 8')
print(got8relationship)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




 Seasons 1 and 2
[('jaime lannister/brienne of tarth,', 147), ('arya stark/gendry waters,', 100), ('sandor clegane/sansa stark,', 76), ('cersei lannister/jaime lannister,', 71), ('theon greyjoy/robb stark,', 68), ('renly baratheon/loras tyrell,', 67), ('jon snow/robb stark,', 61), ('catelyn stark/ned stark,', 60), ('gendry/arya stark,', 49), ('jon snow/sansa stark,', 29), ('jorah mormont/daenerys targaryen,', 28), ('robb stark/jeyne westerling,', 25), ('sansa stark/willas tyrell,', 22), ('lyanna stark/rhaegar targaryen,', 21), ('joffrey baratheon/sansa stark,', 21), ('jon snow/daenerys targaryen,', 20), ('jon snow/ygritte,', 20), ('myrcella baratheon/robb stark,', 19), ('jaime lannister/sansa stark,', 19), ('petyr baelish/sansa stark,', 16), ('robert baratheon/cersei lannister,', 15), ('stannis baratheon/davos seaworth,', 15), ('theon greyjoy/jon snow,', 14), ('jon snow/arya stark,', 14), ('sansa stark/margaery tyrell,', 12), ("jaqen h'ghar/arya stark,", 12), ('cersei lannister/ned st


 Seasons 3 and 4
[('jaime lannister/brienne of tarth,', 541), ('arya stark/gendry waters,', 520), ('sandor clegane/sansa stark,', 433), ('catelyn stark/ned stark,', 263), ('sansa stark/margaery tyrell,', 217), ('cersei lannister/jaime lannister,', 203), ('theon greyjoy/robb stark,', 186), ('renly baratheon/loras tyrell,', 175), ('jon snow/sansa stark,', 161), ('petyr baelish/sansa stark,', 156), ('ramsay bolton/theon greyjoy,', 150), ('sandor clegane & sansa stark,', 142), ('jon snow/ygritte,', 132), ('jon snow/robb stark,', 126), ('lyanna stark/rhaegar targaryen,', 110), ('joffrey baratheon/sansa stark,', 94), ('robb stark/jeyne westerling,', 88), ('tyrion lannister/sansa stark,', 84), ('ramsay bolton/reek,', 84), ('sansa stark/willas tyrell,', 70), ('robert baratheon/cersei lannister,', 65), ('jojen reed/bran stark,', 64), ('myrcella baratheon/robb stark,', 63), ('elia martell/rhaegar targaryen,', 60), ('jon snow/daenerys targaryen,', 52), ('robb stark/sansa stark,', 52), ('stannis 


 Seasons 5 and 6
[('jon snow/sansa stark,', 1015), ('jaime lannister/brienne of tarth,', 663), ('sandor clegane/sansa stark,', 481), ('arya stark/gendry waters,', 366), ('petyr baelish/sansa stark,', 269), ('catelyn stark/ned stark,', 259), ('sansa stark/margaery tyrell,', 248), ('cersei lannister/jaime lannister,', 173), ('theon greyjoy/robb stark,', 169), ('jon snow/ygritte,', 166), ('lyanna stark/rhaegar targaryen,', 157), ('renly baratheon/loras tyrell,', 152), ('sandor clegane & sansa stark,', 147), ('jon snow/daenerys targaryen,', 144), ('ramsay bolton/theon greyjoy,', 126), ('jon snow/robb stark,', 122), ('minor or background relationship(s),', 102), ('jon snow/arya stark,', 94), ('stannis baratheon/sansa stark,', 89), ('joffrey baratheon/sansa stark,', 82), ('shireen baratheon/rickon stark,', 76), ('jojen reed/bran stark,', 73), ('tyrion lannister/sansa stark,', 72), ('jon snow & sansa stark,', 70), ('elia martell/rhaegar targaryen,', 66), ('robb stark/margaery tyrell,', 65), 


 Season 7
[('jon snow/sansa stark,', 2196), ('jon snow/daenerys targaryen,', 1288), ('jaime lannister/brienne of tarth,', 879), ('arya stark/gendry waters,', 690), ('sandor clegane/sansa stark,', 472), ('catelyn stark/ned stark,', 396), ('petyr baelish/sansa stark,', 292), ('cersei lannister/jaime lannister,', 263), ('lyanna stark/rhaegar targaryen,', 253), ('sandor clegane & sansa stark,', 219), ('theon greyjoy/robb stark,', 196), ('minor or background relationship(s),', 184), ('sansa stark/margaery tyrell,', 176), ('robb stark/margaery tyrell,', 161), ('jon snow/ygritte,', 152), ('jon snow & sansa stark,', 148), ('arya stark & sansa stark,', 143), ('jon snow & arya stark,', 142), ('tyrion lannister/sansa stark,', 132), ('jaime lannister/sansa stark,', 130), ('jon snow/arya stark,', 126), ('jon snow & daenerys targaryen,', 112), ('renly baratheon/loras tyrell,', 111), ('sansa stark/daenerys targaryen,', 100), ('theon greyjoy/sansa stark,', 99), ('ashara dayne/ned stark,', 97), ('robe


 Season 8
[('jaime lannister/brienne of tarth,', 1343), ('arya stark/gendry waters,', 1337), ('jon snow/daenerys targaryen,', 841), ('jon snow/sansa stark,', 709), ('theon greyjoy/sansa stark,', 304), ('sandor clegane/sansa stark,', 250), ('cersei lannister/jaime lannister,', 212), ('catelyn stark/ned stark,', 190), ('tormund giantsbane/jon snow,', 184), ('minor or background relationship(s),', 183), ('tyrion lannister/sansa stark,', 169), ('arya stark & sansa stark,', 160), ('arya stark & gendry waters,', 146), ('sansa stark/margaery tyrell,', 141), ('jon snow & arya stark,', 118), ('jon snow/ygritte,', 110), ('theon greyjoy/robb stark,', 108), ('grey worm/missandei,', 107), ('jorah mormont/daenerys targaryen,', 101), ('sansa stark/daenerys targaryen,', 95), ('lyanna stark/rhaegar targaryen,', 90), ('jon snow & daenerys targaryen,', 89), ('jon snow & sansa stark,', 88), ('robb stark/margaery tyrell,', 87), ('jaime lannister & brienne of tarth,', 86), ('jaime lannister/sansa stark,', 

In [135]:
#SEASONS 1–2
got1_2relationshipDF = pd.DataFrame(got1_2relationship, columns =['Pairing', 'Appearance']) 
got1_2relationshipDF.to_csv('data/counting_csvs/GOT/relationships/got1-2_relationships.csv')

#SEASONS 3–4
got3_4relationshipDF = pd.DataFrame(got3_4relationship, columns =['Pairing', 'Appearance']) 
got3_4relationshipDF.to_csv('data/counting_csvs/GOT/relationships/got3-4_relationships.csv')

#SEASONS 5–6
got5_6relationshipDF = pd.DataFrame(got5_6relationship, columns =['Pairing', 'Appearance']) 
got5_6relationshipDF.to_csv('data/counting_csvs/GOT/relationships/got5-6_relationships.csv')

#SEASON 7
got7relationshipDF = pd.DataFrame(got7relationship, columns =['Pairing', 'Appearance']) 
got7relationshipDF.to_csv('data/counting_csvs/GOT/relationships/got7_relationships.csv')

#SEASON 8
got8relationshipDF = pd.DataFrame(got8relationship, columns =['Pairing', 'Appearance']) 
got8relationshipDF.to_csv('data/counting_csvs/GOT/relationships/got8_relationships.csv')

In [140]:
got_all_AT = TagsAnalyzer_noMonth(got_all,'additional tags', 20)
got_all_AT

[('alternate universe - modern setting,', 5302),
 ('alternate universe - canon divergence,', 3098),
 ('fluff,', 2968),
 ('angst,', 2750),
 ('romance,', 1712),
 ('smut,', 1598),
 ('alternate universe,', 1420),
 ('slow burn,', 1070),
 ('hurt/comfort,', 1029),
 ('r plus l equals j,', 898),
 ('explicit sexual content,', 892),
 ('oral sex,', 820),
 ('fluff and angst,', 820),
 ('one shot,', 689),
 ('arranged marriage,', 644),
 ('fluff and smut,', 635),
 ('canon compliant,', 615),
 ('sibling incest,', 610),
 ('sexual content,', 589),
 ('incest,', 566),
 ('jon snow is a targaryen,', 555),
 ('love,', 549),
 ('canon-typical violence,', 525),
 ('humor,', 522),
 ('fix-it,', 522),
 ('family,', 519),
 ('other additional tags to be added,', 517),
 ('modern au,', 514),
 ('cunnilingus,', 508),
 ('drabble,', 503),
 ('violence,', 503),
 ('angst with a happy ending,', 503),
 ('first time,', 492),
 ('anal sex,', 485),
 ('post-canon,', 480),
 ('plot what plot/porn without plot,', 480),
 ('established relati

#### Additional Tags

In [501]:
#seasons 1 and 2 – season 3 starts March 2013
got1_2AT = TagsAnalyzer(got_all,'2006-08','2013-02','additional tags')
print('\n Seasons 1 and 2')
print(got1_2AT)

#seasons 3 and 4 – season 5 starts April 2015
got3_4AT = TagsAnalyzer(got_all,'2013-03','2015-03','additional tags')
print('\n Seasons 3 and 4')
print(got3_4AT)

#seasons 5 and 6 – season 7 starts July 2017
got5_6AT = TagsAnalyzer(got_all,'2015-07','2017-06','additional tags')
print('\n Seasons 5 and 6')
print(got5_6AT)

#season 7 – seasons 8 starts April 2019
got7AT = TagsAnalyzer(got_all,'2017-07','2019-03','additional tags')
print('\n Season 7')
print(got7AT)

#season 7 – seasons 8 starts April 2019
got8AT = TagsAnalyzer(got_all,'2019-04','2019-09','additional tags')
print('\n Season 8')
print(got8AT)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




 Seasons 1 and 2
[('no english,', 108), ('angst,', 96), ('alternate universe - modern setting,', 95), ('alternate universe,', 91), ('somali,', 85), ('romance,', 67), ('sibling incest,', 56), ('alternate universe - canon divergence,', 50), ('sexual content,', 50), ('fluff,', 47), ('future fic,', 44), ('first time,', 36), ('incest,', 31), ('family,', 31), ('crossover,', 30), ('friendship,', 30), ('alternate universe - canon,', 29), ('oral sex,', 27), ('hurt/comfort,', 26), ('drabble,', 26), ('cunnilingus,', 26), ('somali only,', 24), ('half-sibling incest,', 24), ('au,', 23), ('dubious consent,', 22), ('pre-canon,', 20), ('pov female character,', 20), ('blow jobs,', 18), ('slash,', 18), ('arranged marriage,', 18), ('masturbation,', 16), ('unrequited love,', 15), ('character study,', 15), ('drama,', 15), ('canon compliant,', 15), ('canonical character death,', 14), ('first kiss,', 14), ('kink meme,', 14), ('personal,', 14), ('love,', 14), ('dirty talk,', 14), ('prompt fic,', 14), ('unres


 Season 8


In [None]:
#SEASONS 1–2
got1_2relationshipDF = pd.DataFrame(got1_2relationship, columns =['Pairing', 'Appearance']) 
got1_2relationshipDF.to_csv('data/counting_csvs/GOT/relationships/got1-2_relationships.csv')

#SEASONS 3–4
got3_4relationshipDF = pd.DataFrame(got3_4relationship, columns =['Pairing', 'Appearance']) 
got3_4relationshipDF.to_csv('data/counting_csvs/GOT/relationships/got3-4_relationships.csv')

#SEASONS 5–6
got5_6relationshipDF = pd.DataFrame(got5_6relationship, columns =['Pairing', 'Appearance']) 
got5_6relationshipDF.to_csv('data/counting_csvs/GOT/relationships/got5-6_relationships.csv')

#SEASON 7
got7relationshipDF = pd.DataFrame(got7relationship, columns =['Pairing', 'Appearance']) 
got7relationshipDF.to_csv('data/counting_csvs/GOT/relationships/got7_relationships.csv')

#SEASON 8
got8relationshipDF = pd.DataFrame(got8relationship, columns =['Pairing', 'Appearance']) 
got8relationshipDF.to_csv('data/counting_csvs/GOT/relationships/got8_relationships.csv')

#### Category (sexual pairing)

In [38]:
#seasons 1 and 2 – season 3 starts March 2013
got1_2cat = TagsAnalyzer(got_all,'2006-08','2013-02','category')
print('\n Seasons 1 and 2')
print(got1_2cat)

#seasons 3 and 4 – season 5 starts April 2015
got3_4cat = TagsAnalyzer(got_all,'2013-03','2015-03','category')
print('\n Seasons 3 and 4')
print(got3_4cat)

#seasons 5 and 6 – season 7 starts July 2017
got5_6cat = TagsAnalyzer(got_all,'2015-07','2017-06','category')
print('\n Seasons 5 and 6')
print(got5_6cat)

#season 7 – seasons 8 starts April 2019
got7cat = TagsAnalyzer(got_all,'2017-07','2019-03','category')
print('\n Season 7')
print(got7cat)

#season 7 – seasons 8 starts April 2019
got8cat = TagsAnalyzer(got_all,'2019-04','2019-09','category')
print('\n Season 8')
print(got8cat)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




 Seasons 1 and 2
[('f/m,', 808), ('m/m,', 316), ('gen,', 282), ('f/f,', 68), ('multi,', 65), ('other,', 10)]

 Seasons 3 and 4
[('f/m,', 2920), ('m/m,', 1127), ('gen,', 1020), ('f/f,', 424), ('multi,', 286), ('other,', 87)]

 Seasons 5 and 6
[('f/m,', 4189), ('m/m,', 1096), ('gen,', 893), ('f/f,', 591), ('multi,', 411), ('other,', 124)]

 Season 7
[('f/m,', 7216), ('m/m,', 1234), ('gen,', 1096), ('f/f,', 865), ('multi,', 600), ('other,', 183)]

 Season 8
[('f/m,', 5205), ('gen,', 845), ('m/m,', 764), ('f/f,', 579), ('multi,', 373), ('other,', 158)]


## Visualizations and Analytics

For the first portion, I have to take the tuples I made above to then transform them into dataframes so that they may be analyzed. First, I will do the categories for both GoT and TLoK.

In [92]:
def tuple_to_df(tup, column_name_seasons, column_name_data):
    '''
    Description: Takes a list of tupples and turns it into a dataframe
    
    Input: A list of tupples, like the ones created above and a new column name based on the "Season" or whatever data
    Output: A dataframe with a the c
    '''
    newdf = pd.DataFrame(list(tup))
    newdf[column_name_seasons] = newdf[1]
    newdf[column_name_data] = newdf[0]
    finalDF= newdf[[column_name_seasons,column_name_data]]
    return finalDF

In [93]:
#TAKING THE RELATIONSHIP CATEGORY TUPLE AND MAKING IT A MERGED DATAFRAME
gotcat1 = tuple_to_df(got1_2cat, "Seasons 1 and 2", 'pairings')
gotcat2 = tuple_to_df(got3_4cat, "Seasons 3 and 4", 'pairings')
gotcat3 = tuple_to_df(got5_6cat, "Seasons 5 and 6", 'pairings')
gotcat4 = tuple_to_df(got7cat, "Season 7", 'pairings')
gotcat5 = tuple_to_df(got8cat, "Season 8", 'pairings')

#MERGING
gotcat_all = gotcat1.merge(gotcat2)
gotcat_all = gotcat_all.merge(gotcat3)
gotcat_all = gotcat_all.merge(gotcat4)
gotcat_all = gotcat_all.merge(gotcat5)

#REPLACING 
gotcat_all = gotcat_all.replace('f/m,','F/M').replace('gen,','GEN').replace('f/f,','F/F').replace('m/m,','M/M').replace('multi,','MULTI').replace('other,','OTHER')
gotcat_all

NameError: name 'got1_2cat' is not defined

In [94]:
#TAKING THE RELATIONSHIP CATEGORY TUPLE AND MAKING IT A MERGED DATAFRAME
korracat1 = tuple_to_df(korra_preKAcat, "Pre-Korrasami", 'pairings')
korracat2 = tuple_to_df(korra_subKAcat, "Subtext Korrasami", 'pairings')
korracat3 = tuple_to_df(korra_postKAcat, "Post Korrasami", 'pairings')

#MERGING
korracat_all = korracat1.merge(korracat2)
korracat_all = korracat_all.merge(korracat3)

#REPLACING PAIRING TITLES
korracat_all = korracat_all.replace('f/m,','F/M').replace('gen,','GEN').replace('f/f,','F/F').replace('m/m,','M/M').replace('multi,','MULTI').replace('other,','OTHER')
korracat_all

Unnamed: 0,Pre-Korrasami,pairings,Subtext Korrasami,Post Korrasami
0,40.959041,F/M,23.746702,20.164609
1,30.819181,GEN,22.427441,14.238683
2,12.137862,M/M,6.596306,4.938272
3,9.99001,F/F,43.007916,55.596708
4,4.545455,MULTI,2.638522,3.415638
5,1.548452,OTHER,1.583113,1.646091


In [96]:
fig = go.Figure(data=[
    go.Bar(name='Seasons 1 and 2', x=gotcat_all['pairings'], y=gotcat_all['Seasons 1 and 2']),
    go.Bar(name='Seasons 3 and 4', x=gotcat_all['pairings'], y=gotcat_all['Seasons 3 and 4']),
    go.Bar(name='Seasons 5 and 6', x=gotcat_all['pairings'], y=gotcat_all['Seasons 5 and 6']),
    go.Bar(name='Season 7', x=gotcat_all['pairings'], y=gotcat_all['Season 7']),
    go.Bar(name='Season 8', x=gotcat_all['pairings'], y=gotcat_all['Season 8'])
])

fig.update_layout(barmode='group')

figTLOK.update_layout(
    title='Game of Thrones Romantic Pairing Trends'
)

NameError: name 'gotcat_all' is not defined

In [410]:
figGOT = go.Figure(data=[
    go.Bar(name='Seasons 1 and 2', x=gotcat_all['pairings'], y=gotcat_all['Seasons 1 and 2'], marker_color='#C0C0C0'),
    go.Bar(name='Seasons 3 and 4', x=gotcat_all['pairings'], y=gotcat_all['Seasons 3 and 4'], marker_color='#808080'),
    go.Bar(name='Seasons 5 and 6', x=gotcat_all['pairings'], y=gotcat_all['Seasons 5 and 6'], marker_color='#585858'),
    go.Bar(name='Season 7', x=gotcat_all['pairings'], y=gotcat_all['Season 7'], marker_color='#383838'),
    go.Bar(name='Season 8', x=gotcat_all['pairings'], y=gotcat_all['Season 8'], marker_color='#181818')
])

figGOT.update_layout(barmode='group')

figGOT.update_layout(
    title='Game of Thrones Romantic Pairing Trends', plot_bgcolor='#ffffff'
)


figGOT.write_html('images/GoT-Romantic-Pairings.html', auto_open=True)

In [101]:
figGOT = make_subplots(
    rows=2, cols=3,
    shared_yaxes=True,
    shared_xaxes=True,
    subplot_titles=("Seasons 1–2", "Seasons 3–4", "Seasons 5–6", "Season 7", "Season 8 and Beyond"))

figGOT.add_trace(go.Bar(
    y=gotcat1[1], 
    x=gotcat1[0], 
    name="Seasons 1–2"), 
    row=1, 
    col=1)

figGOT.add_trace(go.Bar(y=gotcat2[1], x=gotcat2[0], name="Seasons 3–4"), row=1, col=2)
figGOT.add_trace(go.Bar(y=gotcat3[1], x=gotcat3[0], name="Seasons 5–6"), row=1, col=3)
figGOT.add_trace(go.Bar(y=gotcat4[1], x=gotcat4[0], name="Seasons 7"), row=2, col=1)
figGOT.add_trace(go.Bar(y=gotcat5[1], x=gotcat5[0], name="Seasons 8"), row=2, col=2)

figGOT.update_layout(
    title='Game of Thrones Romantic Pairing Trends'
)

figGOT.update_xaxes(ticks="inside")

figGOT.write_html('images/GoT-Romantic-Pairings.html', auto_open=True)

NameError: name 'gotcat1' is not defined

### The Legend of Korra Visualizations

In [132]:
korra_rel_fig = go.Figure(data=[
    go.Bar(name='Pre-Korrasami (2011–05/2014)', x=korra_preKAdf['Pairing'][:5], y=korra_preKAdf['Appearance'],marker_color='#C0C0C0'),
    go.Bar(name='Subtext Korrasami (06/2014–11/2014)', x=korra_subKAdf['Pairing'][:5], y=korra_subKAdf['Appearance'], marker_color='#585858'),
    go.Bar(name='Post Korrasami (12/2014–07/2015)', x=korra_postKA['Pairing'][:5], y=korra_postKA['Appearance'], marker_color='#181818')
])

#I don't like this one cause it's only the top 5 and the top five changes quite frequently

korra_rel_fig.update_layout(barmode='group',)

korra_rel_fig.update_layout(
    title='The Legend of Korra Romantic Pairing Trends', plot_bgcolor='#ffffff'
)

korra_rel_fig.write_html('images/TLoK-Romantic-Pairings.html', auto_open=True)

In [117]:
Korra_fig = go.Figure(data=[
    go.Bar(name='Pre-Korrasami (2011–05/2014)', x=korracat_all['pairings'], y=korracat_all['Pre-Korrasami'],marker_color='#C0C0C0'),
    go.Bar(name='Subtext Korrasami (06/2014–11/2014)', x=korracat_all['pairings'], y=korracat_all['Subtext Korrasami'], marker_color='#585858'),
    go.Bar(name='Post Korrasami (12/2014–07/2015)', x=korracat_all['pairings'], y=korracat_all['Post Korrasami'], marker_color='#181818')
])

Korra_fig.update_layout(barmode='group',)

Korra_fig.update_layout(
    title='The Legend of Korra Romantic Pairing Trends', plot_bgcolor='#ffffff'
)


Korra_fig.write_html('images/TLoK-Romantic-Pairings.html', auto_open=True)

In [102]:
figTLOK = make_subplots(
    rows=1, cols=3,
    shared_yaxes=True,
    shared_xaxes=True,
    subplot_titles=("Before Korrasami", "Korrasami Subtext", "Post Korrasami"))

figTLOK.add_trace(go.Bar(
    y=korracat1[1], 
    x=korracat1[0], 
    name='Up To July 2014'), 
              row=1, 
              col=1)

figTLOK.add_trace(go.Bar(y=korracat2[1], x=korracat2[0], name='August–November 2014'), row=1, col=2)

figTLOK.add_trace(go.Bar(y=korracat3[1], x=korracat3[0], name='December 2014 and Beyond'), row=1, col=3)

figTLOK.update_layout(
    title='The Legend of Korra Romantic Pairing Trends'
)

figTLOK.write_html('images/TLoK-Romantic-Pairings.html', auto_open=True)

KeyError: 1