In [1]:
#pandas for working with dataframes
import pandas as pd
import numpy as np

#nltk libraries
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.stem.porter import *
import string

## Create .TXT Files

First, create large .txt files of each of the separate <em>Game of Thrones</em> and <em>The Legend of Korra</em> corpora. The reason for creating .txt files before doing word2vec or any kind of work with data is because I want to be able to easily access these files in other notebooks. This way, I do not have to re-run this process again.

### Creating GoT Corpora
I will create several different corpora: a <strong>cleaned</strong> and <strong>uncleaned</strong> version for each. This way, I can easily access the data when I need to. Here are the five GoT I'll be working with (ten since there will be the clean and unclean of each):
- Seasons 1 & 2 
- Seasons 3 & 4
- Seasons 5 & 6
- Season 7
- Season 8

I do not need to do this with TLoK data because I have already done this. Find more on my GitHub repository, [tracing fan uptakes in <em>The Legend of Korra</em>](https://github.com/caramessina/tracing_fan_uptakes)

In [2]:
#reading in the GoT CSV files
gotmonth0 = pd.read_csv('data/group_month/got_1.csv')
gotmonth1 = pd.read_csv('data/group_month/got_2.csv')
gotmonth2 = pd.read_csv('data/group_month/got_3.csv')
gotmonth3 = pd.read_csv('data/group_month/got_4.csv')

In [3]:
#merging the GoT CSV files
got_all = pd.concat([gotmonth0, gotmonth1, gotmonth2, gotmonth3]).set_index('month')
got_all.head(5)

Unnamed: 0_level_0,rating,additional tags,category,relationship,body,count
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2006-08,"teen and up audiences,","incest, dreams,","m/m,","jon snow/robb stark,",up on the wall it is impossible to be warm and...,1
2007-02,"general audiences,","possible incest,",,,the last time jon had seen his half-sister san...,1
2007-05,"teen and up audiences,","tragedy, canonical character death, suicide, b...","f/m,","rhaegar targaryen/lyanna stark, robert barathe...","it is far too easy, to slip away. lyanna is kn...",1
2007-06,"mature, teen and up audiences,","alternate universe, infidelity, unrequited lov...","f/m, f/m,","cersei lannister/oberyn martell, petyr baelish...",the vase shatters beautifully against the wall...,2
2007-12,"teen and up audiences, general audiences,","romance, action/adventure, incest, maleslash,","f/m, m/m,","brienne/jaime lannister, jaime lannister/rhaeg...",asshai is the end of the world. the valaryians...,2


In [4]:
#creating the 5 separate corpora based on the seasons
got1_2 = got_all.loc['2006-08':'2013-02']
got3_4 = got_all.loc['2013-03':'2015-03']
got5_6 = got_all.loc['2015-07':'2017-06']
got7 = got_all.loc['2017-07':'2019-03']
got8 = got_all.loc['2019-04':'2019-09']

got8.head(2)

Unnamed: 0_level_0,rating,additional tags,category,relationship,body,count
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-04,"teen and up audiences, not rated, general audi...","alternate universe - fantasy, game of thrones ...","m/m, f/m, f/m, multi, other, f/f, f/m, f/m, f/...","midoriya izuku/todoroki shouto, bakugou katsuk...","izuku midoriya is a green, curly-haired advent...",971
2019-05,"mature, teen and up audiences, not rated, expl...","game of thrones alternate universe, game of th...","f/f, f/m, m/m, f/m, f/f, f/m, m/m, f/f, multi,...","jon snow/daenerys targaryen, gilly (asoiaf)/sa...",\n\nprologue\n\n\n\n \n\n \n\n\nharry was figh...,2498


## Transforming the Corpora Dataframes into Text

Since the dataframes are all confirmed to be within the proper timeframes, I will now use the functions to take the body of text from each published fanfiction, tokenize it, remove stopwords/punctuation, and stem it. Then, I will use the "save_txt" function to save each corpus as a text file so they can be used in the computational text analysis notebook.

### Saving the Uncleaned Versions
To save the unclean version, I will use the function 'column_to_txt,' which takes information from a column, converts it to a string, and then saves that string in a file.

In [31]:
def column_to_txt_file(dataframe,columnName,filePath):
    '''
    this function takes all the information from a specific column and joins it into a string. This is NOT for cleaning the text, just for saving the string.
    Input: the name of the dataframe and the column name
    Output: a string of all the words/characters from a column
    '''
    string = ' '.join(dataframe[columnName].tolist())
    
    #save the .txt
    file = open(filePath, "w") 
    file.write(string)
    file.close()
    
    print(string[:200])

In [33]:
column_to_txt_file(got1_2,'body','data/group_month/got_all_txt/gotS1_2_unclean.txt')
column_to_txt_file(got3_4,'body', 'data/group_month/got_all_txt/gotS3_4_unclean.txt')
column_to_txt_file(got5_6,'body', 'data/group_month/got_all_txt/gotS5_6_unclean.txt')
column_to_txt_file(got7,'body', 'data/group_month/got_all_txt/gotS7_unclean.txt')
column_to_txt_file(got8,'body', 'data/group_month/got_all_txt/gotS8_unclean.txt')

up on the wall it is impossible to be warm and the wind blows through your furs, regardless of their quality. the skies are grey and most of the black brothers are huddled within the relative shelter 
john didn't seem the kind of guy who likes reading. especially heavy literature, with complicated plots and lots of character deaths. however, john was pretty much the biggest fanboy when it came to a

  



 part one: of trips and manuscripts  



it is the wine and lobsters.



sansa spends the entire trip pouring over tyrion's manuscript of his latest novel, only dropping it down for a good glas
it is a truth universally acknowledged that a prince in possession of too many titles and a remarkable lack of responsibility must be in want of a pint. it was, therefore, no surprise that her path fi
izuku midoriya is a green, curly-haired adventurer who does nothing but adventure. his mother bothers him numerous times about going out and how he is never careful. izuku, on the other hand, clai

### Saving the Cleaned Versions

Next, I will "clean" each of the corpora and save them as the cleaned versions of the bodies of text. Here are the functions I will use:
- "column_to_token" will take all the information in one column, make it into a string, and then tokenize it. 

- "cleaning" will lower all the capital letters, remove the stopwords, remove the punctuation, and stem the texts using NLTK's porter stemmer.

- "save_txt"  takes a tokenized list, joins it to make it a string, and then saves that string in a desired location.

So basically, I am taking all the text out of the "body" columns to make it a tokenized list, then clean that tokenized list using the "cleaning" function, rejoin the list to a string once it's cleaned, and save it. <strong>I will provide more detail for each step below</strong>

Once I use the cleaning function to remove punctation, stopwords, and stem each word in the list, I will then join the tokenized list back to make it a string and save this new string as a .txt file so I may access it in the future.

In [5]:
def column_to_token(dataframe,columnName):
    '''
    this function takes all the information from a specific column, joins it to a string, and then tokenizes that string.
    Input: the name of the dataframe and the column name
    Output: a tokenized list of all the characters from that specific column
    '''
    string = ' '.join(dataframe[columnName].tolist())
    corpus_token = word_tokenize(string)
    return corpus_token 
    print(corpus_token[:3])

In [6]:
stopwords_ = ['``','`','--','...',"'m","''","'re","'ve",'i','me', 'my',  'myself', "“", "”", 'we', 'our', '’', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her','hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', "he's","she's","they're","they've","i've"'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'do',"n't","don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", "'t", "'s", 'wasn', "wasn't", 'weren', "weren't", "would", "could", 'won', "won't", 'wouldn', "wouldn't"]

def cleaning(token_text):
    '''
    This function takes a tokenized list of words/punctuation and does several "cleaning" methods
    The first "lowers" everything, then it removes the stop words and punctuation. The stop words are listed and I used the punctuation list from NLTK (in my "imports")
    Input: the tokenized list of the text
    Output: the tokenized list of the text with all lower case, punctuation removed, and no stop words
    '''
    stemmer = PorterStemmer() 
    
    text_lc = [word.lower() for word in token_text]
    text_tokens_clean = [word for word in text_lc if word not in stopwords_]
    text_tokens_clean_ = [word for word in text_tokens_clean if word not in string.punctuation]
    tokens_stemmed = [stemmer.stem(w) for w in text_tokens_clean_]
    return tokens_stemmed

In [8]:
def save_txt(cleanToken,filePath):
    '''
    take the tokenized list of words, convert to a string, and save it
    input: the tokenized list of words and your new filepath
    output: a saved file of the full corpus' string
    '''
    clean_string = " ".join(cleanToken)
    file2 = open(filePath,"w") 
    file2.write(clean_string)
    file2.close()

#### Step 1: Creating a Token

The first function, column_to_token, takes all the characters from body and puts them into a string. A string is a a set of characters, almost like a bag of words. <strong>Strings</strong> usually appear between quotation marks, which is how the computer will recognize the words as a string. "This is what a string might look like. I can have numbers, 2, in here, but the computer will see this as a long set of characters."

I will then take my strings, which will just be all of the text from the "body" column, and tokenize them. <strong>Tokenizing</strong> takes a string and breaks it into a list based on whitespace (it's default). "So a string like this" is turned into ['a', 'string', 'like', 'this'] where each word between white space becomes an item in the list. 


In [9]:
got1_2token = column_to_token(got1_2, 'body')
got3_4token = column_to_token(got3_4, 'body')
got5_6token = column_to_token(got5_6, 'body')
got7token = column_to_token(got7, 'body')
got8token = column_to_token(got8, 'body')

print(got1_2token[:5])
print(got5_6token[:5])

['up', 'on', 'the', 'wall', 'it']
['part', 'one', ':', 'of', 'trips']


#### Cleaning the Token

Tokenizing will allow us to remove any unwanted items in the list. If, for example, I can  remove the word 'like' from the list, so my list will look like ['a', 'string', 'this'].

Using the "cleaning" function, the code will go through each item in my list and remove it if it's a stop word, lowercase it if there are capital letters, remove any punctuation, and stem any items. <strong>Stemming</strong> is when you take the stem of a word that may have multiple endings, such as the word "talk." Talk can appear as talk, talks, talked, talking, etc. A stemmer will recognize all these variations and turn all of them into "talk." This way, the words are standardized.

You will be able to see the difference between the output above and the output below. The output above still has the words 'up,' 'on,' and 'the,' which are all in the stopwords lists. The output below, however, has removed these.

In [None]:
got1_2clean = cleaning(got1_2token)
got3_4clean = cleaning(got3_4token)
got5_6clean = cleaning(got5_6token)
got7clean = cleaning(got7token)
got8clean = cleaning(got8token)

In [None]:
print(got1_2clean[:5])
print(got5_6clean[:5])

#### Saving The Resuls

Now that there are five new, cleaned corpora, I will save each of them. The "save_txt" function takes a tokenized list, joins it back into a string, and saves the string as a .txt file.

In [11]:
save_txt(got1_2clean,'data/group_month/got_all_txt/gotS1_2_clean.txt')
save_txt(got3_4clean,'data/group_month/got_all_txt/gotS3_4_clean.txt')
save_txt(got5_6clean,'data/group_month/got_all_txt/gotS5_6_clean.txt')
save_txt(got7clean,'data/group_month/got_all_txt/gotS7clean.txt')
save_txt(got8clean,'data/group_month/got_all_txt/gotS8clean.txt')

### Reading in the new files

Now that I have created five corpora, each with the full body texts from different GoT fanfic seasons, I will now read them back in. This way, in case this notebook crashes, I can easily access the data again by just running the cell below. I mainly want to check that the read_txt function works and that both the clean and unclean texts run. I will use one corpus from each.

In [13]:
def read_txt(filePath):
    '''
    This function reads a file (specifically a text file) and tokenizes that file
    Input: a .txt filepath of a string of words
    Output: a tokenized list of words
    '''
    file = open(filePath, "r") 
    new_string = file.read() 
    file.close()
    corpus_token = word_tokenize(new_string)
    return corpus_token

In [14]:
#UNCLEANED
gots_unclean_test = read_txt('data/group_month/got_all_txt/gotS1_2_unclean.txt')

#CLEANED
gots_clean_test = read_txt('data/group_month/got_all_txt/gotS1_2_clean.txt')

In [15]:
print(gots_unclean_test[:5])
print(gots_clean_test[:5])

['up', 'on', 'the', 'wall', 'it']
['wall', 'imposs', 'warm', 'wind', 'blow']
