Luke Witten and Aditi Vinod:

# Methedology

Now that we have our data (conveniently stored as CSV's) we can actually start analyzing our data.

Storing our data in this form means that we do not need to re-scrape the data from Reddit every time we want to analyze it, but it also means that the data is not readily accessible by the computer.

Luckily reading data from a CSV is not difficult, using the function `csv_to_dict` we can easily convert from a CSV file to a dictionary in python.


In [1]:
import csv
import gamer_words

def csv_to_dict(file_name):
    with open(file_name) as csv_file:
        reader = csv.reader(csv_file)
        word_dict = dict(reader)
    return word_dict

gamer_dictionary = csv_to_dict("gaming.csv")


for word in gamer_dictionary:
    value = gamer_dictionary[word]
    gamer_dictionary[word] = int(value)



print(f"the length of the dictionary is {len(gamer_dictionary)}")

the length of the dictionary is 17752


We now have access to a dictionary that tells us how many times a word is used in the dataset we collected, but many of these words appear only once or are typos. These results are not particularly useful as we want words that are commonly used by gamers.

We can remove words from the dataset who do not show up enough times fairly simply using `remove_too_uncommon()`. Bundled into this function is also code to remove typo words that duplicate themselves and code to remove strings that are far too long such as urls.

In [2]:
#create a new dictionary with only words that appear 3 or more times
gamer_dictionary_1 = gamer_words.remove_too_uncommon(gamer_dictionary,3)
print(f"the length of the dictionary is {len(gamer_dictionary_1)}")


the length of the dictionary is 5061


This dataset is much smaller than the original and likely more representative of words that gamers actually say. 

If we want to find out which words gamers use most, then all we need to do is find which words appear most frequently in the dataset. Once we have these "gamer words" we can compare them against a user's post history to find out if they are a gamer or not.

Let's run `find_most_frequent()` to find the 5 most frequently occuring "gamer words"

In [3]:
gamer_dictionary_2 = gamer_words.find_most_frequent(gamer_dictionary_1,5)
print(f"the five most frequent gamer words are {gamer_dictionary_2}")


the five most frequent gamer words are {'the': 6797, 'to': 4351, 'and': 3528, 'a': 3472, 'of': 3008}


Looking at the words that appeared we can see that something is obviously wrong. While there is no doubt that gamers use words like "a", "and", and "the" frequently, nobody would be fooled into believing that these words are unique to the gamer vocabulary. 

To fully determine what words are unique to the gamer vocabulary, we will need to somehow compare the language dataset from gamers to a language dataset for non-gamers.

For our convenience, this data has already been scraped and is stored in `normal_dictionary.csv`

~ aditi put something here about our methodology from scraping normal subreddits and what normal subreddits we scraped from~

To filter out words from both dictionaries we can go through each word in the gamer dictionary and compare how frequently a word occurrs in both dictionaries. If a word is used too similarly in both language sets then we can remove the word from both. 

Unfortunately, our dictionary currently stores the number of times a word has been used in a language set and not the percentage of times that a word is used in the entire language set. This is a simple enough thing to code though, and now exists in the form of the `instances_to_decimal` function.

In [4]:
#create and store the normal dictionary
normal_dictionary = csv_to_dict("normal.csv")
for word in normal_dictionary:
    value = normal_dictionary[word]
    normal_dictionary[word] = int(value)

#remove infrequent words
normal_dictionary_1 = gamer_words.remove_too_uncommon(normal_dictionary,3)

gamer_decimal_dictionary = gamer_words.instances_to_decimal(gamer_dictionary_1)
normal_decimal_dictionary = gamer_words.instances_to_decimal(normal_dictionary_1)

the_usages = gamer_dictionary_1["the"]
the_decimal = gamer_decimal_dictionary["the"]
print(f"the is used {the_usages} times in the gamer dictionary ")
print(f"the is used {the_decimal} of the time in the gamer language set")

the is used 6797 times in the gamer dictionary 
the is used 0.0482714050338049 of the time in the gamer language set


Now that we have our data in a usable form, we can parse through our two language sets and remove words that appear a similar percentage of the time in both data sets. While the actual percentage value that is used for this process is arbitrary, we found that removing words with frequency values within +- 15% of each other worked well to find a good set of words.

We can use the function `remove_most_common()` to parse our two dictionaries with respect to each other and even output a list of words that were removed from both sets.


In [5]:
normal_decimal_dictionary_1, gamer_decimal_dictionary_1, ignore_list =gamer_words.remove_most_common(normal_decimal_dictionary,gamer_decimal_dictionary)
print(f"the length of the dictionary is {len(gamer_decimal_dictionary_1)}")


the length of the dictionary is 4334


While we now have two curated language sets, one for the gamer language set and one for the normal language set, we still do not have a method for determining what words are extremely specific to the gamer lexicon. 

To do this, we can examine all words that appear in both the gamer and the normal dataset, if the word is used 5 times more frequently in the gamer language set than the normal language set, then we can determine it is a gamer word. While this 5x threshold is undoubtedly arbitrary, in testing we found that this value produced a good number of gamer words that were not too over specific, but also not too common as to not be considered gamer specific.

Sometimes possible gamer words do not appear at all in the normal language set, so there needs to be a method for these words to become gamer words as well. We foudn that in testing, if a word does not appear in both data sets and its frequency of occurrences values is over .00005 then it can be reasonably considered a gamer word. While this value of .00005 is also arbitrary, we found that using it produces a solid set of gamer words.

Using our curated data sets, we can find a number of gamer words using the `determine_gamer_words()` function.

In [6]:
gamer_words_1 = gamer_words.determine_gamer_words(normal_decimal_dictionary_1,gamer_decimal_dictionary_1)

print(f"there are {len(gamer_words_1)} gamer words")

_ = 0
#print(gamer_words_1)
for word in gamer_words_1:
    _ +=1
    

    #if _ > 25:
    #    break

there are 345 gamer words


In [7]:
normal_final,gamer_final, gamer_words_final, ignore_final = gamer_words.parse_words(normal_dictionary,gamer_dictionary,3)

In [8]:
print(gamer_final["i"]/normal_final["i"])
print(gamer_final["i"])
print(normal_final["i"])

0.6064562587978671
0.014636952445883757
0.02413521541503668


In [9]:
print(len(gamer_words_final))
print((gamer_words_final))

345
['ea', 'battlefront', 'base', 'loot', 'progression', 'battlefield', 'game', 'fifa', 'profile', 'himanshu', 'delete', 'select', 'cards', 'stats', 'franchise', 'releases', 'soul', 'apology', 'gameplay', 'quote', 'differences', 'league', 'legends', 'player', 'tyler1', 'players', 'item', 'clg', 'tsm', 'championship', 'na', 'seed', 'match', 'vs', 'winner', 'bans', 'azir', 'kalista', 'lulu', 'towers', 'yasuo', 'phase', 'champion', 'teammates', 'elo', 'ranked', 'shiny', 'skins', 'champions', 'garen', 'ingame', 'riot', 'casual', 'summoner', 'x', 'lane', 'rank', 'akali', 'challenger', 'jungler', 'u', 'client', 'patch', 'queue', 'solo', 'minions', 'tp', 'gank', 'champ', 'aphelios', 'e', 'ult', 'kills', 'lux', 'g2', 'esports', 'congratulations', 'rng', '1bans', 'xin', 'hmtherald2', 'mmtmountain3', 'bmtbarons5', 'wunder', 'karsa', 'jankos', 'perkz', 'uzi', 'bmtbarons4', 'omtocean3', 'imtinfernal5', 'sincleesin', 'cmtcloud3', 'imtinfernal4', 'omtocean5', 'bmtbarons6', 'postmatch', 'cloud9', 'c9

In [10]:
import os
file_list = os.listdir("suite_life_data")

file_list = ["suite_life_data/" + user for user in file_list]

#print(file_list)

user_value_dict = {}
swap_list = []
for user in file_list:

    swap_list = []
    user_dictionary = csv_to_dict(user)
    #print(user_dictionary)
    for word in user_dictionary:
        value = user_dictionary[word]
        user_dictionary[word] = int(value)
    
    user_dictionary = gamer_words.remove_too_uncommon(user_dictionary.copy(),3)
    user_dictionary = {word:value for (word,value) in user_dictionary.items() if word not in ignore_final}
    #print(type(user_dictionary))
    user_dictionary = gamer_words.instances_to_decimal(user_dictionary.copy())

    #print(user_dictionary)
    #print(type(user_dictionary))
    swap_list.append(gamer_words.determine_language_similarity(gamer_final.copy(),user_dictionary.copy()))
    swap_list.append(gamer_words.determine_language_similarity(normal_final.copy(),user_dictionary.copy()))
    
    user_value_dict[user] = swap_list

print(user_value_dict)

    



{'suite_life_data/adp1030.csv': [0.11572542788529089, 0.11424546162373007], 'suite_life_data/Bonk.csv': [0.0714009387716252, 0.06638096321951123], 'suite_life_data/cLA.csv': [0.09218463884576775, 0.08745971337158197], 'suite_life_data/crane.csv': [0.10965551354141385, 0.10250149598617035], 'suite_life_data/Firefight451.csv': [0.23558957867261746, 0.22996696261987495], 'suite_life_data/Jiangster.csv': [0.3032867726579211, 0.295173016941167], 'suite_life_data/Kelly Stellmacher.csv': [0.12824787035401852, 0.12132498785079898], 'suite_life_data/Misfortune123.csv': [0.07113193357387446, 0.06599138649080756], 'suite_life_data/notbot.csv': [0.09457795709288658, 0.08741476731800797], 'suite_life_data/pieguy.csv': [0.08083238727727893, 0.07906097176597826], 'suite_life_data/Sier.csv': [0.06969730314834649, 0.06488058468672768]}
