# Steam Description Analysis
### anaylze the short descriptions of steam games in english to find the most popular common words.  
A word's popularity is determined by the average of the peak player count of games that use that word.    
A common word is a word that is used at or above the average word frequency. 


### What is the question?
    I wanted to learn what the most common words were used in the descriptions of successful steam games 

### What was the approach?
    I used two steam data sets from kraggle to collect the short desciptions and highest average player count.  

### What problems did I encounter?
    I had to define popularity and commonality of a word.
    I used the statistics of the highest average player count to determine popularity and found the average of all the occurances of those words.  
    I defined a common word by finding the average of all word frequencies
    I had to clean out the decriptions and removed games thats were in japanese and other non ascii languages.  

### What new ideas did this generate?
    I found it interesting to see what words successful games used to decribe themselves.  It could be useful to inform new desciptions 


In [None]:
import pandas as pd
import spacy
import nltk

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_md")

### Setup
Download steam data from https://www.kaggle.com/datasets/souyama/steam-dataset?resource=download 

### Organize steam store data
* collect appID, name, short description, and type
* removed duplicate entries with the same appID
* removed non english games

In [4]:
store_data_path = "./steam_dataset/appinfo/store_data/steam_store_data.json"
store_df = pd.read_json(store_data_path, orient="index")
store_df = store_df[["steam_appid", "name", "short_description", "type"]]
store_df = store_df.drop_duplicates(subset='steam_appid', keep="first")
store_df= store_df[store_df["short_description"].apply(lambda x: x.isascii())]
store_df = store_df.sort_values(by="steam_appid", ascending=True)
store_df = store_df.reset_index(drop=True)
store_df.head()

Unnamed: 0,steam_appid,name,short_description,type
0,10,Counter-Strike,Play the world's number 1 online action game. ...,game
1,20,Team Fortress Classic,One of the most popular online action games of...,game
2,30,Day of Defeat,Enlist in an intense brand of Axis vs. Allied ...,game
3,40,Deathmatch Classic,Enjoy fast-paced multiplayer gaming with Death...,game
4,50,Half-Life: Opposing Force,Return to the Black Mesa Research Facility as ...,game


### Organize steam spy data
* collect appID and average forever(the highest average player count recorded)

In [5]:
steamspy_data_path = "./steam_dataset/steamspy/basic/steam_spy_scrap.json"
spy_df = pd.read_json(path_or_buf=steamspy_data_path, orient='index')
spy_df = spy_df[['appid', 'average_forever']]
spy_df = spy_df.rename(columns={"appid" : "steam_appid"})
spy_df = spy_df.sort_values(by='steam_appid', ascending=True)
spy_df = spy_df.reset_index(drop=True)

spy_df.head(5)

Unnamed: 0,steam_appid,average_forever
0,10,10639
1,20,1064
2,30,402
3,40,875
4,50,952


### Combine data frames from both data sets based on appID
* removed non games 

In [14]:
combined_df = store_df.merge(spy_df, on="steam_appid")
combined_df = combined_df[combined_df['type'] == "game"]
combined_df = combined_df.drop(columns = ['type'])
combined_df.head(5)

Unnamed: 0,steam_appid,name,short_description,average_forever
0,10,Counter-Strike,Play the world's number 1 online action game. ...,10639
1,20,Team Fortress Classic,One of the most popular online action games of...,1064
2,30,Day of Defeat,Enlist in an intense brand of Axis vs. Allied ...,402
3,40,Deathmatch Classic,Enjoy fast-paced multiplayer gaming with Death...,875
4,50,Half-Life: Opposing Force,Return to the Black Mesa Research Facility as ...,952


### Analyze decriptions
* using spacy - parse the descriptions and reduce the words down to their lemma 
* a Lemma is the most basic form of a word for example: watched -> watch
* removed stop words which consists of particles and other words that don't have meaning on their own
* removed non alpha words 
* lowercased all words for uniformity 

In [15]:
lemma = []

for doc in nlp.pipe(combined_df['short_description'].astype('unicode').values, batch_size=1000, n_process=6):
    if doc.has_annotation("DEP"):

        filtered_lemmas = []
        for item in doc:
            if not item.is_stop and item.is_alpha:
                filtered_lemmas.append(item.lemma_.lower())

        lemma.append(filtered_lemmas)

    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails

        lemma.append(None)


combined_df['description_lemma'] = lemma

combined_df.head()

Unnamed: 0,steam_appid,name,short_description,average_forever,description_lemma
0,10,Counter-Strike,Play the world's number 1 online action game. ...,10639,"[play, world, number, online, action, game, en..."
1,20,Team Fortress Classic,One of the most popular online action games of...,1064,"[popular, online, action, game, time, team, fo..."
2,30,Day of Defeat,Enlist in an intense brand of Axis vs. Allied ...,402,"[enlist, intense, brand, axis, allied, teampla..."
3,40,Deathmatch Classic,Enjoy fast-paced multiplayer gaming with Death...,875,"[enjoy, fast, pace, multiplayer, game, deathma..."
4,50,Half-Life: Opposing Force,Return to the Black Mesa Research Facility as ...,952,"[return, black, mesa, research, facility, mili..."


### Explode
* create a new data frame with new rows for each word

In [16]:
word_df = combined_df.explode('description_lemma')

word_df.head()

Unnamed: 0,steam_appid,name,short_description,average_forever,description_lemma
0,10,Counter-Strike,Play the world's number 1 online action game. ...,10639,play
0,10,Counter-Strike,Play the world's number 1 online action game. ...,10639,world
0,10,Counter-Strike,Play the world's number 1 online action game. ...,10639,number
0,10,Counter-Strike,Play the world's number 1 online action game. ...,10639,online
0,10,Counter-Strike,Play the world's number 1 online action game. ...,10639,action


### Word statistics
* combine occurances of the same word while recording their frequencies and average their popularity number(highest average player count)

In [17]:
lemma_avg_df = word_df.groupby('description_lemma').agg({'description_lemma': 'size', 'average_forever': 'mean'})
lemma_avg_df.index.names= ['lemma_word']
lemma_avg_df = lemma_avg_df.reset_index()
lemma_avg_df = lemma_avg_df.rename(columns={'description_lemma': 'lemma_occurence', 'average_forever':'forever_avg_avg_for_lemma'})
lemma_avg_df = lemma_avg_df[lemma_avg_df['lemma_occurence'] >= lemma_avg_df['lemma_occurence'].mean()]
lemma_avg_df = lemma_avg_df.sort_values(by="forever_avg_avg_for_lemma", ascending=False)
lemma_avg_df.head(5)

Unnamed: 0,lemma_word,lemma_occurence,forever_avg_avg_for_lemma
13507,greedy,31,30780
34452,valuable,83,22943
11896,flying,42,22630
25972,realize,106,18063
11051,fade,26,15155


### Finished Results

In [27]:
pd.set_option("display.precision", 0)

finished_df = lemma_avg_df[['lemma_word', 'forever_avg_avg_for_lemma']]
finished_df = lemma_avg_df.rename(columns={"lemma_word":"Word", "forever_avg_avg_for_lemma":"Average popularity"})
finished_df = finished_df.reset_index(drop=True)

finished_df.head(50)

Unnamed: 0,Word,lemma_occurence,Average popularity
0,greedy,31,30780
1,valuable,83,22943
2,flying,42,22630
3,realize,106,18063
4,fade,26,15155
5,pick,337,11376
6,debt,35,11312
7,trail,96,9983
8,item,708,8151
9,worry,49,8082
