# IndieP: Predicting the Success of Indie Games on Steam

# Part 3: Data Exploration

<div align="center"><img src="./data_exploration.jpg", title="Image from activestate.com", width="50%"/></div>

So. We've got a lot of data. We've got titles, genres, release dates, availability of external media sources, available platforms and languages, price and release date. Then we've got positive and negative ratings, overall copies owned, number of people playing the game at the same time, and statistics on how long people have spent playing each game. 

What do we do with it? How do we use all of this information, and actually ask and discover answers to the question "What makes an indie game successful?"  

**While Machine Learning is as the forefront of essential Data Science methodologies, there's a lot one can do to understand their data without (or before) applying ML techniques.** In Part 2 of this project, we cleaned and investigated each of the features in our dataset individually, but one question that we could as is: "How do the features correlate to each other?" For example, how does the price of the game correlate to features such as positive and negative ratings, number of owners, or playtime? One could even ask how features such as the number of languages, platforms, or achievements correlate to the price. Are games with more of these attributes typically more expensive? Are there categories or genres that are more expensive? We could further investigate questions such as: is there a certain time of the year when more successful games are released? 

In [1]:
import csv
import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import missingno
import seaborn as sns
from ast import literal_eval
import re

pd.set_option("max_columns", 1000)

In [4]:
data_full = pd.read_csv('./data/indie_games_steam.csv')

In [5]:
data_full.head(2)

Unnamed: 0,name,appid,developers,publishers,price,is_free,platforms,release_date,required_age,n_languages,n_packages,has_website,achievements,categories,genres,positive_ratings,negative_ratings,owners,average_playtime,median_playtime,ccu,tags
0,Rag Doll Kung Fu,1002,Mark Healey,Mark Healey,0,0,Windows,10/12/2005,0,1,1,1,0,Single-player; Multi-player,Indie,44,18,0 - 20000,0,0,0,2D Fighter; Martial Arts; Multiplayer; Intenti...
1,Darwinia,1500,Introversion Software,Introversion Software,0,0,Windows; Mac; Linux,07/14/2005,0,2,4,1,0,Single-player,Indie; Strategy,508,168,500000 - 1000000,9,13,5,Strategy; Indie; RTS; Singleplayer; Retro; Sto...


### 3.1 Additional Feature Cleaning

In Part 2, we created a clean dataset of indie games including all of the features that may be in some way correlated to the success of a game. However, in order to prepare our data for input to Machine Learning algorithms and feature selection algorithms such as Principle Component Analysis, we need to take care of a few more things.

1. Ensure all numerical data is provides as float or int 
2. Bin continuous numerical data where appropriate
3. Decide how to deal with outliers

Additionally, for the `release_date` feature, we intentionally left in missing data because we didn't want to remove thousands of rows that were otherwise useful. To account for this missing data, we will create a subset of our dataset to be used for analyzing the effect of the release date on other features, which has the missing rows removed.

In [12]:
data_full.dtypes

name                object
appid                int64
developers          object
publishers          object
price               object
is_free              int64
platforms           object
release_date        object
required_age         int64
n_languages          int64
n_packages           int64
has_website          int64
achievements         int64
categories          object
genres              object
positive_ratings     int64
negative_ratings     int64
owners              object
average_playtime     int64
median_playtime      int64
ccu                  int64
tags                object
dtype: object

In [15]:
data_sample = data_full.sample(frac=0.1)

In [16]:
len(data_sample)

2173

### 3.2 Feature Engineering

In addition to making sure our features are appropriately cleaned to be useful, we also can use human intuition to create new features out of existing ones so that the data can be used more effectively. Some examples of features we may want to create:

1. `ratings_ratio`: ratio of positive to negative ratings - if greater than 1, the game has more positive than negative ratings
2. `ratings_net_pos`: binomial - 1 if `ratings_ratio` is greater than 1, 0 otherwise.
3. `total_ratings`: total number of positive and negative ratings
4. `n_dev_games`: total number of games created by the developer of the game
5. `n_platforms`: number of platforms on which the game is available
6. `is_single_player`: 1 if has single-player option, 0 otherwise
7. `is_multi_player`: 1 if has multi-player option, 0 otherwise
8. `n_genres`: number of genres listed
9. `n_categories`: number of categories listed
10. `n_tags`: number of tags listed

### 3.3 Investigating Feature Correlations

### Bi-Viariate Feature Analysis

### Correlation Matrices

### 3.4 Feature Encoding 

### 3.5 Feature Importance

### Principal Component Analysis

In [19]:
# Interaction between pairs of features.
#sns.pairplot(data_sample[['n_languages', 'achievements', 'ccu']], 
#             hue="ccu", 
#             diag_kind="kde",
#             height=4);