# Steam Games

## ¿Se puede construir un modelo predictivo para estimar la adquisición o el grado de adquisición de un juego basado en sus características, puntuaciones y horas jugadas?

---

## Participants

* David Velasquez
* David Venté
* Gerson Yarce

---
## Index

* Environment Setup
* Data Preparation
* EDA

## 1. Environment Setup

### Installing missing libraries

In [1]:
!pip install pandas numpy matplotlib seaborn



### Import modules helper

In [2]:
import importlib.util

def import_module_from_path(module_name, path):
    spec = importlib.util.spec_from_file_location(module_name, path)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)
    return module

### Import DB Connection

In [3]:
connect_database = import_module_from_path("connect_database", "../code/connect_database.py")

### Loading libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## 2. Data Preparation

### Setup DB Connection

In [5]:
connection = connect_database.ConnectionPostgres()### Loading libraries

In [6]:
connection.engine

Engine(postgresql://postgres:***@localhost:32771/etl_project)

### Loading Data

In [7]:
COLUMNS = [
    "id",
    "name",
    "release_date",
    "estimated_owners",
    "peak_ccu",
    "required_age",
    "price",
    "downloable_content_count",
    "supported_languages",
    "full_audio_languages",
    "reviews",
    "website",
    "support_url",
    "support_email",
    "windows",
    "mac",
    "linux",
    "metacritic_score",
    "metacritic_url",
    "user_score",
    "positive",
    "negative",
    "score_rank",
    "achievements",
    "recommendations",
    "average_playtime_forever_minute",
    "average_playtime_two_weeks_minute",
    "median_playtime_forever_minute",
    "median_playtime_two_weeks_minute",
    "developers",
    "publishers",
    "categories",
    "genres",
    "tags"
]


In [8]:
raw_games  = pd.read_sql_table( 
    "raw_games", 
    con=connection.engine,
    columns=COLUMNS,
)

### Taking a look at the data structure

Looking at the top five rows of the dataset, we see, each row represents a game, with **34 columns**.


In [9]:
raw_games.head()

Unnamed: 0,id,name,release_date,estimated_owners,peak_ccu,required_age,price,downloable_content_count,supported_languages,full_audio_languages,...,recommendations,average_playtime_forever_minute,average_playtime_two_weeks_minute,median_playtime_forever_minute,median_playtime_two_weeks_minute,developers,publishers,categories,genres,tags
0,1,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,0,['English'],[],...,0,0,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling"
1,2,Train Bandit,"Oct 12, 2017",0 - 20000,0,0,0.99,0,"['English', 'French', 'Italian', 'German', 'Sp...",[],...,0,0,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc..."
2,3,Jolt Project,"Nov 17, 2021",0 - 20000,0,0,4.99,0,"['English', 'Portuguese - Brazil']",[],...,0,0,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy",
3,4,Henosis™,"Jul 23, 2020",0 - 20000,0,0,5.99,0,"['English', 'French', 'Italian', 'German', 'Sp...",[],...,0,0,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz..."
4,5,Two Weeks in Painland,"Feb 3, 2020",0 - 20000,0,0,0.0,0,"['English', 'Spanish - Spain']",[],...,0,0,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,..."


Looking at the info() method, we have in total **83560 entries** 

In [10]:
raw_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83560 entries, 0 to 83559
Data columns (total 34 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   id                                 83560 non-null  int64  
 1   name                               83554 non-null  object 
 2   release_date                       83560 non-null  object 
 3   estimated_owners                   83560 non-null  object 
 4   peak_ccu                           83560 non-null  int64  
 5   required_age                       83560 non-null  int64  
 6   price                              83560 non-null  float64
 7   downloable_content_count           83560 non-null  int64  
 8   supported_languages                83560 non-null  object 
 9   full_audio_languages               83560 non-null  object 
 10  reviews                            9716 non-null   object 
 11  website                            39054 non-null  obj

Check the data

In [11]:
raw_games.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,83560.0,41780.5,24121.838584,1.0,20890.75,41780.5,62670.25,83560.0
peak_ccu,83560.0,136.255086,5450.777515,0.0,0.0,0.0,1.0,872138.0
required_age,83560.0,0.316563,2.267967,0.0,0.0,0.0,0.0,21.0
price,83560.0,7.195325,12.312332,0.0,0.99,4.49,9.99,999.98
downloable_content_count,83560.0,0.551795,13.84687,0.0,0.0,0.0,0.0,2366.0
metacritic_score,83560.0,3.40827,15.551867,0.0,0.0,0.0,0.0,97.0
user_score,83560.0,0.040558,1.807466,0.0,0.0,0.0,0.0,100.0
positive,83560.0,976.050191,24582.729979,0.0,0.0,7.0,47.0,5764420.0
negative,83560.0,162.522367,4616.32546,0.0,0.0,2.0,14.0,895978.0
score_rank,44.0,98.909091,0.857747,97.0,98.0,99.0,100.0,100.0
