In [1]:
import pandas as pd
import numpy as np


In [2]:
games_df = pd.read_csv('../Data/games.csv')
games_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71716 entries, 0 to 71715
Data columns (total 39 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   AppID                       71716 non-null  int64  
 1   Name                        71715 non-null  object 
 2   Release date                71716 non-null  object 
 3   Estimated owners            71716 non-null  object 
 4   Peak CCU                    71716 non-null  int64  
 5   Required age                71716 non-null  int64  
 6   Price                       71716 non-null  float64
 7   DLC count                   71716 non-null  int64  
 8   About the game              69280 non-null  object 
 9   Supported languages         71716 non-null  object 
 10  Full audio languages        71716 non-null  object 
 11  Reviews                     9167 non-null   object 
 12  Header image                71716 non-null  object 
 13  Website                     350

We will drop all one hot vector columns, NLP cols, and pointless columns

In [3]:
drop_list = ['AppID', 'Name', 'About the game', 'Supported languages', 
             'Full audio languages', 'Notes', 'Categories',
             'Genres', 'Tags', 'Score rank', 'Header image',
             'Peak CCU', 'Estimated owners', 'Average playtime forever',
             'Average playtime two weeks', 'Median playtime forever',
             'Median playtime two weeks']

strip_df = games_df.drop(drop_list, axis=1)
strip_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71716 entries, 0 to 71715
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Release date      71716 non-null  object 
 1   Required age      71716 non-null  int64  
 2   Price             71716 non-null  float64
 3   DLC count         71716 non-null  int64  
 4   Reviews           9167 non-null   object 
 5   Website           35073 non-null  object 
 6   Support url       36250 non-null  object 
 7   Support email     60596 non-null  object 
 8   Windows           71716 non-null  bool   
 9   Mac               71716 non-null  bool   
 10  Linux             71716 non-null  bool   
 11  Metacritic score  71716 non-null  int64  
 12  Metacritic url    3778 non-null   object 
 13  User score        71716 non-null  int64  
 14  Positive          71716 non-null  int64  
 15  Negative          71716 non-null  int64  
 16  Achievements      71716 non-null  int64 

Carry over already complete features
- Windows, Mac, Linux, DLC Count

In [4]:
augment_df = strip_df[['Required age', 'Price', 'DLC count', 'Windows', 'Mac', 'Linux', 'Achievements']]


Turn Release data into Release Month and Release Year Column

In [5]:
month_mapper = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}

augment_df['Release Month'] = strip_df['Release date'].map(lambda x: month_mapper[x[0:3]])
augment_df['Release Year'] = strip_df['Release date'].str.slice(-4).astype('Int32')
# print(augment_df['Release Year'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['Release Month'] = strip_df['Release date'].map(lambda x: month_mapper[x[0:3]])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['Release Year'] = strip_df['Release date'].str.slice(-4).astype('Int32')


Now Lets create is18+ column based on Age

In [6]:
augment_df['is18plus'] = strip_df['Required age'].apply(lambda x: x >= 18)
# augment_df['is18plus']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['is18plus'] = strip_df['Required age'].apply(lambda x: x >= 18)


Now lets create isFreeToPlay column

In [7]:
augment_df['isFreeToPlay'] = ~strip_df['Price'].astype(bool)
augment_df['isFreeToPlay']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['isFreeToPlay'] = ~strip_df['Price'].astype(bool)


0        False
1        False
2        False
3        False
4         True
         ...  
71711     True
71712    False
71713    False
71714    False
71715    False
Name: isFreeToPlay, Length: 71716, dtype: bool

Now lets create has DLC column

In [8]:
augment_df['hasDLC'] = strip_df['DLC count'].astype(bool)
# augment_df['hasDLC']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['hasDLC'] = strip_df['DLC count'].astype(bool)


Now we create hasSupport column by combining support url, and support email; and hasWebsite Column

In [9]:
augment_df['hasSupport'] = strip_df['Support url'].fillna('').astype(bool) | strip_df['Support email'].fillna('').astype(bool)
augment_df['hasWebsite'] = strip_df['Website'].fillna('').astype(bool)
# augment_df['hasSupport']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['hasSupport'] = strip_df['Support url'].fillna('').astype(bool) | strip_df['Support email'].fillna('').astype(bool)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['hasWebsite'] = strip_df['Website'].fillna('').astype(bool)


Now we create a hasMedia Column based on Screenshots and Movies; we will also create a movieCount and screnshotCount Column

In [10]:
augment_df['hasMedia'] = strip_df['Screenshots'].fillna('').astype(bool) | strip_df['Movies'].fillna('').astype(bool)
augment_df['movieCount'] = strip_df['Movies'].fillna('').str.split(',').str.len()
augment_df['screenshotCount'] = strip_df['Screenshots'].fillna('').str.split(',').str.len()
# augment_df['Screenshots']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['hasMedia'] = strip_df['Screenshots'].fillna('').astype(bool) | strip_df['Movies'].fillna('').astype(bool)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['movieCount'] = strip_df['Movies'].fillna('').str.split(',').str.len()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augme

add positive, and negative review counts, and number of recomendations

In [11]:
augment_df['posReviewCount'] = strip_df['Positive']
augment_df['posNegativeCount'] = strip_df['Negative']
augment_df['recommendationCount'] = strip_df['Recommendations']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['posReviewCount'] = strip_df['Positive']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['posNegativeCount'] = strip_df['Negative']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['recommendationCount'] = strip_df['Recommendations']


Now lets turn Metacritic score into hasMetacriticScore and we will use the actual MetacriticScore as a part of the right hand side of the ml 

In [12]:
augment_df['hasMetacriticScore'] = strip_df['Metacritic score'].astype(bool)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  augment_df['hasMetacriticScore'] = strip_df['Metacritic score'].astype(bool)


Now we just add a developerCount and PublisherCount column

In [13]:
augment_df['devCount'] = strip_df['Developers'].fillna('').str.split(',').str.len()
augment_df['pubCount'] = strip_df['Publishers'].fillna('').str.split(',').str.len()
# augment_df['devCount']


In [14]:
augment_df['reviewCount'] = strip_df['Reviews'].fillna('').str.split(r'\”\s*([^\"]*)\s*\“', regex=True).str.len()
# augment_df['reviewCount']


some final column renaming

In [15]:
augment_df.rename({'Achievements': 'achievementCount'}, axis=1, inplace=True)


In [16]:
augment_df.to_csv('../Data/augmented_data.csv')
