---

## 3. Preparación de Datos

Para el procesamiento de los datos se seguirá en el orden que se mostrará a continuación, añadiendo features y aplicando transformadores a atributos determinados

### Columnas

In [None]:
import re
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

**Columnas con categorías**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

class CategoriesTokenizer:
    def __init__(self):
        pass

    def __call__(self, doc):
        return doc.split(';')

Esta versión de vectorizador es para columnas con pocas categorías posibles (<1k):
- platforms (3 valores posibles)
- categories (29 valores posibles)
- genres (26 valores posibles)
- tags (306 valores posibles)

In [None]:
boc_some_values = CountVectorizer(
    tokenizer = CategoriesTokenizer(),
    max_df = 1.0,
    min_df = 0.05  # hiperparametro a optimizar
    )

Esta otra versión es para developers y publishers (5617 y 3961 valores posibles respectivamente)

In [None]:
boc_many_values = CountVectorizer(
    tokenizer = CategoriesTokenizer(),
    max_df = 1.0,
    min_df = 1  # hiperparametro a optimizar
    # valores para GridSearch : [5, 10, 15] ???
    )

Variable de fecha de publicación y revenue

In [None]:
import re

def custom_features(dataframe_in):
    df = dataframe_in.copy(deep=True)

    df['month'] = pd.to_datetime(df['release_date']).dt.month
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.to_julian_date())
    df['revenue'] = 0

    top_pub_revenues = {'microsoft':10.260, 'netease':6.668, 'activision':6.388, 'electronic':5.537, 'bandai':3.018, 'square':2.386, 'nexon':2.286,
                        'ubisoft':1.446, 'konami':1.303, 'SEGA':1.153, 'capcom':0.7673, 'warner':0.7324}

    for rev_tuples in top_pub_revenues.items():
        pub, rev = rev_tuples
        if pub == 'SEGA':
            df.loc[df.publisher.str.match(f'.*{pub}.*').values, 'revenue'] = rev
        else:
            df.loc[df.publisher.str.match(f'.*{pub}.*', flags=re.IGNORECASE).values, 'revenue'] = rev    
    return df

**Juntando todo**

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, PowerTransformer, OneHotEncoder


preprocessing = ColumnTransformer(
    transformers=[
        ('BoC-plat',boc_some_values,'platforms'),
        ('BoC-cat',boc_some_values,'categories'),
        ('BoC-genres',boc_some_values,'genres'),
        ('BoC-tags',boc_some_values,'tags'),

        ('BoC-dev',boc_many_values,'developer'),
        ('BoC-pub',boc_many_values,'publisher'),

        ('OneHotEncoder',OneHotEncoder(handle_unknown='ignore'),['month']),
        # ('StandardScaler',StandardScaler(), ['...']),
        ('MinMaxScaler',MinMaxScaler(),['required_age','price','release_date']),
        ('BoxCox',PowerTransformer(method='yeo-johnson'),['achievements','average_playtime']),
        # ('unchanged',None,['english'])  # chequear como no hacer nada
])

### Resumen de transformaciones

|                   | **Procesamiento** |
|:-----------------:|:-----------------:|
|      **name**     |        ---        |
|    release_date   |    MinMaxScaler   |
|      english      |        ---        |
|     developer     |        BoW        |
|     publisher     |        BoW        |
|     platforms     |        BoW        |
|    required_age   |    MinMaxScaler   |
|     categories    |        BoW        |
|       genres      |        BoW        |
|        tags       |        BoW        |
|    achievements   |  PowerTransformer |
|  average_playtime |  PowerTransformer |
|       price       |    MinMaxScaler   |
| short_description |     Embeddings    |
|       month       |   OneHotEncoder   |
|      revenue      |  PowerTransformer |