# Mapeando el Universo



## Aprendizaje Supervisado

### Clasificación del tipo morfológico de galaxias

  * Implementar los modelos random forest y redes neuronales para clasificar las galaxias en tipo Elípticas y Espirales e Irregulares
  
    + Utilizar al menos dos subconjuntos diferentes de variables (uno puede ser el mejor conjunto que les resultó del práctico anterior)
    + Realizar una búsqueda en grilla de los mejores parámetros de los modelos empleandos.
    + Comparar la performance con los modelos de perceptrón, regresión logística, vecinos más cercanos o el que hayan utilizado en el práctico anterior.
  
### Determinación del _redshift_ de las galaxias

  * Implementar los modelos de random forest, multi-layer perceptron y/o stochastic gradient descent para determinar el _redshift_ de las galaxias a partir de las propiedades fotométricas.
  
    + Utilizar al menos dos subconjuntos diferentes de variables (uno puede ser el mejor conjunto que les resultó del práctico anterior)
    + Determinar cuales son los parámetros de los algoritmos más importantes y realizar una búsqueda en grilla de los mejores parámetros de los modelos empleandos.
    + Elijan un métrica para evaluar el rendimiento de los métodos.
    

    
   

### Lectura de datos

Esto es una manera, pueden utilizar las que más les convenga

In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
filename = "~/DiploDatos_clean.csv"

In [5]:
df = pd.read_csv(filename,index_col=0)

In [6]:
df['modelColor_ug'] = df['modelMag_u'] - df['modelMag_g']
df['modelColor_gr'] = df['modelMag_g'] - df['modelMag_r']
df['modelColor_ri'] = df['modelMag_r'] - df['modelMag_i']
df['modelColor_iz'] = df['modelMag_i'] - df['modelMag_z']
df['petroColor_ug'] = df['petroMag_u'] - df['petroMag_g']
df['petroColor_gr'] = df['petroMag_g'] - df['petroMag_r']
df['petroColor_ri'] = df['petroMag_r'] - df['petroMag_i']
df['petroColor_iz'] = df['petroMag_i'] - df['petroMag_z']

### Uniendo dataframes

Vamos a ir un poco más allá y vamos a unir la tabla anterior con otra donde para algunas de las galaxias la gente a votado si se corresponde con una galaxia espiral, eliptica o irregular.

In [7]:
filename = '~/DiploDatos_Zoo.csv'

In [8]:
zoo = pd.read_csv(filename,index_col=0)

In [9]:
data = df.join(zoo)

In [10]:
data.shape

(585382, 83)

In [11]:
data.spiral = data.spiral.fillna(0)
data.elliptical = data.elliptical.fillna(0)
data.uncertain = data.uncertain.fillna(0)

In [12]:
data.loc[(data.spiral == 0) & (data.elliptical == 0) & (data.uncertain == 0), 'uncertain'] = 1

### Clasificación del tipo morfológico de galaxias

  * Implementar los modelos random forest y redes neuronales para clasificar las galaxias en tipo Elípticas y Espirales e Irregulares
  
    + Utilizar al menos dos subconjuntos diferentes de variables (uno puede ser el mejor conjunto que les resultó del práctico anterior)
    + Realizar una búsqueda en grilla de los mejores parámetros de los modelos empleandos.
    + Comparar la performance con los modelos de perceptrón, regresión logística, vecinos más cercanos o el que hayan utilizado en el práctico anterior.

A partir de los resultados del práctico anterior...

In [13]:
subset1 = [col for col in data.columns if col.startswith("model")] + [col for col in data.columns if col.startswith("deVRad")]
subset3 = [col for col in data.columns if col.startswith("petro")]
subsets = [subset1, subset3]

Tenemos que establecer los labels

In [14]:
data.labels = data.spiral
data.labels.loc[(data.elliptical == 1)] = 2
data.labels.loc[(data.uncertain == 1)] = 0
print("Labels: \n 0 => Uncertain\n 1 => Spiral\n 2 => Elliptical")

Labels: 
 0 => Uncertain
 1 => Spiral
 2 => Elliptical


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 585382 entries, 957075158303008768 to 957064987820451840
Data columns (total 83 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   z                               585382 non-null  float64
 1   subClass_AGN                    585382 non-null  int64  
 2   subClass_AGN BROADLINE          585382 non-null  int64  
 3   subClass_BROADLINE              585382 non-null  int64  
 4   subClass_STARBURST              585382 non-null  int64  
 5   subClass_STARBURST BROADLINE    585382 non-null  int64  
 6   subClass_STARFORMING            585382 non-null  int64  
 7   subClass_STARFORMING BROADLINE  585382 non-null  int64  
 8   subClass_UNKNOWN                585382 non-null  int64  
 9   velDisp                         585382 non-null  float64
 10  ra                              585382 non-null  float64
 11  dec                             585382 non-null  

In [16]:
"""reducing.py
Author: Kirgsn, 2018

Use like this:
>>> import reducing
>>> df = reducing.Reducer().reduce(df)
"""
import numpy as np
import pandas as pd
import time
import gc
from joblib import Parallel, delayed
from fastprogress import master_bar, progress_bar

__all__ = ['Reducer']

def measure_time_mem(func):
    def wrapped_reduce(self, df, *args, **kwargs):
        # pre
        mem_usage_orig = df.memory_usage().sum() / self.memory_scale_factor
        start_time = time.time()
        # exec
        ret = func(self, df, *args, **kwargs)
        # post
        mem_usage_new = ret.memory_usage().sum() / self.memory_scale_factor
        end_time = time.time()
        print(f'reduced df from {mem_usage_orig:.4f} MB '
              f'to {mem_usage_new:.4f} MB '
              f'in {(end_time - start_time):.2f} seconds')
        gc.collect()
        return ret
    return wrapped_reduce

class Reducer:
    """
    Class that takes a dict of increasingly big numpy datatypes to transform
    the data of a pandas dataframe into, in order to save memory usage.
    """
    memory_scale_factor = 1024**2  # memory in MB

    def __init__(self, conv_table=None, use_categoricals=True, n_jobs=-1):
        """
        :param conv_table: dict with np.dtypes-strings as keys
        :param use_categoricals: Whether the new pandas dtype "Categoricals"
                shall be used
        :param n_jobs: Parallelization rate
        """

        self.conversion_table = \
            conv_table or {'int': [np.int8, np.int16, np.int32, np.int64],
                           'uint': [np.uint8, np.uint16, np.uint32, np.uint64],
                           'float': [np.float32, ]}
        self.null_int = {   np.int8:  pd.Int8Dtype,
                            np.int16: pd.Int16Dtype,
                            np.int32: pd.Int32Dtype,
                            np.int64: pd.Int64Dtype,
                            np.uint8: pd.UInt8Dtype,
                            np.uint16:pd.UInt16Dtype,
                            np.uint32:pd.UInt32Dtype,
                            np.uint64:pd.UInt64Dtype}
        
        self.use_categoricals = use_categoricals
        self.n_jobs = n_jobs
        
    def _type_candidates(self, k):
        for c in self.conversion_table[k]:
            i = np.iinfo(c) if 'int' in k else np.finfo(c)
            yield c, i

    @measure_time_mem
    def reduce(self, df, verbose=False):
        """Takes a dataframe and returns it with all data transformed to the
        smallest necessary types.

        :param df: pandas dataframe
        :param verbose: If True, outputs more information
        :return: pandas dataframe with reduced data types
        """
        ret_list = Parallel(n_jobs=self.n_jobs, max_nbytes=None)(progress_bar(list(delayed(self._reduce)
                                                (df[c], c, verbose) for c in
                                                df.columns)))

        del df
        gc.collect()
        return pd.concat(ret_list, axis=1)
    
    def _reduce(self, s, colname, verbose):
        try:
            isnull = False
            # skip NaNs
            if s.isnull().any():
                isnull = True
            # detect kind of type
            coltype = s.dtype
            if np.issubdtype(coltype, np.integer):
                conv_key = 'int' if s.min() < 0 else 'uint'
            elif np.issubdtype(coltype, np.floating):
                conv_key = 'float'
                asint = s.fillna(0).astype(np.int64)
                result = (s - asint)
                result = np.abs(result.sum())
                if result < 0.01:
                    conv_key = 'int' if s.min() < 0 else 'uint'
            else:
                if isinstance(coltype, object) and self.use_categoricals:
                    # check for all-strings series
                    if s.apply(lambda x: isinstance(x, str)).all():
                        if verbose: print(f'convert {colname} to categorical')
                        return s.astype('category')
                if verbose: print(f'{colname} is {coltype} - Skip..')
                return s
            # find right candidate
            for cand, cand_info in self._type_candidates(conv_key):
                if s.max() <= cand_info.max and s.min() >= cand_info.min:
                    if verbose: print(f'convert {colname} to {cand}')
                    if isnull:
                        return s.astype(self.null_int[cand]())
                    else:
                        return s.astype(cand)

            # reaching this code is bad. Probably there are inf, or other high numbs
            print(f"WARNING: {colname} doesn't fit the grid with \nmax: {s.max()} "
                f"and \nmin: {s.min()}")
            print('Dropping it..')
        except Exception as ex:
            print(f'Exception for {colname}: {ex}')
            return s

In [17]:
data = Reducer().reduce(data)
data.info()

reduced df from 395.1533 MB to 171.3870 MB in 10.55 seconds
<class 'pandas.core.frame.DataFrame'>
Int64Index: 585382 entries, 957075158303008768 to 957064987820451840
Data columns (total 83 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   z                               585382 non-null  float32
 1   subClass_AGN                    585382 non-null  uint8  
 2   subClass_AGN BROADLINE          585382 non-null  uint8  
 3   subClass_BROADLINE              585382 non-null  uint8  
 4   subClass_STARBURST              585382 non-null  uint8  
 5   subClass_STARBURST BROADLINE    585382 non-null  uint8  
 6   subClass_STARFORMING            585382 non-null  uint8  
 7   subClass_STARFORMING BROADLINE  585382 non-null  uint8  
 8   subClass_UNKNOWN                585382 non-null  uint8  
 9   velDisp                         585382 non-null  float32
 10  ra                              585382 non-null  fl

Random Forest

  
### Determinación del _redshift_ de las galaxias

  * Implementar los modelos de random forest, multi-layer perceptron y/o stochastic gradient descent para determinar el _redshift_ de las galaxias a partir de las propiedades fotométricas.
  
    + Utilizar al menos dos subconjuntos diferentes de variables (uno puede ser el mejor conjunto que les resultó del práctico anterior)
    + Determinar cuales son los parámetros de los algoritmos más importantes y realizar una búsqueda en grilla de los mejores parámetros de los modelos empleandos.
    + Elijan un métrica para evaluar el rendimiento de los métodos.

Trabajamos con las magnitudes Model que han dado mejores resultados en el práctico anterior.

Random Forest

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [19]:
params = {
    "n_estimators" : [200,100,50,10],
    "max_depth" : [6,10,20]
}

In [20]:
model = RandomForestRegressor(random_state=0, n_jobs=-1)

In [21]:
def grid_cv(model,X,y):
    grid = GridSearchCV(estimator=model, param_grid=params, cv=5, n_jobs=-1)
    grid.fit(X,y)
    best_model = grid.best_estimator_
    return best_model
def split(X,y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)
    return X_train, X_test, y_train, y_test
def random_forest(model,X_train, X_test, y_train, y_test):
    model.fit(X_train,y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    print("Train MSE Error")
    print(mean_squared_error(y_train,y_train_pred))
    print("Test MSE Error")
    print(mean_squared_error(y_test,y_test_pred))

In [22]:
import joblib

X = data[subset1]
y = data.z
print("Random Forest Regressor")
model = grid_cv(model, X, y)
print(model)
print("----------")
name = "randomforestCV_" + "modelMag" + ".pkl"
joblib.dump(model, name)
print("_____________________")

Random Forest Regressor
RandomForestRegressor(max_depth=20, n_estimators=200, n_jobs=-1, random_state=0)
----------
_____________________


In [25]:
X_train, X_test, y_train, y_test = split(X,y)
random_forest(model, X_train, X_test, y_train, y_test)

Train MSE Error
0.00015687338982727005
Test MSE Error
0.0005497582982857604


Stochastics Gradient Descent Regressor

In [30]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error

In [31]:
params = {
    "loss" : ["squared_loss", "huber"],
    "penalty" : ["l1","l2"],
    "alpha" : [1000,200,100,50,0]
}

In [33]:
model = SGDRegressor(random_state=0)

In [34]:
def grid_cv(model,X,y):
    grid = GridSearchCV(estimator=model, param_grid=params, cv=5, n_jobs=-1)
    grid.fit(X,y)
    best_model = grid.best_estimator_
    return best_model
def split(X,y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)
    return X_train, X_test, y_train, y_test
def sgdr(model,X_train, X_test, y_train, y_test):
    model.fit(X_train,y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    print("Train MSE Error")
    print(mean_squared_error(y_train,y_train_pred))
    print("Test MSE Error")
    print(mean_squared_error(y_test,y_test_pred))

In [35]:
import joblib

X = data[subset1]
y = data.z
print("SGDRegressor")
model = grid_cv(model, X, y)
print(model)
print("----------")
name = "SGDRCV_" + "modelMag" + ".pkl"
joblib.dump(model, name)
print("_____________________")


SGDRegressor
SGDRegressor(alpha=0, loss='huber', penalty='l1', random_state=0)
----------
_____________________
_____________________


In [36]:
X_train, X_test, y_train, y_test = split(X,y)
sgdr(model, X_train, X_test, y_train, y_test)

Train MSE Error
0.0017920295072500522
Test MSE Error
0.0017992826721478844


El mejor modelo es un Random Forest Regressor para las magnitudes Model.