# Laboratorio 2
## Miembros del equipo
- Ricardo Méndez 21289
- Sara Echeverría 21371
- Melissa Pérez 21385
- Francisco Castillo 21562

# Task 1

#### ¿Por qué el modelo de Naive Bayes se le considera “naive”?

La teoría de Bayes asume que los eventos son dependientes entre sí. Al implementarlo con un acercamiento _naive_ implica que asumimos que los eventos son independientes entre sí. Donde, básicamente, describe que una característica de un conjunto de datos no está relacionada con otra de ellas. En otras palabras, es por su simplicidad de la independencia condicional entre las características o atributos que describen los datos. [(Shoba, 2018)](https://www.sciencedirect.com/science/article/abs/pii/S0169716118300191)

#### Explique la formulación matemática que se busca optimizar en Support Vector Machine, además responda ¿cómo funciona el truco del Kernel para este modelo? 
Las máquinas de soporte vectorial buscan encontrar una frontera que maximice el margen de separation para la classification de los datos. La prediction de los datos se realiza con la siguiente fórmula:
$$
y_i = sign(w^Tx_i+b)
$$
El margen se optimiza a través de la distancia perpendicular desde la frontera hacia al punto más cercano de cada clase; es decir, maximizamos esta distancia. Esto se resuelve con multiplicadores de Lagrange (Tantos multiplicadores como cantidad de puntos tenemos)
$$
argmin_{w, ɑ} \frac{1}{2}w^Tw- \sum_{n, m=1}^{N}ɑ_mɑ_ny_my_nx_nx_m
$$
Tras resolver el problema de optimización, se obtiene la siguiente fórmula:
$$
argmax_w \sum_{n=1}^{N}ɑ_n - \frac{1}{2} \sum_{n, m=1}^{N}ɑ_nɑ_my_my_n{x_n}^Tx_m
$$
Se puede apreciar que ya no dependemos de los pesos ($w$) y el bias ($b$), sino que dependemos de los multiplicadores de Lagrange ($ɑ$). Estos multiplicadores son los que nos permiten encontrar la frontera de decision.

Las predicciones ahora se hacen de la siguiente manera:
$$
y_i = sign(\sum_{n=1}^{N}ɑ_ny_nx_n^Tx_i+b)
$$
Los datos de $x_i$ solamente aparecen como un producto punto, el cual podemos representar con una función de kernel.
$$
y_i = sign(\sum_{n=1}^{N}ɑ_ny_nK(x_n,x_i)+b)
$$

El utilizar esta function nos permite ingress el _truco del kernel_. Este truco nos permite mapear los datos a un espacio de mayor dimension, donde es más facial encontrar una frontera de decision que no es necesariamente lineal.  [(Ranjan, 2019)](https://towardsdatascience.com/truly-understanding-the-kernel-trick-1aeb11560769)

### ¿Qué tipo de ensemble learning es este modelo (Random Forest)?
Random Forest es un ensemble learning de tipo _Bootstrap Aggregation_ o _bagging_. Se utiliza para reducir la varianza de un modelo de machine learning a base de crear multiples modelos y combinarlos para obtener un modelo final. En un _Random Forest_ todos los modelos son árboles de decision. Cada arbol se entrena con diferentes subconjuntos de datos los cuales son combinados para obtener un modelo final. [(Sruthi, 2024)](https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/)

### ¿Cuál es la idea general detrás de Random Forest?
Como se ha mencionado anteriormente, el objetivo es crear un modelo de aprendizaje que combine la salida de multiples árboles de decision para llegar a un resultado unico. 

### ¿Por qué se busca baja correlación entre los árboles de Random Forest?
Se busca que los _features_ tengan baja correlación para reducir la varianza y mejorar la precision de las predicciones. Si la correlación es alta entre estos árboles, es muy probable que cometan los mismos errores en la predicción, lo que no ofrecería ninguna mejora. [(Breiman & Cutler, s.f.)](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)

In [31]:
import pandas as pd
import numpy as np
import string
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler = StandardScaler()

# Task 2 Naive Bayes: Clasificador de Mensajes Ham/Spam

Los datos de prueba han sido obtenidos de [este enlace](https://github.com/Sk70249/NLP-Spam-Ham-Classifier/blob/master/data/SMSSpamCollection.tsv)

# Task 2.1

In [6]:
data = []
with open('data/entrenamiento.txt', 'r') as file:
    for line in file:
        split = line.split('\t')
        if len(split) == 2:
            classification = split[0]
            message = split[1].lower()
            
            #Clean the message
            chars = ['\n', '.', ',', '!', '?', '(', ')', '"', ':', ';']
            for char in chars:
                message = message.replace(char, '')
            
            data.append([classification, message])

In [7]:
df = pd.DataFrame(data, columns=['classification', 'message'])

In [8]:
df.head(10)

Unnamed: 0,classification,message
0,ham,go until jurong point crazy available only in ...
1,ham,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor u c already then say
4,ham,nah i don't think he goes to usf he lives arou...
5,spam,freemsg hey there darling it's been 3 week's n...
6,ham,even my brother is not like to speak with me t...
7,ham,as per your request 'melle melle oru minnaminu...
8,spam,winner as a valued network customer you have b...
9,spam,had your mobile 11 months or more u r entitled...


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5565 entries, 0 to 5564
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   classification  5565 non-null   object
 1   message         5565 non-null   object
dtypes: object(2)
memory usage: 87.1+ KB


In [10]:
df['classification'] = df['classification'].map({'ham': 0, 'spam': 1})

In [11]:
df.head()

Unnamed: 0,classification,message
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor u c already then say
4,0,nah i don't think he goes to usf he lives arou...


In [12]:
df['classification'].value_counts(normalize=True)

classification
0    0.865768
1    0.134232
Name: proportion, dtype: float64

In [13]:
# Balance the dataset
df_ham = df[df['classification'] == 0]
df_spam = df[df['classification'] == 1]
df_ham = df_ham.sample(n=len(df_spam), random_state=7)
df = pd.concat([df_ham, df_spam])

In [14]:
df['classification'].value_counts()

classification
0    747
1    747
Name: count, dtype: int64

In [15]:
features = df['message']
target = df['classification']

In [16]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=7)

# Task 2.2

In [17]:
p_spam = sum(y_train == 1) / len(y_train)  # probability of spam message
p_ham = sum(y_train == 0) / len(y_train)  # probability of ham message

# Preprocess data to separate words
vocab = set()
spam_words = []
ham_words = []
vocab_size = len(vocab)

# Separate words by classification
for message, classification in zip(X_train, y_train):
    for word in message.split():
        vocab.add(word)
        if classification == 1:
            spam_words.append(word)
        else:  # ham
            ham_words.append(word)

# Laplace Smoothing for probability of word in each classification
k = 1  # Laplace smoothing constant

spam_word_probs = {word: (spam_words.count(word) + 1) / (len(spam_words) + len(vocab)) for word in vocab}
ham_word_probs = {word: (ham_words.count(word) + 1) / (len(ham_words) + len(vocab)) for word in vocab}

# Classify a new message
def classify(message):
    p_message_given_spam = np.log(p_spam)
    p_message_given_ham = np.log(p_ham)
    for word in message.split():
        if word in spam_word_probs:
            p_message_given_spam += np.log(spam_word_probs.get(word, 1 / (len(spam_words) + vocab_size * k)))
        if word in ham_word_probs:
            p_message_given_ham += np.log(ham_word_probs.get(word, 1 / (len(ham_words) + vocab_size * k)))
    return 'spam' if p_message_given_spam > p_message_given_ham else 'ham'

# Print probabilities of each classification
print(f"Probability of spam: {p_spam}")
print(f"Probability of ham: {p_ham}")

# Print probabilities of each word in each classification
print(f"\nSpam word probabilities: {spam_word_probs}")
print(f"Ham word probabilities: {ham_word_probs}")

Probability of spam: 0.5087866108786611
Probability of ham: 0.49121338912133894



La métrica a utilizar será la precisión ya que las clases están equilibradas y ayudará a la proporción de predicciones correctas entre todas las predicciones.

In [18]:
# Test the model on the TESTING set
y_pred = [classify(message) for message in X_test]

# Encode the predictions as 0 and 1
y_pred_encoded = [1 if prediction == 'spam' else 0 for prediction in y_pred]

# Calculate accuracy
accuracy = sum(y_pred_encoded == y_test) / len(y_test)

print(f"Accuracy on testing set: {accuracy}")

# Test the model on the TRAINING set
y_pred_train = [classify(message) for message in X_train]

# Encode the predictions as 0 and 1
y_pred_train_encoded = [1 if prediction == 'spam' else 0 for prediction in y_pred_train]

# Calculate accuracy
accuracy_train = sum(y_pred_train_encoded == y_train) / len(y_train)

print(f"Accuracy on training set: {accuracy_train}")

Accuracy on testing set: 0.9665551839464883
Accuracy on training set: 0.9774058577405857


Las precisiones de los modelos al ser cercanas, indica que el modelo se está generalizando bien, no está sobreajustado ni insuficientemente ajustado.

# Task 2.3

In [38]:
# Convert the message to lowercase, remove punctuation, and remove non-ASCII characters
def clean_message(message):
    return "".join(char for char in message.lower() if char.isascii() and char not in string.punctuation)

# Asks for the input message and classifies it
Imessage = input('Introduce your message: ')
cleaned_message = clean_message(Imessage)

# Calculate the probability of the message being spam and ham
p_message_given_spam = np.log(p_spam)
p_message_given_ham = np.log(p_ham)
for word in cleaned_message.split():
    if word in spam_word_probs:
        p_message_given_spam += np.log(spam_word_probs.get(word, 1 / (len(spam_words) + vocab_size * k)))
    if word in ham_word_probs:
        p_message_given_ham += np.log(ham_word_probs.get(word, 1 / (len(ham_words) + vocab_size * k)))

print(f"Spam message probability: {np.exp(p_message_given_spam)}") 
print(f"Ham message probability: {np.exp(p_message_given_ham)}") 
print(f"\nMessage classification: {classify(cleaned_message)}") 

Spam message probability: 2.3777355481917977e-67
Ham message probability: 1.3681660617415548e-74

Message classification: spam


# Task 2.4

# Task 3 Clasificación de Partidas de League of Legends

## Análisis Exploratorio

In [193]:
df_lol = pd.read_csv('data/high_diamond_ranked_10min.csv')

In [194]:
df_lol.head()

Unnamed: 0,gameId,blueWins,blueWardsPlaced,blueWardsDestroyed,blueFirstBlood,blueKills,blueDeaths,blueAssists,blueEliteMonsters,blueDragons,...,redTowersDestroyed,redTotalGold,redAvgLevel,redTotalExperience,redTotalMinionsKilled,redTotalJungleMinionsKilled,redGoldDiff,redExperienceDiff,redCSPerMin,redGoldPerMin
0,4519157822,0,28,2,1,9,6,11,0,0,...,0,16567,6.8,17047,197,55,-643,8,19.7,1656.7
1,4523371949,0,12,1,0,5,5,5,0,0,...,1,17620,6.8,17438,240,52,2908,1173,24.0,1762.0
2,4521474530,0,15,0,0,7,11,4,1,1,...,0,17285,6.8,17254,203,28,1172,1033,20.3,1728.5
3,4524384067,0,43,1,0,4,5,5,1,0,...,0,16478,7.0,17961,235,47,1321,7,23.5,1647.8
4,4436033771,0,75,4,0,6,6,6,0,0,...,0,17404,7.0,18313,225,67,1004,-230,22.5,1740.4


In [195]:
df_lol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9879 entries, 0 to 9878
Data columns (total 40 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   gameId                        9879 non-null   int64  
 1   blueWins                      9879 non-null   int64  
 2   blueWardsPlaced               9879 non-null   int64  
 3   blueWardsDestroyed            9879 non-null   int64  
 4   blueFirstBlood                9879 non-null   int64  
 5   blueKills                     9879 non-null   int64  
 6   blueDeaths                    9879 non-null   int64  
 7   blueAssists                   9879 non-null   int64  
 8   blueEliteMonsters             9879 non-null   int64  
 9   blueDragons                   9879 non-null   int64  
 10  blueHeralds                   9879 non-null   int64  
 11  blueTowersDestroyed           9879 non-null   int64  
 12  blueTotalGold                 9879 non-null   int64  
 13  blu

In [196]:
df_lol.describe()

Unnamed: 0,gameId,blueWins,blueWardsPlaced,blueWardsDestroyed,blueFirstBlood,blueKills,blueDeaths,blueAssists,blueEliteMonsters,blueDragons,...,redTowersDestroyed,redTotalGold,redAvgLevel,redTotalExperience,redTotalMinionsKilled,redTotalJungleMinionsKilled,redGoldDiff,redExperienceDiff,redCSPerMin,redGoldPerMin
count,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,...,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0
mean,4500084000.0,0.499038,22.288288,2.824881,0.504808,6.183925,6.137666,6.645106,0.549954,0.36198,...,0.043021,16489.041401,6.925316,17961.730438,217.349226,51.313088,-14.414111,33.620306,21.734923,1648.90414
std,27573280.0,0.500024,18.019177,2.174998,0.500002,3.011028,2.933818,4.06452,0.625527,0.480597,...,0.2169,1490.888406,0.305311,1198.583912,21.911668,10.027885,2453.349179,1920.370438,2.191167,149.088841
min,4295358000.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,11212.0,4.8,10465.0,107.0,4.0,-11467.0,-8348.0,10.7,1121.2
25%,4483301000.0,0.0,14.0,1.0,0.0,4.0,4.0,4.0,0.0,0.0,...,0.0,15427.5,6.8,17209.5,203.0,44.0,-1596.0,-1212.0,20.3,1542.75
50%,4510920000.0,0.0,16.0,3.0,1.0,6.0,6.0,6.0,0.0,0.0,...,0.0,16378.0,7.0,17974.0,218.0,51.0,-14.0,28.0,21.8,1637.8
75%,4521733000.0,1.0,20.0,4.0,1.0,8.0,8.0,9.0,1.0,1.0,...,0.0,17418.5,7.2,18764.5,233.0,57.0,1585.5,1290.5,23.3,1741.85
max,4527991000.0,1.0,250.0,27.0,1.0,22.0,22.0,29.0,2.0,1.0,...,2.0,22732.0,8.2,22269.0,289.0,92.0,10830.0,9333.0,28.9,2273.2


In [197]:
df_lol['blueWins'].value_counts(normalize=True)

blueWins
0    0.500962
1    0.499038
Name: proportion, dtype: float64

In [198]:
features_lol = df_lol.drop(columns=['blueWins', 'gameId'])
target_lol = df_lol['blueWins']

In [199]:
# Scale everything
features_lol = pd.DataFrame(scaler.fit_transform(features_lol), columns=features_lol.columns)

In [200]:
features_lol.describe()

Unnamed: 0,blueWardsPlaced,blueWardsDestroyed,blueFirstBlood,blueKills,blueDeaths,blueAssists,blueEliteMonsters,blueDragons,blueHeralds,blueTowersDestroyed,...,redTowersDestroyed,redTotalGold,redAvgLevel,redTotalExperience,redTotalMinionsKilled,redTotalJungleMinionsKilled,redGoldDiff,redExperienceDiff,redCSPerMin,redGoldPerMin
count,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,...,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0,9879.0
mean,-2.8769820000000005e-17,5.034719e-18,-1.0788680000000001e-18,1.125619e-16,-1.179563e-16,-1.111234e-16,3.30853e-17,-8.702872e-17,-5.753965e-18,-2.7331330000000003e-17,...,2.5892840000000003e-17,1.146837e-15,1.444245e-15,3.394839e-16,5.897814e-16,8.055551e-17,-1.2586800000000001e-17,1.438491e-18,-1.352901e-15,-6.88318e-16
std,1.000051,1.000051,1.000051,1.000051,1.000051,1.000051,1.000051,1.000051,1.000051,1.000051,...,1.000051,1.000051,1.000051,1.000051,1.000051,1.000051,1.000051,1.000051,1.000051,1.000051
min,-0.9594869,-1.298863,-1.009663,-2.053863,-2.092146,-1.634988,-0.879231,-0.7532257,-0.4811324,-0.210439,...,-0.1983529,-3.539707,-6.961495,-6.254973,-5.036349,-4.718391,-4.66838,-4.364806,-5.036349,-3.539707
25%,-0.4599937,-0.8390689,-1.009663,-0.7253456,-0.7286663,-0.6508123,-0.879231,-0.7532257,-0.4811324,-0.210439,...,-0.1983529,-0.7120554,-0.4104749,-0.6276311,-0.6549,-0.7293122,-0.6446966,-0.6486683,-0.6549,-0.7120554
50%,-0.3489952,0.08051859,0.9904294,-0.06108705,-0.04692613,-0.1587244,-0.879231,-0.7532257,-0.4811324,-0.210439,...,-0.1983529,-0.07448379,0.2446271,0.01023723,0.0297014,-0.03122336,0.0001688026,-0.002926826,0.0297014,-0.07448379
75%,-0.1269983,0.5403123,0.9904294,0.6031716,0.634814,0.5794075,0.7195032,1.327623,-0.4811324,-0.210439,...,-0.1983529,0.6234576,0.8997291,0.6697989,0.7143028,0.5671385,0.6521677,0.6545317,0.7143028,0.6234576
max,12.63783,11.11557,0.9904294,5.252982,5.406995,5.500287,2.318237,1.327623,2.07843,16.15907,...,9.022956,4.18762,4.175239,3.593814,3.270148,4.057583,4.420473,4.842738,3.270148,4.18762


### Pasos del análisis exploratorio

#### Encoding
- No se ha realizado encoding, ya que no hay variables categóricas

#### Balanceo
- No se ha realizado balanceo, ya que el dataset está balanceado

#### Escalamiento
- Se ha realizado escalamiento de los datos, ya que las variables tienen diferentes escalas

#### Selección de Variables
- No se ha realizado selección de variables, ya que todas las variables son relevantes

In [201]:
X_train_lol, X_test_lol, y_train_lol, y_test_lol = train_test_split(features_lol, target_lol, test_size=0.2, random_state=7)

## Task 3.1 Support Vector Machines

## Task 3.2 Árboles de Decisión

## Task 3.3 Comparación