# Lab 2: Data!

Agenda
------
+ Your Dataset
+ Balanced/Imbalanced data sets
    - Oversampling correction
+ Training a Logistic Regression classifier
    - Evaluating classifier f_measure
+ Training a Neural Net w/ Keras


# Task 0: Who is in your group?

__Adrian Criollo__

## TASK 1 : Getting familiar with your data
---

For this task, you'll load in your data, gather some statistics about it, and work on some featurization.

**Answer the following:**

**Preliminary Analysis**

Before you get started with the coding aspect of the task, please analyse the dataset visually and draw inferences about common features you might have observed through just scanning through data.
1. What are your findings/inferences?

__This dataset set contains a two types of text spam and non spam which are labeled 1 and 0 respectively. Just from scanning the data, there seems to be common themes within the spam dataset. In the spam, there is a lot of selling of products and urgency to do something.__

In [5]:
# if you'd like to use pandas, you can, but you aren't required to
import pandas as pd

import numpy as np
import random

from sklearn.linear_model import LogisticRegression

# for f_measure, word_tokenize
import nltk
# you can also use sklearn.metrics.f1_score
from sklearn.metrics import f1_score

# this function will help split our data into train and test
from sklearn.model_selection import train_test_split

# feel free to use these fancier data structures
from collections import Counter, defaultdict

In [6]:
# your data file should be in the same directory as this notebook
# take a look at this file to see what it looks like and how it is structured
# we've cleaned this dataset for you so you won't see any missing values

# The dataset contains spam and non-spam emails
# Labels: 0 - non-spam, 1 - spam
DATAFILE = "spam_or_not_spam.csv"

# read the data into a pandas dataframe or into numpy arrays or into a list of lists
df = pd.read_csv("spam_or_not_spam.csv")

In [7]:
# print the first few lines of your data to make sure it looks like what you expect
# make sure not to overwhelm yourself/your reader by printing too much data!
print(df.head())

                                               email  label
0   date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...      0
1  martin a posted tassos papadopoulos the greek ...      0
2  man threatens explosion in moscow thursday aug...      0
3  klez the virus that won t die already the most...      0
4   in adding cream to spaghetti carbonara which ...      0


In [8]:
# TO-DO
# Display the following
# Size of dataset
# Number of unique classes in the dataset
# Display all the unique classes
# Number of samples for each class

# Size of dataset
print("Size of dataset: ", df.shape)

# Unique classes in the dataset (find this programmatically!)
print("Number of unique classses: ", len(df['label'].unique()))
print("Unique classses: ", df['label'].unique())                                         

# Number of samples for each class
print("Number of sample for each class: ", df['label'].value_counts())

Size of dataset:  (2997, 2)
Number of unique classses:  2
Unique classses:  [0 1]
Number of sample for each class:  label
0    2500
1     497
Name: count, dtype: int64


__Imbalanced__ datasets: Imbalanced datasets refers to the data in which the targeted classes are not equally distributed. In other words, number of data samples of one class significantly outnumbers the data samples of other class. This significant difference between datasamples of classes can lead to biased models favoring the majority class and results in poor spam detection and compromised accuracy. It is always essential to balance the dataset to tackle this issue.

The following are some simple techniques to handle imbalance datasets:
- Random Oversampling (randomly select minority class examples with replacement and add to training data)
- Random Undersampling (randomly select majority class examples and delete from the training data)


**Answer the following:**
2. Is the dataset imbalanced?

__Yes, this data is imbalanced as it leans in favor of 0 (non-spam).__


# Task 2: Random Oversampling
---

In [12]:
# function to handle oversampling (also known as upsampling)
# implement EITHER the pandas version OR the list version
# you don't need to implement both
# delete whichever one you don't want to use
def oversample_minority_class_df(df: pd.DataFrame) -> pd.DataFrame:
    """
    Oversamples or upsamples the minority class in the given DataFrame to balance the class distribution.

    Parameters:
    -----------
    df : pandas.DataFrame
        Input DataFrame containing
        'email' column representing feature
        'label' column representing the class labels.

    Returns:
    --------By
    pandas.DataFrame
        Concatenated dataframe - original majority samples + original minority samples + upsampled minority samples
        DataFrame with the minority class oversampled to match the size of the majority class.
    """
    ## TO-DO

    # Identify the majority and minority class labels from the 'label' column in the DataFrame.
    class_counts = df['label'].value_counts()
    majority_class = class_counts.idxmax()
    minority_class = class_counts.idxmin()

    # Separate the majority and minority classes into different DataFrames based on their labels.
    majority_df = df[df['label'] == majority_class]
    minority_df = df[df['label'] == minority_class]

    # Determine the number of samples in the minority and majority classes.
    majority_samples_count = majority_df.shape[0]
    minority_samples_count = minority_df.shape[0]

    # Calculate the number of samples to be added to upsample the minority class.
    num_to_add_to_minority = majority_samples_count - minority_samples_count

    # Randomly select samples (with replacement) from the minority class to create upsampled data.
    # create a dataset with upsampled minority class
    # Hint: Can use random.choices
    samples_to_choose_from = minority_df.to_dict(orient="records")
    random_samples = random.choices(samples_to_choose_from, k=num_to_add_to_minority)
    random_samples_to_df = pd.DataFrame(random_samples)
    unsampled_data = pd.concat([majority_df, minority_df, random_samples_to_df])
    # Concatenate the upsampled minority class dataframe with the original majority class and minority class.

    # Return the resulting DataFrame with balanced class distribution.
    return unsampled_data
# call and test your function
balanced_df = oversample_minority_class_df(df)
# keep your original data around too! You'll need it!

In [13]:
# TO-DO
# Display the following
# Size of oversampled dataset
# Number of samples for each class in your oversampled dataset

# Size of dataset
print("Size of dataset: ", balanced_df.shape)

# Unique classes in the dataset (find this programmatically!)
print("Number of unique classses: ", len(balanced_df['label'].unique()))
print("Unique classses: ", balanced_df['label'].unique())                                         

# Number of samples for each class
print("Number of sample for each class: ", balanced_df['label'].value_counts())

Size of dataset:  (5000, 2)
Number of unique classses:  2
Unique classses:  [0 1]
Number of sample for each class:  label
0    2500
1    2500
Name: count, dtype: int64


# Task 3: Train a Logistic Regression Classifier
----


In [15]:
# TO-DO
# your first step before training a Logistic Regression classifier
# is to featurize your data.

# scikit-learn LogisticRegression and keras neural networks like data in
# the format:
# X: a matrix of shape (num_samples, num_features)
# y: a vector of shape (num_samples,)

# start by defining some words as your features — you'll be using these
# as a mini bag of words. Choose up to 10 words that you think might be
# indicative of spam emails. You can look at the data to get some ideas.
WORDS = ["winner","sexy","free","guaranteed", "click", "password","approved","need","winner","lucky"]
nltk.download('punkt')
def featurize(X_train):
    # implement this function to featurize your data
    # use nltk.word_tokenize to tokenize the emails
    features = []
    for i in X_train:
        feature = []
        tokens = nltk.word_tokenize(i)
        for word in WORDS:
            if word in tokens:
                feature.append(1)
            else:
                feature.append(0)

        features.append(feature)
    feature_matrix = pd.DataFrame(features, columns=WORDS)

    return feature_matrix
        

# make sure you've successfully featurized your data

# the shape should be (num_samples, num_features) where num_features
# is the length of your word list


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adria\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
# split your data into train and test sets
X_featurized = featurize(balanced_df['email'])
# this function will put 70% of your data into the training set and 30% into the test set
X_train, X_test, y_train, y_test = train_test_split(X_featurized, balanced_df['label'], test_size=0.3, random_state=42)


In [17]:
# train your model

model = LogisticRegression()
model.fit(X_train, y_train)


In [18]:
# see how good your model is by printing the f1 score
predictions = model.predict(X_test)

print("F1 score: ", f1_score(y_test, predictions))

F1 score:  0.7976878612716763


In [19]:
# what is the distribution of your predictions?
# is your model predicting all 0s or 1s?

predictions = pd.Series(lr_preds)
distribution = predictions.value_counts()
print("Predictions: \n", distribution)



Predictions: 
 0    848
1    652
Name: count, dtype: int64


1. Using the __original__ data, what label distribution does your LogisticRegression classifier give? __The model guesses 0 (non-spam) 801 times and 1(spam) 99 times.__
1. Using the __oversampled/upsampled__ data, what label distribution does your LogisticRegression classifier give? __The model guesses 0 (non-spam) 868 times and 1(spam) 632 times.__

In [21]:
# finally, take a look at the model's coefficients (weights) (model.coef_)
# which word is the most important for predicting spam?
coefficients = model.coef_[0]
feature_weights = pd.DataFrame({'Word': WORDS,'Coefficient': coefficients})
print(feature_weights)

         Word  Coefficient
0      winner    -0.473680
1        sexy    -0.209073
2        free     1.355523
3  guaranteed     2.819742
4       click     3.679690
5    password    -0.647605
6    approved     1.834257
7        need     0.444333
8      winner    -0.473680
9       lucky     0.107002


3. What is the most important word when using the __original__ data? __click__
4. What is the most important word when using the __oversampled/upsampled__ data? __click__
5. What do the relative weights of the model when using the original vs. the oversampled data tell you about the effects of using oversampled data on this model? (1 - 2 sentences) __For some words, the weight did change heavily. What this means is that when data is unbalanced, certain features are weighed less.__
6. We didn't implement it today, but we can also __downsample/undersample__ in which we reduce the size of the majority class in our data. What effects do you think that this would have? (1 - 2 sentences) __I think this would also lead to a closer balance between labels, but we would be sacrificing data.__

## TASK 4: Training a Neural Net
---

Finally, we'll train and evaluate a neural net using the same data as your Logistic Regression model.

In [24]:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [25]:
# Create your model
# (take a look at lecture notebook 9 for an example of creating a keras neural network)
# Assuming X_train_featurized and y_train are your training data and labels
# Define the model
model = Sequential()

# Add the input layer
# This is done for you as an example
model.add(Dense(units=32, input_dim=X_train.shape[1], activation='relu'))

# Add a hidden layer
# This is also done as an example
model.add(Dense(units=16, activation='relu'))

# Add the output layer, assuming binary classification (spam or not spam)
model.add(Dense(units=1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Display the model summary
model.summary()

# call compile

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


1. How many __hidden__ layers does your network have? __This network has 2 hidden layers: dense and dense_1.__

In [27]:
# train your model
# if you get this error:
# ValueError: Failed to find data adapter that can handle input: <class 'numpy.ndarray'>, (<class 'list'> containing values of types {"<class 'int'>"})
# this means that your inputs need to be numpy arrays

# some useful parameters to the keras model.fit function:
# epochs: how many times you want to go through your training data
# batch_size: how many examples to look at before updating the weights in your network
# verbose: whether or not you want to see the training progress
model.fit(X_train, y_train, epochs=10, verbose=1)

Epoch 1/10
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 863us/step - accuracy: 0.5019 - loss: 0.6674
Epoch 2/10
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 744us/step - accuracy: 0.7670 - loss: 0.5504
Epoch 3/10
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 735us/step - accuracy: 0.7818 - loss: 0.4756
Epoch 4/10
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 744us/step - accuracy: 0.7944 - loss: 0.4469
Epoch 5/10
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 707us/step - accuracy: 0.7956 - loss: 0.4441
Epoch 6/10
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 753us/step - accuracy: 0.7996 - loss: 0.4373
Epoch 7/10
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 753us/step - accuracy: 0.7959 - loss: 0.4347
Epoch 8/10
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 836us/step - accuracy: 0.7883 - loss: 0.4487
Epoch 9/10
[1m110/110[

<keras.src.callbacks.history.History at 0x181df3baed0>

In [28]:
# see how good your model is (use the `model.predict(test_data)` function)
# print out the f1 score

y_predictions = model.predict(X_test)
print("F1 score: ", f1_score(y_test, y_predictions))

[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step


ValueError: Classification metrics can't handle a mix of binary and continuous targets

2. What is the output of the `keras` function `model.predict(test_data)`? Is it the same as your `sklearn` `LogisticRegression` model's `model.predict(test_data)` function? __YOUR ANSWER HERE__

In [None]:
# what is the distribution of your predictions?
# is your model predicting all 0s or 1s?


# compare and contrast your LR predictions with your NN predictions

3. Using the __original__ data, what label distribution does your neural classifier give? __YOUR ANSWER HERE__ ({0.0: 784, 1.0: 116})
    1. What is the overlap with the Logistic Regression guesses (# both models agree on)? __YOUR ANSWER HERE__ ({True: 885, False: 15})
4. Using the __oversampled/upsampled__ data, what label distribution does your neural classifier give? __YOUR ANSWER HERE__ (({0.0: 921, 1.0: 579}))
    1. What is the overlap with the Logistic Regression guesses (# both models agree on)? __YOUR ANSWER HERE__ (1493 agree, 7 disagree)



## BONUS: Training a Neural Net from scratch
---

While modern libraries allow us to use the most efficiently designed algorithms, having an understanding of what these libraries contain would be beneficial to write code around it. To get you started, we have a simple fully connected neural network here that uses the sigmoid function as an activation function.

If you are interested in learning more about this, you can check out the neural network approach to creating a bi-gram model demonstrated here. [Youtube Link](https://youtu.be/PaCmpygFfXo?feature=shared&t=3777)

In [None]:
# Define the sigmoid activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [None]:
# Define the derivative of the sigmoid function
def sigmoid_derivative(x):
    return x * (1 - x)

In [None]:
# Define the neural network class
class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights and biases
        # TODO
        self.weights_input_hidden = np.random.rand(''' Fill here ''') # Weights for the links between the input layer and the hidden layer
        self.bias_hidden = np.zeros(( ''' Fill here ''')) # Bias for the hidden layer
        self.weights_hidden_output = np.random.rand(''' Fill here ''') # Weights for the links between the hidden layer and the output layer
        self.bias_output = np.zeros((''' Fill here ''')) # Bias for the output layer

    def forward(self, X):
        # Forward pass
        self.hidden_activation = sigmoid(np.dot(X, self.weights_input_hidden) + self.bias_hidden)
        self.output_activation = sigmoid(np.dot(self.hidden_activation, self.weights_hidden_output) + self.bias_output)
        return self.output_activation

    def backward(self, X, y, learning_rate):
        # Backward pass
        error = y - self.output_activation
        output_delta = error * sigmoid_derivative(self.output_activation)
        hidden_error = output_delta.dot(self.weights_hidden_output.T)
        hidden_delta = hidden_error * sigmoid_derivative(self.hidden_activation)

        # Update weights and biases
        self.weights_hidden_output += self.hidden_activation.T.dot(output_delta) * learning_rate
        self.bias_output += np.sum(output_delta, axis=0, keepdims=True) * learning_rate
        self.weights_input_hidden += X.T.dot(hidden_delta) * learning_rate
        self.bias_hidden += np.sum(hidden_delta, axis=0, keepdims=True) * learning_rate

    def train(self, X, y, epochs, learning_rate):
        # Training routine
        for epoch in range(epochs):
            output = self.forward(X) # The forward pass
            self.backward(X, y, learning_rate) # The backword pass

In [None]:
# Convert the data into an array
X_train_np = np.array(X_train)
y_train_np = np.array(y_train)
X_test_np = np.array(X_test)

In [None]:
# TODO
# Play around with these parameters to get optimal results.
input_size = X_train_np.shape[1]
hidden_size =
output_size =
learning_rate =
epochs =

In [None]:
# Create and train the neural network
nn = NeuralNetwork(input_size, hidden_size, output_size)
nn.train(X_train_np, y_train_np.reshape(-1, 1), epochs, learning_rate)

In [None]:
# Make predictions on the test set
nn_y_pred = nn.forward(X_test_np)

# Convert probabilities to binary predictions
nn_y_pred_binary = (nn_y_pred > 0.5).astype(int)

# Calculate F1 score
nn_f1 = f1_score(y_test, nn_y_pred_binary)
print("F1 Score - Neural Network (from scratch):", nn_f1)

Q. What do you think goes on in other neural network functions written in the library functions? (Give relevant references).

__Your Answer here__

__Make sure to clear your kernel and run your notebook from top to bottom, ensuring there are no errors, before turning it in!__

Upon finding the right combination of hyperparameters, you should be able to have your F1 score above the previous implimentation. (Criteria for full bonus points).