# Lab 2: Data!

10/16/2023, Felix Muzny, Nidhi Bodar, Harshitha Somala, Ankit Ramakrishnan

Agenda
------
+ your data set
+ balanced/imbalanced data sets
    - oversampling correction
+ training a Logistic Regression classifier
    - evaluating classifier f_measure
+ training a Neural Net w/ Keras


# Task 0: Who is in your group?

Dave Budhram, Akshay Dupuguntla, Sunny Huang

## TASK 1 : Getting familiar with your data
---

For this task, you'll load in your data, gather some statistics about it, and work on some featurization.

In [1]:

# if you'd like to use pandas, you can, but you aren't required to
import pandas as pd

import numpy as np
import random

from sklearn.linear_model import LogisticRegression

# for f_measure, word_tokenize
import nltk
import nltk.corpus as corpus
# you can also use sklearn.metrics.f1_score
from sklearn.metrics import f1_score

# this function will help split our data into train and test
from sklearn.model_selection import train_test_split

# feel free to use these fancier data structures
from collections import Counter, defaultdict

In [5]:
# your data file should be in the same directory as this notebook
# take a look at this file to see what it looks like and how it is structured
# we've cleaned this dataset for you so you won't see any missing values

# The dataset contains spam and non-spam emails
# Labels: 0 - non-spam, 1 - spam
DATAFILE = "spam_or_not_spam.csv"

# read the data into a pandas dataframe or into numpy arrays or into a list of lists
spam_df = pd.read_csv(DATAFILE)


In [6]:
# print the first few lines of your data to make sure it looks like what you expect
# make sure not to overwhelm yourself/your reader by printing too much data!
spam_df.head()


Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
3,klez the virus that won t die already the most...,0
4,in adding cream to spaghetti carbonara which ...,0


In [7]:
# TO-DO
# Display the following
# Size of dataset
# Number of unique classes in the dataset 
# Display all the unique classes
# Number of samples for each class

# Size of dataset
print(len(spam_df))

# Unique classes in the dataset (find this programmatically!)
print(set(spam_df['label']))

# Number of samples for each class
print(spam_df['label'].value_counts())

2997
{0, 1}
0    2500
1     497
Name: label, dtype: int64


**Answer the following:**
1. Is the dataset imbalanced? Yes, there are 5 times more emails with label 0 than 1

__Imbalanced__ datasets: Imbalanced datasets refers to the data in which the targeted classes are not equally distributed. In other words, number of data samples of one class significantly outnumbers the data samples of other class. This significant difference between datasamples of classes can lead to biased models favoring the majority class and results in poor spam detection and compromised accuracy. It is always essential to balance the dataset to tackle this issue.

The following are some simple techniques to handle imbalance datasets:
- Random Oversampling (randomly select minority class examples with replacement and add to training data)
- Random Undersampling (randomly select majority class examples and delete from the training data)



# Task 2: Random Oversampling
---

In [16]:

# function to handle oversampling (also known as upsampling)
# implement EITHER the pandas version OR the list version
# you don't need to implement both
# delete whichever one you don't want to use
def oversample_minority_class_df(df: pd.DataFrame) -> pd.DataFrame:
    """
    Oversamples or upsamples the minority class in the given DataFrame to balance the class distribution.

    Parameters:
    -----------
    df : pandas.DataFrame
        Input DataFrame containing 
        'email' column representing feature
        'label' column representing the class labels.

    Returns:
    --------
    pandas.DataFrame 
        Concatenated dataframe - original majority samples + original minority samples + upsampled minority samples
        DataFrame with the minority class oversampled to match the size of the majority class.
    """
    ## TO-DO

    # Identify the majority and minority class labels from the 'label' column in the DataFrame.
    counts = df['label'].value_counts()
    majority = counts.idxmax()
    minority = counts.idxmin()
    # Separate the majority and minority classes into different DataFrames based on their labels.
    df_majority = df[df['label'] == majority]
    df_minority = df[df['label'] == minority]

    # Determine the number of samples in the minority and majority classes.
    majority_count = len(df_majority)
    minority_count = len(df_minority)

    # Calculate the number of samples to be added to upsample the minority class.
    minority_needed = majority_count - minority_count

    # Randomly select samples (with replacement) from the minority class to create upsampled data.
    # create a dataset with upsampled minority class
    # Hint: Can use random.choices
    seed_value = 42
    random.seed(seed_value)
    upsampled_minority_class = random.choices(df_minority.to_records(index=False), k=minority_needed )
    # Convert the upsampled minority class data back to a DataFrame
    df_upsampled_minority_class = pd.DataFrame.from_records(upsampled_minority_class, columns=df.columns)

    # Concatenate the upsampled minority class dataframe with the original majority class and minority class.
    upsampled_with_minority_df = pd.concat([df_minority, df_upsampled_minority_class])
    upsampled_df = pd.concat([df_majority, upsampled_with_minority_df])
    # Return the resulting DataFrame with balanced class distribution.
    return upsampled_df


# function to handle oversampling (also known as upsampling)
def oversample_minority_class(data: list) -> list:
    """
    Oversamples or upsamples the minority class in the given DataFrame to balance the class distribution.

    Parameters:
    -----------
    data : list
        Input data containing lists with two elements:
        the first is the email and the second is the label
        
    Returns:
    --------
    list
        list of lists - original majority samples + original minority samples + upsampled minority samples
        DataFrame with the minority class oversampled to match the size of the majority class.
    """
    ## TO-DO

    # Identify the majority and minority class labels
    
    # Separate the majority and minority classes into different groups
    
    
    # Determine the number of samples in the minority and majority classes.
    
    # Calculate the number of samples to be added to upsample the minority class.

    # Randomly select samples (with replacement) from the minority class to create upsampled data.
    # create a dataset with upsampled minority class
    # Hint: Can use random.choices with the k= optional parameter

    # Return the resulting list of lists with balanced class distribution.
    

# call and test your function
# keep your original data around too! You'll need it!

In [17]:
# TO-DO
# Display the following
# Size of oversampled dataset
# Number of samples for each class in your oversampled dataset
# df = pd.read_csv(DATAFILE)
spam_oversampled_df = oversample_minority_class_df(spam_df)
# Size of dataset
print(len(spam_oversampled_df))
# Unique classes
print(set(spam_oversampled_df['label']))
# Number of samples for each class
print(spam_oversampled_df['label'].value_counts())

5000
{0, 1}
0    2500
1    2500
Name: label, dtype: int64


# Task 3: Train a Logistic Regression Classifier
----


In [18]:
# TO-DO
# your first step before training a Logistic Regression classifier
# is to featurize your data.

# scikit-learn LogisticRegression and keras neural networks like data in
# the format:
# X: a matrix of shape (num_samples, num_features)
# y: a vector of shape (num_samples,)

# start by defining some words as your features — you'll be using these
# as a mini bag of words. Choose up to 10 words that you think might be
# indicative of spam emails. You can look at the data to get some ideas.
WORDS = ['money', 'card', 'free', 'cash', 'refund', 'credit', 'best', 'save', 'now', 'congratulations']

def featurize(X_train):
    # implement this function to featurize your data
    # use nltk.word_tokenize to tokenize the emails
    array = []
    for sentence in X_train:
        tokens = nltk.word_tokenize(sentence)
        counter = Counter(tokens)
        spam_words = []
        for word in WORDS:
            if word not in counter:
                spam_words.append(0)
            else:
                spam_words.append(counter[word])
        array.append(spam_words)
    return array
    
    
# make sure you've successfully featurized your data

# the shape should be (num_samples, num_features) where num_features
# is the length of your word list


In [19]:

features = featurize(spam_oversampled_df['email'])
labels = spam_oversampled_df['label']
# this function will put 70% of your data into the training set and 30% into the test set
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)



In [20]:
# train your model
oversampled_model = LogisticRegression()
oversampled_model.fit(X_train, y_train)


In [21]:
# see how good your model is by printing the f1 score
lr_preds = oversampled_model.predict(X_test)


In [23]:
# what is the distribution of your predictions?
# is your model predicting all 0s or 1s?
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, lr_preds))

0.7286666666666667


In [24]:
print(lr_preds)
count = Counter(lr_preds)
print(count)


[0 0 1 ... 0 0 0]
Counter({0: 931, 1: 569})


1. Using the __original__ data, what label distribution does your LogisticRegression classifier give? See below
1. Using the __oversampled/upsampled__ data, what label distribution does your LogisticRegression classifier give? See above 

In [26]:
# split your data into train and test sets
features = featurize(spam_df['email'])
labels = spam_df['label']
# this function will put 70% of your data into the training set and 30% into the test set
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)

# train your model
spam_model = LogisticRegression()
spam_model.fit(X_train, y_train)
lr_preds = spam_model.predict(X_test)
count = Counter(lr_preds)
print(count)

Counter({0: 840, 1: 60})


In [27]:
# finally, take a look at the model's coefficients (weights) (model.coef_)
# which word is the most important for predicting spam?
coefficients = oversampled_model.coef_
abs_coefficients = np.array([abs(x) for x in coefficients])
print(coefficients)
print(WORDS[abs_coefficients.argmax()])

[[ 0.70273435 -0.05135297  0.93570958  0.95170556 -1.12075203  1.27362274
  -0.07438184  0.78240272 -0.08678875  1.1951754 ]]
credit


In [28]:
coefficients = spam_model.coef_
abs_coefficients = np.array([abs(x) for x in coefficients])
print(coefficients)
print(WORDS[abs_coefficients.argmax()])

[[ 0.6175562  -0.18227245  0.90549302  1.00808619 -0.65278403  1.50960816
  -0.39014197  0.71379644  0.01539242  1.3880844 ]]
credit


3. What is the most important word when using the __original__ data? Credit
4. What is the most important word when using the __oversampled/upsampled__ data? Credit
5. What do the relative weights of the model when using the original vs. the oversampled data tell you about the effects of using oversampled data on this model? (1 - 2 sentences) __YOUR ANSWER HERE__
6. We didn't implement it today, but we can also __downsample/undersample__ in which we reduce the size of the majority class in our data. What effects do you think that this would have? (1 - 2 sentences) __YOUR ANSWER HERE__

## TASK 4: Training a Neural Net
---

Finally, we'll train and evaluate a neural net using the same data as your Logistic Regression model.

In [103]:
from keras.models import Sequential
from keras.layers import Dense
from keras import backend as K

In [112]:
# Create your model
# (take a look at lecture notebook 9 for an example of creating a keras neural network)
df = pd.read_csv(DATAFILE)
df = oversample_minority_class_df(df)
features = featurize(df['email'])
labels = df['label']
# this function will put 70% of your data into the training set and 30% into the test set
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)

model = Sequential()
# hidden layer
# you can play around with different activation functions
model.add(Dense(units=4, activation='relu', input_dim=10))

# output layer
# activation function is our classification function
model.add(Dense(units=1, activation='sigmoid'))

# configure the learning process
model.compile(loss='binary_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

model.summary() #will tell you about your model
# call compile

Model: "sequential_20"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_35 (Dense)            (None, 4)                 44        
                                                                 
 dense_36 (Dense)            (None, 1)                 5         
                                                                 
Total params: 49 (196.00 Byte)
Trainable params: 49 (196.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


1. How many __hidden__ layers does your network have? 1

In [115]:
# train your model
# if you get this error:
# ValueError: Failed to find data adapter that can handle input: <class 'numpy.ndarray'>, (<class 'list'> containing values of types {"<class 'int'>"})
# this means that your inputs need to be numpy arrays

# some useful parameters to the keras model.fit function:
# epochs: how many times you want to go through your training data
# batch_size: how many examples to look at before updating the weights in your network
# verbose: whether or not you want to see the training progress
model.fit(np.array(X_train), np.array(y_train), epochs=10, verbose=1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x2939f4c40>

In [116]:
from sklearn.metrics import f1_score
# see how good your model is (use the `model.predict(test_data)` function)
# print out the f1 score
preds = model.predict(np.array(X_test))
print(preds)
preds = [x[0] for x in preds]
preds = [1 if x > .5 else 0 for x in preds]

print(f1_score(np.array(y_test), preds))

[[0.41495636]
 [0.42940155]
 [0.75322706]
 ...
 [0.41495636]
 [0.41495636]
 [0.41495636]]
0.6734059097978227


2. What is the output of the `keras` function `model.predict(test_data)`? Is it the same as your `sklearn` `LogisticRegression` model's `model.predict(test_data)` function? The distributions are very similar. With the keras we have 0: 946, 1: 554 and with LogisticRegression we have 0: 931, 1: 569

In [117]:
# what is the distribution of your predictions?
# is your model predicting all 0s or 1s?

count = Counter(preds)
print(count)

# compare and contrast your LR predictions with your NN predictions

Counter({0: 946, 1: 554})


3. Using the __original__ data, what label distribution does your neural classifier give? __YOUR ANSWER HERE__ ({0.0: 784, 1.0: 116})
    1. What is the overlap with the Logistic Regression guesses (# both models agree on)? __YOUR ANSWER HERE__ ({True: 885, False: 15})
4. Using the __oversampled/upsampled__ data, what label distribution does your neural classifier give? __YOUR ANSWER HERE__ (({0.0: 921, 1.0: 579}))
    1. What is the overlap with the Logistic Regression guesses (# both models agree on)? __YOUR ANSWER HERE__ (1493 agree, 7 disagree)



__Make sure to clear your kernel and run your notebook from top to bottom, ensuring there are no errors, before turning it in!__