# Homework 1 - Data munging and Linear Regression

Brandeis University
COSI 104-A Intro to Machine Learning
Spring 2025
Authors: Dylan Cashman and Binyamin Friedman

## Part 1: Data Munging (50 points)

### Installation
By now, you should have installed Jupyter Notebook and be familiar with how to run it with a virtual environment. To automatically install the dependencies listed in `requirements.txt`, you can run the following command in the terminal:
    
```bash
pip install -r requirements.txt
```

Alternatively, your IDE (PyCharm, or VSCode) may auto-detect the `requirements.txt` file and install the required libraries.

### Overview
[Pandas](https://pandas.pydata.org/) is a data analysis library for efficiently working with tables of data. To complete this task, you can reference their documentation on [DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), which is Pandas' primary data structure. Pandas is built on top of [NumPy](https://numpy.org/), which is a widely-used library that provides super efficient multidimensional arrays. Many of the operations you will perform can either be done with higher-level Pandas operations, or done directly with NumPy. Either way, we want your final output to be a NumPy array, since NumPy ndarrays are the standard for use in scientific computing libraries. You can convert a Pandas DataFrame to a NumPy array with the `to_numpy()` method.
 
[scikit-learn](https://scikit-learn.org/stable/) is a machine learning library that offers many useful algorithm implementations. It has a higher level of abstraction compared to PyTorch and Tensorflow. For this assignment, you should reference their documentation on [data preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) and [comparing matrices](https://docs.scipy.org/doc/scipy/reference/spatial.distance.html).

Note that the final line of a cell will be automatically displayed if it is not assigned to a variable. Your solution to a problem can be the last line of the cell, OR it can be assigned to a descriptive variable name.

Modify the skeleton code at your own risk.

#### Loading data
Jupyter notebooks are stateful, which means variables and imports persist between cells. After running the next cell, you will have access to the local variables `df` and `matrix` in all subsequent cells. 

In [None]:
# These aliases are standard practice for pandas and numpy
import pandas as pd
import numpy as np

# This loads the data into a Pandas DataFrame
df = pd.read_csv('sample_data.csv')

# We can always convert a dataframe to a numpy matrix by calling the `to_numpy` method
matrix = df.to_numpy()

# Display the first few rows of the dataframe
df.head()

#### a) Randomly select 10 rows from the matrix without replacement to create a subset (10 points)

Remember, many basic operations can be accomplished with both Pandas and NumPy. Random selection is a method on `pandas.DataFrame` and also a function in `numpy.random`. We leave you to look at the documentation and decide which one to use. 

In [None]:
# Solution here



#### b) Select all rows from the matrix where the value of the `THHLD_NUMKID` column is greater than 4 (10 points)

In [None]:
# Solution here



#### c) Fetch the first sample (row) from the matrix (5 points)

In [None]:
# Solution here



#### d) Apply feature normalization to the matrix using sklearn.StandardScaler, which standardizes each feature to have a mean of 0 and a standard deviation of 1 (5 points)

Feature normalization is a common preprocessing step that adjusts the range of each feature to be within a similar scale. You can see that this operation is found in scikit-learn's preprocessing module.

In [None]:
from sklearn.preprocessing import StandardScaler

# Solution here



#### e) Calculate the cosine similarity between two samples from the matrix by subtracting the cosine distance from 1 (10 points)

[SciPy](https://scikit-learn.org/stable/) is yet another Python library commonly used for scientific computing. SciPy has more fundamental algorithms than scikit-learn (which is focused on machine learning), related to optimization, integration, and linear algebra. Don't be overwhelmed by the number of libraries we're using! Everything is well-documented, and most of the functionality you'll need is a Google search away.

Cosine similarity is just a way of measuring the distance between two vectors. In other words, how similar they are.

In [None]:
from scipy.spatial.distance import cosine

# Select two samples from the matrix with no NaN values
# In practice, you need to decide how to handle NaN values, not just ignore them
sample1 = matrix[0]
sample2 = matrix[2]

# Solution here



#### f) Calculate the product of two matrices (10 points)

Take a look at the NumPy documentation for how to perform matrix multiplication. And feel free to do it by hand if you want to double-check your answer!

In [None]:
import numpy as np

A_a = np.array([[1, 2], [4, 5], [7, 8]])
B_a = np.array([[1, 1, 0], [0, 1, 1]])

# Solution here


## Part 2: Linear Regression (50 points)

In this part, you will use the training dataset (training_data.csv) to build a linear regression model to estimate `avgAnnCount`, `avgDeathsPerYear`, `TARGET_deathRate`, and `incidenceRate`. Then you will apply your model to the test dataset (test_data.csv) and analyze the results.

Please read the comments carefully and write your code in the designated places.

#### a) Creating a Model (15 points)
Create a linear regression model. We have supplied some important code to help you get started.
1. We load the raw data from a CSV file.
2. The preprocessing pipeline is partially defined for you. Your task is to complete the `handle_non_numeric` method. You might want to add additional preprocesing, like through scaling or normalization, if you think it helps you build a better model.
3. You should use the **normal equation** to learn the correct model parameters (you may not use sklearn.linear_model.LinearRegression, follow the procedure in the textbook instead). 
This should all result in a final variable `theta` that contains the fitted model parameters.

In [None]:
raw_data.head()

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

targets = ['avgAnnCount', 'avgDeathsPerYear', 'TARGET_deathRate', 'incidenceRate']

# csv files are easy to work with!
# Note that Python also comes with a csv module out of the box
raw_data = pd.read_csv('training_data.csv')

def handle_non_numeric(data):
    """
    Student's solution here. 
    
    You can drop any non-numeric data, and/or attempt to extract numerical information from it (for example, `binnedInc` is a string containing the median income of the top and bottom decile).
    
    :param data: data from the pipeline
    :return: data with non-numeric fields handled
    """
    return data

def add_bias(data):
    """
    This function adds a bias term to the data. The NumPy syntax is confusing, but we
    encourage you to look up the documentation and understand what's happening here. 
    np.ones((data.shape[0], 1)) makes a Mx1 column vector of ones, and np.hstack concatenates it to the data horizontally (whereas np.vstack would concatenate vertically).
    
    :param data: data from the pipeline
    :return: data with a bias term added
    """
    return np.hstack((np.ones((data.shape[0], 1)), data))

# Here we preprocess the data:
# The sklearn pipeline is a powerful tool for chaining together data transformations. It
# chains together Transformers, which each perform some operation on data. The overall 
# pipeline is also a Transformer. This allows for some great code reusability.
# We make use of the FunctionTransformer to add our custom functions to the pipeline.
preprocess_pipeline = Pipeline([
    ('non_numeric', FunctionTransformer(handle_non_numeric)),  # Handle non-numeric fields
    ('imputer', SimpleImputer(strategy='median')),  # Fill in missing values with the median
    ('bias', FunctionTransformer(add_bias)),  # Add a bias term
])

# Separate the data into targets and preprocessed features
training_data_y = raw_data[targets].to_numpy()

# Note that in one line we drop the targets from the raw_data, and then use the pipeline with the fit_transform method
training_data_x = preprocess_pipeline.fit_transform(raw_data.drop(targets, axis=1))

# YOUR SOLUTION HERE - Define the normal solution and use it to learn theta
# Learn the model parameters
# def normal_equation(x, y):
#     return x
# 
# theta = normal_equation(training_data_x, training_data_y)
theta = None


#### b) Model Accuracy (10 points) 
Calculate the model's mean squared error (using `theta` calculated in the previous part) on both the training set and the test set. Make sure to preprocess the test set as you did on the training set. Which one is higher? Why is this the case?

Start by loading the test data from `test_data.csv`.

In [None]:
# Solution here



*Explanation here*



#### c) Analyzing the Learned Parameters (10 points)
For each target attribute, what are the most and least predictive features for our `target`? Explain why these make sense. The definitions of the data attributes should be helpful for theorizing why certain features are predictive. They can be found in `README.md`.

HINT: Everything you need to know is in the `theta` variable.

In [None]:
# Solution here


*Explanation here*


#### d) Plotting Data (10 points) 
Knowing how to visualize data and share your findings is a critical skill. Matplotlib is a widely-used library for plotting scientific data. If you are unfamiliar with it, you can reference their [documentation](https://matplotlib.org/stable/contents.html) or look at some of the many tutorials available online.

For this question, you will create two plots,
 1. Plot the `medIncome` attribute against the `incidenceRate` attribute in the test data (remember to apply your preprocessing pipeline). On top of this, overlay the linear regression model's best fit line (HINT: The slope and y-intercept are stored in `theta`).
 2. Do the same for the `medIncome` attribute against the `avgDeathsPerYear` attribute.

What interesting conclusion could you *possibly* draw from these plots?

We provide code to generate the scatterplot and best fit line for the first plot.  You must include a title, labels for the x and y axes, and a legend to differentiate the data points from the regression line.  Then, you must generate the second plot on your own.


In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Solution here
def plot_attribute_against_target(attribute, target):
    """
    This function plots the given attribute against the target variable. You must fill it out to overlay the best fit line from the model, and add labels and a title.
    
    :param attribute: The attribute to plot on the x-axis
    :param target: The target variable to plot on the y-axis
    """
    
    # We preprocess the test data to get the transformed x values
    # This is necessary for overlaying the best fit line, since the model's parameters are learned from preprocessed data
    raw_test_data = pd.read_csv('test_data.csv')
    test_data_x = preprocess_pipeline.transform(raw_test_data.drop(targets, axis=1))
    
    # We get the index of the attribute and target in the final frame. This is necessary 
    # because your handle_non_numeric function will change the indices of some of the 
    # columns.
    final_frame = handle_non_numeric(raw_test_data.drop(targets, axis=1))
    attribute_i = final_frame.columns.get_loc(attribute) + 1  # +1 because of the bias term
    
    x_values = test_data_x[:, attribute_i]
    y_values = raw_test_data[target]
    
    plt.scatter(x_values, y_values, label='Data')
    
    income_index = final_frame.columns.get_loc(attribute)
    target_index = targets.index(target)
    best_fit_line = theta[income_index + 1][target_index] * x_values + theta[0][target_index]

    plt.plot(x_values, best_fit_line, color='red', label='Regression Line')
    
    # YOUR CODE: Add a title, labels, and a legend
    plt.show()
    
plot_attribute_against_target('medIncome', 'incidenceRate')


*Explanation here*

We see that incidence rate is positively correlated with median income, while average deaths per year is negatively correlated with median income. It's likely that since wealthier areas have access to better healthcare, they have lower death rates but higher diagnosis rates.

#### README.md (5 points)
In the README cell below, please explain how to run your code, load the model, and interpret the results. The README should provide any necessary information about the code environment and give a broad overview of what your code accomplishes, so that someone unfamiliar with the project could understand its parts.

*README.md*

