# DS 3000 HW 3

Due: Tuesday Oct 31 @ 11:59 PM EST

### Submission Instructions
You will may submit up to two files for this assignment. This `ipynb` file should have answers to the programming questions, and you could include the answers to the math problems as well either via LaTeX typesetting in Markdown cells, or by embedding images of your written work. If you would rather work on the math problems separately, you may also submit a pdf file with your handwritten answers to the math problems. To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the `ipynb` file to Gradescope.

### Tips for success
- Start early
- Make use of Piazza (also accessible through Canvas)
- Make use of Office Hours
- Remember to use cells and headings to make the notebook easy to read (if a grader cannot find the answer to a problem, you will receive no points for it)
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](http://www.northeastern.edu/osccr/academic-integrity), though you are welcome to **talk about** (*not* show each other your answers to) the problems.

# Part 1: Data Cleaning and Ethics

The first part of this HW deals with the Get to Know You results from my DS3000 sections this semester. Please download the `cleaner_gtky.csv` file for use on this homework if you haven't already.

## Pseudonymizing data

It's often the case that we 'Pseudonymize' (use random fake pseudonyms instead of personally identifiable information) a dataset by storing a code which maps some sensitive information to a consistent tag.

| fake_student_id | time_stamp                 | class     | co_op | dream_career                                   | hobby                                      | song                                                                        | ai_feels                                                      | age_months | pets | credit_hours | work_hours |
|-----------------|----------------------------|-----------|-------|------------------------------------------------|--------------------------------------------|-----------------------------------------------------------------------------|---------------------------------------------------------------|------------|------|--------------|------------|
| 1841            | 2023/09/06 12:59:46 AM AST | Junior    | No    | Pharmaceutical data scientist job              | playing poker, try new restaurants         |                                                                             | I think it's the future, and the future looks great!          | 240        | 0    | 18           | 30         |
| 1049            | 2023/09/06 5:52:01 AM AST  | Senior    | Yes   | Management                                     | Drawing                                    | https://www.youtube.com/watch?v=A1EhBdsTkl8                                 | I think it's the future, and the future looks great!          | 259        | 8    | 16           | 0          |
| 1508            | 2023/09/06 8:40:59 AM AST  | Sophomore | No    | I want to be a UI/UX developer.                | I love doing yoga and going on long walks! | "I'll Be There For You"                                                     | It will be useful, but it won't change the world signicantly. | 230        | 0    | 19           | 15         |
| 1717            | 2023/09/06 9:07:26 AM AST  | Junior    | Yes   | International Affairs                          | Singing                                    | These are the Days by Inhaler - https://www.youtube.com/watch?v=l0Hilyfp_8A | I'm a bit worried it might be used for ill.                   | 246        | 5    | 17           | 0           |

To protect student privacy, this data has already been pseudonymized by creatfake_student_iddent ID` column with random numbers.

Note that there is a difference between:
- **pseudonymization** (changing everyone's name to a pseudonym)
- **anonymization**  (ensuring no individual can be uniquely identified within the data)

Briefly, if a single student was known to be the only one who submitted their Get to Know You at 15:37, then changing their name to a pseudonym is insufficient to protect their privacy within this data.  [This link](https://gathercapture.com/latest/anonymous-pseudonymous-data-are-they-important) contains further details, though its not necessary for this HW.

### Generating a pseudonym

[Universally Unique Identifiers](https://en.wikipedia.org/wiki/Universally_unique_identifier) is a CS term for a unique name:

In [None]:
from uuid import uuid4

# a good random pseudonym/alias
str(uuid4())

In [None]:
# a good enough random alias for us (doesn't need to be as long to be unique)
str(uuid4())[:6]

You can read more about the process in the [python docs for uuid4()](https://docs.python.org/3/library/uuid.html#uuid.uuid4) but its sufficient to know that there are so many unique output strings of `uuid4()` that we can assume no two calls return the same id.

## Part 1.1: Pseudonymize (10 points)

Pseudonymize `df_gtky` (since I already did it once, pretend that the current 'fake_student_id' is actually real) by completing the tasks below:

1. Load `cleaner_gtky.csv` to a DataFrame called `df_gtky`
1. Write a `pseudonymize_col()` function which:
    * accepts a DataFrame and the name of one of its columns
    * returns `pseudo_map_dict`
        - keys are the unique items observed in original column of dataframe
        - values are the new pseudonyms
    * operates `inplace` by modifying the input DataFrame to replacing each item in the given column with its corresponding pseudonym
    
Note that the pseudonymization must be consistent: all observations of a particular student ID in the original DataFrame are replaced with an identical pseudonym.

1. call `pseudonymize_col()`, and print out the `.head()` of the resulting `df_gtky` to demonstrate it worked. 

[This example](https://colab.research.google.com/drive/1VdikAnXZEBx3tGxclDG-BC-psr_D_DBN?usp=sharing) may help clarify the expected behavior of `pseudonymize_col()`.

In [None]:
import pandas as pd 

df_gtky = pd.read_csv('cleaner_gtky.csv')
df_gtky.head(4)

## Part 1.2: Cleaning (5 points)

As we've discussed in class, there are few messy things about these data. For example:

- I am in the data, and I am not a student.
- A few students entered their age in months incorrectly.
- There are several missing values.

Clean the data set by making decisions about what to do with the incorrect and missing data and then justify your decisions in a markdown cell. Your final data set should:

- Contain no obviously incorrect data
- Contain a *minimal amount of missing data
  - *what you define as minimal may be different from other students. You **must** justify your cleaning choices adequately to receive full points on this part
  - Note this means you may still have some missing data, but you may want to remove some missing observations and/or impute others. Treat all missing data as missing at random.
  - Assume we care more about the numeric features than the categorical features.
 
When you are done, you should save the clean data set as a different object (such as `df_gtky_clean`) and demonstrate the changes. This could be done in any of several ways, including but not limited to displaying subsets of the original and cleaned data set and/or creating plot(s) or numerical summaries which demonstrate the changes.

In [None]:
df_gtky.shape

## Part 1.3: Ethics Considerations (5 points)

Write a paragraph in the markdown cell below discussing why we should be pseudonymizing these data and why we needed to clean it. Your answer must be well thought out, **at least two sentences**, and provide insight specific to the data collected to receive full credit.

## Part 1.4: Ethics More Generally (5 points)

Notice that one of the decisions we made was to save `pseudo_map_dict` to ensure the pseudonymization can be undone. This is often done in medical studies when pseudonymizing data. Why would we want to do this? **Discuss in a couple sentences**.

# Part 2: Summarizing and Visualizing Data

For this part, you will use the `players_fifa23.csv` from Canvas to investigate the ratings for soccer players in the FIFA 23 video game. Make sure the `.csv` is in the same directory as this notebook file.

**Note**: You do not need to know anything about soccer or video games to complete this problem, only perhaps that a higher `Overall` rating is considered a good thing.

## Part 2.1: Plotting Data (15 points)

Create a plotly scatter plot which shows the mean `Overall` rating for soccer players (rows) of a given `Nationality` for a particular `Age`. Focuse on three countries (`England`, `Germany`, `Spain`). In other words, your plot's x-axis should be `Age`, the y-axis should be `Overall`, and there should be three different colored points at each `Age`, one for each `Nationality`.

Export your graph as an html file `age_ratings_nationality.html`. You do not have to submit it with this homework, but the code should show that you did this.

Hints:
- There may be multiple ways/approaches to accomplish this task.
- One approach: you may use `groupby()` and boolean indexing to build these values in a loop which runs per each `Nationality`.
- `px.scatter()` will only graph data from columns (not the index).  Some approaches may need to graph data from the index.  You can use [df.reset_index()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) to make your index a new column as shown [in this example](https://colab.research.google.com/drive/1d9JDphmpSTg9NtFMyfFnMQ6RmIx6zChK?usp=sharing)
- In some approaches you may need to pass multiple rows to [df.append()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html) if need be as shown [in this example](https://colab.research.google.com/drive/1XbBHMcYq_2Q225nkKs3j06iigCQGmM4H?usp=sharing)
- In some approaches you may need to go from "wide" data to "long" data by using [df.melt()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html) as discussed [here](https://towardsdatascience.com/reshaping-a-pandas-dataframe-long-to-wide-and-vice-versa-517c7f0995ad)
- The first few code cells below get you started with looking at the data set.

In [None]:
# use pandas to read in the data
import pandas as pd

df_fifa = pd.read_csv('players_fifa23.csv', index_col = 'ID')
df_fifa.head()

In [None]:
df_fifa.Nationality.value_counts()

In [None]:
df_fifa.shape

In [None]:
df_fifa['Age'].unique()

In [None]:
# making the plot
import plotly
import plotly.express as px

## Part 2.2: Numerical Summaries (10 points)

1. Calculate the sample mean and median of `Overall` for the entire data set. In a markdown cell, discuss what their relative values imply about the distribution of `Overall`, and then use the plot from 2.1 and these values to discuss whether you think English, German, and Spanish players are generally better rated than other country's players, and at what age do they become average players?
2. Calculate the `.group_by()` function to calculate the means and standard deviations of `Overall` for the three Nationalities in Part 2.1 (you will want to use the original data frame or a slightly modified version of it (the `.isin` function from pandas may help), **NOT** the data frame you used for the plot). What do these values tell you about the differences between English, German, and Spanish players?
3. Create a subset of the original data frame that includes only `Age`, `Height`, `Weight`, and `Overall`. Calculate the correlation matrix for these four features and discuss what the relationships seem to be and whether those relationships make sense to you.

# Part 3: Vector Geometry Practice (10 points)

Use the vectors to below to compute the following quantities. You must show all math work/steps (no matter how trivial) to receive full credit. You may either use LaTeX typesetting within a Markdown cell, or do it by hand with pen and paper and embed the image in this .ipynb file, or submit a separate pdf file with your handwritten work. Round all decimals to three places.

After calculating the quantities by hand, use numpy in cells below to verify your answers.

$a = \begin{bmatrix} 2 \\ -1 \\ 3 \end{bmatrix}$

$b = \begin{bmatrix} -4 \\ -2 \\ 0 \end{bmatrix}$

$c = \begin{bmatrix} 3 \\ 3 \\ -3 \end{bmatrix}$

1. Compute $||b+c||$
2. Compute $2a + b$
3. Compute $c \cdot a$
4. Compute $||\frac{a}{2} - c||$
5. Compute $b \cdot (a + c)$

In [None]:
import numpy as np

a = np.array([2, -1, 3])
b = np.array([-4, -2, 0])
c = np.array([3, 3, -3])

# Part 4: Spotify Perceptron

In this part you will use a subset of our class' Spotify playlist to enact a linear perceptron algorithm in order to predict the `Mode` (Major or Minor) of a song based on its `danceability` and `energy`. I have created the subset and uploaded it to Canvas as the `simple_spot.csv` file. 

In [None]:
import pandas as pd

simple_spot = pd.read_csv('simple_spot.csv')
simple_spot.head()

## Part 4.1: Visualize the Data (5 points)

Before beginning to implement the perceptron algorithm, make a plot using plotly of the two predictor features coloring the points by the mode (note that `Mode = 1` corresponds to Major Mode). Note that this is **not** a random subset, but rather a semi-curated collection of 60 songs from our class' playlist. In a markdown cell, discuss whether this linear perceptron will converge or not without defining a loss function and how you know.

In [None]:
import plotly.express as px

## Part 4.2: Do the First Iteration By Hand (10 points)

The first observation in the data set is, with added bias term: $x_1 = \begin{bmatrix} 1 \\ .854 \\ .492 \end{bmatrix}$. You will use this observation to perform the first update of your linear perceptron by hand. Initialize the weight vector to be $w = \begin{bmatrix} 0 \\ 1 \\ -1 \end{bmatrix}$, determine if an update of $w$ is necessary and, if so, update it. If not, continue looking at observations until you make your first update of $w$. **Do this first by hand** either writing out all your work with LaTeX in a Markdown cell, or with pencil and paper and then embedding the image below, or including it in pdf submitted with this .ipynb file. **Then**, use python to do exactly what you did by hand.

In [None]:
import numpy as np

## Part 4.3: Write the Perceptron Function (10 points)

Write the function `linear_perceptron`, with appropriate docstring, which takes as arguments `X` (the array of X values, including the bias term), `y` (the array of response values), and `w` (the initial weight vector). These are already created for you below (to give you an idea of what they look like).

Your perceptron is a simple perceptron, which assumes the data are linearly separable and thus the only stopping criteria is that all the observations are correctly classified by the final weight vector. Your perceptron should iterate through all observations, updating $w$ sequentially (and keeping track of the total number of iterations), until all observations are iterated through and correctly classified. Include in your function the following two lines for when the algorithm has converged:

```python
print(f'Algorithm converged, final w = {w}')
print(f'Total number of iterations = {iter}')
```

Also make sure your function returns the final weight vector `w`. To test to make sure your function works, run the cell with the `assert` statement at the end of this part. When you run it, if function is written correctly and the `assert` passes, you should see:

```python
Algorithm converged, final w = [ 2.    -4.571  1.288]
Total number of iterations = 62
```

**Note**: My "Total number of iterations" does not include those times when $w$ is updated. If you did keep track of the iterations where $w$ is updated, the total number should be 240 (i.e. four complete loops through the data). If you see either of these numbers, the final $w$ looks correct, and the `assert` passes, then you are done.

In [None]:
import numpy as np

Xdat = np.array(simple_spot.iloc[:,:-1])
ydat = np.array(simple_spot.iloc[:,-1])
wdat = np.array([0, 1, -1])

In [None]:
myrun = linear_perceptron(Xdat, ydat, w = wdat)

def predict_perceptron(x, w):
    yhat = 1 if np.dot(w, x) >= 0 else 0
    return yhat

assert (np.apply_along_axis(predict_perceptron, 1, Xdat, w=myrun) == ydat).all()

## Part 4.4: Cross Validate and Run (5 points)

To actually test the perceptron, we shouldn't run it on the full data set, but rather on a training set and then use a test set to evaluate the accuracy. Instead of Leave-One-Out Cross Validation (LOO-CV), we will instead use single-fold cross validation. The first cell below creates the training (`Xtrain` and `ytrain`) and test (`Xtest` and `ytest`) for this problem. 

Apply your function from 4.3 to the training data, and then use the `predict_perceptron` function I defined in 4.3 (as well as the Xtest and ytest values) to determine the cross validated accuracy of your function.

In [None]:
# may need to run pip install sklearn before doing below
from sklearn.model_selection import train_test_split

crossval = train_test_split(Xdat, 
                            ydat,
                            test_size=0.3)

Xtrain, Xtest, ytrain, ytest = crossval

## Part 4.5: Use `sklearn` and Compare (5 points)

Use the `Perceptron` function from `sklearn` to fit the training data, print the `.n_iter_` attribute after fitting and then use the `.score()` function to calculate the accuracy of the model fit to the test data. How many iterations did Does the accuracy of `sklearn`'s perceptron function match yours? If so, do you expect this to always be the case? If not, why do you think this is?

You might take a look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron) to help you answer these questions, though this is meant to be a relatively open-ended question without necessarily a specific answer.

In [None]:
from sklearn.linear_model import Perceptron

## Part 4.6: Thoughts on Extending the Model (5 points)

In the above problem, you worked with only a subset of the Spotify data set you created in Homework 2. What are some considerations/what would have to be different if you used the full data set (including other features besides danceability and energy) in a perceptron (or any other machine learning) algorithm to try to predict if a song is in Major or Minor mode? **Write a few sentences discussing this question.** As a reminder, here is what the top of the raw data set looks something like:

| danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo   | ... |
|--------------|--------|-----|----------|------|-------------|--------------|------------------|----------|---------|---------|-----|
| 0.631        | 0.605  | 0   | -8.73    | 1    | 0.0386      | 0.453        | 0.000184         | 0.291    | 0.25    | 115.281 | ... |
| 0.531        | 0.766  | 8   | -7.692   | 1    | 0.0582      | 0.0056       | 0                | 0.201    | 0.532   | 130.048 | ... |
| 0.561        | 0.965  | 7   | -3.673   | 0    | 0.0343      | 0.00383      | 7.07E-06         | 0.371    | 0.304   | 128.04  | ... |
| 0.62         | 0.712  | 9   | -6.434   | 1    | 0.1         | 0.228        | 3.39E-06         | 0.0561   | 0.83    | 170.234 | ... |     |
