<a href="https://colab.research.google.com/github/anyuanay/DSCI521-summer2025/blob/main/week3/DSCI521_week3_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DSCI 521: Methods for analysis and interpretation <br> Chapter 3: Exploratory data analysis and visualization

## Exercises
Note: numberings refer to the main notes.

#### 3.1.1.2 Exercise: modifying a sort order
Use the `key` argument and a lambda function to sort the `list_of_tuples` object primarily by the second (string) column, and secondarily by the first (integer) column.

In [None]:
list_of_tuples = [
    (8, "g"), (6, "a"), (7, "f"), (5, "d"),
    (3, "k"), (0, "t"), (9, "x"), (8,  "f")
]

## code here

#### 3.1.3.2 Exercise: percentiles
Compute all deciles ($10$-increment percentiles) for the baseball player heights. How do these values space apart, are the locations evenly spaced?

In [None]:
import numpy as np
# import scipy.stats
import pandas as pd

baseball_data = pd.read_csv(
    filepath_or_buffer= "data/baseball_heightweight.csv",
    sep = ",",
    header = 0
)

## code here

#### 3.1.4.3 Exercise: comparing player statistics
Represent each player in the baseball data as a (2-dimensional) pair of height/weight numbers. Use either of the Euclidean or taxicab distances on these pairs to determine which players are the 'closest' to one another in terms of build.

In [None]:
## code here

#### 3.1.7.2 Exercise: computing standard deviation-based outliers
Modify the above code to implement the outlier detection that defines outliers by as all points at least $3$ standard deviations ($\sigma$s) away from the mean.

In [None]:
height_stdev = np.std(baseball_data["Height"])
height_mean = np.mean(baseball_data["Height"])

## code here

#### 3.2.1.4 Exercise: standardizing data for comparison
Use the standardization technique from __Section 2.0.1__ to transform the height, weight, and age columns into numerically comparable quantities and vizualize them in a side-by-side boxplot. How do they appear to differ, now?

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use("seaborn-whitegrid")

def standardize(data):
    mean = np.mean(data)
    stdev = np.std(data)

    standardized_data = (data - mean) / stdev

    return standardized_data

## code here

#### 3.2.2.2 Exercise: a sorted bar plot
Utilize the `sorted()` function from __Sec. 3.1.1__ to rebuild the barplot visualization of baseball position, but with the size of bars decreasing from left to right.

In [None]:
from collections import Counter

## initialize a counter for the positions
positions = Counter()

## loop over rows to count up the positions
for ix, row in baseball_data.iterrows():
    positions[row["Position"]] += 1

In [None]:
## code here

#### 3.2.3.2 Exercise: binning
Repeat the above visualization using a few different bin sizes. Which choice appears to evoke the smoothest shape in the hisogram? What happens when there are too many or too few bins?

In [None]:
## code here

#### 3.2.5.1 Exercise: line charts
Make a line chartof the Apple stock above, except this time plot the daily price change as a a function of date.

In [None]:
## This loads the csv file from disk
APPL = pd.read_csv(
    filepath_or_buffer = "./data/APPL.csv", sep = ",",
    header=0, parse_dates = [0]
)

APPL = APPL.sort_values(by = "Date", ascending = True).reset_index(drop = True)

APPL["Change"] = APPL["Close"].diff().fillna(0)

## code here

#### 3.2.4.3 Exercise: combining visual elements
While it won't change the above correlation in any way, rebuild the density plot visualization from __Sec. 3.2.4.2__ using standardized height and weight values. Then, add to this visualization by plotting the line $y = x$. Discuss the stength of correlation and how drastically the density appears to fall out of alignment with this most basic linear relationship.

In [None]:
## code here

It looks like there's some variation in weight between players who have the same height. Below the mean height, this variation is mostly above the $y = x$ line, meaning that the players who are of below-average standardized height tend to have above-average standardized weight. On the other hand, players who are of above-average standardized height tend to have below-average standardized weight.

## Additional In-depth Exercises

### A. Rank-frequency distributions and Zipf's Law

Zipf's Law is a loose, common phenomenon that's readily-observable in many forms of social data. Generally, when some collection of categorical entities, e.g., words ($w$) are counted ($f(w)$), their _ranked_ values, i.e., being ordered, $r = 1, \dots$ from large-to-small by frequency forms a simple, reciprocal, 'power-law' relationship:

$$
f(w_r) = C\cdot r^{-1}
$$

for some constant, $C$. In this set of exercises we'll explore the universality of this relationship between words.


#### A.1 Frankenstein rank vs frequency
First, plot this rank-frequency relationship for a processed copy of the Frankstein book.
In particular, load the `'./data/84-processed.json'` file as `data` and plot the logarithm via `np.log10()`
of the `data['rs']` vs the `data['fs']` sets of values as a line plot. Additionally, plot Zipf's Law, scaled
so that the lowst frequency in the Zipf model has value 1.

When you do this make sure you put a title and legend on the figure,
and addition support readability as much as possible,
including plot color, font size, line style.

In [None]:
## code here

#### A.2 Histogram of Novelty
Now make a histogram of a different column, particularly, `data['As']`. This is called the novelty function,
which, as a function of word order (let's call this $r$) indicates the probability that a word has not been observed before in the document.
Again, when you do this make sure you put a title and legend on the figure,
and addition support readability as much as possible,
including plot color, font size, line style.

Additionally, plot a vertical line for the mean value of novelty in this histogram and comment on the on the scale of variation in the data&mdash;is variation 'wild' or or 'confined' to a limited reggion? Also is the mean a good approximation of centrality?

In [None]:
## code here

#### A.3 Frankenstein novelty scatter plot
As it turns out, the novelty function exhibits some relatively-smooth variation with respect to rank. Make a scatter plot of the novelty function and plot the mean novelty function as a horizontal line.

As before, when you do this make sure you put a title and legend on the figure,
and addition support readability as much as possible,
including plot color, font size, line style.

In [None]:
## code here

#### A.4 Lowest-loss word2vec optimized novelty
Now, unzip the `'./data/84-performance.zip'` file and subsequently load the `'./data/84-performance.json'` file, which exhibits a number of randomly-initialized, gradient-descent optimized `'model'` for the novelty function. This object (name it `performance`) has additional fields named `'NLL'` (the negative log likelihoods, loss values) and `'correlation'` (the association of each model to the known likelihood function).

Determine which novelty function has the lowest `'NLL'` value and plot it in the background of the empirical novelty, utilizing any code from the previous part.

In [None]:
## code here

#### A.5 Plots of model convergence
Now, make two plots by iteration (keys in the `performance` object) of the performance values (`'NLL'` and `'correlation'`)
and observe the location of minimum `'NLL'` value as a vertical line. Does it appear to align to the correlation with the
known novelty values?

In [None]:
## code here

#### A.5 Plot empirical frequency and novelty/word2vec transforms
Next, we'll observe the relationship between frequency and novelty. In particular reuse the frequency-plotting from the first part of this exercise, but additionall utilize the
`frequency_model` function provided below, which transforms a novelty function into a corresponding frequency model. Use this to compare the empirical frequency values with
to 1) Zipf's law, 2) the frequency model implied by the Theoretical novelty function, and 3) the frequency model implied by the Learned novelty function (from the `performance` object).
Note, the novelty function used from the `performance` object should be the one which has the smallest negative log likelihood.

In [None]:
def frequency_model(As, ns):
    Am = ns/np.cumsum(1/As)
    ## approximate birthdays
    mn = (ns + Am - 1)/Am
    ## slicing Ashat from second, up, assumes the As are actually one behind
    ## these are the log-transformed Eq. 8 factors
    NLfactors = list(-(1 - As[1:])*np.log10(mn[:-1]/mn[1:]))
    ## this is a really tricky cumulative sum, since it operates in reverse for the factors,
    ## to make a cumulative product after the re-exponentiation (removing the log transformation)
    return np.floor(10 ** np.append(np.cumsum(NLfactors[::-1])[::-1], 0))

In [None]:
## code here