# Week 2: A Data Scientist's most fundamental tools

The point of these exercises is to refresh your memory on some mathematics and get you comfortable doing computations in code.

The exercises today cover:
* Basic visualization
* Linear algebra
* Statistics

**Advice**: Some of you may be new to solving problems using code. You may be wondering *what level of detail* I expect in your solutions, your code comments and explanations. **This is the guideline:** Solve the exercises in a manner that allows you to—later in life—use them as examples. This also means that you should add `#code comments` when the code isn't self-explanatory or if you're afraid it won't make sense when you look at it with fresh eyes. You may also want to comment on your output in plain text to capture the conclusions you arrive at throughout your analysis. But express yourself succinctly. To quote (probably) Einstein: "*Make everything as simple as possible, but not simpler*". Finally, when you optimize for your own future comprehension, other people will be able to understand what you did.

## Exercises

### Part 1: Visualization

>**Ex. 2.1.1**: The figure below meets the minumum style requirements which I expect the figures you make in this class (and life in general) should also meet:
* Figure sizing. Try to make the aspect ratio close to 4:3.
* Axis labels. Note that you may want to alter the `fontsize` to make them look nice.
* Properly sized x and y tick labels.
* Title (optional: not always necessary, but oftens helps the reader)
* Legend (general rule: only use if you have multiple trends so reader can distinguish).
>
> Your task in this exercise if to reproduce this figure (perfect match not required).
>
>*Hint: To get figures to display inside the notebook, use the Jupyter magic `%matplotlib inline`. For pointers on how to make plots like this in Python, Google something like "scatter plot python" and see if you can find some examples of how other people do this.*

<img src="https://dhsvendsen.github.io/images/ex211.png" width="500"/>

In [12]:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(10)
y = np.random.randn(10)
plt.yscale

array([-0.5592661 ,  0.21670406, -0.64389097, -1.43748149, -1.04738727,
       -0.25641203,  1.49635567, -0.64611326, -0.98949272, -0.51116256])

>**Ex. 2.1.2**: The `get_x_y` function below gives you the number of comments versus score for the latest `N` posts on a given `subreddit`.
1. Make a scatter plot of `x` vs. `y` for the "blackmirror" subreddit (**remember** what you learned in the previous exercise about **styling**). Comment on what you see.
2. Maybe you've noticed that it looks pretty bad right? That's because the majority of the data is clustered at low values while some posts have very high values of both score *and* comments! This is a very common thing. To visualize it you should then try to *transform* it somehow. Which transformation would allow us to visualize both *high* and *low* values in one plot?
3. In two seperate figures, floating side by side, scatter plot (left) the set of x and y variables for "blackmirror" and (right) x and y for "news". Remember to transform the data. My figure looks like [this](https://dhsvendsen.github.io/images/ex2123.pdf).
4. Comment on any differences you see in the trends. Why might number of comments versus post upvotes look different for a TV-show than for world news?
>
>*Hint: By "transformation" I explicitly mean that you map some function onto every value in a list of values. Example: I can apply a square root transformation like `x = [np.sqrt(v) for v in x]`. A faster way to do that, of course, would be just `x = np.sqrt(x)`.*
>*Note:* You can also use data tranformations to illustrate a nonlinear relationship in your data.

In [1]:
import requests as rq

def get_x_y(subreddit, N, count=25):
    
    def _get_data(subreddit, count, after):
        url = "https://www.reddit.com/r/%s/.json?count=%d&after=%s" % (subreddit, count, after)
        data = rq.get(url, headers = {'User-agent': 'sneakybot'}).json()
        print("Retrieved %d posts from page %s" % (count, after))
        return data
    
    after = ""

    x, y = [], []
    for n in range(N//count):
        data = _get_data(subreddit, count, after)
        for d in data['data']['children']:
            x.append(d['data']['num_comments'])
            y.append(d['data']['score'])
        after = data['data']['after']

    return x, y
                          
x, y = get_x_y("blackmirror", 500, count=25)

>**Ex. 2.1.3**: There is clearly a huge level unevenness in the distribution of how likes and comments given to different posts. Let's visualize this using histograms!
1. Log transform `y` (e.g. create a new variable called `y_transformed`) and input it to `plt.hist`. Notice that if there are zeros in `y`, `np.log` will convert them to `-inf`, which `plt.hist` can't handle. In this case, we will remove zeros before log transforming. When you have done this, execute `hist_output = plt.hist(y_transformed)`. This should produce a histogram. But what does the variable `hist_output` contain?
2. Use `hist_output` to make a similar histogram with the `plt.bar` plotting function. I make you do this to force
into your permanent memory what a histogram is: a bar chart showing counts within intervals/bins.
3. Plot the distributions of `y_transformed` for "blackmirror" and "news" as histograms, side by side (you can just use the regular `plt.hist` function here). My figure looks like [this](http://ulfaslak.com/computational_analysis_of_big_data/exer_figures/example_2.2c.png). Comment on the result.

### Part 2: Linear algebra

>**Ex. 2.2.1**: What does Joel (book) mean when he uses the word *vector*? What are [Grant](https://youtu.be/fNk_zzaMoSs)s vector definitions from the perspectives of the Physicist, the Computer Scientist and the Mathematician, respectively?

>**Ex. 2.2.2**: Using `numpy`, compute:
1. `2 * [2, 3]`,
2. `[3, 8] + [6, 1]`,
3. `[3, 8] * [6, 1]` and
4. `[3, 8] · [6, 1]` (dot product)
5. `[3, 8, 0] x [6, 1, 0]` (cross product)

>**Ex. 2.2.3**: Say you have two vectors. What does it mean that the dot product between them is zero or very close to zero? What if it's very large? Intuitively, what does the dot product then measure?

>**Ex. 2.2.4**: In Data Science, we often think of matrices as (usually two-dimensional) containers for data. If we have $N=500$ data points each with $M$ features to them, we can represent this data using an $N \times M$ matrix, that is a matrix that has $N$ rows, one for each datapoint, and $M$ columns, one for each feature. Below I fetch a dataset of wines (rows) and their features (columns).

In [2]:
import pandas as pd

# Download dataset
X = pd.read_csv("https://gist.githubusercontent.com/tijptjik/9408623/raw/b237fa5848349a14a14e5d4107dc7897c21951f5/wine.csv").drop('Wine', axis=1)

# Display dataset
X.head(10)

Unnamed: 0,Alcohol,Malic.acid,Ash,Acl,Mg,Phenols,Flavanoids,Nonflavanoid.phenols,Proanth,Color.int,Hue,OD,Proline
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
5,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450
6,14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290
7,14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295
8,14.83,1.64,2.17,14.0,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045
9,13.86,1.35,2.27,16.0,98,2.98,3.15,0.22,1.85,7.22,1.01,3.55,1045


>So this dataset has $N=178$ rows and $M=13$ columns. Let's start by finding the so-called *covariance matrix* of the features. It is a square, in this case, $13\times13$ matrix where every value $i,j$ scores the covariance between features $i$ and $j$. [Read more here](https://en.wikipedia.org/wiki/Covariance_matrix).
1. Use the `np.cov` method on `X` to get its $13 \times 13$ covariance matrix.
2. Plot the covariance matrix using `plt.imshow` and `plt.colorbar()`.
3. Plot the correlations in the same way. Comment on the differences between these two plots. Is one easier to interpret than the other?
>
>*Hint 1: `np.cov` expects that rows are features and columns are observations. That is the transpose of how `X` 
is represented now.*<br>
>*Hint 2: The correlation matrix can be obtained with the `np.corrcoef` function.*

>**Ex. 2.2.5**: There's another use of the covariance matrix, other than just learning how features co-vary. In fact, it turns out that the *eigenvectors* of the covariance matrix are a set of mutually orthogonal vectors, that point in the directions of greatest variance in the data. The eigenvector with the greatest *eigenvalue* points along the direction of greatest variation, and so on. This is pretty neat, because if we know along which axes the data is most stretched, we can figure out how best to project it when visualizing it in 2D as a scatter plot! This whole procedure has a name: **Principal Component Analysis** (PCA) and it was invented by Karl Pearson in 1901. It belongs to a powerful class of linear algebra methods called **Matrix Factorization** methods. Ok, so rather than spending too much time on the math of PCA, let's just use the `sklearn` implementation and fit a PCA on `X`.

In [9]:
# Loading the Penguin dataset through the seaborn library
import seaborn as sns
df = sns.load_dataset('penguins'); df = df[df['species']!='Adelie']; df = df.dropna(); X = df[['bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g']]
colors = df['species'].map({'Adelie':'C0','Chinstrap':'C1','Gentoo':'C2'})
X

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
152,46.5,17.9,192.0,3500.0
153,50.0,19.5,196.0,3900.0
154,51.3,19.2,193.0,3650.0
155,45.4,18.7,188.0,3525.0
156,52.7,19.8,197.0,3725.0
...,...,...,...,...
338,47.2,13.7,214.0,4925.0
340,46.8,14.3,215.0,4850.0
341,50.4,15.7,222.0,5750.0
342,45.2,14.8,212.0,5200.0


In [11]:
from sklearn.decomposition import PCA
pca = PCA(n_components=4)
pca.fit(X)

In [5]:
pca.components_?

>1. Explain what the matrix you get when you call `pca.components_` means.
2. Make a bar plot of `pca.explained_variance_ratio_` and explain what it means (you could log-scale the y-axis to visualize both large and small values). What insights about our data can we extract from this?
3. Indeed, problem with the data AS-IS, is that the different features have very different orders of magnitude (some are huge numbers others are small). The way to fix this is by doing something called "[zscoring](https://en.wikipedia.org/wiki/Standard_score)", whereby each feature is normalized/rescaled to have zero mean and unit standard deviation. In this way, all of the data ends up with comparable variance. Make a new array `X_z` that is the zscored `X`, using the `scipy.stats.zscore` function. Show that each column has zero mean and unit standard deviation.
4. Transform `X` using the PCA we fitted above to create a new array `X_pca`. Then fit a new PCA to `X_z` and transform it to create another new array `X_z_pca`. Produce the bar plot from before and comment on the differences.
5. Finally, scatter plot against each other the first two components (i.e. first two columns in the array) of `X_pca`, and color the data by penguin type. Do the same for `X_z_pca`. Comment on the difference.

### Part 3: Statistics (DSFS Chapter 5)

>**Ex. 2.3.1**: Take a vector `a = [1, 3, 2, 5, 3, 1, 5, 1, 9000]`:
1. Compute the mean of `a` using `numpy`.
2. How is median defined? Compute the median of `a` using `numpy`.
3. For `a`, why might it make sense to take the median more seriously than the mean?

>**Ex. 2.3.2**: Using the same vector `a`:
1. How is *range* defined? Compute it.
2. How is *variance* defined? How do variance and standard deviation relate? Compute them both. Which value is greater?
3. What is the interquartile range? Compute it, and explain why it might be useful.

>**Ex. 2.3.3**: Covariance and correlation are both measures of trend similarity.
1. How do they relate?
2. Compute the correlation between `a` and `b = [0, 4, 1, 6, 2, 0, 6, 0, 2]`.
3. How does that result change if you remove the last data-point from each list? Why? What *term* do we use for that last value for both `a` and `b`?