# Math Matters with Python, Scipy, & Numpy


## Setup

This guide was written in Python 3.6.


### Python and Pip

Download [Python](https://www.python.org/downloads/) and [Pip](https://pip.pypa.io/en/stable/installing/).


### Libraries

We'll be working with numpy and scipy, so make sure to install them. Pull up your terminal and run the following: 

```
pip3 install -r requirements.txt
```

### Data

Lastly, for this tutorial you'll need some data. You can download it in this repo [here](https://github.com/lesley2958/dod-math).

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<h1><center>Why Does This Matter?</center></h1>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<h1><center>Math is Data</center></h1>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

## Data Structures 

### Vectors

Lists are data structures universal to pretty much any programming language. Vectors are very similar to lists, in that a vector is just a set, or collection, of numbers. Because of this similarity, we can represent a vector with a list, for example:

In [5]:
A = [2.0, 3.0, 5.0]

### Matrices

A matrix is similar to a list or vector, but there's one fundamental difference: it's a 2D array that stores numbers. Another way of thinking about them is that they're multiple vectors in an list. Visually, they typically look like:

```
1 2 3
8 2 6
5 6 3
```

So to access any given element, you would use its row and column number. For example, in the following matrix, we would access the number by:

In [30]:
B = [[1,2,3],[8,2,6],[5,6,3]]

print(B[1][0])

8


## Numpy

Using the built-in data structures of the Python programming language, we implemented examples of vectors and matrices, but `numpy` gives us a better way! 

In [5]:
import numpy as np

vector1 = np.array([1,2,3])

matrix1 = np.matrix(
    [[0, 4],
     [2, 0]]
)

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<h1><center>Math Operations = Data Operations</center></h1>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

## Matrix Operations

Within the `numpy` module, there are tons of matrix operations you can use. As with any module, this reduces the amount of code you need to write. But more importantly, because `numpy` is actually written in C, its operations are _incredibly_ fast.

Here are some notable examples!

### Identity Matrix

Recall, that the identity matrix is an n x n matrix with 1s on the diagonal from the top left to the bottom right, such as

```
[[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]]
```
We can generate diagonal matrices with the eye() function with numpy:

In [32]:
np.eye(4)

array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

### Inverse Matrices

Recall, the inverse matrix is the reciprocal function of a matrix. In `numpy`, 

In [34]:
inverse = np.linalg.inv(matrix1)
print(inverse)

[[ 0.    0.5 ]
 [ 0.25  0.  ]]


### Determinant

Recall, the determinant of a matrix is a useful metric with respect to calculating the inverse of a matrix. For reference, the formula is as follows:

![ alt text](https://github.com/lesley2958/linear-algebra-with-python/blob/master/det.png?raw=true)

Instead of implementing this recursive algorithm, you can simply call the `det()` function in numpy. 

In [35]:
det = np.linalg.det(matrix1)
print(det)

-8.0


<br>
<br>
<br>
<br>
<br>
<br>
<br>
<h1><center>Images are Data</center></h1>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

So far we've seen more abstract and dry examples of matrices and their operations, so now let's turn to a concrete example of when these data structures and operations come in hand.

Images consist of pixels, which vary in numerical value. But that's not the important part. The important part is what this structure looks like. 


Consider this picture of my very cute dog, Lennon: 

<img src="https://github.com/lesley2958/dod-math/blob/master/lennon.png?raw=true" alt="Drawing" style="width: 200px;"/>


This image is 200 x 200 pixels -- does this notation sound familiar to you? It's how we described the dimensionality of a matrix. Because of this wonderful property, we can literally treat the pixels of an image as an n x n matrix.

In the following example, we'll do as such using `numpy` and `scipy`. 

In [None]:
import scipy

img = scipy.misc.imread("./lennon.jpg")

The `img` variable created above to read in the image is a matrix, which we can verify by printing out its type:

In [None]:
print(type(img))

In this example, we're going to manipulate the image so that it's tinted -- and we do this with a matrix operation! 

I arbitrarily chose numbers to then do a multiplication operation between the image matrix and this vector. 

In [13]:
op = np.array([89/255, 172/255, 1])
img_tinted = img * op

`img_tinted` is now a manipulated version of the original photo matrix. We can then save this image to our local for safekeeping: 

In [14]:
scipy.misc.imsave('lennon.jpg', img_tinted)

`imsave` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imwrite`` instead.
  """Entry point for launching an IPython kernel.


<img src="https://github.com/lesley2958/dod-math/blob/master/lennon_tinted.png?raw=true" alt="Drawing" style="width: 200px;"/>


<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<h1><center>Text is Data</center></h1>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

As a bonus, I'll introduce some beginning steps of a machine learning algorithm for sentiment analysis to see how `scipy` plays into this process. 

The first natural question is, _what exactly is "sentiment analysis"?_

Well, it's exactly what it sounds like: it's using computational tools to determine the emotional tone behind words. 

Sentiment Analysis isn't a new concept. There are thousands of labeled datasets out there, labels varying from simple positive and negative to more complex systems that determine how positive or negative is a given text.

For this post, I've selected a pre-labeled set of dataset consisting of tweets from Twitter already labeled as positive or negative. Using this data, go through the initial steps of building a classifier that predicts whether a tweet is positive or negative. Namely, we'll set up the data preparation portion of this problem.


It's important to note that `sklearn` is a Python module with built-in machine learning algorithms. To utilize these models, having the correct data structures is **crucial**. 

This is where `scipy` comes in -- we need to format the Twitter data. Using `sklearn.feature_extraction.text.CountVectorizer`, we will convert the tweets to a matrix, or two-dimensional array, of word counts. Ultimately, this data would be used to build the classifier. 

First, we import this specific class:

In [47]:
from sklearn.feature_extraction.text import CountVectorizer        

Each file is a text file with one tweet per line. We will use the builtin `open()` function to split the file line-by-line and build up two lists: one for tweets and one for their labels. 

In [48]:
data = []
data_labels = []
with open("./pos_tweets.txt") as f:
    for i in f: 
        data.append(i) 
        data_labels.append('pos')

with open("./neg_tweets.txt") as f:
    for i in f: 
        data.append(i)
        data_labels.append('neg')

Next, we initialize a sklearn vector with the `CountVectorizer` class. This vectorizer will transform our data into vectors of features. 

In [49]:
vectorizer = CountVectorizer(
    analyzer = 'word',
    lowercase = False,
)
features = vectorizer.fit_transform(
    data
)

You're likely wondering where `scipy` comes in, which is an excellent question. `sklearn` actually builds this class with the `scipy` module. If you looked at the logs when you installed `sklearn`, you'll actually see that it checks to make sure `scipy` installed. This speaks to the importance `scipy` plays within machine learning.

If you want to go through the rest of this exercise, you can find the tutorial [here](https://trello.com/c/xcFqkVuv/111-making-sentiment-analysis-easy-with-scikit-learn).

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<h1><center>Statistics ♥ Data</center></h1>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

## Statistics

While not all data science relies on statistics, a lot of the exciting topics like machine learning or analysis relies on statistical concepts. 

#### In this section, we'll begin by asking ourselves, "What is statistics?" 

It's very likely that you've heard of statistics before, whether that be in an article, results for a test grade in school, or pretty much any other context. But to put it formally, statistics is a discipline that uses **data** to support claims about populations. You'll come to learn that these "populations" are what we refer to as "distributions."


## ... And?

These distributions _are_ your data. Those test scores you and the rest of your classmates bombed? Data. And as we saw above, data isn't very useful without the operations we can use on them. For example,

### Mean

You know what the mean is, you've heard it every time your computer science professor handed your midterms back and announced that the average, or mean, was a disappointing low of 59. Woops.

With that said, the “average” is just one of many summary statistics you might choose to describe the typical value or the central tendency of a sample. As we saw in the linear algebra above, either `scipy` or `numpy` can be used to accomodate even the "simplest" of operations: 

In [40]:
import numpy as np
scores = np.array([17,42,86,21,55,66])
scipy.mean(scores)

47.833333333333336

But these modules go far beyond simple descriptive statistics like mean, median, mode. More involved operations like cumulative distribution functions, distribution data generation, and skewness.

In [51]:
np.random.poisson(1,100)

array([0, 2, 0, 1, 4, 1, 1, 1, 0, 0, 0, 2, 2, 0, 0, 0, 2, 0, 1, 2, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 2, 0, 2, 4, 1, 2, 1, 1, 2, 1, 0, 0, 1, 0, 1, 1,
       1, 0, 2, 0, 1, 1, 0, 6, 0, 1, 1, 0, 1, 1, 4, 0, 2, 2, 2, 1, 1, 1, 2,
       0, 0, 0, 2, 1, 1, 1, 1, 1, 1, 0, 1, 0, 2, 0, 0, 1, 3, 2, 1, 2, 2, 3,
       0, 0, 2, 1, 0, 4, 0, 1])

In [54]:
scipy.stats.bernoulli.pmf(1, .5)

0.5

In [53]:
scipy.stats.skew([1,3,3,6,3,2,7,5,9,1])

0.592927061281571