*Stanislav Borysov [stabo@dtu.dk], DTU Management*
# Advanced Business Analytics

## Refreshing Python and Machine Learning: Part 2 - Numpy and Matplotlib

*Based on the notebooks from 42184 Data Science for Mobility E19 / 42577 Introduction to Business Analytics E19*

*Some parts of this notebook was originally conceived by Prof. Sune Lehmann, for the Social data analysis and visualization (02806) course.*

## 1. Algebra in Numpy

In order to continue, you need to get comfortable with vector and matrix manipulation in Numpy, a very handy Python package.

### NumPy

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

You can read more about NumPy [here](https://docs.scipy.org/doc/numpy/user/whatisnumpy.html)

#### Why NumPy???

* **Memory efficiency:** NumPy's arrays are more compact than Python lists. For example, where a Python list would take at least 20 MB, a NumPy 3D array with single-precision floats in the cells would fit in 4 MB. Access in reading and writing items is also faster with NumPy.

* **Conveniency:** You get a lot of vector and matrix operations for free, which sometimes allow one to avoid unnecessary work. And they are also efficiently implemented.

* **Speed:** Here's a test on doing a sum over a list and a NumPy array, showing that the sum on the NumPy array is 10x faster.

In [None]:
from numpy import arange
from timeit import Timer

Nelements = 10000
Ntimeits = 10000

x = arange(Nelements)     #the "new" arange function in Python, creates an array with Nelements
y = range(Nelements)      #the "old" one you know already, creates a list with Nelements

t_numpy = Timer("x.sum()", "from __main__ import x")    #a simple operation on the array created
t_list = Timer("sum(y)", "from __main__ import y")      #a similar operation but on a list
print ("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,))
print ("list:  %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))

### The Basics

NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers.

<p>
<img src="https://www.safaribooksonline.com/library/view/elegant-scipy-1st/9781491922927/assets/elsp_0105.png"/>
</p>
<p>

In [None]:
import numpy as np

### Creating arrays
a = np.array([1,2,3])
b = np.array([(1.5,2,3), (4,5,6)], dtype=float)

#### Initial Placeholders 
(or ways to quickly create a vector/array with ones or zeros...)

In [None]:
np.zeros([3,4])                    # Creates an 3x4 array of zeros

In [None]:
np.ones((2,3,4), dtype=np.int16)   # Creates a array of 2x3x4 ones

In [None]:
np.arange(10,30,5)                  # Creates an array of evenly spaced values (step value)

In [None]:
np.linspace(0,2,9)                 # Creates an array of evenly spaced values (number of samples)

In [None]:
np.full((2,2),7)                    # Creates an array of constants

In [None]:
np.eye(2)                           # Creates a 2x2 identity matrix

In [None]:
np.random.random((2,2))             # Creates an array of random values

In [None]:
import matplotlib.pyplot as plt   # we add this for you... 
#Yes, without this line below you'll run into trounble (wanna try?   ;-) )
%matplotlib inline   

In [None]:
# Build a vector of 10000 samples from a normal distribution with variance 0.5^2 and mean 2
mu, sigma = 2, 0.5   

v = np.random.normal(mu,sigma,10000)   #get 100000 samples from a normal
# Plot a normalized histogram with 50 bins
plt.hist(v, bins=50, normed=1)       # matplotlib version (plot)
plt.show()

In [None]:
np.empty((3,2))                     # Creates an empty array

### Saving and loading text files

In [None]:
a = np.array([1,2,3])
np.savetxt("myarray.txt", a, delimiter=" ")
loaded_a=np.loadtxt("myarray.txt")  #reads directly to an np.array
print(loaded_a)

In [None]:
f=open("myarray.txt") #read it as if it was a normal text file
for line in f.readlines():
    print(line)

### Indexing, Slicing and Iterating

In [None]:
a = np.arange(10)**3   #power operator
print(a)

In [None]:
print(a[2])
print(a[2:5])
print(a[ : :-1])      # reversed a

for i in a:
    print(i**(1/3.))

In [None]:
# Two dimensional arrays
print(b)
print(b[:1])       # prints the first row
print(b[0:2,1])    # prints the second column

In [None]:
a[a<2]            # Selection of elements from "a" less than 2

### Array Manipulation

In [None]:
print(b)
print(b.T) # Transposing Array (you could also do np.transpose(b))

In [None]:
# Changing array Shape
b=b.ravel()               # b becomes flatenned (all elements in a single row)
print(b)
b=b.reshape(2,3)     #let's get it back to the initial 2x3 form
print(b)           #and here it is

In [None]:
# Adding/Removing elements
c=np.append(a,b)
print(c)

In [None]:
np.insert(c, 3, 5) #insert number 5 in position 3 of array c

In [None]:
print(c) #before
c=np.delete(c, 3) #removes the fourth element in array
print(c) #after

In [None]:
# Combining arrays
np.concatenate((b,b), axis=0)

In [None]:
np.vstack((b,b))   #Try hstack()!

## 2. Basic numpy statistics

Start by downloading this dataset: [Data](https://raw.githubusercontent.com/suneman/socialdataanalysis2017/master/files/data1.tsv). The format is `.tsv`, which stands for _tab separated values_. 
The file has two columns (separated using the tab character). The first column is $x$-values, and the second column is $y$-values.  

It's ok to just download this file to disk by right-clicking on each one, but if you use Python and _urllib_ or _urllib2_ to get them, I'll really be impressed. If you don't know how to do that, I recommend opening up Google and typing "download file using Python" or something like that. When interpreting the search results remember that _stackoverflow_ is your friend.

* Using the `numpy` function `mean`, calculate the mean of both $x$-values and $y$-values. 
* Use python string formatting to print precisely two decimal places of these results to the output cell. Check out [this _stackoverflow_ page](http://stackoverflow.com/questions/8885663/how-to-format-a-floating-number-to-fixed-width-in-python) for help with the string formatting. 
* Now calculate the variance of $x$- and $y$-values (to three decimal places).
* Use `numpy` to calculate the [Pearson correlation](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) between $x$- and $y$-values (also to three decimal places).

In [None]:
# ...

## 3. Numpy statistics and Matplotlib visualisations

Please open the file "data/pickups_zone_1_15min.csv". This corresponds to the series of taxi-pickups in New York zone 1 (an area in the Manhattan island). 

You can use the function open(file), which **returns** a stream. 

In [None]:
f = open("data/pickups_zone_1_15min.csv")

Good, so we have the file stream, now we need to get to its content. This stream has a method to read all the lines at once (method readlines()). Let's use it. 

In [None]:
lines = f.readlines()

Just to be sure, let's check how many lines the file actually has (it should be 262849, if it's different, there's something wrong...). 

In [None]:
len(lines)

Print the first 10 lines of this file.

In [None]:
lines[:10]

Cool, so you have temporal attributes, and the actual number of pickps.

So, print **only** the pickups part (10 first lines). 

A tip - look what the following code does:

> x="1,2,3,4"

>xsplitted=x.split(',')


>print(xsplitted)

>print(xsplitted[2])

Output:

['1', '2', '3', '4']

3


In [None]:
# ...

ok, our goal is thus to make a **single** list with all the pickup data. You need to go over each line, get the pickup value, convert it to an integer, and add it to this list. 

Can you do that?

In [None]:
# ...

Cool, so now you have an actual time series dataset. As you'll shortly see, it's very useful to have the respective time stamps (not just know that it is a sequence of values, also know exactly **when** each value occured)

Notice, however, that the values for the date are split into 3 fields. More importantly, for Python, they just numbers that have nothing to do with time. Luckily, there's class called datetime.datetime (the repetition here is intentional...). 

Here's some example code for you:
>from datetime import datetime as dt

>s="2017-09-11"
>
>print(type(s))
>
>time=dt.strptime(s,  '%Y-%m-%d')
>
>print(type(time))
>
>print(time)
>
>time=time.replace(hour=14, minute=35)
>
>print(time)


Output:


<class 'str' &gt;

<class 'datetime.datetime' &gt;

2017-09-11 00:00:00

2017-09-11 14:35:00





So, it creates a "datetime" object, that has everything you need to know about the time of that data point. It's pretty handy as you'll see later. 

Like what you did in the previous exercise, we now want a single list with all the datetime objects. 

In [None]:
# ...

Time to reorganize your code a bit. You've made a few things above:
- load a file
- go over the content of the file to create a list with pickup data
- go over the content of the file to create a list of datetime objects

Let's put them together in a function that reads a file (with name fscv) and returns the two lists mentioned. I'll give you the first and last lines:

>def read_csv(fcsv):
>
>      ...  #you just need to fill this part! ;-)
>
>      return pickups, times


In [None]:
# ...

Now, with this function, you can run all the above with different files with a single command! Do you want to try?


In [None]:
picks, times = read_csv("data/pickups_zone_1_15min.csv")

Now, let's create two numpy arrays from our two lists. 

In [None]:
vpicks = np.array(picks)
vtimes = np.array(times)

The pickup vector can be used right away to make a histogram. Think about it: what should the distribution of number of pickups (i.e. the observable taxi demand) look like?

In [None]:
n, bins, patches=plt.hist(vpicks, bins=60)
plt.ylabel("frequency")
plt.xlabel("pickups")
plt.show()

hhmmm... was this what you were expecting?... 

Another interesting way to look at this data is simply by plotting directly with a scatter plot, where the x axis is the index of the datapoint, and the y axis is the total number of pickups.

Tip: to see better the data try playing with the size of the dots (for example, put s=0.1 as an argument to the scatter call)

In [None]:
plt.scatter(np.arange(len(vpicks)), vpicks, s=0.1, alpha=0.5)
plt.ylabel("# pickups")
plt.xlabel("Record ID")
plt.show()

This sort of pattern is intriguing. There seems to be two general "trends" of taxi pickups in this area. Do you think it relates with time of day? It would make some sense (e.g. during the night, low values, during peak hours, high values). 

There are many ways to check this, but a simple one is to make a small change in the temporal vector. Instead of each element of this vector corresponding to absolute time (the *actual* date and time), why not just represent the minutes since midnight? 

Can you make a new vector with that content?

**Tip on how to get the hour and minute**:
>from datetime import datetime as dt
>
>s="2017-09-11 18:35:11"
>
>d=dt.strptime(s, "%Y-%m-%d %H:%M:%S")
>
>print(d)
>
>print(d.hour)
>
>print(d.minute)
>

Output:

2017-09-11 18:35:11

18

35

In [None]:
# ...

Ok, now for a cool trick. In Python (well, in general), you ultimately define colors with numbers. So, imagine that the number of minutes since midnight (that you just created) corresponds to a color. The function scatter allows you to give this list straight away and plot it (just use the argument c, for example "c=my_minute_since_midnight_list". 

Do you want to try?

In [None]:
# ...

Doesn't this explain something?  ;-)

Ok, so there seems to be indeed a relationship with time! 

If this is true, it may be interesting to do a 24-hr average plot. In other words, a plot where the x axis is 0 to 1440 (1440=24 hours X 60 minutes), and you show the average per minute.

BTW, it will also be very useful if you add the 5 and 95 quantiles. 

Can you do this?

In [None]:
# ...

### Correlations

One very important task in Data Science modeling is to find (and understand) correlations between different variables. Let's do a few simple exercises.

Let's start with a simple question: are the different areas correlated between them? If yes, it may be interesting knowledge. For example, maybe we can share data between them to predict better, later.



In [None]:
s1, _ = read_csv("data/pickups_zone_1_15min.csv")
s17, _ = read_csv("data/pickups_zone_17_15min.csv")

np.corrcoef([s1, s17])

It seems the areas are well correlated with each other. Can that show, in the pictures above?

Now, a more interesting question: are there correlations between a given area, and the other areas in earlier time steps? 

This is a VERY important one. If you find high correlation, for example, between area 1 at time t, with area 17 at time t-1, then you can use area 17 to predict area 1!

To check this, you need to play a little bit with the vectors. Let's call a vector that is shifted in time for 1 time step, a "lag1" vector.:

In [None]:
s17_lag1=np.array(s17[1:])  #look, it's the SAME vector, except that you start on the second element. Why?
#if this is confusing you now, please ask the teacher to explain. 
s1_trimmed=np.array(s1[:-1]) #With a white board near, it will be easy to understand! ;-)

Let's now check those correlations

In [None]:
np.corrcoef([s1_trimmed, s17_lag1])

WOW! Very interesting!! This means that you can use data from these other areas to predict for area 1... This is useful when there is missing data in area 1, for example... 

Try to replicate this same study, for all different areas. To make it quick, you can define a function that receives the area you want to "predict for", and the areas you want to use as "predictors".

In [None]:
def search_correlations(target, predictors, lag=1):
    # ...

In [None]:
search_correlations(s17, [s1], 1)

An idea: check the correlation between the series "s1" and the series "s1_lag1". This is called the autocorrelation of lag 1 of the time series. 

In [None]:
search_correlations(s1, [s1])

Now, do this with lag=2

In [None]:
search_correlations(s1, [s1], 2)

Ok, the last couple of tasks for today: 

Create a list, let's call it "auto_s1", with the autocorrelations with all lags until 120

In [None]:
# ...

... and plot it

In [None]:
# ...

The resulting diagram is called an autocorrelogram! :-) What do you think? Can you explain what's happening in those bumps?