<hr/>

# Introduction to Data Science
**Tamás Budavári** - budavari@jhu.edu <br/>

- Python and Jupyter Notebook 
- Location, dispersion

<hr/>

<h1><font color="darkblue">Descriptive Statistics</font></h1>

### Data Sets

- For example, a set of $N$ scalar measurements 

>$ \displaystyle \big\{x_i\big\}_{i=1}^N $

### How to characterize the data?
- Location
- Dispersion
- Shape?

### Jupyter Notebook
Interactive data analysis made easy

In [None]:
# useful modules we'll always need
import numpy as np
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline 

In [None]:
%matplotlib --list

In [None]:
N = 1000;
x = np.random.randn(N);

In [None]:
print (x[0])

In [None]:
# indexing from 0
print ("%f, %f, ..., %f" % (x[0], x[1], x[N-1]))

# last element
print ("%f = %f" % (x[N-1], x[-1]))

In [None]:
[i*i for i in range(5)]

In [None]:
# index out of bounds
for i in range(N-3,N+1):
    print ("%d : \t %r" % (i, x[i]))

In [None]:
# error handling with exceptions
for i in range(N-3,N+5):
    try: 
        print ("%d : \t %r" % (i, x[i]))
    except IndexError as err: 
        print (err)

In [None]:
# as a function of the index
plt.plot(x,'bo');

In [None]:
plt.plot(x,'bo', alpha=0.3);

In [None]:
# x vs y=x^2
plt.plot(x, x*x, 'r.');

In [None]:
# x vs y=x^2
plt.plot(x, x*x, 'r.-', lw=0.1);

In [None]:
h = plt.hist(x, 50)

### Location

- Mode 

> where it peaks
<br>
> unimodal vs multimodal

- Sample average

> $\displaystyle \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i$
<br><br>
> but indexing starts with 0 in Python and most computer languages
<br><br>
> $\displaystyle \bar{x} = \frac{1}{N} \sum_{i=0}^{N-1} x_i$

- Median

> The number that separates the higher half of the set from the lower half


In [None]:
avg = np.sum(x) / N 
avg, np.mean(x)

In [None]:
x.sum() / x.size

In [None]:
med = np.median(x)
med

In [None]:
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(1,1,1)
ax.hist(x,20)
ax.arrow(avg,0,0,5,color='r')
ax.arrow(med,5,0,5,color='y')
ax.set_xlim(-10, 10);

### Dispersion
- Sample variance

> $\displaystyle s^2 = \frac{1}{N\!-\!1} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2$

- Standard deviation

> $\displaystyle s = \sqrt{s^2}$

#### Unhomework
- Why is $(N\!-\!1)$ in the denominator above?

### Outliers
- What if just one element is too large, e.g., erroneously becomes $+\infty$
- Sample average $\bar{x} \rightarrow +\infty$
- Sample variance explodes, too

**Ouch !!**


In [None]:
x[0] = 1e5
plt.hist(x, 100);

### Robustness
- Robust against outliers? What fraction can we tolerate?
- Median is more robust than the mean
- Median Absolute Deviation (MAD) for dispersion

In [None]:
print ('Average old vs new: %f %f' % (avg, np.mean(x)))
print ('Median  old vs new: %f %f' % (med, np.median(x)))

In [None]:
np.std(x), np.std(x[1:]) # slicing!

In [None]:
# unhomework: what does this line do? 
x[x>100] 
# hint: look at the result of x>100 separately