# Data Analysis with Python

<a id='TOC'></a>
# Contents
## [I. Hello Jupyter](#HelloJupyter)
## [II. Python Basics](#PythonBasics)
## [III. Numpy](#NumPy)
## [IV. Scipy](#Scipy)
## [V. Pandas](#Pandas)
## [VI. Pickle](#Pickle)
## [VII. Matplotlib](#Matplotlib)
## [VIII. Exploring Datasets](#ExploringDatasets)
## [IX. WordClouds](#WordClouds)

<a id='HelloJupyter'></a>
# I. Hello Jupyter

## Markdown in Jupyter

### New paragraph

This is *rich* **text** with [links](http://ipython.org), equations:

$$\hat{f}(\theta) = \int_{-\infty}^{+\infty} f(x)\, \mathrm{e}^{-i \theta x}$$

code with syntax highlighting:

```python
print("Hello world!")
```

and images:

![This is an image](http://ipython.org/_static/IPy_header.png)

## Image in Jupyter

In [None]:
from IPython.display import Image
Image('Images/LanguageChoice.png', width = 600)

## Video in Jupyter

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('8CX-Q0gtSp8')

## Math in Jupyter

In [None]:
from IPython.display import Latex
Latex(r"""\begin{eqnarray}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
\nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0 
\end{eqnarray}""")

## Audio in Jupyter

In [None]:
import numpy as np
from IPython.display import Audio

max_time = 3
f1 = 220.0
f2 = 224.0
rate = 8000.0
L = 3
times = np.linspace(0,L,rate*L)
signal = np.sin(2*np.pi*f1*times) + np.sin(2*np.pi*f2*times)

Audio(data=signal, rate=rate)

## Packages we need

- **NumPy**: Numerical Python, provides ability to specify and manipulate array data structures.


- **SciPy**: Scientific Python, provides variety of high level science and engineering modules.


- **Matplotlib**: Plotting library.


- **Pandas**: High-performance, easy-to-use data structures and data analysis tools.


- **IPython/Jupyter**: Enhanced Python shell, designed to increase the efficiency and usability of coding, testing and debugging Python.


- **scikit-learn**: Tools for machine learning and data analysis.

Get all of above in one shot by installing **Anaconda**

<a id='PythonBasics'></a>
# II. Python Basics

In [None]:
x = ?
print(x)

In [None]:
print(x * 4)

In [None]:
print x**4

In [None]:
import math
x = math.cos(43)
print(x)

## Python String Formatting

In [None]:
x = ?
print('This is a simple way of printing x', x)

In [None]:
y = ?
print('This is how we print x', x, 'and y', y, '.')

In [None]:
z = 0.489349842432423829328933 * 10**32
print('{}'.format(z))

In [None]:
print('What if I want to print x twice, like this {0}, {0}, and y once, {1}?'.format(x,y))

## Python Lists

In [None]:
x = list(?)
print(x)
x.reverse()
print(x)

In [None]:
x = [3, 42, 8, 5, 42, 0.4, 246]
x.sort()
print(x)

In [None]:
x = list(range(4))
print(x)

x.append(?)
print(x)

In [None]:
x = list(range(5, 10))
x

In [None]:
x = list(range(0, 10, 2))
x

In [None]:
x = list(range(10, 0, -1))
x

In [None]:
x = [4, 3, 41]
x[2] = "Let's put a string here instead"
x.append("And another for demonstration purposes")
print(x)

### Slicing Lists

[**start** : **stop** : **steps**]

[: stop] => slice from start till (excluding) stop index


[start :] => slice from start till end


-1 => means go backwards

In [None]:
numbers = [4, 7, 24, 11, 2]
print(numbers)
print(numbers[0:3])
print(numbers[-1], numbers[-2])
print(numbers[3:])
print(numbers[:-1])
print(numbers[:4])

## Control Structures

### For loops

In [None]:
x = [4, 3, 24, 7]
for element in x:
    print(element)

In [None]:
x = [4, 3, 24, 7]
xsum = 0
for element in x:
    xsum += element # <--- this means xsum = xsum + element
print(xsum)

### if-elif-else

In [None]:
x, y = ?

if (x > y):
    print(x, '>', y)
elif (x == y):
    print(x, 'equals', y)
else:
    print('Hi there!')

### While

In [None]:
i = 0
while (?):
    print(i)
    i += 1

### List Comprehension

In [None]:
t = [? for x in [5, 6, 7]] 
print t

In [None]:
?complex

In [None]:
z = [? for x in range(0, 4, 1)
                       for y in range(4, 0, -1) if x > y]
print z

## Dictionary

In [None]:
d = {'a':1, 'b':2, 'c':3}
d

In [None]:
d['b']

In [None]:
d1 = dict(d=4, e=5, f=6)
d1

In [None]:
d.update(d1)
d

In [None]:
list(d.keys())

In [None]:
list(d.values())

In [None]:
d['g'] = 7
d

In [None]:
for k in d:
    print(?)

### Dictionary Comprehension

In [None]:
a = { ? for n in range(7) } # note curly brackets
print a

In [None]:
odd_sq = { n: n*n for n in range(7) if ? }
print odd_sq

In [None]:
# next example -> swaps the key:value pairs
a = { ? for key, val in a.items() }
print a

## Functions

Python functions are defined using the **def** keyword. 

In [None]:
def hello(name, loud=False):
    if loud:
        print('HELLO, %s!' % name.upper())
    else:
        print('Hello, %s' % name)

hello('Rose') 
hello('Jupyter', loud=True) 

## Lambda functions

Often we define a mathematical function with a quick one-line function called a lambda. No return statement is needed.

In [None]:
square = lambda x: x*x
print(square(3))

hypotenuse = lambda x, y: x*x + y*y

## Same as

# def hypotenuse(x, y):
#     return(x*x + y*y)

hypotenuse(?)

## EXERCISE: 
- Write a function called **isprime** that takes in a positive integer $N$, and determines whether or not it is prime. Return $N$ if it's prime and return nothing if it isn't.

- Create a **list** myprimes that contains all the prime numbers less than 100.

In [None]:
# your code here


## Classes

In [None]:
class Greeter(object):

    # Constructor
    def __init__(self, name):
        self.name = name  # Create an instance variable

    # Instance method
    def greet(self, loud=False):
        if loud:
            print('HELLO, %s!' % self.name.upper())
        else:
            print('Hello, %s' % self.name)

g = Greeter('Jupyter')  # Construct an instance of the Greeter class
g.greet()            # Call an instance method
g.greet(loud=True)   # Call an instance method

<a id='NumPy'></a>
# III. NumPy
Python lists are great. They can store strings, integers, or mixtures.

NumPy arrays though are **multi-dimensional** and most **engineering** python libraries use them instead. 

They store the **same type of data** in each element and **cannot change size**.

In [None]:
import numpy as np

x = np.zeros(5)
print(x)

In [None]:
x = np.zeros( ? )
print(x)

In [None]:
print (np.arange(3, 10))       # Does not include end point
print (np.linspace(0, 1, 25))  # Includes end point

In [None]:
from math import pi
x = np.linspace(?)
print(np.cos(x))

In [None]:
x = np.arange(5)
print(?)

In [None]:
x = np.arange(5)
print(?)

In [None]:
# Numpy Methods (use Tab)
x = np.arange(0, 10, 0.1)
print(x.?)
print(x.?)
print(x.?)

## Numpy Speed

In [None]:
import random
n = 1000000
# Return random floats in the half-open interval [0.0, 1.0)
x = [random.random() for i in range(n)]
y = [random.random() for i in range(n)]

In [None]:
x[:3], y[:3]

In [None]:
z = [x[i] + y[i] for i in range(n)]
z[:3]

In [None]:
%timeit [x[i] + y[i] for i in range(n)]

In [None]:
xa = np.array(x)
ya = np.array(y)

In [None]:
za = xa + ya
za[:3]

In [None]:
%timeit xa + ya

We observe that this operation is more than one order of magnitude faster in NumPy than in pure Python!

## Numpy 2D Arrays

In [None]:
import numpy as np 
a = np.array([[0.0,0.0,0.0], [10.0,10.0,10.0], [20.0,20.0,20.0], [30.0,30.0,30.0]]) 
b = np.array([0.0, 1.0, 2.0])  
   
print 'First array:' 
print a 
print '\n'  
   
print 'Second array:' 
print b 
print '\n'  
   
print 'First Array + Second Array' 
print ?

In [None]:
import numpy as np
# Create a 10x6 matrix from normal distribution and convert to ints
n, nrows, ncols = ?
xs = np.random.normal(n, 15, size=(nrows,ncols)).astype('int')
xs

In [None]:
print(xs.max())
print(xs.max(axis=?)) # max of each col     
print(xs.max(axis=?)) # max of each row     

<a id='Scipy'></a>
# IV. Scipy

Numpy provides a high-performance multidimensional array and basic tools to compute with and manipulate these arrays. 

SciPy builds on this, and provides a large number of functions that operate on numpy arrays and are useful for different types of **scientific and engineering applications**.

## Optimization

In [None]:
# Example of Scipy functionality

# Optimization:
import scipy as sp
import matplotlib.pyplot as plt

def f(x):
    return ?

x = np.arange(-10, 10, 0.1)
plt.plot(?) 
plt.show() 

This function has a global minimum around -1.3 and a local minimum around 3.8.

Searching for minimum can be done with **scipy.optimize.minimize()**; given a starting point x0, it returns the location of the minimum that it has found

In [None]:
from scipy.optimize import minimize     
result = minimize(?)
print result      # Global minimum
print f(result.x) # Value at global minimum

<a id='Pandas'></a>
# V. Pandas

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime

### Dataframes
According to Pandas documentation:
*Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels.*

In human terms, this means that a **dataframe has rows and columns**, can **change size**, and possibly **has mixed data types**.

### Peek at the DataFrame contents
df.info()                  # index & data types

df.head(i)                 # get first i rows

df.tail(i)                 # get last i rows

df.describe()              # summary stats cols

### Pandas with IPL data

In [None]:
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
            'Kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
            'Rank': [1, 2, 2, 3, 3, 4 ,1 ,1, 2, 4, 1, 2],
            'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
            'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(?)

print df

In [None]:
df.?

A very powerful feature in Pandas is **groupby**. 

This function allows us to **group together rows that have the same value in a particular column**. 
Then, we can aggregate this group-by object to compute statistics in each group. 

In [None]:
df.groupby(?).groups

In [None]:
# Multiple columns
df.groupby(?).groups

## Pandas with Roger Federer

In [None]:
# Now, Pandas with another dataset
df = pd.read_csv('Data/Roger-Federer.csv')
df

There are many columns. Each row corresponds to a match played by Roger Federer. Let's add a boolean variable indicating whether he has won the match or not. The `tail` method displays the last rows of the column.

In [None]:
player = ?
df['win'] = df['winner'] == player
df['win'].tail()

`df['win']` is a `Series` object: it is very similar to a NumPy array, except that each value has an index (here, the match index). This object has a few standard statistical functions. For example, let's look at the proportion of matches won.

In [None]:
print("{player} has won {vic:.0f}% of his ATP matches.".format(
      player=player, vic=100*df['win'].mean()))

We now look at the **proportion of double faults in each match**. 

In [None]:
df['dblfaults'] = (df['player1 double faults'] / df['player1 total points total'])

We can use the `head` and `tail` methods to take a look at the beginning and the end of the column, and `describe` to get summary statistics. In particular, let's note that some rows have `NaN` values (i.e. the number of double faults is not available for all matches).

In [None]:
df['dblfaults'].?

In [None]:
df['dblfaults'].?

In [None]:
df.groupby(?)['win'].mean()

<a id='Pickle'></a>
# VI. Pickle

In [None]:
import pickle

# make an example object to pickle
some_obj = {'x':[4,2,1.5,1], 'y':[32,[101],17], 'foo':True, 'spam':False}

To save a pickle, use **pickle.dump**.

In [None]:
with open('mypickle.pickle', 'wb') as f:
    pickle.?

In [None]:
del some_obj
# Delete from memory

Loading the pickled file from your hard drive is as simple as pickle.load and specifying the file path:

In [None]:
with open('mypickle.pickle') as f:
    loaded_obj = pickle.?

print 'loaded_obj is', loaded_obj

## Pickling pandas DataFrames

In [None]:
df = pd.DataFrame([range(11), range(100,110)], columns=list('abcdefghijk'))

df

In [None]:
df.?('my_df.pickle')

In [None]:
df2 = pd.?('my_df.pickle')

df2

<a id='Matplotlib'></a>
# VII. Matplotlib - Data Visualization

## Histograms

The Gaussian PDF is given by:

$$ X \sim \mathcal{N}(\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2} $$

In [None]:
import numpy as np
from matplotlib import pyplot as plt

data = np.random.normal(0, 20, 1000)    # mean, std, numSamples 

# fixed bin size
bins = np.arange(-100, 100, 5)
print bins

plt.xlim([min(data)-5, max(data)+5])
plt.?(data, bins=bins, alpha=1)
plt.?('Random Gaussian data')
plt.?('variable X (bin size = 5)')
plt.?('count')

plt.?()

### Histogram of 2 overlapping data sets

In [None]:
import random
%matplotlib inline 
# Now, no need to do plt.show() 

data1 = [random.gauss(?) for i in range(500)] # 500 samples from Gaussian distribution, mu, sigma  
data2 = [random.gauss(?) for i in range(500)]   # 500 samples from Gaussian distribution, mu, sigma
bins = np.arange(-60, 60, 2.5)
plt.xlim([min(data1+data2)-5, max(data1+data2)+5])

plt.?(data1, bins=bins, alpha=0.5, label='class 1')    
plt.?(data2, bins=bins, alpha=0.5, label='class 2')
plt.title('Random Gaussian data')
plt.xlabel('variable X')
plt.ylabel('count')
plt.legend(loc='upper right')

## Lineplots

In [None]:
x = [1, 2, 3]

y_1 = ?
y_2 = ?

plt.?(x, y_1, marker=?) 
plt.?(x, y_2, marker=?)

plt.xlim([0, len(x)+1])
plt.ylim([0, max(y_1+y_2) + 10])
plt.xlabel('x-axis label')
plt.ylabel('y-axis label')
plt.title('Simple line plot')
plt.legend(['sample 1', 'sample2'], loc='upper left')

## Scatter Plots

In [None]:
# Generating a Gaussion dataset:
# creating random vectors from the multivariate normal distribution given mean and covariance 
mu_vec1 = np.array(?)        
cov_mat1 = np.array(?)       

x1_samples = np.random.multivariate_normal(mu_vec1, cov_mat1, 100)
x2_samples = np.random.multivariate_normal(mu_vec1+5, cov_mat1+0.8, 100)

# print x1_samples.shape 
# (100, 2), 100 rows, 2 columns

plt.figure(figsize=(8,6))
    
plt.scatter(x1_samples[:,0], x1_samples[:,1], marker='x', 
            color='blue', alpha=0.7, label='x1 samples')
plt.scatter(x2_samples[:,0], x1_samples[:,1], marker='o', 
            color='green', alpha=0.7, label='x2 samples')
plt.title('Basic scatter plot')
plt.ylabel('variable X')
plt.xlabel('Variable Y')
plt.legend(loc='upper right')

## 3D Scatter plot

In [None]:
from mpl_toolkits.mplot3d import Axes3D

# Generate some 3D sample data
mu_vec1 = np.array(?) # mean vector                   
cov_mat1 = np.array(?) # covariance matrix            

class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, 20)
class2_sample = np.random.multivariate_normal(mu_vec1 + 1, cov_mat1, 20)
class3_sample = np.random.multivariate_normal(mu_vec1 + 3, cov_mat1, 20)

# class1_sample.shape -> (20, 3), 20 rows, 3 columns

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
   
ax.scatter(class1_sample[:,0], class1_sample[:,1], class1_sample[:,2], 
           marker='x', color='blue', s=40, label='class 1')
ax.scatter(class2_sample[:,0], class2_sample[:,1], class2_sample[:,2], 
           marker='o', color='green', s=40, label='class 2')
ax.scatter(class3_sample[:,0], class3_sample[:,1], class3_sample[:,2], 
           marker='^', color='red', s=40, label='class 3')

ax.set_xlabel('variable X')
ax.set_ylabel('variable Y')
ax.set_zlabel('variable Z')

plt.title('3D Scatter Plot')
plt.show()

## Boxplot

In [None]:
all_data = [np.random.normal(0, std, 100) for ?]   
# print all_data

fig = plt.figure(figsize=(8,6))

plt.boxplot(all_data, 
            notch=False, # box instead of notch shape 
            sym='rs',    # red squares for outliers
            vert=True)   # vertical box aligmnent

plt.xticks([y+1 for y in range(len(all_data))], ['x1', 'x2', 'x3'])
plt.xlabel('measurement x')
t = plt.title('Box plot')

## Subplots

In [None]:
x = ?
y = ?

fig, ax = plt.subplots(nrows=2,ncols=2)

for row in ax:
    for col in row:
        col.?

## Plot with XKCD

In [None]:
x = np.?(0, 2*np.pi, 100)          # ?
y = np.?
with plt.xkcd():
    plt.?
    plt.axis([0, 2*np.pi, -1.05, 1.05,])

<a id='ExploringDatasets'></a>
# VIII. Exploring Datasets

## Iris Dataset

In [None]:
import pandas as pd
iris = pd.read_csv("Data/iris.csv") 
iris.?

In [None]:
# Let's see how many examples we have of each species
iris[?].value_counts()                               

### Joint distribution plot - Seaborn 

In [None]:
import seaborn as sns
plt.figure()
sns.?(x='petal_length', y='petal_width', data=iris, kind='kdeplot')

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8,4))

sns.?(x='species', y='petal_length', data=?, ax=axes[0])      
sns.?(x='species', y='petal_length', data=?, ax=axes[1])   

<a id='WordClouds'></a>
# IX. Generating WordClouds

In [None]:
from wordcloud import WordCloud

# Read the whole text.
text = open('Data/WordCloudData.txt').read()

print text

# Generate a word cloud image
wordcloud = WordCloud().generate(text)

# Display the generated image:
import matplotlib.pyplot as plt
plt.imshow(wordcloud.recolor(random_state=2017))
plt.title('Most Frequent Words')
plt.axis("off")
plt.show()

## [Go to Table of Contents](#TOC)