# Tutorial Brief

Pandas is powerful and easy-to-use library for data analysis. Is has two main object to represents data: Series and DataFrame.

Finding Help:

- http://pandas.pydata.org/pandas-docs/stable/10min.html
- http://pandas.pydata.org/pandas-docs/stable/tutorials.html

<table>
<tr>
    <td><img src="http://www.scipy.org/_static/images/numpylogo_med.png"  style="width:50px;height:50px;" /></td>
    <td><h4>NumPy</h4> Base N-dimensional array package </td>
    <td><img src="http://www.scipy.org/_static/images/scipy_med.png" style="width:50px;height:50px;" /></td>
    <td><h4>SciPy</h4> Fundamental library for scientific computing </td>
    <td><img src="http://www.scipy.org/_static/images/matplotlib_med.png" style="width:50px;height:50px;" /></td>
    <td><h4>Matplotlib</h4> Comprehensive 2D Plotting </td>
</tr>
<tr>
    <td><img src="http://www.scipy.org/_static/images/ipython.png" style="width:50px;height:50px;" /></td>
    <td><h4>IPython</h4> Enhanced Interactive Console </td>
    <td><img src="http://www.scipy.org/_static/images/sympy_logo.png" style="width:50px;height:50px;" /></td>
    <td><h4>SymPy</h4> Symbolic mathematics </td>
    <td style="background:Lavender;"><img src="http://www.scipy.org/_static/images/pandas_badge2.jpg" style="width:50px;height:50px;" /></td>
    <td style="background:Lavender;"><h4>Pandas</h4> Data structures & analysis </td>
</tr>
</table>

# Import libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

# Working with Series

Series is an array like object.

#### pd.Series( data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)

In [2]:
x = pd.Series([1,2,3,4,5])
x

0    1
1    2
2    3
3    4
4    5
dtype: int64

Notice that generated an index for your item

## Basic Operation

In [3]:
x + 100

0    101
1    102
2    103
3    104
4    105
dtype: int64

In [4]:
(x ** 2) + 100

0    101
1    104
2    109
3    116
4    125
dtype: int64

In [5]:
x > 2

0    False
1    False
2     True
3     True
4     True
dtype: bool

## `any()` and `all()`

In [6]:
larger_than_2 = x > 2
larger_than_2

0    False
1    False
2     True
3     True
4     True
dtype: bool

In [7]:
larger_than_2.any()

True

In [8]:
larger_than_2.all()

False

## `apply()`

In [9]:
def f(x):
    if x % 2 == 0: #if x is even
        return x * 2
    else:
        return x * 3

x.apply(f)

0     3
1     4
2     9
3     8
4    15
dtype: int64

**Avoid looping over your data**

This is a `%%timeit` results from `apply()` and a for loop.

In [10]:
%%timeit

ds = pd.Series(range(10000))

for counter in range(len(ds)):
    ds[counter] = f(ds[counter])

1 loops, best of 3: 283 ms per loop


In [11]:
%%timeit

ds = pd.Series(range(10000))

ds = ds.apply(f)

10 loops, best of 3: 18 ms per loop


## `astype()`

In [12]:
x.astype(np.float64)

0    1
1    2
2    3
3    4
4    5
dtype: float64

## `copy()`

In [13]:
y = x

In [14]:
y[0]

1

In [15]:
y[0] = 100

In [16]:
y

0    100
1      2
2      3
3      4
4      5
dtype: int64

In [17]:
x

0    100
1      2
2      3
3      4
4      5
dtype: int64

**copy() makes a new series**

In [18]:
y = x.copy()

In [19]:
x[0]=1

In [20]:
x

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [21]:
y

0    100
1      2
2      3
3      4
4      5
dtype: int64

# DataFrame

Dataframes are a colection of named Series

#### pd.DataFrame(a data=None, index=None, columns=None, dtype=None, copy=False)

In [22]:
data = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame(data, columns=["x"])

In [23]:
df

Unnamed: 0,x
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9


## Selecting Data

In [24]:
df["x"]

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
Name: x, dtype: int64

In [25]:
df["x"][0]

1

## Adding extra columns

In [26]:
df["x_plus_2"] = df["x"] + 2
df

Unnamed: 0,x,x_plus_2
0,1,3
1,2,4
2,3,5
3,4,6
4,5,7
5,6,8
6,7,9
7,8,10
8,9,11


In [27]:
df["x_square"] = df["x"] ** 2
df["x_factorial"] = df["x"].apply(np.math.factorial)
df

Unnamed: 0,x,x_plus_2,x_square,x_factorial
0,1,3,1,1
1,2,4,4,2
2,3,5,9,6
3,4,6,16,24
4,5,7,25,120
5,6,8,36,720
6,7,9,49,5040
7,8,10,64,40320
8,9,11,81,362880


In [28]:
df["is_odd"] = df["x"] % 2
df

Unnamed: 0,x,x_plus_2,x_square,x_factorial,is_odd
0,1,3,1,1,1
1,2,4,4,2,0
2,3,5,9,6,1
3,4,6,16,24,0
4,5,7,25,120,1
5,6,8,36,720,0
6,7,9,49,5040,1
7,8,10,64,40320,0
8,9,11,81,362880,1


### `map()`

In [29]:
df["odd_even"] = df["is_odd"].map({1:"odd", 0:"even"})
df

Unnamed: 0,x,x_plus_2,x_square,x_factorial,is_odd,odd_even
0,1,3,1,1,1,odd
1,2,4,4,2,0,even
2,3,5,9,6,1,odd
3,4,6,16,24,0,even
4,5,7,25,120,1,odd
5,6,8,36,720,0,even
6,7,9,49,5040,1,odd
7,8,10,64,40320,0,even
8,9,11,81,362880,1,odd


### `drop()`

In [30]:
df = df.drop("is_odd",axis= 1)
df

Unnamed: 0,x,x_plus_2,x_square,x_factorial,odd_even
0,1,3,1,1,odd
1,2,4,4,2,even
2,3,5,9,6,odd
3,4,6,16,24,even
4,5,7,25,120,odd
5,6,8,36,720,even
6,7,9,49,5040,odd
7,8,10,64,40320,even
8,9,11,81,362880,odd


## Multi Column Select

In [31]:
df[["x", "odd_even"]]

Unnamed: 0,x,odd_even
0,1,odd
1,2,even
2,3,odd
3,4,even
4,5,odd
5,6,even
6,7,odd
7,8,even
8,9,odd


## Filtering

In [32]:
df[df["odd_even"] == "odd"]

Unnamed: 0,x,x_plus_2,x_square,x_factorial,odd_even
0,1,3,1,1,odd
2,3,5,9,6,odd
4,5,7,25,120,odd
6,7,9,49,5040,odd
8,9,11,81,362880,odd


In [33]:
df[df.odd_even == "even"]

Unnamed: 0,x,x_plus_2,x_square,x_factorial,odd_even
1,2,4,4,2,even
3,4,6,16,24,even
5,6,8,36,720,even
7,8,10,64,40320,even


### Chaining Filters

#### `|` OR

In [34]:
df[(df.odd_even == "even") | (df.x_square < 20)]

Unnamed: 0,x,x_plus_2,x_square,x_factorial,odd_even
0,1,3,1,1,odd
1,2,4,4,2,even
2,3,5,9,6,odd
3,4,6,16,24,even
5,6,8,36,720,even
7,8,10,64,40320,even


#### `&` AND

In [35]:
df[(df.odd_even == "even") & (df.x_square < 20)]

Unnamed: 0,x,x_plus_2,x_square,x_factorial,odd_even
1,2,4,4,2,even
3,4,6,16,24,even


### Furter Chaining

In [36]:
df[(df.odd_even == "even") & (df.x_square < 20)]["x_plus_2"][:1]

1    4
Name: x_plus_2, dtype: int64

# Reading Data from CSV/TSV Files

In [37]:
url = "http://www.google.com/finance/historical?q=TADAWUL:TASI&output=csv"
stocks_data = pd.read_csv(url)

In [38]:
stocks_data

Unnamed: 0,﻿Date,Open,High,Low,Close,Volume
0,10-Mar-16,6370.37,6432.09,6350.05,6354.48,282609657
1,9-Mar-16,6430.87,6430.87,6343.92,6370.37,362921458
2,8-Mar-16,6387.44,6452.71,6387.78,6430.87,380619549
3,7-Mar-16,6396.36,6418.41,6362.87,6387.44,370270557
4,6-Mar-16,6216.31,6398.97,6217.92,6396.36,370095180
5,3-Mar-16,6170.16,6223.78,6164.50,6216.31,253334352
6,2-Mar-16,6180.66,6223.33,6146.44,6170.16,296864784
7,1-Mar-16,6092.50,6190.72,6093.31,6180.66,337886205
8,29-Feb-16,6092.01,6155.65,6084.41,6092.50,337886991
9,28-Feb-16,5975.94,6096.58,5975.94,6092.01,247770070


In [39]:
stocks_data["change_amount"] = stocks_data["Close"] - stocks_data["Open"]
stocks_data["change_percentage"] = stocks_data["change_amount"] / stocks_data["Close"]
stocks_data

Unnamed: 0,﻿Date,Open,High,Low,Close,Volume,change_amount,change_percentage
0,10-Mar-16,6370.37,6432.09,6350.05,6354.48,282609657,-15.89,-0.002501
1,9-Mar-16,6430.87,6430.87,6343.92,6370.37,362921458,-60.50,-0.009497
2,8-Mar-16,6387.44,6452.71,6387.78,6430.87,380619549,43.43,0.006753
3,7-Mar-16,6396.36,6418.41,6362.87,6387.44,370270557,-8.92,-0.001396
4,6-Mar-16,6216.31,6398.97,6217.92,6396.36,370095180,180.05,0.028149
5,3-Mar-16,6170.16,6223.78,6164.50,6216.31,253334352,46.15,0.007424
6,2-Mar-16,6180.66,6223.33,6146.44,6170.16,296864784,-10.50,-0.001702
7,1-Mar-16,6092.50,6190.72,6093.31,6180.66,337886205,88.16,0.014264
8,29-Feb-16,6092.01,6155.65,6084.41,6092.50,337886991,0.49,0.000080
9,28-Feb-16,5975.94,6096.58,5975.94,6092.01,247770070,116.07,0.019053
