# Creating a Series

There are a couple of ways of creating a series from scratch... lets explore

In [1]:
import numpy as np
import pandas as pd

## Creating from a Dictionary

Let's use the sample data we got from CashBox. They want to track the balance of their users. This is how much money each user currently has in their accounts. CashBox requires that users create a username. 

In our example, `test_balance_data` is just a standard Python dictionary; the key is the username and the balance is that users current account balance.

In [2]:
test_balance_data = {
    'Callum':0.83,
    'Brad':20.00,
    'Matt': 35.00,
    'Clem':55.00
}

The `Series` constructor accepts any dict-like object

In [3]:
balances = pd.Series(test_balance_data)
balances

Callum     0.83
Brad      20.00
Matt      35.00
Clem      55.00
dtype: float64

Notice that labels have been set from `test_balance_data.keys()` and the values are set from `tets_balances_data.values()`

## Creating From an Iterable

You can pass any iterable as as the first argument

_NOTE_: If labels are not present they're defaulted to incremental integers starting at 0, but you can provide an index the same size as the iterable

In [4]:
unlabeled_balances = pd.Series([0.83, 20.00, 35.00, 55.00])
labeled_balances = pd.Series([0.83, 20.00, 35.00, 55.00],
                            index = ['Callum', 'Brad', 'Matt', 'Clem'])
# The order is maintened
unlabeled_balances, labeled_balances

(0     0.83
 1    20.00
 2    35.00
 3    55.00
 dtype: float64, Callum     0.83
 Brad      20.00
 Matt      35.00
 Clem      55.00
 dtype: float64)

## Creating from NumPy

Series can also be made using ndarrays, in fact NumPy and Pandas play together very well.

In [5]:

ndbalances = np.array([0.083, 20.00, 35.00, 55.00])
pd.Series(ndbalances)

0     0.083
1    20.000
2    35.000
3    55.000
dtype: float64

## Creating from a Scalar and an Index

Another way is from a scalar (single value) which is then broadcasted and indexed. The index needs to explicitly stated using the `index` argument.

In [6]:
pd.Series(42, index = ['Callum', ' Brad', ' Matt', 'Clem'])

Callum    42
 Brad     42
 Matt     42
Clem      42
dtype: int64

# Accessing a Series

There are multiple ways to get to data stored in your `Series`. Lets explore the __`balances`__ `Series`.

The series `Series` is indexed by username. The label is the username and the values is that users cash balance. A `Series` is ordered and indexable... it is zero indexed. Data can also be accessed using dot notation

In [7]:
balances[0], balances['Callum'], balances.Callum

(0.83, 0.83, 0.83)

Can also use slicers to access data. 

In [8]:
balances[0:3], balances['Matt':'Clem']

(Callum     0.83
 Brad      20.00
 Matt      35.00
 dtype: float64, Matt    35.0
 Clem    55.0
 dtype: float64)

# Series Vectorisation and Broadcasting

Just like NumPy, Pandas offers powerful vectorised methods. It also leans on broadcasting.


In [9]:
test_deposit_data = {
    'Callum':100.00,
    'Brad':0.00,
    'Matt':30,
    'Clem':25
}
deposits = pd.Series(test_deposit_data)

## Vectorisation

While it is possible to loop through each item and apply it to another...

In [10]:
for label, value in deposits.iteritems():
    balances[label] += value
balances

Callum    100.83
Brad       20.00
Matt       65.00
Clem       80.00
dtype: float64

... its important to lean on vectorisations and skip the loops altogether. Vectorisation is faster and easier to read and write

In [11]:
# undo the previous step 
balances -= deposits

# this is the same as the as the above loop but utilising vectorisation
balances += deposits
balances

Callum    100.83
Brad       20.00
Matt       65.00
Clem       80.00
dtype: float64

## Broadcasting

### Broadcasting a scalar
Also just like NumPy arrays, the mathematical operators have been overwritten to use the vectorised versions of the same operation.


In [12]:
balances + 5

Callum    105.83
Brad       25.00
Matt       70.00
Clem       85.00
dtype: float64

### Broadcasting a series

Labels are used to line up series. When the label exists on one side but not the other a `np.nan` gets put in place. This can accidentally introduce NaN's

Lets say Callum and Matt got a 10 quid voucher...

In [13]:
vouchers = pd.Series(10, ['Callum', 'Matt'])
balances + vouchers

Brad         NaN
Callum    110.83
Clem         NaN
Matt       75.00
dtype: float64

It is possible to use the `.add` method with `fill_value` to get around this

In [14]:
balances.add(vouchers, fill_value = 0)

Brad       20.00
Callum    110.83
Clem       80.00
Matt       75.00
dtype: float64

## Creating a Dataframe

There are a few ways to create a dataframe from existing objects. Lets explore them!

### From a 2-Dimensional Object

If the data is already in rows and columns, like a list of lists, you can just pass it along to the constructor. Label and Column headings will automatically be generated as a range.

In [15]:
test_user_list = [
    ['Callum', 'Smyth', 1],
    ['Joe', 'Bloggs', 2]
]
pd.DataFrame(test_user_list)

Unnamed: 0,0,1,2
0,Callum,Smyth,1
1,Joe,Bloggs,2


If you don't want the index's to be autogenerated they can be explicitly stated as an arguement

In [16]:
pd.DataFrame(test_user_list, index = ['csmyth93', 'jbloggs42'],
            columns = ['Fname', 'Lname', 'Number'])

Unnamed: 0,Fname,Lname,Number
csmyth93,Callum,Smyth,1
jbloggs42,Joe,Bloggs,2


### From a Dictionary

DataFrames can also be generated from dictionary's

In [17]:
test_user_data = {
    'Fname':['Callum', 'Joe'],
    'Lname':['Smyth', 'Bloggs'],
    'Number': [1,2]
}
pd.DataFrame(test_user_data)

Unnamed: 0,Fname,Lname,Number
0,Callum,Smyth,1
1,Joe,Bloggs,2


The index can also be specified like creating a DataFrame from a list of lists

In [18]:
test_user_data = {
    'Fname':['Callum', 'Joe'],
    'Lname':['Smyth', 'Bloggs'],
    'Number': [1,2]
}
pd.DataFrame(test_user_data, index = ["csmyth93", "jbloggs42"])

Unnamed: 0,Fname,Lname,Number
csmyth93,Callum,Smyth,1
jbloggs42,Joe,Bloggs,2


### DataFrame.from_dict adds more options

__The `orient` keyword__

The orient keyword allows you to specify whether the keys of you dictionary are part of the labels (`index`) or the column titles (`columns`). Note how the nested dictionaries have been used to define the columns. You can also pass a list to the `columns`

In [19]:
by_username = {
    'csmyth93' : {
        'Fname' : 'Callum',
        'Lname' : 'Smyth',
        'Number' : 1
    },
    'jbloggs42' : {
        'Fname' : 'Joe',
        'Lname' : 'Bloggs',
        'Number' : 42
    },
    'rsmythster':{
        'Fname' : 'Robert',
        'Lname' : 'Smyth',
        'Number' : 66
    }
}
pd.DataFrame.from_dict(by_username, orient = 'index')

Unnamed: 0,Fname,Lname,Number
csmyth93,Callum,Smyth,1
jbloggs42,Joe,Bloggs,42
rsmythster,Robert,Smyth,66


## Accessing a DataFrame

There are many different choices for indexing DataFrames (a lot are similar to R)

First lets get a DataFrame object...

In [20]:
users = pd.DataFrame.from_dict(by_username, orient = 'index')

### Retrieve a Specfic Series

#### By Column Name

Each column in a `DataFrame` is actually a `Series`. The `DataFrame` provides access to each of these `Series` by a column name index. For instance we can select the `Fname` `Series`. This method also returns the name of the series

In [21]:
Fnames = users['Fname']
Fnames

csmyth93      Callum
jbloggs42        Joe
rsmythster    Robert
Name: Fname, dtype: object

#### By Label

You can retrieve a row by using the `loc` property and supplying the label. Note how the returned `Series` is labelled by the existing column labels 

In [22]:
users.loc['csmyth93']

Fname     Callum
Lname      Smyth
Number         1
Name: csmyth93, dtype: object

#### By Location

Normal list like indices are also available to get a specific row row by using the `iloc` (`i`ndex `loc`ation) property and the appropiate index.  

In [23]:
users.iloc[0]

Fname     Callum
Lname      Smyth
Number         1
Name: csmyth93, dtype: object

#### By Row and Column

The DataFrame allows access to a specific columns by using a co-ordinate like system. We can use `DataFrame.loc()` to index by creating a tuple. To be more explicit we can also use the `DataFrame.at()` method.

In [24]:
users.loc['csmyth93', 'Fname'], users.at['csmyth93', 'Fname']

('Callum', 'Callum')

## Retrieve a Specific DataFrame Through Slicing

Using both `loc` and `iloc` we can slice the existing `DataFrame` into a new one. 

In the first example we use `:` in the rows axis to select all rows, and we specify which columns we want back using a list in the columns axis (similar to NumPy fancy indexing)

When using a slice with `loc` we have to remember the results are inclusive, meaning they include the right side. 

When using a slice with `iloc` we have to remember the results are exclusive, like standard Python lists.

In [31]:
users.loc[:, ['Fname', 'Lname']]

Unnamed: 0,Fname,Lname
csmyth93,Callum,Smyth
jbloggs42,Joe,Bloggs
rsmythster,Robert,Smyth


In [30]:
users.loc['csmyth93':'rsmyther', :]

Unnamed: 0,Fname,Lname,Number
csmyth93,Callum,Smyth,1
jbloggs42,Joe,Bloggs,42


In [35]:
users.iloc[0:2, :]

Unnamed: 0,Fname,Lname,Number
csmyth93,Callum,Smyth,1
jbloggs42,Joe,Bloggs,42
