In [5]:
import pandas as pd
import numpy as np

# Creating data
There are two core objects in pandas: the DataFrame and the Series.

## DataFrame
A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

In [2]:
fruits = pd.DataFrame({'Apples':[30],'Bananas':[21]})

In [3]:
fruits

Unnamed: 0,Apples,Bananas
0,30,21


The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:

In [6]:
df=pd.DataFrame(np.arange(0,20).reshape(5,4),index=['Row1','Row2','Row3','Row4','Row5'],columns=["Column1","Column2","Column3","Coumn4"])

In [7]:
df

Unnamed: 0,Column1,Column2,Column3,Coumn4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


## Series
A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list. If we have only one row/column of a dataframe it's called series if we have greater than one then it's called Dataframe.

In [8]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

In [9]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

In [41]:
df.iloc[0] #row 1 


Column1    0
Column2    1
Column3    2
Coumn4     3
Name: Row1, dtype: int32

In [40]:
type(df.iloc[0]) #series

pandas.core.series.Series

In [43]:
df.iloc[[0,1]] #more than one row


Unnamed: 0,Column1,Column2,Column3,Coumn4
Row1,0,1,2,3
Row2,4,5,6,7


In [42]:
type(df.iloc[[0,1]]) #dataframe

pandas.core.frame.DataFrame

# Dataframe to array

In [44]:
df.iloc[:,1:].values


array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 9, 10, 11],
       [13, 14, 15],
       [17, 18, 19]])

# Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, `loc` and `iloc`

## Index-based selection
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

In [11]:
df

Unnamed: 0,Column1,Column2,Column3,Coumn4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [10]:
df.iloc[0] # accesing the first row

Column1    0
Column2    1
Column3    2
Coumn4     3
Name: Row1, dtype: int32

Both `loc` and `iloc` are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with `iloc`, we can do the following:

In [13]:
df.iloc[:,:1]

Unnamed: 0,Column1
Row1,0
Row2,4
Row3,8
Row4,12
Row5,16


On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values.

In [16]:
df.iloc[:3,0] # show coloumn 1 from only first second and third row

Row1    0
Row2    4
Row3    8
Name: Column1, dtype: int32

It's also possible to pass a list:

In [17]:
df.iloc[[1,3,4],0] # will show elemnts of coloumn 1 of rows second fourth and fifth

Row2     4
Row4    12
Row5    16
Name: Column1, dtype: int32

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the end of the values. So for example here are the last four elements of the dataset.

In [18]:
df.iloc[-4:]

Unnamed: 0,Column1,Column2,Column3,Coumn4
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


## Label-based selection
The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

For example, to get the second entry in Coloumn3, we would now do the following:

In [27]:
df.loc['Row2','Column3']

6

## Choosing between loc and iloc
When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet'] (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].

Otherwise, the semantics of using loc are the same as those for iloc.

# Manipulating the index
Label-based selection derives its power from the labels in the index. Critically, the index we use is not immutable. We can manipulate the index in any way we see fit.

The `set_index()` method can be used to do the job. Here is what happens when we `set_index` to the `Column1` field:

In [32]:
df.set_index("Column1")

Unnamed: 0_level_0,Column2,Column3,Coumn4
Column1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,2,3
4,5,6,7
8,9,10,11
12,13,14,15
16,17,18,19


This is useful if you can come up with an index for the dataset which is better than the current one.