In [1]:
import numpy as np ,pandas as pd

# Series Object

Series is the Standard one-dimensional array of indexed data for Pandas

In [2]:
data=pd.Series([0.5,0.3,0.6,0.1,-2])
data

0    0.5
1    0.3
2    0.6
3    0.1
4   -2.0
dtype: float64

In [3]:
print(data.values)
print(data.index)

[ 0.5  0.3  0.6  0.1 -2. ]
RangeIndex(start=0, stop=5, step=1)


In [4]:
print(data[3])
print(data[2:4])

0.1
2    0.6
3    0.1
dtype: float64


## As a NumPy array

In a Pandas, the Index of a Series can be **explicitly defined**.

Unlike the NumPy array and the default python lists where the indeces cannot be anything other than the usual whole number indexing.

In [5]:
data=pd.Series([0.23,3,0.5,0.75,1.0],index=['c',1,'b','pr',-4])
data

c     0.23
1     3.00
b     0.50
pr    0.75
-4    1.00
dtype: float64

In [6]:
data['c']

0.23

In [7]:
data=pd.Series([1,2,3,4,5,6],index=[2,5,3,6,9,4])
data

2    1
5    2
3    3
6    4
9    5
4    6
dtype: int64

## As a dictionary

A default python dictionary maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps **typed keys** to a set of **typed values**

(consistancy in data type eases and smiplifies the operations performed on a Pandas Series, although inconsistancy in such matter is also acceptable)

In [8]:
population_dict ={'California': 38332521,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Texas': 26448193,
                   'Illinois': 12882135}
population=pd.Series(population_dict)
population

California    38332521
New York      19651127
Florida       19552860
Texas         26448193
Illinois      12882135
dtype: int64

In [9]:
population['California']

38332521

Series also support slicing, unlike dictionaries

In [10]:
population['New York':'Texas']

New York    19651127
Florida     19552860
Texas       26448193
dtype: int64

## Constructing Series objects

We've seen some general definitions of a Series object, most of which look like.

`pd.Series(data,index=index)`

`data` above can be either a list or a numpy array in which case `index` defaults to an integer sequence:

In [11]:
pd.Series([2, 4, 6])


0    2
1    4
2    6
dtype: int64

`data` can be a scalar, which is repeated to fill the specified index:

In [12]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

`data` can be a dictionary, in which `index` defaults to the sorted dictionary keys:



In [13]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [14]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In [15]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

# DataFrame Object

## As a numpy array

Just like Series are equivalent to one dimensional arrays with flexible indeces, Dataframes are two dimensional arrays with flexible row names and column names.

In [16]:
states=pd.DataFrame({'population':population,'area':area})
print(states)

            population    area
California    38332521  423967
Florida       19552860  170312
Illinois      12882135  149995
New York      19651127  141297
Texas         26448193  695662


Just like the series object, Dataframes also have this fearure of selecting indeces

In [17]:
print(states.index)

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')


In [18]:
print(states.columns)

Index(['population', 'area'], dtype='object')


## As a Dictionary

Just like a dictionary connects a key to it's value, a DataFrame connects a column name to a Series of column data. Consider the following example

In [19]:
print(states['population'])

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
Name: population, dtype: int64


*Notice how if `data` is a Series, `data[0]` will return the first row of the Series but, if `data` is a DataFrame, `data[0]` will result in an error. Because of this, it is probably better to think about DataFrames as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful.*

## Constructing a DataFrame object

### From a single Series

In [20]:
pd.DataFrame(population,columns=['population'])

Unnamed: 0,population
California,38332521
New York,19651127
Florida,19552860
Texas,26448193
Illinois,12882135


### from a list of dictionaries

In [21]:
data = [{'a': i, 'b': 2 * i} for i in range(1,4)]
data

[{'a': 1, 'b': 2}, {'a': 2, 'b': 4}, {'a': 3, 'b': 6}]

Each instances of the dictionaries are treated as columns of their own

In [22]:
pd.DataFrame(data)

Unnamed: 0,a,b
0,1,2
1,2,4
2,3,6


Even if some keys in the dictionary are missing, Pandas will fill them in with `NaN` (i.e., "not a number") values:

In [23]:
pd.DataFrame([{'a':1,'b':2},{'b':1,'c':3}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,1,3.0


### From a dictionary of Series objects

In [24]:
pd.DataFrame({'Population':population,'Area':area})

Unnamed: 0,Population,Area
California,38332521,423967
Florida,19552860,170312
Illinois,12882135,149995
New York,19651127,141297
Texas,26448193,695662


### from 2-dimensional NumPy array

In [25]:
thenumpyarray=np.array([[1,2,3],[4,5,6]])
pd.DataFrame(thenumpyarray,columns=['foo','ball','stick'],index=['a','b'])

Unnamed: 0,foo,ball,stick
a,1,2,3
b,4,5,6


# Index Object

Can be seen as an immutable(cannot be changed) ordered multiset

### As an immutable array

In [26]:
ind=pd.Index([2,3.0,5,7,11])
ind

Index([2.0, 3.0, 5.0, 7.0, 11.0], dtype='float64')

In [27]:
print(ind[1])
print(ind[::2])
print(ind.size)
print(ind.shape)
print(ind.ndim)
print(ind.dtype)

3.0
Index([2.0, 5.0, 11.0], dtype='float64')
5
(5,)
1
float64


In [28]:
ind[2]=4 #shows immutability

TypeError: Index does not support mutable operations

### As an ordered Set

Being a `set`-like structure we can impelemnt multiple python set operations on `Index`.

In [None]:
indA=pd.Index([1,3,5,7,9])
indB=pd.Index([2, 3, 5, 7, 11])

In [None]:
indA & indB #intersection

Index([0, 3, 5, 7, 9], dtype='int64')

In [None]:
indA | indB #union

Index([3, 3, 5, 7, 11], dtype='int64')

In [None]:
indA ^ indB #symmetric difference

Index([3, 0, 0, 0, 2], dtype='int64')