## INTRODUCTION TO PANDAS

Pandas library is a framework built on top of NumPy for data processing and analysis in Python, it provides data structures and methods for representing and manipulating data. The Fundamental data structures in Pandas are: Series and DataFrame as well as a rich veriety of functions and methods that add functionality to the them. In this introduction we will explore the series object under the following subtopics:
+ Series Object
+ Creating a Series
+ Attributes (commonly used)
+ Indexing and Selecting Data in Series
+ Boolean Selection 
+ Index Alingnment
+ Binary Operation and Arithematics on Series
+ Methods (commonly used)
+ Conversion Operation on Series

### Series Object
Series is a one dimensional data structure in pandas containing an array of index labeled data. This means that the data has index labels that can be used to select items. This label is what distinguishes a NumPy array from a series. Before we proceed to create a series its important to note that the pandas library is imported as pd, this is done by convention. The fundamental way of creating a series object is by passing a sequence such as a list or an array to the Series function.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Creating a series from a random NumPy array
arr = np.random.randn(5)
ser = pd.Series(arr)

In [3]:
# NumPy one dimensional array w/o index labels 
arr

array([-0.16757655,  0.18732864,  0.19434023,  0.70940664,  0.81573995])

In [4]:
# Series with index label
ser

0   -0.167577
1    0.187329
2    0.194340
3    0.709407
4    0.815740
dtype: float64

In [5]:
# Using the index label to access the data.
ser[0]

-0.167576554672185

### Creating a Series
To create a Series you use the Series Function,below is the typical series function and its parameters

**pd.Series(data=None, index=None, dtype=None, name=None)**  

+ *data = Array, list, Dictionary, Scalar or an iterable.*
+ *index = The labels to assign to the series data*
+ *name = The name to assign to the series object*

When creating a series from a dictionary The dictionary **Keys** match to become the index labels, while the **Values** make up the array of data.

In [6]:
# Creating an Array
arr  = np.arange(150,500,50)

In [7]:
# Creating a list
label = list("tuvwxyz")

In [8]:
# Creating a dictionary
dct = {"a":150,"b":200,"c":250,"d":300,"e":350,"f":400,"g":450}

In [9]:
lst = [10,20,30,40,50,60,70]

In [10]:
# Series from NumPy Array
pd.Series(data=arr,index=lst)

10    150
20    200
30    250
40    300
50    350
60    400
70    450
dtype: int32

In [11]:
# Series from List
pd.Series(data=lst, index=label)

t    10
u    20
v    30
w    40
x    50
y    60
z    70
dtype: int64

In [12]:
# Series from Dictionary
ser = pd.Series(data=dct, index=list("abcdefg"))
ser

a    150
b    200
c    250
d    300
e    350
f    400
g    450
dtype: int64

In [13]:
pd.Series(data=dct,index=label)

t   NaN
u   NaN
v   NaN
w   NaN
x   NaN
y   NaN
z   NaN
dtype: float64

**NOTE** When the keys of the dictionary match with the Index values, the labels remain intact. However  when the keys and the index dont match you get a series of NaN, this is because  the Index is first built with the keys from the dictionary
after which the Series is reindexed with the given Index values, hence we get all NaN as a result just like the last example.

### Attributes (commonly used)

In [14]:
# Series are similar to Arrays 
ser.values

array([150, 200, 250, 300, 350, 400, 450], dtype=int64)

In [15]:
# Accessing the index labels 
ser.keys()

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')

In [16]:
# Formal way of accessing the index labels 
ser.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')

In [17]:
# Accessing the items in a series object
list(ser.items())

[('a', 150),
 ('b', 200),
 ('c', 250),
 ('d', 300),
 ('e', 350),
 ('f', 400),
 ('g', 450)]

In [18]:
# Calling array methods of series - size
ser.size

7

In [19]:
# Calling array methods of series - shape 
ser.shape

(7,)

In [20]:
# Calling array methods of series - dimensions
ser.ndim

1

In [21]:
# Calling array methods of series - data type
ser.dtype

dtype('int64')

### Indexing and Selecting Data in Series
We saw situations where we used dictionary like syntax (such as keys, values, items) to access values in the series object, this dioctionary like behaviour can be extended to element selection. firstly, its important to note that we can modify (add an elemet to a series) by accessing a non existent key like it exists and assigning a value to it.

In [22]:
ser

a    150
b    200
c    250
d    300
e    350
f    400
g    450
dtype: int64

In [23]:
# Assigning values like they exist
ser["y"], ser["z"] = np.nan, 10

ser

a    150.0
b    200.0
c    250.0
d    300.0
e    350.0
f    400.0
g    450.0
y      NaN
z     10.0
dtype: float64

In [24]:
# Selecting individual elements by keys
ser["a"]

150.0

In [25]:
# Slicing  elements by key  (the end element is included)
ser["b":"e"]

b    200.0
c    250.0
d    300.0
e    350.0
dtype: float64

In [26]:
# selecting individual element by index position
ser[0]

150.0

In [27]:
# Slicing elements by index position (end element excluded)
ser[1:5]

b    200.0
c    250.0
d    300.0
e    350.0
dtype: float64

In [28]:
# Using fancy indexing to select elements 
ser[["a","d","f","g"]]

a    150.0
d    300.0
f    400.0
g    450.0
dtype: float64

**NOTE** 
From the above examples we saw that the index could be explicit (alphabets labels) or implicit (integer index position). Pandas provides 3 Indexers for this purpose. 
+ **Series.loc[]**  *This is used for the explicit case where we have well definded index labels*
+ **Series.iloc[]**  *This is used for the implicit case where we use the index integer location*


In [29]:
type(ser)

pandas.core.series.Series

In [44]:
# Using explicit index labels 
ser.loc["b":"e"]

b    200.0
c    250.0
d    300.0
e    350.0
dtype: float64

In [45]:
# Using implicit index positions
ser.iloc[1:5]

b    200.0
c    250.0
d    300.0
e    350.0
dtype: float64

### Boolean Selection
Similar to boolean selection done in NumPy we can use booleans to filter out or select data based on some specified conditions, This is done by using a conpairson operator to compare a series to a value, multiple conditions can be specified. The process is similar to boolean selection in NumPy, lets explore boolean selection:

In [46]:
ser

a    5000.0
b     200.0
c     250.0
d     300.0
e     350.0
f     400.0
g     450.0
y       NaN
z    5000.0
dtype: float64

In [47]:
# Compairing a series to a value, returns a boolean series.
ser > 280

a     True
b    False
c    False
d     True
e     True
f     True
g     True
y    False
z     True
dtype: bool

In [48]:
# Compairing a series to a value, returns a boolean series
ser <= 300

a    False
b     True
c     True
d     True
e    False
f    False
g    False
y    False
z    False
dtype: bool

In [49]:
# Masking the boolean series to the main series.
ser[ser > 280]

a    5000.0
d     300.0
e     350.0
f     400.0
g     450.0
z    5000.0
dtype: float64

In [50]:
# Masking the boolean series to the main series.
ser[ser <= 300]

b    200.0
c    250.0
d    300.0
dtype: float64

In [51]:
# Multiple conditions 
ser[(ser > 200) & (ser <= 500)]

c    250.0
d    300.0
e    350.0
f    400.0
g    450.0
dtype: float64

In [52]:
# Multiple conditions 
ser[(ser < 200) | (ser > 500)]

a    5000.0
z    5000.0
dtype: float64

In [53]:
# Reassigning  a selected value 
ser[ser < 200] = 5000

In [54]:
ser

a    5000.0
b     200.0
c     250.0
d     300.0
e     350.0
f     400.0
g     450.0
y       NaN
z    5000.0
dtype: float64

#### Index Alingnment

An important feature of Pandas Series object is its behaviour during arithemeatic computations between series objects with different indexes, when adding or carrying out any arithematic computations, if any pair of index match the arithematic proceedes on that pair of indices, but when the indices dont match the internal data alingnment process introduces NaN values and propagates it to the result. lets explore this behaviour: 

In [55]:
cost = {"book": 20, "pen":30, "bag":50, "razor":7,"cup":11}

In [56]:
unit= {"brush":5, "pen":3, "book":5,"spoon":2,"razor":6,"knife":4}

In [57]:
ser1 = pd.Series(cost)
ser2 = pd.Series(unit)

In [58]:
ser1


book     20
pen      30
bag      50
razor     7
cup      11
dtype: int64

In [59]:
ser2

brush    5
pen      3
book     5
spoon    2
razor    6
knife    4
dtype: int64

In [60]:
# indices with no match are replace with NaN
ser1  + ser2

bag       NaN
book     25.0
brush     NaN
cup       NaN
knife     NaN
pen      33.0
razor    13.0
spoon     NaN
dtype: float64

In [61]:
# Filling NA with the mean
ser1.add(ser2,fill_value=ser1.mean())

bag      73.6
book     25.0
brush    28.6
cup      34.6
knife    27.6
pen      33.0
razor    13.0
spoon    25.6
dtype: float64

### Binary Operation and Arithematics on Series
As earlier said for binary operations on two series, Pandas will align the indices in the process of performing the operation. 
It is interesting to note that NumPy Universal functions can be applied to pandas series objects and the result will be another pandas object with the indices preserved. We can perform binary operation on series like addition, subtraction and many other operation.Pandas can also perform aggregation operations on series. There are alos functions available for such arithematics:

Python Operator | Pandas Method
:- | :-
+ | add()
- | sub(),subtract()
* | mul(), multiply()
/ | truediv(), div(),divide()
//| floordiv()
% | mod()
**| pow()

In [62]:
a = pd.Series([100,200,300,400],index=["PH","LAGOS","KANO","JOS"])
a

PH       100
LAGOS    200
KANO     300
JOS      400
dtype: int64

In [63]:
b = pd.Series([500,600,700,800],index=["PH","LAGOS","KANO","JOS"])
b

PH       500
LAGOS    600
KANO     700
JOS      800
dtype: int64

In [64]:
# Adding two series
a + b

PH        600
LAGOS     800
KANO     1000
JOS      1200
dtype: int64

In [65]:
# Division
a / b

PH       0.200000
LAGOS    0.333333
KANO     0.428571
JOS      0.500000
dtype: float64

In [66]:
# Multiplication
a * b

PH        50000
LAGOS    120000
KANO     210000
JOS      320000
dtype: int64

In [67]:
# Using NumPy Ufunc
np.add(a,b)

PH        600
LAGOS     800
KANO     1000
JOS      1200
dtype: int64

In [68]:
# Addition method off series object
a.add(b)

PH        600
LAGOS     800
KANO     1000
JOS      1200
dtype: int64

In [69]:
# Power method off series object.
a.pow(2)

PH        10000
LAGOS     40000
KANO      90000
JOS      160000
dtype: int64

In [70]:
# Sum
a.sum()

1000

In [71]:
# Mean
b.mean()

650.0

In [72]:
# Max
a.max()

400

In [73]:
#standard deviation
b.std()

129.09944487358058

### Methods (commonly used)

In [74]:
# Checking for unique elements
ser.unique()

array([5000.,  200.,  250.,  300.,  350.,  400.,  450.,   nan])

In [75]:
# Checking if the series has unique elements
ser.is_unique

False

In [76]:
# Checking for null values 
ser.isnull()

a    False
b    False
c    False
d    False
e    False
f    False
g    False
y     True
z    False
dtype: bool

In [77]:
# checkingh for NaN
ser.isna()

a    False
b    False
c    False
d    False
e    False
f    False
g    False
y     True
z    False
dtype: bool

In [78]:
# counting elements in a ser
ser.count()

8

In [79]:
# Statistics on the series
ser.describe()

count       8.000000
mean     1493.750000
std      2165.548017
min       200.000000
25%       287.500000
50%       375.000000
75%      1587.500000
max      5000.000000
dtype: float64

In [61]:
# Changing properties of a series
ser.index.name = "series1"
ser.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'y', 'z'], dtype='object', name='series1')