# Pandas 

**Pandas is an open source library which provides high-performance, easy to use data structures and data analysis tools for Python.**

**Pandas is derived from Panel Data (data sets that include multiple observations over multiple period of time).** 

**Pandas is built on top of Numpy and it steps on its computational abilities**

## Pandas Series
       The "Series" object is a single-column data or a set of values that correspond to a single variable.
       The Pandas Series object is something like a powerful version of the Python List or an enhanced version of the Numpy
       array.
       Series - a lager set of tools and capabilities that are pertinent to the pandas library only.
       The Series object stores its values in a sequenced order and has an explicit index.
       Always maintain data consistency.

In [2]:
import numpy as np
import pandas as pd

In [3]:
products = ['A','B','C','D']
products

['A', 'B', 'C', 'D']

In [4]:
type(products)

list

In [5]:
product_cat = pd.Series(products)
product_cat

0    A
1    B
2    C
3    D
dtype: object

In [6]:
type(product_cat)

pandas.core.series.Series

In [7]:
type(pd.Series(products))

pandas.core.series.Series

In [8]:
daily_rates_rupees = pd.Series([72,73,74,75])
daily_rates_rupees

0    72
1    73
2    74
3    75
dtype: int64

In [9]:
array_a = np.array([10,20,30,40,50])
array_a

array([10, 20, 30, 40, 50])

In [10]:
type(array_a)

numpy.ndarray

In [11]:
series_a = pd.Series(array_a)
series_a

0    10
1    20
2    30
3    40
4    50
dtype: int32

In [12]:
type(series_a)

pandas.core.series.Series

In [13]:
series_a = pd.Series([10,20,30,40,50])
series_a

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [14]:
series_a.dtype

dtype('int64')

In [15]:
series_a.size

5

In [16]:
product_cat = pd.Series(['A','B','C','D'])
product_cat

0    A
1    B
2    C
3    D
dtype: object

In [17]:
product_cat.dtype

dtype('O')

In [18]:
product_cat.size

4

In [19]:
type(product_cat.size)

int

In [20]:
product_cat.name

In [21]:
print(product_cat)

0    A
1    B
2    C
3    D
dtype: object


In [22]:
product_cat.name = "Product Catogeries"
product_cat

0    A
1    B
2    C
3    D
Name: Product Catogeries, dtype: object

In [23]:
product_cat.name

'Product Catogeries'

### Using an index in Pandas

In [24]:
prices_per_catogery = {'Product A': 22250, 'Product B': 16600, 'Product C': 15600}
prices_per_catogery

{'Product A': 22250, 'Product B': 16600, 'Product C': 15600}

In [25]:
type(prices_per_catogery)

dict

In [26]:
prices_per_catogery = pd.Series(prices_per_catogery)
prices_per_catogery

Product A    22250
Product B    16600
Product C    15600
dtype: int64

In [27]:
type(prices_per_catogery)

pandas.core.series.Series

In [28]:
prices_per_catogery.index

Index(['Product A', 'Product B', 'Product C'], dtype='object')

In [29]:
type(prices_per_catogery.index)

pandas.core.indexes.base.Index

### Points to remember while working with indexes in pandas : 
**1. An index allows you to refer to a position within a sequence, or in other words - a set of values in a sequenced order**

**2. You will be able to quickly access the prices of the relevant catogeries through their respective indices.**

**3. The index data structure will often turn out to be a way to speed up with large data sets.** 

In [36]:
series_a = pd.Series([10,20,30,40,50])
series_a
#this is zero-based indexing

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [32]:
series_a.index

RangeIndex(start=0, stop=5, step=1)

In [33]:
type(series_a.index)

pandas.core.indexes.range.RangeIndex

In [34]:
list(series_a.index)

[0, 1, 2, 3, 4]

In [35]:
prices_per_catogery = pd.Series({'Product A': 22250, 'Product B': 16600, 'Product C': 12500})
prices_per_catogery

Product A    22250
Product B    16600
Product C    12500
dtype: int64

In [37]:
prices_per_catogery.index

Index(['Product A', 'Product B', 'Product C'], dtype='object')

In [38]:
type(prices_per_catogery.index)

pandas.core.indexes.base.Index

**Indexing in terms of programming :**

 1. Label based indexing
 
 2. Position based indexing
 
**Indexing in terms of Analytical perspective :** 

 1. Explicit - This means that we have explicitly specified our index.
 
 2. Implicit - if the user doesn't specify an index, pandas will immediately attach the default zero based indexing to                    their object, this way they can refer to the object values via their positions.
 
 Label based indexing = Axis labels
 
 Index based indexing = Zero based indexing

In [39]:
series_a = pd.Series([10,20,30,40,50])
prices_per_catogery = pd.Series({'Product A': 22250, 'Product B': 16600, 'Product C': 12500})

In [40]:
series_a

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [42]:
series_a[0]

10

In [43]:
prices_per_catogery

Product A    22250
Product B    16600
Product C    12500
dtype: int64

In [44]:
prices_per_catogery['Product A']

22250

In [45]:
prices_per_catogery[0]

22250

**let's create 'series_b', so that it contains the same data as 'series_a', we'll assign an  explicit index, a sequence of integer from 1 - 5.**

In [47]:
series_b = pd.Series([10,20,30,40,50], index = [1,2,3,4,5])
series_b

1    10
2    20
3    30
4    40
5    50
dtype: int64

In [48]:
series_b[1]

10

**Now, we'll be using index labels (numbers written as strings)**

In [50]:
series_c = pd.Series([10,20,30,40,50], index = ['1','2','3','4','5'])
series_c

1    10
2    20
3    30
4    40
5    50
dtype: int64

In [51]:
series_c[1]

20

In [52]:
series_c['1']

10

In [53]:
series_c[0]

10

In [54]:
prices_per_catogery

Product A    22250
Product B    16600
Product C    12500
dtype: int64

**Index labels = a non-numeric index.**

for large data sets, Position - based indexing is preferred.

**Python object is associated with  a certain collection of attributes and methods.**

    Attributes - provides the metadata, they deliver information about a given data.
    
    Methods - relate to the functionalities and behaviour of the object, they work with the data stored in the object.

## Pandas Series Methods

In [55]:
start_date_deposits = pd.Series({
    '4/1/2019' : 3000,
    '3/2/2019' : 3000,
    '3/3/2019' : 3500,
    '2/4/2019' : 3000,
    '2/5/2019' : 3500,
    '1/6/2019' : 2500,
    '2/7/2019' : 1000,
    '5/8/2019' : 500,
    '20/8/2019' : 2000,
    '2/9/2019' : 5000,
    '3/10/2019' : 6000,
    '2/11/2019' : 4500,
    '2/12/2019' : 2000
})

In [56]:
start_date_deposits

4/1/2019     3000
3/2/2019     3000
3/3/2019     3500
2/4/2019     3000
2/5/2019     3500
1/6/2019     2500
2/7/2019     1000
5/8/2019      500
20/8/2019    2000
2/9/2019     5000
3/10/2019    6000
2/11/2019    4500
2/12/2019    2000
dtype: int64

In [57]:
start_date_deposits.sum()

39500

In [58]:
start_date_deposits.min()

500

In [59]:
start_date_deposits.max()

6000

In [60]:
start_date_deposits.idxmax()

'3/10/2019'

In [61]:
start_date_deposits.idxmin()

'5/8/2019'

Methods used in Pandas are also available in Numpy as well, as Pandas is built on Numpy.

    while working on numeric data only, we should use Numpy.
    while working on both numeric and non-numeric data, we should use Pandas.


**Methods used while working with non-numeric data.**

    head() - this will obtain first 5 rows of the dataset
    tail() - this will obtain last 5 rows of the dataset

In [62]:
start_date_deposits.head()

4/1/2019    3000
3/2/2019    3000
3/3/2019    3500
2/4/2019    3000
2/5/2019    3500
dtype: int64

In [63]:
start_date_deposits.tail()

20/8/2019    2000
2/9/2019     5000
3/10/2019    6000
2/11/2019    4500
2/12/2019    2000
dtype: int64

**Mathematical Methods -- sum() , min() , max() , idxmax () , idxmin ()**

**Non-Mathematical Methods -- head () , tail ()**

In [64]:
start_date_deposits.head(3)

4/1/2019    3000
3/2/2019    3000
3/3/2019    3500
dtype: int64

In [68]:
start_date_deposits.head(8)

4/1/2019    3000
3/2/2019    3000
3/3/2019    3500
2/4/2019    3000
2/5/2019    3500
1/6/2019    2500
2/7/2019    1000
5/8/2019     500
dtype: int64

In [69]:
start_date_deposits.tail(4)

2/9/2019     5000
3/10/2019    6000
2/11/2019    4500
2/12/2019    2000
dtype: int64

## Pandas DataFrames 

    A DataFrame is a collection of multiple Series objects.
    Any characteristics of a Series is also applicable to the separate columns of a DataFrame.
    Multi-Column Data.
    Every Column conatins data of its own type.
    Each column represents a different variable.
    We can preserve data consistency.
    Contains values along two axis (rows & columns).
    Represent a tabular structure.
    Can have both row & column labels.
    Two points of reference.
    We can provide a whole object that contains the values of an entire column of the dictionary keys.
    Inherits the characteristics of the Dictionary Class.

### Creating a DataFrame in 6 different ways
1. From a dictionary of lists.
2. From a dictionary of lists and specifying an index.
3. From a list of dictionaries.
4. From a dictionary of pandas series.
5. From a list of lists.
6. In a professional way.

### 1. Constructing a DataFrame from a dictionary of lists

In [70]:
data = {'ProductName': ['Product A','Product B','Product C'],
       'ProductPrice': [22250, 16600, 12500]}
df = pd.DataFrame(data)
df

Unnamed: 0,ProductName,ProductPrice
0,Product A,22250
1,Product B,16600
2,Product C,12500


**( we must always be careful with the number of records and dimensions we provide )**

### 2. Constructing a DataFrame from a dictionary of lists and specifying an index

In [72]:
data = {'ProductName': ['Product A','Product B','Product C'],
       'ProductPrice': [22250, 16600, 12500]}
df = pd.DataFrame(data, index = ['A','B','C'])
df

Unnamed: 0,ProductName,ProductPrice
A,Product A,22250
B,Product B,16600
C,Product C,12500


In [73]:
data = {'ProductName': ['Product A','Product B','Product C'],
       'ProductPrice': [22250, 16600, 12500]}
products_IDs = ['A','B','C']
df = pd.DataFrame(data, index = products_IDs)
df

Unnamed: 0,ProductName,ProductPrice
A,Product A,22250
B,Product B,16600
C,Product C,12500


### 3. Constructing a DataFrame from a list of dictionaries 

In [74]:
data = [{'ProductName': 'Product A',
        'ProductPrice': 22250},
       {'ProductName': 'Product B',
        'ProductPrice': 16600},
       {'ProductName': 'Product C',
        'ProductPrice': 12500}]
df = pd.DataFrame(data)
df

Unnamed: 0,ProductName,ProductPrice
0,Product A,22250
1,Product B,16600
2,Product C,12500


**( This way of creating a DataFrame can help a great deal to preserve data consistency )**

In [77]:
data = [{'ProductName': 'Product A',
        'ProductPrice': 22250},
       {'ProductName': 'Product B',
        'ProductPrice': 16600},
       {'ProductName': 'Product C',
        'ProductPrice': [12500,5000]}]
df = pd.DataFrame(data)
df

# This will ruin data consistency

Unnamed: 0,ProductName,ProductPrice
0,Product A,22250
1,Product B,16600
2,Product C,"[12500, 5000]"


### 4. Constructing a DataFrame from a dictionary of Pandas Series

In [78]:
ser_products = pd.Series(['Product A', 'Product B', 'Product C'])
ser_prices = pd.Series([22250, 16600, 12500])

In [79]:
data = {'ProductName': ser_products,
       'ProductPrice': ser_prices}
df = pd.DataFrame(data)
df

Unnamed: 0,ProductName,ProductPrice
0,Product A,22250
1,Product B,16600
2,Product C,12500


Adding specific indexes - 
Add the same number of indices upon the creation of everyseries that is set to creation of a column in our new dataframe.

In [80]:
ser_products = pd.Series(['Product A', 'Product B', 'Product C'], index = ['A','B','C'])
ser_prices = pd.Series([22250, 16600, 12500], index = ['A','B','C'])
data ={'ProductName': ser_products,
      'ProductPrice': ser_prices}
df = pd.DataFrame(data)
df

Unnamed: 0,ProductName,ProductPrice
A,Product A,22250
B,Product B,16600
C,Product C,12500


If you change the order of the index of one of the series, Python will eventually match its elements with their corresponding indices.

In the final output, it will re-organise the values of every row to respond to the order provided in the index of the first series.

In [81]:
ser_products = pd.Series(['Product A', 'Product B', 'Product C'], index = ['A','B','C'])
ser_prices = pd.Series([22250, 16600, 12500], index = ['C','B','A'])
data ={'ProductName': ser_products,
      'ProductPrice': ser_prices}
df = pd.DataFrame(data)
df

Unnamed: 0,ProductName,ProductPrice
A,Product A,12500
B,Product B,16600
C,Product C,22250


### 5. Constructing a DataFrame from a list of lists

The number of elements in each inner list must be fixed.

In [89]:
data = [['Product A', 22250], ['Product B', 16600], ['Product C', 12500]]
dff = pd.DataFrame(data)
dff

Unnamed: 0,0,1
0,Product A,22250
1,Product B,16600
2,Product C,12500


In [90]:
data = [['Product A', 22250], ['Product B', 16600], ['Product C', 12500,5000]]
df = pd.DataFrame(data)
df

Unnamed: 0,0,1,2
0,Product A,22250,
1,Product B,16600,
2,Product C,12500,5000.0


To avoid this, we must be careful when matching the number of elements we have across the horizontal axis

In [91]:
# Assigning column names to the dataset
dff.columns = ['ProductName','ProductPrice']
dff

Unnamed: 0,ProductName,ProductPrice
0,Product A,22250
1,Product B,16600
2,Product C,12500


In [92]:
# Assigning specific index values to the dataset
dff.index = ['A','B','C']
dff

Unnamed: 0,ProductName,ProductPrice
A,Product A,22250
B,Product B,16600
C,Product C,12500


### 6. Constructing a DataFrame in a Professional way
while constructing a dataframe, you need to provide information about its data, columns, index

In [93]:
df = pd.DataFrame( data = [['Product A', 22250], ['Product B',16600], ['Product C',12500]],
                 columns = ['ProductName', 'ProductPrice'],
                 index = ['A','B','C'])
df

Unnamed: 0,ProductName,ProductPrice
A,Product A,22250
B,Product B,16600
C,Product C,12500


In [94]:
df.shape

(3, 2)

'.shape' provides the number of rows and columns you currently have in your dataframe.