# Python Library - Pandas


Pandas is a library built using NumPy specifically for data analysis, used extensively in Data Science Projects.

There are two main data structures in Pandas - Series and Dataframes.

*Source: https://pandas.pydata.org/pandas-docs/stable/overview.html*

In this section, we cover:
1. The pandas Series (similar to a numpy array)
2. Dataframes

### 1. The Pandas Series

###### 1.1. Import Libraries and Create Pandas Series
###### 1.2. Access Elements and Indexing

In [35]:
###### 1.1. Import Libraries and Create Pandas Series

# import pandas, pd is an alias
import pandas as pd
import numpy as np

# Creating pandas series - type: int64, object and datetimes
s = pd.Series([2, 4, 5, 6, 9])
print(s)\
#returns index and value pairs
print(type(s))
print("----------------------------")
# creating a series of characters # notice that the 'dtype' here is 'object' (not int64)
char_series = pd.Series(['a', 'b', 'af'])
print (char_series)
print("----------------------------")
# creating a series of type datetime
date_series = pd.date_range(start = '11-09-2017', end = '12-12-2017')
date_series
print (type(date_series))

0    2
1    4
2    5
3    6
4    9
dtype: int64
<class 'pandas.core.series.Series'>
----------------------------
0     a
1     b
2    af
dtype: object
----------------------------
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>


In [36]:
###### 1.2. Access Elements and Indexing

print("-------The s panada series of type int64----------")
print (s)
print("-----------4th element-----------------")
print (s[3])
print("-----------index=2 to end elements-----------------")
print (s[2:])
print("----------index 1 and 3 elements------------------")
print (s[[1, 3]])
# note that s[1, 3] will not work, we need to pass the indices [1, 3] as a list inside the original []
print("---------------------------------------------------")
print("---------------------------------------------------")
print("----------Indexing as abc for 012------------------")
# Indexing
print(pd.Series([0, 1, 2], index = ['a', 'b', 'c']))
print("----------Indexing as 0 to 9 to the respective squares------------")
print (pd.Series(np.array(range(0,10))**2, index = range(0,10)))

-------The s panada series of type int64----------
0    2
1    4
2    5
3    6
4    9
dtype: int64
-----------4th element-----------------
6
-----------index=2 to end elements-----------------
2    5
3    6
4    9
dtype: int64
----------index 1 and 3 elements------------------
1    4
3    6
dtype: int64
---------------------------------------------------
---------------------------------------------------
----------Indexing as abc for 012------------------
a    0
b    1
c    2
dtype: int64
----------Indexing as 0 to 9 to the respective squares------------
0     0
1     1
2     4
3     9
4    16
5    25
6    36
7    49
8    64
9    81
dtype: int64


### 2. The Pandas Dataframe

Dataframe is a table with rows and columns, with rows having an index and columns having meaningful names.
###### 2.1. Creating Dataframes from dictionaries, JSON objects, reading from txt, CSV files

In [37]:
# Creating Dataframe from Dictionary: The key values pairs. Keys become column names
df = pd.DataFrame({'name': ['Vinay', 'Kushal', 'Aman', 'Saif'], 
                   'age': [22, 25, 24, 28], 
                    'occupation': ['engineer', 'doctor', 'data analyst', 'teacher']})
df

Unnamed: 0,name,age,occupation
0,Vinay,22,engineer
1,Kushal,25,doctor
2,Aman,24,data analyst
3,Saif,28,teacher


In [38]:
# # Creating Dataframe from a CSV file
market_df = pd.read_csv('https://designworksakiiitb9.s3.amazonaws.com/market_fact.csv')
market_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.3,0.37
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.3,0.38


In [57]:
market_df.tail()
#Here, each row represents an order placed at a retail store. 
#Notice the index associated with each row - starts at 0 and ends at 8398, 
#implying that there were 8399 orders placed.

Unnamed: 0_level_0,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
Ord_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Ord_5353,Prod_4,SHP_7479,Cust_1798,2841.4395,0.08,28,374.63,7.69,0.59
Ord_5411,Prod_6,SHP_7555,Cust_1798,127.16,0.1,20,-74.03,6.92,0.37
Ord_5388,Prod_6,SHP_7524,Cust_1798,243.05,0.02,39,-70.85,5.35,0.4
Ord_5348,Prod_15,SHP_7469,Cust_1798,3872.87,0.03,23,565.34,30.0,0.62
Ord_5459,Prod_6,SHP_7628,Cust_1798,603.69,0.0,47,131.39,4.86,0.38


In [58]:
# Looking at the datatypes of each column
market_df.info()

# Note that each column (0 to 9 columns in total) is basically a pandas Series of length 8399
# The ID columns are 'objects', i.e. they are being read as characters
# The rest are numeric (floats or int)


<class 'pandas.core.frame.DataFrame'>
Index: 8399 entries, Ord_5446 to Ord_5459
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Prod_id              8399 non-null   object 
 1   Ship_id              8399 non-null   object 
 2   Cust_id              8399 non-null   object 
 3   Sales                8399 non-null   float64
 4   Discount             8399 non-null   float64
 5   Order_Quantity       8399 non-null   int64  
 6   Profit               8399 non-null   float64
 7   Shipping_Cost        8399 non-null   float64
 8   Product_Base_Margin  8336 non-null   float64
dtypes: float64(5), int64(1), object(3)
memory usage: 656.2+ KB


In [59]:
# Describe gives you a summary of all the numeric columns in the dataset
market_df.describe()

Unnamed: 0,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
count,8399.0,8399.0,8399.0,8399.0,8399.0,8336.0
mean,1775.878179,0.049671,25.571735,181.184424,12.838557,0.512513
std,3585.050525,0.031823,14.481071,1196.653371,17.264052,0.135589
min,2.24,0.0,1.0,-14140.7,0.49,0.35
25%,143.195,0.02,13.0,-83.315,3.3,0.38
50%,449.42,0.05,26.0,-1.5,6.07,0.52
75%,1709.32,0.08,38.0,162.75,13.99,0.59
max,89061.05,0.25,50.0,27220.69,164.73,0.85


In [60]:
# Column names
market_df.columns

Index(['Prod_id', 'Ship_id', 'Cust_id', 'Sales', 'Discount', 'Order_Quantity',
       'Profit', 'Shipping_Cost', 'Product_Base_Margin'],
      dtype='object')

In [61]:
# The number of rows and columns
market_df.shape

(8399, 9)

In [63]:
# You can extract the values of a dataframe as a numpy array using df.values 
market_df.values

array([['Prod_16', 'SHP_7609', 'Cust_1818', ..., -30.51, 3.6, 0.56],
       ['Prod_13', 'SHP_7549', 'Cust_1818', ..., 4.56, 0.93, 0.54],
       ['Prod_4', 'SHP_7610', 'Cust_1818', ..., 1148.9, 2.5, 0.59],
       ...,
       ['Prod_6', 'SHP_7524', 'Cust_1798', ..., -70.85, 5.35, 0.4],
       ['Prod_15', 'SHP_7469', 'Cust_1798', ..., 565.34, 30.0, 0.62],
       ['Prod_6', 'SHP_7628', 'Cust_1798', ..., 131.39, 4.86, 0.38]],
      dtype=object)

Congratulations! You have been introduced to Python Pandas Library. The next tutorials will give operations on Pandas library. All the best ahead.