---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.10 (Pandas-02)</h1>

## _Overview of Pandas Dataframe Data Structure.ipynb_

#### Read about Pandas Data Structures: https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro

## Learning agenda of this notebook

1. Overview of Pandas
2. Creating Dataframes
    - An empty dataframe
    - Two-Dimensional NumPy Array
    - Dictionary of Python Lists
    - Dictionary of Panda Series
3. Attributes of a Dataframe
4. Some basic Info gathering functions


<img align="right" width="500" height="500"  src="images/pandas-apps.png"  >

## 1. Overview of Pandas
- **Pandas** is an open-source Python library built on numPy and provides easy to use data structures and data analysis tools. PANDAS has derived its name from “PANel DAta System”. It was developed in 2008 by Wes McKinney. 
- Data Scientists use Pandas for performing following functions:
    - Reading, Writing, Downloading files of different formats like CSV, JSON, EXCEL, HTML, etc
    - Filtering and Modifying data based on multiple conditions
    - Attribute Generation (e.g., ID generation) 
    - Identifying and removing null values and duplicates
    - Imputation (replacement of missing observations by using statistical algorithms) 
    - Cutting, Splitting and Merging
    - Sorting and aggregating
    - Normalisation, standardisation, scaling, and pivoting
    - Data Partitioning (create training + validation + test data set)
- **Data Structures:**
    - **Series:** It is a labeled one-dimensional homogeneous array containing a sequence of values of any but homogeneous data type with numeric data labels starting from zero. 
    - **Dataframe:** It is a 2-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.
    - Both of these data structures are *value-mutable*

In [None]:
# To install this library in Jupyter notebook
#import sys
#!{sys.executable} -m pip install pandas

In [1]:
import pandas as pd
pd.__version__ , pd.__path__

('1.3.4',
 ['/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas'])

## 2. Creating a Dataframe
<img align="right" width="500" height="500"  src="images/pandas.png"  >

- DataFrame is a 2-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.
- Pandas Dataframe vs NumPy array:
    - A Numpy array requires homogeneous data, while a Pandas DataFrame can have heterogeneously typed columns (float, int, string, datetime, etc.).
    - Pandas have a simpler interface for operations like file loading, plotting, selection, joining, GROUP BY, which come very handy in data-processing applications.
    - Pandas DataFrames (with column names) make it very easy to keep track of data.
    - Pandas is used when data is in Tabular Format, whereas Numpy is used for numeric array based data manipulation.


- You can create a dataframe using the `pd.DataFrame()` function from a 
    - Two-Dimensional NumPy Array
    - Dictionary of Python Lists
    - Dictionary of Panda Series
    - CSV, TSV, Excel files (residing on local disk or a public GitHub Gist, or from a public Google Document)
    - Database table
    - Web document (HTML, XML files)

```
pd.DataFrame(data, index, columns, dtype)

```

### a. Creating an Empty Dataframe

In [2]:
import pandas as pd

df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


### b. Creating a Dataframe from a two-dimensional NumPy Array
- Remember data actually need not to be labeled at all to be placed into a dataframe

In [3]:
import pandas as pd
import numpy as np

# Create a matrix of integers between 10 and 50
arr = np.random.randint(10,100, size= (6,5))
print(arr)
# Pass this 2-D NumPy array to pd.Dataframe()
df = pd.DataFrame(data=arr)
df

[[12 22 57 34 15]
 [27 63 30 84 73]
 [74 85 97 49 48]
 [97 27 86 88 39]
 [68 95 27 77 35]
 [96 94 47 42 72]]


Unnamed: 0,0,1,2,3,4
0,12,22,57,34,15
1,27,63,30,84,73
2,74,85,97,49,48
3,97,27,86,88,39
4,68,95,27,77,35
5,96,94,47,42,72


- **Note that both the row indices and the column labels/indices are implicitly set to numerical values from 0 to n-1, since neither of the two is provided while creating the dataframe object. They are also not considered as part of data in the dataframe**
- **In majority of the cases the row label is left as default, i.e., 0,1,2,3.... However, the column labels are changed from 0,1,2,3,... to some meaningful values.**
- **This is shown below:**

In [6]:
# Let us name the column labels of our choice, while creating it
df = pd.DataFrame(data=arr, columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
df

Unnamed: 0,Col1,Col2,Col3,Col4,Col5
0,12,22,57,34,15
1,27,63,30,84,73
2,74,85,97,49,48
3,97,27,86,88,39
4,68,95,27,77,35
5,96,94,47,42,72


In [8]:
# Let us name the row labels of our choice, while creating it
df = pd.DataFrame(data=arr, index=['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5'])
df

Unnamed: 0,0,1,2,3,4
Row0,12,22,57,34,15
Row1,27,63,30,84,73
Row2,74,85,97,49,48
Row3,97,27,86,88,39
Row4,68,95,27,77,35
Row5,96,94,47,42,72


In [5]:
# Let us name the both row labels and column labels to strings of our choice, while creating it
row_labels = ['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5']
col_labels = ['Col0', 'Col1', 'Col2', 'Col3', 'Col4']
df = pd.DataFrame(data=arr, index=row_labels, columns=col_labels)
df

Unnamed: 0,Col0,Col1,Col2,Col3,Col4
Row0,12,22,57,34,15
Row1,27,63,30,84,73
Row2,74,85,97,49,48
Row3,97,27,86,88,39
Row4,68,95,27,77,35
Row5,96,94,47,42,72


- **You can do this later as well, i.e., after the datafram has been created with default indices**
- **This is done by assigning a list of labels/values to `index` and `columns` attributes of a dataframe object**

In [9]:
arr = np.random.randint(10,100, size= (6,5))
df = pd.DataFrame(data=arr)
df

Unnamed: 0,0,1,2,3,4
0,42,48,59,73,69
1,28,52,45,14,76
2,15,53,41,42,93
3,71,19,32,83,14
4,81,61,32,14,21
5,65,93,87,90,76


In [10]:
row_labels = ['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5']
col_labels = ['Col0', 'Col1', 'Col2', 'Col3', 'Col4']

df.columns = col_labels
df.index = row_labels

df

Unnamed: 0,Col0,Col1,Col2,Col3,Col4
Row0,42,48,59,73,69
Row1,28,52,45,14,76
Row2,15,53,41,42,93
Row3,71,19,32,83,14
Row4,81,61,32,14,21
Row5,65,93,87,90,76


### c. Creating a Dataframe from a Dictionary of Python Lists
- You can create a dataframe object from a dictionary of Python Lists 
    - The dictionary `Keys` become the column names, and 
    - The dictionary `Values` are lists/arrays containing data for the respective columns.

In [14]:
import pandas as pd
d1 = {
    'Names': ['Arif', 'Hadeed', 'Mujahid'],
    'Age': [50, 22, 18],
    'Addr': ['Lahore', 'Islamabad', 'Karachi']
}
# Pass this Dictionary of Python Lists to pd.Dataframe()
df = pd.DataFrame(data=d1)
df

Unnamed: 0,Names,Age,Addr
0,Arif,50,Lahore
1,Hadeed,22,Islamabad
2,Mujahid,18,Karachi


- **Note that column labels are set as per the keys inside the dictionary object, while the row labels/indices are set to default numerical values**
- **You can set the row indices while creating the dataframe as shown below**

In [16]:
d1 = {
    'Names': ['Arif', 'Hadeed', 'Mujahid'],
    'Age': [50, 22, 18],
    'Addr': ['Lahore', 'Islamabad', 'Karachi']
}
# Let us change the row labels
df = pd.DataFrame(data=d1, index=['MS01', 'MS02', 'MS03'])
df

Unnamed: 0,Names,Age,Addr
MS01,Arif,50,Lahore
MS02,Hadeed,22,Islamabad
MS03,Mujahid,18,Karachi


- **OR you may change both by assigning a list of labels/values to `index` and `columns` attributes of a dataframe object**

In [18]:
row_labels = [0,1,2]
col_labels = ['Col1', 'Col2', 'Col3']

df.columns = col_labels
df.index = row_labels

df

Unnamed: 0,Col1,Col2,Col3
0,Arif,50,Lahore
1,Hadeed,22,Islamabad
2,Mujahid,18,Karachi


### d. Creating a Dataframe from Dictionary of Panda Series
One can think of a dataframe as a dictionary of Panda Series: 
- `Keys` are column names, and 
- `Values` are Series object for the respective columns.

In [19]:

dict = {
    "Names": pd.Series(['Arif', 'Hadeed', 'Mujahid']),
    "Age": pd.Series([50, 22, 18]),
    "Addr": pd.Series(['Lahore', 'Islamabad','Karachi']),
}
df = pd.DataFrame(dict)
df

Unnamed: 0,Names,Age,Addr
0,Arif,50,Lahore
1,Hadeed,22,Islamabad
2,Mujahid,18,Karachi


**Note from the above output, that every series object becomes the data of the appropriate column**

In [32]:
dict = {
    "Names": pd.Series(['Arif', 'Hadeed', 'Mujahid', 'Maaz'], index=['a','b','c', 'd']),
    "Age": pd.Series([50, 22,np.nan, 18], index=['a','b','c','d']),
    "Addr": pd.Series(['Lahore', '', 'Peshawer','Karachi'], index=['a','b','c', 'd']),
}
df = pd.DataFrame(dict)
df

Unnamed: 0,Names,Age,Addr
a,Arif,50.0,Lahore
b,Hadeed,22.0,
c,Mujahid,,Peshawer
d,Maaz,18.0,Karachi


- **In the above code and its output, note that every series object has four data values and four corresponding indices**
- **Also note that in the Age series, we have a NaN value, and in the Addr series we have a missing value.**
- **Another point to note that the row indices of the three series exactly match, in number as well as in sequence/value.**
- **A question arise, what if the indices of series are different. See the following code to understand this concept.**

In [37]:
dict = {
    "Names": pd.Series(['Arif', 'Hadeed', 'Mujahid', 'Maaz'], index=['a','b','c', 'd']),
    "Age": pd.Series([50, 22,np.nan, 18], index=['a','x','y','d']),
    "Addr": pd.Series(['Lahore', '','Karachi'], index=['a', 'd', 'x']),
}
df = pd.DataFrame(dict)
df

Unnamed: 0,Names,Age,Addr
a,Arif,50.0,Lahore
b,Hadeed,,
c,Mujahid,,
d,Maaz,18.0,
x,,22.0,Karachi
y,,,


- **In the above code and its output, note that first series object has four data values and four corresponding indices. Similarly, second has four indices and three values with one NaN. Third series has three indices and two values with on missing value**
- **The resulting Dataframe has six rows and three columns.** 
    - **For index 'a' we have value in all the three series objects or columns**
    - **For index 'b' we have a value in first series object, and NaN for the second and third column, since the second and third series object has no value corresponding to row index 'b'**

## 3. Attributes of Pandas Dataframe
- Like Series, we can access properties/attributes of a dataframe by using dot `.` notation

In [38]:
# Create a 2D NumPy array
import numpy as np
a2DNumPyArray =[[1001, 'elephant', 500.0],
       [1002, 'elephant', 600.0],
       [1003, 'elephant', 350.0],
       [1004, 'tiger', np.nan],
       [1005, 'tiger', 320.0],
       [1006, 'tiger', 330.0],
       [1007, 'tiger', 290.0],
       [1008, np.nan, 310.0],
       [1009, 'zebra', 200.0],
       [1010, 'zebra', 220.0],
       [1011, 'zebra', 240.0],
       [1012, 'zebra', 230.0],
       [1013, 'zebra', 220.0],
       [1014, 'zebra', 100.0],
       [1015, 'zebra', 80.0],
       [1016, 'lion', 420.0],
       [1017, 'lion', 600.0],
       [1018, np.nan, 500.0],
       [1019, 'lion', 390.0],
       [1020, 'kangaroo', 410.0],
       [1021, 'kangaroo', 430.0],
       [1022, 'kangaroo', 410.0]
               ]

In [39]:
# Use the 2D NumPy array to create a dataframe
import pandas as pd
list1 = ['uniq_id', 'animal', 'water_need']
df = pd.DataFrame(data=a2DNumPyArray, columns=list1)
df

Unnamed: 0,uniq_id,animal,water_need
0,1001,elephant,500.0
1,1002,elephant,600.0
2,1003,elephant,350.0
3,1004,tiger,
4,1005,tiger,320.0
5,1006,tiger,330.0
6,1007,tiger,290.0
7,1008,,310.0
8,1009,zebra,200.0
9,1010,zebra,220.0


**Note:** 
- The column labels are not considered as data part so the count of rows is 22
- Similarly, the row indices are also not considered as data part, so the count of columns is 3
- **This can be observed in the shape property of dataframe as shown below**

In [58]:
# display a tuple representing the dimensionality of the DataFrame.
df.shape   # rows, columns = df.shape


(22, 3)

In [40]:
# display row labels  of the dataframe
df.index

RangeIndex(start=0, stop=22, step=1)

In [41]:
# display column labels of a dataframe
df.columns

Index(['uniq_id', 'animal', 'water_need'], dtype='object')

In [42]:
# display data types of each column in the dataframe
df.dtypes

uniq_id         int64
animal         object
water_need    float64
dtype: object

In [43]:
# display a NumPy ndarray having all the values in the DataFrame, without the axes labels
df.values

array([[1001, 'elephant', 500.0],
       [1002, 'elephant', 600.0],
       [1003, 'elephant', 350.0],
       [1004, 'tiger', nan],
       [1005, 'tiger', 320.0],
       [1006, 'tiger', 330.0],
       [1007, 'tiger', 290.0],
       [1008, nan, 310.0],
       [1009, 'zebra', 200.0],
       [1010, 'zebra', 220.0],
       [1011, 'zebra', 240.0],
       [1012, 'zebra', 230.0],
       [1013, 'zebra', 220.0],
       [1014, 'zebra', 100.0],
       [1015, 'zebra', 80.0],
       [1016, 'lion', 420.0],
       [1017, 'lion', 600.0],
       [1018, nan, 500.0],
       [1019, 'lion', 390.0],
       [1020, 'kangaroo', 410.0],
       [1021, 'kangaroo', 430.0],
       [1022, 'kangaroo', 410.0]], dtype=object)

In [None]:
# # return number of dimensions of underlying data
df.ndim

In [None]:
# return number of elements in the underlying data
df.size

In [None]:
# to transpose the DataFrame. Means, row indices and column labels of the DataFrame replace each other’s position
df.T

In [None]:
# To check number on non-NA values
df.count()

In [44]:
#Return True if the dataframe is empty
df.empty

False

## 4. Some Basic Info Gathering Methods

In [45]:
import numpy as np
import pandas as pd

a2DNumPyArray =[
       [1001, 'elephant', 500.0],
       [1002, 'elephant', 600.0],
       [1003, 'elephant', 350.0],
       [1004, 'tiger', np.nan],
       [1005, 'tiger', 320.0],
       [1006, 'tiger', 330.0],
       [1007, 'tiger', 290.0],
       [1008, np.nan, 310.0],
       [1009, 'zebra', 200.0],
       [1010, 'zebra', 220.0],
       [1011, 'zebra', 240.0],
       [1012, 'zebra', 230.0],
       [1013, 'zebra', 220.0],
       [1014, 'zebra', 100.0],
       [1015, 'zebra', 80.0],
       [1016, 'lion', 420.0],
       [1017, 'lion', 600.0],
       [1018, np.nan, 500.0],
       [1019, 'lion', 390.0],
       [1020, 'kangaroo', 410.0],
       [1021, 'kangaroo', 430.0],
       [1022, 'kangaroo', 410.0]
       ]

list1 = ['uniq_id', 'animal', 'water_need']
df = pd.DataFrame(data=a2DNumPyArray, columns=list1)
df

Unnamed: 0,uniq_id,animal,water_need
0,1001,elephant,500.0
1,1002,elephant,600.0
2,1003,elephant,350.0
3,1004,tiger,
4,1005,tiger,320.0
5,1006,tiger,330.0
6,1007,tiger,290.0
7,1008,,310.0
8,1009,zebra,200.0
9,1010,zebra,220.0


In [46]:
df.count()

uniq_id       22
animal        20
water_need    21
dtype: int64

In [47]:
#This method prints information about a DataFrame including the index dtype, total columns, 
# non-null values and memory usage
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   uniq_id     22 non-null     int64  
 1   animal      20 non-null     object 
 2   water_need  21 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 656.0+ bytes


In [49]:
# If no arguments are passed displays the descriptive statistis of the numerical columns of dataframe
df.describe()

Unnamed: 0,uniq_id,water_need
count,22.0,21.0
mean,1011.5,340.47619
std,6.493587,143.682852
min,1001.0,80.0
25%,1006.25,230.0
50%,1011.5,330.0
75%,1016.75,420.0
max,1022.0,600.0


**Since `df.describe()` method returns a dataframe, so all dataframe attributes can be accessed using dot notation. This is called chaining, which is very important to understand as we will be using it extensively in upcoming sessions**

In [56]:
df.describe().shape

(8, 2)

In [53]:
df.describe().index

Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')

In [54]:
df.describe().columns

Index(['uniq_id', 'water_need'], dtype='object')

In [55]:
df.describe().values

array([[  22.        ,   21.        ],
       [1011.5       ,  340.47619048],
       [   6.49358658,  143.68285181],
       [1001.        ,   80.        ],
       [1006.25      ,  230.        ],
       [1011.5       ,  330.        ],
       [1016.75      ,  420.        ],
       [1022.        ,  600.        ]])

In [57]:
df.describe(include='all')

Unnamed: 0,uniq_id,animal,water_need
count,22.0,20,21.0
unique,,5,
top,,zebra,
freq,,7,
mean,1011.5,,340.47619
std,6.493587,,143.682852
min,1001.0,,80.0
25%,1006.25,,230.0
50%,1011.5,,330.0
75%,1016.75,,420.0
