---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.9 (Pandas-01)</h1>

## _Overview of Pandas Series Data Structure.ipynb_

#### Read about Pandas Data Structures: https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro

## Learning agenda of this notebook

1. Overview of Python Pandas library and its data structures
2. Creating a Series
    - From Python List
    - From NumPy Arrays
    - From Python Dictionary
    - From a scalar value
3. Attributes of a Pandas Series
4. Understanding Index in a Series and its usage
    - Identification
    - Selection/Filtering/Subsetting
    - Alignment

<img align="right" width="500" height="500"  src="images/pandas-apps.png"  >

## 1. Overview of Pandas
- **Pandas** is an open-source Python library built on numPy and provides easy to use data structures and data analysis tools. PANDAS has derived its name from “PANel DAta System”. It was developed in 2008 by Wes McKinney. 
- Data Scientists use Pandas for performing following functions:
    - Reading, Writing, Downloading files of different formats like CSV, JSON, EXCEL, HTML, etc
    - Filtering and Modifying data based on multiple conditions
    - Attribute Generation (e.g., ID generation) 
    - Identifying and removing null values and duplicates
    - Imputation (replacement of missing observations by using statistical algorithms) 
    - Cutting, Splitting and Merging
    - Sorting and aggregating
    - Normalisation, standardisation, scaling, and pivoting
    - Data Partitioning (create training + validation + test data set)
- **Data Structures:**
    - **Series:** It is a labeled one-dimensional homogeneous array containing a sequence of values of any but homogeneous data type with numeric data labels starting from zero. 
    - **Dataframe:** It is a 2-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.
    - Both of these data structures are *value-mutable*

In [1]:
# To install this library in Jupyter notebook
import sys
!{sys.executable} -m pip install pandas --quiet

In [2]:
import pandas as pd
pd.__version__ , pd.__path__

('1.3.4',
 ['/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas'])

## 2. Creating a Series
<img align="right" width="500" height="500"  src="images/series-anatomy.png"  >

- A Series is a one-dimensional array containing a sequence of values of any data type (int, float, list, string, etc) which by default have numeric data labels starting from zero. 
- A series can be of any one data type like int, float, string, or object type. In case if the items of a series are of different types, the type associated with the series object is type of the largest size.
- We can imagine a Pandas Series as a column in a spreadsheet.
- Every series object has an index associated with every item, which can be an integer value (default) or a may be a string. The index is used for three purposes
    - Identification
    - Selection
    - Alignment
- Indeces must be unique, hashable and have the same length as data. It defaults to `np.arrange(n)`, if no index is passed
- You can create a series using the `pd.Series()` method from a 
    - Python list 
    - NumPy array
    - Python Dictionary
    - A scalar value

**```pd.Series(data, index, dtype, name)```**
- data: It can be a python sequence, an ndarray, a python dictionary or a scalar value.
- idx: It is a valid numpy datatype. 

To create an empty series use  `s1=pd.Series()`

### a. Creating a Series from Python List

In [2]:
# Creating Series from a Python List
import pandas as pd
import numpy as np
list1 = ['Arif', 'Rauf', 'Maaz', '','Hadeed']

# When index is not provided, it creates an index for the data starting from zero and with a step size of one.
s = pd.Series(data=list1)
print(s)
print(type(s))

0      Arif
1      Rauf
2      Maaz
3          
4    Hadeed
dtype: object
<class 'pandas.core.series.Series'>


Observe that output is shown in two columns - the index is on the left and the data value is on the right. If we do not explicitly specify an index for the data values while creating a series, then by default indices range from 0 through N – 1. Here N is the number of data elements.

**You can explicitly specify an index for a Series object, which can be either int or string type.**

In [3]:
# When index labels are passed with the array, then the length of the index and array must be of the same size, 
# else it will result in a ValueError

list1 = ['Arif', 'Rauf', 'Maaz', 'Hadeed']
label1 = ['MS01', 'MS02', 'MS03', 'MS02']

s = pd.Series(data=list1, index=label1)
print(s)
print(type(s))

MS01      Arif
MS02      Rauf
MS03      Maaz
MS02    Hadeed
dtype: object
<class 'pandas.core.series.Series'>


In [4]:
# Creating Series from a Python List with NaN
#np.nan is IEEE 754 floating-point representation of Not a Number. 
#Act as a placeholder for any missing numerical values in the array.

list1 = [1, 2.7, 3, 4, 5.9, 6, np.nan]
s = pd.Series(data=list1)
print(s)
print(type(s))

0    1.0
1    2.7
2    3.0
3    4.0
4    5.9
5    6.0
6    NaN
dtype: float64
<class 'pandas.core.series.Series'>


In [13]:
# You can assign names to series
list1 = ['Arif', 'Rauf', '', 'Hadeed']
label1 = ['MS01', 'MS02', 'MS03', 'MS04']
s = pd.Series(data=list1, index=label1, name='myseries1')
print(s)
print(type(s))

MS01      Arif
MS02      Rauf
MS03          
MS04    Hadeed
Name: myseries1, dtype: object
<class 'pandas.core.series.Series'>


### b. Creating a Series from NumPy Array

In [9]:
# Creating Series from a Numpy Array
import numpy as np
import pandas as pd
s = pd.Series(data=range(4))
s

0    0
1    1
2    2
3    3
dtype: int64

In [17]:
# Creating Series from a Numpy Array
import numpy as np
import pandas as pd
arr1 = np.array([62,22,98,44])

s = pd.Series(data=arr1, dtype='float64')
s

0    62.0
1    22.0
2    98.0
3    44.0
dtype: float64

### c. Creating a Series from Python Dictionary

In [12]:
# Creating Series from a Python Dictionary
import pandas as pd
import numpy as np
my_dict = {
    'name':"Arif", 
    'gender':"Male", 
    'Role':"Teacher", 
    'subject':"Data Science"}
s = pd.Series(data=my_dict)
print(s)
print(type(s))

name               Arif
gender             Male
Role            Teacher
subject    Data Science
dtype: object
<class 'pandas.core.series.Series'>


**When you create a series from dictionary, it will automatically take the keys as index and the value as data**

### d. Creating a Series from Scalar value

In [16]:
# Create Series from a Scalar value
s = pd.Series(data=25)
print(s)
print(type(s))

0    25
dtype: int64
<class 'pandas.core.series.Series'>


In [20]:
# An empty Series
s=pd.Series(dtype='float64')
print(s)
print(type(s))

Series([], dtype: float64)
<class 'pandas.core.series.Series'>


## 3. Attributes of Panda  Series
- We can access certain properties called attributes of a series by using that property with the series name using dot `.` notation

In [21]:
import pandas as pd
import numpy as np

my_dict = {0:"Rauf", 1:"Arif", 2:"Maaz", 3:"Hadeed", 4:"Mujahid", 5:"Mohid", 6:"Jamil"}
s = pd.Series(my_dict, name="ser1")
s

0       Rauf
1       Arif
2       Maaz
3     Hadeed
4    Mujahid
5      Mohid
6      Jamil
Name: ser1, dtype: object

In [22]:
s.name

'ser1'

In [23]:
# return the list of indices and its datatype
s.index

Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')

In [24]:
# return the list of values and its datatype
s.values

array(['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid', 'Mohid', 'Jamil'],
      dtype=object)

In [25]:
# return the type of underlying data
s.dtype

dtype('O')

In [26]:
# return a tuple of shape of underlying data
s.shape

(7,)

In [27]:
# return the number of bytes of underlying data (object data type take 8 bytes)
s.nbytes

56

In [28]:
# return number of elements in the underlying data
s.size

7

In [29]:
# return number of dimensions of underlying data
s.ndim

1

In [30]:
# return true if there are NaN values in the data
s.hasnans

False

In [31]:
# return true if the series object is empty
s.empty

False

## 4. Understanding Index in a Series
- Every series object has an index associated with every item, which can be an integer value (default) or a may be a string. Indeces must be unique, hashable and have the same length as data.
- The index is used for three purposes
    - Identification
    - Selection/Filtering/Subsetting
    - Alignment
- Before we discuss these three uses of indices, let us dig a litter deeper to have a clear understanding about the index of a series object



- The Pandas series object supports both integer-based and label-based indexing and provides a host of methods for performing operations involving the index.


### a. Changing Index of a Series Object
- If we create a Series object from dictionary, the keys of dictionray become the index 
- If we create a Series object from a list or nmpy array, the index defaults to integers from 0, 1, 2, ...
- Once a series object is created, we can assign new indices to it. This is shown below:

In [48]:
import pandas as pd
import numpy as np

# defining a series containing ten data values from 25 to 34
list1 = list(range(25,35))

s = pd.Series(data=list1)
print(s)

# index attribute shows that index range for this series is from (0-9) with step value of 1
print(s.index)

0    25
1    26
2    27
3    28
4    29
5    30
6    31
7    32
8    33
9    34
dtype: int64
RangeIndex(start=0, stop=10, step=1)


**Changing index of a series to some random whole number**

In [50]:
# creating a random list of 10 integers between range 100-200
arr1 = np.random.randint(low = 100, high = 200, size = 10)

# Now let us set this list set the index of our series
s.index = arr1
s

158    25
137    26
133    27
181    28
178    29
162    30
161    31
111    32
128    33
183    34
dtype: int64

**Changing index of a series to a list of strings**

In [52]:
import random
s = pd.Series(range(10, 15))
print(s)

# you can check that the index type of series s1 is now object
s.index

0    10
1    11
2    12
3    13
4    14
dtype: int64


RangeIndex(start=0, stop=5, step=1)

In [54]:
# Now let us change the index of series object by assigning its index attribute to a list of strings
s.index = ['num1', 'num2', 'num3', 'num4', 'num5']
print(s)

# you can check that the index type of series s1 is now object
s.index

num1    10
num2    11
num3    12
num4    13
num5    14
dtype: int64


Index(['num1', 'num2', 'num3', 'num4', 'num5'], dtype='object')

### b. First use of Index (Identification)
- Since every data value of a series object has an associated index, so we can use this index to identify the data value

** Identification using Integer Labels**

In [71]:
s = pd.Series(range(10, 15))
s

0    10
1    11
2    12
3    13
4    14
dtype: int64

In [72]:
s[2]

12

In [73]:
s[[2, 1, 3]]

2    12
1    11
3    13
dtype: int64

**Identification using String Labels**

In [74]:
labels = ['num1', 'num2', 'num3', 'num4', 'num5']
s = pd.Series(range(10, 15), index=labels)
s

num1    10
num2    11
num3    12
num4    13
num5    14
dtype: int64

In [75]:
s['num1']

10

In [77]:
s[['num3', 'num1']]

num3    12
num1    10
dtype: int64

### c. Second use of Index (Selection)
- You can select a subset of a series object using the slice object [start:stop:step]
    - If only stop is provided, it generates a portion of a sequence from index 0 to stop, where stop is excluded.
    - If only start is provided, it generates a portion of the sequence from the index start until the last element.
    - If both start and stop are provided, it generates a portion of the sequence from the index start until the stop where the stop is excluded.
    - If start, stop, and step are provided, it generates a portion of the sequence from the index start until stop (excluded) with an increment of index step.


**Selection/Filtering/Subsetting using Integer Labels**

In [64]:
s = pd.Series(range(15, 23))
s

0    15
1    16
2    17
3    18
4    19
5    20
6    21
7    22
dtype: int64

In [None]:
s[::]

In [65]:
s[2::]

2    17
3    18
4    19
5    20
6    21
7    22
dtype: int64

In [66]:
s[2:4]

2    17
3    18
dtype: int64

In [67]:
s[3:9]

3    18
4    19
5    20
6    21
7    22
dtype: int64

In [68]:
s[1:5:2]

1    16
3    18
dtype: int64

In [69]:
s[::-1]

7    22
6    21
5    20
4    19
3    18
2    17
1    16
0    15
dtype: int64

**Selection/Filtering/Subsetting using String Labels**

In [84]:
labels = ['num1', 'num2', 'num3', 'num4', 'num5', 'num6', 'num7']
s = pd.Series(range(15, 22), index=labels)
s

num1    15
num2    16
num3    17
num4    18
num5    19
num6    20
num7    21
dtype: int64

In [85]:
s[::]

num1    15
num2    16
num3    17
num4    18
num5    19
num6    20
num7    21
dtype: int64

In [89]:
s['num2':'num5']

num2    16
num3    17
num4    18
num5    19
dtype: int64

In [90]:
s['num2':'num5':2]

num2    16
num4    18
dtype: int64

In [92]:
s['num5':'num3':-1]

num5    19
num4    18
num3    17
dtype: int64

### d. Third use of Index (Alignment)
- If we perform basic mathematical operations like addition, subtraction, multiplication, division, etc., on two Series objects, the operation is done on each corresponding pair of elements. This is done by matching the indices of the two series objects.
- Let us understand this with examples

**Understanding Alignment with Series having Integer Labels**

In [105]:
s1 = pd.Series(range(5, 10), index=[0,1,2,3,4])
s2 = pd.Series(range(20, 25), index=[0,1,2,3,4])

In [106]:
print(s1)
print(s1.index)

0    5
1    6
2    7
3    8
4    9
dtype: int64
Int64Index([0, 1, 2, 3, 4], dtype='int64')


In [107]:
print(s2)
print(s2.index)

0    20
1    21
2    22
3    23
4    24
dtype: int64
Int64Index([0, 1, 2, 3, 4], dtype='int64')


In [108]:
s1 + s2

0    25
1    27
2    29
3    31
4    33
dtype: int64

**What if the two series have a bit different indices**
- While performing mathematical operations on series having mismatched indicesall missing values are filled in with NaN by default.
- Explicit call to `s.add()`, `s.sub()`, `s.mul()` and `s.div()` instead of operators (`+, -, *, /`) is preferred when series may have missing values and we want to replace it by a specific value to have a concrete output in place of NaN


In [109]:
s1 = pd.Series(range(5, 10), index=[0,1,4,3,5])
s2 = pd.Series(range(20, 25), index=[0,1,2,3,4])

In [110]:
print(s1)
print(s1.index)

0    5
1    6
4    7
3    8
5    9
dtype: int64
Int64Index([0, 1, 4, 3, 5], dtype='int64')


In [111]:
print(s2)
print(s2.index)

0    20
1    21
2    22
3    23
4    24
dtype: int64
Int64Index([0, 1, 2, 3, 4], dtype='int64')


In [112]:
s1+s2

0    25.0
1    27.0
2     NaN
3    31.0
4    31.0
5     NaN
dtype: float64

In [None]:
s1.add(s2)

**Note here that the output of addition is NaN if one of the elements or both elements have no value.**
- Explicit call to `s.add()`, `s.sub()`, `s.mul()` and `s.div()` instead of operators (`+, -, *, /`) is preferred when series may have missing values and we want to replace it by a specific value to have a concrete output in place of NaN

In [114]:
s1.add(s2, fill_value=0)

0    25.0
1    27.0
2    22.0
3    31.0
4    31.0
5     9.0
dtype: float64

**Understanding Alignment with Series having String Labels**

In [116]:
labels1 = ['num1', 'num2', 'num3', 'num4', 'num5']
s1 = pd.Series(range(15, 20), index=labels1)

labels2 = ['num1', 'num2', 'num3', 'num8', 'num5']
s2 = pd.Series(range(15, 20), index=labels2)


In [117]:
s1

num1    15
num2    16
num3    17
num4    18
num5    19
dtype: int64

In [118]:
s2

num1    15
num2    16
num3    17
num8    18
num5    19
dtype: int64

In [119]:
s1+s2

num1    30.0
num2    32.0
num3    34.0
num4     NaN
num5    38.0
num8     NaN
dtype: float64

**Note here that the output of addition is NaN if one of the elements or both elements have no value. So better to use `add()` with `fill_value` parameter.**

In [120]:
s1.add(s2, fill_value=0)

num1    30.0
num2    32.0
num3    34.0
num4    18.0
num5    38.0
num8    18.0
dtype: float64

**Students are advised to practice following methods:**

`s3 = s1.sub(s2, fill_value=0)`

`s3 = s1.mul(s2, fill_value=0)`

`s3 = s1.div(s2, fill_value=0)`



Conditional Selection and use of `reset_index()`, `pop()`, `drop()` `append()`, and `update()` methods will be discussed when studying Dataframes




####  Conditional Selection in a Series
- The series object in pandas also supports some advanced indexing. 
- It can be used if you want to pick up elements from a series without knowing their exact position or value. 
- You need to know what kind of element you want and what numerical criteria it should meet. 
- In this case, a Boolean index can be used with a series object.



#### Resetting Index
- The `s.reset_index()` method generate a new DataFrame or Series with the index reset.
- One of the reason of using it on a series object is that the index is meaningless and needs to be reset to the default before another operation.

**```s1.reset_index(drop=False, inplace=False, level=None)```**

### NumPy Array vs Pandas Series
>- In a series we can define our own labeled index to access elements of an array. These can be numbers or letters. NumPy arrays are accessed  by their integer position using numbers only
>- In a series the elements can be indexed in descending order also. In NumPy arrays, the indexing starts with zero for the first element and the index is fixed.
>- If two series are not aligned, NaN or missing values are generated. In NumPy arrays, there is no concept of NaN values and if there are no matching values in arrays, alignment fails.
>- Series require more memory. NumPy arraysoccupies lesser memory.
    
    