# Introduction to Series Data Structure

---

**Author:** Dr. Saad Laouadi  
**Copyright:** Dr. Saad Laouadi  

---

## License

**This material is intended for educational purposes only and may not be used directly in courses, video recordings, or similar without prior consent from the author. When using or referencing this material, proper credit must be attributed to the author.**

```text
#**************************************************************************
#* (C) Copyright 2024 by Dr. Saad Laouadi. All Rights Reserved.           *
#**************************************************************************                                                                    
#* DISCLAIMER: The author has used their best efforts in preparing        *
#* this content. These efforts include development, research,             *
#* and testing of the theories and programs to determine their            *
#* effectiveness. The author makes no warranty of any kind,               *
#* expressed or implied, with regard to these programs or                 *
#* to the documentation contained within. The author shall not            *
#* be liable in any event for incidental or consequential damages         *
#* in connection with, or arising out of, the furnishing,                 *
#* performance, or use of these programs.                                 *
#*                                                                        *
#* This content is intended for tutorials, online articles,               *
#* and other educational purposes.                                        *
#**************************************************************************
```

In [1]:
# Environment Setup 
import random
from pprint import pprint
from datetime import datetime

import pandas as pd

## Introduction to Data Structures

### Lists

Let’s begin with a basic data structure that most people are familiar with—lists. A list in Python is an ordered collection that can store different data types.

In [2]:
# generate random list of grades
random.seed(0)
grades = random.choices(range(20, 101, 5), k=6)
print(grades)

[90, 80, 55, 40, 60, 50]


In lists, we can access elements by their position or index; however, the index is always an integer. Remember that Python is a zero based index.

In the previous example, we can access the element of the third element like this:

In [3]:
print(f"The third element is: {grades[2]}")

The third element is: 55


In [4]:
# Checking the list index
for index, elem in enumerate(grades):
    print(f"Index is: {index} corresponding element: {elem}")

Index is: 0 corresponding element: 90
Index is: 1 corresponding element: 80
Index is: 2 corresponding element: 55
Index is: 3 corresponding element: 40
Index is: 4 corresponding element: 60
Index is: 5 corresponding element: 50


### Limitations of Lists in Python

While lists are flexible for storing multiple values, they have some limitations:
- **Lack of Labels**: Lists allow you to access elements by their position (index), but what if you need to assign meaningful labels to the data? For example, in a list of student grades, we can’t directly label each grade with the student’s name.
- **Non-integer Indexing**: In Python, list indices are limited to integers, starting from 0. If you want to use custom labels like strings or other objects as indices, lists don’t support that functionality.
- **No Metadata Support**: Lists don’t offer a direct way to associate additional metadata with the list itself. In the example above, grades is simply a reference to the list object, not a descriptive label or name associated with the data. This can be limiting if we want to store extra information, such as the subject name or the class for the grades.

Lists don’t provide a built-in mechanism to address these needs, which makes them less suited for complex data storage where meaningful labels or additional metadata are required.

### Dictionaries 

We can overcome the previously mentioned shortcomings of the list data structure in Python by using a dictionary (dict) instead. A dictionary in Python is an ordered collection that can store key-value pairs, where both the keys and values can be of different data types.

Here’s how a dictionary addresses the issues we faced with lists:

1.	**Labeling Data**: In dictionaries, we can assign meaningful labels (keys) to each data element (value), making it easier to access and understand the data. This is useful when we want to pair student names with their grades.
2.	**Flexible Indexing**: Unlike lists, dictionaries allow us to use any hashable object as keys (like strings, integers, or tuples). This means we’re not limited to using integer indices as with lists.
3.	**Storing Metadata**: A dictionary can hold additional metadata alongside the data, making it more descriptive and flexible.

### Example: Using a Dictionary to Store Student Grades

Let’s enhance the previous example by using a dictionary to store both student names and their grades.

In [5]:
# Creating a dictionary to store student names as keys and their grades as values
random.seed(0)
grades = {
    "Adam": random.choice(range(20, 101, 5)),
    "Chris": random.choice(range(20, 101, 5)),
    "David": random.choice(range(20, 101, 5)),
    "Micheal": random.choice(range(20, 101, 5)),
    "Denis": random.choice(range(20, 101, 5)),
    "Frank": random.choice(range(20, 101, 5))
}
pprint(grades)

{'Adam': 80, 'Chris': 85, 'David': 25, 'Denis': 100, 'Frank': 95, 'Micheal': 60}


We can access a **dict** object using the `[...]` operator or the `get()` method:

In [6]:
# Accessing a student's grade using their name with []
print(f"Adam's grade: {grades['Adam']}")

# Using the get() method
print(f"Denis's grade: {grades.get("Denis")}")

Adam's grade: 80
Denis's grade: 100


### Example: Storing Annual Income Data with Duplicate Keys

Let’s say we want to track the annual income of a business over several years. In some cases, you might want to store multiple income values for the same year due to corrections or updated records. However, because dictionary keys must be unique, a standard Python dictionary won’t allow you to store multiple entries with the same key (year in this case).

In [7]:
# A dictionary storing income over several years
income = {
    2018: 50000,
    2019: 60000,
    2020: 65000,
    2021: 70000,
    2021: 75000  # Duplicate key!
}

print(income)

{2018: 50000, 2019: 60000, 2020: 65000, 2021: 75000}


In this case, the second entry for 2021 overwrites the first one. As a result, the original value for 2021 (70000) is lost, and only the latest entry (75000) remains. This shows that dictionaries don’t allow duplicate keys, and it can lead to data loss if you attempt to store multiple values for the same key.

### Overcoming this Limitation

To handle situations where duplicate keys are needed, you can use an alternative approach, such as storing lists or tuples as values, but we will face again the previous limitations of lists.

Here is how we can store the income values for each year as a list to allow multiple entries for the same key.

In [8]:
# Dictionary with lists as values to store multiple income records for the same year
income = {
    2018: [50000],
    2019: [60000],
    2020: [65000],
    2021: [70000, 75000]  # Multiple values for the same year
}

# Accessing income data for 2021
print(f"Income in 2021: {income[2021]}")

Income in 2021: [70000, 75000]


### Adding Metadata 

We can add another key-value pair element to the dict object, for example "name": "Income" to the previous example, however, this will be considered just as another element rather than actual metadata for the object. This is because dictionaries don’t support metadata. Let us go with this first and then overcome the issue in the upcoming section:

In [9]:
# Dictionary with lists as values to store multiple income records for the same year
income = {
    2018: 50000,
    2019: 60000,
    2020: 65000,
    2021: 75000,
    "name": "Income"
}

# Access the name key
print(f"The name of data: {income.get("name")}")

The name of data: Income


#### Use a Custom Class for Metadata

One way to overcome the previous issue is to create a custom class that wraps around the dictionary and allows you to store additional metadata like a name.

In [10]:
class NamedDict:
    def __init__(self, name, data):
        self.name = name
        self.data = data

    def __repr__(self):
        return f'{self.name}: {self.data}'


income = NamedDict(name="Income Data", data={
    2018: 50000,
    2019: 60000,
    2020: 65000,
    2021: 75000,
})

print(income)  

Income Data: {2018: 50000, 2019: 60000, 2020: 65000, 2021: 75000}


#### Store Metadata Separately

You can also store metadata like the name separately from the dictionary:

In [11]:
# Storing metadata separately
income = {
    2018: 50000,
    2019: 60000,
    2020: 65000,
    2021: 70000,
}

# Metadata (like name) is stored separately
income_name = "Income Data"

# Access data
print(f"{income_name}: {income}")

Income Data: {2018: 50000, 2019: 60000, 2020: 65000, 2021: 70000}


### Combining Lists with Dictionaries

Suppose we want to store data and its associated metadata (such as labels or a name) together. A common approach in Python is to use a dictionary. Here’s an example where we store both the income values (data) and the corresponding years (index) as separate key-value pairs, along with a name for the dataset:

In [12]:
income = {
    "data": [50000, 60000, 65000, 70000, 75000],
    "index": [2018, 2019, 2020, 2021, 2021],
    "name": "Income Data"
}

# Accessing data and metadata separately
print(f"Data: {income['data']}")
print(f"Index: {income['index']}")
print(f"Name: {income['name']}")

Data: [50000, 60000, 65000, 70000, 75000]
Index: [2018, 2019, 2020, 2021, 2021]
Name: Income Data


Instead of accessing the entire dict element, we might be interested only in one element from the one or more key-pair, say we want the first element of data and the first element from index. We can index that like this:§

In [13]:
print(f"Data first element: {income['data'][0]}")
print(f"Index first element: {income['index'][0]}")

Data first element: 50000
Index first element: 2018


Or we can just write a function that handles this for us. This is the purpose of the following function:

In [14]:
def get_element(data_dict, index, data_only=True):
    """
    Extract an element and its associated index from the given dictionary.
    
    Parameters:
    - data_dict (dict): The dictionary containing 'data' and 'index' keys.
    - index (int or any type): The position or label for the element to extract.
    - data_only (bool): If True, returns only the data value; if False, returns both index and data as a tuple.
    
    Returns:
    - data_value or (index, data_value): Depending on the 'just_data' flag.
      Returns None if index is not found or if 'data' and 'index' keys are missing.
    """
    index_list = data_dict.get('index')
    data_list = data_dict.get('data')

    if index_list is None or data_list is None:
        print("Error: Missing 'data' or 'index' in the dictionary.")
        return None
    
    try:
        index_position = index_list.index(index)
        data_value = data_list[index_position]
        
        if data_only:
            return data_value
        else:
            return (index, data_value)
    
    except ValueError:
        print(f"Error: Index {index} not found in the index list.")
        return None

In [15]:
# Get the element of 2021
print(f"The 2021 data: {get_element(income, 2021)}")
print(f"Data with index {get_element(income, 2021, False)}")

The 2021 data: 70000
Data with index (2021, 70000)


We are almost overcoming the main issue. We can go forward to show that index can be any data type not just an integer as what lists provide for us. Let us more examples here:

In [16]:
# Dictionary with string-based 'index' and 'data'
person_info = {
    "data": [30, "Software Engineer", "New York", "Python"],
    "index": ["Age", "Job Title", "Location", "Skill"],
    "name": "Person Info"
}

result = get_element(person_info, "Job Title")
print(result)  

# Getting both the index and data (just_data=False)
result = get_element(person_info, "Skill", data_only =False)
print(result)  

Software Engineer
('Skill', 'Python')


And here is an example where the index can be a datetime object:

In [17]:
weather_data = {
    "data": [72, 75, 68, 70, 65],
    "index": [
        datetime(2023, 9, 1).date(),
        datetime(2023, 9, 2).date(),
        datetime(2023, 9, 3).date(),
        datetime(2023, 9, 4).date(),
        datetime(2023, 9, 5).date()
    ],
    "name": "Temperature Data"
}

result = get_element(weather_data, datetime(2023, 9, 3).date())
print(result)  

result = get_element(weather_data, datetime(2023, 9, 4).date(), data_only=False)
print(result)  

68
(datetime.date(2023, 9, 4), 70)


## Pandas Series 
A **Pandas Series**  is a one-dimensional array-like structure that holds data of similar types (such as integers, floats, strings, or more complex objects). Unlike a Python list, which can hold mixed data types, a Series generally contains elements of the same data type, providing more consistency in data manipulation.

A Series object is similar to a Python list or NumPy array but offers additional features, such as labels (indexes) and metadata, which make it more powerful for data analysis. It consists of two primary components:

1.	**Data**: The actual data stored in the Series.
2.	**Index**: The labels associated with the data, allowing for more flexible and intuitive data retrieval.

Unlike standard lists or arrays, a Pandas Series allows users to specify custom labels for each data point, which enhances readability and accessibility. By default, if an index is not provided, a Pandas Series will use a numeric range starting from 0 as the index. However, the real power of Series comes when you define custom indices, such as dates, strings, or categories.

### Example
While the dictionary can store the data and index separately, using a **Pandas Series** allows us to combine the data and the index into a more structured format and leverage powerful features for analysis.

We will use the previous example that represents annual income data to show the power of Pandas Series data structure. 

In [18]:
# Create a Pandas Series
income_series = pd.Series(data=income["data"], 
                          index=income["index"],
                          name=income["name"])

# Display the Series
print(income_series)

2018    50000
2019    60000
2020    65000
2021    70000
2021    75000
Name: Income Data, dtype: int64


### Example: Sales Data with Duplicated Dates

Here’s another example of a Pandas Series where the index can be duplicated. Let’s take the example of monthly sales data for a store. Stores might record multiple sales on the same day, which leads to duplicate date entries or a non-unique index.

In [19]:
# Sales data
sales_data = {
    "data": [1500, 2300, 1750, 1200, 2200, 3000],
    "index": [
        datetime(2023, 9, 1), 
        datetime(2023, 9, 1),  # Duplicate date entry
        datetime(2023, 9, 2),
        datetime(2023, 9, 3),
        datetime(2023, 9, 4),
        datetime(2023, 9, 4)   # Duplicate date entry
    ],
    "name": "Store Sales"
}

# Creating a Pandas Series
sales_series = pd.Series(data=sales_data["data"], 
                         index=sales_data["index"],
                         name=sales_data["name"])

print(sales_series)

2023-09-01    1500
2023-09-01    2300
2023-09-02    1750
2023-09-03    1200
2023-09-04    2200
2023-09-04    3000
Name: Store Sales, dtype: int64


### Example: Student Grades with Subject Names as Index

Consider a dataset where a student receives grades in different subjects. Each subject name is a string and will act as the index.

In [20]:
# Student grades data
grades_data = {
    "data": [85, 92, 78, 90, 88],
    "index": ["Math", "Science", "History", "English", "Art"],
    "name": "Student Grades"
}

# Creating a Pandas Series
grades_series = pd.Series(data=grades_data["data"],
                          index=grades_data["index"],
                          name=grades_data["name"])

print(grades_series)

Math       85
Science    92
History    78
English    90
Art        88
Name: Student Grades, dtype: int64


## Advantages of Using Pandas Series:

1.	**Combining Data and Index**: The data and index from the dictionary are combined into a single structure, which is more intuitive and easier to manage than a separate data and index list.
2.	**Handling Duplicate Index**: Pandas Series allows the use of non-unique indices, like in this example where 2021 appears twice. This would be harder to handle in a regular dictionary.
3.	**Automatic Data Alignment**: When performing operations, Pandas will automatically align data based on the index, which allows for more flexibility when working with datasets that have irregular indices.
4.	**Built-in Statistical Methods**: A Pandas Series offers many built-in functions to quickly analyze the data, such as calculating the mean, sum, or even plotting the data.

### Why Use Pandas Series?

- The Series structure is more powerful than a simple dictionary for organizing and manipulating data.
- It makes data operations, such as aggregation, alignment, and indexing, easier and more efficient.
- You can apply various built-in functions directly to the Series (e.g., calculating the mean, standard deviation, or creating plots).

## The Importance of Index in Pandas Philosophy

In Pandas, the index is not just an afterthought; it’s central to the way data is handled and manipulated. The philosophy behind Pandas heavily emphasizes the importance of index alignment for data operations. This means that, unlike in a standard array where the position of data is the only way to reference it, in a Pandas Series (or DataFrame), each piece of data is explicitly associated with an index label. The index serves as a map or identifier for each element.

Here’s why the index is so important in Pandas:

1.	**Data Alignment**: Pandas automatically aligns data based on index labels during operations. This allows you to perform operations on datasets that don’t have matching dimensions or labels, and Pandas will intelligently match the values based on the index. This reduces the possibility of errors due to misalignment, something you might encounter when using raw arrays or lists.
Example:

In [21]:
series1 = pd.Series([1, 2, 3], index=["A", "B", "C"])
series2 = pd.Series([4, 5, 6], index=["B", "C", "D"])

result = series1 + series2
print(result)

A    NaN
B    6.0
C    8.0
D    NaN
dtype: float64


Here, Pandas aligns the data based on the index labels, rather than by position, providing NaN for unmatched indices.

2.	**Efficient Data Access**: The index allows for fast lookups and slicing. Whether you’re working with a time-series dataset or labeled categories, the index can be used to quickly retrieve and filter the data.
3.	**Enhanced Functionality**: Pandas leverages the index for various built-in functions, such as merging, joining, grouping, and sorting. The index also facilitates more advanced data operations such as resampling, reindexing, and pivoting. In time-series data, for instance, using a date index allows for efficient time-based resampling and slicing.
4.	**Multiple Indexing Levels**: Pandas also supports hierarchical or MultiIndex, which allows for multi-dimensional indexing of data, making it possible to represent more complex datasets with multiple levels of data granularity.


In summary, the index is a core feature that makes Pandas more than just a data storage tool—it transforms Pandas into a powerful library for data analysis. The index provides the foundation for performing intuitive, flexible, and efficient data manipulations, making it central to Pandas’ philosophy of making data analysis both fast and easy to work with.


### Example: Monthly Revenue for a Company

Imagine we are tracking the monthly revenue for a company throughout a year. The months will be used as the index, and the revenue for each month will be the data. We also want to handle some real-world complexities, such as missing data and calculating key metrics like average revenue.

In [22]:
revenue_data = {
    "data": [12000, 15000, None, 18000, 17000, 16000, None, 17500, 20000, 21000, 19000, 25000],
    "index": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"],
    "name": "Monthly Revenue"
}

revenue_series = pd.Series(data=revenue_data["data"], index=revenue_data["index"], name=revenue_data["name"])

print(revenue_series)

January      12000.0
February     15000.0
March            NaN
April        18000.0
May          17000.0
June         16000.0
July             NaN
August       17500.0
September    20000.0
October      21000.0
November     19000.0
December     25000.0
Name: Monthly Revenue, dtype: float64
