# Pandas Series

## What is a Series?

In the pandas library, a Series is a one-dimensional, labelled array-like object that can hold data of any type (e.g., integers, floats, strings or even python objects)

### Componets of a Series

The series has two components:

1. **Index**: The labels associated with the data, which can be integers, strings or other objects
2. **Data**: The actual values stored in the series i.e., integers, floats etc

### Key Characteristics of a Series

* One-dimensional: It represents a single column or row of data
* Labeled: Each element in the Series is associated with a unique label called an *index*. This allows for easy data access and manipulation.
* Homogeneous: All elements in a Series are of the same data type
* Mutable: The data in a Series can be modified.
* Supports Vectorized Operations: Operations are performed on all elements of the Series at once without needing explicit loops.


### Analogy

Think of a Series as a single column in a spreadsheet. The column header would be the name of the Series, and each cell in that column would represent an element in the Series

### Common Use Cases

* Data Representation: Storing and representing single columns of data
* Data Manipulation: Performing element-wise operations on data such as filtering, sorting, and applying functions.
* Data Analysis: Analyzing and extracting insights from data

In [1]:
import pandas as pd
import numpy as np

## Creating Pandas Series

* From List

In [2]:
numbers = [10, 20, 30, 40, 50] # list of numbers 
# Series from list
s_list = pd.Series(numbers, name='numbers') # default indexing
s_list

0    10
1    20
2    30
3    40
4    50
Name: numbers, dtype: int64

In [3]:
# custom index
custom_index = ['a', 'b', 'c', 'd', 'e']
s_list_custom_index = pd.Series(numbers, index=custom_index)
s_list_custom_index

a    10
b    20
c    30
d    40
e    50
dtype: int64

* From Numpy arrays

In [4]:
numpy_array = np.array([1.1, 2.4, 5.8, 8.9, 3.5]) # numpy array 
n_series = pd.Series(numpy_array)
n_series

0    1.1
1    2.4
2    5.8
3    8.9
4    3.5
dtype: float64

* From Dictionaries

In [5]:
class_dict = {
    'Sylvia': 45, 
    'Edwin': 47, 
    'Beth': 50, 
    'Vincent': 52, 
    'Adam': 54
}

d_series = pd.Series(class_dict)
d_series

Sylvia     45
Edwin      47
Beth       50
Vincent    52
Adam       54
dtype: int64

* From scalars

In [6]:
scalar_value = 5
sc_series = pd.Series(scalar_value, index=[1, 2, 3, 4, 5])
sc_series

1    5
2    5
3    5
4    5
5    5
dtype: int64

## Indexing and Slicing

In [7]:
np.random.seed(21) # reproducibility
random_numbers = np.random.randint(1, 100, size=50)
num_series = pd.Series(random_numbers)

* Positional indexing (integer-based)

Use `.iloc[]` to access elements by their numerical positions (0-based index). Works like accessing elements in a list or array.

In [8]:
# access a single element 
print(num_series.iloc[2]) # access the 3rd element 

57


In [9]:
# access multiple elements 
print(num_series.iloc[[0, 25, 49]]) # using a list of positions

0     74
25    99
49    31
dtype: int32


* Label-Based indexing

Use `.loc[]` to access elements using the index based labels. Provides greater flexibility when working with labeled data.

In [10]:
# use the dict Series
d_series

Sylvia     45
Edwin      47
Beth       50
Vincent    52
Adam       54
dtype: int64

In [11]:
# using iloc
print(f'Beth scored: {d_series.iloc[2]}')

Beth scored: 50


In [12]:
# use loc to access single element 
print(f"Beth scored: {d_series.loc['Beth']}")

Beth scored: 50


In [13]:
# using loc to access multiple elements 
d_series.loc[['Sylvia', 'Beth', 'Adam']]

Sylvia    45
Beth      50
Adam      54
dtype: int64

* Slicing with Integer Ranges

Use `.iloc[start:stop:step]` to slice based on position. The slicing is exclusive of the stop position, similar to Python slicing.

In [14]:
# slicing with integer ranges
num_series.iloc[10:16] # elements at position 10, 11, 12, 13, 14, 15

10    64
11    45
12    62
13    49
14    85
15    60
dtype: int32

* Slicing with Labels

Use `.loc[start:stop]` to slice based on index labels. includes both he start and stop labels in the result.

In [15]:
d_series.loc['Sylvia':'Beth'] # Elements from label 'Sylvia' to 'Beth' (inclusive)

Sylvia    45
Edwin     47
Beth      50
dtype: int64

## Boolean Indexing in Pandas Series

Allows filtering elements of a `pandas.Series` based on conditions. Create a boolean mask (a series of `True` and `False` values) by applying a condition to the series, and then use that mask to filter the series.

In [16]:
# January daily sales 
np.random.seed(12) 
sales = np.random.randint(100, 1000, size=27)
sales_series = pd.Series(sales)

In [17]:
# creating Bool Mask
# identify good days: elements greater than 700
mask = sales_series > 700

mask

0      True
1      True
2     False
3     False
4     False
5      True
6     False
7      True
8     False
9     False
10    False
11     True
12    False
13    False
14    False
15     True
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
dtype: bool

In [18]:
# Applying Boolean Indexing 
# Filter the good days where sales > 700
good_days = sales_series[mask]
good_days

0     943
1     767
5     742
7     944
11    753
15    730
dtype: int32

### Combining Conditions

* `&`: Logical AND
* `|`: Logical OR
* `~`: Logical NOT(negation)

In [19]:
# combined mask
# average days: sales > 300 and sales < 700
combined_mask = (sales_series > 300) & (sales_series <= 700)
average_days = sales_series[combined_mask]
average_days

2     490
3     353
4     341
6     359
8     532
9     378
12    573
13    518
14    559
17    484
18    473
20    304
21    369
24    391
26    519
dtype: int32

In [20]:
# bad days: sales <= 300 
bad_days = sales_series[~(sales_series > 300)] # filter values NOT above 300
bad_days

10    149
16    174
19    204
22    182
23    189
25    200
dtype: int32

## Filtering with Conditions in Pandas Series

Filtering with conditions allows extractions of elements from a pandas Series that satisfy specific criteria. Condtions are applied element-wise and return a boolean mask(`True` or `False`), which is then used to filter the Series.

In [21]:
np.random.seed(30)
random_series = pd.Series(np.random.randint(1, 100, size=40))

In [22]:
# filter based on single condition
filter_1 = random_series[random_series > 45]
filter_1

2     46
3     46
7     54
9     47
13    66
14    50
15    46
16    62
20    77
23    63
25    47
26    46
27    65
28    63
33    51
35    56
37    59
38    92
39    79
dtype: int32

In [23]:
# filter on multiple conditions 
filter_2 = random_series[(random_series < 10) | (random_series > 90)]
filter_2

6      3
10     4
12     8
22     7
38    92
dtype: int32

In [24]:
# negating a condition
filter_3 = random_series[~(random_series > 40)]
filter_3

0     38
1     38
4     13
5     24
6      3
8     18
10     4
12     8
17    36
18    19
19    19
21    17
22     7
24    28
29    12
30    16
31    24
32    14
34    34
36    29
dtype: int32

## Data Manipulation

* Head and Tail methods

In [25]:
# head() - first 5 rows (default)
sales_series.head()

0    943
1    767
2    490
3    353
4    341
dtype: int32

In [26]:
# specify number of rows 
sales_series.head(8) # 8 rows

0    943
1    767
2    490
3    353
4    341
5    742
6    359
7    944
dtype: int32

In [27]:
# tail() - last 5 rows (default)
sales_series.tail()

22    182
23    189
24    391
25    200
26    519
dtype: int32

In [28]:
# specify number
sales_series.tail(7) # last 7 rows

20    304
21    369
22    182
23    189
24    391
25    200
26    519
dtype: int32

* Value counts

In [29]:
# Series with repition 
np.random.seed(40)
rand_num_rep = np.random.randint(10, 20, size=50)
rand_num_rep_series = pd.Series(rand_num_rep)

In [30]:
# count occurrences of each unique value
rand_num_rep_series.value_counts()

13    9
17    7
12    7
19    6
18    5
15    5
11    4
16    3
14    2
10    2
dtype: int64

* Unique Values

In [31]:
# get the unique values in the Series
rand_num_rep_series.unique()

array([16, 17, 15, 18, 12, 11, 13, 19, 10, 14])

In [32]:
# number of unique values in the series 
rand_num_rep_series.nunique()

10

* Sorting

In [33]:
# sorting in ascending order
sale_asc = sales_series.sort_values() # default argument: ascending = True
sale_asc.reset_index(drop=True, inplace=True) # reorganize the indexs 
sale_asc

0     149
1     174
2     182
3     189
4     200
5     204
6     304
7     341
8     353
9     359
10    369
11    378
12    391
13    473
14    484
15    490
16    518
17    519
18    532
19    559
20    573
21    730
22    742
23    753
24    767
25    943
26    944
dtype: int32

In [34]:
# sort values by descending order
sales_desc = sales_series.sort_values(ascending=False)
sales_desc.reset_index(drop=True, inplace=True)
sales_desc

0     944
1     943
2     767
3     753
4     742
5     730
6     573
7     559
8     532
9     519
10    518
11    490
12    484
13    473
14    391
15    378
16    369
17    359
18    353
19    341
20    304
21    204
22    200
23    189
24    182
25    174
26    149
dtype: int32

## Data Alignment in Pandas Series

Pandas supports **automatic alignment** of data during operations. When performing arithmetic or comparison operations between two Series, pandas align the data based on the index labels. If a label is missing in ope Series, the result will have a `NaN` for that label.

In [35]:
# generate random data for two series
np.random.seed(51)
data1 = np.random.randint(1, 101, size=5)
data2 = np.random.randint(50, 200, size=5)

# define custome indices
index1 = ['a', 'b', 'c', 'd', 'e']
index2 = ['c', 'd', 'e', 'f', 'g']

# create the series 
series1 = pd.Series(data1, index=index1)
series2 = pd.Series(data2, index=index2)

In [36]:
series1

a    58
b    97
c    74
d    70
e    17
dtype: int32

In [37]:
series2

c    199
d    155
e    172
f    160
g     87
dtype: int32

In [38]:
## adding the two series
add_result = series2 + series1
add_result

a      NaN
b      NaN
c    273.0
d    225.0
e    189.0
f      NaN
g      NaN
dtype: float64

In [39]:
## sub the two series
sub_result = series2 - series1
sub_result

a      NaN
b      NaN
c    125.0
d     85.0
e    155.0
f      NaN
g      NaN
dtype: float64

## Data Aggregations 

In [40]:
# shop sales random data
random_sales = np.random.randint(1, 1000, size=200)

# convert the series
shop_sales = pd.Series(random_sales)

# preview first 5 values
shop_sales.head()

0    573
1    863
2     29
3    853
4    677
dtype: int32

* `sum()`


In [41]:
# calculate total sales 
total_sales = shop_sales.sum()
total_sales

103741

* `mean()`

In [42]:
# average sales 
avg_sales = shop_sales.mean()
print(np.round(avg_sales, 4))

518.705


* `median()`

In [43]:
# median 
mid_sales = shop_sales.median()
mid_sales

510.0

* mode

In [44]:
# most common sale
common_sale = shop_sales.mode().tolist()
common_sale

[29, 807]

* `min()`

In [45]:
# least sale 
least_sale = shop_sales.min()
least_sale

3

* `max()`

In [46]:
# highest sale 
high_sale = shop_sales.max()
high_sale

999

In [47]:
# find the index of the first occurence of the maximum value 
max_sale_index = shop_sales.idxmax()
print(max_sale_index)
print(shop_sales.iloc[max_sale_index])

7
999


In [48]:
shop_sales.head(8)

0    573
1    863
2     29
3    853
4    677
5    164
6    924
7    999
dtype: int32

* `std()`

In [49]:
# std
std = shop_sales.std()
print(std.round(3))

295.328


* `var()`

In [50]:
# var
var = shop_sales.var()
print(np.round(var, 3))

87218.581


In [51]:
var ** 0.5

295.3279209275631

* `cumsum()`

In [52]:
# cum
cumsum_value = shop_sales.cumsum()
cumsum_value

0         573
1        1436
2        1465
3        2318
4        2995
        ...  
195    102448
196    103085
197    103418
198    103678
199    103741
Length: 200, dtype: int32

## Data Transformations

Involves applying custom functions and mapping values to change the values in a Series.

### 1. Applying functions

* `apply()` method allow application of a function along the axis of a Series.

In [53]:
shop_sales.head()

0    573
1    863
2     29
3    853
4    677
dtype: int32

In [54]:
# square shop sales using lambda
squared_shop_sales = shop_sales.apply(lambda x: x**2)
squared_shop_sales.head()

0    328329
1    744769
2       841
3    727609
4    458329
dtype: int64

In [55]:
# custom functions
def square_numbers(num) -> int:
    result = num ** 2

    return result 

# apply the function square_numbers to the entire series 
square_shops = shop_sales.apply(square_numbers)

square_shops.head()

0    328329
1    744769
2       841
3    727609
4    458329
dtype: int64

### 2. Mapping Values

* Allows mapping each value in a Series to another value using `map()` method.

In [56]:
# function to categorize sales 
def categorize_sales(sale):
    if (sale < 300):
        return 'low'
    elif (sale > 300) and (sale < 600):
        return 'medium'
    else:
        return 'high'
    

# map each value to category 
sales_category = shop_sales.map(categorize_sales)

sales_category.head()

0    medium
1      high
2       low
3      high
4      high
dtype: object

In [57]:
duka_langu = pd.DataFrame({
    'sales': shop_sales, 
    'category': sales_category
})

duka_langu.head()

Unnamed: 0,sales,category
0,573,medium
1,863,high
2,29,low
3,853,high
4,677,high


In [58]:
# map using dict 
np.random.seed(13)
drive_class = pd.Series((np.random.randint(1, 4, size=10)))

drive_class

0    3
1    1
2    3
3    1
4    3
5    3
6    1
7    2
8    1
9    3
dtype: int32

In [59]:
# create mapping dict
mapping_dict = {
    3 : 'Class E', 
    2 : 'Class BCE', 
    1 : 'class FG'
}

# map values in the series
drive_Class_full = drive_class.map(mapping_dict)

drive_Class_full

0      Class E
1     class FG
2      Class E
3     class FG
4      Class E
5      Class E
6     class FG
7    Class BCE
8     class FG
9      Class E
dtype: object

## String Manipulation Methods

In [60]:
data = np.random.choice(['12%', '30%', '54%', '38%', '40%'], size=100)

hcl_percentages = pd.Series(data)

hcl_percentages.head()

0    38%
1    54%
2    40%
3    54%
4    54%
dtype: object

In [61]:
# calculate avg hcl_percentage 
avg_hcl = hcl_percentages.mean()

avg_hcl

TypeError: Could not convert 38%54%40%54%54%30%38%40%54%12%38%38%38%38%30%54%12%40%38%12%40%54%54%40%38%12%38%54%12%30%12%30%54%38%54%38%38%38%12%54%12%54%40%38%12%30%30%40%38%54%30%40%38%12%38%30%30%30%12%54%38%40%54%30%40%38%30%38%30%54%30%38%40%12%40%40%38%40%38%12%38%38%38%30%12%40%30%30%54%12%54%54%54%30%54%40%12%54%30%40% to numeric

In [90]:
# convert to float 

hcl_percentages.astype('Float64')

ValueError: could not convert string to float: '38%'

In [None]:
# step 1: remove the % sign and replace with nothing
# remember space is a character and you can't convert '38 ' into a number!!

hcl_percentages = hcl_percentages.str.replace('%', '') # no space between the quotes << nothing

hcl_percentages.head()

0    38
1    54
2    40
3    54
4    54
dtype: object

In [92]:
hcl_avg = hcl_percentages.mean()

hcl_avg

TypeError: Could not convert string '38544054543038405412383838383054124038124054544038123854123012305438543838381254125440381230304038543040381238303030125438405430403830383054303840124040384038123838383012403030541254545430544012543040' to numeric

In [None]:
# convert from string to numeric data type using .astype('format')
hcl_percentages = hcl_percentages.astype('Float64')

hcl_percentages.head()

0    38.0
1    54.0
2    40.0
3    54.0
4    54.0
dtype: Float64

In [94]:
# calculate the mean 
hcl_mean = hcl_percentages.mean()

hcl_mean

np.float64(36.18)

* `38% != 38`

* `38% == 0.38`

## Data and Time Operations

In [97]:
# create a simple pandas series with random dates
dates = pd.date_range(start='2024-01-01', end='2024-12-31', freq='D')

# 50 random dates from the calendar year 2024
date_data = pd.Series(np.random.choice(dates, size=50))

date_data.head()

0   2024-02-26
1   2024-04-14
2   2024-12-27
3   2024-02-17
4   2024-07-23
dtype: datetime64[ns]

### 1. Datetime Accessors

In [99]:
# extract year
date_data.dt.year.head()

0    2024
1    2024
2    2024
3    2024
4    2024
dtype: int32

In [100]:
# extract month 
date_data.dt.month.head()

0     2
1     4
2    12
3     2
4     7
dtype: int32

In [101]:
# extract day 
date_data.dt.day.head()

0    26
1    14
2    27
3    17
4    23
dtype: int32

### 2. Datetime Formating

In [None]:
# Ted's date format: 20.01.2025

date_data.dt.strftime('%d.%m.%Y').head() # formats dates as 'DD.MM.YYYY'

0    26.02.2024
1    14.04.2024
2    27.12.2024
3    17.02.2024
4    23.07.2024
dtype: object

### 3. Datetime Arithmetic

In [104]:
# delivery take one day within the city >> add one day to each date

delivery_date = date_data + 1 


TypeError: Addition/subtraction of integers and integer-arrays with DatetimeArray is no longer supported.  Instead of adding/subtracting `n`, use `n * obj.freq`

In [106]:
# delivery take one day within the city >> add one day to each date

delivery_date = date_data + pd.to_timedelta('1 day')

delivery_date.head()

0   2024-02-27
1   2024-04-15
2   2024-12-28
3   2024-02-18
4   2024-07-24
dtype: datetime64[ns]

### 4. Datetime Comparison

In [None]:
# q1 sales < Use the same format 

date_data[date_data < '2024-04-01'] # select dates before April 1, 2024

0    2024-02-26
3    2024-02-17
6    2024-02-18
9    2024-03-29
11   2024-03-18
15   2024-02-01
16   2024-01-02
20   2024-01-15
22   2024-03-30
27   2024-02-19
29   2024-01-09
32   2024-01-22
49   2024-02-29
dtype: datetime64[ns]

In [113]:
date_data[date_data.dt.month < 3]

0    2024-02-26
3    2024-02-17
6    2024-02-18
15   2024-02-01
16   2024-01-02
20   2024-01-15
27   2024-02-19
29   2024-01-09
32   2024-01-22
49   2024-02-29
dtype: datetime64[ns]

## Concatenation

* Combining multiple Series

## Appending

* Adding elements to the end of a Series

## Dropping Elements 

In [114]:
hcl_percentages.head()

0    38.0
1    54.0
2    40.0
3    54.0
4    54.0
dtype: Float64

* Using index

In [115]:
# drop elements by index label
hcl_percentages = hcl_percentages.drop([0])
hcl_percentages.head()

1    54.0
2    40.0
3    54.0
4    54.0
5    30.0
dtype: Float64

* Based on a Condition

In [121]:
# kenyan supplies: Hcl above 40%
ken_hcl_supplies = hcl_percentages[hcl_percentages >= 40]

# reset the indexes
ken_hcl_supplies.reset_index(drop=True, inplace=True)

ken_hcl_supplies.head()

0    54.0
1    40.0
2    54.0
3    54.0
4    40.0
dtype: Float64

## Broadcasting

* Performing operations between a Series and a scalar.

In this scenario, broadcasting refers to how pandas efficiently performs operations between a Series and a scalar value.

In [122]:
# scalar addition 
ken_hcl_supplies = ken_hcl_supplies + 2

ken_hcl_supplies.head()

0    56.0
1    42.0
2    56.0
3    56.0
4    42.0
dtype: Float64

## Vectorized Operations

* Efficiently performing operations on entire Series

In [None]:
# find the square root of each element 
sqrt_s = np.sqrt(ken_hcl_supplies)

sqrt_s.head()

0    7.483315
1    6.480741
2    7.483315
3    7.483315
4    6.480741
dtype: Float64

## Groupby

Grouping and aggregating data

In [None]:
n_brms = [1, 2, 3, 4, 5, 6, 7, 8, 9]
weight = [0.3, 0.25, 0.20, 0.15, 0.05, 0.02, 0.01, 0.01, 0.01]

# generate categorical data 
housing_data = np.random.choice(n_brms, size=100, p=weight) 

housing_series = pd.Series(housing_data)

housing_series.head()

0    1
1    4
2    2
3    2
4    2
dtype: int64

In [135]:
# group by and aggregate 
group_agg = housing_series.groupby(housing_series).agg('count')

group_agg

1    32
2    26
3    17
4    17
5     4
6     2
9     2
dtype: int64

## Joining and Merging 

* Combining Series with other Series

## Memory Usage

* Checking and optimizing memory usage 

### 1. `memory_usage()`

Check the memory usage of the Series

In [138]:
group_agg.memory_usage(deep=True) 

112

### 2. `astype()`

Convert the Series to a more memory-effecient data type eg if all the values are less than 256, use `int8`.

In [139]:
optimized_series = group_agg.astype('int16')

optimized_series.memory_usage(deep=True)

70