# A. The Pandas Series Object
A Pandas Series Object is a **one-dimensional** ***labeled array*** capable of **holding any type of data**. Because the series is a one-dimensional object, it has a ***single axis - the index***. The main property of a single axis object is that ***data is arranged in a linear fashion*** like that of lists or arrays. 

In [1]:
# import the libraries 
import pandas as pd

## Create a series from a list
Let us create a Pandas Series object from a list!

In [5]:
data = [10,20,30,40]
series = pd.Series(data=data)

series

0    10
1    20
2    30
3    40
dtype: int64

#### Explanation:
- The data list contains the values `[10, 20, 30, 40]`.
- The index of the Series is automatically generated as `0, 1, 2, 3`, corresponding to each value.

#### NOTE
- The index is not part of the values - the index is called *axis*
- The values of the index is called the *axis labels*

Thus, a Series has three attributes namely:
- values 
- index
- name (optional) - we have not asigned a name to our series yet! 

In [6]:
# give a name to the series - optional in nature 
series.name = 'Basic Series'
series

0    10
1    20
2    30
3    40
Name: Basic Series, dtype: int64

As you can see, the name of the series is now `Basic Series`!

We have seen that index is not part of the data/values. The natural question here is : **What type of object is the `index` in a `Series` object?**

## Index in Series

In [7]:
# let us look at the index of the series - this is a RangeIndex object!
series.index

RangeIndex(start=0, stop=4, step=1)

In [8]:
# the index being a RangeIndex objet, we can use it in a loop
for index in series.index:
    print(index, end=' ')

0 1 2 3 

**Index** provides two level of abstraction.
- The **first level of abstraction** - index allows you to label data points and hence making them easy to reference!
- The **second level of abstraction** - index, as a built-in data structure allows you to modify and use a custom index independent of the values. 

Let us look at both these levels of abstraction!

In [6]:
# First level of abstraction for index - referenece data points
series[1]

20

In [9]:
# Creating a Series with a custom index
data = [10, 20, 30, 40]
index = ['a', 'b', 'c', 'd']

series = pd.Series(data=data, index=index)

print(series)

a    10
b    20
c    30
d    40
dtype: int64


In [10]:
series['b'] # reference data points 

20

In [None]:
# what's the index now?
series.index

Index(['a', 'b', 'c', 'd'], dtype='object')

## Create a series with custom index from a dictionary

In [11]:
# Creating a dictionary
data_dict = {'a': 10, 'b': 20, 'c': 30, 'd': 40}

# Converting the dictionary to a Series
# keys of the dictionary becomes the index and the values become the data points in the pandas Series object 
series_from_dict = pd.Series(data_dict)

print(series_from_dict)

a    10
b    20
c    30
d    40
dtype: int64


## Miscellaneous Topics in Series

### Series - Non-homogenous!

In [12]:
# The actual data (or values) for a series does not have to be numeric or homogeneous
data_dict = {'a': 10, 'b': 'Harry Potter', 'c': False, 'd': 'Lionel Messi'}

# Converting the dictionary to a Series
series_from_dict = pd.Series(data_dict)

print(series_from_dict)

a              10
b    Harry Potter
c           False
d    Lionel Messi
dtype: object


#### NOTE

The datatype of the Series is now *object* - i.e. a python object.

- The object data type is also used for a series with string values. In addition, it is also used for values that have heterogeneous or mixed types.

### Incorporating NULL values in Series

In [13]:
import numpy as np

# create a series with null values 
nan_series = pd.Series(data=[12,20,30,np.nan])

nan_series

0    12.0
1    20.0
2    30.0
3     NaN
dtype: float64

### Size of array
- using the `size` attribute of the Series object
- using the `count()` method on the Series object 

In [14]:
# size attribute - gets the size of the Series object along with missing values i.e. it also includes the missing values while clculating the size of the series 
nan_series.size

4

In [15]:
# get the size of the array excluding the missing values - use the count method 
nan_series.count()

3

NOTE - the `np.nan` used in the above code is the same as `NULL` is SQL !

### Similarity with numpy array

In [15]:
# creating a numpy array
numpy_series = np.array([10,20,30,40])
numpy_series

array([10, 20, 30, 40])

In [16]:
numpy_series[0]

10

#### Boolean Array Concept - common to both NumPy and Series!

Both the numpy array and the series have the *boolean array* concept. We can use this concept to filter the series!

In [16]:
# consider the following series
series = pd.Series(data=[12,45,67,88,34,54,76,99,7,8,10])

# find the mean of the series 
mean_series = np.mean(series)
print(f"The mean of the series is {mean_series}")

# create the boolean array 
bool_series = series > mean_series
print("Boolean Series (True when value greater than mean): ")
print(bool_series)

# use this boolean series to filer the original series 
print("Using the boolean series to filter the original :")
print(series[bool_series])

The mean of the series is 45.45454545454545
Boolean Series (True when value greater than mean): 
0     False
1     False
2      True
3      True
4     False
5      True
6      True
7      True
8     False
9     False
10    False
dtype: bool
Using the boolean series to filter the original :
2    67
3    88
5    54
6    76
7    99
dtype: int64


As you can see, the filtering has taken place and we have a new series which has only those values that are greater than the mean !

## Brief Introduction to Categorical Data

### Creating a Categorical Series Using dtype="category"
You can create a categorical Series directly by specifying the dtype as "category" in the Series constructor.

In [17]:
categories = pd.Series(data=['apple','banana','orange','apple'],
                       dtype='category')

print(categories)
print(f"Categories : {categories.cat.categories}")
print(f"Codes: {categories.cat.codes}")

0     apple
1    banana
2    orange
3     apple
dtype: category
Categories (3, object): ['apple', 'banana', 'orange']
Categories : Index(['apple', 'banana', 'orange'], dtype='object')
Codes: 0    0
1    1
2    2
3    0
dtype: int8


In [21]:
categories.cat

<pandas.core.arrays.categorical.CategoricalAccessor object at 0x0000029AEE905BD0>

In the code, the `cat` accessor is used to ***access various properties and methods for categorical data in pandas***. Here’s a breakdown:

### Explanation of `cat`

When you create a pandas `Series` with `dtype='category'`, pandas treats the data as **categorical data**. This means it’s stored in a way that’s ***optimized for repeated values***, which can be ***especially useful for columns with a limited number of unique values (like categorical labels)***.

The `cat` accessor is specifically designed for working with these categorical data types. With it, you can access the underlying properties and methods that are unique to categorical data. 

### Example Output of Your Code

```python
Categories
0     apple
1    banana
2    orange
3     apple
dtype: category
Categories (3, object): ['apple', 'banana', 'orange']
```

- **`categories.cat.categories`**: This gives you the ***unique categories in your categorical Series***. In this case, the output would be `Index(['apple', 'banana', 'orange'], dtype='object')`, listing each unique category once.

### Common Uses of `cat`

The `cat` accessor in pandas provides several methods and properties specifically useful for managing and analyzing categorical data. Here are some of the most useful ones:

1. **`categories.cat.categories`**:
   - Returns the list of unique categories in the Series. This is helpful to quickly view all the distinct category labels in the data.

2. **`categories.cat.codes`**:
   - Returns the ***underlying integer codes for each category label in the Series***. This is useful for cases where you need numerical representations of each category (e.g., for machine learning tasks).

3. **`categories.cat.add_categories(new_categories)`**:
   - ***Adds new categories to the existing ones without altering current data***. For example, if you want to expand the list of possible categories before adding new data points, this is useful.

4. **`categories.cat.remove_categories(removals)`**:
   - ***Removes specified categories from the Series***. This is useful when certain categories are no longer relevant and should be excluded.

5. **`categories.cat.rename_categories(new_names)`**:
   - ***Renames the categories, allowing you to update category labels without modifying the data itself***. This is handy if you need more descriptive or consistent naming.

6. **`categories.cat.reorder_categories(new_order)`**:
   - ***Reorders the categories***. Useful when you need a specific order for analyses or plotting, such as changing the order from alphabetical to a custom order.

7. **`categories.cat.set_categories(new_categories)`**:
   - ***Sets new categories for the Series, allowing you to define the entire set of possible categories (even those not in use).*** This is helpful for aligning with a predefined set of categories, like all possible labels.

8. **`categories.cat.as_ordered()` and `categories.cat.as_unordered()`**:
   - ***Converts the Series to an ordered or unordered categorical***. Ordered categoricals support comparisons (e.g., for ordinal data like "low," "medium," "high").

Each of these methods is tailored for efficiently managing, analyzing, and transforming categorical data. They help make the Series data more meaningful, especially when dealing with large datasets or preparing data for machine learning.

Here are examples for each method and property of the `cat` accessor in pandas, using a sample categorical `Series` for illustration.

In [None]:
# Sample categorical Series
data = pd.Series(['apple', 'banana', 'orange', 'apple', 'banana'], dtype='category')
print(data)

print('categories.cat.categories')
# - Returns the list of unique categories.
print("Categories:", data.cat.categories)
# Output: Categories: Index(['apple', 'banana', 'orange'], dtype='object')
print('\n')

print('categories.cat.codes')
# - Returns integer codes for each category label.
print("Category Codes:", data.cat.codes)
# Output: Category Codes: 0 0, 1 1, 2 2, 3 0, 4 1 (where apple=0, banana=1, orange=2)
print('\n')

print('categories.cat.add_categories(new_categories)')
#-  pass the new categories as a list of string(s)
# - Adds new categories to the Series.
data = data.cat.add_categories(['grape'])
print("New Categories:", data.cat.categories)
# Output: New Categories: Index(['apple', 'banana', 'orange', 'grape'], dtype='object')
print('\n')

print('categories.cat.remove_categories(removals)')
#- pas the removals as a list of string(s)
#- Removes specified categories from the Series.
data = data.cat.remove_categories(['orange'])
print("Categories after removal:", data.cat.categories)
# Output: Categories after removal: Index(['apple', 'banana', 'grape'], dtype='object')
print('\n')


print('categories.cat.rename_categories(new_names)')
#- Renames categories with new labels.
data = data.cat.rename_categories(['fruit_apple', 'fruit_banana', 'fruit_grape'])
print("Renamed Categories:", data.cat.categories)
# Output: Renamed Categories: Index(['fruit_apple', 'fruit_banana', 'fruit_grape'], dtype='object')
print(data) # Note that, in the data the orange category will not become fruit_grape instead it is NaN. For other categories, the data changes too!
print('\n')

print('categories.cat.reorder_categories(new_order)')
#- Changes the order of the categories.
data = data.cat.reorder_categories(['fruit_banana', 'fruit_grape', 'fruit_apple'], ordered=True)
print("Reordered Categories:", data.cat.categories)
# Output: Reordered Categories: Index(['fruit_banana', 'fruit_grape', 'fruit_apple'], dtype='object')
print(data) # no change in data but now the categories are ordered 
print('\n')

print('categories.cat.set_categories(new_categories)')
#- Sets new categories, allowing you to redefine the whole set.
data = data.cat.set_categories(['apple', 'banana', 'orange', 'mango'])
print("Updated Categories:", data.cat.categories)
# Output: Updated Categories: Index(['apple', 'banana', 'orange', 'mango'], dtype='object')
print(data) # note that all data become NaN since these are set to new categories 
print('\n')

print('categories.cat.as_ordered()` and `categories.cat.as_unordered()')
#- Converts the Series to an ordered or unordered categorical.

# Convert to ordered categorical
data = data.cat.as_ordered()
print("Is Ordered:", data.cat.ordered)
# Output: Is Ordered: True

# Convert back to unordered categorical
data = data.cat.as_unordered()
print("Is Ordered:", data.cat.ordered)
# Output: Is Ordered: False

0     apple
1    banana
2    orange
3     apple
4    banana
dtype: category
Categories (3, object): ['apple', 'banana', 'orange']
categories.cat.categories
Categories: Index(['apple', 'banana', 'orange'], dtype='object')


categories.cat.codes
Category Codes: 0    0
1    1
2    2
3    0
4    1
dtype: int8


categories.cat.add_categories(new_categories)
New Categories: Index(['apple', 'banana', 'orange', 'grape'], dtype='object')


categories.cat.remove_categories(removals)
Categories after removal: Index(['apple', 'banana', 'grape'], dtype='object')


categories.cat.rename_categories(new_names)
Renamed Categories: Index(['fruit_apple', 'fruit_banana', 'fruit_grape'], dtype='object')
0     fruit_apple
1    fruit_banana
2             NaN
3     fruit_apple
4    fruit_banana
dtype: category
Categories (3, object): ['fruit_apple', 'fruit_banana', 'fruit_grape']


categories.cat.reorder_categories(new_order)
Reordered Categories: Index(['fruit_banana', 'fruit_grape', 'fruit_apple'], dtype='o

### Converting an Existing Series to a Categorical Series Using .astype("category")

- You can also convert an existing Series to categorical by calling the .astype("category") method.
    > Note, however, that using this method creates a unordered category!

In [24]:
# Creating a normal Series
data = ["dog", "cat", "dog", "bird", "cat"]
series = pd.Series(data)

# Converting the Series to categorical
categorical_series = series.astype('category')

print(categorical_series)

0     dog
1     cat
2     dog
3    bird
4     cat
dtype: category
Categories (3, object): ['bird', 'cat', 'dog']


#### Benefits of Using Categorical Data:
- Less Memory: Categorical data uses less memory than regular string data because the categories are stored as numerical codes internally.
- Improved Performance: Operations on categorical data are faster compared to working with raw strings.
- Enforced Membership: Categorical data ensures that the values in your Series belong to a predefined set of categories.
- Ordering: You can impose an order on categories, making it useful for ranked or hierarchical data.

#### Creating an Ordered category 

In [25]:
# Creating an ordered categorical data 

lmh = ['low','medium','high','medium','low','high','high','low','medium','low']

ordered_categories = pd.Series(data=lmh,
                               dtype=pd.CategoricalDtype(
                                   categories=['low','medium','high'],
                                   ordered=True
                               ))

# Display the Series with its ordered categories
print(ordered_categories)
print(f"Is ordered : {ordered_categories.cat.ordered}")

0       low
1    medium
2      high
3    medium
4       low
5      high
6      high
7       low
8    medium
9       low
dtype: category
Categories (3, object): ['low' < 'medium' < 'high']
Is ordered : True


In [None]:
# filter wtth the ordering - a benefit of using ordered categories - useful for ordinal data 
ordered_categories[ordered_categories>'low']

1    medium
2      high
3    medium
5      high
6      high
8    medium
dtype: category
Categories (3, object): ['low' < 'medium' < 'high']

#### Reordering Categories 
To reorder categories in a Pandas categorical Series, you can use the cat.reorder_categories() method. This allows you to rearrange the categories in any order you like. Additionally, if the categorical data is ordered, you can also change the order to define a new sorting or ranking scheme.

Here’s how you can reorder categories in a Pandas Series.

##### Example 1: Reordering Categories
Let's create a categorical Series and reorder its categories:

In [28]:
# Creating a categorical Series
data = pd.Series(data = ["medium", "low", "high", "medium", "low"], 
                 dtype = pd.CategoricalDtype(
                     categories=["low", "medium", "high"], 
                     ordered=True
                     )
                 )

# Reordering the categories
reordered_data = data.cat.reorder_categories(["high", "medium", "low"], ordered=True)

# Display the reordered Series
print(reordered_data)
print("\nReordered categories:", reordered_data.cat.categories)

# Rename Categories 
renamed_data = data.cat.rename_categories(['HIGH','MEDIUM','LOW'])
print(renamed_data)
print("\nReordered categories:", renamed_data.cat.categories)

0    medium
1       low
2      high
3    medium
4       low
dtype: category
Categories (3, object): ['high' < 'medium' < 'low']

Reordered categories: Index(['high', 'medium', 'low'], dtype='object')
0    MEDIUM
1      HIGH
2       LOW
3    MEDIUM
4      HIGH
dtype: category
Categories (3, object): ['HIGH' < 'MEDIUM' < 'LOW']

Reordered categories: Index(['HIGH', 'MEDIUM', 'LOW'], dtype='object')


##### Example 2: Reordering and Keeping Categories Unordered
If your data is unordered, you can still reorder categories without imposing an order on them:

In [42]:
# Reordering the categories without imposing an order
reordered_unordered = data.cat.reorder_categories(["high", "medium", "low"], ordered=False)

# Display the reordered Series
print(reordered_unordered)
print("\nReordered categories (unordered):", reordered_unordered.cat.categories)

0    medium
1       low
2      high
3    medium
4       low
dtype: category
Categories (3, object): ['high', 'medium', 'low']

Reordered categories (unordered): Index(['high', 'medium', 'low'], dtype='object')


# B. Pandas DataFrame Object

- If a **Series** is an ***analog of a one-dimensional array with explicit indices***, a **DataFrame** is an ***analog of a two-dimensional array with explicit row and column indices***. Just as you might think of a two-dimensional array as an ordered sequence of aligned one dimensional columns, you can ***think of a DataFrame as a sequence of aligned Series objects***. Here, by “aligned” we mean that they share the same index.
    > Source: Python Data Science Handbook

## Creating a DataFrame
Let us create a simple pandas DataFrame object!

In [31]:
df = pd.DataFrame({
    'Age':[20,34,56,78],
    'Name' : ['Ajay', 'Shin', 'Freddy', 'Michael']
})
print(df)
print(type(df))

   Age     Name
0   20     Ajay
1   34     Shin
2   56   Freddy
3   78  Michael
<class 'pandas.core.frame.DataFrame'>


A single-column DataFrame can be constructed from a single Series

In [38]:
population_dict = {'California': 39538223, 'Texas': 29145505, 
                   'Florida': 21538187, 'New York': 20201249, 
                   'Pennsylvania': 13002700}

population_series = pd.Series(population_dict)
print(population_series)
print(type(population_series))

print('\n')

population_df = pd.DataFrame(population_series, 
                             columns=['Population'])
print(population_df)
print(type(population_df))

California      39538223
Texas           29145505
Florida         21538187
New York        20201249
Pennsylvania    13002700
dtype: int64
<class 'pandas.core.series.Series'>


              Population
California      39538223
Texas           29145505
Florida         21538187
New York        20201249
Pennsylvania    13002700
<class 'pandas.core.frame.DataFrame'>


Let us view the dataframe as a sequence of aligned Series objects!

In [None]:
population_dict = {'California': 39538223, 'Texas': 29145505, 
                   'Florida': 21538187, 'New York': 20201249, 
                   'Pennsylvania': 13002700}

area_dict = {'California': 423967, 'Texas': 695662, 'Florida': 170312, 
             'New York': 141297, 'Pennsylvania': 119280}

states = pd.DataFrame({
    'population': population_dict,
    'area': area_dict
}) # think states as a combination of two aligned series objects 
states

Unnamed: 0,population,area
California,39538223,423967
Texas,29145505,695662
Florida,21538187,170312
New York,20201249,141297
Pennsylvania,13002700,119280


## Index in DataFrame 

In [33]:
# just like the Series object, the DataFrame object also has an index 
states.index

Index(['California', 'Texas', 'Florida', 'New York', 'Pennsylvania'], dtype='object')

In [None]:
df.index # this will give us a RangeIndex since this dataframe uses the default index

RangeIndex(start=0, stop=4, step=1)

Since, DataFrame is a two-dimentional array, the object has a columns attribute that gives us the index for the Columns 

In [35]:
states.columns

Index(['population', 'area'], dtype='object')

In [36]:
df.columns

Index(['Age', 'Name'], dtype='object')

## Reference Datapoints 


# A Brief discussion on Accessors

An **accessor** in pandas is an attribute that provides access to specialized functionality for certain data types within a `Series` or `DataFrame`. Accessors allow you to call specific methods and properties that are only relevant to a particular data type, making it easier to manipulate and analyze data without needing to change the core data structure. 

### Common Accessors in pandas

Here are a few of the main accessors in pandas:

1. **`.str`**: The string accessor for text data.
   - Provides methods specific to string operations, like `.str.lower()` (to lowercase), `.str.contains()` (for substring search), and `.str.replace()` (for replacing substrings).
   
2. **`.dt`**: The datetime accessor for datetime data.
   - Provides access to datetime-specific properties like `.dt.year`, `.dt.month`, and `.dt.day`, and methods like `.dt.to_period()` (to convert datetime to a period) and `.dt.floor()` (to round down to a specific time frequency).
   
3. **`.cat`**: The categorical accessor for categorical data.
   - Provides methods for managing categories, such as `.cat.categories`, `.cat.codes`, `.cat.add_categories()`, and `.cat.remove_categories()`.

### Example of Accessor Use

Accessors allow you to work with data in a more structured and intuitive way. For example:

```python
# String accessor example
data = pd.Series(['apple', 'banana', 'cherry'])
print(data.str.upper())  # Converts each string in the Series to uppercase

# Datetime accessor example
date_data = pd.Series(pd.to_datetime(['2023-01-01', '2023-02-01', '2023-03-01']))
print(date_data.dt.month)  # Extracts the month from each datetime entry
```

### Why Accessors are Useful

Accessors provide a logical grouping of functionality based on data type, helping you:
- **Access** methods and properties relevant to the data type without needing conversions.
- **Manipulate** data more easily, especially with common operations like string handling, date extraction, or category management.
- **Improve readability**, as code using accessors is generally easier to follow and directly communicates the data type and operations being performed.

# END!