# Data Structures

One of the keys to understanding pandas is to understand the data model. At the core of pandas are two data structures. The most widely used data structures are the Series and the DataFrame for dealing with array data and tabular data. This table shows their analogs in the spreadsheet and database world.


| Data Structure| Dimensionality| Spreadsheet Analog| Databse Analog|
|-|-|-|-|
|Series|1D|Column|Column|
|DataFrame|2D|Single Sheet|Table|

DataFrame is similar to a sheet with rows and columns, while a Series is similar to a single column 
of data (when we refer to a column of data in this text, we are referring to a Series).



## Numpy

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. There are several important differences between NumPy arrays and the standard Python sequences:

- NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete the original.

- The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory. The exception: one can have arrays of (Python, including NumPy) objects, thereby allowing for arrays of different sized elements.

- NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.

A growing plethora of scientific and mathematical Python-based packages are using NumPy arrays; though these typically support Python-sequence input, they convert such input to NumPy arrays prior to processing, and they often output NumPy arrays. In other words, in order to efficiently use much (perhaps even most) of today’s scientific/mathematical Python-based software, just knowing how to use Python’s built-in sequence types is insufficient - one also needs to know how to use NumPy arrays.

"Series," "DataFrame," "Databases," and "Excel" are all data-related concepts or tools used in the context of managing and analyzing data.

 Series and DataFrames are data structures used in data analysis with Pandas in Python, while databases are used for structured data storage and retrieval, and Excel is a spreadsheet tool often used for simpler data analysis and reporting.

# Data Structures

One of the keys to understanding pandas is to understand the data model. At the core of pandas are two data structures. The most widely used data structures are the Series and the DataFrame for dealing with array data and tabular data. This table shows their analogs in the spreadsheet and database world.


| Data Structure| Dimensionality| Spreadsheet Analog| Databse Analog|
|-|-|-|-|
|Series|1D|Column|Column|
|DataFrame|2D|Single Sheet|Table|

DataFrame is similar to a sheet with rows and columns, while a Series is similar to a single column 
of data (when we refer to a column of data in this text, we are referring to a Series).



## Pandas

In pandas, the two-dimensional counterpart to the one-dimensional Series is the DataFrame. If we want to understand this data structure, it helps to know how it is constructed. This chapter will
introduce the dataframe.


Dataframes can be created from many types of input:
- columns (dicts of lists)
- rows (list of dicts)
- CSV, xlsx files (pd.read_csv)
- NumPy ndarrays
- other: SQL, HDF5, arrow, etc

> Just like an excel table is made up of columns, a DataFrame is made up of Series Data Types

# Pandas Series

there are two main types of operations you can perform on Series and DataFrame objects.
these are operator methods and attributes.



## Operator Methods

A Series is used to model one-dimensional data. The Series object also has a few more bits of data,
including an index and a name. A common idea through pandas is the notion of an axis. Because
a series is one-dimensional, it has a single axis—the index.
Below is a table of counts of deposits amounts by customers.


||Data|
|-|-|
|0|145|
|1|142|
|2|38|
|3|13| We

### Numerical Operations

|Method|Operator|Description|
|-|-|-|
|s.add(s2) | s + s2 | Adds series
|s.radd(s2) | s2 + s | Adds series
|s.sub(s2) | s - s2 | Subtracts series
|s.rsub(s2) | s2 - s | Subtracts series
|s.mul(s2) s.multiply(s2) | s * s2 | Multiplies series
|s.rmul(s2) | s2 * s | Multiplies series
|s.div(s2) s.truediv(s2) | s / s2 | Divides series
|s.mod(s2) | s % s2 | Modulo of series division
|s.eq(s2) | s2 == s | Elementwise equals of series
|s.ne(s2) | s2 != s | Elementwise not equals of series
|s.gt(s2) | s > 2 | Elementwise greater than of series
|s.ge(s2) | s >= 2 | Elementwise greater than or equals of series
|s.lt(s2) | s < 2 | Elementwise less than of series
|s.le(s2) | s <= 2  | Elementwise less than or equals of series

|Method|Description|
|-|-|
|'mad' |Return the mean absolute deviation.|
|'max' |Return the maximum value.|
|'mean'| Return the mean value.|
|'median'| Return the median value.|
|'min' |Return the minimum value.|
|'nbytes' |Return the number of bytes of the data.|
|'ndim' |Return the number of dimensions (1) of the data.|
|'nunique' |Return the count of unique values.|
|'quantile' |Return the median value. Can override q to specify other quantile.|
|'sem' |Return the unbiased standard error.|
|'size' |Return the size of the data.|
|'skew' |Return the unbiased skew of the data. Negativeindicates tail is on the left side.|
|'std' |Return the standard deviation of the data.|
|'sum' |Return the sum of the series.|

### Series Attributes

attributes allow you to inspect and retrieve information about the structure and characteristics of a Pandas Series, which is useful for data analysis and manipulation. You can access these attributes by appending them to a Series object

commonly used Pandas Series attributes:

1. `index`: Returns the index (row labels) of the Series.
2. `values`: Returns the data values of the Series as a NumPy array.
3. `name`: Returns or sets the name of the Series.
4. `dtype`: Returns the data type of the elements in the Series.
5. `size`: Returns the number of elements in the Series.
6. `shape`: Returns a tuple representing the shape of the Series (always `(n,)` for a Series).
7. `empty`: Returns True if the Series is empty.
8. `is_unique`: Returns True if all elements are unique.
9. `nunique()`: Returns the number of unique elements.
10. `count()`: Returns the number of non-null (non-missing) elements.
11. `head(n)`: Returns the first n elements of the Series.
12. `tail(n)`: Returns the last n elements of the Series.
13. `describe()`: Generates summary statistics for the Series.
14. `unique()`: Returns an array of unique elements.
15. `min()`: Returns the minimum value in the Series.
16. `max()`: Returns the maximum value in the Series.
17. `mean()`: Returns the mean (average) value of the Series.
18. `median()`: Returns the median value of the Series.
19. `std()`: Returns the standard deviation of the Series.
20. `var()`: Returns the variance of the Series.
21. `sum()`: Returns the sum of all elements in the Series.
22. `mode()`: Returns the mode (most frequent) value in the Series.
23. `idxmin()`: Returns the index of the minimum value.
24. `idxmax()`: Returns the index of the maximum value.
25. `apply(func)`: Applies a function to each element in the Series.
26. `map(dict_or_func)`: Maps values in the Series to new values using a dictionary or function.
27. `value_counts()`: Returns a Series containing counts of unique values.
28. `astype(dtype)`: Converts the data type of the Series.
29. `sort_values()`: Sorts the Series by values.
30. `sort_index()`: Sorts the Series by index labels.
31. `isna()` or `isnull()`: Returns a Boolean Series indicating missing values.
32. `notna()` or `notnull()`: Returns a Boolean Series indicating non-missing values.
33. `str`: Accesses string methods for Series with string data type.


### Head and Tail

The `.head` and `.tail` methods are useful for pulling out values at the start or end of the series,
respectively. These methods are used to quickly inspect a chunk of the data.

In [4]:
%pip install pandas

Collecting pandas
  Using cached pandas-2.2.3-cp310-cp310-win_amd64.whl (11.6 MB)
Collecting tzdata>=2022.7
  Downloading tzdata-2024.2-py2.py3-none-any.whl (346 kB)
     -------------------------------------- 346.6/346.6 kB 2.7 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Downloading pytz-2024.2-py2.py3-none-any.whl (508 kB)
     -------------------------------------- 508.0/508.0 kB 4.0 MB/s eta 0:00:00
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.3 pytz-2024.2 tzdata-2024.2
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [23]:
import pandas as pd
deposit = pd.Series([1,2,3,4,5,6,7,8,9,10,11,12])
deposit

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
dtype: int64

In [24]:
deposit = pd.Series([1,2,3,4,5,6,7,8,9,10,11,12], name = "Deposit")
deposit

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
Name: Deposit, dtype: int64

In [25]:
deposit = pd.Series([1,2,3,4], name = "Deposit", index=["Glen","Bob","Chris","Shirley"])
deposit

Glen       1
Bob        2
Chris      3
Shirley    4
Name: Deposit, dtype: int64

In [26]:
deposit.head(2)

Glen    1
Bob     2
Name: Deposit, dtype: int64

In [27]:
deposit.tail(2)

Chris      3
Shirley    4
Name: Deposit, dtype: int64

In [28]:
deposit.describe()

count    4.000000
mean     2.500000
std      1.290994
min      1.000000
25%      1.750000
50%      2.500000
75%      3.250000
max      4.000000
Name: Deposit, dtype: float64

In [7]:
df = pd.read_csv("transaction_data.csv")
df.head()

Unnamed: 0,Item_Description,Amount,Traffic_Source,Session_Duration
0,Notebook,20.0,Direct,3470
1,Google Hoodie,124.0,Direct,401
2,Greeting Cards,6.0,Paid Search,2608
3,,,,2130
4,Lanyard,6.0,Direct,3517


In [8]:
type(df)

pandas.core.frame.DataFrame

In [9]:
amount = df["Amount"]

In [10]:
amount

0        20.0
1       124.0
2         6.0
3         NaN
4         6.0
        ...  
7391     20.0
7392      5.0
7393    124.0
7394    124.0
7395    124.0
Name: Amount, Length: 7396, dtype: float64

In [11]:
amount > 50

0       False
1        True
2       False
3       False
4       False
        ...  
7391    False
7392    False
7393     True
7394     True
7395     True
Name: Amount, Length: 7396, dtype: bool

In [12]:
amount[amount > 50]

1       124.0
6       124.0
11      124.0
17      124.0
18      124.0
        ...  
7378    115.0
7379    124.0
7393    124.0
7394    124.0
7395    124.0
Name: Amount, Length: 1707, dtype: float64

In [15]:
mask = amount > amount.median()

In [16]:
amount[mask]

0        20.0
1       124.0
6       124.0
11      124.0
15       18.0
        ...  
7388     20.0
7391     20.0
7393    124.0
7394    124.0
7395    124.0
Name: Amount, Length: 3438, dtype: float64

In [17]:
second_amount = amount.copy()
second_amount

0        20.0
1       124.0
2         6.0
3         NaN
4         6.0
        ...  
7391     20.0
7392      5.0
7393    124.0
7394    124.0
7395    124.0
Name: Amount, Length: 7396, dtype: float64

In [18]:
amount + second_amount

0        40.0
1       248.0
2        12.0
3         NaN
4        12.0
        ...  
7391     40.0
7392     10.0
7393    248.0
7394    248.0
7395    248.0
Name: Amount, Length: 7396, dtype: float64

In [19]:
amount[[9,2,5]]

9    10.0
2     6.0
5     5.0
Name: Amount, dtype: float64

In [21]:
amount[100:]

100       6.0
101      18.0
102       5.0
103     124.0
104     125.0
        ...  
7391     20.0
7392      5.0
7393    124.0
7394    124.0
7395    124.0
Name: Amount, Length: 7296, dtype: float64

### Indexing 


Let’s shift the focus onto pulling data out by using indexing operators. You can index directly on
a series object, but this is not recommended. Use the `.iloc` or `loc` to index.


The `.loc` attribute deals with index labels. It allows you to pull out pieces of the series. You can
pass in the following into an index operation on .loc:
- A scalar value of one of the index labels
- A list of index labels.
- A slice of labels (closed interval so it includes the stop value).
- An index.
- A boolean array (same index labels as the series, but with True or False values.
- A function that accepts a series and returns one of the above.

In [29]:
deposit

Glen       1
Bob        2
Chris      3
Shirley    4
Name: Deposit, dtype: int64

In [30]:
# Not recommended
deposit["Glen"]

np.int64(1)

In [31]:
# Recommended
deposit.loc["Glen"]

np.int64(1)

In [32]:
deposit.iloc[0]

np.int64(1)

In [33]:
import numpy as np
another_deposit_amount = pd.Series([1, 10, 60, np.nan], name="other_count")
new_deposit_amount = pd.Series([54, 83, 26, 90], name= "counts")

In [34]:
new_deposit_amount.add(another_deposit_amount)

0    55.0
1    93.0
2    86.0
3     NaN
dtype: float64

In [35]:
new_deposit_amount.add(another_deposit_amount, fill_value= 0)

0    55.0
1    93.0
2    86.0
3    90.0
dtype: float64

In [36]:
(
    new_deposit_amount
    # Add another
    .add(another_deposit_amount, fill_value=0)
    # Sum
    .sum()
 
 )

np.float64(324.0)

In [37]:
(
    new_deposit_amount
    # Find grt values 25
    .gt(25)
    # Multiply by 100
    .mul(100)
    # Mean
    .mean()


)

np.float64(100.0)

In [41]:
amount.isnull()

0       False
1       False
2       False
3        True
4       False
        ...  
7391    False
7392    False
7393    False
7394    False
7395    False
Name: Amount, Length: 7396, dtype: bool

In [38]:
amount.isna()

0       False
1       False
2       False
3        True
4       False
        ...  
7391    False
7392    False
7393    False
7394    False
7395    False
Name: Amount, Length: 7396, dtype: bool

In [42]:
amount[amount.isnull()]

3     NaN
137   NaN
212   NaN
256   NaN
334   NaN
385   NaN
432   NaN
474   NaN
507   NaN
540   NaN
597   NaN
Name: Amount, dtype: float64

In [39]:
amount.isna().sum()

np.int64(11)

In [40]:
(
    amount
    .isna()
    .std()
)

np.float64(0.03853932039363171)

> If we want to return a series object, we can index it with a list of positions. This can be a list
with a single index in it or multiple index values. The following code will return a series with the
first, second, and last values:

> We can also use slices with .iloc. In this case, slices behave as they do in Python lists and follow
the half-open interval. That is, they include the first index and go up to but do not include the last
index. If we want to return the first five items, we can use the .head method or the following code,
which takes index positions starting at 0 and includes 1, 2, 3, and 4, but does not include 5:

With the series defined below:

```{python}
ser = df['Amount']
```
Find the following:
1. Find the count of non-missing values of a series.
2. Find the number of entries of a series.
3. Find the number of unique entries of a series.
4. Find the mean value of a series.
5. Find the maximum value of a serf the

In [49]:
ser = df['Amount']

# 1. Count of non-missing values in the series
non_missing_count = ser.count()

# 2. Total number of entries in the series (including missing values)
total_entries = ser.size

# 3. Number of unique entries in the series
unique_entries = ser.nunique()

# 4. Mean value of the series
mean_value = ser.mean()

# 5. Maximum value of the series
max_value = ser.max()

# Output the results
print(f"Non missing valuse {non_missing_count}")
print(f"Total entries {total_entries}") 
print(f"Unique entries {unique_entries}")
print(f"Mean value {mean_value}")
print(f"Max value {max_value}")

Non missing valuse 7385
Total entries 7396
Unique entries 9
Mean value 35.63236289776574
Max value 125.0
