## pandas
- Used for EDA
- used for data cleaning, etc
- importing/exporting data, creating/deleting columns,

#### Topics:
- Series: statistical operation, element-wise function, boolean function, missing value, arithmatics function, etc

# Series
## Difference between pandas Series and Dataframe
```
Feature         Series              DataFrame
Dimensions        1D                  2D
Shape             (n,)             (rows, columns)
Data Structure    Single column   Table with multiple columns
Usage         Single column or row   Full dataset
```

In [2]:
import pandas as pd
# print(pd.__version__)

import warnings
warnings.filterwarnings(action='ignore')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
# Series declaration
l = [90, 30, 40, 20, 80, 10] # age
s = pd.Series(l) # From a List
print(s)
print(type(s)) # <class 'pandas.core.series.Series'>

0    90
1    30
2    40
3    20
4    80
5    10
dtype: int64
<class 'pandas.core.series.Series'>


In [4]:
# SKIP:
# series declaration from a List with Custom Index

l_indx = [f"Emp{i:03d}" for i in range(1, len(s)+1)]
print(f"custom index: {l_indx}")
s = pd.Series(l, index=l_indx)

print(s)

custom index: ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005', 'Emp006']
Emp001    90
Emp002    30
Emp003    40
Emp004    20
Emp005    80
Emp006    10
dtype: int64


In [5]:
# accessing elements of series: indexing start with 0
# [start_idx: stop_idx: step_size]
# default for start_idx = 0
# default for stop_idx = end position
# default for step_size = 1

l = [90, 30, 40, 20, 80, 10] # age

s = pd.Series(l)
print(s)

print(s[3])   # element at idx=3
print(s[1:4]) # elements from idx=1 to idx=3

print(s[0:3])  # first 3 elements. s[:3] does the same 
print(s[2:])  # elements from idx=2 to the last element

0    90
1    30
2    40
3    20
4    80
5    10
dtype: int64
20
1    30
2    40
3    20
dtype: int64
0    90
1    30
2    40
dtype: int64
2    40
3    20
4    80
5    10
dtype: int64


## Operation on series

In [6]:
## 1. Few arithmetic Operations" +, *, /
l = [30, 20, 10, 40, 20, 22, 34]
s = pd.Series(l)

print(s + 5)  # Add 5 to each element. This cannot be done on list

print(s * 2)  # Multiply each element by 2
print(l * 2)  # Here we would have 2 copies of list

print(s / 10)  # Divide each element by 10
# print(l / 10)  # ERROR


0    35
1    25
2    15
3    45
4    25
5    27
6    39
dtype: int64
0    60
1    40
2    20
3    80
4    40
5    44
6    68
dtype: int64
0    3.0
1    2.0
2    1.0
3    4.0
4    2.0
5    2.2
6    3.4
dtype: float64
0    50
2    30
3    40
dtype: int64


In [11]:
# 2. Boolean Filtering: Give me all elements that are > 25
l = [50, 20, 30, 40, 10]
s = pd.Series(l)

mask = (s > 25)# Filter elements greater than 25
print(s[mask])   

# With list I would have to create a for loop

0    50
2    30
3    40
dtype: int64


## Methods of series

In [7]:
# Statistical methods
l = [30, 20, 10, 40, 20, 30, 20]
s = pd.Series(l)

print(s.mean())      # Average. Note: List l does not have an in-built method to compute mean
print(s.median())    # Median
print(s.std())       # Standard deviation
print(s.max())       # Maximum
print(s.min())       # Minimum
print(s.sum())       # Sum of all elements

# cont...

24.285714285714285
20.0
9.759000729485331
40
10
170


In [8]:
# cont...

print(s)
print(s.cumsum())    # cumulative sum:     30, 30 + 20, 30 + 20 + 10, ...
print(s.cumprod())   # cumulative product: 30, 30 * 20, 30 * 20 * 10, ...

0    30
1    20
2    10
3    40
4    20
5    30
6    20
dtype: int64
0     30
1     50
2     60
3    100
4    120
5    150
6    170
dtype: int64
0            30
1           600
2          6000
3        240000
4       4800000
5     144000000
6    2880000000
dtype: int64


In [9]:
# cont..

print(s)
print(s.describe())
print(s.quantile(0.25))  # 25th percentile

print(s.sem())     # Standard error mean
print(s.nunique()) # how many unique elements are there
print(s.unique())  # show me all unique elements

print(s.value_counts())  # Value counts (frequency of each unique value)

print(s.idxmin())  # Index of first min value
print(s.idxmax())  # Index of first max value

0    30
1    20
2    10
3    40
4    20
5    30
6    20
dtype: int64
count     7.000000
mean     24.285714
std       9.759001
min      10.000000
25%      20.000000
50%      20.000000
75%      30.000000
max      40.000000
dtype: float64
20.0
3.688555567816587
4
[30 20 10 40]
20    3
30    2
10    1
40    1
Name: count, dtype: int64
2
3


In [10]:
############################
# Few string Operations (for string Series)
s = pd.Series(['ash', 'kumar', 'bob', 'shaMLOdhiya', 'taNNER'])
print(s.str.upper())     # Convert to uppercase
print(s.str.lower())     # Convert to lowercase
print(s.str.len())       # Length of each string

##########################
# Handling Missing Data
s = pd.Series([1, 2, None, 4, None, 1])
print(s.isnull())        # Check for NaNs
print(s.fillna(0))       # Replace NaNs with 0
print(s.dropna())        # Drop NaNs

#########################
# Sorting
s = pd.Series([10, 2, 30, 20, 90, 10])
print(s.sort_values(ascending=True))   # Sort by value

0            ASH
1          KUMAR
2            BOB
3    SHAMLODHIYA
4         TANNER
dtype: object
0            ash
1          kumar
2            bob
3    shamlodhiya
4         tanner
dtype: object
0     3
1     5
2     3
3    11
4     6
dtype: int64
0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool
0    1.0
1    2.0
2    0.0
3    4.0
4    0.0
5    1.0
dtype: float64
0    1.0
1    2.0
3    4.0
5    1.0
dtype: float64
1     2
0    10
5    10
3    20
2    30
4    90
dtype: int64
