### SESSION 16 - PANDAS SERIES

### What is Pandas
- Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
- https://pandas.pydata.org/about/index.html


### Pandas Series
- A Pandas Series is like a column in a table. It is a 1-D array holding data of any type.
- Each value in a pandas series is associated with the index. 
- The default index value of it is from 0 to number â€“ 1, or you can specify your own index values.
- **Syntax : pd.Series( data, index, name )**
- Index generate automatically if you want to generate custom index we can use `index` parameters

#### Basic data structure in pandas
Pandas provide two types of classes for handling data:
- **Series**
- **DataFrame**

#### Importing Pandas
There are different way to import pandas library:

Most common way is with `import` statement with alias:
- **import pandas as pd**

Importing all the function and class:
- **from pandas import ***

Importing the specific function and class library:
- **from pandas import DataFrame, read_csv**


**Installing pandas library**

In [1]:
# installing pandas library in jupyter notebook
!pip install pandas



**Checking pandas version**

In [2]:
# checking pandas versrion
import pandas as pd
print(pd.__version__)

1.5.0


**Upgrading the pandas library version:**

In [3]:
# upgrading the pandas library version in jupyter notebook
!pip install --upgrade pandas





**Create a Series from lists:**
1. string
2. integers
3. custom index
4. setting a name

In [4]:
# string
countries = ['India', 'Nepal', 'Srilanka', 'Bhutan']
print(pd.Series(countries))
# Note : dtype : object all countries data type and we consider as string dtype

0       India
1       Nepal
2    Srilanka
3      Bhutan
dtype: object


In [5]:
# integer
runs = [54,76,100,74]
print(pd.Series(runs))

0     54
1     76
2    100
3     74
dtype: int64


In [6]:
# Custom index and name
import pandas as pd
marks = [ 58, 93, 89, 60 ]
subjects = [ 'C++' , 'Python', 'R', 'Java' ]
print(pd.Series(marks, index=subjects, name='student'))


C++       58
Python    93
R         89
Java      60
Name: student, dtype: int64


**Create a Series from Dictionary:**

In [7]:
import pandas as pd
marks = {
    'maths':78,
    'english': 70,
    'science': 89
}
pd.Series(marks, name='student score')

maths      78
english    70
science    89
Name: student score, dtype: int64

### Series Attribute :
- In Pandas, a Series object has several important attributes that is commonly used attributes of a Pandas Series include:

**size:**
- The size attribute returns the number of elements in a Series, including any elements that might contain missing or NaN (Not-a-Number) values.

In [8]:
import pandas as pd
data = [10, 20, 30, None, 50]
print(pd.Series(data).size) 

data1 = [10, 20, 30, 50]
print(pd.Series(data1).size) 

5
4


**dtype:** 
- This attribute returns the data type of the elements in the Series. 
- It can be used to check the data type of the data within the Series.

In [9]:
import pandas as pd
data = [10, 20, 30, 40, 50]
print(pd.Series(data).dtype)

int64


**name:** 
- You can assign a name to a Series when creating it or later using the name attribute. 
- The name is typically used in the context of DataFrames, where a Series can represent a column.


In [10]:
import pandas as pd
data = [10, 20, 30, 40, 50]
print(pd.Series(data, name='MyData').name)

MyData


**is_unique:**
- The is_unique attribute returns a boolean value indicating whether all the values in the Series are unique (no duplicates) or not.

In [11]:
import pandas as pd
data1 = [10, 20, 30, 40, 50]
print(pd.Series(data).is_unique)  # Returns True

data2 = [10, 10, 30, 50, 50]
print(pd.Series(data2).is_unique)  # Returns False

True
False


**index:** 
- This attribute returns the index labels associated with the Series. 
- The index labels can be integers, strings, or any other data type. 
- If no index labels were explicitly provided when creating the Series, a default integer index will be generated.

In [12]:
import pandas as pd
data = [10, 20, 30, 40, 50]
index_labels = ['A', 'B', 'C', 'D', 'E']
print(pd.Series(data, index=index_labels).index)

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')


**values:** 
- This attribute returns the data in the Series as a NumPy array.


In [13]:
import pandas as pd
marks = {'maths':78,'english': 70,'science': 89}
print(pd.Series(marks).values)

[78 70 89]


### Series using read_csv():

- **squeeze=True :** 
    - to avoid DataFrame
    - specifies that the result should be squeezed into a Series if it has only one column.
- **index_col :**
    - used to specify which column in the CSV file should be used as the index for the resulting DataFrame.
    - The index is a way to uniquely identify each row in the DataFrame. 
    - By default, if you don't specify the index_col parameter, Pandas will create a default integer index starting from 0.

In [14]:
import pandas as pd
subs = pd.read_csv('DATASETS/S16/subs.csv', squeeze=True)
print(subs)
print(type(subs))

0       48
1       57
2       40
3       43
4       44
      ... 
360    231
361    226
362    155
363    144
364    172
Name: Subscribers gained, Length: 365, dtype: int64
<class 'pandas.core.series.Series'>




  subs = pd.read_csv('DATASETS/S16/subs.csv', squeeze=True)


In [14]:
vk = pd.read_csv('DATASETS/S16/kohli_ipl.csv', index_col='match_no', squeeze=True)
print(vk)
print(type(vk))

match_no
1       1
2      23
3      13
4      12
5       1
       ..
211     0
212    20
213    73
214    25
215     7
Name: runs, Length: 215, dtype: int64
<class 'pandas.core.series.Series'>




  vk = pd.read_csv('DATASETS/S16/kohli_ipl.csv', index_col='match_no', squeeze=True)


In [15]:
m = pd.read_csv('DATASETS/S16/bollywood.csv', index_col='movie', squeeze=True)
print(m)
print(type(m))



  m = pd.read_csv('DATASETS/S16/bollywood.csv', index_col='movie', squeeze=True)


movie
Uri: The Surgical Strike                   Vicky Kaushal
Battalion 609                                Vicky Ahuja
The Accidental Prime Minister (film)         Anupam Kher
Why Cheat India                            Emraan Hashmi
Evening Shadows                         Mona Ambegaonkar
                                              ...       
Hum Tumhare Hain Sanam                    Shah Rukh Khan
Aankhen (2002 film)                     Amitabh Bachchan
Saathiya (film)                             Vivek Oberoi
Company (film)                                Ajay Devgn
Awara Paagal Deewana                        Akshay Kumar
Name: lead, Length: 1500, dtype: object
<class 'pandas.core.series.Series'>


### Series Methods:

**head():**
- The head() method is used to display the first few rows (default is 5) of a DataFrame or Series.
- if we provide negative number as parameter then return all rows.

In [16]:
# head()
print(vk.head()) 
print(vk.head(2))
#print(vk.head(-2)) 

NameError: name 'vk' is not defined

**tail():**
- The tail() method is used to display the last few rows (default is 5) of a DataFrame or Series.

In [None]:
import pandas as pd
print(vk.tail()) # default is 5
print(vk.tail(2)) # specific number

**sample()**
- The sample() method in Pandas is used to randomly select a specified number of rows (default is 1) or elements from a DataFrame or Series. 
- This is particularly useful when you want to obtain a random sample from your data for various purposes, such as data exploration, analysis, or testing.
- If you have biased in your datasets then we can use this method.

In [None]:
print(vk.sample()) 
#3 randoms row form dataset
print(vk.sample(3)) 

**value_counts() function:**
- the value_counts() function in Pandas to count the frequency of values that occur multiple times in a Series.

In [None]:
print(m.value_counts())

**sort_values() function:**
- sort_values() is used to sort the values within a Pandas Series.
- By default, it sorts the values in ascending order, but you can specify the sorting order using the ascending parameter, where **ascending=True** sorts in ascending order, and **ascending=False** sorts in descending order
- **syntax: series.sort_values(by, ascending, inplace)**
    - **inplace:** 
        - The inplace parameter, when set to True, performs the sorting operation in-place, meaning it modifies the original Series, and the sorted result replaces the original data. 
        - If inplace is set to False (or not specified, as False is the default), it returns a new Series with the sorted values while leaving the original Series unchanged.

In [None]:
print(vk.sort_values()) # defualt ascending=True
print(vk.sort_values(ascending=False))

**Method chaining :**
- Method chaining Pandas Series refers to the practice of applying multiple operations or methods to a Series in a single, concise line of code. 
- This approach is both efficient and readable, making it easier to perform complex data manipulations and transformations.

In [None]:
# method chaining
print(vk.sort_values(ascending=False).head(1).values[0])

In [None]:
# with inplace=True for permanent changes in series
#vk.sort_values(ascending=False,inplace=True)

In [None]:
#vk

**sort_index():** 
- the sort_index() method in Pandas Series is similar in concept to the sort_values() method, but **instead of sorting the values within the Series, it sorts the index (row labels) of the Series.** 
- Both methods allow you to control the sorting order, either ascending or descending, and both can be used with the inplace parameter to modify the original Series.
- **syntax: series.sort_index(data,index inplace)**

In [None]:
# defualt ascending
print(vk.sort_index())
# dscending
vk.sort_index(ascending=False)

In [None]:
# with inplace=True for permanent changes in series
vk.sort_index(ascending=False,inplace=True)

In [None]:
print(vk)

**Series Mathematical Methods :**

Common statistical methods in Pandas Series for analyzing data:

In [None]:
import pandas as pd
subs = pd.read_csv('DATASETS/S16/subs.csv',squeeze=True)
print(subs)
print(type(subs))

In [None]:
vk = pd.read_csv('DATASETS/S16/kohli_ipl.csv', index_col='match_no',squeeze=True)
print(vk)
print(type(vk))

In [None]:
m = pd.read_csv('DATASETS/S16/bollywood.csv', index_col='movie', squeeze=True)
print(m)
print(type(m))

**count():**
- Count the non-null elements in the Series.

In [None]:
print(vk.count())
print(subs.count())

**sum():**
- the sum() method is used to calculate the sum of all the elements in a Series

In [None]:
print(subs.sum())

**product():**
- The product() method in Pandas Series is used to calculate the product of all elements in the Series. 
- It multiplies all the values together and returns the result.

In [None]:
print(subs.product())
print(vk.product())

**mean():** 
- This method calculates the mean (average) of the elements in a Series.

In [None]:
print('average yt subs of campusx channel:',subs.mean())
print('average ipl runs of virat kohli:',vk.mean())

**median():** 
- The median() method calculates the median of the elements in a Series, which is the middle value when the data is sorted. It's a measure of central tendency.

In [None]:
print(subs.median())
print(vk.median())

**mode():** 
- The mode() method returns the mode(s) of the elements in a Series, which is the most frequently occurring value(s).

In [None]:
print('Most frequent lead actor :')
print(m.mode())

**std():** 
- The std() method computes the standard deviation of the elements in a Series, which measures the spread or dispersion of the data.

In [None]:
print(vk.std())

**var():** 
- The var() method calculates the variance of the elements in a Series, which is the average of the squared differences from the mean.

In [None]:
print(subs.var())
print(vk.var())

**min():**
- The min() method returns the minimum value in a Series or DataFrame.

In [None]:
print('minimum subs:',subs.min())
print('minimum runs:',vk.min())

**max():**
- The max() method returns the maximum value in a Series or DataFrame.

In [None]:
print('maximum subs:',subs.max())
print('maximum runs:',vk.max())

**describe():**
- The describe() method in Pandas is a convenient function to generate descriptive statistics of a numeric Series or DataFrame. 
- It provides a summary of various statistical measures, giving you insights into the data's distribution and central tendency.

- The describe() method provides the following statistics:

    - **count:** The number of non-null elements in the Series.
    - **mean:** The mean (average) of the Series.
    - **std:** The standard deviation, which measures the spread of the data.
    - **min:** The minimum value in the Series.
    - **25%:** The 25th percentile (lower quartile).
    - **50%:** The median (50th percentile).
    - **75%:** The 75th percentile (upper quartile).
    - **max:** The maximum value in the Series.
    
- **Note:** the describe() works on numeric data

In [None]:
print(vk.describe())

In [None]:
print(subs.describe())

**Indexing in Series:**

In [None]:
import pandas as pd
subs = pd.read_csv('DATASETS/S16/subs.csv', squeeze=True)
print(subs)
print(type(subs))

In [None]:
vk = pd.read_csv('DATASETS/S16/kohli_ipl.csv', index_col='match_no', squeeze=True)
print(vk)
print(type(vk))

In [None]:
m = pd.read_csv('DATASETS/S16/bollywood.csv', index_col='movie', squeeze=True)
print(m)
print(type(m))

In [None]:
#Positive Indexing in series:
import pandas as pd
series = pd.Series([10,20,40,50])
print(series)

**Note:** Positive indexing works but negative indexing not works

In [None]:
print(series[0])
print(series[1])
#print(series[5]) # ValueError & KeyError 

In [None]:
#Negative Indexing in sereis:
#It is not working on series
#print(series[-1]) # ValueError & KeyError 

**Note:**  Here movie name act as index or we can provide index number

In [None]:
# bollywood.csv dataset
# Both positive and negative is possible
print(m['Uri: The Surgical Strike'])
print(m[1])
print(m[-1])

**Indexing with labels:**

In [None]:
# movie name is label for index
print(m['Uri: The Surgical Strike'])

**Slicing:**
- Slicing a series it's similar to python object slicing.

In [None]:
# Positive slicing
print(vk[4:8])
print(vk[1:8:2])

In [None]:
# negative slicing 
print(m[-5:]) # last 5 movies 

In [None]:
# Virat kohli last 5 ipl matches run 
print(vk[-4:]) 

**Fancy indexing:**|

In [None]:
print(vk[[4,3,7]])
print(m[[4,6,10]])
print(subs[[9,6]])

**Editing Pandas Series:(Optional)**

Editing a Pandas Series involves modifying the data within the Series or changing its values or index labels.
Here are some common operations for editing Pandas Series:

In [None]:
import pandas as pd
marks = [58,93,89,60]
subjects = ['C++','Python','NumPy','Java']
student_marks = pd.Series(marks, index=subjects)
print(student_marks)

In [None]:
# Change a specific value
student_marks[0] = 90
print(student_marks)

In [None]:
# what if an index does not exist
student_marks['JavaScript'] = 90
print(student_marks)

In [None]:
# Using slicing:
# changing the vk ipl runs
vk[0:3] = [90,70,100]
print(vk.head(4))

In [None]:
# Fancy indexing:
vk[[1,2,3,4]] = [70,70,70,70]
print(vk.head(4))

In [None]:
# using index label
m['Battalion 609'] = 'Salman Khan'
print(m.head(4))

**Note:** Most of the time, real world data is only reading purpose. 

**Series with Python Functionalities:**

In [None]:
# len/type/dir/sorted/max/min
print(len(subs))
print(type(subs))
#print(dir(subs))
#print(sorted(subs))
print(min(subs))
print(max(subs))

**type conversion:**

In [None]:
print(list(student_marks))

In [None]:
print(dict(student_marks))

**membership operator:**
- It works on index values

In [None]:
m

In [None]:
print('Company (film)' in m)
print('Evening' in m)

**Looping:**

In [None]:
for i in student_marks.index:
    print(i)

**Arthimatic operator:**

In [None]:
print(student_marks)

In [None]:
# remaining marks to get out of 
# It's like broadcasting
print(100 - student_marks)

**Relational operator :**

In [None]:
# vk hits more than or equal to 50 runs
print(vk >= 50)

**Boolean Indexing on Series:**

In [None]:
# Find no of 50's and 100's scored by kohli
vk_score = vk[vk >= 50]
print(vk_score.size)

In [None]:
# find number of ducks(zeros)
print(vk[vk==0])
print(vk[vk==0].size)

In [None]:
# Count number of day when I had more than 200 subs a day
print(subs[subs > 200].size)

In [None]:
# find actors who have done more than 20 movies
m_count = m.value_counts()
print(m_count[m_count > 20])

**Plotting Graphs on Series:**

In [None]:
#
subs.plot()

In [None]:
m.value_counts().head(10).plot(kind='pie')

In [None]:
m.value_counts().head(10).plot(kind='bar')

**Some Important Series Methods:**

These are some common methods and functions available for working with Pandas Series in Python:


In [23]:
import pandas as pd
m = pd.read_csv('DATASETS/S16/bollywood.csv', index_col='movie', squeeze=True)
print(m)

movie
Uri: The Surgical Strike                   Vicky Kaushal
Battalion 609                                Vicky Ahuja
The Accidental Prime Minister (film)         Anupam Kher
Why Cheat India                            Emraan Hashmi
Evening Shadows                         Mona Ambegaonkar
                                              ...       
Hum Tumhare Hain Sanam                    Shah Rukh Khan
Aankhen (2002 film)                     Amitabh Bachchan
Saathiya (film)                             Vivek Oberoi
Company (film)                                Ajay Devgn
Awara Paagal Deewana                        Akshay Kumar
Name: lead, Length: 1500, dtype: object




  m = pd.read_csv('DATASETS/S16/bollywood.csv', index_col='movie', squeeze=True)


In [24]:
import pandas as pd
vk = pd.read_csv('DATASETS/S16/kohli_ipl.csv', index_col='match_no', squeeze=True)
print(vk)

match_no
1       1
2      23
3      13
4      12
5       1
       ..
211     0
212    20
213    73
214    25
215     7
Name: runs, Length: 215, dtype: int64




  vk = pd.read_csv('DATASETS/S16/kohli_ipl.csv', index_col='match_no', squeeze=True)


In [25]:
import pandas as pd
subs = pd.read_csv('DATASETS/S16/subs.csv', squeeze=True)
print(subs)

0       48
1       57
2       40
3       43
4       44
      ... 
360    231
361    226
362    155
363    144
364    172
Name: Subscribers gained, Length: 365, dtype: int64




  subs = pd.read_csv('DATASETS/S16/subs.csv', squeeze=True)


**astype():**
- This method is used to cast the data type of the elements in a Series to the specified data type (e.g., int, float, str).
- Useful to reduce the memory space
Example: series.astype(float)
- **Syntax: series.astype(dtype)**

In [26]:
import sys
print('Original size of dataset:',sys.getsizeof(vk))
vk_size = vk.astype('int32')
print('Reduce size of dataset:',sys.getsizeof(vk_size))

Original size of dataset: 3456
Reduce size of dataset: 2596


**between():**
- Checks if each element in the Series falls within the specified range. 
- Returns a boolean Series.
- **Syntax: series.between(left, right, inclusive=True)**


In [47]:
print(vk.between(50,100)) # return boolean value
print(vk[vk.between(50, 100)]) # values

match_no
1      False
2      False
3      False
4      False
5      False
       ...  
211    False
212    False
213     True
214    False
215    False
Name: runs, Length: 215, dtype: bool
match_no
15      50
34      58
41      71
44      56
45      67
52      70
57      57
68      73
71      51
73      58
74      65
80      57
81      93
82      99
85      56
97      67
99      73
103     51
104     62
110     82
116     75
117     79
119     80
120    100
122     52
127     75
129     54
131     54
132     62
134     64
137     55
141     58
144     57
145     92
148     68
152     70
160     84
162     67
164    100
175     72
178     90
182     50
188     72
197     51
198     53
209     58
213     73
Name: runs, dtype: int64


In [49]:
print(vk.between(95,110)) # return boolean value
print(vk[vk.between(95, 110)]) # values

match_no
1      False
2      False
3      False
4      False
5      False
       ...  
211    False
212    False
213    False
214    False
215    False
Name: runs, Length: 215, dtype: bool
match_no
82      99
120    100
123    108
126    109
164    100
Name: runs, dtype: int64


**clip():**
- Clips values in the Series to be within the specified lower and upper bounds.
- **Syntax: series.clip(lower, upper)**

In [35]:
print(subs.clip(100,160))

0      100
1      100
2      100
3      100
4      100
      ... 
360    160
361    160
362    155
363    144
364    160
Name: Subscribers gained, Length: 365, dtype: int64


**drop_duplicates():**
- Removes duplicate values from the Series.
- **Syntax: series.drop_duplicates(keep='first/last', inplace=False)**

In [36]:
temp = pd.Series([1,1,3,3,3,5,5])
print(temp)
print(temp.drop_duplicates()) # default first
# deleting first occurance
print(temp.drop_duplicates(keep='last'))

0    1
1    1
2    3
3    3
4    3
5    5
6    5
dtype: int64
0    1
2    3
5    5
dtype: int64
1    1
4    3
6    5
dtype: int64


**duplicated():**
- the duplicated() method is used to identify and mark duplicate values in a Series (column) of a DataFrame.
- It returns a Boolean Series 
- **Syntax: Series.duplicated()**

In [37]:
temp = pd.Series([1,1,3,3,3,5,5])
print(temp.duplicated()) # True menas duplicate
print('Duplicate value count:',temp.duplicated().sum())

0    False
1     True
2    False
3     True
4     True
5    False
6     True
dtype: bool
Duplicate value count: 4


**isnull():**
- Returns a boolean Series indicating whether each element is NaN (missing data).
- **Syntax: series.isnull()**

In [39]:
import numpy as np
import pandas as pd
temp = pd.Series([1,3,np.nan,np.nan,5,np.nan,7,np.nan])
print(temp)

0    1.0
1    3.0
2    NaN
3    NaN
4    5.0
5    NaN
6    7.0
7    NaN
dtype: float64


In [40]:
temp.size 

8

In [41]:
temp.count() # return only non-NaN values

4

In [42]:
# isnull
print(temp.isnull()) # return boolean
print('Missing values:',temp.isnull().sum())

0    False
1    False
2     True
3     True
4    False
5     True
6    False
7     True
dtype: bool
Missing values: 4


**dropna():**
- Removes missing (NaN) values from the Series.
- **Syntax: series.dropna(axis=0, inplace=False)**

In [43]:
print(temp.dropna())

0    1.0
1    3.0
4    5.0
6    7.0
dtype: float64


**fillna():**
- Fills missing (NaN) values in the Series with the specified value.
- **Syntax: series.fillna(value)**

In [45]:
import numpy as np
import pandas as pd
temp = pd.Series([1,3,np.nan,5,np.nan,7,np.nan])
print(temp)
print(temp.fillna(0))

0    1.0
1    3.0
2    NaN
3    5.0
4    NaN
5    7.0
6    NaN
dtype: float64
0    1.0
1    3.0
2    0.0
3    5.0
4    0.0
5    7.0
6    0.0
dtype: float64


**isin():**
- Checks if each element in the Series is in the provided list of values. 
- Returns a boolean Series.
- **Syntax: series.isin(values)**

In [46]:
print(vk.isin([49,99])) # return boolean
print(vk[vk.isin([49,99])]) 

match_no
1      False
2      False
3      False
4      False
5      False
       ...  
211    False
212    False
213    False
214    False
215    False
Name: runs, Length: 215, dtype: bool
match_no
82    99
86    49
Name: runs, dtype: int64


**apply():**
- Applies a given  function to each element in the Series and returns a new Series with the results.
- **Syntax: series.apply(func)**

In [None]:
print(m.apply(lambda x:x.split()[0].upper()).head(5))

In [None]:
# If average subs is more then series values then return good day 
# if subs less than average then return bad day
print('mean',subs.mean())
print(subs.apply(lambda x:'Good day' if x> subs.mean() else 'Bad day'))

**copy():**
- Creates a deep copy of the Series. 
- Changes made to the copied Series do not affect the original.
- **Syntax: series.copy()**

In [None]:
print(vk.head())

In [None]:
new_vk = vk.head().copy()
print(new_vk)

In [None]:
new_vk[1] = 100

In [None]:
# vk original data
print(vk.head())

In [None]:
# modifed new_vk using copy()
print(new_vk)

**EXTRA series methods:**

**to_numeric():**
- the to_numeric() function is used to convert the values in a Series (or DataFrame column) to numeric data types. 
- This function is particularly useful when you have a Series containing strings or other non-numeric data, and you want to convert them to numeric types like integers or floating-point numbers.
- **Syntax: pd.to_numeric(data, errors='coerce', downcast='integer')**
    - **raise(default):** Raises an error if any value cannot be converted to a number.
    - **coerce:** Replaces non-convertible values with NaN.
    - **ignore:** Ignores non-convertible values and leaves them as they are.

In [None]:
import pandas as pd
data = pd.Series(['1', '2', '3.14', 'hello', '5'])
numeric_data = pd.to_numeric(data, errors='coerce')
print(numeric_data)

**quantile():**
- Return value at the given quantile.
- q = 0.5 is default
- **Syntax: Series.(q=0.5, interpolation)**
    - **q:** float or array-like, default 0.5 (50% quantile)
    - **interpolation:** {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
