### SESSION 16 - PANDAS SERIES

### What is Pandas
- Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
- https://pandas.pydata.org/about/index.html

### Pandas Series
- A Pandas Series is like a column in a table. It is a 1-D array holding data of any type.
- **Syntax : pd.Series( data, index, name )**
- Index generate automatically if you want to generate custom index we can use `index` parameters

### Importing Pandas


In [2]:
import numpy as np
import pandas as pd

**Create a Series from lists:**
1. string
2. integers
3. custom index
4. setting a name

In [4]:
# string
countries = ['India', 'Nepal', 'Srilanka', 'Bhutan']
print(pd.Series(countries))
# Note : dtype : object all countries data type and we consider as string dtype

0       India
1       Nepal
2    Srilanka
3      Bhutan
dtype: object


In [6]:
# integer
runs = [54,76,100,74]
print(pd.Series(runs))

0     54
1     76
2    100
3     74
dtype: int64


In [9]:
# Custom index
marks = [58,93,89,60]
subjects = ['C++','Python','NumPy','Java']
print(pd.Series(marks, index=subjects))

C++       58
Python    93
NumPy     89
Java      60
dtype: int64


**Create a Series from Dictionary:**

In [3]:
import pandas as pd
marks = {
    'maths':78,
    'english': 70,
    'science': 89
}
pd.Series(marks, name='student score')

maths      78
english    70
science    89
Name: student score, dtype: int64

### Series Attribute :
- In Pandas, a Series object has several important attributes that is commonly used attributes of a Pandas Series include:

**size:**
- The size attribute returns the number of elements in a Series, including any elements that might contain missing or NaN (Not-a-Number) values.

In [5]:
import pandas as pd
data = [10, 20, 30, None, 50]
print(pd.Series(data).size) 

data1 = [10, 20, 30, 50]
print(pd.Series(data1).size) 

5
4


**dtype:** 
- This attribute returns the data type of the elements in the Series. 
- It can be used to check the data type of the data within the Series.

In [30]:
import pandas as pd
data = [10, 20, 30, 40, 50]
print(pd.Series(data).dtype)

int64


**name:** 
- You can assign a name to a Series when creating it or later using the name attribute. 
- The name is typically used in the context of DataFrames, where a Series can represent a column.


In [3]:
import pandas as pd
data = [10, 20, 30, 40, 50]
print(pd.Series(data, name='MyData').name)

MyData


**is_unique:**
- The is_unique attribute returns a boolean value indicating whether all the values in the Series are unique (no duplicates) or not.

In [2]:
import pandas as pd
data1 = [10, 20, 30, 40, 50]
print(pd.Series(data).is_unique)  # Returns True

data2 = [10, 10, 30, 50, 50]
print(pd.Series(data2).is_unique)  # Returns False

NameError: name 'data' is not defined

**index:** 
- This attribute returns the index labels associated with the Series. 
- The index labels can be integers, strings, or any other data type. 
- If no index labels were explicitly provided when creating the Series, a default integer index will be generated.

In [None]:
import pandas as pd
data = [10, 20, 30, 40, 50]
index_labels = ['A', 'B', 'C', 'D', 'E']
print(pd.Series(data, index=index_labels).index)

**values:** 
- This attribute returns the data in the Series as a NumPy array.


In [None]:
import pandas as pd
marks = {'maths':78,'english': 70,'science': 89}
print(pd.Series(marks).values)

### Series using read_csv():

- **squeeze=True :** 
    - to avoid DataFrame
    - specifies that the result should be squeezed into a Series if it has only one column.
- **index_col :**
    - used to specify which column in the CSV file should be used as the index for the resulting DataFrame.
    - The index is a way to uniquely identify each row in the DataFrame. 
    - By default, if you don't specify the index_col parameter, Pandas will create a default integer index starting from 0.

In [53]:
import pandas as pd
subs = pd.read_csv('DATASETS/S16/subs.csv', squeeze=True)
print(subs)
print(type(subs))

0       48
1       57
2       40
3       43
4       44
      ... 
360    231
361    226
362    155
363    144
364    172
Name: Subscribers gained, Length: 365, dtype: int64
<class 'pandas.core.series.Series'>




  subs = pd.read_csv('DATASETS/S16/subs.csv', squeeze=True)


In [77]:
vk = pd.read_csv('DATASETS/S16/kohli_ipl.csv', index_col='match_no', squeeze=True)
print(vk)
print(type(vk))

match_no
1       1
2      23
3      13
4      12
5       1
       ..
211     0
212    20
213    73
214    25
215     7
Name: runs, Length: 215, dtype: int64
<class 'pandas.core.series.Series'>




  vk = pd.read_csv('DATASETS/S16/kohli_ipl.csv', index_col='match_no', squeeze=True)


In [93]:
bolly = pd.read_csv('DATASETS/S16/bollywood.csv', index_col='movie', squeeze=True)
print(bolly)
print(type(bolly))

movie
Uri: The Surgical Strike                   Vicky Kaushal
Battalion 609                                Vicky Ahuja
The Accidental Prime Minister (film)         Anupam Kher
Why Cheat India                            Emraan Hashmi
Evening Shadows                         Mona Ambegaonkar
                                              ...       
Hum Tumhare Hain Sanam                    Shah Rukh Khan
Aankhen (2002 film)                     Amitabh Bachchan
Saathiya (film)                             Vivek Oberoi
Company (film)                                Ajay Devgn
Awara Paagal Deewana                        Akshay Kumar
Name: lead, Length: 1500, dtype: object
<class 'pandas.core.series.Series'>




  bolly = pd.read_csv('DATASETS/S16/bollywood.csv', index_col='movie', squeeze=True)


### Series Methods:

**head():**
- The head() method is used to display the first few rows (default is 5) of a DataFrame or Series.

In [94]:
# head()
print(vk.head()) # default is 5
print(vk.head(2))
#print(vk.head(-2)) # if negative number then return all rows

match_no
87     0
211    0
207    0
206    0
91     0
Name: runs, dtype: int64
match_no
87     0
211    0
Name: runs, dtype: int64


**tail():**
- The tail() method is used to display the last few rows (default is 5) of a DataFrame or Series.

In [95]:
import pandas as pd
#print(vk.tail()) # default is 5
print(vk.tail(3)) # specific number

match_no
123    108
126    109
128    113
Name: runs, dtype: int64


**sample()**
- The sample() method in Pandas is used to randomly select a specified number of rows or elements from a DataFrame or Series. 
- This is particularly useful when you want to obtain a random sample from your data for various purposes, such as data exploration, analysis, or testing.
- If you have biased in your datasets then we can use this method.

In [96]:
# default is 1
print(vk.sample()) 
# 3 randoms row form dataset
print(vk.sample(3)) 

match_no
106    1
Name: runs, dtype: int64
match_no
12     38
54     35
116    75
Name: runs, dtype: int64


**value_counts() function:**
- the value_counts() function in Pandas to count the frequency of values that occur multiple times in a Series.

In [97]:
print(bolly.value_counts())

Akshay Kumar        48
Amitabh Bachchan    45
Ajay Devgn          38
Salman Khan         31
Sanjay Dutt         26
                    ..
Diganth              1
Parveen Kaur         1
Seema Azmi           1
Akanksha Puri        1
Edwin Fernandes      1
Name: lead, Length: 566, dtype: int64


**sort_values() function:**
- The sort_values() method in Pandas Series is used to sort the values within the Series. 
- By default, it sorts the values in ascending order, but you can specify the sorting order as either ascending or descending using the ascending parameter.
- **syntax: series.sort_values(data, inplace)**
    - **inplace:** for do permanently changes in series
- **Note**: not permanently changes in origial series

In [98]:
#print(vk.sort_values()) # defualt ascending=True
print(vk.sort_values(ascending=False))

match_no
128    113
126    109
123    108
120    100
164    100
      ... 
211      0
8        0
130      0
135      0
87       0
Name: runs, Length: 215, dtype: int64


In [99]:
# method chaining
print(vk.sort_values(ascending=False).head(1).values[0])

113


In [100]:
# with inplace=True for permanet changes in series
#vk.sort_values(inplace=True)

In [101]:
#vk

**sort_index():** (inpalce=True)
**sort_index() function:**
- Similar to sort_values() function
- The sort_index() method in Pandas Series is used to sort the index within the Series.
- for string, alphabetical order
- By default, it sorts the indexes in ascending order, but you can specify the sorting order as either ascending or descending using the ascending parameter.
- **syntax: series.sort_(data,index inplace)**
    - **inplace:** for do permanently changes in series
- **Note**: not permanently changes in origial series

In [102]:
# defualt ascending
#print(bolly.sort_index())

# dscending
print(bolly.sort_index(ascending=False))

movie
Zor Lagaa Ke...Haiya!            Meghan Jadhav
Zokkomon                       Darsheel Safary
Zindagi Tere Naam           Mithun Chakraborty
Zindagi Na Milegi Dobara        Hrithik Roshan
Zindagi 50-50                      Veena Malik
                                   ...        
2 States (2014 film)              Arjun Kapoor
1971 (2007 film)                Manoj Bajpayee
1920: The Evil Returns             Vicky Ahuja
1920: London                     Sharman Joshi
1920 (film)                   Rajniesh Duggall
Name: lead, Length: 1500, dtype: object


In [103]:
# with inplace=True for permanent changes in series
#bolly.sort_index(inplace=True)

In [105]:
bolly

movie
1920 (film)                   Rajniesh Duggall
1920: London                     Sharman Joshi
1920: The Evil Returns             Vicky Ahuja
1971 (2007 film)                Manoj Bajpayee
2 States (2014 film)              Arjun Kapoor
                                   ...        
Zindagi 50-50                      Veena Malik
Zindagi Na Milegi Dobara        Hrithik Roshan
Zindagi Tere Naam           Mithun Chakraborty
Zokkomon                       Darsheel Safary
Zor Lagaa Ke...Haiya!            Meghan Jadhav
Name: lead, Length: 1500, dtype: object