## Pandas Series

### What is Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

https://pandas.pydata.org/about/index.html

### Pandas Series

A Pandas Series is like a column in a table. It is a 1-D array holding data of any type.

In [82]:
# # install pandas
# !pip install pandas

### DATA
NAME -> Series 

AGE -> Series 

NAME | AGE | MARKS | IQ  -> Data Frame (group of series)

In [83]:
# importing pandas
import pandas as pd
import numpy as np

### Series from List

In [84]:
# String
country= ['India', 'Nepal', 'USA']

pd.Series(country)

0    India
1    Nepal
2      USA
dtype: object

In [85]:
# Integer:
runs =(61,51,26,34,16)
pd.Series(runs) # default for int is int64

pd.Series(runs, dtype=np.int32) # now this will have data type of int32

# As we didnt define the index, so default is 0,1,2....

0    61
1    51
2    26
3    34
4    16
dtype: int32

In [86]:
# Custom index: 
marks = [67, 26, 56, 78, 100]
subjects = ["maths", "english", "hindi", "bio", "chemistry"]

print(pd.Series(marks, index=subjects))
print()
print(pd.Series(subjects, index=marks))

maths         67
english       26
hindi         56
bio           78
chemistry    100
dtype: int64

67         maths
26       english
56         hindi
78           bio
100    chemistry
dtype: object


In [87]:
# setting the name

marks = [67, 26, 56, 78, 100]
subjects = ["maths", "english", "hindi", "bio", "chemistry"]

pd.Series(marks, index=subjects, name="Vrunda ke marks")

maths         67
english       26
hindi         56
bio           78
chemistry    100
Name: Vrunda ke marks, dtype: int64

### Series from Dictionary

In [88]:
marks = {
    "maths" : 67,
    "english" : 57,
    "science" : 78,
    "hindi" : 67
}
marks_ser = pd.Series(marks, name="Afsan ke marks")
print(marks_ser)

maths      67
english    57
science    78
hindi      67
Name: Afsan ke marks, dtype: int64


### Series Attributes

In [89]:
# size: gives us number of data point present in a series. 
marks_ser.size

4

In [90]:
# dtype: gives data type of the series 
marks_ser.dtype

dtype('int64')

In [91]:
# name: gives name of the series 
marks_ser.name

'Afsan ke marks'

In [92]:
# is_unique: tells if all the elements in the series are unique or not. return True or False
print(marks_ser.is_unique)
k=[1,23,4,42,12,33,23]
pd.Series(k).is_unique

False


False

In [93]:
# index: return all the index of the series, note in a series indexes have to be unique. 

marks_ser.index

Index(['maths', 'english', 'science', 'hindi'], dtype='object')

In [94]:
# values: returns all the value of the series. 

marks_ser.values

array([67, 57, 78, 67])

### Reading series

In [95]:
folder_path= '/Users/vrunda/Library/CloudStorage/GoogleDrive-shahvrunda231296@gmail.com/.shortcut-targets-by-id/1If4Xq7JBYnZ3iRTOYUU8DDWlZlr7e5rJ/Dataset, Assignments, Interview Prep - D1'
file_path='/dataset/subs.csv' # path of the dataset 

subs=pd.read_csv(folder_path+file_path)  # Note: By default read_csv returns a DatFrame. 

# You can also check type using type functions 
print(type(subs)) # its a pandas DataFrame. 

# as we want it to be series, we will you squeeze function to convert DataFrame to Series 
subs_series= subs.squeeze()

print(type(subs_series))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [96]:
print(subs)

     Subscribers gained
0                    48
1                    57
2                    40
3                    43
4                    44
..                  ...
360                 231
361                 226
362                 155
363                 144
364                 172

[365 rows x 1 columns]


In [97]:
# Pandas shows head(5) and tail(5) when you print whole dataset. 
print(subs_series) 

"""
Note: we not getting column name that we were getting in subs dataframe, at same time we are getting name. 

So when you squeeze the dataframe with one column, it will return series where series.name == df.column[0]
Thus, we can use name to get idea of what column/series are we working with. 
"""

0       48
1       57
2       40
3       43
4       44
      ... 
360    231
361    226
362    155
363    144
364    172
Name: Subscribers gained, Length: 365, dtype: int64


'\nNote: we not getting column name that we were getting in subs dataframe, at same time we are getting name. \n\nSo when you squeeze the dataframe with one column, it will return series where series.name == df.column[0]\nThus, we can use name to get idea of what column/series are we working with. \n'

In [98]:
# Now we want to read it as series with index match_no

folder_path= '/Users/vrunda/Library/CloudStorage/GoogleDrive-shahvrunda231296@gmail.com/.shortcut-targets-by-id/1If4Xq7JBYnZ3iRTOYUU8DDWlZlr7e5rJ/Dataset, Assignments, Interview Prep - D1'
file_path='/dataset/kohli_ipl.csv' # path of the dataset 

vk=pd.read_csv(folder_path+file_path, index_col='match_no').squeeze() # -> DataFrame 
print(vk) # -> series with match_no as index 

match_no
1       1
2      23
3      13
4      12
5       1
       ..
211     0
212    20
213    73
214    25
215     7
Name: runs, Length: 215, dtype: int64


In [99]:
# head: returns first n rows of the series, default is 5, note order of the dataset is not changed. 
subs_series.head(10)

0    48
1    57
2    40
3    43
4    44
5    46
6    33
7    40
8    44
9    74
Name: Subscribers gained, dtype: int64

In [100]:
# tail: returns last n rows of the series, default is 5, note order of the dataset is not changed. 
subs_series.tail(5)

360    231
361    226
362    155
363    144
364    172
Name: Subscribers gained, dtype: int64

In [101]:
# sample: return a random n sample of the dataset, default is one.

subs_series.sample(5)

169    261
57      72
312    230
71      84
61      70
Name: Subscribers gained, dtype: int64

In [102]:
# value_counts(): give count of how many times does a value repeat in the dataset. Default of ascending is False, hence we see result in descending order, and sort is True. 
vk.value_counts()

runs
0     9
1     8
12    8
9     7
8     6
     ..
46    1
29    1
84    1
90    1
53    1
Name: count, Length: 78, dtype: int64

In [103]:
# sort_values(): its sorts the dataset by values, default is ascending order.
vk.sort_values()

# sort_index: its sorts the dataset by index, default is ascending order. 
vk.sort_index()

match_no
1       1
2      23
3      13
4      12
5       1
       ..
211     0
212    20
213    73
214    25
215     7
Name: runs, Length: 215, dtype: int64

In [104]:
# inplace: it bascially overwrite the array, so save made changes.

vk.sort_index(ascending=False, inplace=True)
print(vk)

vk.sort_values(ascending=False, inplace=True)
print(vk)

match_no
215     7
214    25
213    73
212    20
211     0
       ..
5       1
4      12
3      13
2      23
1       1
Name: runs, Length: 215, dtype: int64
match_no
128    113
126    109
123    108
120    100
164    100
      ... 
206      0
91       0
93       0
87       0
8        0
Name: runs, Length: 215, dtype: int64


### Series Indexing 

In [105]:
# Note: x here is zero indexing
x = pd.Series([12,13,14,35,67,89,21,67,90])

In [106]:
print(x)
# positive indexing: 
print(x[3])
# negative indexing: it will throw error, as negative indexing is not allowed when index value is integer
# x[-1]
# Note negative slicing works here. 
print(x[-5:])

0    12
1    13
2    14
3    35
4    67
5    89
6    21
7    67
8    90
dtype: int64
35
4    67
5    89
6    21
7    67
8    90
dtype: int64


In [107]:
# movie dataset: with custom indexing, not using default integer indexing. 
folder_path= '/Users/vrunda/Library/CloudStorage/GoogleDrive-shahvrunda231296@gmail.com/.shortcut-targets-by-id/1If4Xq7JBYnZ3iRTOYUU8DDWlZlr7e5rJ/Dataset, Assignments, Interview Prep - D1'
file_path='/dataset/bollywood.csv' # path of the dataset 

movies=pd.read_csv(folder_path+file_path, index_col= 'movie').squeeze()
print(movies)

movie
Uri: The Surgical Strike                   Vicky Kaushal
Battalion 609                                Vicky Ahuja
The Accidental Prime Minister (film)         Anupam Kher
Why Cheat India                            Emraan Hashmi
Evening Shadows                         Mona Ambegaonkar
                                              ...       
Hum Tumhare Hain Sanam                    Shah Rukh Khan
Aankhen (2002 film)                     Amitabh Bachchan
Saathiya (film)                             Vivek Oberoi
Company (film)                                Ajay Devgn
Awara Paagal Deewana                        Akshay Kumar
Name: lead, Length: 1500, dtype: object


In [108]:
# Here as indexing is not integer, negative indexing is allowed here. 
print(movies[2])
print(movies[-1])


Anupam Kher
Akshay Kumar


  print(movies[2])
  print(movies[-1])


In [109]:
# Slicing: exact same as list(start, stop, size)
print(movies[1:4])
print()
print()
print(movies[::2])

movie
Battalion 609                             Vicky Ahuja
The Accidental Prime Minister (film)      Anupam Kher
Why Cheat India                         Emraan Hashmi
Name: lead, dtype: object


movie
Uri: The Surgical Strike                   Vicky Kaushal
The Accidental Prime Minister (film)         Anupam Kher
Evening Shadows                         Mona Ambegaonkar
Fraud Saiyaan                               Arshad Warsi
Manikarnika: The Queen of Jhansi          Kangana Ranaut
                                              ...       
Raaz (2002 film)                              Dino Morea
Waisa Bhi Hota Hai Part II                  Arshad Warsi
Kaante                                  Amitabh Bachchan
Aankhen (2002 film)                     Amitabh Bachchan
Company (film)                                Ajay Devgn
Name: lead, Length: 750, dtype: object


In [110]:
# Fancy indexing: indexing in irregular patterning 
print(x[[1,2,5,6]]) # Note indexing has to pass at list,
print(movies[[1,3,4]])

1    13
2    14
5    89
6    21
dtype: int64
movie
Battalion 609           Vicky Ahuja
Why Cheat India       Emraan Hashmi
Evening Shadows    Mona Ambegaonkar
Name: lead, dtype: object


  print(movies[[1,3,4]])


In [111]:
# custom indexed: can also be used to get values 
movies['Why Cheat India']

'Emraan Hashmi'

In [112]:
# Example: 
new_row = [f"row_{i}" for i in range(9)]
x.index = new_row
print(x)
print()

# zero indexing:
print(x[1:5])
print()


# customer indexing: 
print(x['row_1':'row_5']) # Note: in custom indexing both start and stop are inclusive, unlike zero indexing

# Thus, we are getting more output in custome indexing. 


row_0    12
row_1    13
row_2    14
row_3    35
row_4    67
row_5    89
row_6    21
row_7    67
row_8    90
dtype: int64

row_1    13
row_2    14
row_3    35
row_4    67
dtype: int64

row_1    13
row_2    14
row_3    35
row_4    67
row_5    89
dtype: int64


### Editing Series:

### Editing using indexing: 

In [113]:
# Using zero index: 

print(x)
x[1]= 100
print(x)

row_0    12
row_1    13
row_2    14
row_3    35
row_4    67
row_5    89
row_6    21
row_7    67
row_8    90
dtype: int64
row_0     12
row_1    100
row_2     14
row_3     35
row_4     67
row_5     89
row_6     21
row_7     67
row_8     90
dtype: int64


  x[1]= 100


In [114]:
# Using custom index: 

print(x)
x['row_5']= 350
print(x)

row_0     12
row_1    100
row_2     14
row_3     35
row_4     67
row_5     89
row_6     21
row_7     67
row_8     90
dtype: int64
row_0     12
row_1    100
row_2     14
row_3     35
row_4     67
row_5    350
row_6     21
row_7     67
row_8     90
dtype: int64


In [115]:
# What if an index doesnt exist: it will append it. 
# custom indexing:
print(x)
x['row_11']= 90
print(x)

# x[100]= 10 with zero indexing it will throw error, espically if we have custom indexes. 


row_0     12
row_1    100
row_2     14
row_3     35
row_4     67
row_5    350
row_6     21
row_7     67
row_8     90
dtype: int64
row_0      12
row_1     100
row_2      14
row_3      35
row_4      67
row_5     350
row_6      21
row_7      67
row_8      90
row_11     90
dtype: int64


In [116]:
# Slicing: 

# zero indexing: 
print(x)
x[2:4]= 30
print(x)

row_0      12
row_1     100
row_2      14
row_3      35
row_4      67
row_5     350
row_6      21
row_7      67
row_8      90
row_11     90
dtype: int64
row_0      12
row_1     100
row_2      30
row_3      30
row_4      67
row_5     350
row_6      21
row_7      67
row_8      90
row_11     90
dtype: int64


In [117]:
# for unusual pattern, we use fancy indexing: 
# one value at multiple index: 
print(x)
x[[1,3,4,6,8]]= 259
print(x)

# multiple values at multiple index: 

x[[1,3,4,6,8]]=[10,20,30,40,50]
print(x)

row_0      12
row_1     100
row_2      30
row_3      30
row_4      67
row_5     350
row_6      21
row_7      67
row_8      90
row_11     90
dtype: int64
row_0      12
row_1     259
row_2      30
row_3     259
row_4     259
row_5     350
row_6     259
row_7      67
row_8     259
row_11     90
dtype: int64
row_0      12
row_1      10
row_2      30
row_3      20
row_4      30
row_5     350
row_6      40
row_7      67
row_8      50
row_11     90
dtype: int64


  x[[1,3,4,6,8]]= 259
  x[[1,3,4,6,8]]=[10,20,30,40,50]


In [118]:
# Custom indexing: 

x['row_8']= 90.95 # here we added float, as array can only take one type of data type (homogenous), as soon as you enter one float, all value will change to float. 
print(x)

# x['row_6']= '0'
# print(x) # Now everythings has been changed to string data type.

row_0      12.00
row_1      10.00
row_2      30.00
row_3      20.00
row_4      30.00
row_5     350.00
row_6      40.00
row_7      67.00
row_8      90.95
row_11     90.00
dtype: float64


  x['row_8']= 90.95 # here we added float, as array can only take one type of data type (homogenous), as soon as you enter one float, all value will change to float.


### Series with python functionality

In [119]:
# len: give length of series 
len(x)

10

In [120]:
# type: 
type(x)

pandas.core.series.Series

In [121]:
# dir: shows all the method available. 
dir(x)

['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__bool__',
 '__class__',
 '__column_consortium_standard__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__firstlineno__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pandas_priority__',
 '__pos__',
 '__pow__',
 '_

In [122]:
# sort: 

sorted(x)

[10.0, 12.0, 20.0, 30.0, 30.0, 40.0, 67.0, 90.0, 90.95, 350.0]

In [123]:
# min: 
min(x)

# max: 
max(x)

# sum: 
sum(x)

739.95

In [124]:
# type conversion: 

# series-> list:
type(list(x))

# series-> dict:
print(dict(x)) # here key is index and corresponding values is values. 

{'row_0': np.float64(12.0), 'row_1': np.float64(10.0), 'row_2': np.float64(30.0), 'row_3': np.float64(20.0), 'row_4': np.float64(30.0), 'row_5': np.float64(350.0), 'row_6': np.float64(40.0), 'row_7': np.float64(67.0), 'row_8': np.float64(90.95), 'row_11': np.float64(90.0)}


In [125]:
# membership opertor: 

print("Company (film)" in movies) # Note it check only on indexes
print("Alia Bhatt" in movies.values) # to check values present in the dataset. 

True
True


In [126]:
# looping 
print(end= '\n')

for i in movies.index: # looping on index
    print(i)

print(end= '\n')

for i in movies.values: # looping on values
    print(i)


print(end= '\n')


Uri: The Surgical Strike
Battalion 609
The Accidental Prime Minister (film)
Why Cheat India
Evening Shadows
Soni (film)
Fraud Saiyaan
Bombairiya
Manikarnika: The Queen of Jhansi
Thackeray (film)
Amavas
Gully Boy
Hum Chaar
Total Dhamaal
Sonchiriya
Badla (2019 film)
Mard Ko Dard Nahi Hota
Hamid (film)
Photograph (film)
Risknamaa
Mere Pyare Prime Minister
22 Yards
Kesari (film)
Notebook (2019 film)
Junglee (2019 film)
Gone Kesh
Albert Pinto Ko Gussa Kyun Aata Hai?
The Tashkent Files
Kalank
Setters (film)
Student of the Year 2
PM Narendra Modi
De De Pyaar De
India's Most Wanted (film)
Yeh Hai India
Khamoshi (2019 film)
Kabir Singh
Article 15 (film)
One Day: Justice Delivered
Hume Tumse Pyaar Kitna
Super 30 (film)
Family of Thakurganj
Batla House
Jhootha Kahin Ka
Judgementall Hai Kya
Chicken Curry Law
Arjun Patiala
Jabariya Jodi
Pranaam
The Sky Is Pink
Mission Mangal
Saaho
Dream Girl (2019 film)
Section 375
The Zoya Factor (film)
Pal Pal Dil Ke Paas
Prassthanam
P Se Pyaar F Se Faraar
Ghost

In [127]:
# Arithmetic Operator (Broadcasting): 
print(x)
100+x
print(x+100)

row_0      12.00
row_1      10.00
row_2      30.00
row_3      20.00
row_4      30.00
row_5     350.00
row_6      40.00
row_7      67.00
row_8      90.95
row_11     90.00
dtype: float64
row_0     112.00
row_1     110.00
row_2     130.00
row_3     120.00
row_4     130.00
row_5     450.00
row_6     140.00
row_7     167.00
row_8     190.95
row_11    190.00
dtype: float64


In [128]:
# Relational Operators: 

print(x)
x>=150 #-> bool
x[x>=150] # masking 

row_0      12.00
row_1      10.00
row_2      30.00
row_3      20.00
row_4      30.00
row_5     350.00
row_6      40.00
row_7      67.00
row_8      90.95
row_11     90.00
dtype: float64


row_5    350.0
dtype: float64

In [129]:
# Now we want to read it as series with index match_no

folder_path= '/Users/vrunda/Library/CloudStorage/GoogleDrive-shahvrunda231296@gmail.com/.shortcut-targets-by-id/1If4Xq7JBYnZ3iRTOYUU8DDWlZlr7e5rJ/Dataset, Assignments, Interview Prep - D1'
file_path='/dataset/kohli_ipl.csv' # path of the dataset 

vk=pd.read_csv(folder_path+file_path, index_col='match_no').squeeze()
print(vk)

match_no
1       1
2      23
3      13
4      12
5       1
       ..
211     0
212    20
213    73
214    25
215     7
Name: runs, Length: 215, dtype: int64


In [130]:
# Question: # load the vk dataset and convert to series and find the no of 50' and 100's scored by kholi
print(vk[(vk==50) | (vk==100)].value_counts())

# or 

print(sum(vk[vk>=50].value_counts()))

# or 

print(vk[vk>=50].size) # How many time he has score more than 50?

# or 
(vk>=50).sum()


runs
50     2
100    2
Name: count, dtype: int64
50
50


np.int64(50)

In [131]:
# movie dataset: with custom indexing, not using default integer indexing. 
folder_path= '/Users/vrunda/Library/CloudStorage/GoogleDrive-shahvrunda231296@gmail.com/.shortcut-targets-by-id/1If4Xq7JBYnZ3iRTOYUU8DDWlZlr7e5rJ/Dataset, Assignments, Interview Prep - D1'
file_path='/dataset/bollywood.csv' # path of the dataset 

movies=pd.read_csv(folder_path+file_path, index_col= 'movie').squeeze()
print(movies)

movie
Uri: The Surgical Strike                   Vicky Kaushal
Battalion 609                                Vicky Ahuja
The Accidental Prime Minister (film)         Anupam Kher
Why Cheat India                            Emraan Hashmi
Evening Shadows                         Mona Ambegaonkar
                                              ...       
Hum Tumhare Hain Sanam                    Shah Rukh Khan
Aankhen (2002 film)                     Amitabh Bachchan
Saathiya (film)                             Vivek Oberoi
Company (film)                                Ajay Devgn
Awara Paagal Deewana                        Akshay Kumar
Name: lead, Length: 1500, dtype: object


In [132]:
# Question: Find actor who have done more than 20 movies 
num_movies= movies.value_counts()
num_movies[(num_movies>20)]


# movies[(movies.value_counts()>20)] this wont work because, we want to apply masking to value count not to movie series. 

lead
Akshay Kumar        48
Amitabh Bachchan    45
Ajay Devgn          38
Salman Khan         31
Sanjay Dutt         26
Shah Rukh Khan      22
Emraan Hashmi       21
Name: count, dtype: int64