# Pandas Series

Pandas Series is a one-dimensional labeled array capable of
holding data of any type (integer, string, float, python objects,
etc.). The axis labels are collectively called index. 
Pandas Series is nothing but a column in an excel sheet.

There are some differences worth noting between ndarrays and Series objects. 

First of all, elements in NumPy arrays are accessed by their integer position, starting with zero for the first element.
A pandas Series Object is more flexible as you can use define your own labeled index to index and access elements of an array. 

You can also use letters instead of numbers, or number an array in descending order instead of ascending order. 

Second, aligning data from different Series and matching labels with Series objects is more efficient than using ndarrays, for example dealing with missing values. If there are no matching labels during alignment, pandas returns NaN (not any number) so that the operation does not fail.

In [1]:
import numpy as np
import pandas as pd

# Creating a Series using Pandas

You could convert a list,numpy array, or dictionary to a Series in the following manner

In [31]:
labels = ['w','x','y','z']
list = [10,20,30,'x']
array = np.array([10,20,30,40])
dict = {'w':10,'x':20,'y':30,'z':40}

In [32]:
x = pd.Series(list)
x

0    10
1    20
2    30
3     x
dtype: object

In [33]:
type(x)

pandas.core.series.Series

In [34]:
# Objects in Series

In [35]:
x[0]

10

In [37]:
print(type(x[3]))

<class 'str'>


In [19]:
pd.Series(list,index=labels)

w    10
x    20
y    30
z    40
dtype: int64

In [13]:
pd.Series(dict)

w    10
x    20
y    30
z    40
dtype: int64

## Using an Index

We shall now see how to index in a Series using the following examples of 2 series

In [14]:
sports1 = pd.Series([1,2,3,4],
    index = ['Cricket', 'Football','Basketball', 'Golf'])                                   

In [15]:
sports1

Cricket       1
Football      2
Basketball    3
Golf          4
dtype: int64

In [16]:
sports2 = pd.Series([1,2,5,4],
index = ['Cricket', 'Football','Baseball', 'Golf'])                                   

In [17]:
sports2

Cricket     1
Football    2
Baseball    5
Golf        4
dtype: int64

In [22]:
#Operations are then also done based off of index:

In [23]:
sports1 + sports2  #NaN - Not a number

Baseball      NaN
Basketball    NaN
Cricket       2.0
Football      4.0
Golf          8.0
dtype: float64

# Pandas Dataframe

In [21]:
np.random.randint(5,10,size=(10,5))

array([[5, 5, 9, 7, 7],
       [9, 7, 8, 5, 5],
       [9, 8, 9, 8, 8],
       [9, 5, 8, 6, 9],
       [9, 8, 7, 7, 7],
       [7, 7, 5, 7, 6],
       [7, 8, 5, 5, 6],
       [6, 8, 8, 8, 6],
       [8, 8, 8, 6, 8],
       [5, 9, 5, 7, 9]])

In [169]:
import numpy as np
import pandas as pd
dataframe = pd.DataFrame(np.random.randint(1,100,size=(10,5)),index='A B C D E F G H I J'.split(),
                  columns='Score1 Score2 Score3 Score4 Score5'.split())
dataframe

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
A,93,41,61,71,50
B,51,33,91,21,53
C,39,66,81,58,53
D,90,20,95,65,43
E,48,22,98,2,90
F,39,13,19,78,78
G,89,91,84,40,99
H,16,38,97,37,99
I,2,33,31,8,51
J,51,33,44,42,70


In [110]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, A to J
Data columns (total 5 columns):
Score1    10 non-null int32
Score2    10 non-null int32
Score3    10 non-null int32
Score4    10 non-null int32
Score5    10 non-null int32
dtypes: int32(5)
memory usage: 600.0+ bytes


In [111]:
dataframe.describe()

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
count,10.0,10.0,10.0,10.0,10.0
mean,45.5,50.0,41.5,40.4,52.1
std,21.146316,27.446918,26.800705,33.463745,29.429577
min,4.0,5.0,10.0,1.0,7.0
25%,35.75,29.5,24.5,21.75,39.5
50%,49.0,48.0,33.5,32.0,46.0
75%,61.0,73.75,54.0,52.75,79.0
max,72.0,84.0,97.0,97.0,91.0


In [112]:
df.head(5)

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
A,64,43,23,83,41
B,45,17,18,89,1
C,98,83,72,15,37
D,27,33,2,49,89
E,23,35,95,22,69


In [113]:
df.tail(3)

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
H,65,90,41,8,8
I,34,59,8,35,5
J,6,62,51,77,46


In [35]:
dataframe['Score3']

A    33
B    32
C    16
D    97
E    58
F    42
G    10
H    22
I    34
J    71
Name: Score3, dtype: int32

In [36]:
# Pass a list of column names in any order necessary
dataframe[['Score2','Score1']]

Unnamed: 0,Score2,Score1
A,26,58
B,52,66
C,29,4
D,84,62
E,31,35
F,84,47
G,5,51
H,75,22
I,44,38
J,70,72


In [37]:
type(dataframe['Score1'])

pandas.core.series.Series

In [38]:
dataframe['Score15'] = dataframe['Score1'] + dataframe['Score2']
dataframe['Score15']

A     84
B    118
C     33
D    146
E     66
F    131
G     56
H     97
I     82
J    142
Name: Score15, dtype: int32

In [58]:
dataframe

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
A,58,26,33,34,7
B,66,52,32,8,14
C,4,29,16,30,67
D,62,84,97,59,83
E,35,31,58,97,50
F,47,84,42,21,41
G,51,5,10,1,91
H,22,75,22,34,39
I,38,44,34,96,42
J,72,70,71,24,87


** Removing Columns from DataFrame**

In [59]:
dataframe.drop('Score15',axis=1)              # Use axis=0 for dropping rows and axis=1 for dropping columns

KeyError: "['Score15'] not found in axis"

In [60]:
# column is not dropped unless inplace input is TRUE
dataframe

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
A,58,26,33,34,7
B,66,52,32,8,14
C,4,29,16,30,67
D,62,84,97,59,83
E,35,31,58,97,50
F,47,84,42,21,41
G,51,5,10,1,91
H,22,75,22,34,39
I,38,44,34,96,42
J,72,70,71,24,87


In [45]:
dataframe.drop('Score15',axis=1,inplace=True)
dataframe

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
A,58,26,33,34,7
B,66,52,32,8,14
C,4,29,16,30,67
D,62,84,97,59,83
E,35,31,58,97,50
F,47,84,42,21,41
G,51,5,10,1,91
H,22,75,22,34,39
I,38,44,34,96,42
J,72,70,71,24,87


In [46]:
dataframe.drop('A',axis=0)      # Row will also be dropped only if inplace=TRUE is given as input

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
B,66,52,32,8,14
C,4,29,16,30,67
D,62,84,97,59,83
E,35,31,58,97,50
F,47,84,42,21,41
G,51,5,10,1,91
H,22,75,22,34,39
I,38,44,34,96,42
J,72,70,71,24,87


In [56]:
# Selecting Rows Using Loc and Iloc Function

# loc function is used to select rows using index values having particular name
# iloc is used to select the values using only index numbers

In [69]:
dataframe.loc['J'] 

Score1    72
Score2    70
Score3    71
Score4    24
Score5    87
Name: J, dtype: int32

In [72]:
dataframe.iloc[9]

Score1    72
Score2    70
Score3    71
Score4    24
Score5    87
Name: J, dtype: int32

In [74]:
dataframe.loc['B','Score1']

66

In [75]:
dataframe.loc[['A','B'],['Score1','Score2']]

Unnamed: 0,Score1,Score2
A,58,26
B,66,52


In [77]:
### Conditional Selection

#Similar to NumPy, we can make conditional selections using Brackets

In [78]:
dataframe

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
A,58,26,33,34,7
B,66,52,32,8,14
C,4,29,16,30,67
D,62,84,97,59,83
E,35,31,58,97,50
F,47,84,42,21,41
G,51,5,10,1,91
H,22,75,22,34,39
I,38,44,34,96,42
J,72,70,71,24,87


In [81]:
dataframe.shape #rows vs columns

(10, 5)

In [83]:
dataframe.size #total values

50

In [85]:
dataframe>50

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
A,True,False,False,False,False
B,True,True,False,False,False
C,False,False,False,False,True
D,True,True,True,True,True
E,False,False,True,True,False
F,False,True,False,False,False
G,True,False,False,False,True
H,False,True,False,False,False
I,False,False,False,True,False
J,True,True,True,False,True


In [102]:
a = dataframe[dataframe>90]
a

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
A,,,,,
B,,,,,
C,,,,,
D,,,97.0,,
E,,,,97.0,
F,,,,,
G,,,,,91.0
H,,,,,
I,,,,96.0,
J,,,,,


In [114]:
#Checking number of values greater than particular number
x = a.notnull().sum().sum()  #First sum is for every Series. Second for together
x

4

In [116]:
dataframe[(dataframe['Score1']>50) & (dataframe['Score2'] <60)]

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
A,58,26,33,34,7
B,66,52,32,8,14
G,51,5,10,1,91


# More Index Details

Some more features of indexing includes 
  - resetting the index 
  - setting a different value
  - index hierarchy

In [170]:
dataframe

Unnamed: 0,Score1,Score2,Score3,Score4,Score5
A,93,41,61,71,50
B,51,33,91,21,53
C,39,66,81,58,53
D,90,20,95,65,43
E,48,22,98,2,90
F,39,13,19,78,78
G,89,91,84,40,99
H,16,38,97,37,99
I,2,33,31,8,51
J,51,33,44,42,70


In [171]:
# Reset to default index value instead of A to J
dataframe.reset_index()

Unnamed: 0,index,Score1,Score2,Score3,Score4,Score5
0,A,93,41,61,71,50
1,B,51,33,91,21,53
2,C,39,66,81,58,53
3,D,90,20,95,65,43
4,E,48,22,98,2,90
5,F,39,13,19,78,78
6,G,89,91,84,40,99
7,H,16,38,97,37,99
8,I,2,33,31,8,51
9,J,51,33,44,42,70


In [172]:
newindex = 'IND JP CAN GE IT PL FY IU RT IP'.split()

In [173]:
newindex

['IND', 'JP', 'CAN', 'GE', 'IT', 'PL', 'FY', 'IU', 'RT', 'IP']

In [174]:
dataframe['Countries'] = newindex

In [175]:
dataframe

Unnamed: 0,Score1,Score2,Score3,Score4,Score5,Countries
A,93,41,61,71,50,IND
B,51,33,91,21,53,JP
C,39,66,81,58,53,CAN
D,90,20,95,65,43,GE
E,48,22,98,2,90,IT
F,39,13,19,78,78,PL
G,89,91,84,40,99,FY
H,16,38,97,37,99,IU
I,2,33,31,8,51,RT
J,51,33,44,42,70,IP


In [165]:
dataframe.set_index('Countries')

Unnamed: 0_level_0,Score1,Score2,Score3,Score4,Score5
Countries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
IND,40,32,38,39,57
JP,88,29,37,67,11
CAN,85,1,17,31,4
GE,79,60,54,70,50
IT,75,47,28,34,73
PL,69,40,33,98,8
FY,34,63,15,78,44
IU,62,86,37,13,54
RT,27,83,13,26,2
IP,42,56,56,65,16


In [166]:
dataframe

Unnamed: 0_level_0,Score1,Score2,Score3,Score4,Score5,Countries
Countries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
IND,40,32,38,39,57,IND
JP,88,29,37,67,11,JP
CAN,85,1,17,31,4,CAN
GE,79,60,54,70,50,GE
IT,75,47,28,34,73,IT
PL,69,40,33,98,8,PL
FY,34,63,15,78,44,FY
IU,62,86,37,13,54,IU
RT,27,83,13,26,2,RT
IP,42,56,56,65,16,IP


In [176]:
dataframe.set_index('Countries',inplace=True)
dataframe

Unnamed: 0_level_0,Score1,Score2,Score3,Score4,Score5
Countries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
IND,93,41,61,71,50
JP,51,33,91,21,53
CAN,39,66,81,58,53
GE,90,20,95,65,43
IT,48,22,98,2,90
PL,39,13,19,78,78
FY,89,91,84,40,99
IU,16,38,97,37,99
RT,2,33,31,8,51
IP,51,33,44,42,70


In [177]:
dataframe

Unnamed: 0_level_0,Score1,Score2,Score3,Score4,Score5
Countries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
IND,93,41,61,71,50
JP,51,33,91,21,53
CAN,39,66,81,58,53
GE,90,20,95,65,43
IT,48,22,98,2,90
PL,39,13,19,78,78
FY,89,91,84,40,99
IU,16,38,97,37,99
RT,2,33,31,8,51
IP,51,33,44,42,70


# Missing Data

Methods to deal with missing data in Pandas

In [178]:
dataframe = pd.DataFrame({'Cricket':[1,2,np.nan,4,6,7,2,np.nan],
                  'Baseball':[5,np.nan,np.nan,5,7,2,4,5],
                  'Tennis':[1,2,3,4,5,6,7,8]})

In [179]:
dataframe

Unnamed: 0,Cricket,Baseball,Tennis
0,1.0,5.0,1
1,2.0,,2
2,,,3
3,4.0,5.0,4
4,6.0,7.0,5
5,7.0,2.0,6
6,2.0,4.0,7
7,,5.0,8


In [180]:
dataframe.dropna()

Unnamed: 0,Cricket,Baseball,Tennis
0,1.0,5.0,1
3,4.0,5.0,4
4,6.0,7.0,5
5,7.0,2.0,6
6,2.0,4.0,7


In [181]:
dataframe.dropna(axis=1)       # Use axis=1 for dropping columns with nan values

Unnamed: 0,Tennis
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8


# Filling with Zero

In [185]:
dataframe.fillna(value=0)

Unnamed: 0,Cricket,Baseball,Tennis
0,1.0,5.0,1
1,2.0,0.0,2
2,0.0,0.0,3
3,4.0,5.0,4
4,6.0,7.0,5
5,7.0,2.0,6
6,2.0,4.0,7
7,0.0,5.0,8


# Filling with Mean Value

In [192]:
dataframe['Baseball'] = dataframe['Baseball'].fillna(value=dataframe['Baseball'].mean())
dataframe['Baseball']

0    5.000000
1    4.666667
2    4.666667
3    5.000000
4    7.000000
5    2.000000
6    4.000000
7    5.000000
Name: Baseball, dtype: float64

In [190]:
dataframe

Unnamed: 0,Cricket,Baseball,Tennis
0,1.0,5.0,1
1,2.0,4.666667,2
2,,4.666667,3
3,4.0,5.0,4
4,6.0,7.0,5
5,7.0,2.0,6
6,2.0,4.0,7
7,,5.0,8
