<a href="https://colab.research.google.com/github/chetnashahi/100daysofML/blob/master/DPhi%20Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hands-On Pandas Session**

In [2]:
import pandas as pd

**Pandas Series Object - As numpy array**

Its one-dimensional array of indexed data. It can be created from list or array. Series can have data of different data types.

In [None]:
data = pd.Series([23,12,44,13,123])
data

0     23
1     12
2     44
3     13
4    123
dtype: int64

Series wrap both sequence of values and sequence of indices. This is another differemce between Pandas & Numpy library as Numpy library doesnt have indices. 

In [None]:
data.values

array([ 23,  12,  44,  13, 123])

In [None]:
data.index

RangeIndex(start=0, stop=5, step=1)

In [None]:
data[1:3]

1    12
2    44
dtype: int64

In [None]:
data[:3]

0    23
1    12
2    44
dtype: int64

Another advantage of Python Series is we can assign whatever Index we want. We can use non-continuous or non-sequential indices.

In [None]:
data = pd.Series([23,12,44,13,123], index = [2,3,5,7,9])
data

2     23
3     12
5     44
7     13
9    123
dtype: int64

**Series as specialized dictionary**

In [None]:
fruit_dict = {
    'Apples':30,
    'Mangoes' : 20,
    'Bananas' : 40,
    'Pears' : 100
}
fruits = pd.Series(fruit_dict)
fruits

Apples      30
Mangoes     20
Bananas     40
Pears      100
dtype: int64

In [None]:
fruits [:'Bananas']

Apples     30
Mangoes    20
Bananas    40
dtype: int64

***Pandas DataFrame Object***

DataFrames is two dimensional array

In [None]:
quant_dict = {
    'Apples':30,
    'Mangoes' : 20,
    'Bananas' : 40,
    'Pears' : 100
}
fruits_quant = pd.Series(quant_dict)
fruits_quant

Apples      30
Mangoes     20
Bananas     40
Pears      100
dtype: int64

In [None]:
Fruits_df = pd.DataFrame({
    'fruits_price' : fruits,
    'fruits_quantity' : fruits_quant
})
Fruits_df

Unnamed: 0,fruits_price,fruits_quantity
Apples,30,30
Mangoes,20,20
Bananas,40,40
Pears,100,100


In [None]:
Fruits_df.index

Index(['Apples', 'Mangoes', 'Bananas', 'Pears'], dtype='object')

In [None]:
Fruits_df.columns

Index(['fruits_price', 'fruits_quantity'], dtype='object')

**Pandas Index Object**

In [2]:
ind = pd.Index([2,3,1,5,21])
ind

Int64Index([2, 3, 1, 5, 21], dtype='int64')

In [3]:
ind[1]

3

In [5]:
ind[::4]

Int64Index([2, 21], dtype='int64')

In [6]:
ind.values

array([ 2,  3,  1,  5, 21])

In [7]:
print(ind.size, ind.ndim, ind.shape, ind.dtype)

5 1 (5,) int64


Difference between index objects and numpy arrays is that indices are immutable

In [8]:
ind[1]=0

TypeError: ignored

# **Data Selection & Indexing**

In [17]:
data1 = pd.Series([23,12,44,13,123], index = ['a','b','c','d','e'])
data1

a     23
b     12
c     44
d     13
e    123
dtype: int64

In [18]:
#Appending values in array
data1[f]='121'
data1

NameError: ignored

In [19]:
#Slicing by explicit index values
data1['c':'e']

c     44
d     13
e    123
dtype: int64

In [20]:
#Slicing by implicit index values
data1[0:3]

a    23
b    12
c    44
dtype: int64

In [22]:
#Fancy indexing - Array of particular values
data1[['d','a']]

d    13
a    23
dtype: int64

In explicit indexing, the final index is included in the slice while in implicit indexing, final index is included in the slice.

# Indexers : loc or iloc

In [24]:
data3 = pd.Series(['a','b','c','d'], index = [1,3,5,7]) 
data3

1    a
3    b
5    c
7    d
dtype: object

In [34]:
data3.loc[1:3]

1    a
3    b
dtype: object

In [45]:
data3.loc[1]

'a'

**loc** attribute gets rows from particular labels from the index. It always references explicit indexing.

**iloc** attribute gets rows from integer index which references to implicit indexing.

In [27]:
data3.iloc[1]

'b'

In [30]:
data3.iloc[1:3]

3    b
5    c
dtype: object

# **Data Selection in Dataframes**

In [47]:
fruits = pd.Series({
    'Apples':30,
    'Mangoes' : 20,
    'Bananas' : 40,
    'Pears' : 100
})
fruits_quant = pd.Series({
    'Apples':3,
    'Mangoes' : 5,
    'Bananas' : 2,
    'Pears' : 6
})

data4 = pd.DataFrame({'Price':fruits, 'Quantity':fruits_quant})
data4

Unnamed: 0,Price,Quantity
Apples,30,3
Mangoes,20,5
Bananas,40,2
Pears,100,6


Individual Series can be accessed in dictionary-style index

In [48]:
data4['Price']

Apples      30
Mangoes     20
Bananas     40
Pears      100
Name: Price, dtype: int64

In [49]:
data4['Rate'] = data4['Price']/ data4['Quantity']
data4

Unnamed: 0,Price,Quantity,Rate
Apples,30,3,10.0
Mangoes,20,5,4.0
Bananas,40,2,20.0
Pears,100,6,16.666667


In [50]:
data4.values

array([[ 30.        ,   3.        ,  10.        ],
       [ 20.        ,   5.        ,   4.        ],
       [ 40.        ,   2.        ,  20.        ],
       [100.        ,   6.        ,  16.66666667]])

In [51]:
data4.iloc[:2,:3]

Unnamed: 0,Price,Quantity,Rate
Apples,30,3,10.0
Mangoes,20,5,4.0


In [52]:
data4.loc[:'Bananas',:'Quantity']

Unnamed: 0,Price,Quantity
Apples,30,3
Mangoes,20,5
Bananas,40,2


In [54]:
data4.dtypes

Price         int64
Quantity      int64
Rate        float64
dtype: object

In [57]:
data4[data4['Quantity'] < 5]

Unnamed: 0,Price,Quantity,Rate
Apples,30,3,10.0
Bananas,40,2,20.0


# **Handling Missing Data**


**NaN and None in Pandas**

In [2]:
import numpy as np
data5 = pd.Series([1, np.NaN, 'Hey', None])
data5

0       1
1     NaN
2     Hey
3    None
dtype: object

**Operations on Null Values**

Different ways of detecting, removing & replacing Null Values.

isnull() : Boolean value indicating if its null or not

notnull() : Opposite of isnull()

dropna() : drop Null values

fillna() : fill the missing values

In [70]:
data5.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [71]:
data5[data5.notnull()]

0      1
2    Hey
dtype: object

In [72]:
data5.dropna()

0      1
2    Hey
dtype: object

In [3]:
data6= pd.DataFrame([[2,3, np.nan],
                     [np.nan, 32, 12],
                     [90,21,np.nan]
])

data6

Unnamed: 0,0,1,2
0,2.0,3,
1,,32,12.0
2,90.0,21,


We cannot drop single Nan values fromdataframe. It will reult in either drop of entire rows or entire columns.

By default, dropna() will result in drop of all rows in which Null is present.

In [4]:
data6.dropna()

Unnamed: 0,0,1,2


You can drop NA values along a different axis. axis=1 drops all columns containing Null values

In [79]:
data6.dropna(axis = 'columns')

Unnamed: 0,1
0,3
1,32
2,21


In [5]:
data6.dropna(axis = 'rows')

Unnamed: 0,0,1,2


In [6]:
data6[3] = np.nan

In [7]:
data6.dropna(axis = 'columns', how = 'all')

Unnamed: 0,0,1,2
0,2.0,3,
1,,32,12.0
2,90.0,21,


Default of how = 'any', that means any rows or columns containing Null values will be dropped while how = 'all' means any rows or columns containing all Null values will be dropped

# **Filling Null values**

Rather than dropping the columns with Null values, we can replcae the null values wit a valid values.

In [11]:
data7 = pd.Series([1, np.nan, 5, None], index = list ('abcd'))
data7

a    1.0
b    NaN
c    5.0
d    NaN
dtype: float64

In [13]:
data7.fillna(0) #You can replace with any value that you want

a    1.0
b    0.0
c    5.0
d    0.0
dtype: float64

**Forward-Fill** method to propagate previous value forward.

In [15]:
data7.fillna(method='ffill')

a    1.0
b    1.0
c    5.0
d    5.0
dtype: float64

**Backward-Fill** method to propagate next value forward.

In [16]:
data7.fillna(method='bfill')

a    1.0
b    5.0
c    5.0
d    NaN
dtype: float64

The above examples of filling Null values was on Series, we can do it in similar way on DataFrames by defining the axis along which we need to perform this operation.

In [17]:
data6

Unnamed: 0,0,1,2,3
0,2.0,3,,
1,,32,12.0,
2,90.0,21,,


In [18]:
data6.fillna(axis=1, method='ffill')

Unnamed: 0,0,1,2,3
0,2.0,3.0,3.0,3.0
1,,32.0,12.0,12.0
2,90.0,21.0,21.0,21.0


In [19]:
data6.fillna(axis=0, method='bfill')

Unnamed: 0,0,1,2,3
0,2.0,3,12.0,
1,90.0,32,12.0,
2,90.0,21,,


**We cannot perform mathematical calculations using None but we can perform mathematical calculations using np.nan**

# **Pandas String Operations**

In [21]:
data9 = ['sheena','Rina', 'mintu', 'jhonson']
[s.capitalize() for s in data9]

['Sheena', 'Rina', 'Mintu', 'Jhonson']

This string operation will fail, if we have any missing values.

In [22]:
data10 = ['sheena','Rina', None,'mintu', 'jhonson']
[s.capitalize() for s in data10]

AttributeError: ignored

In [25]:
data11 = pd.Series( ['sheena','Rina', 'mintu', 'jhonson'])
data11.str.len()

0    6
1    4
2    5
3    7
dtype: int64

In [26]:
data11.str.startswith('j')

0    False
1    False
2    False
3     True
dtype: bool

In [28]:
data11.str.split('i')

0     [sheena]
1      [R, na]
2     [m, ntu]
3    [jhonson]
dtype: object

# **Concat and Append**

In [32]:
ser1 = pd.Series([3,5,7], index= list('abc'))
ser2 = pd.Series([22,33,44], index = list('xyz'))
pd.concat([ser1,ser2])

a     3
b     5
c     7
x    22
y    33
z    44
dtype: int64

In [4]:
df1 = pd.DataFrame([[3,5,7],[9,12,57]], index= list('ab'))
df2 = pd.DataFrame([[22,33,44],[76,35,90]], index = list('xy'))
pd.concat([df1,df2],axis=0)

Unnamed: 0,0,1,2
a,3,5,7
b,9,12,57
x,22,33,44
y,76,35,90


# **Join, Merge & Concat difference**

