### Ufuncs: Operations Between DataFrame and Series


In [4]:
import pandas as pd
import numpy as np
rng=np.random.RandomState(42)
A=rng.randint(100,size=(3,4))
A.real # this real ask to show only reeal part , imag can be used to represent the imaginary part

array([[51, 92, 14, 71],
       [60, 20, 82, 86],
       [74, 74, 87, 99]])

In [7]:
np.ndarray((2,), buffer=np.array([1,2,3]),
           offset=np.int_().itemsize,
           dtype=int)    # offset = 1*itemsize, i.e. skip first element


array([2, 3])

In [18]:
A-A[0] # deleting the 0th row from the array

array([[  0,   0,   0,   0],
       [  9, -72,  68,  15],
       [ 23, -18,  73,  28]])

In Pandas, the convention similarly operates row-wise by default

In [2]:
df=pd.DataFrame(A,index=list('abc'),columns=list('defg')) # we can similarly name rows and columns
df

Unnamed: 0,d,e,f,g
a,51,92,14,71
b,60,20,82,86
c,74,74,87,99


In [26]:
df-df.iloc[0] # this subtracts 1st row from whole array

Unnamed: 0,d,e,f,g
a,0,0,0,0
b,9,-72,68,15
c,23,-18,73,28


In [28]:
df-df.iloc[0:2] # now 2 rows are subtracted from the first 2 rows of actual dataframe

Unnamed: 0,d,e,f,g
a,0.0,0.0,0.0,0.0
b,0.0,0.0,0.0,0.0
c,,,,


If you would instead like to operate column-wise, you can use the object methods
mentioned earlier, while specifying the axis keyword

In [9]:
df.subtract(df['e'],axis=0) # column e is subtracted from actual dataframe

Unnamed: 0,d,e,f,g
a,-41,0,-78,-21
b,40,0,62,66
c,0,0,13,25


In [13]:
halfrow=df.iloc[[0,2],[0,1]] # it's actually used to chose particular rows and columns
halfrow

Unnamed: 0,d,e
a,51,92
c,74,74


In [12]:
df.iloc?

In [39]:
b=df.iloc[0,2]
b

14

In [42]:
c=df.iloc[0,::2]
c

d    51
f    14
Name: a, dtype: int32

In [15]:
df.iloc[0]

d    51
e    92
f    14
g    71
Name: a, dtype: int32

In [16]:
df.iloc[[0]]

Unnamed: 0,d,e,f,g
a,51,92,14,71


In [18]:
df.iloc[[True, False, True],[True,True,False,False]] # double brackets work for selecting particular rows and columns

Unnamed: 0,d,e
a,51,92
c,74,74


## Handling Missing Data

In [23]:
for dtype in ['int','object']:
    print('dtype=',dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype= int
4.78 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

dtype= object
103 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)



In [43]:
vals2 = np.array([1, np.nan, 3, 4]) # dtype is printed as float because np.nan is actually described as float in python.
vals2.dtype

dtype('float64')

### NaN: Missing numerical data

 You should be aware that NaN is a bit like a data virus—it infects any
other object it touches. Regardless of the operation, the result of arithmetic with NaN
will be another NaN

In [50]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype


dtype('float64')

In [45]:
1 + np.nan

nan

In [46]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

NumPy does provide some special aggregations that will ignore these missing values

In [47]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2) # These are some special commands which ignore NaN whenever it appears.

(8.0, 1.0, 4.0)

In [74]:
import pandas as pd
import numpy as np
pd.isnull('vals2')

False

In [80]:
data = pd.Series([1, np.nan, 'hello', None])
data.notnull()

0     True
1    False
2     True
3    False
dtype: bool

In [87]:
data.fillna(value=1,limit=1) # limit tells the no. of NaN or None to be filled with the provided value

0        1
1        1
2    hello
3     None
dtype: object

In [88]:
data[data.notnull()]

0        1
2    hello
dtype: object

### Dropping null values

In addition to the masking used before, there are the convenience methods, dropna()
(which removes NA values) and fillna() (which fills in NA values).

In [89]:
data.dropna()

0        1
2    hello
dtype: object

In [91]:
df = pd.DataFrame([[1, np.nan, 2],
                   [2, 3, 5],
                   [np.nan, 4, 6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


### Filling null values

In [99]:
# or we can fill the na position with 0
df.fillna(value=0)

Unnamed: 0,0,1,2
0,1.0,0.0,2
1,2.0,3.0,5
2,0.0,4.0,6


In [112]:
df.fillna(method='ffill') # fills the empty NaN with the previous filled value

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,2.0,4.0,6,


In [119]:
h=df.T
g=h.fillna(method='ffill')
g

Unnamed: 0,0,1,2
0,1.0,2.0,
1,1.0,3.0,4.0
2,2.0,5.0,6.0
3,2.0,5.0,6.0


Alternatively, you can drop NA values along a different axis; axis=1 drops all col‐
umns containing a null value

In [120]:
g.fillna(method='bfill') # b fill is backward fill

Unnamed: 0,0,1,2
0,1.0,2.0,4.0
1,1.0,3.0,4.0
2,2.0,5.0,6.0
3,2.0,5.0,6.0


In [98]:
df.dropna() # We cannot drop single values from a DataFrame; we can only drop full rows or full columns.

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [101]:
df.dropna(axis=1) 
#df.dropna(axis='columns') 
# both give same output

Unnamed: 0,2
0,2
1,5
2,6


But this drops some good data as well; you might rather be interested in dropping
rows or columns with all NA values, or a majority of NA values. This can be specified
through the how or thresh parameters, which allow fine control of the number of
nulls to allow through.

In [102]:
df[3]=np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [106]:
df.dropna(axis='columns',how='all') # how ask the no. of NaN a column holds for deleting

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


For finer-grained control, the thresh parameter lets you specify a minimum number
of non-null values for the row/column to be kept

In [111]:
df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


For finer-grained control, the thresh parameter lets you specify a 
minimum number of non-null values for the row/column to be kept.
Here the first and last row have been dropped, because they contain only two non-null values.

### Hierarchial Indexing

hierarchical indexing (also known as multi-indexing) is used to incorporate multiple index levels within a
single index

### A Multiple Indexed Series


Let’s start by considering how we might represent two-dimensional data within a
one-dimensional Series

### The bad way

Suppose you would like to track data about states from two different years

In [70]:
import pandas as pd
import numpy as np
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [71]:
# With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index
pop[('California', 2010):('Texas', 2000)]


(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

In [42]:
pop[[i for i in pop.index if i[1] == 2010]]
 # This is little complex to work and time consuming

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

### The better way: Pandas MultiIndex

Pandas provides a better way to convert these tuples into multi index

In [43]:
index1=pd.MultiIndex.from_tuples(index)
index1

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [56]:
pop1=pd.DataFrame(pop,index=index1,columns=['population'])
pop1

Unnamed: 0,Unnamed: 1,population
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [52]:
pop2=pop.reindex(index1)
pop2

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [57]:
pop.reindex?

Here the first two columns of the Series representation show the multiple index val‐
ues, while the third column shows the data. Notice that some entries are missing in
the first column: in this multi-index representation, any blank entry indicates the
same value as the line above it

In [61]:
pop2[:, 2010] # Can only tuple-index with a MultiIndex

California    37253956
New York      19378102
Texas         25145561
dtype: int64

### MultiIndex as extra dimension

You might notice something else here: we could easily have stored the same data
using a simple DataFrame with index and column labels. In fact, Pandas is built with
this equivalence in mind. 

In [63]:
pop_df = pop2.unstack() # This unstack() and above command works only with re-indexed values. Here (pop2).
pop_df


Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [64]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Seeing this, you might wonder why would we would bother with hierarchical index‐
ing at all. The reason is simple: just as we were able to use multi-indexing to represent
two-dimensional data within a one-dimensional Series, we can also use it to repre‐
sent data of three or more dimensions in a Series or DataFrame. Each extra level in a
multi-index represents an extra dimension of data; taking advantage of this property
gives us much more flexibility in the types of data we can represent. Concretely, we
might want to add another column of demographic data for each state at each year
(say, population under 18); with a MultiIndex this is as easy as adding another col‐
umn to the DataFrame

In [97]:
pop_df1 = pd.DataFrame({'total': pop2,
                        'under18': [9267089, 9284094,
                                    4687374, 4318033,
                                    5906301, 6879014]})
pop_df1

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


 Here we compute the
fraction of people under 18 by year, given the above data

In [101]:
pop_fract=pop_df1['under18']/pop_df1['total'] # To calculate a value out of two columns already present in dataframe
pop_fract.unstack()
#type(pop_fract)

pandas.core.series.Series

In [102]:
pop_df1['fract']=pop_df1['under18']/pop_df1['total'] 
pop_df1

Unnamed: 0,Unnamed: 1,total,under18,fract
California,2000,33871648,9267089,0.273594
California,2010,37253956,9284094,0.249211
New York,2000,18976457,4687374,0.24701
New York,2010,19378102,4318033,0.222831
Texas,2000,20851820,5906301,0.283251
Texas,2010,25145561,6879014,0.273568


In [100]:
pop_df1['fract']=pop_fract # direct assignment of a series to a dataframe , above thing also works
pop_df1

Unnamed: 0,Unnamed: 1,total,under18,fract
California,2000,33871648,9267089,0.273594
California,2010,37253956,9284094,0.249211
New York,2000,18976457,4687374,0.24701
New York,2010,19378102,4318033,0.222831
Texas,2000,20851820,5906301,0.283251
Texas,2010,25145561,6879014,0.273568


### Methods of MultiIndex Creation

The most straightforward way to construct a multiply indexed Series or DataFrame
is to simply pass a list of two or more index arrays to the constructor

In [105]:
df = pd.DataFrame(np.random.rand(8, 2),        # A simple df is created showing a very simple way to create a multiIndex.
                  index=[['a', 'a','a','a', 'b', 'b','b','b'], [1,1,2,2,1,1,2,2],['I','II','I','II','I','II','I','II']],
                  columns=['data1', 'data2'])
df    # Here a triple layer multiIndex is created

Unnamed: 0,Unnamed: 1,Unnamed: 2,data1,data2
a,1,I,0.690473,0.289988
a,1,II,0.145064,0.665637
a,2,I,0.114886,0.015767
a,2,II,0.710067,0.369914
b,1,I,0.215712,0.973067
b,1,II,0.914025,0.497934
b,2,I,0.852589,0.360117
b,2,II,0.901824,0.042223


Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default

In [119]:
data = {('California', 2000): 33871648, # to convert the dictionary into df , first convert that into series.
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
x=pd.Series(data)
x

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

In [124]:
df=pd.DataFrame(x,columns=['population'])
df

Unnamed: 0,Unnamed: 1,population
California,2000,33871648
California,2010,37253956
Texas,2000,20851820
Texas,2010,25145561
New York,2000,18976457
New York,2010,19378102


dict