<p style="color:Blue; font-size: 30px;text-align: center;"><b> ASAP The Pandas Library</b></p>

In [None]:
##<p style="color:Blue; font-size: 30px;text-align: center;"> Relsoft Systems</p> </center>

<p style="color:Red; font-size: 25px;"><b> Pandas </b></p>

### Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

## Object Creation - Series

### 1. Creating a Series by passing a list of values, letting pandas create a default integer index:

### 2. One-dimensional ndarray with axis labels (including time series).

### 3. Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).


### Hashable is a feature of Python objects that tells if the object has a hash value or not. If the object has a hash value then it can be used as a key for a dictionary or as an element in a set.

### An object is hashable if it has a hash value that does not change during its entire lifetime.
Python has a built-in hash method ( __hash__() ) that can be compared to other objects. For comparing it needs __eq__() or __cmp__() method and if the hashable objects are equal then they have the same hash value. All immutable built-in objects in Python are hashable like tuples while the mutable containers like lists and dictionaries are not hashable. 

Objects which are instances of the user-defined class are hashable by default, they all compare unequal, and their hash value is their id().

### 4. Operations between Series (+, -, /, , *) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.

The series is the object of the pandas library designed to represent one-dimensional data
structures, similar to an array but with some additional features.

![image.png](attachment:image.png)
The structure of the series object

## Declaring a Series

In [2]:
import numpy as np
import pandas as pd

In [3]:
s = pd.Series([12,-4,7,9])
s

0    12
1    -4
2     7
3     9
dtype: int64

In [4]:
s = pd.Series(data=[1,3,5, 6,8], name="myseries")
s

0    1
1    3
2    5
3    6
4    8
Name: myseries, dtype: int64

In [6]:
s = pd.Series(data=[1,3,5, 6,8], name="myseries", dtype=np.float32)
s

0    1.0
1    3.0
2    5.0
3    6.0
4    8.0
Name: myseries, dtype: float32

As you can see from the output of the series, on the left there are the values in the
index, which is a series of labels, and on the right are the corresponding values.
If you do not specify any index during the definition of the series, by default,
pandas will assign numerical values increasing from 0 as labels. In this case, the labels
correspond to the indexes (position in the array) of the elements in the series object.
Often, however, it is preferable to create a series using meaningful labels in order to
distinguish and identify each item regardless of the order in which they were inserted
into the series.

In this case it will be necessary, during the constructor call, to include the index
option and assign an array of strings containing the labels.

In [7]:
s = pd.Series([12,-4,7,9], index=['a','b','c','d'])
s

a    12
b    -4
c     7
d     9
dtype: int64

If you want to individually see the two arrays that make up this data structure, you
can call the two attributes of the series as follows: index and values.

In [8]:
s.values

array([12, -4,  7,  9], dtype=int64)

In [9]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

### Selecting the Internal Elements
You can select individual elements as ordinary numpy arrays, specifying the key.

In [8]:
s[2]

7

In [10]:
s['c']

7

In [11]:
s[0:2]

a    12
b    -4
dtype: int64

### Assigning Values to the Elements
Now that you understand how to select individual elements, you also know how to
assign new values to them. In fact, you can select the value by index or by label.

In [12]:
s[2]=10
s

a    12
b    -4
c    10
d     9
dtype: int64

In [13]:
s['b'] = 1
s

a    12
b     1
c    10
d     9
dtype: int64

## Defining a Series from NumPy Arrays and Other Series

In [14]:
arr = np.array([1,2,3,4])
s3 = pd.Series(arr)
s3

0    1
1    2
2    3
3    4
dtype: int32

In [15]:
s4 = pd.Series(s)
s4

a    12
b     1
c    10
d     9
dtype: int64

## Filtering Values

In [16]:
s[s > 8]

a    12
c    10
d     9
dtype: int64

In [17]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [18]:
s=pd.Series(data=np.arange(10))
s

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32

## Operations and Mathematical Functions

In [19]:
s / 2

0    0.0
1    0.5
2    1.0
3    1.5
4    2.0
5    2.5
6    3.0
7    3.5
8    4.0
9    4.5
dtype: float64

In [20]:
np.log(s)

  result = getattr(ufunc, method)(*inputs, **kwargs)


0        -inf
1    0.000000
2    0.693147
3    1.098612
4    1.386294
5    1.609438
6    1.791759
7    1.945910
8    2.079442
9    2.197225
dtype: float64

## Evaluating Vales
There are often duplicate values in a series. Then you may need to have more
information about the samples, including existence of any duplicates and whether a
certain value is present in the series.
In this regard, you can declare a series in which there are many duplicate values.

In [21]:
serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
serd                                     

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

To know all the values contained in the series, excluding duplicates, you can use
the unique() function. The return value is an array containing the unique values in the
series, although not necessarily in order.

In [22]:
serd.unique()

array([1, 0, 2, 3], dtype=int64)

In [24]:
serd.value_counts()

1    2
2    2
0    1
3    1
dtype: int64

Finally, isin() evaluates the membership, that is, the given a list of values. This
function tells you if the values are contained in the data structure. Boolean values that are
returned can be very useful when filtering data in a series or in a column of a dataframe.

In [25]:
serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

In [26]:
serd.isin([0,3])

white     False
white      True
blue      False
green     False
green     False
yellow     True
dtype: bool

## NaN Values
As you can see in the previous case, we tried to run the logarithm of a negative number
and received NaN as a result. This specific value NaN (Not a Number) is used in pandas
data structures to indicate the presence of an empty field or something that’s not
definable numerically.
Generally, these NaN values are a problem and must be managed in some way,
especially during data analysis. These data are often generated when extracting data
from a questionable source or when the source is missing data. Furthermore, as you
have just seen, the NaN values can also be generated in special cases, such as calculations
of logarithms of negative values, or exceptions during execution of some calculation
or function. In later chapters, you see how to apply different strategies to address the
problem of NaN values.

In [27]:
s2 = pd.Series([5,-3,np.NaN,14])
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

In [28]:
s2.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [29]:
s2[s2.notnull()]

0     5.0
1    -3.0
3    14.0
dtype: float64

In [30]:
s2[s2.isnull()]

2   NaN
dtype: float64

In [31]:
ind1=["India","Pakistan","Bangladesh","Srilanka","China","Nepal"]
ind1

['India', 'Pakistan', 'Bangladesh', 'Srilanka', 'China', 'Nepal']

Many types in the standard library conform to Hashable : Strings, integers, floating-point and Boolean values, and even sets are hashable by default. Some other types, such as optionals, arrays and ranges automatically become hashable when their type arguments implement the same.

In [33]:
s2 = pd.Series(data=[1,3,5,np.nan,6,8],name="myseries")
s2

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: myseries, dtype: float64

In [29]:
# s3 = pd.Series(data=s2, index=ind1, name="myseries")
# s3

India        NaN
Pakistan     NaN
Bangladesh   NaN
Srilanka     NaN
China        NaN
Nepal        NaN
Name: myseries, dtype: float64

In [34]:
s2 = pd.Series(data=[1,3,5,np.nan,6,8], index=ind1, name="myseries")
s2

India         1.0
Pakistan      3.0
Bangladesh    5.0
Srilanka      NaN
China         6.0
Nepal         8.0
Name: myseries, dtype: float64

### pandas.DataFrame.loc

In [31]:
s2.loc['Bangladesh']

5.0

1. Access a group of rows and columns by label(s) or a boolean array. 
2. .loc[] is primarily label based, but may also be used with a boolean array.

In [35]:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],index=['cobra', 'viper', 'sidewinder'],columns=['max_speed', 'shield'])
df

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,5
sidewinder,7,8


In [36]:
df.loc['viper']

max_speed    4
shield       5
Name: viper, dtype: int64

## pandas.DataFrame.at

1. Access a single value for a row/column label pair.

In [37]:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
4,0,2,3
5,0,4,1
6,10,20,30


In [38]:
df.at[4,"B"]

2

Selection by position 
1. Pandas provides a suite of methods in order to get purely integer based indexing. 
2. The semantics follow closely python and numpy slicing.
3. These are 0-based indexing. When slicing, the start bounds is included, while the upper bound is excluded. 
4. Trying to use a non-integer, even a valid label will raise an IndexError.

In [43]:
df[:2]

Unnamed: 0,A,B,C
4,0,2,3
5,0,4,1


In [44]:
df[1:2]

Unnamed: 0,A,B,C
5,0,4,1


In [32]:
s2.name

'myseries'

In [33]:
s = pd.Series(data=[1,3,5, 6,8], name="myseries")
s

0    1
1    3
2    5
3    6
4    8
Name: myseries, dtype: int64

In [33]:
s2.index

Index(['India', 'Pakistan', 'Bangladesh', 'Srilanka', 'China', 'Nepal'], dtype='object')

In [34]:
s2.append(s)

  s2.append(s)


India         1.0
Pakistan      3.0
Bangladesh    5.0
Srilanka      NaN
China         6.0
Nepal         8.0
0             1.0
1             3.0
2             5.0
3             6.0
4             8.0
Name: myseries, dtype: float64

In [34]:
d = {'a': 1, 'b': 2, 'c': 3}
ser = pd.Series(data=d, index=['a', 'b', 'c'])
ser

a    1
b    2
c    3
dtype: int64

In [35]:
d = {'a': 1, 'b': 2, 'c': 3}
ser = pd.Series(data=d)
ser

a    1
b    2
c    3
dtype: int64

In [36]:
r = np.array([1, 2])
ser = pd.Series(r, copy=False)
ser

0    1
1    2
dtype: int32

pandas.DataFrame.iloc
1. Purely integer-location based indexing for selection by position. 
2. .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

In [46]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4}, {'a': 100, 'b': 200, 'c': 300, 'd': 400}, {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
mydict

[{'a': 1, 'b': 2, 'c': 3, 'd': 4},
 {'a': 100, 'b': 200, 'c': 300, 'd': 400},
 {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000}]

In [47]:
df = pd.DataFrame(mydict)
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


In [48]:
# first row
df.iloc[0]

a    1
b    2
c    3
d    4
Name: 0, dtype: int64

In [49]:
df.iloc[[0]]

Unnamed: 0,a,b,c,d
0,1,2,3,4


In [51]:
df.iloc[[0, 2], [1, 3]]

Unnamed: 0,b,d
0,2,4
2,2000,4000


In [53]:
# r = np.array([1, 2])
# ser = pd.Series(r, copy=False)
# ser.iloc[0] = 999
# ser

## pandas.DataFrame.iat

1. Access a single value for a row/column pair by integer position.
2. Similar to iloc, in that both provide integer-based lookups.

In [54]:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,0,2,3
1,0,4,1
2,10,20,30


3. Get value at specified row/column pair.

In [55]:
df.iat[1, 2]

1

4. Set value at specified row/column pair. 

In [56]:
df.iat[1, 2] = 10
df

Unnamed: 0,A,B,C
0,0,2,3
1,0,4,10
2,10,20,30


 5. Get value within a series.

In [57]:
df.loc[0].iat[1]

2

### pandas.DataFrame.where 
#### ● Replace values where the condition is False.

In [58]:
s = pd.Series(range(5))
s

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [59]:
s.where(s > 0)

0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

## pandas.DataFrame.reindex

1. Conform Series/DataFrame to new index with optional filling logic. 
2. Places NA/NaN in locations having no value in the previous index. 
3. A new object is produced unless the new index is equivalent to the current one and copy=False.

In [61]:
index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
index

['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']

In [62]:
df = pd.DataFrame({'http_status': [200, 200, 404, 404,301], 'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, index=index)
df

Unnamed: 0,http_status,response_time
Firefox,200,0.04
Chrome,200,0.02
Safari,404,0.07
IE10,404,0.08
Konqueror,301,1.0


In [63]:
new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', 'Chrome']
new_index

['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', 'Chrome']

In [64]:
df.reindex(new_index)

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
Iceweasel,,
Comodo Dragon,,
IE10,404.0,0.08
Chrome,200.0,0.02


## pandas.DataFrame.isna

● Detect missing values. 

● Return a boolean same-sized object indicating if the values are NA. 

● NA values, such as None or numpy.NaN, gets mapped to True values. 

● Everything else gets mapped to False values. 

Characters such as empty strings '' or numpy.inf are not considered NA values.

In [4]:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(age=[5, 6, np.NaN],born=[pd.NaT, pd.Timestamp('1939-05-27'),pd.Timestamp('1940-04-25')],name=['Alfred', 'Batman', ''],toy=[None, 'Batmobile', 'Joker']))
df

Unnamed: 0,age,born,name,toy
0,5.0,NaT,Alfred,
1,6.0,1939-05-27,Batman,Batmobile
2,,1940-04-25,,Joker


In [5]:
df.isna()

Unnamed: 0,age,born,name,toy
0,False,True,False,True
1,False,False,False,False
2,True,False,False,False


### pandas.DataFrame.fillna

#### 1. Fill NA/NaN values using the specified method.

### 2 Example

In [6]:
df=pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1], [np.nan, np.nan, np.nan, 5], [np.nan, 3, np.nan, 4]], columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


### 3. Replace all NaN elements with 0’s.

In [10]:
# do notcjange original file 
df.fillna(0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
3,0.0,3.0,0.0,4


In [11]:
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [12]:
df1=df.fillna(0)
df1

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
3,0.0,3.0,0.0,4


In [14]:
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


### 4. We can also propagate non-null values forward or backward.

In [17]:
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [15]:
df.fillna(method='ffill')

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,4.0,,5
3,3.0,3.0,,4


In [16]:
df.fillna(method='bfill')

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0
1,3.0,4.0,,1
2,,3.0,,5
3,,3.0,,4


## pandas.DataFrame.isnull 
### 1. Detect missing values.
### 2. Alias of isna.

In [19]:
df = pd.DataFrame(dict(age=[5, 6, np.NaN], born=[pd.NaT, pd.Timestamp('1939-05-27'), pd.Timestamp('1940-04-25')], name=['Alfred', 'Batman', ''], toy=[None, 'Batmobile', 'Joker']))
df

Unnamed: 0,age,born,name,toy
0,5.0,NaT,Alfred,
1,6.0,1939-05-27,Batman,Batmobile
2,,1940-04-25,,Joker


In [20]:
df.isna()

Unnamed: 0,age,born,name,toy
0,False,True,False,True
1,False,False,False,False
2,True,False,False,False


## Series as Dictionaries

An alternative way to think of a series is to think of it as an object dict (dictionary). This
similarity is also exploited during the definition of an object series. In fact, you can create
a series from a previously defined dict.

In [38]:
import pandas as pd
mydict = {'red': 2000, 'blue': 1000, 'yellow': 500,'orange': 1000}
myseries = pd.Series(mydict)
myseries

red       2000
blue      1000
yellow     500
orange    1000
dtype: int64

As you can see from this example, the array of the index is filled with the keys while
the data are filled with the corresponding values. You can also define the array indexes
separately. In this case, controlling correspondence between the keys of the dict and
labels array of indexes will run. If there is a mismatch, pandas will add the NaN value.

In [39]:
colors = ['red','yellow','orange','blue','green']
myseries = pd.Series(mydict, index=colors)
myseries

red       2000.0
yellow     500.0
orange    1000.0
blue      1000.0
green        NaN
dtype: float64

## The DataFrame

The dataframe is a tabular data structure very similar to a spreadsheet. This data
structure is designed to extend series to multiple dimensions. In fact, the dataframe
consists of an ordered collection of columns (see Figure 4-2), each of which can contain
a value of a different type (numeric, string, Boolean, etc.).

![image.png](attachment:image.png)
The dataframe structure

## Defining a Dataframe

The most common way to create a new dataframe is precisely to pass a dict object to the
DataFrame() constructor. This dict object contains a key for each column that you want
to define, with an array of values for each of them.

## Object Creation - DataFrame

In [41]:
data = {'color' : ['blue','green','yellow','red','white'],
'object' : ['ball','pen','pencil','paper','mug'],
'price' : [1.2,1.0,0.6,0.9,1.7]}

In [42]:
frame = pd.DataFrame(data)
frame

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


If the dict object from which you want to create a dataframe contains more data
than you are interested in, you can make a selection. In the constructor of the dataframe,
you can specify a sequence of columns using the columns option. The columns will be
created in the order of the sequence regardless of how they are contained in the dict
object.

In [43]:
frame2 = pd.DataFrame(data, columns=['object','price'])
frame2

Unnamed: 0,object,price
0,ball,1.2
1,pen,1.0
2,pencil,0.6
3,paper,0.9
4,mug,1.7


In [45]:
import numpy as np
df=np.arange(16)
df

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [46]:
df1=df.reshape(4,4)
df1

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [47]:
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
            index=['red','blue','yellow','white'],
            columns=['ball','pen','pencil','paper'])
frame3

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


## Selecting Elements

If you want to know the name of all the columns of a dataframe, you can specify the
columns attribute on the instance of the dataframe object.

In [48]:
frame3.columns

Index(['ball', 'pen', 'pencil', 'paper'], dtype='object')

In [49]:
frame.index

RangeIndex(start=0, stop=5, step=1)

In [50]:
frame['color']

0      blue
1     green
2    yellow
3       red
4     white
Name: color, dtype: object

In [51]:
frame['price']

0    1.2
1    1.0
2    0.6
3    0.9
4    1.7
Name: price, dtype: float64

In [52]:
frame3.values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [53]:
frame3.columns

Index(['ball', 'pen', 'pencil', 'paper'], dtype='object')

### For rows within a dataframe, it is possible to use the loc attribute with the index value of the row that you want to extract.

In [54]:
frame

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


In [55]:
frame.loc[2]

color     yellow
object    pencil
price        0.6
Name: 2, dtype: object

To select multiple rows, you specify an array with the sequence of rows to insert:

In [56]:
frame.loc[[2,4]]

Unnamed: 0,color,object,price
2,yellow,pencil,0.6
4,white,mug,1.7


In [57]:
frame[1:3]

Unnamed: 0,color,object,price
1,green,pen,1.0
2,yellow,pencil,0.6


In [58]:
frame3[1:3]

Unnamed: 0,ball,pen,pencil,paper
blue,4,5,6,7
yellow,8,9,10,11


## 5. Parameters: 
    data : array-like, dict, or scalar value: Contains data stored in Series   
            
    index : array-like or Index (1d): Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex(len(data)) if not provided. If both a dict and index sequence are used, the index will override the keys found in the dict.  
    
    dtype : numpy.dtype or None: If None, dtype will be inferred
            
    copy : boolean, default False: Copy input data

## Concatenating DataFrames 

The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

In [59]:
df1 = pd.DataFrame({'id': ['A01', 'A02', 'A03', 'A04'],
                    'Name': ['ABC', 'PQR', 'DEF', 'GHI']})
  
# Second DataFrame
df2 = pd.DataFrame({'id': ['B05', 'B06', 'B07', 'B08'],
                    'Name': ['XYZ', 'TUV', 'MNO', 'JKL']})
frames = [df1, df2]
  
result = pd.concat(frames) # or
#result=pd.concat([df1, df2])
display(result)

Unnamed: 0,id,Name
0,A01,ABC
1,A02,PQR
2,A03,DEF
3,A04,GHI
0,B05,XYZ
1,B06,TUV
2,B07,MNO
3,B08,JKL


## Joining DataFrame

### The INNER JOIN keyword selects records that have matching values in both tables.

![image.png](attachment:image.png)

In [60]:
df1 = pd.DataFrame({'id': ['A01', 'A02', 'A03', 'A04'],
                    'Name': ['ABC', 'PQR', 'DEF', 'GHI']})
  
df3 = pd.DataFrame({'City': ['MUMBAI', 'PUNE', 'MUMBAI', 'DELHI'],
                    'Age': ['12', '13', '14', '12']})
  
# the default behaviour is join='outer'
# inner join
  
result = pd.concat([df1, df3], axis=1, join='inner')
display(result)

Unnamed: 0,id,Name,City,Age
0,A01,ABC,MUMBAI,12
1,A02,PQR,PUNE,13
2,A03,DEF,MUMBAI,14
3,A04,GHI,DELHI,12


Q8. Create a series with [7,2,4,9,5,6] and series of dates starting from '20210102'as index and add it as a new column with column name newCol,the function should return a DataFrame object of length 6 rows and 5 columns .

In [59]:
dates = pd.date_range('20130101', periods=6) 
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [61]:
s3 = pd.date_range('20130101', periods=6) 
s3

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [29]:
s2 = pd.Series(data=[7,2,4,9,5,6], index=s3)
s2

2013-01-01    7
2013-01-02    2
2013-01-03    4
2013-01-04    9
2013-01-05    5
2013-01-06    6
Freq: D, dtype: int64

In [60]:
df = pd.DataFrame(data=np.random.randn(6,4), index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.060691,1.825019,0.386913,0.333096
2013-01-02,0.586965,1.734606,1.18274,0.205376
2013-01-03,-0.212064,-0.062599,-0.065482,0.328943
2013-01-04,-1.983105,-0.436435,-1.537473,-1.017995
2013-01-05,-1.360011,-0.624002,-0.011057,1.595122
2013-01-06,-0.160928,-0.700138,-0.120448,-0.498701


In [31]:
result = pd.concat([df, s2], axis=1, join='inner')
display(result)

Unnamed: 0,A,B,C,D,0
2013-01-01,-1.557744,0.526716,-0.562259,0.401993,7
2013-01-02,-1.557521,-0.532612,-0.187273,0.621357,2
2013-01-03,-1.98947,-0.464998,-0.118814,2.209643,4
2013-01-04,-0.802186,0.803187,-1.464163,0.14303,9
2013-01-05,1.737718,-0.086446,-2.472608,-0.279948,5
2013-01-06,1.12494,-1.046374,0.400629,0.613087,6


In [32]:
result.columns =["A","B","C","D","newCol"]

In [33]:
result

Unnamed: 0,A,B,C,D,newCol
2013-01-01,-1.557744,0.526716,-0.562259,0.401993,7
2013-01-02,-1.557521,-0.532612,-0.187273,0.621357,2
2013-01-03,-1.98947,-0.464998,-0.118814,2.209643,4
2013-01-04,-0.802186,0.803187,-1.464163,0.14303,9
2013-01-05,1.737718,-0.086446,-2.472608,-0.279948,5
2013-01-06,1.12494,-1.046374,0.400629,0.613087,6


In [62]:
dict1={'a':"john",'b':[0,1,1], "c":["foo","bar","bar"]}
df01=pd.DataFrame(dict1)
df01

Unnamed: 0,a,b,c
0,john,0,foo
1,john,1,bar
2,john,1,bar


In [63]:
df2 = pd.DataFrame(data={'A' : 1, 'B' : pd.Timestamp('20130102'), 'C' :pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1,2013-01-02,1.0,3,test,foo
1,1,2013-01-02,1.0,3,train,foo
2,1,2013-01-02,1.0,3,test,foo
3,1,2013-01-02,1.0,3,train,foo


In [64]:
df2.dtypes

A             int64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

Viewing 
Data pandas.DataFrame.head 
1. syntax: DataFrame.head(n) 
2. Return the first n rows. 
3. For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].

In [9]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [10]:
df = pd.DataFrame(data=np.random.randn(6,4), index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.080744,-0.30616,-2.572248,0.062044
2013-01-02,0.192475,1.438754,-0.124447,0.320705
2013-01-03,1.755335,-0.135448,-0.276858,2.958892
2013-01-04,0.053603,-1.271311,-0.523835,1.262606
2013-01-05,-0.333965,1.205723,2.264021,-1.519391
2013-01-06,0.737857,0.149769,-1.727581,-1.141776


In [42]:
#df2=pd.DataFrame(data={'A': 1,'B': pd.Timestamp('20130102')})

In [65]:
df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion','monkey', 'parrot', 'shark', 'whale', 'zebra']})
df

Unnamed: 0,animal
0,alligator
1,bee
2,falcon
3,lion
4,monkey
5,parrot
6,shark
7,whale
8,zebra


In [25]:
df.head()

Unnamed: 0,animal
0,alligator
1,bee
2,falcon
3,lion
4,monkey


In [66]:
df.tail()


Unnamed: 0,animal
4,monkey
5,parrot
6,shark
7,whale
8,zebra


In [27]:
df.head(3)

Unnamed: 0,animal
0,alligator
1,bee
2,falcon


In [68]:
df.head(-3)

Unnamed: 0,animal
0,alligator
1,bee
2,falcon
3,lion
4,monkey
5,parrot


In [69]:
df.tail(-3)

Unnamed: 0,animal
3,lion
4,monkey
5,parrot
6,shark
7,whale
8,zebra


## Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [10]:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101', periods=6) 
df = pd.DataFrame(data=np.random.randn(6,4), index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.310217,-0.869673,1.593967,-0.304008
2013-01-02,0.475151,-0.683676,1.930054,-0.646594
2013-01-03,0.222106,-0.916923,1.173079,1.374699
2013-01-04,-0.682485,0.522869,2.087065,1.141443
2013-01-05,1.834533,0.273641,-0.40878,0.104467
2013-01-06,1.87456,0.332887,0.997788,-0.637607


In [9]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.723982,0.052805,-0.222078,-0.002251
std,0.865122,0.556041,0.755623,0.896292
min,-2.201782,-0.717543,-1.376603,-0.900283
25%,-1.028214,-0.148781,-0.519224,-0.599103
50%,-0.577492,-0.054895,-0.22104,-0.309955
75%,-0.067806,0.287303,0.12624,0.566459
max,0.089238,0.922872,0.862833,1.343575


In [10]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [11]:
df.values

array([[-0.91514353, -0.06436933,  0.86283307,  1.34357505],
       [-1.06590467, -0.17691767, -1.37660321, -0.32184851],
       [ 0.08923833, -0.71754296, -0.4147823 ,  0.85463176],
       [-0.01046136, -0.04542037, -0.02729745, -0.9002827 ],
       [-0.23984112,  0.39821014,  0.1774189 , -0.69152057],
       [-2.20178228,  0.92287156, -0.55403818, -0.29806119]])

In [39]:
df.count

<bound method DataFrame.count of                    A         B         C         D
2013-01-01 -1.008105  1.603343  1.020631 -0.046335
2013-01-02 -0.193463 -0.014117  1.816343 -0.698264
2013-01-03  0.013134 -0.121962 -0.259855  0.715877
2013-01-04 -0.301143 -1.362985  0.006376  0.349448
2013-01-05  0.916903 -0.496021 -0.303500 -0.046055
2013-01-06  0.983490 -1.346763 -0.950632 -0.676519>

In [12]:
df['C']

2013-01-01    0.862833
2013-01-02   -1.376603
2013-01-03   -0.414782
2013-01-04   -0.027297
2013-01-05    0.177419
2013-01-06   -0.554038
Freq: D, Name: C, dtype: float64

In [13]:
df.sort_values

<bound method DataFrame.sort_values of                    A         B         C         D
2013-01-01 -0.915144 -0.064369  0.862833  1.343575
2013-01-02 -1.065905 -0.176918 -1.376603 -0.321849
2013-01-03  0.089238 -0.717543 -0.414782  0.854632
2013-01-04 -0.010461 -0.045420 -0.027297 -0.900283
2013-01-05 -0.239841  0.398210  0.177419 -0.691521
2013-01-06 -2.201782  0.922872 -0.554038 -0.298061>

## Creating a dataframe using a dictionary

In [14]:
dict1={'a':"john",'b':[0,1,1], "c":["foo","bar","bar"]}
df01=pd.DataFrame(dict1)
df01

Unnamed: 0,a,b,c
0,john,0,foo
1,john,1,bar
2,john,1,bar


In [15]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4}, {'a': 100, 'b': 200, 'c': 300, 'd': 400}, {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df = pd.DataFrame(mydict)
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


In [16]:
df.iloc[0]

a    1
b    2
c    3
d    4
Name: 0, dtype: int64

In [17]:
df.iloc[[0]]

Unnamed: 0,a,b,c,d
0,1,2,3,4


In [18]:
df.iloc[[0 ,2]]

Unnamed: 0,a,b,c,d
0,1,2,3,4
2,1000,2000,3000,4000


In [20]:
df.iloc[[0 ,2], [1, 3]]

Unnamed: 0,b,d
0,2,4
2,2000,4000


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [28]:
dates = pd.date_range('20130101', periods=6) 
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [11]:
df2 = pd.DataFrame(data={'A' : 1, 
                         'B' : pd.Timestamp('20130102'), 
                         'C' :pd.Series(1,index=list(range(4)),dtype='float32'),
                         'D' : np.array([3] * 4,dtype='int32'), 
                         'E' : pd.Categorical(["test","train","test","train"]), 
                         'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1,2013-01-02,1.0,3,test,foo
1,1,2013-01-02,1.0,3,train,foo
2,1,2013-01-02,1.0,3,test,foo
3,1,2013-01-02,1.0,3,train,foo


In [12]:
df2.dtypes

A             int64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## Viewing Data

pandas.DataFrame.head 
1. syntax: DataFrame.head(n) 
2. Return the first n rows. 
3. For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].

### Example:

In [33]:
df2.head()

Unnamed: 0,A,B,C,D,E,F
0,1,2013-01-02,1.0,3,test,foo
1,1,2013-01-02,1.0,3,train,foo
2,1,2013-01-02,1.0,3,test,foo
3,1,2013-01-02,1.0,3,train,foo


In [1]:
import pandas as pd
df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 
                'lion','monkey', 'parrot', 'shark', 'whale', 'zebra']})

### Viewing the first 5 lines

In [3]:
df.head()

Unnamed: 0,animal
0,alligator
1,bee
2,falcon
3,lion
4,monkey


## Viewing the first 5 lines

In [6]:
df.tail()

Unnamed: 0,animal
4,monkey
5,parrot
6,shark
7,whale
8,zebra


### Viewing the first n lines (three in this case):

In [7]:
df.head(3)

Unnamed: 0,animal
0,alligator
1,bee
2,falcon


## For negative values of n:

In [13]:
df2.head(-3)

Unnamed: 0,A,B,C,D,E,F
0,1,2013-01-02,1.0,3,test,foo


In [None]:
df2.tail(-3)

In [58]:
df2.values

array([[1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

## pandas.DataFrame.index

● Syntax: DataFrame.index   
● The index (row labels) of the DataFrame.  
● For example:

In [71]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [72]:
df = pd.DataFrame(data=np.random.randn(6,4), index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.882727,-0.941787,-0.889685,-1.144192
2013-01-02,-0.850495,-0.382628,0.699375,-0.624573
2013-01-03,-0.052372,0.275284,1.828779,-0.745642
2013-01-04,-0.514158,1.038022,1.121443,0.138829
2013-01-05,0.318209,-0.647672,0.416151,-0.358944
2013-01-06,0.637625,1.90134,1.195001,-0.181094


## pandas.DataFrame.columns

● Syntax: DataFrame.columns    
● The column labels of the DataFrame.    
● For example:  

In [73]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [74]:
df.values

array([[-0.88272676, -0.9417871 , -0.88968546, -1.14419199],
       [-0.85049499, -0.38262758,  0.69937487, -0.62457319],
       [-0.05237229,  0.27528425,  1.82877872, -0.74564177],
       [-0.5141583 ,  1.03802202,  1.12144272,  0.13882945],
       [ 0.31820914, -0.64767197,  0.4161505 , -0.3589436 ],
       [ 0.63762478,  1.9013398 ,  1.19500119, -0.1810945 ]])

## pandas.DataFrame.describe

In [75]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.223986,0.207093,0.72851,-0.485936
std,0.628721,1.092143,0.927316,0.451521
min,-0.882727,-0.941787,-0.889685,-1.144192
25%,-0.766411,-0.581411,0.486957,-0.715375
50%,-0.283265,-0.053672,0.910409,-0.491758
75%,0.225564,0.847338,1.176612,-0.225557
max,0.637625,1.90134,1.828779,0.138829


## pandas.DataFrame.info

In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2013-01-01 to 2013-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      float64
 1   B       6 non-null      float64
 2   C       6 non-null      float64
 3   D       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes


## pandas.DataFrame.T  
1. Syntax: DataFrame.T   
    2. Transposing your data   
    3. Transpose of a matrix is the interchanging of rows and columns. 

In [77]:
df1=df.T
df1

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-0.882727,-0.850495,-0.052372,-0.514158,0.318209,0.637625
B,-0.941787,-0.382628,0.275284,1.038022,-0.647672,1.90134
C,-0.889685,0.699375,1.828779,1.121443,0.416151,1.195001
D,-1.144192,-0.624573,-0.745642,0.138829,-0.358944,-0.181094


In [44]:
df1.T

Unnamed: 0,A,B,C,D
2013-01-01,1.319383,1.538343,-0.101386,1.543788
2013-01-02,-0.482934,-1.053872,-1.022548,-1.883305
2013-01-03,1.016479,0.985057,-0.03969,0.912748
2013-01-04,-0.100035,0.351513,1.044918,0.024794
2013-01-05,-1.114431,-0.187444,1.246009,0.66145
2013-01-06,1.517862,-0.791749,2.310444,0.638301


## Sorting and Ranking

## pandas.DataFrame.sort_index

Another fundamental operation that uses indexing is sorting. Sorting the data is often
a necessity and it is very important to be able to do it easily. pandas provides the sort_
index() function, which returns a new object that’s identical to the start, but in which
the elements are ordered.

1. Syntax: DataFrame.sort_index(axis=0, level=None, ascending=True, 
    inplace=False, kind='quicksort', 
    na_position='last', sort_remaining=True, ignore_index=False, key=None)

2. Parameters axis: {0 or ‘index’, 1 or ‘columns’}, default 0 
    The axis along which to sort. The value 0 identifies the rows, 
    and 1 identifies the columns
level: int or level name or list of ints or list of level names

If not None, sort on values in specified index level(s).

ascending: bool or list of bools, default True
Sort ascending vs. descending. 
When the index is a MultiIndex the sort direction can be controlled 
for each level individually.

inplace: bool, default False If True, perform operation in-place.

kind: {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’

Choice of sorting algorithm. See also ndarray.np.sort for more information.
mergesort is the only stable algorithm.For DataFrames, this option is only applied 
when sorting on a single column or label. 

    na_position: {‘first’, ‘last’},default ‘last’
    
    Puts NaNs at the beginning if first; last puts NaNs at the end. 
    Not implemented for MultiIndex. 
    sort_remaining: bool, default True If True and sorting by level 
    and index is multilevel, sort by other levels too (in order) 
    after sorting by specified level. ignore_index: bool, 
    default False If True, the resulting axis will be labeled 0, 1, …, n - 1.

3. Sort objects by labels (along an axis). 
4. Returns a new DataFrame sorted by label if inplace argument is False, otherwise updates the original DataFrame and returns None.

## 5. Example:

In [78]:
ser = pd.Series([5,0,3,8,4],
index=['red','blue','yellow','white','green'])
ser

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

In [79]:
ser.sort_index()

blue      0
green     4
red       5
white     8
yellow    3
dtype: int64

In [47]:
ser.sort_index(ascending=False)

yellow    3
white     8
red       5
green     4
blue      0
dtype: int64

With the dataframe, the sorting can be performed independently on each of its two
axes. So if you want to order by row following the indexes, you just continue to use the
sort_index() function without arguments as you’ve seen before, or if you prefer to order
by columns, you need to set the axis options to 1.

In [80]:
import numpy as np
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [81]:
frame.sort_index()

Unnamed: 0,ball,pen,pencil,paper
blue,4,5,6,7
red,0,1,2,3
white,12,13,14,15
yellow,8,9,10,11


In [52]:
frame.sort_index(axis=1)

Unnamed: 0,ball,paper,pen,pencil
red,0,3,1,2
blue,4,7,5,6
yellow,8,11,9,10
white,12,15,13,14


In [64]:
ser

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

In [65]:
ser.sort_values()

blue      0
yellow    3
green     4
red       5
white     8
dtype: int64

In [82]:
frame.sort_values(by='pen')

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [83]:
frame.sort_values(by=['pen','pencil'])

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


### Ranking
The ranking is an operation closely related to sorting. It mainly consists of assigning
a rank (that is, a value that starts at 0 and then increase gradually) to each element of the
series. The rank will be assigned starting from the lowest value to the highest.

In [84]:
ser

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

In [85]:
ser.rank()

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

The rank can also be assigned in the order in which the data are already in the data
structure (without a sorting operation). In this case, you just add the method option with
the first value assigned.

In [15]:
ser.rank(method='first')

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

In [86]:
ser.rank(ascending=False)

red       2.0
blue      5.0
yellow    4.0
white     1.0
green     3.0
dtype: float64

In [72]:
import pandas as pd
df = pd.DataFrame([1, 2, 3, 4 , 5], index=[100, 29, 234, 1, 150], columns=['A'])
df

Unnamed: 0,A
100,1
29,2
234,3
1,4
150,5


In [73]:
df.sort_index()

Unnamed: 0,A
1,4
29,2
100,1
150,5
234,3


In [74]:
df.sort_index(ascending=False)

Unnamed: 0,A
234,3
150,5
100,1
29,2
1,4


## pandas.DataFrame.sort_values

In [76]:
import numpy as np
df = pd.DataFrame({ 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'], 
                   'col2': [2, 1, 9, 8, 7, 4],
                   'col3': [0, 1, 9, 4, 2, 3], 
                   'col4': ['a', 'B', 'c', 'D', 'e', 'F']})

In [77]:
df

Unnamed: 0,col1,col2,col3,col4
0,A,2,0,a
1,A,1,1,B
2,B,9,9,c
3,,8,4,D
4,D,7,2,e
5,C,4,3,F


In [78]:
df['col2']

0    2
1    1
2    9
3    8
4    7
5    4
Name: col2, dtype: int64

In [79]:
df.sort_values(by=['col1'])

Unnamed: 0,col1,col2,col3,col4
0,A,2,0,a
1,A,1,1,B
2,B,9,9,c
5,C,4,3,F
4,D,7,2,e
3,,8,4,D


In [80]:
df.sort_values(by=['col1', 'col2'])

Unnamed: 0,col1,col2,col3,col4
1,A,1,1,B
0,A,2,0,a
2,B,9,9,c
5,C,4,3,F
4,D,7,2,e
3,,8,4,D


## Selection

1. While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix are recommended. 
2. Selecting a single column, which yields a Series, equivalent to df.A or df[‘A’]. 
3. Selecting via [ ], which slices the rows.

In [81]:
df[:3]

Unnamed: 0,col1,col2,col3,col4
0,A,2,0,a
1,A,1,1,B
2,B,9,9,c


In [82]:
df['col1']

0      A
1      A
2      B
3    NaN
4      D
5      C
Name: col1, dtype: object

## Selection 
1. While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix are recommended.   
2. Selecting a single column, which yields a Series, equivalent to df.A or df[‘A’].   
3. Selecting via [ ], which slices the rows. 
   eg: df[:3]

## Selecting by Label

## pandas.DataFrame.loc

1. Access a group of rows and columns by label(s) or a boolean array. 
2. .loc[] is primarily label based, but may also be used with a boolean array. 
3. Example:

In [87]:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
                  index=['cobra', 'viper', 'sidewinder'],
                  columns=['max_speed', 'shield'])

In [88]:
df

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,5
sidewinder,7,8


In [89]:
df.loc['viper']

max_speed    4
shield       5
Name: viper, dtype: int64

## pandas.DataFrame.at

In [None]:
1. Access a single value for a row/column label pair.  
2. Example:

In [87]:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], 
                  index=[4, 5, 6], columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
4,0,2,3
5,0,4,1
6,10,20,30


In [88]:
df.at[4, 'B']

2

## df.ix

## Selection by position

1. Pandas provides a suite of methods in order to get purely integer based indexing.
2. The semantics follow closely python and numpy slicing. 
3. These are 0-based indexing. When slicing, the start bounds is included, while the upper bound is excluded. 
4. Trying to use a non-integer, even a valid label will raise an IndexError.

## pandas.DataFrame.iloc

1. Purely integer-location based indexing for selection by position.  
2. .iloc is primarily integer position based (from 0 to length-1 of the axis),  
   but may also be used with a boolean array.  
3. Allowed inputs are: 
    ○ An integer, e.g. 5.  
    ○ A list or array of integers, e.g. [4, 3, 0].  
    ○ A slice object with ints, e.g. 1:7.  
    ○ A boolean array.  
    ○ A callable function with one argument (the calling Series or DataFrame)   and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have reference to the calling object, but would like to base your selection on some value.

In [90]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4}, 
          {'a': 100, 'b': 200, 'c': 300, 'd': 400}, 
          {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df = pd.DataFrame(mydict)
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


In [91]:
df.iloc[0]

a    1
b    2
c    3
d    4
Name: 0, dtype: int64

In [92]:
df.iloc[[0]]

Unnamed: 0,a,b,c,d
0,1,2,3,4


4. Indexing both axes

In [93]:
df.iloc[[0, 2], [1, 3]]

Unnamed: 0,b,d
0,2,4
2,2000,4000


## pandas.DataFrame.iat

1. Access a single value for a row/column pair by integer position. 
2. Similar to iloc, in that both provide integer-based lookups.

In [94]:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,0,2,3
1,0,4,1
2,10,20,30


3. Get value at specified row/column pair.

In [95]:
df.iat[1, 2]

1

In [122]:
df

Unnamed: 0,A,B,C
0,0,2,3
1,0,4,1
2,10,20,30


4. Set value at specified row/column pair.

In [96]:
df.iat[1, 2] = 100
df

Unnamed: 0,A,B,C
0,0,2,3
1,0,4,100
2,10,20,30


5. Get value within a series.

In [125]:
df.loc[0].iat[1]

2

## pandas.DataFrame.where

● Replace values where the condition is False.  
● Example:

In [2]:
import pandas as pd
s = pd.Series(range(5))
s

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [3]:
s.where(s > 0)

0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

## Specifying an index at creation

In [4]:
# explicitly create an index
labels = ['Mike', 'Marcia', 'Mikael', 'Bleu']
role = ['Dad', 'Mom', 'Son', 'Dog']
s = pd.Series(labels, index=role)
s

Dad      Mike
Mom    Marcia
Son    Mikael
Dog      Bleu
dtype: object

In [5]:
s.index

Index(['Dad', 'Mom', 'Son', 'Dog'], dtype='object')

In [6]:
s['Dad']

'Mike'

## Lookup by label using the [] and .ix[] operators

In [14]:
# we will use this series to examine lookups
s1 = pd.Series(np.arange(10, 15), index=list('abcde'))
s1

a    10
b    11
c    12
d    13
e    14
dtype: int32

In [15]:
s1['a']

10

In [17]:
# get multiple items
s1[['d', 'b']]

d    13
b    11
dtype: int32

In [18]:
# gets values based upon position
s1[[3, 1]]

d    13
b    11
dtype: int32

In [24]:
# a sequence of 5 values, all 2
pd.Series([2]*5)

0    2
1    2
2    2
3    2
4    2
dtype: int64

In [25]:
# use each character as a value
pd.Series(list('abcde'))

0    a
1    b
2    c
3    d
4    e
dtype: object

In [26]:
# create Series from dict
pd.Series({'Mike': 'Dad', 
           'Marcia': 'Mom', 
           'Mikael': 'Son', 
           'Bleu': 'Best doggie ever' })

Mike                   Dad
Marcia                 Mom
Mikael                 Son
Bleu      Best doggie ever
dtype: object

## Explicit position lookup with .iloc[]

## For integer index

In [28]:
# to demo lookup by matching labels as integer values
s2 = pd.Series([1, 2, 3, 4], index=[10, 11, 12, 13])
s2

10    1
11    2
12    3
13    4
dtype: int64

In [29]:
# explicitly  by position
s1.iloc[[0, 2]]

a    10
c    12
dtype: int32

In [30]:
# explicitly  by position
s2.iloc[[3, 2]]

13    4
12    3
dtype: int64

## Explicit label lookup with .loc[]

In [32]:
# use each character as a value
pd.Series(list('abcde'))

0    a
1    b
2    c
3    d
4    e
dtype: object

In [33]:
# explicit via labels
s1.loc[['a', 'd']]

a    10
d    13
dtype: int32

In [34]:
# get items at position 11 an d12
s2.loc[[11, 12]]

11    2
12    3
dtype: int64

### Slicing a Series into subsets

In [95]:
# a Series to use for slicing
# using index labels not starting at 0 to demonstrate 
# position based slicing
s = pd.Series(np.arange(100, 110), index=np.arange(10, 20))
s

10    100
11    101
12    102
13    103
14    104
15    105
16    106
17    107
18    108
19    109
dtype: int32

In [96]:
# lookup via list of positions
s.iloc[[1, 2, 3, 4, 5]]

11    101
12    102
13    103
14    104
15    105
dtype: int32

In [97]:
# items at position 1, 3, 5
s[1:6:2] # 1 to 6 step 2

11    101
13    103
15    105
dtype: int32

In [98]:
# first five by slicing, same as .head(5)
s[:5]

10    100
11    101
12    102
13    103
14    104
dtype: int32

In [99]:
# fourth position to the end
s[4:]

14    104
15    105
16    106
17    107
18    108
19    109
dtype: int32

In [100]:
# reverse the Series
s[::-1]

19    109
18    108
17    107
16    106
15    105
14    104
13    103
12    102
11    101
10    100
dtype: int32

In [101]:
# -4:, which means the last 4 rows
s[-4:]

16    106
17    107
18    108
19    109
dtype: int32

In [102]:
# :-4, all but the last 4
s[:-4]

10    100
11    101
12    102
13    103
14    104
15    105
dtype: int32

In [103]:
# equivalent to s.tail(4).head(3)
s[-4:-1]

16    106
17    107
18    108
dtype: int32

In [104]:
# used to demonstrate the next two slices
s = pd.Series(np.arange(0, 5), 
              index=['a', 'b', 'c', 'd', 'e'])
s

a    0
b    1
c    2
d    3
e    4
dtype: int32

In [105]:
# slices by position as the index is characters
s[1:3]

b    1
c    2
dtype: int32

In [109]:
# this slices by the strings in the index
s['b':'d']

b    1
c    2
d    3
dtype: int32

## Alignment via index labels

In [110]:
# First series for alignment
s1 = pd.Series([1, 2], index=['a', 'b'])
s1

a    1
b    2
dtype: int64

In [111]:
# Second series for alignment
s2 = pd.Series([4, 3], index=['b', 'a'])
s2

b    4
a    3
dtype: int64

In [112]:
# add them
s1 + s2

a    4
b    6
dtype: int64

In [113]:
# multiply all values in s3 by 2
s1 * 2

a    2
b    4
dtype: int64

In [114]:
# scalar series using s1's index
t = pd.Series(2, s1.index)
t

a    2
b    2
dtype: int64

In [115]:
# multiply s1 by t
s1 * t

a    2
b    4
dtype: int64

In [116]:
# we will add this to s1
s3 = pd.Series([5, 6], index=['b', 'c'])
s3

b    5
c    6
dtype: int64

In [117]:
# s1 and s3 have different sets of index labels
# NaN will result for a and c
s1 + s3

a    NaN
b    7.0
c    NaN
dtype: float64

In [58]:
# 2 'a' labels
s1 = pd.Series([1.0, 2.0, 3.0], index=['a', 'a', 'b'])
s1

a    1.0
a    2.0
b    3.0
dtype: float64

In [59]:
# 2 'a' labels
s1 = pd.Series([1.0, 2.0, 3.0], index=['a', 'a', 'b'])
s1

a    1.0
a    2.0
b    3.0
dtype: float64

In [60]:
# 3 a labels
s2 = pd.Series([4.0, 5.0, 6.0, 7.0], index=['a', 'a', 'c', 'a'])
s2

a    4.0
a    5.0
c    6.0
a    7.0
dtype: float64

In [61]:
# will result in 6 'a' index labels, and NaN for b and c
s1 + s2

a    5.0
a    6.0
a    8.0
a    6.0
a    7.0
a    9.0
b    NaN
c    NaN
dtype: float64

### Boolean selection

In [118]:
# which rows have values that are > 3?
s = pd.Series(np.arange(0, 5), index=list('abcde'))
logical_results = s >= 3
logical_results

a    False
b    False
c    False
d     True
e     True
dtype: bool

In [120]:
s>=3

a    False
b    False
c    False
d     True
e     True
dtype: bool

In [64]:
# select where True
s[logical_results]

d    3
e    4
dtype: int32

In [66]:
# a little shorter version
s[s > 3]

e    4
dtype: int32

In [121]:
# correct syntax
s[(s >=2) & (s < 5)]

c    2
d    3
e    4
dtype: int32

In [122]:
# are all items >= 0?
(s >= 0).all()

True

In [123]:
# any items < 2?
s[s < 2].any()

True

In [124]:
# how many values < 2?
(s < 2).sum()

2

## Reindexing a Series

## pandas.DataFrame.reindex

1. Conform Series/DataFrame to new index with optional filling logic.  
2. Places NA/NaN in locations having no value in the previous index.  
3. A new object is produced unless the new index is equivalent to the current one and copy=False.  
4. Example:

In [71]:
# sample series of five items
np.random.seed(123456)
s = pd.Series(np.random.randn(5))
s

0    0.469112
1   -0.282863
2   -1.509059
3   -1.135632
4    1.212112
dtype: float64

In [72]:
# change the index
s.index = ['a', 'b', 'c', 'd', 'e']
s

a    0.469112
b   -0.282863
c   -1.509059
d   -1.135632
e    1.212112
dtype: float64

In [74]:
np.random.seed(123456)
s1 = pd.Series(np.random.randn(4), ['a', 'b', 'c', 'd'])
s1

a    0.469112
b   -0.282863
c   -1.509059
d   -1.135632
dtype: float64

In [75]:
# reindex with different number of labels
# results in dropped rows and/or NaN's
s2 = s1.reindex(['a', 'c', 'g'])
s2

a    0.469112
c   -1.509059
g         NaN
dtype: float64

In [76]:
# different types for the same values of labels
# causes big trouble
s1 = pd.Series([0, 1, 2], index=[0, 1, 2])
s2 = pd.Series([3, 4, 5], index=['0', '1', '2'])
s1 + s2

0   NaN
1   NaN
2   NaN
0   NaN
1   NaN
2   NaN
dtype: float64

In [77]:
# reindex by casting the label types
# and we will get the desired result
s2.index = s2.index.values.astype(int)
s1 + s2

0    3
1    5
2    7
dtype: int64

In [78]:
# fill with 0 instead of NaN
s2 = s.copy()
s2.reindex(['a', 'f'], fill_value=0)

a    0.469112
f    0.000000
dtype: float64

In [79]:
# create example to demonstrate fills
s3 = pd.Series(['red', 'green', 'blue'], index=[0, 3, 5])
s3

0      red
3    green
5     blue
dtype: object

In [80]:
s3

0      red
3    green
5     blue
dtype: object

In [81]:
s3.reindex(np.arange(0,7))

0      red
1      NaN
2      NaN
3    green
4      NaN
5     blue
6      NaN
dtype: object

In [82]:
# forward fill example
s3.reindex(np.arange(0,7), method='ffill')

0      red
1      red
2      red
3    green
4    green
5     blue
6     blue
dtype: object

In [83]:
# backwards fill example
s3.reindex(np.arange(0,7), method='bfill')

0      red
1    green
2    green
3    green
4     blue
5     blue
6      NaN
dtype: object

In [2]:
index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
df = pd.DataFrame({'http_status': [200, 200, 404, 404,301], 
                   'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, index=index)
df

Unnamed: 0,http_status,response_time
Firefox,200,0.04
Chrome,200,0.02
Safari,404,0.07
IE10,404,0.08
Konqueror,301,1.0


In [4]:
new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', 'Chrome']
df=df.reindex(new_index)
df

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
Iceweasel,,
Comodo Dragon,,
IE10,404.0,0.08
Chrome,200.0,0.02


## Values considered “missing”

As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While
NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to
easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases,
however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.

In [12]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3),index=["a", "c", "e", "f", "h"],
                  columns=["one", "two", "three"],)
df

Unnamed: 0,one,two,three
a,-0.752268,1.809516,-0.044346
c,2.286315,-0.774779,-3.491551
e,-0.631728,0.374999,-1.229647
f,0.349595,1.012546,-0.504641
h,-0.022814,-1.010834,-0.869495


In [14]:
df["four"] = "bar"
df

Unnamed: 0,one,two,three,four
a,-0.752268,1.809516,-0.044346,bar
c,2.286315,-0.774779,-3.491551,bar
e,-0.631728,0.374999,-1.229647,bar
f,0.349595,1.012546,-0.504641,bar
h,-0.022814,-1.010834,-0.869495,bar


In [15]:
df["five"] = df["one"] > 0
df

Unnamed: 0,one,two,three,four,five
a,-0.752268,1.809516,-0.044346,bar,False
c,2.286315,-0.774779,-3.491551,bar,True
e,-0.631728,0.374999,-1.229647,bar,False
f,0.349595,1.012546,-0.504641,bar,True
h,-0.022814,-1.010834,-0.869495,bar,False


## Working with missing data

In [16]:
df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])

Unnamed: 0,one,two,three,four,five
a,-0.752268,1.809516,-0.044346,bar,False
b,,,,,
c,2.286315,-0.774779,-3.491551,bar,True
d,,,,,
e,-0.631728,0.374999,-1.229647,bar,False
f,0.349595,1.012546,-0.504641,bar,True
g,,,,,
h,-0.022814,-1.010834,-0.869495,bar,False


In [18]:
df

Unnamed: 0,one,two,three,four,five
a,-0.752268,1.809516,-0.044346,bar,False
c,2.286315,-0.774779,-3.491551,bar,True
e,-0.631728,0.374999,-1.229647,bar,False
f,0.349595,1.012546,-0.504641,bar,True
h,-0.022814,-1.010834,-0.869495,bar,False


In [19]:
df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])
df2

Unnamed: 0,one,two,three,four,five
a,-0.752268,1.809516,-0.044346,bar,False
b,,,,,
c,2.286315,-0.774779,-3.491551,bar,True
d,,,,,
e,-0.631728,0.374999,-1.229647,bar,False
f,0.349595,1.012546,-0.504641,bar,True
g,,,,,
h,-0.022814,-1.010834,-0.869495,bar,False


# pandas.DataFrame.isna

#### To make detecting missing values easier (and across different array dtypes), pandas provides the isna() and notna() functions, which are also methods on Series and DataFrame objects:

### ● Detect missing values.  
#### ● Return a boolean same-sized object indicating if the values are NA.  
#### ● NA values, such as None or numpy.NaN, gets mapped to True values.  
#### ● Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values.  

In [20]:
df2["one"]

a   -0.752268
b         NaN
c    2.286315
d         NaN
e   -0.631728
f    0.349595
g         NaN
h   -0.022814
Name: one, dtype: float64

In [9]:
pd.isna(df2["one"])

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool

In [22]:
df2["one"].notna()

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool

In [23]:
df2["four"].notna()

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: four, dtype: bool

In [24]:
df2.isna()

Unnamed: 0,one,two,three,four,five
a,False,False,False,False,False
b,True,True,True,True,True
c,False,False,False,False,False
d,True,True,True,True,True
e,False,False,False,False,False
f,False,False,False,False,False
g,True,True,True,True,True
h,False,False,False,False,False


Warning: One has to be mindful that in Python (and NumPy), the nan's don’t compare equal, but None's do.
Note that pandas/NumPy uses the fact that np.nan != np.nan, and treats None like np.nan.

In [25]:
None == None

True

In [26]:
np.nan == np.nan

False

So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information

In [16]:
df2["one"] == np.nan

a    False
b    False
c    False
d    False
e    False
f    False
g    False
h    False
Name: one, dtype: bool

In [27]:
df = pd.DataFrame(dict(age=[5, 6, np.NaN], 
                born=[pd.NaT, pd.Timestamp('1939-05-27'),
                pd.Timestamp('1940-04-25')],name=['Alfred', 'Batman', ''],
                toy=[None, 'Batmobile', 'Joker']))
df

Unnamed: 0,age,born,name,toy
0,5.0,NaT,Alfred,
1,6.0,1939-05-27,Batman,Batmobile
2,,1940-04-25,,Joker


###  Example:

In [29]:
df.isna()

Unnamed: 0,age,born,name,toy
0,False,True,False,True
1,False,False,False,False
2,True,False,False,False


## pandas.DataFrame.fillna

1. Fill NA/NaN values using the specified method.  
2. Example:

In [30]:
df=pd.DataFrame([[np.nan, 2, np.nan, 0], 
        [3, 4, np.nan, 1], [np.nan, np.nan, np.nan, 5], 
        [np.nan, 3, np.nan, 4]], columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


3. Replace all NaN elements with 0’s.

In [32]:
df.fillna(0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
3,0.0,3.0,0.0,4


In [34]:
df1=df.fillna(0)
df1

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
3,0.0,3.0,0.0,4


In [35]:
df2

Unnamed: 0,one,two,three,four,five
a,-0.752268,1.809516,-0.044346,bar,False
b,,,,,
c,2.286315,-0.774779,-3.491551,bar,True
d,,,,,
e,-0.631728,0.374999,-1.229647,bar,False
f,0.349595,1.012546,-0.504641,bar,True
g,,,,,
h,-0.022814,-1.010834,-0.869495,bar,False


In [36]:
df2["one"].fillna("missing")

a   -0.752268
b     missing
c    2.286315
d     missing
e   -0.631728
f    0.349595
g     missing
h   -0.022814
Name: one, dtype: object

## 4. We can also propagate non-null values forward or backward.

In [37]:
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [28]:
pd.notna(df)

Unnamed: 0,A,B,C,D
0,False,True,False,True
1,True,True,False,True
2,False,False,False,True
3,False,True,False,True


In [38]:
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [40]:
df.fillna(method='ffill')

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,4.0,,5
3,3.0,3.0,,4


In [41]:
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [42]:
df.fillna(method='bfill')

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0
1,3.0,4.0,,1
2,,3.0,,5
3,,3.0,,4


In [43]:
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [44]:
df.ffill()

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,4.0,,5
3,3.0,3.0,,4


![image-2.png](attachment:image-2.png)

Limit the amount of filling
If we only want consecutive gaps filled up to a certain number of data points, we can use the limit keyword:

In [45]:
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [46]:
df.fillna(method="pad", limit=1)

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,4.0,,5
3,,3.0,,4


### 2.10.7 Filling with a PandasObject
You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series must match the
columns of the frame you wish to fill. The use case of this is to fill a DataFrame with the mean of that column.

In [2]:
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10, 3), columns=list("ABC"))
dff

Unnamed: 0,A,B,C
0,1.242035,-2.637482,1.54198
1,-0.470395,1.827986,-1.003457
2,0.668405,1.898285,-1.936237
3,0.279724,-1.661291,1.982024
4,0.54585,-0.343037,0.548498
5,0.890096,-2.661384,-0.680505
6,0.812188,-0.5492,0.177236
7,0.309767,-1.183596,0.669373
8,-2.045948,1.099659,-0.181441
9,-0.773392,0.844205,0.610492


In [3]:
dff.iloc[3:5, 0] = np.nan
dff.iloc[4:6, 1] = np.nan
dff.iloc[5:8, 2] = np.nan
dff

Unnamed: 0,A,B,C
0,1.242035,-2.637482,1.54198
1,-0.470395,1.827986,-1.003457
2,0.668405,1.898285,-1.936237
3,,-1.661291,1.982024
4,,,0.548498
5,0.890096,,
6,0.812188,-0.5492,
7,0.309767,-1.183596,
8,-2.045948,1.099659,-0.181441
9,-0.773392,0.844205,0.610492


In [65]:
dff['A'].mean()

-0.6662467130834082

In [4]:
dff

Unnamed: 0,A,B,C
0,1.242035,-2.637482,1.54198
1,-0.470395,1.827986,-1.003457
2,0.668405,1.898285,-1.936237
3,,-1.661291,1.982024
4,,,0.548498
5,0.890096,,
6,0.812188,-0.5492,
7,0.309767,-1.183596,
8,-2.045948,1.099659,-0.181441
9,-0.773392,0.844205,0.610492


In [5]:
dff["A"].fillna(dff['A'].mean())

0    1.242035
1   -0.470395
2    0.668405
3    0.079094
4    0.079094
5    0.890096
6    0.812188
7    0.309767
8   -2.045948
9   -0.773392
Name: A, dtype: float64

In [66]:
#dff.fillna(dff['A'].mean())

Unnamed: 0,A,B,C
0,0.318375,-0.285125,-1.478929
1,-0.030896,0.687139,-0.367629
2,0.875753,-1.771526,0.535858
3,-0.666247,-0.437533,0.099994
4,-0.666247,-0.666247,1.692444
5,-1.829933,-0.666247,-0.666247
6,-2.922838,-1.238736,-0.666247
7,-0.503549,1.588579,-0.666247
8,-0.059504,0.440825,1.135898
9,-1.17738,-0.379597,-0.509849


## pandas.DataFrame.isnull

1. Detect missing values. 
2. Alias of isna. 
3. Example:

In [69]:
df = pd.DataFrame(dict(age=[5, 6, np.NaN], 
        born=[pd.NaT, pd.Timestamp('1939-05-27'), 
        pd.Timestamp('1940-04-25')], name=['Alfred', 'Batman', ''], 
                toy=[None, 'Batmobile', 'Joker'])) 
df

Unnamed: 0,age,born,name,toy
0,5.0,NaT,Alfred,
1,6.0,1939-05-27,Batman,Batmobile
2,,1940-04-25,,Joker


In [70]:
df.isna()

Unnamed: 0,age,born,name,toy
0,False,True,False,True
1,False,False,False,False
2,True,False,False,False


## Merge

1. Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.  
2. Merge, join, and concatenate: pandas provides various facilities for easily combining Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.  
3. Concatenating objects: The concat function (in the main pandas namespace) does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.  
4. Concatenating pandas objects together with concat() 
5. Example:

In [6]:
df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number']) 
df1

Unnamed: 0,letter,number
0,a,1
1,b,2


In [7]:
df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number']) 
df2

Unnamed: 0,letter,number
0,c,3
1,d,4


In [8]:
pd.concat([df1, df2])

Unnamed: 0,letter,number
0,a,1
1,b,2
0,c,3
1,d,4


6. Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument.

In [9]:
pd.concat([df1, df2], join="inner")

Unnamed: 0,letter,number
0,a,1
1,b,2
0,c,3
1,d,4


7. Combine DataFrame objects horizontally along the x axis by passing in axis=1.

In [75]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,letter,number,letter.1,number.1
0,a,1,c,3
1,b,2,d,4


## Append 
1. Append rows to a dataframe. 
2. Example:

In [10]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) 
df

Unnamed: 0,A,B
0,1,2
1,3,4


In [11]:
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df2

Unnamed: 0,A,B
0,5,6
1,7,8


In [12]:
df.append(df2)

  df.append(df2)


Unnamed: 0,A,B
0,1,2
1,3,4
0,5,6
1,7,8


3. With ignore_index set to True:

In [13]:
df.append(df2, ignore_index=True)

  df.append(df2, ignore_index=True)


Unnamed: 0,A,B
0,1,2
1,3,4
2,5,6
3,7,8


In [82]:
df=df.append(df2, ignore_index=True)
df

  df=df.append(df2, ignore_index=True)


Unnamed: 0,A,B
0,1,2
1,3,4
2,5,6
3,7,8
4,5,6
5,7,8


## Grouping 
By “group by” we are referring to a process involving one or more of the following steps  
1. Splitting the data into groups based on some criteria.  
2. Applying a function to each group independently.  
3. Combining the results into a data structure.

## pandas.DataFrame.groupby

1. Group DataFrame using a mapper or by a Series of columns.  
2. A groupby operation involves some combination of splitting the object, applying a function, and combining the results.  
3. This can be used to group large amounts of data and compute operations on these groups. 
4. Example:

In [14]:
import pandas as pd
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})
df

Unnamed: 0,Animal,Max Speed
0,Falcon,380.0
1,Falcon,370.0
2,Parrot,24.0
3,Parrot,26.0


## 5. Grouping and then applying a function sum to the resulting groups.
6. Example:

In [15]:
l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] 
df = pd.DataFrame(l, columns=["a", "b", "c"])
df

Unnamed: 0,a,b,c
0,1,2.0,3
1,1,,4
2,2,1.0,3
3,1,2.0,2


In [16]:
df.groupby(by=["b"]).sum()

Unnamed: 0_level_0,a,c
b,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2,3
2.0,2,5


df.groupby(by=["c"]).sum()

## Getting Data In/Out

## CSV :Writing to CSV format

1. The Series and DataFrame objects have an instance method to_csv which allows storing the contents of the object as a comma-separated-values file.  
2. The function takes a number of arguments.  
3. Only the first is required.  
4. Example:

In [87]:
df

Unnamed: 0,a,b,c
0,1,2.0,3
1,1,,4
2,2,1.0,3
3,1,2.0,2


In [17]:
df.to_csv('test.csv')

## Reading from CSV format  
1. The two workhorse functions for reading text files (a.k.a. flat files) are read_csv() and read_table().  
2. They both use the same parsing code to intelligently convert tabular data into a DataFrame object. See the cookbook for some advanced strategies.  
3. Example:

In [18]:
df1=pd.read_csv('test.csv')
df1

Unnamed: 0.1,Unnamed: 0,a,b,c
0,0,1,2.0,3
1,1,1,,4
2,2,2,1.0,3
3,3,1,2.0,2


In [19]:
df2=pd.read_table('test.csv')
df2

Unnamed: 0,",a,b,c"
0,"0,1,2.0,3"
1,"1,1,,4"
2,"2,2,1.0,3"
3,"3,1,2.0,2"


In [20]:
df

Unnamed: 0,a,b,c
0,1,2.0,3
1,1,,4
2,2,1.0,3
3,1,2.0,2


In [22]:
df.to_excel('test.xlsx', sheet_name='Sheet1')

In [23]:
df=pd.read_excel('test.xlsx')

In [24]:
df

Unnamed: 0.1,Unnamed: 0,a,b,c
0,0,1,2.0,3
1,1,1,,4
2,2,2,1.0,3
3,3,1,2.0,2


### Creating a DataFrame from a CSV file

In [25]:
sp500 = pd.read_csv("data/sp500.csv")
sp500

Unnamed: 0,Symbol,Name,Sector,Price,Dividend Yield,Price/Earnings,Earnings/Share,Book Value,52 week low,52 week high,Market Cap,EBITDA,Price/Sales,Price/Book,SEC Filings
0,MMM,3M Co.,Industrials,141.14,2.12,20.33,6.900,26.668,107.15,143.37,92.345,8.1210,2.95,5.26,http://www.sec.gov/cgi-bin/browse-edgar?action...
1,ABT,Abbott Laboratories,Health Care,39.60,1.82,25.93,1.529,15.573,32.70,40.49,59.477,4.3590,2.74,2.55,http://www.sec.gov/cgi-bin/browse-edgar?action...
2,ABBV,AbbVie Inc.,Health Care,53.95,3.02,20.87,2.570,2.954,40.10,54.78,85.784,7.1900,4.48,18.16,http://www.sec.gov/cgi-bin/browse-edgar?action...
3,ACN,Accenture,Information Technology,79.79,2.34,19.53,4.068,8.326,69.00,85.88,50.513,4.4230,1.75,9.54,http://www.sec.gov/cgi-bin/browse-edgar?action...
4,ACE,ACE Limited,Financials,102.91,2.21,10.00,10.293,86.897,84.73,104.07,34.753,4.2750,1.79,1.18,http://www.sec.gov/cgi-bin/browse-edgar?action...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,YHOO,Yahoo Inc.,Information Technology,35.02,,28.94,1.199,12.768,23.82,41.72,35.258,0.8873,7.48,2.72,http://www.sec.gov/cgi-bin/browse-edgar?action...
496,YUM,Yum! Brands Inc,Consumer Discretionary,74.77,1.93,29.86,2.507,5.147,64.08,79.70,33.002,2.8640,2.49,14.55,http://www.sec.gov/cgi-bin/browse-edgar?action...
497,ZMH,Zimmer Holdings,Health Care,101.84,0.81,22.92,4.441,37.181,74.55,108.33,17.091,1.6890,3.68,2.74,http://www.sec.gov/cgi-bin/browse-edgar?action...
498,ZION,Zions Bancorp,Financials,28.43,0.56,18.82,1.511,30.191,26.39,33.33,5.257,0.0000,2.49,0.94,http://www.sec.gov/cgi-bin/browse-edgar?action...


In [26]:
sp500 = pd.read_csv("data/sp500.csv",index_col='Symbol')
sp500

Unnamed: 0_level_0,Name,Sector,Price,Dividend Yield,Price/Earnings,Earnings/Share,Book Value,52 week low,52 week high,Market Cap,EBITDA,Price/Sales,Price/Book,SEC Filings
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
MMM,3M Co.,Industrials,141.14,2.12,20.33,6.900,26.668,107.15,143.37,92.345,8.1210,2.95,5.26,http://www.sec.gov/cgi-bin/browse-edgar?action...
ABT,Abbott Laboratories,Health Care,39.60,1.82,25.93,1.529,15.573,32.70,40.49,59.477,4.3590,2.74,2.55,http://www.sec.gov/cgi-bin/browse-edgar?action...
ABBV,AbbVie Inc.,Health Care,53.95,3.02,20.87,2.570,2.954,40.10,54.78,85.784,7.1900,4.48,18.16,http://www.sec.gov/cgi-bin/browse-edgar?action...
ACN,Accenture,Information Technology,79.79,2.34,19.53,4.068,8.326,69.00,85.88,50.513,4.4230,1.75,9.54,http://www.sec.gov/cgi-bin/browse-edgar?action...
ACE,ACE Limited,Financials,102.91,2.21,10.00,10.293,86.897,84.73,104.07,34.753,4.2750,1.79,1.18,http://www.sec.gov/cgi-bin/browse-edgar?action...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
YHOO,Yahoo Inc.,Information Technology,35.02,,28.94,1.199,12.768,23.82,41.72,35.258,0.8873,7.48,2.72,http://www.sec.gov/cgi-bin/browse-edgar?action...
YUM,Yum! Brands Inc,Consumer Discretionary,74.77,1.93,29.86,2.507,5.147,64.08,79.70,33.002,2.8640,2.49,14.55,http://www.sec.gov/cgi-bin/browse-edgar?action...
ZMH,Zimmer Holdings,Health Care,101.84,0.81,22.92,4.441,37.181,74.55,108.33,17.091,1.6890,3.68,2.74,http://www.sec.gov/cgi-bin/browse-edgar?action...
ZION,Zions Bancorp,Financials,28.43,0.56,18.82,1.511,30.191,26.39,33.33,5.257,0.0000,2.49,0.94,http://www.sec.gov/cgi-bin/browse-edgar?action...


In [27]:
import pandas as pd
# read in the data and print the first five rows
# use the Symbol column as the index, and 
# only read in columns in positions 0, 2, 3, 7
sp500 = pd.read_csv("data/sp500.csv", 
                    index_col='Symbol', 
                    usecols=[0, 2, 3, 7])
sp500

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.60,15.573
ABBV,Health Care,53.95,2.954
ACN,Information Technology,79.79,8.326
ACE,Financials,102.91,86.897
...,...,...,...
YHOO,Information Technology,35.02,12.768
YUM,Consumer Discretionary,74.77,5.147
ZMH,Health Care,101.84,37.181
ZION,Financials,28.43,30.191


In [28]:
# peek at the first 5 rows of the data using .head()
sp500.head()

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.6,15.573
ABBV,Health Care,53.95,2.954
ACN,Information Technology,79.79,8.326
ACE,Financials,102.91,86.897


In [29]:
# how many rows of data?  Should be 500
len(sp500)

500

In [30]:
# what is the shape?
sp500.shape

(500, 3)

In [31]:
# what is the size?
sp500.size

1500

In [32]:
# examine the index
sp500.index

Index(['MMM', 'ABT', 'ABBV', 'ACN', 'ACE', 'ACT', 'ADBE', 'AES', 'AET', 'AFL',
       ...
       'XEL', 'XRX', 'XLNX', 'XL', 'XYL', 'YHOO', 'YUM', 'ZMH', 'ZION', 'ZTS'],
      dtype='object', name='Symbol', length=500)

In [33]:
# get the columns
sp500.columns

Index(['Sector', 'Price', 'Book Value'], dtype='object')

## Selecting columns of a DataFrame

In [34]:
# retrieve the Sector column
sp500['Sector'].head()

Symbol
MMM                Industrials
ABT                Health Care
ABBV               Health Care
ACN     Information Technology
ACE                 Financials
Name: Sector, dtype: object

In [35]:
# retrieve the Price and Book Value columns
sp500[['Price', 'Book Value']].head()

Unnamed: 0_level_0,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
MMM,141.14,26.668
ABT,39.6,15.573
ABBV,53.95,2.954
ACN,79.79,8.326
ACE,102.91,86.897


In [36]:
# show that this is a DataFrame
type(sp500[['Price', 'Book Value']])

pandas.core.frame.DataFrame

In [37]:
# attribute access of column by name
sp500.Price

Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91
         ...  
YHOO     35.02
YUM      74.77
ZMH     101.84
ZION     28.43
ZTS      30.53
Name: Price, Length: 500, dtype: float64

In [38]:
sp500['Price']

Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91
         ...  
YHOO     35.02
YUM      74.77
ZMH     101.84
ZION     28.43
ZTS      30.53
Name: Price, Length: 500, dtype: float64

## Selecting rows of a DataFrame

In [39]:
# get row with label MMM
# returned as a Series
sp500.loc['MMM']

Sector        Industrials
Price              141.14
Book Value         26.668
Name: MMM, dtype: object

In [40]:
# rows with label MMM and MSFT
# this is a DataFrame result
sp500.loc[['MMM', 'MSFT']]

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
MSFT,Information Technology,40.12,10.584


In [41]:
# get the location of MMM and A in the index
i1 = sp500.index.get_loc('MMM')
i2 = sp500.index.get_loc('A')
(i1, i2)

(0, 10)

In [42]:
# and get the rows
sp500.iloc[[i1, i2]]

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
A,Health Care,56.18,16.928


## Scalar lookup by label or location using .at[] and .iat[]

In [43]:
# by label in both the index and column
sp500.at['MMM', 'Price']

141.14

In [44]:
#by location.  Row 0, column 1
sp500.iat[0, 1]

141.14

## Slicing using the [] operator

In [45]:
# first five rows
sp500[:5]

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.6,15.573
ABBV,Health Care,53.95,2.954
ACN,Information Technology,79.79,8.326
ACE,Financials,102.91,86.897


In [46]:
# ABT through ACN labels
sp500['ABT':'ACN']

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ABT,Health Care,39.6,15.573
ABBV,Health Care,53.95,2.954
ACN,Information Technology,79.79,8.326


### Selecting rows using Boolean selection

In [47]:
# what rows have a price < 100?
sp500.Price < 100

Symbol
MMM     False
ABT      True
ABBV     True
ACN      True
ACE     False
        ...  
YHOO     True
YUM      True
ZMH     False
ZION     True
ZTS      True
Name: Price, Length: 500, dtype: bool

In [134]:
# get only the Price where Price is < 10 and > 0
r = sp500[(sp500.Price < 10) & 
          (sp500.Price > 6)] ['Price']
r

Symbol
HCBK    9.80
HBAN    9.10
SLM     8.82
WIN     9.38
Name: Price, dtype: float64

In [48]:
r = sp500[(sp500.Price < 10) & 
          (sp500.Price > 6)] ['Sector']
r

Symbol
HCBK                     Financials
HBAN                     Financials
SLM                      Financials
WIN     Telecommunications Services
Name: Sector, dtype: object

In [50]:
# price > 100 and in the Health Care Sector
r = sp500[(sp500.Sector == 'Health Care') & 
          (sp500.Price > 100.00)] [['Price', 'Sector']]
r

Unnamed: 0_level_0,Price,Sector
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
ACT,213.77,Health Care
ALXN,162.3,Health Care
AGN,166.92,Health Care
AMGN,114.33,Health Care
BCR,146.62,Health Care
BDX,115.7,Health Care
BIIB,299.71,Health Care
CELG,150.13,Health Care
HUM,124.49,Health Care
ISRG,363.86,Health Care


## Selecting across both rows and columns

In [51]:
# select the price and sector columns for ABT and ZTS
sp500.loc[['ABT', 'ZTS']][['Sector', 'Price']]

Unnamed: 0_level_0,Sector,Price
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
ABT,Health Care,39.6
ZTS,Health Care,30.53


## Renaming columns

In [52]:
# import numpy and pandas
import numpy as np
import pandas as pd

# used for dates
import datetime
from datetime import datetime, date

# Set some pandas options controlling output format
pd.set_option('display.max_columns', 7)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', 60)

# read in the data and print the first five rows
# use the Symbol column as the index, and 
# only read in columns in positions 0, 2, 3, 7
sp500 = pd.read_csv("data/sp500.csv", 
                    index_col='Symbol', 
                    usecols=[0, 2, 3, 7])

In [53]:
sp500

Unnamed: 0_level_0,Sector,Price,Book Value
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.60,15.573
ABBV,Health Care,53.95,2.954
ACN,Information Technology,79.79,8.326
ACE,Financials,102.91,86.897
...,...,...,...
YHOO,Information Technology,35.02,12.768
YUM,Consumer Discretionary,74.77,5.147
ZMH,Health Care,101.84,37.181
ZION,Financials,28.43,30.191


In [54]:
sp500.columns

Index(['Sector', 'Price', 'Book Value'], dtype='object')

In [141]:
newSP500 = sp500.rename(columns=
                        {'Book Value': 'BookValue'})
# print first 2 rows
newSP500[:2]

Unnamed: 0_level_0,Sector,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.6,15.573


In [142]:
# verify the columns in the original did not change
sp500.columns

Index(['Sector', 'Price', 'Book Value'], dtype='object')

In [143]:
# this changes the column in-place
sp500.rename(columns=                  
             {'Book Value': 'BookValue'},                   
             inplace=True)
# we can see the column is changed
sp500.columns


Index(['Sector', 'Price', 'BookValue'], dtype='object')

In [144]:
sp500.BookValue[:5]

Symbol
MMM     26.668
ABT     15.573
ABBV     2.954
ACN      8.326
ACE     86.897
Name: BookValue, dtype: float64

## Adding new columns with [] and .insert()

In [145]:
sp500.Price

Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91
         ...  
YHOO     35.02
YUM      74.77
ZMH     101.84
ZION     28.43
ZTS      30.53
Name: Price, Length: 500, dtype: float64

In [147]:
# make a copy so that we keep the original data unchanged
sp500_copy = sp500.copy()
# add the new column
sp500_copy['RoundedPrice'] = sp500.Price.round()
sp500_copy[:2]

Unnamed: 0_level_0,Sector,Price,BookValue,RoundedPrice
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MMM,Industrials,141.14,26.668,141.0
ABT,Health Care,39.6,15.573,40.0


In [148]:
# make a copy so that we keep the original data unchanged
copy = sp500.copy()
# insert sp500.Price * 2 as the 
# second column in the DataFrame
copy.insert(1, 'RoundedPrice', sp500.Price.round())
copy[:2]

Unnamed: 0_level_0,Sector,RoundedPrice,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MMM,Industrials,141.0,141.14,26.668
ABT,Health Care,40.0,39.6,15.573


## Adding columns through enlargement

In [150]:
# copy of subset / slice
ss = sp500[:3].copy()
# add the new column initialized to 0
ss.loc[:,'PER'] = 0
# take a look at the results
ss

Unnamed: 0_level_0,Sector,Price,BookValue,PER
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MMM,Industrials,141.14,26.668,0
ABT,Health Care,39.6,15.573,0
ABBV,Health Care,53.95,2.954,0


In [155]:
# copy of subset / slice
ss = sp500[:3].copy()
# add the new column initialized with random numbers
np.random.seed(123456)
ss.loc[:,'PER'] = pd.Series(np.random.normal(size=3), index=ss.index)
# take a look at the results
ss

Unnamed: 0_level_0,Sector,Price,BookValue,PER
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MMM,Industrials,141.14,26.668,0.469112
ABT,Health Care,39.6,15.573,-0.282863
ABBV,Health Care,53.95,2.954,-1.509059


## Adding columns using concatenation

In [156]:
# create a DataFrame with only the RoundedPrice column
rounded_price = pd.DataFrame({'RoundedPrice':    
                              sp500.Price.round()})
# concatenate along the columns axis
concatenated = pd.concat([sp500, rounded_price], axis=1)
concatenated[:5]

Unnamed: 0_level_0,Sector,Price,BookValue,RoundedPrice
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MMM,Industrials,141.14,26.668,141.0
ABT,Health Care,39.6,15.573,40.0
ABBV,Health Care,53.95,2.954,54.0
ACN,Information Technology,79.79,8.326,80.0
ACE,Financials,102.91,86.897,103.0


In [157]:
# create a DataFrame with only the RoundedPrice column
rounded_price = pd.DataFrame({'Price': sp500.Price.round()})
rounded_price[:5]

Unnamed: 0_level_0,Price
Symbol,Unnamed: 1_level_1
MMM,141.0
ABT,40.0
ABBV,54.0
ACN,80.0
ACE,103.0


In [158]:
# this will result in duplicate Price columm
dups = pd.concat([sp500, rounded_price], axis=1)
dups[:5]

Unnamed: 0_level_0,Sector,Price,BookValue,Price
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MMM,Industrials,141.14,26.668,141.0
ABT,Health Care,39.6,15.573,40.0
ABBV,Health Care,53.95,2.954,54.0
ACN,Information Technology,79.79,8.326,80.0
ACE,Financials,102.91,86.897,103.0


In [159]:
# retrieves both Price columns
dups.Price[:5]

Unnamed: 0_level_0,Price,Price
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
MMM,141.14,141.0
ABT,39.6,40.0
ABBV,53.95,54.0
ACN,79.79,80.0
ACE,102.91,103.0


## Reordering columns

In [160]:
# return a new DataFrame with the columns reversed
reversed_column_names = sp500.columns[::-1]
sp500[reversed_column_names][:5]

Unnamed: 0_level_0,BookValue,Price,Sector
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,26.668,141.14,Industrials
ABT,15.573,39.6,Health Care
ABBV,2.954,53.95,Health Care
ACN,8.326,79.79,Information Technology
ACE,86.897,102.91,Financials


## Replacing the contents of a column

In [45]:
# this occurs in-place so let's use a copy
copy = sp500.copy()
# replace the Price column data with the new values
# instead of adding a new column
copy.Price = rounded_price.Price
copy[:5]

Unnamed: 0_level_0,Sector,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.0,26.668
ABT,Health Care,40.0,15.573
ABBV,Health Care,54.0,2.954
ACN,Information Technology,80.0,8.326
ACE,Financials,103.0,86.897


In [46]:
# this occurs in-place so let's use a copy
copy = sp500.copy()
# replace the Price column data wwith rounded values
copy.loc[:,'Price'] = rounded_price.Price
copy[:5]

Unnamed: 0_level_0,Sector,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.0,26.668
ABT,Health Care,40.0,15.573
ABBV,Health Care,54.0,2.954
ACN,Information Technology,80.0,8.326
ACE,Financials,103.0,86.897


## Deleting columns

In [161]:
# Example of using del to delete a column
# make a copy as this is done in-place
copy = sp500.copy()
del copy['BookValue']
copy[:2]

Unnamed: 0_level_0,Sector,Price
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
MMM,Industrials,141.14
ABT,Health Care,39.6


In [48]:
# Example of using pop to remove a column from a DataFrame
# first make a copy of a subset of the data frame as
# pop works in place
copy = sp500.copy()
# this will remove Sector and return it as a series
popped = copy.pop('Sector')
# Sector column removed in-place
copy[:2]

Unnamed: 0_level_0,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
MMM,141.14,26.668
ABT,39.6,15.573


In [49]:
# and we have the Sector column as the result of the pop
popped[:5]

Symbol
MMM                Industrials
ABT                Health Care
ABBV               Health Care
ACN     Information Technology
ACE                 Financials
Name: Sector, dtype: object

In [50]:
# Example of using drop to remove a column 
# make a copy of a subset of the data frame
copy = sp500.copy()
# this will return a new DataFrame with 'Sector’ removed
# the copy DataFrame is not modified
afterdrop = copy.drop(['Sector'], axis = 1)
afterdrop[:5]

Unnamed: 0_level_0,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
MMM,141.14,26.668
ABT,39.6,15.573
ABBV,53.95,2.954
ACN,79.79,8.326
ACE,102.91,86.897


## Appending rows from other DataFrame objects with .append()

In [51]:
# copy the first three rows of sp500
df1 = sp500.iloc[0:3].copy()
# copy 10th and 11th rows
df2 = sp500.iloc[[10, 11, 2]]
# append df1 and df2
appended = df1.append(df2)
# the result is the rows of the first followed by 
# those of the second
appended

  appended = df1.append(df2)


Unnamed: 0_level_0,Sector,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.6,15.573
ABBV,Health Care,53.95,2.954
A,Health Care,56.18,16.928
GAS,Utilities,52.98,32.462
ABBV,Health Care,53.95,2.954


In [52]:
# data frame using df1.index and just a PER column
# also a good example of using a scalar value
# to initialize multiple rows
df3 = pd.DataFrame(0.0, 
                   index=df1.index,
                   columns=['PER'])
df3

Unnamed: 0_level_0,PER
Symbol,Unnamed: 1_level_1
MMM,0.0
ABT,0.0
ABBV,0.0


In [53]:
# append df1 and df3
# each has three rows, so 6 rows is the result
# df1 had no PER column, so NaN from for those rows
# df3 had no BookValue, Price or Sector, so NaN's
df1.append(df3)

  df1.append(df3)


Unnamed: 0_level_0,Sector,Price,BookValue,PER
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MMM,Industrials,141.14,26.668,
ABT,Health Care,39.6,15.573,
ABBV,Health Care,53.95,2.954,
MMM,,,,0.0
ABT,,,,0.0
ABBV,,,,0.0


In [54]:
# ignore index labels, create default index
df1.append(df3, ignore_index=True)

  df1.append(df3, ignore_index=True)


Unnamed: 0,Sector,Price,BookValue,PER
0,Industrials,141.14,26.668,
1,Health Care,39.6,15.573,
2,Health Care,53.95,2.954,
3,,,,0.0
4,,,,0.0
5,,,,0.0


## Concatenating rows

In [55]:
# copy the first three rows of sp500
df1 = sp500.iloc[0:3].copy()
# copy 10th and 11th rows
df2 = sp500.iloc[[10, 11, 2]]
# pass them as a list
pd.concat([df1, df2])

Unnamed: 0_level_0,Sector,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.6,15.573
ABBV,Health Care,53.95,2.954
A,Health Care,56.18,16.928
GAS,Utilities,52.98,32.462
ABBV,Health Care,53.95,2.954


In [56]:
# copy df2
df2_2 = df2.copy()
# add a column to df2_2 that is not in df1
df2_2.insert(3, 'Foo', pd.Series(0, index=df2.index))
# see what it looks like
df2_2

Unnamed: 0_level_0,Sector,Price,BookValue,Foo
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,Health Care,56.18,16.928,0
GAS,Utilities,52.98,32.462,0
ABBV,Health Care,53.95,2.954,0


In [57]:
# now concatenate
pd.concat([df1, df2_2])

Unnamed: 0_level_0,Sector,Price,BookValue,Foo
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MMM,Industrials,141.14,26.668,
ABT,Health Care,39.6,15.573,
ABBV,Health Care,53.95,2.954,
A,Health Care,56.18,16.928,0.0
GAS,Utilities,52.98,32.462,0.0
ABBV,Health Care,53.95,2.954,0.0


In [58]:
# specify keys
r = pd.concat([df1, df2_2], keys=['df1', 'df2'])
r

Unnamed: 0_level_0,Unnamed: 1_level_0,Sector,Price,BookValue,Foo
Unnamed: 0_level_1,Symbol,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
df1,MMM,Industrials,141.14,26.668,
df1,ABT,Health Care,39.6,15.573,
df1,ABBV,Health Care,53.95,2.954,
df2,A,Health Care,56.18,16.928,0.0
df2,GAS,Utilities,52.98,32.462,0.0
df2,ABBV,Health Care,53.95,2.954,0.0


## Adding and replacing rows via setting with enlargement

In [59]:
# get a small subset of the sp500 
# make sure to copy the slice to make a copy
ss = sp500[:3].copy()
# create a new row with index label FOO
# and assign some values to the columns via a list
ss.loc['FOO'] = ['the sector', 100, 110]
ss

Unnamed: 0_level_0,Sector,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.6,15.573
ABBV,Health Care,53.95,2.954
FOO,the sector,100.0,110.0


## Removing rows using .drop()

In [60]:
# get a copy of the first 5 rows of sp500
ss = sp500[:5]
ss

Unnamed: 0_level_0,Sector,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.6,15.573
ABBV,Health Care,53.95,2.954
ACN,Information Technology,79.79,8.326
ACE,Financials,102.91,86.897


In [61]:
# drop rows with labels ABT and ACN
afterdrop = ss.drop(['ABT', 'ACN'])
afterdrop[:5]

Unnamed: 0_level_0,Sector,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABBV,Health Care,53.95,2.954
ACE,Financials,102.91,86.897


## Removing rows using Boolean selection

In [4]:
# Create pandas DataFrame
import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[22000,25000,np.nan,24000],
    'Duration':['30day',None,'55days',np.nan],
    'Discount':[1000,2300,1000,np.nan]
          }
df = pd.DataFrame(technologies)
print(df)

   Courses      Fee Duration  Discount
0    Spark  22000.0    30day    1000.0
1  PySpark  25000.0     None    2300.0
2   Hadoop      NaN   55days    1000.0
3   Python  24000.0      NaN       NaN


In [5]:
df2=df.copy()
df1=df.copy()
df3=df.copy()
df4=df.copy()
df4

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000.0,30day,1000.0
1,PySpark,25000.0,,2300.0
2,Hadoop,,55days,1000.0
3,Python,24000.0,,


In [2]:
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print(df)

  Courses      Fee Duration  Discount
0   Spark  22000.0    30day    1000.0
2  Hadoop      NaN   55days    1000.0


## Using loc[] to Drop Rows by Condition

## Pandas Dataframe.drop() method delete/remove with conditions

## Alternatively, you can also try another most used approach to drop rows by condition using loc[] and df[].

In [6]:
df2

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000.0,30day,1000.0
1,PySpark,25000.0,,2300.0
2,Hadoop,,55days,1000.0
3,Python,24000.0,,


In [7]:
df6 = df2[df2.Fee >= 24000]
print(df6)

   Courses      Fee Duration  Discount
1  PySpark  25000.0     None    2300.0
3   Python  24000.0      NaN       NaN


In [8]:
df2

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000.0,30day,1000.0
1,PySpark,25000.0,,2300.0
2,Hadoop,,55days,1000.0
3,Python,24000.0,,


In [9]:
df2.drop(df2.loc[df2["Fee"] >= 24000 ].index, inplace = True)
print(df2)

  Courses      Fee Duration  Discount
0   Spark  22000.0    30day    1000.0
2  Hadoop      NaN   55days    1000.0


In [10]:
df4

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000.0,30day,1000.0
1,PySpark,25000.0,,2300.0
2,Hadoop,,55days,1000.0
3,Python,24000.0,,


In [11]:
# Delect rows based on multiple column value
df4 = df4[(df4['Fee'] >= 22000) & (df4['Discount'] == 2300)]
print(df4)

   Courses      Fee Duration  Discount
1  PySpark  25000.0     None    2300.0


In [62]:
# determine the rows where Price > 300
selection = sp500.Price > 300
# report number of rows and number that will be dropped
(len(selection), selection.sum())

(500, 10)

In [63]:
# select the complement of the expression
# note the use of the complement of the selection
price_less_than_300 = sp500[~selection]
price_less_than_300

Unnamed: 0_level_0,Sector,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.60,15.573
ABBV,Health Care,53.95,2.954
ACN,Information Technology,79.79,8.326
ACE,Financials,102.91,86.897
...,...,...,...
YHOO,Information Technology,35.02,12.768
YUM,Consumer Discretionary,74.77,5.147
ZMH,Health Care,101.84,37.181
ZION,Financials,28.43,30.191


## Quick Examples of Drop Rows With Condition in Pandas

In [None]:
DataFrame.drop() metthod delete/remove raws with conditions

## Removing rows using a slice

In [64]:
# get only the first three rows
only_first_three = sp500[:3]
only_first_three

Unnamed: 0_level_0,Sector,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.6,15.573
ABBV,Health Care,53.95,2.954


In [65]:
# first three, but a copy of them
only_first_three = sp500[:3].copy()
only_first_three

Unnamed: 0_level_0,Sector,Price,BookValue
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MMM,Industrials,141.14,26.668
ABT,Health Care,39.6,15.573
ABBV,Health Care,53.95,2.954


###  End of Chapter 5

## Excel  
● The read_excel() method can read Excel 2003 (.xls) and Excel 2007+ (.xlsx) files using the xlrd Python module.  
● The to_excel() instance method is used for saving a DataFrame to Excel.  
● Generally the semantics are similar to working with csv data.

### Writing to an excel file.
● Example:

In [12]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')

In [None]:
Reading from an excel file

In [2]:
import pandas as pd
df3=pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
df3

Unnamed: 0,Symbol,Sector,Price,Book Value
0,MMM,Industrials,141.14,26.668
1,ABT,Health Care,39.60,15.573
2,ABBV,Health Care,53.95,2.954
3,ACN,Information Technology,79.79,8.326
4,ACE,Financials,102.91,86.897
...,...,...,...,...
495,YHOO,Information Technology,35.02,12.768
496,YUM,Consumer Discretionary,74.77,5.147
497,ZMH,Health Care,101.84,37.181
498,ZION,Financials,28.43,30.191


## ASAP Pandas Practical

In [None]:
https://colab.research.google.com/drive/1xhpSpt-g76xhg8hF4s1TDQpeAcfkn5qT?usp=sharing

In [1]:
import numpy as np
import pandas as pd

In [2]:
np.random.seed(100)
set1df = pd.DataFrame(data=np.random.randn(6,4), index=pd.date_range('20210101', periods=6),columns=list('wxyz'))
set1df

Unnamed: 0,w,x,y,z
2021-01-01,-1.749765,0.34268,1.153036,-0.252436
2021-01-02,0.981321,0.514219,0.22118,-1.070043
2021-01-03,-0.189496,0.255001,-0.458027,0.435163
2021-01-04,-0.583595,0.816847,0.672721,-0.104411
2021-01-05,-0.53128,1.029733,-0.438136,-1.118318
2021-01-06,1.618982,1.541605,-0.251879,-0.842436


### Q1. Complete the below function to return column w, the function should return a Series object of length 6.

In [3]:
def q1(set1df):
    x=(set1df['w'])
    return x
x1=q1(set1df)
print (x1)

2021-01-01   -1.749765
2021-01-02    0.981321
2021-01-03   -0.189496
2021-01-04   -0.583595
2021-01-05   -0.531280
2021-01-06    1.618982
Freq: D, Name: w, dtype: float64


### Q2. Complete the below function to return third row, the function should return a Series object of length 4.

In [4]:
def q2(set1df):
  #Fill in your code here
  x=set1df.iloc[2][:4]
  return x
print(q2(set1df))

w   -0.189496
x    0.255001
y   -0.458027
z    0.435163
Name: 2021-01-03 00:00:00, dtype: float64


In [14]:
np.random.seed(100)
set1df1 = pd.DataFrame(data=np.random.randn(6,4), index=([1,2,3,4,5,6]),columns=list('wxyz'))
set1df1

Unnamed: 0,w,x,y,z
1,-1.749765,0.34268,1.153036,-0.252436
2,0.981321,0.514219,0.22118,-1.070043
3,-0.189496,0.255001,-0.458027,0.435163
4,-0.583595,0.816847,0.672721,-0.104411
5,-0.53128,1.029733,-0.438136,-1.118318
6,1.618982,1.541605,-0.251879,-0.842436


In [15]:
def q2(set1df):
#Fill in your code here
  x=set1df.loc[3]
  return x  
print(q2(set1df1))

w   -0.189496
x    0.255001
y   -0.458027
z    0.435163
Name: 3, dtype: float64


In [16]:
set1df.loc[3]

w   -0.189496
x    0.255001
y   -0.458027
z    0.435163
Name: 3, dtype: float64

### Q3. Complete the below function to select last three rows from the above dataframe, the function should return a DataFrame object of length 3 rows and 4 columns..

In [17]:
def q3(set1df):
  #Fill in your code here
  x=set1df.iloc[-4:-1][:4]
  return x
print(q3(set1df))

          w         x         y         z
3 -0.189496  0.255001 -0.458027  0.435163
4 -0.583595  0.816847  0.672721 -0.104411
5 -0.531280  1.029733 -0.438136 -1.118318


In [18]:
set1df.tail(3)

Unnamed: 0,w,x,y,z
4,-0.583595,0.816847,0.672721,-0.104411
5,-0.53128,1.029733,-0.438136,-1.118318
6,1.618982,1.541605,-0.251879,-0.842436


In [19]:
def q3(set1df):
  #Fill in your code here
  return set1df.tail(3)
q3(set1df)

Unnamed: 0,w,x,y,z
4,-0.583595,0.816847,0.672721,-0.104411
5,-0.53128,1.029733,-0.438136,-1.118318
6,1.618982,1.541605,-0.251879,-0.842436


### Q4. Complete the below function to sort the column 'y' in ascending order in the above dataframe, the function should return a DataFrame object of length 6rows and 4 columns.

In [None]:
#df.sort_values(by=['col1'])

In [20]:
def q4(set1df):
  #Fill in your code here
  return set1df.sort_values(by=['y'],ascending=True)
q4(set1df)

Unnamed: 0,w,x,y,z
3,-0.189496,0.255001,-0.458027,0.435163
5,-0.53128,1.029733,-0.438136,-1.118318
6,1.618982,1.541605,-0.251879,-0.842436
2,0.981321,0.514219,0.22118,-1.070043
4,-0.583595,0.816847,0.672721,-0.104411
1,-1.749765,0.34268,1.153036,-0.252436


DataFrame.sort_values(by, *, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)[source]


![image.png](attachment:image.png)

### Q5. Write a Pandas program to select the 'x' and 'y' columns from the above DataFrame, the function should return a DataFrame object of length 6 rows and 2 columns ..

In [109]:
set1df.loc[:,['x','y']]

Unnamed: 0,x,y
1,0.34268,1.153036
2,0.514219,0.22118
3,0.255001,-0.458027
4,0.816847,0.672721
5,1.029733,-0.438136
6,1.541605,-0.251879


In [21]:
def q5(set1df):
  #Fill in your code here
    df1=set1df.loc[:,['x','y']]
    return df1
q5(set1df)

Unnamed: 0,x,y
1,0.34268,1.153036
2,0.514219,0.22118
3,0.255001,-0.458027
4,0.816847,0.672721
5,1.029733,-0.438136
6,1.541605,-0.251879


In [22]:
def q5(set1df,p1,p2):
  #Fill in your code here
    df1=set1df.loc[:,[p1,p2]]
    return df1
q5(set1df,'x','y')

Unnamed: 0,x,y
1,0.34268,1.153036
2,0.514219,0.22118
3,0.255001,-0.458027
4,0.816847,0.672721
5,1.029733,-0.438136
6,1.541605,-0.251879


### Q6.Display a summary of the basic information about this DataFrame and its data, the function should return a DataFrame object of length 8 rows and 4 columns .

In [23]:
def q6(set1df):
  #Fill in your code here
  return x

In [24]:
np.random.seed(100)
set1df = pd.DataFrame(data=np.random.randn(8,4),columns=list('wxyz'))
set1df

Unnamed: 0,w,x,y,z
0,-1.749765,0.34268,1.153036,-0.252436
1,0.981321,0.514219,0.22118,-1.070043
2,-0.189496,0.255001,-0.458027,0.435163
3,-0.583595,0.816847,0.672721,-0.104411
4,-0.53128,1.029733,-0.438136,-1.118318
5,1.618982,1.541605,-0.251879,-0.842436
6,0.184519,0.937082,0.731,1.361556
7,-0.326238,0.055676,0.2224,-1.443217


In [25]:
def q6(set1df):
  #Fill in your code here
    df1=set1df.describe()
    return df1
q6(set1df)

Unnamed: 0,w,x,y,z
count,8.0,8.0,8.0,8.0
mean,-0.074444,0.686605,0.231537,-0.379268
std,1.028221,0.487093,0.591742,0.937724
min,-1.749765,0.055676,-0.458027,-1.443217
25%,-0.544359,0.320761,-0.298443,-1.082112
50%,-0.257867,0.665533,0.22179,-0.547436
75%,0.383719,0.960245,0.687291,0.030483
max,1.618982,1.541605,1.153036,1.361556


### Q7. Return a dataframe retaining only values greater than 0 and all other values set as NaN,the function should return a DataFrame object of length 6 rows and 4 columns.

In [33]:
np.random.seed(100)
set1df = pd.DataFrame(data=np.random.randn(6,4),columns=list('wxyz'))
set1df

Unnamed: 0,w,x,y,z
0,-1.749765,0.34268,1.153036,-0.252436
1,0.981321,0.514219,0.22118,-1.070043
2,-0.189496,0.255001,-0.458027,0.435163
3,-0.583595,0.816847,0.672721,-0.104411
4,-0.53128,1.029733,-0.438136,-1.118318
5,1.618982,1.541605,-0.251879,-0.842436


In [34]:
def q7(set1df):
  #Fill in your code here
    set1df[set1df<0]='NaN'
    return set1df
q7(set1df)

Unnamed: 0,w,x,y,z
0,,0.34268,1.153036,
1,0.981321,0.514219,0.22118,
2,,0.255001,,0.435163
3,,0.816847,0.672721,
4,,1.029733,,
5,1.618982,1.541605,,


In [31]:
#df[df < 0] = 0
#set1df[set1df<0]='NaN'

In [32]:
set1df

Unnamed: 0,w,x,y,z
0,,0.34268,1.153036,
1,0.981321,0.514219,0.22118,
2,,0.255001,,0.435163
3,,0.816847,0.672721,
4,,1.029733,,
5,1.618982,1.541605,,


### Q8. Create a series with [7,2,4,9,5,6] and series of dates starting from '20210102'as index and add it as a new column with column name newCol,the function should return a DataFrame object of length 6 rows and 5 columns .

In [10]:
ser1=np.array([7,2,4,9,5,6]) 
ser1

array([7, 2, 4, 9, 5, 6])

In [12]:
s1=pd.Series([7,2,4,9,5,6],name='series')
s1

0    7
1    2
2    4
3    9
4    5
5    6
Name: series, dtype: int64

In [13]:
df2=pd.date_range('20210101', periods=6)
df2

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [86]:
df2.reindex([0,1,2,3,4,5])
df2

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [14]:
def q8(set1df):
  #Fill in your code here
    x=set1df
    return x
x=q8(set1df)
x

Unnamed: 0,w,x,y,z
2021-01-01,-1.749765,0.34268,1.153036,-0.252436
2021-01-02,0.981321,0.514219,0.22118,-1.070043
2021-01-03,-0.189496,0.255001,-0.458027,0.435163
2021-01-04,-0.583595,0.816847,0.672721,-0.104411
2021-01-05,-0.53128,1.029733,-0.438136,-1.118318
2021-01-06,1.618982,1.541605,-0.251879,-0.842436


In [4]:
np.random.seed(100)
set1df = pd.DataFrame(data=np.random.randn(6,4), index=pd.date_range('20210101', periods=6),columns=list('wxyz'))
set1df

Unnamed: 0,w,x,y,z
2021-01-01,-1.749765,0.34268,1.153036,-0.252436
2021-01-02,0.981321,0.514219,0.22118,-1.070043
2021-01-03,-0.189496,0.255001,-0.458027,0.435163
2021-01-04,-0.583595,0.816847,0.672721,-0.104411
2021-01-05,-0.53128,1.029733,-0.438136,-1.118318
2021-01-06,1.618982,1.541605,-0.251879,-0.842436


In [152]:
np.random.seed(100)
set1df = pd.DataFrame(data=np.random.randn(6,4), index=pd.date_range('20210101', periods=6),columns=list('wxyz'))
set1df

Unnamed: 0,w,x,y,z
2021-01-01,-1.749765,0.34268,1.153036,-0.252436
2021-01-02,0.981321,0.514219,0.22118,-1.070043
2021-01-03,-0.189496,0.255001,-0.458027,0.435163
2021-01-04,-0.583595,0.816847,0.672721,-0.104411
2021-01-05,-0.53128,1.029733,-0.438136,-1.118318
2021-01-06,1.618982,1.541605,-0.251879,-0.842436


### Q9. Complete the function to return three largest values from column col2 of the DataFrame set2df after grouping the DataFrame by column col1,return DataFrame which contains three values from col2 against each unique value of col1 :

In [6]:
set2df = pd.DataFrame({'col1': list('aaabbcaabcccbbc'), 'col2': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
set2df

Unnamed: 0,col1,col2
0,a,12
1,a,345
2,a,3
3,b,1
4,b,45
5,c,14
6,a,4
7,a,52
8,b,54
9,c,23


### Q10. Create a DataFrame set3df from this dictionary "Company" and and use the list named "index" as DataFrames row index, the function should return a DataFrame object of length 3 rows and 2 columns.

In [9]:
Company={"Comp_name":["TCS","Infosys","CATS"],"place":["Trivandrum","Banglore","Ernakulam"]}
index=list("abc")
index

['a', 'b', 'c']

In [40]:
set3df=pd.DataFrame(data=Company,index=index)
set3df

Unnamed: 0,Comp_name,place
a,TCS,Trivandrum
b,Infosys,Banglore
c,CATS,Ernakulam


In [71]:
set3df.iloc[:,1:2]

Unnamed: 0,place
a,Trivandrum
b,Banglore
c,Ernakulam


In [76]:
def q10(set3df):
  #Fill in your code here
    x=set3df.iloc[:,0:2]
    return x

In [77]:
q10(set3df)

Unnamed: 0,Comp_name,place
a,TCS,Trivandrum
b,Infosys,Banglore
c,CATS,Ernakulam


### Q11. Create a DataFrame set4df from this dictionary "data" and use the list named "labels" as DataFrames row index, the function should return a DataFrame object of length 10 rows and 4 columns.

In [79]:
import numpy as np
import pandas as pd

data = {'patients': ['Jack', 'Annie', 'Gopal', 'Renuka', 'Charlie', 'Charlie', 'Annie', 'Jack', 'Charlie', 'Renuka'],
        'age': [25, 30, 50, np.nan, 50, 20, 20, np.nan, 20, np.nan],
        'visits': [1, 3, 2, 3, 2, 3, 4, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']


In [81]:
set4df=pd.DataFrame(data=data,index=labels)
set4df

Unnamed: 0,patients,age,visits,priority
a,Jack,25.0,1,yes
b,Annie,30.0,3,yes
c,Gopal,50.0,2,no
d,Renuka,,3,yes
e,Charlie,50.0,2,no
f,Charlie,20.0,3,no
g,Annie,20.0,4,no
h,Jack,,1,yes
i,Charlie,20.0,2,no
j,Renuka,,1,no


In [84]:
def q11(set4df):
  #Fill in your code here
    x=set4df.iloc[:,0:4]
    return x

In [85]:
q11(set4df)

Unnamed: 0,patients,age,visits,priority
a,Jack,25.0,1,yes
b,Annie,30.0,3,yes
c,Gopal,50.0,2,no
d,Renuka,,3,yes
e,Charlie,50.0,2,no
f,Charlie,20.0,3,no
g,Annie,20.0,4,no
h,Jack,,1,yes
i,Charlie,20.0,2,no
j,Renuka,,1,no


### Solution  Refer: 72 ASAP Pandas Colab Practicals.iyynp

## Break

In [69]:
np.random.seed(100)
set1df = pd.DataFrame(data=np.random.randn(6,4), index=pd.date_range('20210101', periods=6),columns=list('wxyz'))
set1df

Unnamed: 0,w,x,y,z
2021-01-01,-1.749765,0.34268,1.153036,-0.252436
2021-01-02,0.981321,0.514219,0.22118,-1.070043
2021-01-03,-0.189496,0.255001,-0.458027,0.435163
2021-01-04,-0.583595,0.816847,0.672721,-0.104411
2021-01-05,-0.53128,1.029733,-0.438136,-1.118318
2021-01-06,1.618982,1.541605,-0.251879,-0.842436


In [62]:
set1df.loc[[2,3,4]]

Unnamed: 0,0,1,2,3
2,-0.189496,0.255001,-0.458027,0.435163
3,-0.583595,0.816847,0.672721,-0.104411
4,-0.53128,1.029733,-0.438136,-1.118318


In [118]:
def q1(set1df):
    x=set1df.loc[[2]]
    return x
x1=q1(set1df)
print (x1)

          w         x        y         z
2  0.981321  0.514219  0.22118 -1.070043


In [None]:
Break***************

### Filtering 

Even when a dataframe, you can apply the filtering through the application of certain
conditions. For example, say you want to get all values smaller than a certain number, for
example 1.2.

In [59]:
frame[frame < 1.2]

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,
1,green,pen,1.0
2,yellow,pencil,
3,red,paper,0.9
4,white,mug,


### DataFrame from Nested dict 

A very common data structure used in Python is a nested dict, as follows:

In [28]:
nestdict = {'red': { 2012: 22, 2013: 33},
            'white': { 2011: 13, 2012: 22, 2013: 16},
            'blue': { 2011: 17, 2012: 27, 2013: 18}}
frame2 = pd.DataFrame(nestdict)
frame2

Unnamed: 0,red,white,blue
2012,22.0,22,27
2013,33.0,16,18
2011,,13,17


### Transposition of a DataFrame 

An operation that you might need when you’re dealing with tabular data structures is
transposition (that is, columns become rows and rows become columns). pandas allows
you to do this in a very simple way. You can get the transposition of the dataframe by
adding the T attribute to its application.

In [16]:
frame2.T

Unnamed: 0,2012,2013,2011
red,22.0,33.0,
white,22.0,16.0,13.0
blue,27.0,18.0,17.0


## The Index Objects

Now that you know what the series and the dataframe are and how they are structured,
you can likely perceive the peculiarities of these data structures. Indeed, the majority of
their excellent characteristics are due to the presence of an Index object that’s integrated
in these data structures.  
The Index objects are responsible for the labels on the axes and other metadata as
the name of the axes. You have already seen how an array containing labels is converted
into an Index object and that you need to specify the index option in the constructor.

In [62]:
ser = pd.Series([5, 0, 3, 8, 4], index=['red', 'blue', 'yellow', 'white', 'green'])
ser.index

Index(['red', 'blue', 'yellow', 'white', 'green'], dtype='object')

### Methods on Index

There are some specific methods for indexes available to get some information about indexes from a data structure. For example, idmin() and idmax() are two functions that return, respectively, the index with the lowest value and the index with the highest value.

In [63]:
ser.idxmin()

'blue'

In [64]:
ser.idxmax()

'white'

### Index with Duplicate Labels

So far, you have met all cases in which indexes within a single data structure have a
unique label. Although many functions require this condition to run, this condition is
not mandatory on the data structures of pandas.

In [29]:
serd = pd.Series(range(6), index=['white', 'white', 'blue', 'green', 'green', 'yellow'])
serd

white     0
white     1
blue      2
green     3
green     4
yellow    5
dtype: int64

In [66]:
serd['white']

white    0
white    1
dtype: int64

In [67]:
serd.index.is_unique

False

In [68]:
frame.index.is_unique

True

 ## Other Functionalities on Indexes

Compared to data structures commonly used with Python, you saw that pandas, as
well as taking advantage of the high-performance quality offered by NumPy arrays, has
chosen to integrate indexes in them.

This section analyzes in detail a number of basic features that take advantage of this mechanism.

• Reindexing

• Dropping

• Alignment

## Reindexing
It was previously stated that once it’s declared in a data structure, the Index object
cannot be changed. This is true, but by executing a reindexing, you can also overcome
this problem.  
In fact it is possible to obtain a new data structure from an existing one where
indexing rules can be defined again.  

In [30]:
ser = pd.Series([2, 5, 7, 4], index=['one', 'two', 'three', 'four'])
ser

one      2
two      5
three    7
four     4
dtype: int64

In [70]:
ser.reindex(['three', 'four', 'five', 'one'])

three    7.0
four     4.0
five     NaN
one      2.0
dtype: float64

In [32]:
ser3 = pd.Series([1, 5, 6, 3], index=[0, 3, 5, 6])
ser3

0    1
3    5
5    6
6    3
dtype: int64

In [33]:
ser3.reindex(range(6), method='ffill') # fill from frot

0    1
1    1
2    1
3    5
4    5
5    6
dtype: int64

In [34]:
ser3.reindex(range(6), method='bfill') # fill from back

0    1
1    5
2    5
3    5
4    6
5    6
dtype: int64

In [35]:
frame.reindex(range(5), method='ffill', columns=['colors', 'price', 'new', 'object'])

item,colors,price,new,object
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,1.2,blue,ball
1,green,1.0,green,pen
2,yellow,3.3,yellow,pencil
3,red,0.9,red,paper
4,white,1.7,white,mug


### Dropping

Another operation that is connected to Index objects is dropping. Deleting a row or a
column becomes simple, due to the labels used to indicate the indexes and column names.
Also in this case, pandas provides a specific function for this operation, called
drop(). This method will return a new object without the items that you want to delete.
For example, take the case where we want to remove a single item from a series. To
do this, define a generic series of four elements with four distinct labels.

In [36]:
ser = pd.Series(np.arange(4.), index=['red', 'blue', 'yellow', 'white'])
ser

red       0.0
blue      1.0
yellow    2.0
white     3.0
dtype: float64

In [38]:
ser.drop('yellow')

red      0.0
blue     1.0
white    3.0
dtype: float64

In [39]:
ser.drop(['blue','white'])

red       0.0
yellow    2.0
dtype: float64

In [40]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)), 
                    index=['red', 'blue', 'yellow', 'white'],
                    columns=['ball', 'pen', 'pencil', 'paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [41]:
frame.drop(['blue','yellow'])

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
white,12,13,14,15


In [42]:
frame.drop(['pen','pencil'], axis=1)

Unnamed: 0,ball,paper
red,0,3
blue,4,7
yellow,8,11
white,12,15


### Arithmetic and Data Alignment

In [43]:
s1 = pd.Series([3,2,5,1],['white','yellow','green','blue'])
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])

In [44]:
s1 + s2

black     NaN
blue      3.0
brown     NaN
green     NaN
white     4.0
yellow    6.0
dtype: float64

In [45]:
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
          index=['red', 'blue', 'yellow', 'white'],
          columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
          index=['blue', 'green', 'white', 'yellow'],
          columns=['mug','pen','ball'])
frame1

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [46]:
frame2

Unnamed: 0,mug,pen,ball
blue,0,1,2
green,3,4,5
white,6,7,8
yellow,9,10,11


In [86]:
frame1 + frame2

Unnamed: 0,ball,mug,paper,pen,pencil
blue,6.0,,,6.0,
green,,,,,
red,,,,,
white,20.0,,,20.0,
yellow,19.0,,,19.0,


## Operations between Data Structures

Now that you are familiar with the data structures such as series and dataframe and you
have seen how various elementary operations can be performed on them, it’s time to go
to operations involving two or more of these structures.

For example, in the previous section, you saw how the arithmetic operators apply
between two of these objects. Now in this section you will deepen more the topic of
operations that can be performed between two data structures.

### Flexible Arithmetic Methods

You’ve just seen how to use mathematical operators directly on the pandas data
structures. The same operations can also be performed using appropriate methods,
called flexible arithmetic methods.

• add()

• sub()

• div()

• mul()

In order to call these functions, you need to use a specification different than what
you’re used to dealing with mathematical operators. For example, instead of writing a sum
between two dataframes, such as frame1 + frame2, you have to use the following format:

In [47]:
frame1.add(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
blue,6.0,,,6.0,
green,,,,,
red,,,,,
white,20.0,,,20.0,
yellow,19.0,,,19.0,


As you can see, the results are the same as what you’d get using the addition operator +.
You can also note that if the indexes and column names differ greatly from one series to
another, you’ll find yourself with a new dataframe full of NaN values. You’ll see later in
this chapter how to handle this kind of data.

### Operations between DataFrame and Series

Coming back to the arithmetic operators, pandas allows you to make transactions
between different structures. For example, between a dataframe and a series. For
example, you can define these two structures in the following way.

In [88]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
          index=['red', 'blue', 'yellow', 'white'],
          columns=['ball','pen','pencil','paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [89]:
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
ser

ball      0
pen       1
pencil    2
paper     3
dtype: int32

In [90]:
frame - ser

Unnamed: 0,ball,pen,pencil,paper
red,0,0,0,0
blue,4,4,4,4
yellow,8,8,8,8
white,12,12,12,12


In [91]:
ser['mug'] = 9
ser

ball      0
pen       1
pencil    2
paper     3
mug       9
dtype: int64

In [92]:
frame - ser

Unnamed: 0,ball,mug,paper,pen,pencil
red,0,,0,0,0
blue,4,,4,4,4
yellow,8,,8,8,8
white,12,,12,12,12


## Function Application and Mapping

### This section covers the pandas library functions.

### Functions by Element

The pandas library is built on the foundations of NumPy and then extends many of its
features by adapting them to new data structures as series and dataframe. Among these
are the universal functions, called ufunc. This class of functions operates by element in
the data structure.

In [6]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
          index=['red', 'blue', 'yellow', 'white'],
          columns=['ball','pen','pencil','paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [7]:
np.sqrt(frame)

Unnamed: 0,ball,pen,pencil,paper
red,0.0,1.0,1.414214,1.732051
blue,2.0,2.236068,2.44949,2.645751
yellow,2.828427,3.0,3.162278,3.316625
white,3.464102,3.605551,3.741657,3.872983


### Functions by Row or Column 

The application of the functions is not limited to the ufunc functions, but also includes
those defined by the user. The important point is that they operate on a one-dimensional
array, giving a single number as a result. For example, you can define a lambda function
that calculates the range covered by the elements in an array.

In [5]:
frame.max()

ball      12
pen       13
pencil    14
paper     15
dtype: int32

In [94]:
f = lambda x: x.max() - x.min()

def f(x):
    return x.max() - x.min()

frame.apply(f)

ball      12
pen       12
pencil    12
paper     12
dtype: int64

The result this time is one value for the column, but if you prefer to apply the
function by row instead of by column, you have to set the axis option to 1.

In [95]:
frame.apply(f, axis=1)

red       3
blue      3
yellow    3
white     3
dtype: int64

It is not mandatory that the method apply() return a scalar value. It can also
return a series. A useful case would be to extend the application to many functions
simultaneously. In this case, we will have two or more values for each feature applied.
This can be done by defining a function in the following manner:

In [96]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,ball,pen,pencil,paper
min,0,1,2,3
max,12,13,14,15


## Statistics Functions

Most of the statistical functions for arrays are still valid for dataframe, so using the
apply() function is no longer necessary. For example, functions such as sum() and
mean() can calculate the sum and the average, respectively, of the elements contained
within a dataframe.

In [97]:
frame.sum()

ball      24
pen       28
pencil    32
paper     36
dtype: int64

In [98]:
frame.mean()

ball      6.0
pen       7.0
pencil    8.0
paper     9.0
dtype: float64

In [99]:
frame.describe()

Unnamed: 0,ball,pen,pencil,paper
count,4.0,4.0,4.0,4.0
mean,6.0,7.0,8.0,9.0
std,5.163978,5.163978,5.163978,5.163978
min,0.0,1.0,2.0,3.0
25%,3.0,4.0,5.0,6.0
50%,6.0,7.0,8.0,9.0
75%,9.0,10.0,11.0,12.0
max,12.0,13.0,14.0,15.0


## Sorting and Ranking

Another fundamental operation that uses indexing is sorting. Sorting the data is often
a necessity and it is very important to be able to do it easily. pandas provides the sort_
index() function, which returns a new object that’s identical to the start, but in which
the elements are ordered.

In [100]:
ser = pd.Series([5, 0, 3, 8, 4], index=['red','blue','yellow','white','green'])
ser

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

In [101]:
ser.sort_index()

blue      0
green     4
red       5
white     8
yellow    3
dtype: int64

In [102]:
ser.sort_index(ascending=False)

yellow    3
white     8
red       5
green     4
blue      0
dtype: int64

In [103]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
          index=['red', 'blue', 'yellow', 'white'],
          columns=['ball','pen','pencil','paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [104]:
frame.sort_index()

Unnamed: 0,ball,pen,pencil,paper
blue,4,5,6,7
red,0,1,2,3
white,12,13,14,15
yellow,8,9,10,11


In [105]:
frame.sort_index(axis=1)

Unnamed: 0,ball,paper,pen,pencil
red,0,3,1,2
blue,4,7,5,6
yellow,8,11,9,10
white,12,15,13,14


In [112]:
ser.sort_values()

blue      0
yellow    3
green     4
red       5
white     8
dtype: int64

In [113]:
frame.sort_values(by='pen')

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [114]:
frame.sort_values(by=['pen','pencil'])

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [115]:
ser.rank()

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

In [116]:
ser.rank(method='first')

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

In [117]:
ser.rank(ascending=False)

red       2.0
blue      5.0
yellow    4.0
white     1.0
green     3.0
dtype: float64

## Correlation and Covariance

Two important statistical calculations are correlation and covariance, expressed in
pandas by the corr() and cov() functions. These kind of calculations normally involve
two series.

In [18]:
seq = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq2 = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq.corr(seq2)

0.7745966692414835

In [16]:
seq                                

2006    1
2007    2
2008    3
2009    4
2010    4
2011    3
2012    2
2013    1
dtype: int64

In [17]:
seq2

2006    3
2007    4
2008    3
2009    4
2010    5
2011    4
2012    3
2013    2
dtype: int64

In [19]:
seq.cov(seq2)

0.8571428571428571

In [20]:
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
          index=['red', 'blue', 'yellow', 'white'],
          columns=['ball','pen','pencil','paper'])
frame2

Unnamed: 0,ball,pen,pencil,paper
red,1,4,3,6
blue,4,5,6,1
yellow,3,3,1,5
white,4,1,6,4


In [21]:
frame2.corr()

Unnamed: 0,ball,pen,pencil,paper
ball,1.0,-0.276026,0.57735,-0.763763
pen,-0.276026,1.0,-0.079682,-0.361403
pencil,0.57735,-0.079682,1.0,-0.692935
paper,-0.763763,-0.361403,-0.692935,1.0


In [123]:
frame2.cov()

Unnamed: 0,ball,pen,pencil,paper
ball,2.0,-0.666667,2.0,-2.333333
pen,-0.666667,2.916667,-0.333333,-1.333333
pencil,2.0,-0.333333,6.0,-3.666667
paper,-2.333333,-1.333333,-3.666667,4.666667


Using the corrwith() method, you can calculate the pairwise correlations between
the columns or rows of a dataframe with a series or another DataFrame().

In [130]:
ser = pd.Series([0, 1, 2, 3, 9], index=['red','blue','yellow','white','green'])
ser

red       0
blue      1
yellow    2
white     3
green     9
dtype: int64

In [131]:
frame2.corrwith(ser)

ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64

In [132]:
frame2.corrwith(frame)

ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64

## "Not a Number" Data

In the previous sections, you saw how easily missing data can be formed. They are
recognizable in the data structures by the NaN (Not a Number) value. So, having values
that are not defined in a data structure is quite common in data analysis.
However, pandas is designed to better manage this eventuality. In fact, in this
section, you will learn how to treat these values so that many issues can be obviated.
For example, in the pandas library, calculating descriptive statistics excludes NaN values
implicitly.

### Assigning a NaN Value

If you need to specifically assign a NaN value to an element in a data structure, you can
use the np.NaN (or np.nan) value of the NumPy library.

In [23]:
ser = pd.Series([0,1,2,np.NaN,9], index=['red','blue','yellow','white','green'])
ser

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

In [24]:
ser['white'] = None
ser

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

### FIltering Out NaN Values

There are various ways to eliminate the NaN values during data analysis. Eliminating them
by hand, element by element, can be very tedious and risky, and you’re never sure that
you eliminated all the NaN values. This is where the dropna() function comes to your aid.

In [17]:
ser.dropna()

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

You can also directly perform the filtering function by placing notnull() in the
selection condition.

In [18]:
ser[ser.notnull()]

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

In [20]:
ser

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

If you’re dealing with a dataframe, it gets a little more complex. If you use the
dropna() function on this type of object, and there is only one NaN value on a column or
row, it will eliminate it.

In [27]:
frame3 = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
                     index=['blue','green','red'],
                     columns=['ball','mug','pen'])
frame3

Unnamed: 0,ball,mug,pen
blue,6.0,,6.0
green,,,
red,2.0,,5.0


In [28]:
frame3.dropna()

Unnamed: 0,ball,mug,pen


Therefore, to avoid having entire rows and columns disappear completely, you
should specify the how option, assigning a value of all to it. This tells the dropna()
function to delete only the rows or columns in which all elements are NaN.

In [139]:
frame3.dropna(how='all')

Unnamed: 0,ball,mug,pen
blue,6.0,,6.0
red,2.0,,5.0


### Filliing in NaN Occurrences

Rather than filter NaN values within data structures, with the risk of discarding them along with values that could be relevant in the context of data analysis, you can replace them with other numbers. For most purposes, the fillna() function is a great choice. This method takes one argument, the value with which to replace any NaN. It can be the same for all cases.

In [29]:
frame3.fillna(0)

Unnamed: 0,ball,mug,pen
blue,6.0,0.0,6.0
green,0.0,0.0,0.0
red,2.0,0.0,5.0


In [141]:
frame3.fillna({'ball':1, 'mug':0, 'pen': 99})

Unnamed: 0,ball,mug,pen
blue,6.0,0.0,6.0
green,1.0,0.0,99.0
red,2.0,0.0,5.0


## Hierarchical Indexing and Leveling

Hierarchical indexing is a very important feature of pandas, as it allows you to have
multiple levels of indexes on a single axis. It gives you a way to work with data in multiple
dimensions while continuing to work in a two-dimensional structure.
Let’s start with a simple example, creating a series containing two arrays of indexes,
that is, creating a structure with two levels.

In [25]:
mser = pd.Series(np.random.rand(8),
                index=[['white','white','white','blue','blue','red','red','red'],
                      ['up','down','right','up','down','up','down','left']])
mser

white  up       0.337767
       down     0.013559
       right    0.247099
blue   up       0.179219
       down     0.744752
red    up       0.393401
       down     0.333457
       left     0.600779
dtype: float64

In [26]:
mser.index

MultiIndex([('white',    'up'),
            ('white',  'down'),
            ('white', 'right'),
            ( 'blue',    'up'),
            ( 'blue',  'down'),
            (  'red',    'up'),
            (  'red',  'down'),
            (  'red',  'left')],
           )

Through the specification of hierarchical indexing, selecting subsets of values is in a
certain way simplified.
In fact, you can select the values for a given value of the first index, and you do it in
the classic way:

In [28]:
mser['white']

up       0.337767
down     0.013559
right    0.247099
dtype: float64

Or you can select values for a given value of the second index, in the following
manner:

In [29]:
mser[:,'up']

white    0.337767
blue     0.179219
red      0.393401
dtype: float64

Intuitively, if you want to select a specific value, you specify both indexes.

In [30]:
mser['white','up']

0.337766580088508

Hierarchical indexing plays a critical role in reshaping data and group-based
operations such as a pivot-table. For example, the data could be rearranged and used
in a dataframe with a special function called unstack(). This function converts the
series with a hierarchical index to a simple dataframe, where the second set of indexes is
converted into a new set of columns.

In [147]:
mser.unstack()

Unnamed: 0,down,left,right,up
blue,0.408367,,,0.08148
red,0.374153,0.325975,,0.465264
white,0.512268,,0.639885,0.661039


If what you want is to perform the reverse operation, which is to convert a dataframe
to a series, you use the stack() function.

In [148]:
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [149]:
frame.stack()

red     ball       0
        pen        1
        pencil     2
        paper      3
blue    ball       4
        pen        5
        pencil     6
        paper      7
yellow  ball       8
        pen        9
        pencil    10
        paper     11
white   ball      12
        pen       13
        pencil    14
        paper     15
dtype: int32

With dataframe, it is possible to define a hierarchical index both for the rows and for
the columns. At the time the dataframe is declared, you have to define an array of arrays
for the index and columns options.

In [32]:
mframe = pd.DataFrame(np.random.randn(16).reshape(4,4),
                     index=[['white','white','red','red'], ['up','down','up','down']],
                     columns=[['pen','pen','paper','paper'],[1,2,1,2]])
mframe

Unnamed: 0_level_0,Unnamed: 1_level_0,pen,pen,paper,paper
Unnamed: 0_level_1,Unnamed: 1_level_1,1,2,1,2
white,up,-0.439913,-0.626905,-0.096799,-0.681926
white,down,-0.172315,0.620844,0.609874,0.664244
red,up,-0.469982,-2.35146,-0.24797,1.971591
red,down,0.081544,0.309313,-0.011495,0.938054


# Reordering and Sorting Levels

Occasionally, you might need to rearrange the order of the levels on an axis or sort for
values at a specific level.
The swaplevel() function accepts as arguments the names assigned to the two
levels that you want to interchange and returns a new object with the two levels
interchanged between them, while leaving the data unmodified.

In [33]:
mframe.columns.names = ['objects','id']
mframe.index.names = ['colors','status']
mframe

Unnamed: 0_level_0,objects,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
colors,status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
white,up,-0.439913,-0.626905,-0.096799,-0.681926
white,down,-0.172315,0.620844,0.609874,0.664244
red,up,-0.469982,-2.35146,-0.24797,1.971591
red,down,0.081544,0.309313,-0.011495,0.938054


In [34]:
mframe.swaplevel('colors','status')

Unnamed: 0_level_0,objects,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
status,colors,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
up,white,-0.439913,-0.626905,-0.096799,-0.681926
down,white,-0.172315,0.620844,0.609874,0.664244
up,red,-0.469982,-2.35146,-0.24797,1.971591
down,red,0.081544,0.309313,-0.011495,0.938054


In [35]:
mframe.sort_index(level='colors')

Unnamed: 0_level_0,objects,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
colors,status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
red,down,0.081544,0.309313,-0.011495,0.938054
red,up,-0.469982,-2.35146,-0.24797,1.971591
white,down,-0.172315,0.620844,0.609874,0.664244
white,up,-0.439913,-0.626905,-0.096799,-0.681926


## Summary Statistic by Level

Many descriptive statistics and summary statistics performed on a dataframe or on a
series have a level option, with which you can determine at what level the descriptive
and summary statistics should be determined.
For example, if you create a statistic at row level, you have to simply specify the level
option with the level name.

In [36]:
mframe.sum(level='colors')

  mframe.sum(level='colors')


objects,pen,pen,paper,paper
id,1,2,1,2
colors,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
white,-0.612228,-0.006061,0.513075,-0.017681
red,-0.388438,-2.042147,-0.259464,2.909645


In [37]:
mframe.sum(level='id',axis=1)

  mframe.sum(level='id',axis=1)


Unnamed: 0_level_0,id,1,2
colors,status,Unnamed: 2_level_1,Unnamed: 3_level_1
white,up,-0.536712,-1.308831
white,down,0.437559,1.285089
red,up,-0.717952,-0.379869
red,down,0.070049,1.247368


## End