In [None]:
import numpy as np
import pandas as pd

#Reindexing and altering labels

## Filling while reindexing

reindex() takes an optional parameter method which is a filling method chosen from the following table:

Method

Action

* pad / ffill  -------Fill values forward

* bfill / backfill-------------------Fill values backward

* nearest--------------------Fill from the nearest index value

In [None]:
eng = pd.date_range('01-05-2018',periods = 10, freq='m')
ts = pd.DataFrame(np.random.rand(10),index = eng,columns=['a'])
ts

Unnamed: 0,a
2018-01-31,0.344658
2018-02-28,0.976188
2018-03-31,0.840416
2018-04-30,0.169581
2018-05-31,0.024219
2018-06-30,0.304059
2018-07-31,0.375037
2018-08-31,0.953195
2018-09-30,0.346242
2018-10-31,0.980717


In [None]:
ts2 = ts[1:8:2]

In [None]:
ts2.reindex(ts.index)

Unnamed: 0,a
2018-01-31,
2018-02-28,0.976188
2018-03-31,
2018-04-30,0.169581
2018-05-31,
2018-06-30,0.304059
2018-07-31,
2018-08-31,0.953195
2018-09-30,
2018-10-31,


In [None]:
ts2.reindex(ts.index,method='ffill')

Unnamed: 0,a
2018-01-31,
2018-02-28,0.976188
2018-03-31,0.976188
2018-04-30,0.169581
2018-05-31,0.169581
2018-06-30,0.304059
2018-07-31,0.304059
2018-08-31,0.953195
2018-09-30,0.953195
2018-10-31,0.953195


In [None]:
ts2.reindex(ts.index,method='bfill')

Unnamed: 0,a
2018-01-31,0.976188
2018-02-28,0.976188
2018-03-31,0.169581
2018-04-30,0.169581
2018-05-31,0.304059
2018-06-30,0.304059
2018-07-31,0.953195
2018-08-31,0.953195
2018-09-30,
2018-10-31,


In [None]:
ts2.reindex(ts.index,method ='nearest')

Unnamed: 0,a
2018-01-31,0.976188
2018-02-28,0.976188
2018-03-31,0.169581
2018-04-30,0.169581
2018-05-31,0.304059
2018-06-30,0.304059
2018-07-31,0.953195
2018-08-31,0.953195
2018-09-30,0.953195
2018-10-31,0.953195


These methods require that the indexes are ordered increasing or decreasing.

Note that the same result could have been achieved using fillna (except for method='nearest') or interpolate:

In [None]:
ts2.reindex(ts.index).fillna(method='ffill') # 1st  wiil be nan beacause ther is no before value to fill

Unnamed: 0,a
2018-01-31,
2018-02-28,0.976188
2018-03-31,0.976188
2018-04-30,0.169581
2018-05-31,0.169581
2018-06-30,0.304059
2018-07-31,0.304059
2018-08-31,0.953195
2018-09-30,0.953195
2018-10-31,0.953195


In [None]:
ts2.reindex(ts.index).fillna(method='bfill') # here ther is no elements to fill the last elements becsuse this is back fill method

Unnamed: 0,a
2018-01-31,0.976188
2018-02-28,0.976188
2018-03-31,0.169581
2018-04-30,0.169581
2018-05-31,0.304059
2018-06-30,0.304059
2018-07-31,0.953195
2018-08-31,0.953195
2018-09-30,
2018-10-31,


In [None]:
ts2.reindex(ts.index).fillna(method='nearest') # in fill na there no method nearest fill

ValueError: ignored

reindex() will raise a ValueError if the index is not monotonically increasing or decreasing. fillna() and interpolate() will not perform any checks on the order of the index.

## Limits on filling while reindexing

The limit and tolerance arguments provide additional control over filling while reindexing. Limit specifies the maximum count of consecutive matches:

In [None]:
ts2.reindex(ts.index,method = 'ffill',limit=1)

Unnamed: 0,a
2018-01-31,
2018-02-28,0.976188
2018-03-31,0.976188
2018-04-30,0.169581
2018-05-31,0.169581
2018-06-30,0.304059
2018-07-31,0.304059
2018-08-31,0.953195
2018-09-30,0.953195
2018-10-31,


In contrast, tolerance specifies the maximum distance between the index and indexer values:

In [None]:
ts2.reindex(ts.index,method='ffill',tolerance='1 Day')

Unnamed: 0,a
2018-01-31,
2018-02-28,0.976188
2018-03-31,
2018-04-30,0.169581
2018-05-31,
2018-06-30,0.304059
2018-07-31,
2018-08-31,0.953195
2018-09-30,
2018-10-31,


Notice that when used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced into a Timedelta if possible. This allows you to specify tolerance with appropriate strings.

## Dropping labels from an axis

A method closely related to reindex is the drop() function. It removes a set of labels from an axis:

In [None]:
df = pd.DataFrame(
    {
        "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
        "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
        "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
    }
)

In [None]:
df

Unnamed: 0,one,two,three
a,0.975297,-0.640324,
b,0.889013,0.354877,0.093643
c,0.251497,-0.809261,-1.581803
d,,-2.292713,-0.57477


In [None]:
df.drop(['a','d'],axis=0)

Unnamed: 0,one,two,three
b,0.889013,0.354877,0.093643
c,0.251497,-0.809261,-1.581803


In [None]:
df.drop(['one'],axis = 1)

Unnamed: 0,two,three
a,-0.640324,
b,0.354877,0.093643
c,-0.809261,-1.581803
d,-2.292713,-0.57477


Note that the following also works, but is a bit less obvious / clean:

In [None]:
df.reindex(df.index.difference(['a','d']))

Unnamed: 0,one,two,three
b,0.889013,0.354877,0.093643
c,0.251497,-0.809261,-1.581803


##Renaming / mapping labels

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [None]:

s = pd.Series(np.random.randn(5),index=list('abcde'))
s

a   -0.750127
b    1.866692
c    1.778551
d   -0.074709
e   -0.705924
dtype: float64

In [None]:
s.rename(str.upper)

A   -0.750127
B    1.866692
C    1.778551
D   -0.074709
E   -0.705924
dtype: float64

If you pass a function, it must return a value when called with any of the labels (and must produce a set of unique values). A dict or Series can also be used:

In [None]:
df.rename(
    columns={'one':'foo','two':'soo'},
    index={'A':"Apple",'B':'Banana','d':'druian'}

)

Unnamed: 0,foo,soo,three
a,0.975297,-0.640324,
b,0.889013,0.354877,0.093643
c,0.251497,-0.809261,-1.581803
druian,,-2.292713,-0.57477


If the mapping doesn’t include a column/index label, it isn’t renamed. Note that extra labels in the mapping don’t throw an error.

DataFrame.rename() also supports an “axis-style” calling convention, where you specify a single mapper and the axis to apply that mapping to.

In [None]:
df.rename({'one':'foo',"two":'soo'}, axis='columns')

Unnamed: 0,foo,soo,three
a,0.975297,-0.640324,
b,0.889013,0.354877,0.093643
c,0.251497,-0.809261,-1.581803
d,,-2.292713,-0.57477


In [None]:
df.rename({'A':"Apple",'B':'Banana','d':'druian'},axis='index')

Unnamed: 0,one,two,three
a,0.975297,-0.640324,
b,0.889013,0.354877,0.093643
c,0.251497,-0.809261,-1.581803
druian,,-2.292713,-0.57477


Finally, rename() also accepts a scalar or list-like for altering the Series.name attribute.

In [None]:
s.rename('scalar_name')

a   -0.750127
b    1.866692
c    1.778551
d   -0.074709
e   -0.705924
Name: scalar_name, dtype: float64

**The methods DataFrame.rename_axis() and Series.rename_axis() allow specific names of a MultiIndex to be changed (as opposed to the labels).**

In [None]:
df=pd.DataFrame({'ajay':[1,2,3,4,5,6],'kumar':[10,20,30,40,50,60]},
                index=pd.MultiIndex.from_product(
                    [['a','b','c'],[1,2]],names=['x','y']
                )
                )

In [None]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,ajay,kumar
x,y,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,10
a,2,2,20
b,1,3,30
b,2,4,40
c,1,5,50
c,2,6,60


In [None]:
df.rename_axis(index={'x':'abc'})

Unnamed: 0_level_0,Unnamed: 1_level_0,ajay,kumar
abc,y,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,10
a,2,2,20
b,1,3,30
b,2,4,40
c,1,5,50
c,2,6,60


In [None]:
df.rename_axis(index=str.upper)

Unnamed: 0_level_0,Unnamed: 1_level_0,ajay,kumar
X,Y,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,10
a,2,2,20
b,1,3,30
b,2,4,40
c,1,5,50
c,2,6,60


# Iteration

The behavior of basic iteration over pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. DataFrames follow the dict-like convention of iterating over the “keys” of the objects.

In short, basic iteration (for i in object) produces:

* Series: values

* DataFrame: column labels

Thus, for example, iterating over a DataFrame gives you the column names:

In [None]:
df = pd.DataFrame(
    {"col1": np.random.randn(3), "col2": np.random.randn(3)}, index=["a", "b", "c"]
)
df

Unnamed: 0,col1,col2
a,0.78156,1.750903
b,1.270105,-1.108372
c,0.495083,-0.182836


In [None]:
for key in df:
  print(key)

col1
col2


In [None]:
for key ,value in df.items(): #data frame also works as dict
  print(key)
  print(value)

col1
a    0.781560
b    1.270105
c    0.495083
Name: col1, dtype: float64
col2
a    1.750903
b   -1.108372
c   -0.182836
Name: col2, dtype: float64


pandas objects also have the dict-like items() method to iterate over the (key, value) pairs.

To iterate over the rows of a DataFrame, you can use the following methods:

iterrows(): Iterate over the rows of a DataFrame as (index, Series) pairs. This converts the rows to Series objects, which can change the dtypes and has some performance implications.

itertuples(): Iterate over the rows of a DataFrame as namedtuples of the values. This is a lot faster than iterrows(), and is in most cases preferable to use to iterate over the values of a DataFrame.

items
Consistent with the dict-like interface, items() iterates through key-value pairs:

Series: (index, scalar value) pairs

DataFrame: (column, Series) pairs

For example:

In [None]:
# see the documentation
for key,value in df.items():
  print(key)
  print(value)

col1
a    0.781560
b    1.270105
c    0.495083
Name: col1, dtype: float64
col2
a    1.750903
b   -1.108372
c   -0.182836
Name: col2, dtype: float64


**iterrows**

iterrows() allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator yielding each index value along with a Series containing the data in each row

In [None]:
from os import sep
for row_idx,row in df.iterrows():
  print(row_idx,row ,sep='\n')

a
col1    0.781560
col2    1.750903
Name: a, dtype: float64
b
col1    1.270105
col2   -1.108372
Name: b, dtype: float64
c
col1    0.495083
col2   -0.182836
Name: c, dtype: float64


**NOTE**

Because iterrows() returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,

In [None]:
df_orig = pd.DataFrame([[1, 1.5]], columns=["int", "float"])

df_orig.dtypes

int        int64
float    float64
dtype: object

In [None]:
row = next(df_orig.iterrows())[1]
row

int      1.0
float    1.5
Name: 0, dtype: float64

In [None]:
row.dtypes


dtype('float64')

For instance, a contrived way to transpose the DataFrame would be:

In [None]:
df2 = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})

In [None]:
df2

Unnamed: 0,x,y
0,1,4
1,2,5
2,3,6


In [None]:
 df2.T


Unnamed: 0,0,1,2
x,1,2,3
y,4,5,6


In [None]:
df2_t = pd.DataFrame({idx:values for idx,values in df2.iterrows()})
df2_t
# is same as transpose

Unnamed: 0,0,1,2
x,1,2,3
y,4,5,6


## itertuples


The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

In [None]:
for row in df.itertuples():
  print(row)

Pandas(Index='a', col1=0.7815602026338467, col2=1.7509029980643234)
Pandas(Index='b', col1=1.2701045377566604, col2=-1.1083716082226118)
Pandas(Index='c', col1=0.49508294823263954, col2=-0.18283591324994306)


This method does not convert the row to a Series object; it merely returns the values inside a namedtuple. Therefore, itertuples() preserves the data type of the values and is generally faster as iterrows().

# .dt accessor

Series has an accessor to succinctly return datetime like properties for the values of the Series, if it is a datetime/period like Series. This will return a Series, indexed like the existing Series.

In [None]:
s = pd.Series(pd.date_range('2020-05-10 09:10:12',periods=10))
s = s[0:6]
s

0   2020-05-10 09:10:12
1   2020-05-11 09:10:12
2   2020-05-12 09:10:12
3   2020-05-13 09:10:12
4   2020-05-14 09:10:12
5   2020-05-15 09:10:12
dtype: datetime64[ns]

In [None]:
s.dt.hour

0    9
1    9
2    9
3    9
4    9
5    9
dtype: int64

In [None]:
s.dt.minute

0    10
1    10
2    10
3    10
4    10
5    10
dtype: int64

In [None]:
s.dt.second

0    12
1    12
2    12
3    12
4    12
5    12
dtype: int64

In [None]:
s.dt.day

0    10
1    11
2    12
3    13
4    14
5    15
dtype: int64

In [None]:
s.dt.day_name()

0       Sunday
1       Monday
2      Tuesday
3    Wednesday
4     Thursday
5       Friday
dtype: object

In [None]:
s.dt.day_of_week

0    6
1    0
2    1
3    2
4    3
5    4
dtype: int64

In [None]:
s.dt.day_of_year


0    131
1    132
2    133
3    134
4    135
5    136
dtype: int64

In [None]:
s.dt.dayofweek

0    6
1    0
2    1
3    2
4    3
5    4
dtype: int64

In [None]:
s.dt.dayofyear

0    131
1    132
2    133
3    134
4    135
5    136
dtype: int64

In [None]:
s.dt.days_in_month

0    31
1    31
2    31
3    31
4    31
5    31
dtype: int64

In [None]:
s.dt.daysinmonth # this is used when frequency is month

0    31
1    31
2    31
3    31
4    31
5    31
dtype: int64

In [None]:
s[s.dt.day ==2] # no data based on the condition

Series([], dtype: datetime64[ns])

In [None]:
s[s.dt.day_of_week==2]

3   2020-05-13 09:10:12
dtype: datetime64[ns]

In [None]:
stz = s.dt.tz_localize('US/Eastern')
stz

0   2020-05-10 09:10:12-04:00
1   2020-05-11 09:10:12-04:00
2   2020-05-12 09:10:12-04:00
3   2020-05-13 09:10:12-04:00
4   2020-05-14 09:10:12-04:00
5   2020-05-15 09:10:12-04:00
dtype: datetime64[ns, US/Eastern]

In [None]:
s.dt.tz_localize('UTC')

0   2020-05-10 09:10:12+00:00
1   2020-05-11 09:10:12+00:00
2   2020-05-12 09:10:12+00:00
3   2020-05-13 09:10:12+00:00
4   2020-05-14 09:10:12+00:00
5   2020-05-15 09:10:12+00:00
dtype: datetime64[ns, UTC]

In [None]:
s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

0   2020-05-10 05:10:12-04:00
1   2020-05-11 05:10:12-04:00
2   2020-05-12 05:10:12-04:00
3   2020-05-13 05:10:12-04:00
4   2020-05-14 05:10:12-04:00
5   2020-05-15 05:10:12-04:00
dtype: datetime64[ns, US/Eastern]

You can also format datetime values as strings with Series.dt.strftime() which supports the same format as the standard strftime().

In [None]:
s = pd.Series(pd.date_range('20130101',periods=4))
s


0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
dtype: datetime64[ns]

In [None]:
s.dt.strftime("%y/%d/%m")

0    13/01/01
1    13/02/01
2    13/03/01
3    13/04/01
dtype: object

In [None]:
s.dt.strftime("%d/%m/%y")

0    01/01/13
1    02/01/13
2    03/01/13
3    04/01/13
dtype: object

In [None]:
s.dt.strftime("%D/%M/%Y") # observe the change

0    01/01/13/00/2013
1    01/02/13/00/2013
2    01/03/13/00/2013
3    01/04/13/00/2013
dtype: object

In [None]:
s.dt.strftime("%d/%m/%Y")

0    01/01/2013
1    02/01/2013
2    03/01/2013
3    04/01/2013
dtype: object

In [None]:
# The .dt accessor works for period and timedelta dtypes.

In [None]:
s1 = pd.DataFrame(pd.date_range('20130101',periods=6,freq='m'),columns=['Date'])
s1

Unnamed: 0,Date
0,2013-01-31
1,2013-02-28
2,2013-03-31
3,2013-04-30
4,2013-05-31
5,2013-06-30


In [None]:
s1.dt.day  # its not worked on data frame

AttributeError: ignored

In [None]:
s1['Date'].dt.day

0    31
1    28
2    31
3    30
4    31
5    30
Name: Date, dtype: int64

In [None]:
s1['Date'].dt.day_name()

0    Thursday
1    Thursday
2      Sunday
3     Tuesday
4      Friday
5      Sunday
Name: Date, dtype: object

In [None]:
s1['Date'].dt.year

0    2013
1    2013
2    2013
3    2013
4    2013
5    2013
Name: Date, dtype: int64

In [None]:
s1['Date'].dt.month_name()

0     January
1    February
2       March
3       April
4         May
5        June
Name: Date, dtype: object

In [None]:
#timedelta
s = pd.Series(pd.timedelta_range("1 day 00:00:05", periods=4, freq="s"))
s

0   1 days 00:00:05
1   1 days 00:00:06
2   1 days 00:00:07
3   1 days 00:00:08
dtype: timedelta64[ns]

In [None]:
s.dt.days

0    1
1    1
2    1
3    1
dtype: int64

In [None]:
s.dt.seconds

0    5
1    6
2    7
3    8
dtype: int64

In [None]:
s.dt.components

Unnamed: 0,days,hours,minutes,seconds,milliseconds,microseconds,nanoseconds
0,1,0,0,5,0,0,0
1,1,0,0,6,0,0,0
2,1,0,0,7,0,0,0
3,1,0,0,8,0,0,0


AttributeError: ignored

#  Vectorized string methods

Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in string methods. For example:

In [None]:
s = pd.Series(
    ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
)
s

0       A
1       B
2       C
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: string

In [None]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

Powerful pattern-matching methods are provided as well, but note that pattern-matching generally uses regular expressions by default (and in some cases always uses them).

#Sorting

pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a combination of both.

## By index

The Series.sort_index() and DataFrame.sort_index() methods are used to sort a pandas object by its index levels.

In [None]:
df = pd.DataFrame(
    {
        "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
        "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
        "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
    }
)


unsorted_df = df.reindex(
    index=["a", "d", "c", "b"], columns=["three", "two", "one"]
)

In [None]:
df

Unnamed: 0,one,two,three
a,0.145442,0.570495,
b,1.246334,0.593326,-0.343553
c,1.427597,0.272651,-0.641431
d,,-1.546006,-0.035763


In [None]:
unsorted_df

Unnamed: 0,three,two,one
a,,0.570495,0.145442
d,-0.035763,-1.546006,
c,-0.641431,0.272651,1.427597
b,-0.343553,0.593326,1.246334


In [None]:
unsorted_df.sort_index()

Unnamed: 0,three,two,one
a,,0.570495,0.145442
b,-0.343553,0.593326,1.246334
c,-0.641431,0.272651,1.427597
d,-0.035763,-1.546006,


In [None]:
unsorted_df.sort_index(ascending=False)

Unnamed: 0,three,two,one
d,-0.035763,-1.546006,
c,-0.641431,0.272651,1.427597
b,-0.343553,0.593326,1.246334
a,,0.570495,0.145442


In [None]:
unsorted_df.sort_index(axis=1)

Unnamed: 0,one,three,two
a,0.145442,,0.570495
d,,-0.035763,-1.546006
c,1.427597,-0.641431,0.272651
b,1.246334,-0.343553,0.593326


In [None]:
unsorted_df['three'].sort_index()

a         NaN
b   -0.343553
c   -0.641431
d   -0.035763
Name: three, dtype: float64

Sorting by index also supports a key parameter that takes a callable function to apply to the index being sorted. For MultiIndex objects, the key is applied per-level to the levels specified by level.

In [None]:
s1 = pd.DataFrame({"a": ["B", "a", "C"], "b": [1, 2, 3], "c": [2, 3, 4]}).set_index(
    list("ab"))
s1

Unnamed: 0_level_0,Unnamed: 1_level_0,c
a,b,Unnamed: 2_level_1
B,1,2
a,2,3
C,3,4


In [None]:
s1.sort_index(level='a')

Unnamed: 0_level_0,Unnamed: 1_level_0,c
a,b,Unnamed: 2_level_1
B,1,2
C,3,4
a,2,3


In [None]:
s1.sort_index(level='a',key=lambda idx:idx.str.lower())

Unnamed: 0_level_0,Unnamed: 1_level_0,c
a,b,Unnamed: 2_level_1
a,2,3
B,1,2
C,3,4


In [None]:
s1.sort_index(level='a',ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,c
a,b,Unnamed: 2_level_1
a,2,3
C,3,4
B,1,2


In [None]:
s1.sort_index(level='b',ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,c
a,b,Unnamed: 2_level_1
C,3,4
a,2,3
B,1,2


##By values

The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values() method is used to sort a DataFrame by its column or row values. The optional by parameter to DataFrame.sort_values() may used to specify one or more columns to use to determine the sorted order.

In [None]:
df1 = pd.DataFrame(
    {"one": [2, 1, 1, 1], "two": [1, 3, 2, 4], "three": [5, 4, 3, 2]}
)
df1

Unnamed: 0,one,two,three
0,2,1,5
1,1,3,4
2,1,2,3
3,1,4,2


In [None]:
df1.sort_values(by='one')

Unnamed: 0,one,two,three
1,1,3,4
2,1,2,3
3,1,4,2
0,2,1,5


In [None]:
df1.sort_values(by='three')

Unnamed: 0,one,two,three
3,1,4,2
2,1,2,3
1,1,3,4
0,2,1,5


In [None]:
df1.sort_values(by=['two','three'])

Unnamed: 0,one,two,three
0,2,1,5
2,1,2,3
1,1,3,4
3,1,4,2


In [None]:
df1.sort_values(by=['two','three'],ascending=False)

Unnamed: 0,one,two,three
3,1,4,2
1,1,3,4
2,1,2,3
0,2,1,5


In [None]:
#The by parameter can take a list of column names, e.g.:

df1[["one", "two", "three"]].sort_values(by=["one", "two"])

Unnamed: 0,one,two,three
2,1,2,3
1,1,3,4
3,1,4,2
0,2,1,5


These methods have special treatment of NA values via the na_position argument:

In [None]:
s[2]=np.nan
s

0       A
1       B
2    <NA>
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: string

In [None]:
s.sort_values()

0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
2    <NA>
5    <NA>
dtype: string

In [None]:
s.sort_values(na_position='first')

2    <NA>
5    <NA>
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
dtype: string

key will be given the Series of values and should return a Series or array of the same shape with the transformed values. For DataFrame objects, the key is applied per column, so the key should still expect a Series and return a Series, e.g.

In [None]:
df = pd.DataFrame({"a": ["B", "a", "C"], "b": [1, 2, 3]})

In [None]:
df.sort_values(by='a')

Unnamed: 0,a,b
0,B,1
2,C,3
1,a,2


In [None]:
df.sort_values(by='a',key=lambda idx : idx.str.lower()) # observe th eabove and below output

Unnamed: 0,a,b
1,a,2
0,B,1
2,C,3


## By indexes and values

Strings passed as the by parameter to DataFrame.sort_values() may refer to either columns or index level names.

In [None]:
# Build MultiIndex
idx = pd.MultiIndex.from_tuples(
    [("a", 1), ("a", 2), ("a", 2), ("b", 2), ("b", 1), ("b", 1)]
)
idx.names=['frist','second']
# buid a data frame
df_mult = pd.DataFrame({'a':np.arange(6,0,-1)},index = idx)

In [None]:
df_mult


Unnamed: 0_level_0,Unnamed: 1_level_0,a
frist,second,Unnamed: 2_level_1
a,1,6
a,2,5
a,2,4
b,2,3
b,1,2
b,1,1


In [None]:
# Sort by ‘second’ (index) and ‘A’ (column)
df_mult.sort_values(by=['second','a'])

Unnamed: 0_level_0,Unnamed: 1_level_0,a
frist,second,Unnamed: 2_level_1
b,1,1
b,1,2
a,1,6
b,2,3
a,2,4
a,2,5


##searchsorted

Series has the searchsorted() method, which works similarly to numpy.ndarray.searchsorted().



In [None]:
ser = pd.Series([1, 2, 3])

In [None]:
ser.searchsorted([0,3])

array([0, 2])

In [None]:
ser.searchsorted([1,4])

array([0, 3])

In [None]:
ser.searchsorted([1,3],side='right')

array([1, 3])

In [None]:
ser.searchsorted([1,3],side='left|')

array([0, 2])

In [None]:
# change series
ser = pd.Series([3, 1, 2])

In [None]:
ser.searchsorted([0, 3], sorter=np.argsort(ser))

array([0, 2])

## smallest / largest values

Series has the nsmallest() and nlargest() methods which return the smallest or largest
 values. For a large Series this can be much faster than sorting the entire Series and calling head(n) on the result.



In [None]:
s = pd.Series(np.random.permutation(10)  )
s

0    4
1    9
2    5
3    0
4    6
5    2
6    1
7    3
8    8
9    7
dtype: int64

In [None]:
s.sort_values()

3    0
6    1
5    2
7    3
0    4
2    5
4    6
9    7
8    8
1    9
dtype: int64

In [None]:
s.nsmallest()

3    0
6    1
5    2
7    3
0    4
dtype: int64

In [None]:
s.nsmallest(3)

3    0
6    1
5    2
dtype: int64

In [None]:
s.nlargest(3)

1    9
8    8
9    7
dtype: int64

In [None]:
# DataFrame also has the nlargest and nsmallest methods.
df = pd.DataFrame(
    {
        "a": [-2, -1, 1, 10, 8, 11, -1],
        "b": list("abdceff"),
        "c": [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0],
    }
)

In [None]:
df.nlargest(3,'a')

Unnamed: 0,a,b,c
5,11,f,3.0
3,10,c,3.2
4,8,e,


In [None]:
df.nlargest(3,['a','c'])

Unnamed: 0,a,b,c
5,11,f,3.0
3,10,c,3.2
4,8,e,


In [None]:
df.nsmallest(3,'a')

Unnamed: 0,a,b,c
0,-2,a,1.0
1,-1,b,2.0
6,-1,f,4.0


In [None]:
df.nsmallest(5,['a','c'])

Unnamed: 0,a,b,c
0,-2,a,1.0
1,-1,b,2.0
6,-1,f,4.0
2,1,d,4.0
4,8,e,


## Sorting by a MultiIndex column

You must be explicit about sorting when the column is a MultiIndex, and fully specify all levels to by.

In [None]:
df1.columns = pd.MultiIndex.from_tuples(
    [("a", "one"), ("a", "two"), ("b", "three")]
)
df1

Unnamed: 0,a,b,c
0,-2,a,1.0
1,-1,b,2.0
2,1,d,4.0
3,10,c,3.2
4,8,e,
5,11,f,3.0
6,-1,f,4.0


In [None]:
df1.sort_values(by=("a", "two"))

Unnamed: 0_level_0,a,a,b
Unnamed: 0_level_1,one,two,three
0,2,1,5
2,1,2,3
1,1,3,4
3,1,4,2


#Copying

The copy() method on pandas objects copies the underlying data (though not the axis indexes, since they are immutable) and returns a new object. Note that it is seldom necessary to copy objects. For example, there are only a handful of ways to alter a DataFrame in-place:

* Inserting, deleting, or modifying a column.

* Assigning to the index or columns attributes.

* For homogeneous data, directly modifying the values via the values attribute or advanced indexing.

To be clear, no pandas method has the side effect of modifying your data; almost every method returns a new object, leaving the original object untouched. If the data is modified, it is because you did so explicitly.

pandas has two ways to store strings.

* object dtype, which can hold any Python object, including strings.

* StringDtype, which is dedicated to strings.

Generally, we recommend using StringDtype. See Text data types for more.

Finally, arbitrary objects may be stored using the object dtype, but should be avoided to the extent possible (for performance and interoperability with other libraries and methods. See object conversion).

A convenient dtypes attribute for DataFrame returns a Series with the data type of each column.



In [None]:
dft = pd.DataFrame(
    {
        "A": np.random.rand(3),
        "B": 1,
        "C": "foo",
        "D": pd.Timestamp("20010102"),
        "E": pd.Series([1.0] * 3).astype("float32"),
        "F": False,
        "G": pd.Series([1] * 3, dtype="int8"),
    }
)

In [None]:
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.658621,1,foo,2001-01-02,1.0,False,1
1,0.081029,1,foo,2001-01-02,1.0,False,1
2,0.345906,1,foo,2001-01-02,1.0,False,1


In [None]:
dft.dtypes # use S for data frame

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

On a Series object, use the dtype attribute.

In [None]:
dft['A'].dtype

dtype('float64')

**If a pandas object contains data with multiple dtypes in a single column, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).**

In [None]:
pd.Series([1,2,3,4,5,6]) # dtype is int all are same sooo.....

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

In [None]:
pd.Series([1,2,'aj',2.5,]) # dtype is object

0      1
1      2
2     aj
3    2.5
dtype: object

The number of columns of each type in a DataFrame can be found by calling DataFrame.dtypes.value_counts().

In [None]:
dft.dtypes.value_counts()

float64           1
int64             1
object            1
datetime64[ns]    1
float32           1
bool              1
int8              1
dtype: int64

In [None]:
dft.value_counts()

A         B  C    D           E    F      G
0.081029  1  foo  2001-01-02  1.0  False  1    1
0.345906  1  foo  2001-01-02  1.0  False  1    1
0.658621  1  foo  2001-01-02  1.0  False  1    1
dtype: int64

In [None]:
# see the dtypes documentation in pandas


In [None]:
df1 = pd.DataFrame(np.random.randn(8,1),columns=['a'],dtype='float32')

In [None]:
df1

Unnamed: 0,a
0,0.151236
1,1.634171
2,-1.438826
3,0.009966
4,0.754646
5,0.346598
6,-0.290816
7,2.357955


In [None]:
df1.dtypes

a    float32
dtype: object

In [None]:
df2 = pd.DataFrame(
    {
        "A": pd.Series(np.random.randn(8), dtype="float16"),
        "B": pd.Series(np.random.randn(8)),
        "C": pd.Series(np.random.randint(0, 255, size=8), dtype="uint8"),
    }
)
df2

Unnamed: 0,A,B,C
0,-0.932617,0.387841,168
1,-0.680176,-0.147792,67
2,0.290039,-1.83693,158
3,0.978516,0.024535,14
4,1.626953,-0.342949,64
5,-0.616211,-1.666974,95
6,-0.293945,-1.622066,185
7,1.21582,1.751115,59


In [None]:
df2.dtypes

A    float16
B    float64
C      uint8
dtype: object

## defaults

By default integer types are int64 and float types are float64, regardless of platform (32-bit or 64-bit).

 The following will all result in int64 dtypes.

In [None]:
pd.DataFrame([1,2],columns=['a']).dtypes

a    int64
dtype: object

In [None]:
pd.DataFrame({'a':[1,2]}).dtypes

a    int64
dtype: object

In [None]:
pd.DataFrame({'a':1},index=list(range(2))).dtypes

a    int64
dtype: object

Note that Numpy will choose platform-dependent types when creating arrays. The following WILL result in int32 on 32-bit platform.

In [None]:
frame=pd.DataFrame(np.array([1,0]))
frame

Unnamed: 0,0
0,1
1,0


In [None]:
frame.dtypes # because my latop is 64 bit

0    int64
dtype: object

## upcasting
Types can potentially be upcasted when combined with other types, meaning they are promoted from the current type (e.g. int to float).

In [None]:
df1.reindex_like(df2).fillna(value=0.0)     #   we take the index frame from df2

Unnamed: 0,A,B,C
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,0.0
4,0.0,0.0,0.0
5,0.0,0.0,0.0
6,0.0,0.0,0.0
7,0.0,0.0,0.0


In [None]:
df3 = df1.reindex_like(df2).fillna(value=0.0)+df2  # Add df2 for cells filling with values
df3

Unnamed: 0,A,B,C
0,-0.932617,0.387841,168.0
1,-0.680176,-0.147792,67.0
2,0.290039,-1.83693,158.0
3,0.978516,0.024535,14.0
4,1.626953,-0.342949,64.0
5,-0.616211,-1.666974,95.0
6,-0.293945,-1.622066,185.0
7,1.21582,1.751115,59.0


In [None]:
df3.dtypes

A    float64
B    float64
C    float64
dtype: object

DataFrame.to_numpy() will return the lower-common-denominator of the dtypes, meaning the dtype that can accommodate ALL of the types in the resulting homogeneous dtyped NumPy array. This can force some upcasting.

In [None]:
df3.to_numpy().dtype

dtype('float64')

## astype

You can use the astype() method to explicitly convert dtypes from one to another.

These will by default return a copy, even if the dtype was unchanged (pass copy=False to change this behavior).

In addition, they will raise an exception if the astype operation is invalid.

Upcasting is always according to the NumPy rules. If two different dtypes are involved in an operation, then the more general one will be used as the result of the operation.

In [None]:
df3

Unnamed: 0,A,B,C
0,-0.932617,0.387841,168.0
1,-0.680176,-0.147792,67.0
2,0.290039,-1.83693,158.0
3,0.978516,0.024535,14.0
4,1.626953,-0.342949,64.0
5,-0.616211,-1.666974,95.0
6,-0.293945,-1.622066,185.0
7,1.21582,1.751115,59.0


In [None]:
df3.astype('int32')

Unnamed: 0,A,B,C
0,0,0,168
1,0,0,67
2,0,-1,158
3,0,0,14
4,1,0,64
5,0,-1,95
6,0,-1,185
7,1,1,59


In [None]:
df3.astype('int32').dtypes

A    int32
B    int32
C    int32
dtype: object

Convert a subset of columns to a specified type using astype().

In [None]:
dfa= pd.DataFrame({'a':[1,2],"b":[3,4],'c':[5,6]})
dfa

Unnamed: 0,a,b,c
0,1,3,5
1,2,4,6


In [None]:
dfa[['a','b']] = dfa[['a','b']].astype(np.uint8)


In [None]:
dfa.dtypes

a    uint8
b    uint8
c    int64
dtype: object

Convert certain columns to a specific dtype by passing a dict to astype().

In [None]:
dft1 = pd.DataFrame({"a": [1, 0, 1], "b": [4, 5, 6], "c": [7, 8, 9]})

In [None]:
dft1 = dft1.astype({'a':np.bool_,'c':np.float64})

In [None]:
dft1.dtypes

a       bool
b      int64
c    float64
dtype: object

When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting occurs.

loc() tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the dtype from the right hand side. Therefore the following piece of code produces the unintended result.

In [None]:
dft = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
dft.dtypes

a    int64
b    int64
c    int64
dtype: object

In [None]:
dft.loc[:,['a','b']].astype(np.uint8).dtypes

a    uint8
b    uint8
dtype: object

In [None]:
dft.dtypes

a    int64
b    int64
c    int64
dtype: object

In [None]:
dft.loc[:, ["a", "b"]] = dft.loc[:, ["a", "b"]].astype(np.uint8)

In [None]:
dft.dtypes

a    int64
b    int64
c    int64
dtype: object

## object conversion
pandas offers various functions to try to force conversion of types from the object dtype to other types. In cases where the data is already of the correct type, but stored in an object array, the DataFrame.infer_objects() and Series.infer_objects() methods can be used to soft convert to the correct type.


-------importent

In [None]:
import datetime
df = pd.DataFrame(
    [
        [1, 2],
        ["a", "b"],
        [datetime.datetime(2016, 3, 2), datetime.datetime(2016, 3, 2)],
    ]
)
df

Unnamed: 0,0,1
0,1,2
1,a,b
2,2016-03-02 00:00:00,2016-03-02 00:00:00


In [None]:
df = df.T

In [None]:
df

Unnamed: 0,0,1,2
0,1,a,2016-03-02
1,2,b,2016-03-02


In [None]:
df.dtypes

0            object
1            object
2    datetime64[ns]
dtype: object

Because the data was transposed the original inference stored all columns as object,

 which infer_objects will correct.

In [None]:
df.infer_objects().dtypes # it revert first column and all column dtypes

0             int64
1            object
2    datetime64[ns]
dtype: object

The following functions are available for one dimensional object arrays or scalars to perform hard conversion of objects to a specified type:

* to_numeric() (conversion to numeric dtypes)

In [None]:
a = [1.1,3,4]
pd.to_numeric(a)

array([1.1, 3. , 4. ])

* to_datetime (convertion of datetime object)

In [None]:
import datetime

m = ["2016-07-09", datetime.datetime(2016, 3, 2)]

In [None]:
m

['2016-07-09', datetime.datetime(2016, 3, 2, 0, 0)]

In [None]:
pd.to_datetime(m)

DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)

* to_timedelta() (conversion to timedelta objects)

In [None]:
m = ["5us", pd.Timedelta("1day")]

In [None]:
pd.to_timedelta(m)

TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)

To force a conversion, we can pass in an errors argument, which specifies how pandas should deal with elements that cannot be converted to desired dtype or object.

 By default, errors='raise', meaning that any errors encountered will be raised during the conversion process.

  However, if errors='coerce', these errors will be ignored and pandas will convert problematic elements to pd.NaT (for datetime and timedelta) or np.nan (for numeric).
  
   This might be useful if you are reading in data which is mostly of the desired dtype (e.g. numeric, datetime),
   
   but occasionally has non-conforming elements intermixed that you want to represent as missing:

In [None]:
import datetime
m = ['apple',datetime.datetime(2016,3,2)]
m

['apple', datetime.datetime(2016, 3, 2, 0, 0)]

In [None]:
pd.to_datetime(m,errors='coerce')

DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)

In [None]:
a = ['apple',2,3]

In [None]:
pd.to_numeric(a,errors='coerce')

array([nan,  2.,  3.])

In [None]:
k = ['apple',pd.Timedelta('1day')]

In [None]:
pd.to_timedelta(k,errors='coerce')

TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

he errors parameter has a third option of errors='ignore',

 which will simply return the passed in data if it encounters any errors with the conversion to a desired data type:

In [None]:
pd.to_datetime(m, errors='ignore')

Index(['apple', 2016-03-02 00:00:00], dtype='object')

In [None]:
pd.to_numeric(a, errors='ignore')

array(['apple', 2, 3], dtype=object)

In [None]:
pd.to_timedelta(k, errors='ignore')

array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

In addition to object conversion, to_numeric() provides another argument downcast, which gives the option of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:

In [None]:
m = ["1", 2, 3]

In [None]:
pd.to_numeric(m, downcast="integer")  # smallest signed int dtype

array([1, 2, 3], dtype=int8)

In [None]:
pd.to_numeric(m, downcast="signed")  # same as 'integer'

array([1, 2, 3], dtype=int8)

In [None]:
pd.to_numeric(m, downcast="unsigned")  # smallest unsigned int dtype

array([1, 2, 3], dtype=uint8)

In [None]:
pd.to_numeric(m, downcast="float")  # smallest float dtype

array([1., 2., 3.], dtype=float32)

As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-dimensional objects such as DataFrames. However, with apply(), we can “apply” the function over each column efficiently:

In [None]:
df = pd.DataFrame([["2016-07-09", datetime.datetime(2016, 3, 2)]] * 2, dtype="O")


In [None]:
df

Unnamed: 0,0,1
0,2016-07-09,2016-03-02 00:00:00
1,2016-07-09,2016-03-02 00:00:00


In [None]:
df.apply(pd.to_datetime)

Unnamed: 0,0,1
0,2016-07-09,2016-03-02
1,2016-07-09,2016-03-02


In [None]:
df = pd.DataFrame([["1.1", 2, 3]] * 2, dtype="O")
df.dtypes

0    object
1    object
2    object
dtype: object

In [None]:
df.apply(pd.to_numeric).dtypes


0    float64
1      int64
2      int64
dtype: object

In [None]:
df = pd.DataFrame([["5us", pd.Timedelta("1day")]] * 2, dtype="O")
df

Unnamed: 0,0,1
0,5us,1 days 00:00:00
1,5us,1 days 00:00:00


In [None]:
df.apply(pd.to_timedelta)

Unnamed: 0,0,1
0,0 days 00:00:00.000005,1 days
1,0 days 00:00:00.000005,1 days


# Selecting columns based on dtype

The select_dtypes() method implements subsetting of columns based on their dtype.

First, let’s create a DataFrame with a slew of different dtypes:

In [None]:
df = pd.DataFrame(
    {
        "string": list("abc"),
        "int64": list(range(1, 4)),
        "uint8": np.arange(3, 6).astype("u1"),
        "float64": np.arange(4.0, 7.0),
        "bool1": [True, False, True],
        "bool2": [False, True, False],
        "dates": pd.date_range("now", periods=3),
        "category": pd.Series(list("ABC")).astype("category"),
    }
)

In [None]:
df["tdeltas"] = df.dates.diff()

df["uint64"] = np.arange(3, 6).astype("u8")

df["other_dates"] = pd.date_range("20130101", periods=3)

df["tz_aware_dates"] = pd.date_range("20130101", periods=3, tz="US/Eastern")

In [None]:
df

Unnamed: 0,string,int64,uint8,float64,bool1,bool2,dates,category,tdeltas,uint64,other_dates,tz_aware_dates
0,a,1,3,4.0,True,False,2023-07-20 06:40:23.466820,A,NaT,3,2013-01-01,2013-01-01 00:00:00-05:00
1,b,2,4,5.0,False,True,2023-07-21 06:40:23.466820,B,1 days,4,2013-01-02,2013-01-02 00:00:00-05:00
2,c,3,5,6.0,True,False,2023-07-22 06:40:23.466820,C,1 days,5,2013-01-03,2013-01-03 00:00:00-05:00


In [None]:
# all data types
df.dtypes

string                                object
int64                                  int64
uint8                                  uint8
float64                              float64
bool1                                   bool
bool2                                   bool
dates                         datetime64[ns]
category                            category
tdeltas                      timedelta64[ns]
uint64                                uint64
other_dates                   datetime64[ns]
tz_aware_dates    datetime64[ns, US/Eastern]
dtype: object

select_dtypes() has two parameters include and exclude that allow you to say “give me the columns with these dtypes” (include) and/or “give the columns without these dtypes” (exclude).

In [None]:
df.select_dtypes(include=[bool])

Unnamed: 0,bool1,bool2
0,True,False
1,False,True
2,True,False


select_dtypes() also works with generic dtypes as well.

For example, to select all numeric and boolean columns while excluding unsigned integers:

In [None]:
df.select_dtypes(include=["number", "bool"], exclude=["unsignedinteger"])

Unnamed: 0,int64,float64,bool1,bool2,tdeltas
0,1,4.0,True,False,NaT
1,2,5.0,False,True,1 days
2,3,6.0,True,False,1 days


In [None]:
df['tdeltas'].dtype

dtype('<m8[ns]')

In [None]:
# To select string columns you must use the object dtype:
df.select_dtypes(include='object')

Unnamed: 0,string
0,a
1,b
2,c


To see all the child dtypes of a generic dtype like numpy.number you can define a function that returns a tree of child dtypes:



In [None]:
def subdtypes(dtype):
  subs = dtype.__subclasses__()
  if not subs:
    return dtype
  return [dtype,[subdtypes(dt) for dt in subs]]

In [None]:
subdtypes(np.generic)

[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.int32,
        numpy.int64,
        numpy.longlong,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8,
        numpy.uint16,
        numpy.uint32,
        numpy.uint64,
        numpy.ulonglong]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.bytes_, numpy.str_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]