# Functions and Methods.
+ Dropping entries from an axis
+ Sorting
+ Unique Values, Value Counts and Membership
+ Replacing Values
+ Setting and Resetting index
+ Groupby
+ Pivot Tables
+ Stacking and unstacking
+ Duplicates
+ Reindexing and Renaming

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Loading the datasets for this section.
pokemon = pd.read_csv("pokemon_data.csv")
tips = pd.read_csv("tips.csv")

### Dropping entries from an axis
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. returns DataFrame without the removed index or column labels.

`syntax : df.drop(labels,axis,index,columns,level,inplace)`

Parameter | Description
:- | :-
labels | single label or list-like Index or column labels to drop.
axis | {0 or 'index', 1 or 'columns'}, default 0. Whether to drop labels from the index (0 or 'index') or  columns (1 or 'columns').
index | single label or list-like Alternative to specifying axis (``labels, axis=0`` is equivalent to ``index=labels``).
columns | single label or list-like Alternative to specifying axis (``labels, axis=1`` is equivalent to ``columns=labels``).
level | int or level name, optional For MultiIndex, level from which the labels will be removed.
inplace | bool, default False If False, return a copy. Otherwise, do operation inplace and return None.

In [3]:
arr = np.random.rand(3,5)
df =pd.DataFrame(arr, index=["row1","row2","row3"], columns=["col1","col2","col3","col4","col5"])
df

Unnamed: 0,col1,col2,col3,col4,col5
row1,0.938291,0.204311,0.20859,0.126049,0.349567
row2,0.627177,0.854003,0.31024,0.255143,0.234173
row3,0.243444,0.904148,0.478318,0.246787,0.822482


In [4]:
# Dropping a column 
df.drop("col1", axis=1)

Unnamed: 0,col2,col3,col4,col5
row1,0.204311,0.20859,0.126049,0.349567
row2,0.854003,0.31024,0.255143,0.234173
row3,0.904148,0.478318,0.246787,0.822482


In [5]:
# Dropping multiple rows 
df.drop(["row1","row2"],axis=0)

Unnamed: 0,col1,col2,col3,col4,col5
row3,0.243444,0.904148,0.478318,0.246787,0.822482


### Sorting 
Sorting a dataset by some criterion is another built_in Pandas operation. To sort lexicographically by row or column index, use the sort-index method which returns a new sorted object, could be sorted inplace if specified

`syntax: df.sort_index(axis, ascending,inplace)`

Parameter | Description
:- | :-
axis | {0 or 'index', 1 or 'columns'}, default 0. The axis along which to sort.  The value 0 identifies the rows, and 1 identifies the columns.
ascending | bool or list of bools, default True Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
inplace | bool, default False If True, perform operation in-place.

In [6]:
# Creating a DataFrame with unsorted index
df2 = pd.DataFrame(np.linspace(100,105,25).reshape(5,5), index=[100, 29, 234, 1, 150], columns=list("ABCDE"))

In [7]:
df2

Unnamed: 0,A,B,C,D,E
100,100.0,100.208333,100.416667,100.625,100.833333
29,101.041667,101.25,101.458333,101.666667,101.875
234,102.083333,102.291667,102.5,102.708333,102.916667
1,103.125,103.333333,103.541667,103.75,103.958333
150,104.166667,104.375,104.583333,104.791667,105.0


In [8]:
# Sorting the index
df2.sort_index()

Unnamed: 0,A,B,C,D,E
1,103.125,103.333333,103.541667,103.75,103.958333
29,101.041667,101.25,101.458333,101.666667,101.875
100,100.0,100.208333,100.416667,100.625,100.833333
150,104.166667,104.375,104.583333,104.791667,105.0
234,102.083333,102.291667,102.5,102.708333,102.916667


##### Sort by the values along either axis.
`syntax : df.sort_values(by, axis, ascending, inplace)`

In [9]:
# Sorting a DataFrame by a column
df2.sort_values(by="A",ascending=False)

Unnamed: 0,A,B,C,D,E
150,104.166667,104.375,104.583333,104.791667,105.0
1,103.125,103.333333,103.541667,103.75,103.958333
234,102.083333,102.291667,102.5,102.708333,102.916667
29,101.041667,101.25,101.458333,101.666667,101.875
100,100.0,100.208333,100.416667,100.625,100.833333


### Unique Values, Value Counts and Membership
Another class of related methods extracts information about values contained in a one-dimensional Series. from a DataFrame, this is done by selecting the desired column and calling off this methods off the selected column object.
+ **unique()** *This returns an array of the unique values in the series or column*
+ **nunique()** *This returns the number of unique values*
+ **value_count()** *returns a table of unique values and their counts*
+ **isin()** *compute boolean array indicating if each Series value is contained in the passed sequence of values*

In [10]:
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False


In [11]:
# Finding unique items on a specific column
pokemon["Type 1"].unique()

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [12]:
# Finding The number of unique items in a columbn
pokemon["Type 1"].nunique()

18

In [13]:
# Finding the frequency of unique items 
pokemon["Type 1"].value_counts()

Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Rock         44
Electric     44
Ghost        32
Dragon       32
Ground       32
Dark         31
Poison       28
Fighting     27
Steel        27
Ice          24
Fairy        17
Flying        4
Name: Type 1, dtype: int64

In [14]:
# Check for items in a column that are also in a sequence of values - (isin method)
mask = pokemon["Type 1"].isin(["Fire","Rock","Ice"])

In [15]:
# The boolean result
mask

0      False
1      False
2      False
3      False
4       True
       ...  
795     True
796     True
797    False
798    False
799     True
Name: Type 1, Length: 800, dtype: bool

In [16]:
# Masking the boolean result to the DataFrame
pokemon[mask]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
5,5,Charmeleon,Fire,,58,64,58,80,65,80,1,False
6,6,Charizard,Fire,Flying,78,84,78,109,85,100,1,False
7,6,CharizardMega Charizard X,Fire,Dragon,78,130,111,130,85,100,1,False
8,6,CharizardMega Charizard Y,Fire,Flying,78,104,78,159,115,100,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
788,712,Bergmite,Ice,,55,69,85,32,35,28,6,False
789,713,Avalugg,Ice,,95,117,184,44,46,28,6,False
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True


In [17]:
# Verify the checked items 
pokemon[mask]["Type 1"].unique()

array(['Fire', 'Rock', 'Ice'], dtype=object)

### Replacing Values 
Replace values given in **to_replace** with **value**. Values of the DataFrame are replaced with other values dynamically. This differs from updating with **.loc** or **.iloc**, which require you to specify a location to update with some value.  

**syntax:** ``df.replace(to_replace=None,value=None,inplace=False,method='pad')``


In [18]:
# Creating a Series for this example
data = pd.Series([1.,-999.,2.,-999.,-1000.,3.])

In [19]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [20]:
# Replace value 
data.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [21]:
# Replacing multiple values
data.replace([-999,-1000],np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [22]:
# replacing items - Argument passed as a dict
data.replace({-999:np.nan,-1000:0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Setting and Resetting index
To set the DataFrame index using existing columns we use the ``set_index`` method, it Sets the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it. When multiple columns are used to set the index, a MultiIndex will be rerturned.

**Syntax** ``df.set_index(keys,drop=True,append=False,inplace=False)``

Parameter | Description
:- | :-
keys | label or array-like or list of labels/arrays. This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays.
drop | bool, default True Delete columns to be used as the new index.
append | bool, default False Whether to append columns to existing index.
inplace | bool, default False. If True, modifies the DataFrame in place (do not create a new object).


In [23]:
pokemon

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [24]:
# Setting a specific column as the index
pokemon.set_index(keys="#").sort_index()

Unnamed: 0_level_0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


**NB** Its important to sort the index after setting it.

In [25]:
pokemon

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [26]:
# Setting MultipleIndex
pokemon.set_index(["Type 1","Name"])

Unnamed: 0_level_0,Unnamed: 1_level_0,#,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
Type 1,Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Grass,Bulbasaur,1,Poison,45,49,49,65,65,45,1,False
Grass,Ivysaur,2,Poison,60,62,63,80,80,60,1,False
Grass,Venusaur,3,Poison,80,82,83,100,100,80,1,False
Grass,VenusaurMega Venusaur,3,Poison,80,100,123,122,120,80,1,False
Fire,Charmander,4,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
Rock,Diancie,719,Fairy,50,100,150,100,150,50,6,True
Rock,DiancieMega Diancie,719,Fairy,50,160,110,160,110,110,6,True
Psychic,HoopaHoopa Confined,720,Ghost,80,110,60,150,130,70,6,True
Psychic,HoopaHoopa Unbound,720,Dark,80,160,60,170,130,80,6,True


#### Resetting the index
**syntax**: ``df.reset_index() ``
    
Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.

In [27]:
# Resetting the index
pokemon.reset_index()

Unnamed: 0,index,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [28]:
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False


In [29]:
# Setting MultipleIndex
tips.set_index(["sex","smoker"])

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,day,time,size
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,No,16.99,1.01,Sun,Dinner,2
Male,No,10.34,1.66,Sun,Dinner,3
Male,No,21.01,3.50,Sun,Dinner,3
Male,No,23.68,3.31,Sun,Dinner,2
Female,No,24.59,3.61,Sun,Dinner,4
...,...,...,...,...,...,...
Male,No,29.03,5.92,Sat,Dinner,3
Female,Yes,27.18,2.00,Sat,Dinner,2
Male,Yes,22.67,2.00,Sat,Dinner,2
Male,No,17.82,1.75,Sat,Dinner,2


### Groupby
Pandas’ GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis. Using a groupby function on a DataFrame returns a groupby object on which we can call different aggregation functions.

If you call dir() on a Pandas GroupBy object, then you’ll see enough methods there to make your head spin! It can be hard to keep track of all of the functionality of a Pandas GroupBy object. One way to clear the fog is to compartmentalize the different methods into what they do and how they behave.

Broadly, methods of a Pandas GroupBy object fall into a handful of categories:

+ Aggregation methods (also called reduction methods) “smush” many data points into an aggregated statistic about those data points. An example is to take the sum, mean, or median of 10 numbers, where the result is just a single number.

+ Filter methods come back to you with a subset of the original DataFrame. This most commonly means using .filter() to drop entire groups based on some comparative statistic about that group and its sub-table. It also makes sense to include under this definition a number of methods that exclude particular rows from each group.

+ Transformation methods return a DataFrame with the same shape and indices as the original, but with different values. With both aggregation and filter methods, the resulting DataFrame will commonly be smaller in size than the input DataFrame. This is not true of a transformation, which transforms individual values themselves but retains the shape of the original DataFrame.

+ Meta methods are less concerned with the original object on which you called .groupby(), and more focused on giving you high-level information such as the number of groups and indices of those groups.

+ Plotting methods mimic the API of plotting for a Pandas Series or DataFrame, but typically break the output into multiple subplots.


Aggregation methods |Filter methods | Transformation methods |Meta methods |Plotting methods
:- | :-|:-|:-|:-
.agg() |.filter() |.bfill()|.__iter__()|.hist()
.aggregate() |.first() |.diff()|.get_group()|.ohlc()
.all() |.head() | .ffill() |.groups|.boxplot()
.any()|.last() | .fillna() |.indices|.plot()
.apply()|.nth()| .pct_change()|.ndim
.corr()|.tail() |.quantile()|.ngroup()
.corrwith()|.take()|.rank()|.ngroups
.count()||.shift()|.dtypes
.cov()||.transform()|
.cumcount()||.tshift()|
.cummax()|
.cummin()|
.cumprod()|
.cumsum()|
.describe()|
.idxmax()|
.idxmin()|
.mad()|
.max()|
.mean()|
.median()|
.min()|
.nunique()|
.prod()|
.sem()|
.size()|
.skew()|
.std()|
.sum()|
.var()|

In [30]:
# Groupping by the sex column a and finding the mean
tips.groupby("sex").mean()

Unnamed: 0_level_0,total_bill,tip,size
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,18.056897,2.833448,2.45977
Male,20.744076,3.089618,2.630573


In [31]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [32]:
# Grouping by day column and funding the sum
tips.groupby("day").sum()

Unnamed: 0_level_0,total_bill,tip,size
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,325.88,51.96,40
Sat,1778.4,260.4,219
Sun,1627.16,247.39,216
Thur,1096.33,171.83,152


In [33]:
# Calling describe method on a groupby object
tips.groupby("day").describe()

Unnamed: 0_level_0,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill,tip,tip,tip,tip,tip,size,size,size,size,size,size,size,size
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Fri,19.0,17.151579,8.30266,5.75,12.095,15.38,21.75,40.17,19.0,2.734737,...,3.365,4.73,19.0,2.105263,0.567131,1.0,2.0,2.0,2.0,4.0
Sat,87.0,20.441379,9.480419,3.07,13.905,18.24,24.74,50.81,87.0,2.993103,...,3.37,10.0,87.0,2.517241,0.819275,1.0,2.0,2.0,3.0,5.0
Sun,76.0,21.41,8.832122,7.25,14.9875,19.63,25.5975,48.17,76.0,3.255132,...,4.0,6.5,76.0,2.842105,1.007341,2.0,2.0,2.0,4.0,6.0
Thur,62.0,17.682742,7.88617,7.51,12.4425,16.2,20.155,43.11,62.0,2.771452,...,3.3625,6.7,62.0,2.451613,1.066285,1.0,2.0,2.0,2.0,6.0


In [34]:
# Selecting a particular column from a groupby object
tips.groupby("day").describe()['total_bill']

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Fri,19.0,17.151579,8.30266,5.75,12.095,15.38,21.75,40.17
Sat,87.0,20.441379,9.480419,3.07,13.905,18.24,24.74,50.81
Sun,76.0,21.41,8.832122,7.25,14.9875,19.63,25.5975,48.17
Thur,62.0,17.682742,7.88617,7.51,12.4425,16.2,20.155,43.11


### Pivot Tables
A Pivot Table is a data summarization tool frequently found in spreadsheet programs and other data analysis software.It aggregates a table of data by one or more keys,arranging the data in a rectangle with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible using the group by facility combined with reshape and hierarchical indexing. DataFrame has a `pivot_table` method and additionally there is a top-level pandas.pivot_table function.

`df.pivot_table(values=None,index=None,columns=None,aggfunc='mean',fill_value=None,margins=False,dropna=True,margins_name='All',
    observed=False)`
    
Parameter | Description
:-|:-
values | column to aggregate, optional
index | column, Grouper, array, or list of the previous. If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index.  If an array is passed, it is being used as the same manner as column values.
columns | column, Grouper, array, or list of the previous If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column.  If an array is passed, it is being used as the same manner as column values.
aggfunc | function, list of functions, dict, default numpy.mean If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions.
fill_value | scalar, default None. Value to replace missing values with (in the resulting pivot table, after aggregation).
margins | bool, default False Add all row / columns (e.g. for subtotal / grand totals).
dropna | bool, default True. Do not include columns whose entries are all NaN.
margins_name | str, default 'All' Name of the row / column that will contain the totals when margins is True.
observed | bool, default False. This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.


In [35]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [36]:
# Reshaping a DataFrame by pivot_table
pivot_tips = tips.pivot_table(values=["total_bill","tip","size"],index=["sex","smoker"],columns='day',aggfunc="sum")

In [37]:
pivot_tips

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,size,size,tip,tip,tip,tip,total_bill,total_bill,total_bill,total_bill
Unnamed: 0_level_1,day,Fri,Sat,Sun,Thur,Fri,Sat,Sun,Thur,Fri,Sat,Sun,Thur
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
Female,No,5,30,43,62,6.25,35.42,46.61,61.49,38.73,247.05,291.54,400.36
Female,Yes,14,33,10,17,18.78,43.03,14.0,20.93,88.58,304.0,66.16,134.53
Male,No,4,85,124,50,5.0,104.21,133.96,58.83,34.95,637.73,877.34,369.73
Male,Yes,17,71,39,23,21.93,77.74,52.82,30.58,163.62,589.62,392.12,191.71


In [38]:
pivot_tips['size']['Fri'].sum()

40

### Stacking and unstacking
When we have a Multiple Index DataFrame, we can rearrange this DataFrame to a simple two dimensional DataFrame. This is achieved using the stack and unstack method of the MultiIndex object.
+ **stack** *this rotates or pivotes from the columns in the DataFrame to the rows*
+ **unstack** *This pivotes from the rows into the columns*

In [39]:
pivot_tips

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,size,size,tip,tip,tip,tip,total_bill,total_bill,total_bill,total_bill
Unnamed: 0_level_1,day,Fri,Sat,Sun,Thur,Fri,Sat,Sun,Thur,Fri,Sat,Sun,Thur
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
Female,No,5,30,43,62,6.25,35.42,46.61,61.49,38.73,247.05,291.54,400.36
Female,Yes,14,33,10,17,18.78,43.03,14.0,20.93,88.58,304.0,66.16,134.53
Male,No,4,85,124,50,5.0,104.21,133.96,58.83,34.95,637.73,877.34,369.73
Male,Yes,17,71,39,23,21.93,77.74,52.82,30.58,163.62,589.62,392.12,191.71


In [40]:
# Calling stack method on a MultiIndex DataFrame 
pivot_tips.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,size,tip,total_bill
sex,smoker,day,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,No,Fri,5,6.25,38.73
Female,No,Sat,30,35.42,247.05
Female,No,Sun,43,46.61,291.54
Female,No,Thur,62,61.49,400.36
Female,Yes,Fri,14,18.78,88.58
Female,Yes,Sat,33,43.03,304.0
Female,Yes,Sun,10,14.0,66.16
Female,Yes,Thur,17,20.93,134.53
Male,No,Fri,4,5.0,34.95
Male,No,Sat,85,104.21,637.73


In [41]:
# Calling unstack method on a MultiIndex DataFrame 
pivot_tips.unstack()

Unnamed: 0_level_0,size,size,size,size,size,size,size,size,tip,tip,tip,tip,tip,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill
day,Fri,Fri,Sat,Sat,Sun,Sun,Thur,Thur,Fri,Fri,...,Thur,Thur,Fri,Fri,Sat,Sat,Sun,Sun,Thur,Thur
smoker,No,Yes,No,Yes,No,Yes,No,Yes,No,Yes,...,No,Yes,No,Yes,No,Yes,No,Yes,No,Yes
sex,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
Female,5,14,30,33,43,10,62,17,6.25,18.78,...,61.49,20.93,38.73,88.58,247.05,304.0,291.54,66.16,400.36,134.53
Male,4,17,85,71,124,39,50,23,5.0,21.93,...,58.83,30.58,34.95,163.62,637.73,589.62,877.34,392.12,369.73,191.71


In [42]:
# Selecting a Particular column from a pivot table 
pivot_tips.unstack()["size"]

day,Fri,Fri,Sat,Sat,Sun,Sun,Thur,Thur
smoker,No,Yes,No,Yes,No,Yes,No,Yes
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Female,5,14,30,33,43,10,62,17
Male,4,17,85,71,124,39,50,23


### Duplicates
While preparing our dataset, it may become neccessary to check for and delete duplicate rows this is achieved with the ``dupplicated()`` and `drop_duplicates()` methods.  To find duplicates on specific column(s), use ``subset`` parameter.

In [43]:
# Checking for duplicates rows in a DataFrame returns a boolean
tips.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
239    False
240    False
241    False
242    False
243    False
Length: 244, dtype: bool

In [44]:
# Masking the boolean to the DataFrame
tips[tips.duplicated()]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
202,13.0,2.0,Female,Yes,Thur,Lunch,2


In [45]:
# Dropping the duplicates (Not in place)
tips.drop_duplicates()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [46]:
# To verify that a duplicated column will be dropped (244 Rows)
tips.index.size

244

In [47]:
# To verify that a duplicated column will be dropped (243 Rows)
tips.drop_duplicates().index.size

243

### Reindexing and Renaming
Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like −

+ Reorder the existing data to match a new set of labels.

+ Insert missing value (NA) markers in label locations where no data for the label existed.

Pandas `dataframe.reindex()` function conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False

**Syntax** ` : DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)`

Parameters | Description
:- | :-
labels | New labels/index to conform the axis specified by ‘axis’ to.
index, columns | New labels / index to conform to. Preferably an Index object to avoid duplicating data
axis | Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1).
method | {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}, optional
copy | Return a new object, even if the passed indexes are the same
level | Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value | Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
limit | Maximum number of consecutive elements to forward or backward fill
tolerance | Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] – target) <= tolerance.

`Returns : reindexed : DataFrame`



In [48]:
# Loading flights dataset
flights = pd.read_csv("flights.csv")

In [49]:
flights.head()

Unnamed: 0,year,month,passengers
0,1949,January,112
1,1949,February,118
2,1949,March,132
3,1949,April,129
4,1949,May,121


In [50]:
# Using a pivote table to group the years 
pivot_flights = flights.pivot_table(index=["year","month"]).unstack()

In [51]:
pivot_flights

Unnamed: 0_level_0,passengers,passengers,passengers,passengers,passengers,passengers,passengers,passengers,passengers,passengers,passengers,passengers
month,April,August,December,February,January,July,June,March,May,November,October,September
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
1949,129,148,118,118,112,148,135,132,121,104,119,136
1950,135,170,140,126,115,170,149,141,125,114,133,158
1951,163,199,166,150,145,199,178,178,172,146,162,184
1952,181,242,194,180,171,230,218,193,183,172,191,209
1953,235,272,201,196,196,264,243,236,229,180,211,237
1954,227,293,229,188,204,302,264,235,234,203,229,259
1955,269,347,278,233,242,364,315,267,270,237,274,312
1956,313,405,306,277,284,413,374,317,318,271,306,355
1957,348,467,336,301,315,465,422,356,355,305,347,404
1958,348,505,337,318,340,491,435,362,363,310,359,404


In [52]:
months = ["January", "February","March","April","May","June","July","August","September","October","November","December"]

In [53]:
# Reordering the columns with reindex 
pivoted_flights = pivot_flights["passengers"].reindex(columns=months)

In [54]:
pivoted_flights

month,January,February,March,April,May,June,July,August,September,October,November,December
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1949,112,118,132,129,121,135,148,148,136,119,104,118
1950,115,126,141,135,125,149,170,170,158,133,114,140
1951,145,150,178,163,172,178,199,199,184,162,146,166
1952,171,180,193,181,183,218,230,242,209,191,172,194
1953,196,196,236,235,229,243,264,272,237,211,180,201
1954,204,188,235,227,234,264,302,293,259,229,203,229
1955,242,233,267,269,270,315,364,347,312,274,237,278
1956,284,277,317,313,318,374,413,405,355,306,271,306
1957,315,301,356,348,355,422,465,467,404,347,305,336
1958,340,318,362,348,363,435,491,505,404,359,310,337


### Renaming
The `rename()` method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [55]:
mth = {"January":"jan", "February":"feb","March":"mar","April":"apr","May":"may","June":"jun","July":"jul","August":"aug","September":"sept","October":"oct","November":"nov","December":"dec"}

In [57]:
# Renaming the columns
pivoted_flights.rename(columns=mth)

month,jan,feb,mar,apr,may,jun,jul,aug,sept,oct,nov,dec
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1949,112,118,132,129,121,135,148,148,136,119,104,118
1950,115,126,141,135,125,149,170,170,158,133,114,140
1951,145,150,178,163,172,178,199,199,184,162,146,166
1952,171,180,193,181,183,218,230,242,209,191,172,194
1953,196,196,236,235,229,243,264,272,237,211,180,201
1954,204,188,235,227,234,264,302,293,259,229,203,229
1955,242,233,267,269,270,315,364,347,312,274,237,278
1956,284,277,317,313,318,374,413,405,355,306,271,306
1957,315,301,356,348,355,422,465,467,404,347,305,336
1958,340,318,362,348,363,435,491,505,404,359,310,337
