# Data Manipulation with Pandas

Pandas is the most widely used library of python for data science. It is incredibily helpful in manipulating the data so that you can derive better insights and build great machine learning models.

In this notebook, we will have a look at some of the intermediate concepts of working with pandas

### Table of Contents

1. Sorting dataframes
2. Merging dataframes

### Loading dataset

In this notebook we will use the Big Mart Sales Data.

In [2]:
import pandas as pd

In [3]:
import numpy as np

In [6]:
# Read the dataset
data_BM = pd.read_csv("Pandas Part 1/bigmart_data.csv")
# Drop the null values
data_BM = data_BM.dropna(how = "any")
# View the top results
data_BM.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
5,FDP36,10.395,Regular,0.0,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088


### 1. Sorting dataframes

Pandas data frame has two useful functions

- **sort_values()**: to sort pandas data frame by one or more columns
- **sort_index()**: to sort pandas data frame by row index

Each of these functions come with numerous options, like sorting the data frame in specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm etc.

Suppose you want to sort the dataframe by "Outlet_Establishment_Year" then you will use **sort_values**

In [7]:
# Sort by year
sorted_data = data_BM.sort_values(by = 'Outlet_Establishment_Year')
# Print sorted data
sorted_data[:5]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
2812,FDR60,14.3,Low Fat,0.130307,Baking Goods,75.7328,OUT013,1987,High,Tier 3,Supermarket Type1,617.8624
5938,NCJ06,20.1,Low Fat,0.034624,Household,118.9782,OUT013,1987,High,Tier 3,Supermarket Type1,1549.3166
3867,FDY38,13.6,Regular,0.119077,Dairy,231.23,OUT013,1987,High,Tier 3,Supermarket Type1,2330.3
1307,FDB37,20.25,Regular,0.022922,Baking Goods,240.7538,OUT013,1987,High,Tier 3,Supermarket Type1,3364.9532
5930,NCA18,10.1,Low Fat,0.056031,Household,115.1492,OUT013,1987,High,Tier 3,Supermarket Type1,1737.738


- Now `sort_values` takes multiple options like:
    -  `ascending`: The default sorting order is ascending, when you pass False here then it sorts in descending order.
    - `inplace`: whether to do inplace sorting or not

In [12]:
# Sort in descending order and we used inplace to make sure the data is sorted in the original dataset
data_BM.sort_values(by = 'Outlet_Establishment_Year', ascending = False, inplace = True)
data_BM[:5]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
2825,FDL16,12.85,Low Fat,0.169139,Frozen Foods,46.406,OUT018,2009,Medium,Tier 3,Supermarket Type2,186.424
7389,NCD42,16.5,Low Fat,0.012689,Health and Hygiene,39.7506,OUT018,2009,Medium,Tier 3,Supermarket Type2,227.7036
2165,DRJ39,20.25,Low Fat,0.036474,Dairy,218.3482,OUT018,2009,Medium,Tier 3,Supermarket Type2,2409.5302
2162,FDR60,14.3,Low Fat,0.130946,Baking Goods,76.7328,OUT018,2009,Medium,Tier 3,Supermarket Type2,1312.9576
2158,FDM58,16.85,Regular,0.080015,Snack Foods,111.8544,OUT018,2009,Medium,Tier 3,Supermarket Type2,1677.816
