# Data Manipulation with Pandas
## Mastering Indexing, Selection, and Grouping

📚 Completed tasks include:

* Loading a DataFrame from a CSV file
* Setting a specific column as the index
* Selecting columns and rows using various methods (.loc, .iloc, etc.)
* Filtering rows based on conditions
* Grouping data by one or multiple columns
* Calculating mean, sum, and size of groups
* Applying multiple aggregation functions using .agg
* Selecting rows based on multiple conditions using .query and .isin
* Renaming columns

🎉 Demonstrates proficiency in using Pandas for data manipulation and analysis.

In [7]:
import pandas as pd
import numpy as np

1. Load a DataFrame from a CSV file. Display the first and last five rows of the DataFrame.

In [8]:
# Loading the dataframe
df = pd.read_csv('C:/Users/faizr/Data-Science-Group-2---BWF---FAIZ-RAZA/Task 12/diabetes (1).csv')
print("First 5 Rows: ")
df.head()

First 5 Rows: 


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [9]:
print("Last 5 Rows: ")
df.tail()

Last 5 Rows: 


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


2. Set a specific column as the index of the DataFrame.

In [10]:
#df is already defined
print("Setting the 'Age' As index")
df.set_index('Age')

Setting the 'Age' As index


Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Outcome
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
50,6,148,72,35,0,33.6,0.627,1
31,1,85,66,29,0,26.6,0.351,0
32,8,183,64,0,0,23.3,0.672,1
21,1,89,66,23,94,28.1,0.167,0
33,0,137,40,35,168,43.1,2.288,1
...,...,...,...,...,...,...,...,...
63,10,101,76,48,180,32.9,0.171,0
27,2,122,70,27,0,36.8,0.340,0
30,5,121,72,23,112,26.2,0.245,0
47,1,126,60,0,0,30.1,0.349,1


3. Select a specific column and display its values.

In [11]:
#Selecting the specific column and displaying
print("Displaying the Column name 'Outcome'")
df[['Outcome']]

Displaying the Column name 'Outcome'


Unnamed: 0,Outcome
0,1
1,0
2,1
3,0
4,1
...,...
763,0
764,0
765,0
766,1


4. Select multiple columns and display the resulting DataFrame.

In [12]:
#displaying the desired columns
columns = ['Outcome', 'BMI', 'Pregnancies', 'BloodPressure']
print("Printing multiple columns")
df[columns]

Printing multiple columns


Unnamed: 0,Outcome,BMI,Pregnancies,BloodPressure
0,1,33.6,6,72
1,0,26.6,1,66
2,1,23.3,8,64
3,0,28.1,1,66
4,1,43.1,0,40
...,...,...,...,...
763,0,32.9,10,76
764,0,36.8,2,70
765,0,26.2,5,72
766,1,30.1,1,60


5. Select a subset of rows using the .loc method.

In [13]:
#using .loc
print("Sub df: ")
df.loc[4:13]

Sub df: 


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1
10,4,110,92,0,0,37.6,0.191,30,0
11,10,168,74,0,0,38.0,0.537,34,1
12,10,139,80,0,0,27.1,1.441,57,0
13,1,189,60,23,846,30.1,0.398,59,1


6. Select a subset of rows and columns using the .iloc method.

In [14]:
#using .iloc
print("Sub df: ")
df.iloc[4:13, :3]

Sub df: 


Unnamed: 0,Pregnancies,Glucose,BloodPressure
4,0,137,40
5,5,116,74
6,3,78,50
7,10,115,0
8,2,197,70
9,8,125,96
10,4,110,92
11,10,168,74
12,10,139,80


7. Filter rows based on a condition.

In [15]:
#filtering
print("Filtered df: ")
df.loc[df['BMI'] > 50]

Filtered df: 


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
120,0,162,76,56,100,53.2,0.759,25,1
125,1,88,30,42,99,55.0,0.496,26,1
177,0,129,110,46,130,67.1,0.319,26,1
193,11,135,0,0,0,52.3,0.578,40,1
247,0,165,90,33,680,52.3,0.427,23,0
303,5,115,98,0,0,52.9,0.209,28,1
445,0,180,78,63,14,59.4,2.42,25,1
673,3,123,100,35,240,57.3,0.88,22,0


8. Group the DataFrame by a specific column and calculate the mean of each group.

In [16]:
#Exploring the groupby
#df is already defined
print("Grouping the df to get the mean of all")
df.groupby(['Glucose']).mean()

Grouping the df to get the mean of all


Unnamed: 0_level_0,Pregnancies,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Glucose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,2.800000,67.600000,29.600000,4.600000,32.880000,0.380200,28.600000,0.40
44,5.000000,62.000000,0.000000,0.000000,25.000000,0.587000,36.000000,0.00
56,2.000000,56.000000,28.000000,45.000000,24.200000,0.332000,22.000000,0.00
57,4.500000,70.000000,18.500000,0.000000,27.250000,0.415500,54.000000,0.00
61,3.000000,82.000000,28.000000,0.000000,34.400000,0.243000,46.000000,0.00
...,...,...,...,...,...,...,...,...
195,6.500000,70.000000,16.500000,72.500000,28.000000,0.245500,43.000000,1.00
196,5.333333,80.666667,21.666667,176.333333,37.933333,0.643667,42.333333,1.00
197,4.000000,71.000000,45.750000,321.750000,31.950000,1.063250,46.250000,0.75
198,0.000000,66.000000,32.000000,274.000000,41.300000,0.502000,28.000000,1.00


9. Group the DataFrame by multiple columns and calculate the sum of each group.

In [17]:
#Exploring the groupby
#df is already defined
print("Grouping the df to get the mean of all")
df.groupby(['Glucose', 'BMI']).mean()

Grouping the df to get the mean of all


Unnamed: 0_level_0,Unnamed: 1_level_0,Pregnancies,BloodPressure,SkinThickness,Insulin,DiabetesPedigreeFunction,Age,Outcome
Glucose,BMI,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,24.7,1.0,48.0,20.0,0.0,0.140,22.0,0.0
0,27.7,1.0,74.0,20.0,23.0,0.299,21.0,0.0
0,32.0,1.0,68.0,35.0,0.0,0.389,22.0,0.0
0,39.0,6.0,68.0,41.0,0.0,0.727,41.0,1.0
0,41.0,5.0,80.0,32.0,0.0,0.346,37.0,1.0
...,...,...,...,...,...,...,...,...
197,30.5,2.0,70.0,45.0,543.0,0.158,53.0,1.0
197,34.7,2.0,70.0,99.0,0.0,0.575,62.0,1.0
197,36.7,4.0,70.0,39.0,744.0,2.329,31.0,0.0
198,41.3,0.0,66.0,32.0,274.0,0.502,28.0,1.0


10. Use the agg method to apply multiple aggregation functions to grouped data.

In [18]:
#Exploring the groupby
#df is already defined
print("Grouping the multiple columns and getting the sum()")
df.groupby(['Glucose']).agg({'Pregnancies':'sum', 'BMI': 'sum', 'Insulin': 'sum'})

Grouping the multiple columns and getting the sum()


Unnamed: 0_level_0,Pregnancies,BMI,Insulin
Glucose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,14,164.4,23
44,5,25.0,0
56,2,24.2,45
57,9,54.5,0
61,3,34.4,0
...,...,...,...
195,13,56.0,145
196,16,113.8,529
197,16,127.8,1287
198,0,41.3,274


11. Calculate the size of each group.

In [19]:
#Exploring the groupby
#df is already defined
print("Size of the group")
df.groupby(['Glucose']).size()

Size of the group


Glucose
0      5
44     1
56     1
57     2
61     1
      ..
195    2
196    3
197    4
198    1
199    1
Length: 136, dtype: int64

12. Select rows based on multiple conditions.

In [20]:
#filtering
print("Based on Multiple Conditions: ")
df.loc[(df['BMI'] > 50) & (df['BloodPressure'] > 90) & (df['Outcome'] == 1)]

Based on Multiple Conditions: 


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
177,0,129,110,46,130,67.1,0.319,26,1
303,5,115,98,0,0,52.9,0.209,28,1


13. Use the query method to filter rows.

In [21]:
#Querying
print("Using Querying: ")
df.query('BMI > 50 and BloodPressure > 90 and Outcome == 1')

Using Querying: 


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
177,0,129,110,46,130,67.1,0.319,26,1
303,5,115,98,0,0,52.9,0.209,28,1


14. Use isin to filter rows based on a list of values.

In [22]:
#Using isin
v = df['BMI'][df['BMI']>50].values
print("using isin")
df[df['BMI'].isin(v)]

using isin


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
120,0,162,76,56,100,53.2,0.759,25,1
125,1,88,30,42,99,55.0,0.496,26,1
177,0,129,110,46,130,67.1,0.319,26,1
193,11,135,0,0,0,52.3,0.578,40,1
247,0,165,90,33,680,52.3,0.427,23,0
303,5,115,98,0,0,52.9,0.209,28,1
445,0,180,78,63,14,59.4,2.42,25,1
673,3,123,100,35,240,57.3,0.88,22,0


15. Select specific columns and rename them.

In [23]:
#renaim the columnn
a = df['BMI']
df.rename(columns={'BMI' : 'HEHE'})

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,HEHE,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


# Practicle Application: Sales Data Analysis
## Using Pandas for Data Manipulation and Analysis

In this task, we will demonstrate the use of Pandas for data manipulation and analysis on a sample sales dataset.

📊 We will load the data into a Pandas DataFrame and perform various operations, including:

* Selecting and filtering data
* Grouping and aggregating data
* Filtering and querying data
* Renaming and rearranging columns

This task will showcase the power of Pandas in extracting insights from data.

In [34]:
data = {
	'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05'],
	'Sales': [100, 200, 300, 400, 500],
	'Region': ['North', 'South', 'East', 'West', 'North'],
	'Product': ['ProductA', 'ProductB', 'ProductA', 'ProductB', 'ProductA']
}

sales_data = pd.DataFrame(data)
sales_data

Unnamed: 0,Date,Sales,Region,Product
0,2022-01-01,100,North,ProductA
1,2022-01-02,200,South,ProductB
2,2022-01-03,300,East,ProductA
3,2022-01-04,400,West,ProductB
4,2022-01-05,500,North,ProductA


In [38]:
# Display first and last five rows of data
print(sales_data.head())
print(sales_data.tail())

# Set "Date" column as index
sales_data.set_index('Date', inplace=True)

# Select "Sales" column and display values
print(f"Sales: \n {sales_data['Sales']}")

# Select "Sales" and "Region" columns and display resulting DataFrame
print(f"Sales with Ragion: \n {sales_data[['Sales', 'Region']]}")

# Select subset of rows where "Sales" value is greater than 1000
print(f"Filtering Sales: \n {sales_data[sales_data['Sales'] > 1000]}")

# Select subset of rows and columns using .iloc method
print(f"Subdf: \n {sales_data.iloc[0:5, 0:2]}")

# Group data by "Region" column and calculate mean sales
print(f"Groupby region: \n  {sales_data.groupby('Region')['Sales'].mean()}")

# Group data by multiple columns ("Region" and "Product") and calculate sum of sales
print(f"Region and Rroduct: \n {sales_data.groupby(['Region', 'Product'])['Sales'].sum()}")

# Use agg method to apply multiple aggregation functions to grouped data
print(f"Mean and sum of region: \n {sales_data.groupby('Region').agg({'Sales': ['mean', 'sum']})}")

# Calculate size of each group
print(f"Size of Region: \n {sales_data.groupby('Region').size()}")

# Select rows based on multiple conditions
print("FIltering Data: \n {sales_data[(sales_data['Sales'] > 1000) & (sales_data['Region'] == 'North')]}")

# Use query method to filter rows
print(sales_data.query('Sales > 1000 and Region == "North"'))

# Use isin to filter rows based on list of values
print(f"Using isin: \n {sales_data[sales_data['Product'].isin(['ProductA', 'ProductB'])]}")

# Select specific columns and rename them
print(sales_data.rename(columns={'Sales': 'Revenue'})[['Revenue', 'Region']])

         Date  Sales Region   Product
0  2022-01-01    100  North  ProductA
1  2022-01-02    200  South  ProductB
2  2022-01-03    300   East  ProductA
3  2022-01-04    400   West  ProductB
4  2022-01-05    500  North  ProductA
         Date  Sales Region   Product
0  2022-01-01    100  North  ProductA
1  2022-01-02    200  South  ProductB
2  2022-01-03    300   East  ProductA
3  2022-01-04    400   West  ProductB
4  2022-01-05    500  North  ProductA
Sales: 
 Date
2022-01-01    100
2022-01-02    200
2022-01-03    300
2022-01-04    400
2022-01-05    500
Name: Sales, dtype: int64
Sales with Ragion: 
             Sales Region
Date                    
2022-01-01    100  North
2022-01-02    200  South
2022-01-03    300   East
2022-01-04    400   West
2022-01-05    500  North
Filtering Sales: 
 Empty DataFrame
Columns: [Sales, Region, Product]
Index: []
Subdf: 
             Sales Region
Date                    
2022-01-01    100  North
2022-01-02    200  South
2022-01-03    300   East
2022-