## NoteBook 5

## **Pandas Essentials: Practical Guide**

## Cleaning Vs Transforming Data
**Cleaning:** Handling missing values, duplicates, incorrect types, malformed data
**Transforming:** Changing structure, aggregating, formatting, and engineering new fields
Pandas does both, and we'll practice it step by step.



## Pandas Series and DataFrames Basics

In [1]:
import numpy as np
import pandas as pd

In [None]:
# Series : Revenue Over Months

Monthly_Revenue = pd.Series([30000, 50000, 45000 , 60000] , name = "Revenue")
print("The Monthly Revenue Series Is :" , "\n", Monthly_Revenue)

The Monthly Revenue Series Is : 
 0    30000
1    50000
2    45000
3    60000
Name: Revenue, dtype: int64


In [7]:
# Not A Number "np.nan"

Monthly_Revenue = pd.Series([30000, 50000, 45000, np.nan , 60000] , name = "Revenue")
print("The Monthly Revenue Series Is :" , "\n", Monthly_Revenue)

The Monthly Revenue Series Is : 
 0    30000.0
1    50000.0
2    45000.0
3        NaN
4    60000.0
Name: Revenue, dtype: float64


In [12]:
# DataFrame : Employees At Different Firms

Data = {
    'Employees':['Abdelrahman' ,'Menna' ,'Yassin' ,'Ali'],
    'Company' : ['PwC' ,'Microsoft' ,'Noon' ,'Amazon'],
    'YOE':[2, 3,1, 0]
}
# Printing Data As DataFrame

df = pd.DataFrame(Data)
print("DataFrame Example : \n ",df)

DataFrame Example : 
       Employees    Company  YOE
0  Abdelrahman        PwC    2
1        Menna  Microsoft    3
2       Yassin       Noon    1
3          Ali     Amazon    0


In [13]:
# DataFrame : Employees At Different Firms

Data = {
    'Employees':['Abdelrahman' ,'Menna' ,'Yassin' ,'Ali'],
    'Company' : ['PwC' ,'Microsoft' ,'Noon' ,'Amazon'],
    # Nan Value 
    'YOE':[2, np.nan,1, 0]
}
# Printing Data As DataFrame

df = pd.DataFrame(Data)
print("DataFrame Example : \n ",df)

DataFrame Example : 
       Employees    Company  YOE
0  Abdelrahman        PwC  2.0
1        Menna  Microsoft  NaN
2       Yassin       Noon  1.0
3          Ali     Amazon  0.0


## Exploring and Inspecting Your DataFrame

In [15]:
# Printing The First 5 Rows 
df.head()
df

Unnamed: 0,Employees,Company,YOE
0,Abdelrahman,PwC,2.0
1,Menna,Microsoft,
2,Yassin,Noon,1.0
3,Ali,Amazon,0.0


In [18]:
print(df.head(2))


     Employees    Company  YOE
0  Abdelrahman        PwC  2.0
1        Menna  Microsoft  NaN


In [16]:
# Printing The Last 5 Rows 
df.tail()
df

Unnamed: 0,Employees,Company,YOE
0,Abdelrahman,PwC,2.0
1,Menna,Microsoft,
2,Yassin,Noon,1.0
3,Ali,Amazon,0.0


In [20]:
print(df.tail(2))

  Employees Company  YOE
2    Yassin    Noon  1.0
3       Ali  Amazon  0.0


In [None]:
# Info Summary

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Employees  4 non-null      object 
 1   Company    4 non-null      object 
 2   YOE        3 non-null      float64
dtypes: float64(1), object(2)
memory usage: 224.0+ bytes
None


In [23]:
# Describe Numerical Columns

df.describe()

Unnamed: 0,YOE
count,3.0
mean,1.0
std,1.0
min,0.0
25%,0.5
50%,1.0
75%,1.5
max,2.0


In [24]:
# Chek Any Values Are Missing 

df.isna().sum()

Employees    0
Company      0
YOE          1
dtype: int64

In [25]:
# View Full Matrix Of Missig Position

df.isna()

Unnamed: 0,Employees,Company,YOE
0,False,False,False
1,False,False,True
2,False,False,False
3,False,False,False


## Simulate Missing Values

In [27]:
# Simulate A Missing Experience Value 

# Set A value Wirh Nan Using Location

df.loc[2,"YOE"] = np.nan
df

Unnamed: 0,Employees,Company,YOE
0,Abdelrahman,PwC,2.0
1,Menna,Microsoft,
2,Yassin,Noon,
3,Ali,Amazon,0.0


In [29]:
df.isna().sum()


Employees    0
Company      0
YOE          2
dtype: int64

In [30]:
# Droping Nan Values From DataFrame
df.dropna()

Unnamed: 0,Employees,Company,YOE
0,Abdelrahman,PwC,2.0
3,Ali,Amazon,0.0


In [32]:
# Re_Initialize DataFrame To Its Original Clean State

Data = {
    'Employees':['Abdelrahman' ,'Menna' ,'Yassin' ,'Ali'],
    'Company' : ['PwC' ,'Microsoft' ,'Noon' ,'Amazon'],
    'YOE':[2, 3,1, 0]
}
# Printing Data As DataFrame

df = pd.DataFrame(Data)
print("DataFrame Example : \n ",df)

DataFrame Example : 
       Employees    Company  YOE
0  Abdelrahman        PwC    2
1        Menna  Microsoft    3
2       Yassin       Noon    1
3          Ali     Amazon    0


## Selecting And Filtering DataFrames

In [38]:
# Selecting Columns 

print("Company Column  \n" , df['Company'])

Company Column  
 0          PwC
1    Microsoft
2         Noon
3       Amazon
Name: Company, dtype: object


In [42]:
# Select Name Of Employees And Years Of Experince

print("Name and Company \n" , df[['Employees' , 'YOE']])

Name and Company 
      Employees  YOE
0  Abdelrahman    2
1        Menna    3
2       Yassin    1
3          Ali    0


**Selecting Rows Using "Loc & iloc"**

In [None]:
# Selecting Rows Using Loc 
# loc ==> use index 

print("Select Row By Label \n",df.loc[2]) # Getting The Row With Index = 2

Select Row By Label 
 Employees    Yassin
Company        Noon
YOE               1
Name: 2, dtype: object


In [45]:
# Selecting Rows Using iloc 

print("Select Row By Position \n",df.iloc[2])  # Getting The Row With Position 2 

Select Row By Position 
 Employees    Yassin
Company        Noon
YOE               1
Name: 2, dtype: object


In [50]:
# Filtering Employees Who Have 0 YOE

print("The Employees Who Have 0 YOE : \n" )
df[df['YOE'] < 1]

The Employees Who Have 0 YOE : 



Unnamed: 0,Employees,Company,YOE
3,Ali,Amazon,0


In [52]:
# Filter Employees At PwC

print("The Employees In PwC IS : \n")
df[df['Company'] =='PwC']

The Employees In PwC IS : 



Unnamed: 0,Employees,Company,YOE
0,Abdelrahman,PwC,2


## Modifying DataFrames And Adding Columns

In [54]:
# Standardize Company Names

df['Company'] = df['Company'].str.upper()

df

Unnamed: 0,Employees,Company,YOE
0,Abdelrahman,PWC,2
1,Menna,MICROSOFT,3
2,Yassin,NOON,1
3,Ali,AMAZON,0


In [56]:
# Add Seniority Falg

df['Is_Senior'] = df['YOE'] >= 2
df

Unnamed: 0,Employees,Company,YOE,Is_Senior
0,Abdelrahman,PWC,2,True
1,Menna,MICROSOFT,3,True
2,Yassin,NOON,1,False
3,Ali,AMAZON,0,False


## Handling Missing Data

In [57]:
# Create DataSet Of Project budgets

df_Projects = pd.DataFrame({
    'Clinet':['BMW' , 'Google' , 'OpenAi' , 'Tesla' , 'Mansory'],
    'Budget_KUSD': [700,360,np.nan,510,None]
})

df_Projects

Unnamed: 0,Clinet,Budget_KUSD
0,BMW,700.0
1,Google,360.0
2,OpenAi,
3,Tesla,510.0
4,Mansory,


In [68]:
# Filling Missing Budgets With Mean

Mean_Budgets = df_Projects['Budget_KUSD'].mean()
Mean_Budgets


np.float64(523.3333333333334)

In [69]:
df_Projects['Budget_KUSD'].fillna(Mean_Budgets, inplace=True)
#print(df_Projects)
df_Projects

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_Projects['Budget_KUSD'].fillna(Mean_Budgets, inplace=True)


Unnamed: 0,Clinet,Budget_KUSD
0,BMW,700.0
1,Google,360.0
2,OpenAi,523.333333
3,Tesla,510.0
4,Mansory,523.333333



## Grouping & Aggregating DataFrames


In [72]:
#Revenue per company per quarter

df_revenue = pd. DataFrame({
'Company': ['KFC','PwC','EY', 'KPMG', 'KFC','PwC'],
'Quarter': ['Q1', '01', '01', 'Q1','02','02'],
'Revenue_kUSD': [120, 200, 180, 150, 130, 210]
})
print(df_revenue)


  Company Quarter  Revenue_kUSD
0     KFC      Q1           120
1     PwC      01           200
2      EY      01           180
3    KPMG      Q1           150
4     KFC      02           130
5     PwC      02           210


In [74]:
grouped = df_revenue.groupby('Company')['Revenue_kUSD'].sum()
print(" Total Revenue by Company:\n", grouped, "\n")

 Total Revenue by Company:
 Company
EY      180
KFC     250
KPMG    150
PwC     410
Name: Revenue_kUSD, dtype: int64 



## Merging Dataframes Together

In [75]:
# Merge project managers and projects

df_managers = pd.DataFrame({
'ManagerID': [1, 2, 3],
'Name': ['Ibrahim', 'Maher', 'Abdelrahman']
})
print(df_managers)


   ManagerID         Name
0          1      Ibrahim
1          2        Maher
2          3  Abdelrahman


In [76]:
df_projects = pd.DataFrame({
'ProjectID': [101, 102, 103],
'ManagerID': [1, 2, 2],
'Client': ['Kiwilytics', 'PwC', 'EY']
})
print(df_projects)

   ProjectID  ManagerID      Client
0        101          1  Kiwilytics
1        102          2         PwC
2        103          2          EY


In [80]:
merged_df = pd.merge(df_managers, df_projects, on='ManagerID', how='inner')
print(" Merged Managers & Projects:\n", merged_df,"\n")

 Merged Managers & Projects:
    ManagerID     Name  ProjectID      Client
0          1  Ibrahim        101  Kiwilytics
1          2    Maher        102         PwC
2          2    Maher        103          EY 



**This Notebook Made By : Abdelrahman Alaa**

 **LinkedIn:**
  **https://www.linkedin.com/in/abdelrahman1alaa**


**GitHub** : **https://github.com/abdelrahman1alaa**