# Pandas - Basics

When wrangling data with Python, we always begin by importing Python packages like pandas so we could access its tools (or methods) which are useful for data manipulation.

**pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

*Description from https://pandas.pydata.org/

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1200px-Pandas_logo.svg.png" width=500/>


In [7]:
# Importing a library
import pandas as pd

Note that we use **as** to alias the library. We do this so we won't have to call the whole name of the module in our code.

For example, if I want to create an empty dataframe without alias, I have to call `pandas.DataFrame`.

Meanwhile, if I use **`import pandas as pd`**, I only need to call pd.DataFrame.

In [27]:
# This code mounts (adds access to) our GDrive's folders
# to the list of locations available to Python
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [29]:
# This code imports a library "os" that allows file navigation
import os
# This code sets the home directory
# Find your folder and put the path here as a string
os.chdir('/content/drive/MyDrive/DSF/DSF')

# Reading data

The most common source of simple data is csv or comma-separated values file. Pandas can read other file types as well such as:
- read_excel
- read_sql
- read_json

**In terms of reading files**, we usually pass the path of the file we want pandas to read. In this case, since `staff.csv` exists inside the same folder as our notebook, we only need to indicate the file name. However, if the notebook is in another folder, and it's inside a folder like **data/** , we need to pass the path `../Data/staff.csv`, with `../` meaning one directory above.

In [32]:
df = pd.read_csv('Data/staff.csv')

In [33]:
df

Unnamed: 0,Name,Age,City
0,,25,Pasig
1,Jose,25,Makati
2,Maria,30,Quezon City
3,Juan,35,Malabon
4,Lourdes,40,Taguig
5,Manuel,45,Pasay
6,Manuel,45,Pasay
7,,88,


# Basics

In [34]:
len(df) # get length of dataframe

8

In [35]:
df.shape # display shape

(8, 3)

In [36]:
df.info() # display information per column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    6 non-null      object
 1   Age     8 non-null      int64 
 2   City    7 non-null      object
dtypes: int64(1), object(2)
memory usage: 320.0+ bytes


In [37]:
df.describe() # display basic stats for numerical columns

Unnamed: 0,Age
count,8.0
mean,41.625
std,20.381627
min,25.0
25%,28.75
50%,37.5
75%,45.0
max,88.0


In [38]:
df.head() # display first 5 rows

Unnamed: 0,Name,Age,City
0,,25,Pasig
1,Jose,25,Makati
2,Maria,30,Quezon City
3,Juan,35,Malabon
4,Lourdes,40,Taguig


In [39]:
df.tail() # display last 5 rows

Unnamed: 0,Name,Age,City
3,Juan,35,Malabon
4,Lourdes,40,Taguig
5,Manuel,45,Pasay
6,Manuel,45,Pasay
7,,88,


In [40]:
df.columns # display columns

Index(['Name', 'Age', 'City'], dtype='object')

## Selection of columns

In [41]:
## Select only one column
df['Name']

0        NaN
1       Jose
2      Maria
3       Juan
4    Lourdes
5     Manuel
6     Manuel
7        NaN
Name: Name, dtype: object

In [42]:
## Select multiple columns
df[['Name', 'City']]

Unnamed: 0,Name,City
0,,Pasig
1,Jose,Makati
2,Maria,Quezon City
3,Juan,Malabon
4,Lourdes,Taguig
5,Manuel,Pasay
6,Manuel,Pasay
7,,


## pandas: a table is a `DataFrame`

`pandas` reads files and interprets these as its own object called DataFrames.

In [43]:
type(df)

column. A ```DtypeWarning``` is raised when the dataset read has different dtypes in a column from a file. Recall that there are different data types in Python

A `DataFrame` object is made up of `Series` objects, which comprise a single column


In [46]:
type(df['Age'])

`DataFrame` and `Series` objects have different methods associated with each, with DataFrame having access to more methods due to it being more complex.

The first column without a name is called the `index` for both `DataFrames` and `Series` objects. It also has its own methods

In [47]:
df.index

RangeIndex(start=0, stop=8, step=1)

In [48]:
df['Age'].index

RangeIndex(start=0, stop=8, step=1)

## Selection of rows

### `loc[]`
is primarily label-based and allows you to select data by specifying the row and column labels.


In [64]:
# Select a single row by label
row1 = df.loc[1]
print("Selected Row 1 using .loc[]:")
print(row1)

Selected Row 1 using .loc[]:
   Name  Age    City
1  Jose   25  Makati


In [62]:
# Select a specific cell by label
cell = df.loc[2, 'Age']
print("Selected Cell (Row 2, 'Age') using .loc[]:")
print(cell)

Selected Cell (Row 2, 'Age') using .loc[]:
30


In [63]:
# Select multiple rows and specific columns by labels
subset = df.loc[1:3, ['Name', 'City']]
print("Selected Subset using .loc[]:")
print(subset)

Selected Subset using .loc[]:
    Name         City
1   Jose       Makati
2  Maria  Quezon City
3   Juan      Malabon


### `.iloc[]`
is primarily integer-based and allows you to select data by specifying the row and column indices (integer positions).

In [65]:
# Select a single row by index
row2 = df.iloc[2]
print("Selected Row 2 using .iloc[]:")
print(row2)

Selected Row 2 using .iloc[]:
Name          Maria
Age              30
City    Quezon City
Name: 2, dtype: object


In [53]:
# Select a specific cell by indices
cell = df.iloc[3, 1]
print("\nSelected Cell (Row 3, Column 1) using .iloc[]:")
print(cell)


Selected Cell (Row 3, Column 1) using .iloc[]:
35


In [54]:
# Select multiple rows and specific columns by indices
subset = df.iloc[1:4, [0, 1, 2]]
print("\nSelected Subset using .iloc[]:")
print(subset)


Selected Subset using .iloc[]:
    Name  Age         City
1   Jose   25       Makati
2  Maria   30  Quezon City
3   Juan   35      Malabon


## View unique elements
use `.unique` and `.value_counts` for a Series object

In [115]:
df['City'].unique()

array(['Pasig', 'Makati', 'Quezon ', 'Malabon', 'Taguig', 'Pasay', nan],
      dtype=object)

In [116]:
df['Age'].value_counts()

Age
25    2
45    2
30    1
35    1
40    1
88    1
Name: count, dtype: int64

# Filter operations

You can use the following to get a subset of the data satisfying certain conditions.

Remember to store the filtered value to a variable if you want them to persist

## Comparison Operators

### Equal (==)

In [73]:
filtered_data = df[df['Age'] == 30]
display(filtered_data)

Unnamed: 0,Name,Age,City
2,Maria,30,Quezon City


### Not Equal (!=)

In [76]:
filtered_data = df[df['Age'] != 30]
display(filtered_data)


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



Unnamed: 0,Name,Age,City
0,,25,Pasig
1,Jose,25,Makati
3,Juan,35,Malabon
4,Lourdes,40,Taguig
5,Manuel,45,Pasay
6,Manuel,45,Pasay
7,,88,


### Greater Than (>) and Less Than (<)

In [77]:
filtered_data = df[df['Age'] > 30]
display(filtered_data)

Unnamed: 0,Name,Age,City
3,Juan,35,Malabon
4,Lourdes,40,Taguig
5,Manuel,45,Pasay
6,Manuel,45,Pasay
7,,88,


### Greater Than or Equal To (>=) and Less Than or Equal To (<=)

In [78]:
filtered_data = df[df['Age'] >= 30]
display(filtered_data)

Unnamed: 0,Name,Age,City
2,Maria,30,Quezon City
3,Juan,35,Malabon
4,Lourdes,40,Taguig
5,Manuel,45,Pasay
6,Manuel,45,Pasay
7,,88,


### isin()

In [79]:
filtered_data = df[df['City'].isin(['Pasig'])]
display(filtered_data)

Unnamed: 0,Name,Age,City
0,,25,Pasig


### between()

In [80]:
filtered_data = df[df['Age'].between(30, 40)]
display(filtered_data)

Unnamed: 0,Name,Age,City
2,Maria,30,Quezon City
3,Juan,35,Malabon
4,Lourdes,40,Taguig


### isna()

In [117]:
df.isna()

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
0,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False
7,True,False,True,True,True,True,False


In [None]:
df.isna().sum()

### notna()

In [None]:
df.notna()

In [None]:
df.notna().sum()

## Logical operators

### & (And Operator):

The & operator performs element-wise logical AND between two or more conditions. It returns True only if all conditions are True.

In [None]:
filtered_data = df[(df['Age'] >= 30) & (df['City'] == 'Quezon City')]
display(filtered_data)

In [None]:
filtered_data = df[(df['Age'] >= 30) & (df['City'].isin(['Quezon City', 'Malabon']))]
display(filtered_data)

### | (Or Operator):

The | operator performs element-wise logical OR between two or more conditions. It returns True if at least one of the conditions is True.

In [None]:
filtered_data = df[(df['Age'] > 30) | (df['City'] == 'Makati')]
display(filtered_data)

### ~ (Not Operator):

The ~ operator negates a condition. It returns True where the condition is False and False where the condition is True.

In [86]:
filtered_data = df[(df['Age'] > 30) | ~(df['City'] == 'Malabon')]
display(filtered_data)

Unnamed: 0,Name,Age,Hobbies,City,Genders,Salary
0,,25,Hiking,Pasig,F,40917.0
1,Jose,25,Reading,Makati,M,72982.0
2,Maria,30,Painting,Quezon City,F,88687.0
3,Juan,35,Cooking,Malabon,M,65401.0
4,Lourdes,40,Photography,Taguig,F,96602.0
5,Manuel,45,Dancing,Pasay,M,71143.0
6,Manuel,45,Dancing,Pasay,M,71143.0
7,,88,,,,


# Adding Columns

## New list as a column appended at the end

In [85]:
df['Salary'] = [40917, 72982, 88687, 65401, 96602, 71143, 71143, None]

In [None]:
display(df)

## Insert to a specific position

In [81]:
df.insert(2, "Hobbies",
 ['Hiking', 'Reading', 'Painting', 'Cooking', 'Photography', 'Dancing', 'Dancing', None], True)

display(df)

Unnamed: 0,Name,Age,Hobbies,City
0,,25,Hiking,Pasig
1,Jose,25,Reading,Makati
2,Maria,30,Painting,Quezon City
3,Juan,35,Cooking,Malabon
4,Lourdes,40,Photography,Taguig
5,Manuel,45,Dancing,Pasay
6,Manuel,45,Dancing,Pasay
7,,88,,


## Assign

In [83]:
df = df.assign( Genders=['F', 'M', 'F', 'M', 'F', 'M', 'M', None])

display(df)

Unnamed: 0,Name,Age,Hobbies,City,Genders
0,,25,Hiking,Pasig,F
1,Jose,25,Reading,Makati,M
2,Maria,30,Painting,Quezon City,F
3,Juan,35,Cooking,Malabon,M
4,Lourdes,40,Photography,Taguig,F
5,Manuel,45,Dancing,Pasay,M
6,Manuel,45,Dancing,Pasay,M
7,,88,,,


## Map
Will work as long as only one column is used to compute the new column

In [87]:
def categorize_salary(salary):
    if salary <= 50000:
        return 'Low'
    elif salary <= 75000:
        return 'Medium'
    else:
        return 'High'


df['Categories'] = df['Salary'].map(categorize_salary)
df

Unnamed: 0,Name,Age,Hobbies,City,Genders,Salary,Categories
0,,25,Hiking,Pasig,F,40917.0,Low
1,Jose,25,Reading,Makati,M,72982.0,Medium
2,Maria,30,Painting,Quezon City,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
6,Manuel,45,Dancing,Pasay,M,71143.0,Medium
7,,88,,,,,High


# Renaming Columns

In [88]:
df.columns

Index(['Name', 'Age', 'Hobbies', 'City', 'Genders', 'Salary', 'Categories'], dtype='object')

### .rename

In [91]:
rename_1_df = df.rename(columns={'Genders': 'Gender','Categories': 'Category'})
rename_1_df

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
0,,25,Hiking,Pasig,F,40917.0,Low
1,Jose,25,Reading,Makati,M,72982.0,Medium
2,Maria,30,Painting,Quezon City,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
6,Manuel,45,Dancing,Pasay,M,71143.0,Medium
7,,88,,,,,High


### overwrite columns

In [93]:
rename_2_df = df
rename_2_df.columns = ['Name', 'Age', 'Hobbies', 'City', 'Salary', 'Gender', 'Category']
display(rename_2_df)

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
0,,25,Hiking,Pasig,F,40917.0,Low
1,Jose,25,Reading,Makati,M,72982.0,Medium
2,Maria,30,Painting,Quezon City,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
6,Manuel,45,Dancing,Pasay,M,71143.0,Medium
7,,88,,,,,High


# Removing Data

## Removing Columns

In [94]:
df_drop_columns.drop(columns=['Hobbies', 'Name'])
df_drop_columns

NameError: name 'df_drop_columns' is not defined

## Removing Rows

In [96]:
df_drop_rows =df

In [97]:
df_drop_rows = df.drop(index=[0, 6])
df_drop_rows

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
1,Jose,25,Reading,Makati,M,72982.0,Medium
2,Maria,30,Painting,Quezon City,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
7,,88,,,,,High


## Condition

In [98]:
df_drop_conditions = df.drop(df[df['Age'] < 30].index)
df

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
0,,25,Hiking,Pasig,F,40917.0,Low
1,Jose,25,Reading,Makati,M,72982.0,Medium
2,Maria,30,Painting,Quezon City,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
6,Manuel,45,Dancing,Pasay,M,71143.0,Medium
7,,88,,,,,High


In [101]:
df_drop_conditions[df_drop_conditions['Age'] < 30].index

Index([], dtype='int64')

Check duplicates

In [100]:
df.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
dtype: bool

In [103]:
df_no_duplicates = df.drop_duplicates()
df_no_duplicates

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
0,,25,Hiking,Pasig,F,40917.0,Low
1,Jose,25,Reading,Makati,M,72982.0,Medium
2,Maria,30,Painting,Quezon City,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
7,,88,,,,,High


## Check and drop missing values

In [105]:
# Example 1: Drop rows with any missing values (NaN)
df_no_missing = df.dropna()
df_no_missing

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
1,Jose,25,Reading,Makati,M,72982.0,Medium
2,Maria,30,Painting,Quezon City,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
6,Manuel,45,Dancing,Pasay,M,71143.0,Medium


In [106]:
# Example 2: Drop rows from specific column
df_no_missing_age = df.dropna(subset=['Age'])
df_no_missing_age

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
0,,25,Hiking,Pasig,F,40917.0,Low
1,Jose,25,Reading,Makati,M,72982.0,Medium
2,Maria,30,Painting,Quezon City,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
6,Manuel,45,Dancing,Pasay,M,71143.0,Medium
7,,88,,,,,High


In [107]:
# Example 3: Drop rows from multiple columns
df_no_missing_name_city = df.dropna(subset=['Name', 'City'])
df_no_missing_name_city

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
1,Jose,25,Reading,Makati,M,72982.0,Medium
2,Maria,30,Painting,Quezon City,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
6,Manuel,45,Dancing,Pasay,M,71143.0,Medium


In [None]:
# Example 4: Drop rows where all columns have missing values
df_no_all_missing = df.dropna(how='all')
df_no_all_missing

# Replacing Data
Can replace full cells, or partial content of cells (if

In [114]:
# replace all instance of M to 'Male'
df_no_all_missing = df
df_no_all_missing.replace('M', 'Male')

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
0,,25,Hiking,Pasig,F,40917.0,Low
1,Jose,25,Reading,Makati,Male,72982.0,Medium
2,Maria,30,Painting,Quezon,F,88687.0,High
3,Juan,35,Cooking,Malabon,Male,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,Male,71143.0,Medium
6,Manuel,45,Dancing,Pasay,Male,71143.0,Medium
7,,88,,,,,High


In [118]:
# remove "City" in column City -> equivalent to replacing "City" with ""
df_no_all_missing = df
df_no_all_missing['City'] = df_no_all_missing['City'].str.replace('City', '')
df_no_all_missing

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
0,,25,Hiking,Pasig,F,40917.0,Low
1,Jose,25,Reading,Makati,M,72982.0,Medium
2,Maria,30,Painting,Quezon,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
6,Manuel,45,Dancing,Pasay,M,71143.0,Medium
7,,88,,,,,High


# Sorting Data

### Ascending

In [111]:
sorted_df_age_asc = df.sort_values(by='Age', ascending=True)
sorted_df_age_asc

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
0,,25,Hiking,Pasig,F,40917.0,Low
1,Jose,25,Reading,Makati,M,72982.0,Medium
2,Maria,30,Painting,Quezon,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
6,Manuel,45,Dancing,Pasay,M,71143.0,Medium
7,,88,,,,,High


### Descending

In [112]:
sorted_df_salary_desc = df.sort_values(by='Salary', ascending=False)
sorted_df_salary_desc

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
1,Jose,25,Reading,Makati,M,72982.0,Medium
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
6,Manuel,45,Dancing,Pasay,M,71143.0,Medium
0,,25,Hiking,Pasig,F,40917.0,Low
2,Maria,30,Painting,Quezon,F,88687.0,High
4,Lourdes,40,Photography,Taguig,F,96602.0,High
7,,88,,,,,High


### Multiple Columns

In [113]:
sorted_df_age_salary = df.sort_values(by=['Age', 'Salary'], ascending=[True, False])
sorted_df_age_salary

Unnamed: 0,Name,Age,Hobbies,City,Salary,Gender,Category
1,Jose,25,Reading,Makati,M,72982.0,Medium
0,,25,Hiking,Pasig,F,40917.0,Low
2,Maria,30,Painting,Quezon,F,88687.0,High
3,Juan,35,Cooking,Malabon,M,65401.0,Medium
4,Lourdes,40,Photography,Taguig,F,96602.0,High
5,Manuel,45,Dancing,Pasay,M,71143.0,Medium
6,Manuel,45,Dancing,Pasay,M,71143.0,Medium
7,,88,,,,,High


### Sort and reset index

In [None]:
sorted_df_salary = df.sort_values(by='Salary', ascending=True).reset_index(drop=True)
sorted_df_salary

# Save file

In [None]:
sorted_df_salary_desc.to_csv('../Data/staff_salary.csv')