# Complete Guide to Pandas.


For this, We will use the Titanic dataset. 

[Dataset link](https://www.kaggle.com/competitions/titanic/data)

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


## Table of Contents(ToC):

#### 1. The Basics

#### 2. Creating DataFrame

#### 3. [Treating null values

#### 4. Modify/Add new column(s)

#### 5. Deleting columns

#### 6. Renaming columns

#### 7.i. Slicing DataFrame

#### 7.ii. Slicing using iloc and loc

#### 8. Adding a row

#### 9. Dropping row(s)

#### 10. Sorting

#### 11. Joins

#### 12. Groupby

#### Importing modules.

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, 


#### Importing data.

For this tutorial, we will use the standard Titanic Dataset.

In [None]:
train=pd.read_csv(r'/content/drive/MyDrive/0 Data Cleaning /titanic/train.csv')
df=train.copy()

<a id="content1"></a>
## 1. The Basics

`df.head()` and `df.tail() `are both methods in Pandas library that are used to display a portion of the DataFrame.

df.head() returns the first n rows of the DataFrame, where n is an integer value provided as an argument (by default n=5 if not provided). It is commonly used to quickly examine the top rows of a DataFrame to get a sense of its contents and structure.

On the other hand, df.tail() returns the last n rows of the DataFrame, where n is an integer value provided as an argument (by default n=5 if not provided). It is commonly used to quickly examine the bottom rows of a DataFrame to get a sense of its contents and structure.

Both `df.head() and df.tail()` are useful for quickly inspecting a DataFrame, especially when working with large datasets. 

In [None]:
# See the first 5 rows
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
# last 5 rows.
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


`df.shape` is a property in Pandas library that returns a tuple representing the dimensions of a DataFrame. The tuple contains two values: the number of rows and the number of columns, respectively.

In [None]:
# n_samples x n_features
df.shape

(891, 12)

`df.columns` is a property in Pandas library that returns a sequence of labels for the columns in a DataFrame. By default, these labels are the column names that you specified when creating the DataFrame, but you can also assign custom labels to the columns.

In [None]:
#List of all the columns
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

`df.index` is a property in Pandas library that returns a sequence of labels for the rows in a DataFrame. By default, these labels are numeric and start from 0, but you can also assign custom labels to the index.

In [None]:
# Rows index
df.index

RangeIndex(start=0, stop=891, step=1)

Both `df.index` and `df.columns `are useful for accessing and manipulating the row and column labels in a DataFrame. You can use these properties to select specific rows or columns, rename rows or columns, or set new row or column labels.

In [None]:
# Values with their counts in a particular column
df['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

`df.describe()` is a method in Pandas library that provides a statistical summary of the DataFrame. It only includes the numerical columns by default and provides information such as count, mean, standard deviation, minimum, maximum, and various percentiles for each column.

In [None]:
# General description of dataset.
df.describe()

<a id="content2"></a>
## 2. Creating DataFrame

Creating a DataFrame in Pandas means creating an empty or populated DataFrame object that contains the data you want to work with. You can create a DataFrame from various data sources, such as a Python dictionary, a list of lists, a CSV file, or a SQL query.

When you create a DataFrame, you specify the structure of the data in terms of its columns and rows. Each column in the DataFrame represents a variable or a feature, and each row represents an observation or a data point. The data in each column can be of different data types, such as integers, floats, strings, or dates.

Once you create a DataFrame, you can manipulate and analyze the data using various built-in Pandas methods and functions. You can sort and filter the data, perform statistical calculations, apply functions to individual columns or rows, merge or join multiple DataFrames, and visualize the data using various plotting functions.

Overall, creating a DataFrame in Pandas is the first step in working with tabular data in Python and is essential for many data analysis tasks, including data cleaning, data wrangling, and data visualization.

In [None]:
# empty data frame
df_empty=pd.DataFrame()
df_empty.head()  

In [None]:
# From dict
student_dict={'Name':['A','B','C'],'Age':[24,18,17],'Roll':[1,2,3]}
df_student=pd.DataFrame(student_dict).reset_index(drop=True) # without this adds an additional index column in df
df_student.head()

Unnamed: 0,Name,Age,Roll
0,A,24,1
1,B,18,2
2,C,17,3


<a id="content3"></a>
## 3. Treating null values

A null value (also known as a NaN or a missing value) is a value that is undefined or unknown. A null value can occur in any data type, such as integers, floats, strings, or dates.



In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
# on whole df.
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
# on a particular column
df['Age'].isnull().sum()

177

#### Impute null values

Handling null values is an important part of data analysis, as they can affect the accuracy and reliability of the results. Some common ways to handle null values include:

Removing null values: You can remove rows or columns that contain null values using the `dropna()` method.

Replacing null values: You can replace null values with a default value, such as 0 or the mean of the column, using the `fillna() `method.

Ignoring null values: You can ignore null values and perform calculations on the remaining data using the `sum(), mean(), or other statistical functions.`



In [None]:
df['Age'].fillna(df['Age'].mean(),inplace=True)
df['Age'].isnull().sum()

0

In [None]:
df['Sex'].fillna(df['Sex'].mode(),inplace=True)
df['Sex'].isnull().sum()

0


## 4. Modify/Add new column(s).
Modifying or adding new columns to a Pandas DataFrame means changing or creating new columns that contain additional information or transformed data based on the existing columns in the DataFrame.

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Below code modifies the "Sex" column in a Pandas DataFrame called "df" by mapping the values "male" and "female" to the numerical values "0" and "1", respectively.

In [None]:
df['Sex']=df['Sex'].map({"male":'0',"female":"1"})
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S


Below code creates two new columns in a Pandas DataFrame called "df", namely "last_name" and "first_name".

The `apply()` method is used to apply a custom function to each value in the "Name" column. In this case, a lambda function is used to split the full name into two parts: the last name and the first name.

By creating the new "last_name" and "first_name" columns, the original "Name" column is effectively split into two parts, which can be used for further analysis or visualization. The new columns can also be used to group or filter the data based on last name or first name, respectively.

In [None]:
# Finding last name and first name from Name column.
df['last_name']=df['Name'].apply(lambda x: x.split(',')[0])
df['first_name']=df['Name'].apply(lambda x: ' '.join(x.split(',')[1:]))


In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,last_name,first_name
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry


The code `df['Thrid&Men']=df.apply(lambda row: int(row['Pclass']==3 and row['Sex']=="0"),axis=1)` creates a new column in a Pandas DataFrame called "df", namely "Thrid&Men".

In [None]:
# Sets to 1 for men in 3rd class.
df['Thrid&Men']=df.apply(lambda row: int(row['Pclass']==3 and row['Sex']=="0"),axis=1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,last_name,first_name,Thrid&Men
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1


Below code is used to create a new categorical variable 'Age_group' in the DataFrame df based on the age of the individuals. 

In [None]:
def findAgeGroup(age):
    if age<18:
        return 1
    elif age>=18 and age<40:
        return 2
    elif age>=40 and age<60:
        return 3
    else:
        return 4
df['Age_group']=df['Age'].apply(lambda x: findAgeGroup(x))
# Calling a custom function.

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,last_name,first_name,Thrid&Men,Age_group
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0,2
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0,2
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1,2


<a id="content5"></a>
## 5. Deleting columns

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,last_name,first_name,Thrid&Men,Age_group
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0,2
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0,2
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1,2


The first line of code removes the `'PassengerId'` column from the Pandas DataFrame df using the drop method. The `axis=1` argument specifies that the column should be dropped (as opposed to a row). The resulting DataFrame is stored back in df.

The second line of code is commented out, but if it were executed, it would do the same thing as the first line of code, except with an additional argument `inplace=True`. This argument specifies that the original DataFrame should be modified in place (i.e., without creating a new DataFrame), rather than returning a new DataFrame with the column dropped.

In [None]:
df=df.drop(['PassengerId'],axis=1)
#df=df.drop(['PassengerId'],axis=1,inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,last_name,first_name,Thrid&Men,Age_group
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1,2
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0,2
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0,2
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0,2
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1,2


<a id="content6"></a>
## 6. Renaming columns

Below code renames some columns in the Pandas DataFrame df using the rename method. The method takes a dictionary as its argument, where the keys are the original column names and the values are the new column names.

In this case, the column 'Sex' is renamed to 'Gender', 'Name' is renamed to 'Full Name', 'last_name' is renamed to 'Surname', and 'first_name' is renamed to 'Name'.



In [None]:
# Lets try to rename some columns. 
df=df.rename(columns={'Sex':'Gender','Name':'Full Name','last_name':'Surname','first_name':'Name'})
df.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1,2
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0,2
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0,2
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0,2
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1,2


Similarly, we can do for whichever column we want.

<a id="content7"></a>
## 7.i Slicing DataFrame

In [None]:
df.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1,2
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0,2
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0,2
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0,2
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1,2


The code snippet creates a new DataFrame df_third_class that contains all rows from the original DataFrame df where the 'Pclass' column is equal to 3. This is done using boolean indexing, where `df['Pclass']==3` creates a boolean array with True values for all rows where the 'Pclass' column is equal to 3 and False values for all other rows. This boolean array is then used to select the corresponding rows from df.

The reset_index method is used to reset the index of the resulting DataFrame. The `drop=True `argument is passed to the method to drop the original index and replace it with a new one. If` drop=False` or is not specified, the original index is kept and a new index column is added.

In [None]:
# All rows with pclass==3
df_third_class=df[df['Pclass']==3].reset_index(drop=True) # w/0 drop=True it actually adds a index column rather.
df_third_class.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1,2
1,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0,2
2,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1,2
3,0,3,"Moran, Mr. James",0,29.699118,0,0,330877,8.4583,,Q,Moran,Mr. James,1,2
4,0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S,Palsson,Master. Gosta Leonard,1,1


The code snippet creates a new DataFrame df_aged that contains all rows from the original DataFrame df where the 'Age' column is greater than 60 and the 'Gender' column is equal to "1".

The condition `(df['Age']>60) & (df['Gender']=="1")` uses the & operator to combine two boolean arrays generated by two separate conditions. The first condition `df['Age']>60` creates a boolean array with True values for all rows where the 'Age' column is greater than 60 and False values for all other rows. The second condition `df['Gender']=="1"` creates a boolean array with True values for all rows where the 'Gender' column is equal to "1" (assuming "1" represents female in this case) and False values for all other rows. The & operator returns a new boolean array where both conditions are true.

In [None]:
# Females with age > 60
df_aged=df[(df['Age']>60) & (df['Gender']=="1")]
df_aged.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
275,1,1,"Andrews, Miss. Kornelia Theodosia",1,63.0,1,0,13502,77.9583,D7,S,Andrews,Miss. Kornelia Theodosia,0,4
483,1,3,"Turkula, Mrs. (Hedwig)",1,63.0,0,0,4134,9.5875,,S,Turkula,Mrs. (Hedwig),0,4
829,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",1,62.0,0,0,113572,80.0,B28,,Stone,Mrs. George Nelson (Martha Evelyn),0,4


**Note that all these three ladies were saved as they were senior citizen and women so they may have been given priority.**

In [None]:
# Selecting some columns.
df1=df[['Age','Pclass','Gender']]
df1.head()

Unnamed: 0,Age,Pclass,Gender
0,22.0,3,0
1,38.0,1,1
2,26.0,3,1
3,35.0,1,1
4,35.0,3,0


In [None]:
# Select numerical columns only
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

df_num = df.select_dtypes(include=numerics)
df_num.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Thrid&Men,Age_group
0,0,3,22.0,1,0,7.25,1,2
1,1,1,38.0,1,0,71.2833,0,2
2,1,3,26.0,0,0,7.925,0,2
3,1,1,35.0,1,0,53.1,0,2
4,0,3,35.0,0,0,8.05,1,2


In [None]:
# categorical columns
df_cat=df.select_dtypes(include=['object'])
df_cat.head()

Unnamed: 0,Full Name,Gender,Ticket,Cabin,Embarked,Surname,Name
0,"Braund, Mr. Owen Harris",0,A/5 21171,,S,Braund,Mr. Owen Harris
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,PC 17599,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,"Heikkinen, Miss. Laina",1,STON/O2. 3101282,,S,Heikkinen,Miss. Laina
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,113803,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,"Allen, Mr. William Henry",0,373450,,S,Allen,Mr. William Henry


<a id="content8"></a>
## 7.ii Slicing using iloc and loc

#### iloc

In [None]:
df.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1,2
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0,2
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0,2
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0,2
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1,2


The code snippet creates a new DataFrame `df_sub1 `that contains the first 100 rows and all columns of the original DataFrame df. This is done using integer indexing and slicing with the iloc indexer.

The iloc indexer selects rows and columns by their integer index, and takes two arguments: the first argument specifies the rows to select (in this case, from row 0 up to but not including row 100), and the second argument specifies the columns to select (in this case, all columns, specified by :).



In [None]:
# First 100 rows & all columns
df_sub1=df.iloc[0:100,:]
df_sub1.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1,2
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0,2
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0,2
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0,2
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1,2


The `iloc `indexer selects rows and columns by their integer index, and takes two arguments: the first argument specifies the rows to select (in this case, from row 0 up to but not including row 250), and the second argument specifies the columns to select by their integer indices in the df.columns attribute .

In [None]:
#First 250 rows with a subset of columns

#df_sub2=df.iloc[:250,['Age']] 
# This will throw an error as iloc only consumes integres as indices.

df_sub2=df.iloc[:250,[1,8]] 
#Returns first 250 rows and columns at those indices in df.columns.
df_sub2.head()

Unnamed: 0,Pclass,Fare
0,3,7.25
1,1,71.2833
2,3,7.925
3,1,53.1
4,3,8.05


#### loc

In [None]:
df.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1,2
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0,2
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0,2
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0,2
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1,2


The code snippet creates a new DataFrame `df_sub3` that contains the first 500 rows and all columns of the original DataFrame df. This is done using label indexing and slicing with the loc indexer.

The loc indexer selects rows and columns by their labels (i.e., index values and column names), and takes two arguments: the first argument specifies the rows to select (in this case, all rows up to and including the row with label 500), and the second argument specifies the columns to select (in this case, all columns, specified by :).

In [None]:
# First 500 rows.
df_sub3=df.loc[:500,:]
df_sub3.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1,2
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0,2
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0,2
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0,2
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1,2


The code snippet creates a new DataFrame df_sub4 that contains the `'Gender'` and `'Age'` columns of all rows from the original DataFrame df where the `'Age'` column is greater than 50. This is done using label indexing and filtering the rows with the condition `df['Age']>50.`

In [None]:
# Gender and age of age >50
df_sub4=df.loc[(df['Age']>50),['Gender','Age']]
df_sub4.head()

Unnamed: 0,Gender,Age
6,0,54.0
11,1,58.0
15,1,55.0
33,0,66.0
54,0,65.0


<a id="content9"></a>
## 8. Adding a row

In [None]:
df.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1,2
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0,2
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0,2
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0,2
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,1,2


The code snippet appends a new row to the original DataFrame df using the append function. The new row is specified as a dictionary row, with keys corresponding to the column names and values corresponding to the values for the new row.

In [None]:
# Adding row using 'append' function
row=dict({'Age':24,'Full Name':'Peter','Survived':'Y'})
df=df.append(row,ignore_index=True)
# assumes Nan for absent keys(columns)
df.tail()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
887,1,1.0,"Graham, Miss. Margaret Edith",1.0,19.0,0.0,0.0,112053,30.0,B42,S,Graham,Miss. Margaret Edith,0.0,2.0
888,0,3.0,"Johnston, Miss. Catherine Helen ""Carrie""",1.0,29.699118,1.0,2.0,W./C. 6607,23.45,,S,Johnston,"Miss. Catherine Helen ""Carrie""",0.0,2.0
889,1,1.0,"Behr, Mr. Karl Howell",0.0,26.0,0.0,0.0,111369,30.0,C148,C,Behr,Mr. Karl Howell,0.0,2.0
890,0,3.0,"Dooley, Mr. Patrick",0.0,32.0,0.0,0.0,370376,7.75,,Q,Dooley,Mr. Patrick,1.0,2.0
891,Y,,Peter,,24.0,,,,,,,,,,


In [None]:
# Adding new row using loc
df.loc[len(df.index)]=row
df.tail()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
888,0,3.0,"Johnston, Miss. Catherine Helen ""Carrie""",1.0,29.699118,1.0,2.0,W./C. 6607,23.45,,S,Johnston,"Miss. Catherine Helen ""Carrie""",0.0,2.0
889,1,1.0,"Behr, Mr. Karl Howell",0.0,26.0,0.0,0.0,111369,30.0,C148,C,Behr,Mr. Karl Howell,0.0,2.0
890,0,3.0,"Dooley, Mr. Patrick",0.0,32.0,0.0,0.0,370376,7.75,,Q,Dooley,Mr. Patrick,1.0,2.0
891,Y,,Peter,,24.0,,,,,,,,,,
892,Y,,Peter,,24.0,,,,,,,,,,


<a id="content10"></a>
## 9. Dropping row(s)

The code snippet deletes the last row from the original DataFrame df using the drop function. The last row is specified by the index value [-1], which refers to the last row in the DataFrame.



In [None]:
df=df.drop(df.index[-1],axis=0) # Deletes last row
df.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
0,0,3.0,"Braund, Mr. Owen Harris",0,22.0,1.0,0.0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1.0,2.0
1,1,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1.0,0.0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0.0,2.0
2,1,3.0,"Heikkinen, Miss. Laina",1,26.0,0.0,0.0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0.0,2.0
3,1,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1.0,0.0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0.0,2.0
4,0,3.0,"Allen, Mr. William Henry",0,35.0,0.0,0.0,373450,8.05,,S,Allen,Mr. William Henry,1.0,2.0


<a id="content11"></a>
## 10. Sorting

In [None]:
df.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
0,0,3.0,"Braund, Mr. Owen Harris",0,22.0,1.0,0.0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,1.0,2.0
1,1,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1.0,0.0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),0.0,2.0
2,1,3.0,"Heikkinen, Miss. Laina",1,26.0,0.0,0.0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,0.0,2.0
3,1,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1.0,0.0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),0.0,2.0
4,0,3.0,"Allen, Mr. William Henry",0,35.0,0.0,0.0,373450,8.05,,S,Allen,Mr. William Henry,1.0,2.0


The code snippet sorts the original DataFrame df by the 'Age' column in descending order using the sort_values function. The by parameter specifies the column(s) to sort by, and in this case, it is set to 'Age'. The ascending parameter is set to False to sort the values in descending order.

In [None]:
# sorting by age say in decreasing order.
df=df.sort_values(by=['Age'],ascending=False) # can specify multiple columns in a list as well.
df.head()

Unnamed: 0,Survived,Pclass,Full Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Name,Thrid&Men,Age_group
630,1,1.0,"Barkworth, Mr. Algernon Henry Wilson",0,80.0,0.0,0.0,27042,30.0,A23,S,Barkworth,Mr. Algernon Henry Wilson,0.0,4.0
851,0,3.0,"Svensson, Mr. Johan",0,74.0,0.0,0.0,347060,7.775,,S,Svensson,Mr. Johan,1.0,4.0
96,0,1.0,"Goldschmidt, Mr. George B",0,71.0,0.0,0.0,PC 17754,34.6542,A5,C,Goldschmidt,Mr. George B,0.0,4.0
493,0,1.0,"Artagaveytia, Mr. Ramon",0,71.0,0.0,0.0,PC 17609,49.5042,,C,Artagaveytia,Mr. Ramon,0.0,4.0
116,0,3.0,"Connors, Mr. Patrick",0,70.5,0.0,0.0,370369,7.75,,Q,Connors,Mr. Patrick,1.0,4.0
