# What do you undertsand by the term Manipulation?


*   Type here
*   
*   
*   




## What is Pandas?  
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Source: [W3Schools](https://www.w3schools.com/python/pandas/pandas_intro.asp)


As a data scientist, you will use pandas as one of the most popular libraries for working with data.


**Panda's New Data Types**

Pandas has 2 main types of collections that we will be working with: Series and DataFrames.

1. Series are used for 1-dimensional data.
2. DataFrames are used for 2-dimensional data.  

(Note that "DataFrame" is also commonly written in text with lowercase letters as "dataframe." We will use both interchangeably in the text. In code, however, you must use capital letters.)

We will usually be working with 2-dimensional data, so let's dive into DataFrames first.

<img src="https://assets.codingdojo.com/boomyeah2015/codingdojo/curriculum/content/chapter/1679528098__bake-sale-spreadsheet-1.png">

**Creating a DataFrame**  

There are several different ways that we can create a dataframe.

We shall start from lists, the easiest way to create the dataframe is to make a temporary dictionary with the column names we want to use as the keys and the list of data as the values.

In [1]:
# Creating the lists
names_list = ['brownie','cookie','cake', 'cupcake']
prices_list = [2.25,1.25,9.5, 3.5]
quantities_sold_list = [17, 40, 1, 10]

In [2]:
# One way of storing our data
shop_records = {'Name':names_list,
                'Price':prices_list,
                "Quantity Sold": quantities_sold_list}

In [3]:
shop_records

{'Name': ['brownie', 'cookie', 'cake', 'cupcake'],
 'Price': [2.25, 1.25, 9.5, 3.5],
 'Quantity Sold': [17, 40, 1, 10]}

In [4]:
# import pandas library
import pandas as pd

In [5]:
# Make a dataframe from a dictionary
shop_df = pd.DataFrame(shop_records)
shop_df


Unnamed: 0,Name,Price,Quantity Sold
0,brownie,2.25,17
1,cookie,1.25,40
2,cake,9.5,1
3,cupcake,3.5,10


**Anatomy of a DataFrame**  

The DataFrame is comprised of 3 primary components:

* the column names in bold font in the top row.
* the original raw data values are stored in the grid in the center.
* an index for each row in bold on the left.
We can access these 3 components as attributes for our DataFrame.

An attribute is a variable that is stored within another object. We access attributes using dot notation, similar to how we use a function stored inside a package. We do not use parenthesis when accessing an attribute.

You can access the three primary components of a dataframe individually using the following code:

In [6]:
# a special list with the names of each column
shop_df.columns

Index(['Name', 'Price', 'Quantity Sold'], dtype='object')

In [7]:
# the raw data in a numpy array
shop_df.values

array([['brownie', 2.25, 17],
       ['cookie', 1.25, 40],
       ['cake', 9.5, 1],
       ['cupcake', 3.5, 10]], dtype=object)

In [8]:
# a special list with the row names (index)
shop_df.index

RangeIndex(start=0, stop=4, step=1)

### Loading Data with Pandas

We shall use the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html).   
A [modified version](https://github.com/selva86/datasets/blob/master/BostonHousing.csv) can be found here

In [9]:
# Loading data from google drive
# Mount your drive
# from google.colab import drive
# drive.mount('/content/drive')

In [10]:
# Loading Data
# filename = "bostonHousing1978 (1).csv"
filename="https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
df = pd.read_csv(filename)
# pd.rea_csv("/content/drive/MyDrive/test_images/bostonHousing1978.csv")

In [11]:
# print(filename)
type(df)

pandas.core.frame.DataFrame

In [12]:
# What is stored in df
df

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


We can see that pandas shows us a visual tabular perspective of the data. Here df is a dataframe containing the data organized by column and rows

### Basic operations

Let's explore some of the basic methods and attributes of a dataframe that you will commonly use.

In [13]:
# Showing the first and last rows using the dataframe.head() method.
# By default this shows the first 5 rows in the dataframe
df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [14]:
# We can also pass a parameter for the number of rows we want to show
# In the example below it will show the first 3 rows
df.head(3)

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7


In [15]:
# Showing the last 5 rows with dataframe.tail() method.
df.tail()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.12,76.7,2.2875,1,273,21.0,396.9,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.9,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0,0.573,6.03,80.8,2.505,1,273,21.0,396.9,7.88,11.9


In [16]:
# How can we show the last 3 rows?
# Write your code here
df.tail(3)

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.9,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0,0.573,6.03,80.8,2.505,1,273,21.0,396.9,7.88,11.9


In [17]:
# Getting information about the dataframe using the .info() method
# This shows the number of rows each column, the data type of the data and the number of non-null values in each column.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  b        506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB


In [18]:
# Some of the common attributes are:
df.shape
# This returns a tuple of (number of rows, number of columns)

(506, 14)

In [19]:
df.dtypes
# This shows the datatypes of each column

crim       float64
zn         float64
indus      float64
chas         int64
nox        float64
rm         float64
age        float64
dis        float64
rad          int64
tax          int64
ptratio    float64
b          float64
lstat      float64
medv       float64
dtype: object

In [20]:
# Column names
df.columns

Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'b', 'lstat', 'medv'],
      dtype='object')

Note on series and dataframes.  

A Pandas **Series** is like a column in a table. It is a one-dimensional array holding data of any type.  

A Pandas **DataFrame** is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. A Dataframe is made up of series.

### Slicing
As seen earlier with lists, slicing is a common operation that will be undertaken to get part of the data.

**Select columns using brackets**

In [21]:
# We can slice a column using square brackets like this.
# The general notation is dataframe[<column name>]
# First how can we view our column names.
# df.columns
df.head(2)

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6


In [22]:
# Let us pick the RM column
df['rm']
# Please note that the column names are strings, thats why we use quotes.
# We can also use double quotes

0      6.575
1      6.421
2      7.185
3      6.998
4      7.147
       ...  
501    6.593
502    6.120
503    6.976
504    6.794
505    6.030
Name: rm, Length: 506, dtype: float64

In [23]:
# The above notation will return a series, we can verify this with the type function.
type(df['rm'])

pandas.core.series.Series

In [24]:
# To return a dataframe, we use double square brackets i.e.
# dataframe[[<column name>]] e.g
df[['rm']]

Unnamed: 0,rm
0,6.575
1,6.421
2,7.185
3,6.998
4,7.147
...,...
501,6.593
502,6.120
503,6.976
504,6.794


In [25]:
# We can verify this using the type function
type(df[['rm']])

pandas.core.frame.DataFrame

What visual differences have you observed between the sliced series and the sliced dataframe?

In [28]:
df['rm'].shape
# df[['rm']].shape

(506,)

In [29]:
# We can also slice multiple columns. We only need to write the diffrent
# column names separated by a comma (,) using the double brackets notation
# i.e. dataframe[[<column name>, <column name>,...]] e.g
df[['rm', 'lstat']]

Unnamed: 0,rm,lstat
0,6.575,4.98
1,6.421,9.14
2,7.185,4.03
3,6.998,2.94
4,7.147,5.33
...,...,...
501,6.593,9.67
502,6.120,9.08
503,6.976,5.64
504,6.794,6.48


In [30]:
# This returns a dataframe
type(df[['rm', 'lstat']])

pandas.core.frame.DataFrame

In [31]:
# We can apply the dataframe/series methods and attributes to the sliced data e.g.
df[['rm', 'lstat']].head(3)

Unnamed: 0,rm,lstat
0,6.575,4.98
1,6.421,9.14
2,7.185,4.03


In [32]:
# or
df['rm'].head(3)

0    6.575
1    6.421
2    7.185
Name: rm, dtype: float64

In [33]:
# We can also slice by column and then by row by adding square brackets and selecting a range of rows.
df['rm'][0:5]
# We use the notation dataframe[column_slice][row:slice]

0    6.575
1    6.421
2    7.185
3    6.998
4    7.147
Name: rm, dtype: float64

In [34]:
# Slicing by column and index
# df[<column slice>][<row slice>]
df[['rm', 'lstat']][120:121]

Unnamed: 0,rm,lstat
120,5.87,14.37


In [35]:
# The row slicing is similar to list slicing so we can even use the step parameter.
df[['rm', 'lstat']][200:212:3]

Unnamed: 0,rm,lstat
200,7.135,4.45
203,7.853,3.81
206,6.326,10.97
209,5.344,23.09


In [36]:
# We can also slice only by index
df[3:6]

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7


In [37]:
df[3:6][['rm', 'lstat']]

Unnamed: 0,rm,lstat
3,6.998,2.94
4,7.147,5.33
5,6.43,5.21


In [39]:
# Remember lists
my_list  = [1,2,3,4,5,6,7,8,9,10]
# my_list[0:10:1]
my_list[0:10:2]

[1, 3, 5, 7, 9]

In [40]:
# Slicing entire dataframe by index with step
df[0:10:2]

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
6,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
8,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5


In [41]:
# We can use the dot notation for slicing series, but it is not recommended e.g.
df.rm
# In this case, we do not use the quotes around the column name.
# It is similar to using calling attributes.

0      6.575
1      6.421
2      7.185
3      6.998
4      7.147
       ...  
501    6.593
502    6.120
503    6.976
504    6.794
505    6.030
Name: rm, Length: 506, dtype: float64

In [42]:
# We can check the type
type(df.rm)

pandas.core.series.Series

In [43]:
# Question: How can we check if slice obtained from using the brackets notation
# is the same as the slice obtained from using the dot notation. i.e.
# Is df.RM the same as df['RM]
# Hint: refer to comparison operators

**NOTE:** The dot notation has limitations e.g.
 - when the column name is the same as an inbuilt attribute
 - when there is a space in the column name

**Selecting columns using loc**  

The loc attribute of a dataframe can be used to slice the dataframe.  
The general notation is dataframe.loc[row/index_name/value or row slice, column_name/column slice].  
Although loc is primarily label-based, it can also be used with a boolean array as we shall see later.



In [44]:
# Slicing the RM series
df.loc[:, 'rm']

0      6.575
1      6.421
2      7.185
3      6.998
4      7.147
       ...  
501    6.593
502    6.120
503    6.976
504    6.794
505    6.030
Name: rm, Length: 506, dtype: float64

In [45]:
df.loc[2:5, ['rm', 'lstat']]

Unnamed: 0,rm,lstat
2,7.185,4.03
3,6.998,2.94
4,7.147,5.33
5,6.43,5.21


In [46]:
# Checking the type
type(df.loc[:, 'rm'])

pandas.core.series.Series

In [47]:
# Getting a dataframe slice, remember we use doble brackets
df.loc[:, ['rm']]

Unnamed: 0,rm
0,6.575
1,6.421
2,7.185
3,6.998
4,7.147
...,...
501,6.593
502,6.120
503,6.976
504,6.794


In [48]:
# Checking the type
type(df.loc[:, ['rm']])

pandas.core.frame.DataFrame

In [49]:
# We can also slice multiple columns
df.loc[:, ['rm', 'lstat']]

Unnamed: 0,rm,lstat
0,6.575,4.98
1,6.421,9.14
2,7.185,4.03
3,6.998,2.94
4,7.147,5.33
...,...,...
501,6.593,9.67
502,6.120,9.08
503,6.976,5.64
504,6.794,6.48


In [50]:
# And also use the inbuilt methods and attributes
df.loc[:, ['rm', 'lstat']].shape

(506, 2)

What is the first parameter (:) in the brackets. It is actually the row slicer and we can specify the indices for a specific slice or put the full colon (:), that implies that we want to get every row. The same applies for columns, we can get all columns the same way

In [51]:
# Slicing rows from index 5-10
df.loc[5:10, ['rm', 'lstat']]

Unnamed: 0,rm,lstat
5,6.43,5.21
6,6.012,12.43
7,6.172,19.15
8,5.631,29.93
9,6.004,17.1
10,6.377,20.45


In [52]:
# Another example
df.loc[0:10:2, ['rm', 'lstat']]

Unnamed: 0,rm,lstat
0,6.575,4.98
2,7.185,4.03
4,7.147,5.33
6,6.012,12.43
8,5.631,29.93
10,6.377,20.45


In [53]:
# Getting a specific row e.g row with index 2.
df.loc[2, ['rm', 'lstat']]

rm       7.185
lstat    4.030
Name: 2, dtype: float64

In [54]:
# Getting all columns using the : symbol
df.loc[:, :]

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


In [55]:
# Getting all columns from row 0 to 5
df.loc[0:5, :]
# This is the same as df[0:5] and df.loc[0:5, ['RM', 'LSTAT', 'PTRATIO', 'target']]

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7


### Exercise 1
You will use the [Pima Indians Dataset](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database) to answer the following questions.
1. Load the data [here](https://raw.githubusercontent.com/npradaschnor/Pima-Indians-Diabetes-Dataset/master/diabetes.csv). You can use the URL directly.
2. How many rows and columns does are in the dataset? 768 rows & 9 columns
3. Disply the column names.
4. What are the datatypes of the columns? object
5. Does the dataset have null values? No

In [56]:
file = 'https://raw.githubusercontent.com/npradaschnor/Pima-Indians-Diabetes-Dataset/master/diabetes.csv'
df = pd.read_csv(file)

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [58]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [59]:
df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

### Pandas dataframe indices explained

Remeber the index is the unique ID for each row in the dataframe. Previously we have used a numerical range index but pandas indices do not need to be numbers. We shall use the shop records Dataframe to demonstrate this

In [60]:
# We can use the set_index method to specify the column to be used as an index and this returns a modified dataframe
# Here we store the result in a new variable
shop_df_mod = shop_df.set_index('Name')
shop_df_mod

Unnamed: 0_level_0,Price,Quantity Sold
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
brownie,2.25,17
cookie,1.25,40
cake,9.5,1
cupcake,3.5,10


In [61]:
shop_df_mod.index

Index(['brownie', 'cookie', 'cake', 'cupcake'], dtype='object', name='Name')

In [62]:
# Try to slice the Name column
shop_df_mod['Name']

KeyError: 'Name'

We shall get an error because the column does not exist.
Infact we can verify this by listing the columns   
**Hint:** A key error when slicing is an indication of a non-existing column

In [63]:
# List columns
shop_df_mod.columns

Index(['Price', 'Quantity Sold'], dtype='object')

In [64]:
# Review: slicing 1 column
shop_df_mod["Price"]['cake']

9.5

In [65]:
# Slice the row loc at index 2
# We expect an error as this index does not exist
shop_df_mod.loc[2]

KeyError: 2

### Slicing Rows & Columns with .loc

In [66]:
shop_df_mod['Price']

Name
brownie    2.25
cookie     1.25
cake       9.50
cupcake    3.50
Name: Price, dtype: float64

In [67]:
# Slice out the row with cake as the index
shop_df_mod.loc['cake',:]

Price            9.5
Quantity Sold    1.0
Name: cake, dtype: float64

In [68]:
# Slice multiple rows using a list
shop_df_mod.loc[['cake','brownie']]

Unnamed: 0_level_0,Price,Quantity Sold
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
cake,9.5,1
brownie,2.25,17


In [69]:
# Slice the cupcake/cookie rows and Price/Quantity Sold columns
shop_df_mod.loc[["cupcake","cookie"], ["Price", "Quantity Sold"]]

Unnamed: 0_level_0,Price,Quantity Sold
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
cupcake,3.5,10
cookie,1.25,40


### Slicing Columns (Only) with .loc

In [70]:
# Selet all rows for the Price Column
shop_df_mod.loc[:, 'Price']

Name
brownie    2.25
cookie     1.25
cake       9.50
cupcake    3.50
Name: Price, dtype: float64

In [71]:
# Slice all rows for mulitple columns
shop_df_mod.loc[:, ["Price","Quantity Sold"]]



Unnamed: 0_level_0,Price,Quantity Sold
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
brownie,2.25,17
cookie,1.25,40
cake,9.5,1
cupcake,3.5,10


Issue with dot Notation

In [None]:
# We cannot use the dot notation to get the Quantity Sold series
# shop_df_mod.Quantity Sold

# What do you understand by filtering?

### Filtering
Get the data for this section from [here](https://drive.google.com/file/d/1Vs4IyWyXnanJxRz9wOgwgxqgg-kNAAaT/view?usp=sharing).

In [None]:
# Import pandas if not already imported
# import pandas as pd

In [72]:
filename1 = '../files/mortgages.csv'
df1 = pd.read_csv(filename1)

Let us get some basic information about our data.

In [73]:
df1.head()

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,Mortgage Name,Interest Rate
0,1,400000.0,1686.42,1000.0,686.42,399313.58,30 Year,0.03
1,2,399313.58,1686.42,998.28,688.14,398625.44,30 Year,0.03
2,3,398625.44,1686.42,996.56,689.86,397935.58,30 Year,0.03
3,4,397935.58,1686.42,994.83,691.59,397243.99,30 Year,0.03
4,5,397243.99,1686.42,993.1,693.32,396550.67,30 Year,0.03


In [74]:
# Get the number of rows and columns
df1.shape

(1080, 8)

In [75]:
# Get information about the datatypes()
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Month             1080 non-null   int64  
 1   Starting Balance  1080 non-null   float64
 2   Repayment         1080 non-null   float64
 3   Interest Paid     1080 non-null   float64
 4   Principal Paid    1080 non-null   float64
 5   New Balance       1080 non-null   float64
 6   Mortgage Name     1080 non-null   object 
 7   Interest Rate     1080 non-null   float64
dtypes: float64(6), int64(1), object(1)
memory usage: 67.6+ KB


In [None]:
# What is the number of null values in each column

New functions for checking null values

In [76]:
df1.isnull()

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,Mortgage Name,Interest Rate
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
1075,False,False,False,False,False,False,False,False
1076,False,False,False,False,False,False,False,False
1077,False,False,False,False,False,False,False,False
1078,False,False,False,False,False,False,False,False


In [77]:
df1.isnull().count()

Month               1080
Starting Balance    1080
Repayment           1080
Interest Paid       1080
Principal Paid      1080
New Balance         1080
Mortgage Name       1080
Interest Rate       1080
dtype: int64

In [78]:
df1.isnull().sum()

Month               0
Starting Balance    0
Repayment           0
Interest Paid       0
Principal Paid      0
New Balance         0
Mortgage Name       0
Interest Rate       0
dtype: int64

Note that although this dataset does not have null values, these methods will come in handy next week as we learn how to deal with missing values.

In [79]:
# Let us use some new methods to get information about the values in the columns
df1.columns

Index(['Month', 'Starting Balance', 'Repayment', 'Interest Paid',
       'Principal Paid', 'New Balance', 'Mortgage Name', 'Interest Rate'],
      dtype='object')

In [80]:
# The value_counts() methods returns the number of occurances of the
# unique values in a series e.g.
df1['Mortgage Name'].value_counts()

Mortgage Name
30 Year    720
15 Year    360
Name: count, dtype: int64

In [81]:
# We can also get this for the interest rate series
df1['Interest Rate'].value_counts()

Interest Rate
0.03    540
0.05    540
Name: count, dtype: int64

In [None]:
# Question: How many mortgages have a 30 Year term?

In [82]:
# How do we create filters?
# We can use comparison operators e.g >, <,  !=,...
# E.g we can get which mortgage names are named 30 year using the equals (==) operator
df1['Mortgage Name'] == '30 Year'
# Notice this returns a boolean series/array based on the evaluation of the condition.

0        True
1        True
2        True
3        True
4        True
        ...  
1075    False
1076    False
1077    False
1078    False
1079    False
Name: Mortgage Name, Length: 1080, dtype: bool

In [83]:
# Let us store this filter in a variable that we shall reuse later.
mortgage_filter = df1['Mortgage Name'] == '30 Year'

In [84]:
# Let us filter our dataframe.
# 1. Filtering with square brackets
# The general notation is dataframe[filter] and this will return a filtered data frame e.g
df1[mortgage_filter]

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,Mortgage Name,Interest Rate
0,1,400000.00,1686.42,1000.00,686.42,399313.58,30 Year,0.03
1,2,399313.58,1686.42,998.28,688.14,398625.44,30 Year,0.03
2,3,398625.44,1686.42,996.56,689.86,397935.58,30 Year,0.03
3,4,397935.58,1686.42,994.83,691.59,397243.99,30 Year,0.03
4,5,397243.99,1686.42,993.10,693.32,396550.67,30 Year,0.03
...,...,...,...,...,...,...,...,...
715,356,10596.54,2147.29,44.15,2103.14,8493.40,30 Year,0.05
716,357,8493.40,2147.29,35.38,2111.91,6381.49,30 Year,0.05
717,358,6381.49,2147.29,26.58,2120.71,4260.78,30 Year,0.05
718,359,4260.78,2147.29,17.75,2129.54,2131.24,30 Year,0.05


In [85]:
# let us store this in a new variable and investigate the result
df1_mortgage_filtered = df1[mortgage_filter]
# We can check the value counts for the Mortgage name series
df1_mortgage_filtered['Mortgage Name'].value_counts()
# we can see that we only have the 30 Year value

Mortgage Name
30 Year    720
Name: count, dtype: int64

In [86]:
# Return data with an interest rate of 0.05

# Create interest rate filter
interest_filter = df1['Interest Rate'] == 0.05

# Filter the dataframe and save in a variable
df1_interest_filtered = df1[interest_filter]

# Check the result
df1_interest_filtered['Interest Rate'].value_counts()

Interest Rate
0.05    540
Name: count, dtype: int64

In [87]:
# We can also filter using the loc attribute.
# The general notation is dataframe.loc[filter, columns]
df1.loc[interest_filter, :].head()

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,Mortgage Name,Interest Rate
360,1,400000.0,2147.29,1666.66,480.63,399519.37,30 Year,0.05
361,2,399519.37,2147.29,1664.66,482.63,399036.74,30 Year,0.05
362,3,399036.74,2147.29,1662.65,484.64,398552.1,30 Year,0.05
363,4,398552.1,2147.29,1660.63,486.66,398065.44,30 Year,0.05
364,5,398065.44,2147.29,1658.6,488.69,397576.75,30 Year,0.05


**Combining Filters**  
We can use Bitwise operators e.g, &, | to combine filters.

In [88]:
# Let us combine the 2 filters above i.e
# Get data where Mortgagename is 30 Year and interest rate is 0.05
df_combined_filter = df1[mortgage_filter&interest_filter]
df_combined_filter.head()

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,Mortgage Name,Interest Rate
360,1,400000.0,2147.29,1666.66,480.63,399519.37,30 Year,0.05
361,2,399519.37,2147.29,1664.66,482.63,399036.74,30 Year,0.05
362,3,399036.74,2147.29,1662.65,484.64,398552.1,30 Year,0.05
363,4,398552.1,2147.29,1660.63,486.66,398065.44,30 Year,0.05
364,5,398065.44,2147.29,1658.6,488.69,397576.75,30 Year,0.05


In [89]:
# Verify by checking value_counts for both series
df_combined_filter['Interest Rate'].value_counts()

Interest Rate
0.05    360
Name: count, dtype: int64

In [90]:
df_combined_filter['Mortgage Name'].value_counts()

Mortgage Name
30 Year    360
Name: count, dtype: int64

In [None]:
# Question: Show data where the mortgage name is 15 Year or interest rate is less than 0.04

# Exercise 2
Using the Dataset from Exercise 1, answer the following questions.
1. Show the values of centrality for the dataset.
2. How many patients with atleast 2 **pregnancies** have higher than the mean **Glucose** level?
3. How many patients have:
  - greater than median **Blood Pressure**
  - less than median **Blood Pressure**
