# Week 3 Notebook 2: Data Exploration

## Pandas for Data Exploration

In this lesson, we will try out more functions for:
* Manipulating a Pandas DataFrame
* Sorting with Pandas
* Filtering with Pandas

First, we have to import the Pandas library.


In [None]:
import pandas as pd

## Pandas DataFrame

The Pandas library supports a two dimensional data structure known as a `DataFrame` to store, retrieve and manipulate data. 

Rows and columns of a `DataFrame` allow quick and easy access to the stored data.

Let's say we want to keep track of the staff hired by a local hospital in Melbourne. 
We will store the information provided by each member in a single `DataFrame` object. 

For this purpose, we can make use of the Pandas `DataFrame()` constructor method to create a new DataFrame. 

In [None]:
df = pd.DataFrame({
    'StaffID' : [897654, 290128, 612586, 478132, 108954],
    'FirstName' : ['Harvey', 'Mike', 'Riley', 'James', 'Jane'],
    'LastName' : ['Specter', 'Ross', 'Jackson', 'Bond', 'Austin'],
    'Profession' : ['Nurse', 'Pediatrician', 'Neurologist','Gynaecologist','Ophthalmologist']
})

Displaying the contents of the `df` object will show the nicely structured table format of the `DataFrame` object. 

In [None]:
df

In [None]:
df['Profession']

Alternatively, the dot operator `DataFrame.ColumnName` can also be used to display columns. However, this can only be used if the column name is a valid Python identifier. This means that columns which are reserved words, start with a digit or have whitespaces cannot be accessed using the dot operator.

In [None]:
# Selecting column using the dot operator
df.Profession

To select more than one column, you need to use a square brackets and place the required column names in a `List`. A `DataFrame` object will be returned.

In [None]:
df[['Profession','FirstName']]

You can select rows from the `DataFrame` object using the slicing operator:

In [None]:
# Select rows indexed 1 to 3
df[1:4]

Rows can also be selected based on a condition:

In [None]:
df[df.StaffID == 478132]

We can add new columns to the `DataFrame` object even after it has been created:

In [None]:
df['RoomNo'] = [23,67,45,12,8]
df

Additionally, data can be read in from of various file formats including CSV, XLSX, and TXT. Let's read in some data to do more exploration.

# More Pandas

We are going to practice our data exploration and cleaning skills with a different [data set provided by Pavan Tinniru on Kaggle](https://www.kaggle.com/pavantanniru/-datacleaningforbeginnerusingpandas).

Save the data set into this current working directory. 


In [None]:
import pandas as pd

jobs = pd.read_csv('Data-cleaning-for-beginners-using-pandas.csv')

Before we start analysing, cleaning or wrangling our data set, it is always a good idea to explore our data and understand exactly what we are getting ourselves into. 

Pandas provides some great functions to do this.

To start exploring the data set, we can first check how many rows and columns it has, i.e. the *dimensions*, by using the `.shape` property:

In [None]:
# Check the dimensions of the DataFrame
jobs.shape

The above values indicate that our data set contains 29 rows with 7 columns.

Now, let's explore the data types of our data set columns!

In [None]:
# Check the data type of the columns
jobs.dtypes

We have a combination of `integer`, `float` and other `object` types. Notice that salary is stored as an `object`. Let's have a look at our data.

In [None]:
# View the first five rows
jobs.head()

In [None]:
# View the last five rows
jobs.tail()

Let's get some basic statistics regarding our data set. The `describe()` method returns statistics on the numerical values.

In [None]:
jobs.describe()

## Sorting the DataFrame

We can arrange our data by sorting it using the `sort_values()` function.
You will have to select the columns using the `by` parameter.

For example, if we want to sort the `jobs` DataFrame according to age, we would choose the `Age` column:

In [None]:
jobs.sort_values(by='Age')

Running the code above will show the `jobs` DataFrame according to age, from youngest to oldest. 

We could also choose more than one column to sort. In this case, the column names must be provided as a List.

In [None]:
# Sorting by Age, then Salary. 
# the 'ascending' parameter is used to specify that the ages should be
# in descending order, and then for Age values which are equal, our data
# is arranged by salary in ascending order.

jobs.sort_values(by=(['Age', 'Salary']), ascending = [False, True])

The `ascending` parameter is used to specify that our data must be sorted in ascending order or in descending order according to the provided column names. 

In the example above, our data is sorted twice. First,  it is sorted by `Age` in descending order, then for the same age values it sorted by `Salary` in ascending order. 

The `ascending` parameter is optional, by default the sorting is done in ascending order.

## Filtering Data

Filtering data means to choose only a part of our data, which could be certain columns or rows only. Usually, we will store the filtered data in a new object.

We can use the column names in a List to specify which columns to select. The columns in the List can be repeated.

In [None]:
# Filter some columns
someJobs = jobs[['Age', 'Rating', 'Salary', 'Age']]
print(someJobs)
type(someJobs)

We can see that `someJobs` is a `DataFrame` object. Printing `someJobs` also shows that the `Salary` column actually starts with a `$` character.

## Conditions

Another way to filter our data is to specify a condition on the column. This will indicate which rows match the specified condition.

In [None]:
# Show which rows have Age values > 30
jobs['Age']>30

To filter the DataFrame based on this condition, we put the condition in a square brackets `[]`.

In [None]:
# Show the rows where the condition is true
jobs[jobs['Age']>30]

### Combining Conditions

We can combine Boolean conditions in Pandas using the operators:
- `|` for `or`
- `&` for `and`
- `~` for `not`

In [None]:
# Using '&' operator to select rows that match both conditions
jobs[(jobs['Age']>30) & (jobs['Rating']< 3)]

In [None]:
# Using '~' and '|' operators
jobs[~(jobs['Easy Apply']=='TRUE') | (jobs['Established']==-1)]

### Using between()

The `between()` method is a useful way to deal with selecting a range of values.

The arguments specify the starting and ending interval along with whether to include boundary values using the `inclusive` argument, where the argument values can be `left`, `right` or `both`.


In [None]:
# Show jobs established between 2015 and 2020, inclusive
jobs[jobs['Established'].between(2015, 2020, inclusive='both')]

## Subsetting Rows and Columns

We can filter our data based on rows **and** columns using the `loc` and `iloc` attributes of the DataFrame. These allows us to specify based on the (*row*, *column*) required. 

### Using loc

The `loc` attribute of a `DataFrame` object is used to specify the rows and columns required, based on the names.

In our `jobs` DataFrame, the rows are labelled or indexed from 0 to 28 and the columns are labelled `Index`, `Age`, `Salary`, `Rating`, `Location`, `Established` and `Easy Apply`.
Let's try some examples.

In [None]:
# Getting the value in row labelled '3' and column labelled 'Salary'
jobs.loc[3, 'Salary']

Specifying the column name is optional. For example, to obtain the values of the row labelled `3`, we can simply use `jobs.loc` with the row label:

In [None]:
# getting the row labeled '3'
jobs.loc[3]

However, if we wanted to get a specific column without specifying the rows, we have to use the slicing operator `:` to get all of the rows.

In [None]:
# Using ':' to select all rows for the 'Rating' column
jobs.loc[:,'Rating']

To specify more than one row or column we can use lists:

In [None]:
# Selecting row labelled '13' to '15' and 2 columns labelled 'Age' and 'Salary'
jobs.loc[[13,14,15], ['Age', 'Salary']]

### Slicing

We can also use the slicing operator `:` to select the rows and columns. Remember that the slicing operator selects based on the values of *start*:*stop*:*step*.

Using the slicing operator with `loc` is **inclusive** on both bounds, so it will select up to and including the *stop* value. You can also use the slicing operator to select columns based on their names.

In [None]:
# Using slicing operator to select rows labeled '13' up to and including '16'
# and columns labeled 'Age' up to and including 'Established'
print(jobs.loc[13:19, 'Age': 'Established'])

# Adding the step interval by 2
print(jobs.loc[13:19:2, 'Age':'Established':2])

### loc with Conditions

We can also use `loc` to filter rows based on a condition. The conditions are usually based on values found in specific rows.

In [None]:
# Find rows where 'Rating' value is greater than 4 
jobs.loc[jobs['Rating'] > 4]

Using `loc` allows us to filter the rows and determine the columns required to create a subset of the data.

We have to save our filtered data to a new object in order to perform further manipulation on it. 


In [None]:
# Filter the data to specific rows and columns and save in a new DataFrame object
highRatingJobs = jobs.loc[jobs['Rating']> 4, ['Age', 'Easy Apply', 'Location']]
highRatingJobs.head()

### Using iloc

The `iloc` attribute is another way of selecting the rows and columns. The difference between `loc` and `iloc` is that `iloc` only uses **integer** values to specify the rows and columns.

In [None]:
# Using iloc to select row index 5 and column index 6
# remember that the index always starts from 0
jobs.iloc[5, 6]

In [None]:
# Specifying only 3 rows only, but select all columns
jobs.iloc[[1,7,9]]

In [None]:
# Show all rows, but select only 2 columns
jobs.iloc[:, [3,5]]

### Slicing with iloc 
You can use Python’s slicing operators to define the range of integer values. For `iloc`, because the values are always integers, it performs the slicing without including the upper bound in the result.

In [None]:
# Using iloc with slicing
jobs.iloc[2:10:2, 1:5]

## Exercises

A sample dataset about unemployment in Asia on 2019 and 2020 has been downloaded from the [International Labour Organization ](https://ilostat.ilo.org/data/). This dataset is generated based on "Employment-to-population ratio by sex and age" to compare data for 2019 and 2020 for adults aged 25+.

**Q1 Loading the Data Set**

Load the dataset `ILOlabour.csv` into this notebook with the variable name `empData`.

In [None]:
# Q1 Answer
import pandas as pd

empData = pd.read_csv("ILOlabour.csv")

**Q2 Checking DataFrame Dimensions**

How many rows and columns are in the data set?

In [None]:
# Q2 Answer

empData.shape

**Q3 Viewing the DataFrame**

Show the first five rows of the data set.


In [None]:
# Q3 Answer
empData.head()

*Note* 

This data shows the employment-to-population ratio by sex for ages 25 and above.

So for example, for Afghanistan in 2020:
- 14% of the female population was employed 
- compared to 67.8% of the male population. 
- Overall, 41.7% of the population was employed.


**Q4 Data Sorting**

Sort the data by the column `Sex`, then the column `2020`, with the `2020` values in descending order.

In [None]:
# Q4 Answer
empData.sort_values(by=['Sex','2020'], ascending=[True, False])

**Q5 Data Filtering**

Filter the data set by selecting only rows where the `Sex` value is 'Total' and store the result in a new `DataFrame` object called `totalEmp`.

In [None]:
# Q5 Answer
totalEmp = empData[empData['Sex']=='Total']

totalEmp

**Q6 Filtering by Selecting Columns**

For the `totalEmp` data set, we do not need the column `Total` anymore. 
So, select the other columns and save it into `totalEmp` again.

In [None]:
# Q6 Answer
totalEmp = totalEmp[['Reference area', '2019', '2020']]

In [None]:
totalEmp

**Q6 Data Set Subsetting using loc**

Using `loc` on the `totalEmp` data set, find the countries where the total employment-to-population ratio is below 50 in 2020. Show only the country name and the year.

In [None]:
# Q7 Answer
totalEmp.loc[totalEmp['2020']<50, ['Reference area','2020']]