<a href="https://colab.research.google.com/github/futureCodersSE/python-programming-for-data/blob/main/Worksheets/4a_Filtering_data_with_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Filtering dataframe data

Interrogating dataframes
---

*   single column: `df['column']`
*   multiple columns: `df[['column1', 'column2']]`
*   filter rows by condition 	`df[df['column'] = condition]`  
*   filter by multiple conditions where both are true (AND)  
	`df[(df['column'] = condition1) & (df['column'] = condition2)]`  
*   filter by multiple conditions where one or other are true (OR)  
	`df[(df['column'] = condition1)  | (df['column'] = condition2)]`  

### Useful Functions

---

`head()`: will show the first 5 rows of the dataframe.  
`tail()`: same as head() but for the last 5 rows.  
`len()`: will show the length.  
`mode()`: will show the most common value in column.  
`mean()`: will show the average of the column.  
`sort_values()`: will sort the dataframe.





## Data Imports & Table Content

---

In this worksheet you will use data from a dataset in an Excel spreadsheet called public_use-talent-migration.xlsx. This spreadsheet file contains 3 sheets, each with related but different data on migration.   

Running the code below will create a dataframe from each of the 3 sheets and will display the column names and data types so that you can start to get an idea of what is in the data.

You can then use these dataframes in the exercises, rather than keep re-reading the original files. The 3 dataframes will be called:

*  skill_migration
*  industry_migration
*  country_migration

In [None]:
import pandas as pd

skill_migration = pd.read_excel('https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true', "Skill Migration")
industry_migration = pd.read_excel('https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true', "Industry Migration")
country_migration = pd.read_excel('https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true', "Country Migration")

def get_summary(df):
  display(df.info())

get_summary(skill_migration)
get_summary(industry_migration)
get_summary(country_migration)

# Exercise 1

---
\
Write a function that will take the `skill_migration` dataframe as a parameter and will:
*  filter for the rows where the `wb_income` contains `High income`.
*  return the number of rows (length of the dataframe).

In [None]:
def filter_income(df):
  # add code below to find the number of high income countries using the wb_income column



#run test to see if you are getting the correct row length
actual = filter_income(skill_migration)
expected = 8904

if actual == expected:
  print("Test passed!\nExpected: {}\nActual: {}".format(expected, actual))
else:
  print("Test failed!\nExpected: {}\nActual: {}".format(expected, actual))

# Exercise 2

---
\
Write a function that will take the `skill_migration` dataframe, and a particular type of `skill` (e.g. Tech skills) as parameters and will:
*  filter for the rows that have a `skill_group_category` equal to the given `skill`.
*  return the country that shows up the most in the `country_name` column of the filtered dataframe.

In [None]:
import pandas as pd

skill_migration = pd.read_excel('https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true', "Skill Migration")

def filter_skills(df, skill):
  # add code below to find all rows that have the skill 'skill' and return the most common country (mode)



#run test to see if you are getting the most frequent country name
tests = [
    { "id": 1, "actual": filter_skills(skill_migration, "Tech Skills")[0], "expected": "Australia" },
    { "id": 2, "actual": filter_skills(skill_migration, "Business Skills")[0], "expected": "Australia" },
    { "id": 3, "actual": filter_skills(skill_migration, "Specialized Industry Skills")[0], "expected": "United Kingdom" }
]

for test in tests:
  if test["actual"] == test["expected"]:
    print("Test {} passed!\nExpected: {}\nActual: {}\n".format(test["id"], test["expected"], test["actual"]))
  else:
    print("Test {} failed!\nExpected: {}\nActual: {}\n".format(test["id"], test["expected"], test["actual"]))

# Exercise 3

---
\
Write a function that will take the `skill_migration` dataframe as a parameter and will:
*  filter, using two conditions, for the rows where `skill_group_id` is `2265` and `net_per_10K_2019` is greater than `-500`.
*  sort the rows in ascending order based on `net_per_10K_2019`
*  return the first 5 rows.

In [None]:
def filter_skill_id(df):
  # add code below to find all rows that the skill id 2265 with greater than -500 migration and return the first 5 rows when sorted into order of net_per_10K_2019



#run test to see if you are getting the correct first row and only returning 5 items
filtered_df = filter_skill_id(skill_migration)
actual1 = filtered_df.index[0]
expected1 = 14550
actual2 = len(filtered_df)
expected2 = 5

if actual1 == expected1 and actual2 == expected2:
  print("Test passed!\nExpected: {} & {}\nActual: {} & {}".format(expected1, expected2, actual1, actual2))
else:
  print("Test failed!\nExpected: {} & {}\nActual: {} & {}".format(expected1, expected2, actual1, actual2))

# Exercise 4

---
\
Write a function that will take the country_migration dataframe and an amount of net per 10K migrations as parameters and will:
*  filter for all rows with a `net_per_10K_2019` less than `amount`
*  return the number of rows.



In [None]:
def filter_net_per_10k(df, amount):
  # add code below to find all the rows where the net_per_10K_2019 is less than the `amount` parameter



#run test to see if you are getting the correct first row and only returning 5 items
tests = [
    { "id": 1, "actual": filter_net_per_10k(country_migration, 100), "expected": 4148 },
    { "id": 2, "actual": filter_net_per_10k(country_migration, 0), "expected": 1980 },
    { "id": 3, "actual": filter_net_per_10k(country_migration, -100), "expected": 0 }
]

for test in tests:
  if test["actual"] == test["expected"]:
    print("Test {} passed!\nExpected: {}\nActual: {}\n".format(test["id"], test["expected"], test["actual"]))
  else:
    print("Test {} failed!\nExpected: {}\nActual: {}\n".format(test["id"], test["expected"], test["actual"]))

# Exercise 5

---
\
Write a function that will take the `country_migration` dataframe as a parameter and will:
*  filter for all rows where both `net_per_10K_2015` and `net_per_10K_2016` values are greater than `50`.
*  return the number of rows

In [None]:
def filter_two_net(df):
  # add code below to find rows which have migration in 2015 & 2016 greater than 50



#run test to see if you are getting the correct length of rows
actual = filter_two_net(country_migration)
expected = 3

if actual == expected:
  print("Test passed!\nExpected: {}\nActual: {}".format(expected, actual))
else:
  print("Test failed!\nExpected: {}\nActual: {}".format(expected, actual))

# Exercise 6

---
\
Write a function that will take the `country_migration` dataframe as a parameter and will:
*  filter for all migrations from countries with `Low Income` to countries with `Upper Middle Income`and within the same region ( `base_country_wb_region` is the same as `target_country_wb_region`)
*  display the number of rows.

In [None]:
def filter_two_income(df):
  # add code below to find all rows of migration from low income to upper middle income and where migration was to the same region



#run test to see if you are getting the correct first length of rows
actual = filter_two_income(country_migration)
expected = 15

if actual == expected:
  print("Test passed!\nExpected: {}\nActual: {}".format(expected, actual))
else:
  print("Test failed!\nExpected: {}\nActual: {}".format(expected, actual))

# Exercise 7

---
\
Write a function that will take the `industry_migration` dataframe as a parameter and will:
*  filter for all rows which have a `isic_section_index` of `M` and the `industry_name`,  `Biotechnology`
*  return the number of rows.

In [None]:
def filter_industry(df):
  # add code below to find all the rows from biotechnology industry with isic section index of M



#run test to see if you are getting the correct length of rows
actual = filter_industry(industry_migration)
expected = 32

if actual == expected:
  print("Test passed!\nExpected: {}\nActual: {}".format(expected, actual))
else:
  print("Test failed!\nExpected: {}\nActual: {}".format(expected, actual))

# Exercise 8

---
\
Write a function that will take the `industry_migration` dataframe as a parameter and will:
*  filter for all rows with `industry_name` of `Computer Software` that have a `wb_income` of `Low income`.
*  return the filtered dataframe

In [None]:
def filter_industry_income(df):
  # add code below to find Low Income Computer Software migrations, return the full set of filtered data



#run test to see if you are getting the correct first row and only returning 1 item
filtered_df = filter_industry_income(industry_migration)
actual1 = filtered_df.index[0]
expected1 = 3699
actual2 = len(filtered_df)
expected2 = 1

if actual1 == expected1 and actual2 == expected2:
  print("Test passed!\nExpected: {} & {}\nActual: {} & {}".format(expected1, expected2, actual1, actual2))
else:
  print("Test failed!\nExpected: {} & {}\nActual: {} & {}".format(expected1, expected2, actual1, actual2))

# Exercise 9

---
\
Write a function that will take the `industry_migration` dataframe as a parameter and will:
*  filter for all rows with `country_name` of United States or United Kingdom and a `isic_section_index` of M
*  return the average of `net_per_10K_2015`.

In [None]:
def filter_country(df):
  # add code below to find all USA and UK rows which have ISIC of M and return mean migration in 2015



#run test to see if you are getting the correct average
actual = round(filter_country(industry_migration), 2)
expected = 47.28

if actual == expected:
  print("Test passed!\nExpected: {}\nActual: {}".format(expected, actual))
else:
  print("Test failed!\nExpected: {}\nActual: {}".format(expected, actual))

# Exercise 10
---
\
Write a function that will take the `country_migration` dataframe and a base region as paramters and will:
*  filter for all migrations to Upper Middle Income and High Income regions (`target_country_wb_income`) from the given region (`base_country_wb_region`)

In [None]:
def filter_migrations(df, region):



# run test to see if you are getting the correct result
actual = filter_migrations(country_migration, "Middle East & North Africa")
expected = 432
if actual == expected:
  print("Test passed!\nExpected: {}\nActual: {}".format(expected, actual))
else:
  print("Test failed!\nExpected: {}\nActual: {}".format(expected, actual))

# Reflection
----

## What skills have you demonstrated in completing this notebook?

Your answer:

## What caused you the most difficulty?

Your answer: