Versatyle Python Skill Test - *Take-Home Assignment*
---------------------------------------------------

In this assignment, you will go through several questions that together constitute a workflow of basic data operations in Python where each question builds upon the next.

First, upload this notebook in [Google Colab](https://colab.research.google.com/).
Make sure you download and store the Excel file *data.xlsx* locally and upload it. The code to do this is already in place below. The data consists of master data on passengers who were onboard the Titanic, and it is enriched with extra (fictional) information on each passenger.

You will start by performing SQL-like operations (using pandas) on 3 datasets. The resulting dataframe will then be used to implement cleaning rules on the data. Finally you can use the cleaned dataset to build a SQL query as string which you will output.

**Use Google Chrome to avoid possible issues with google colab.**

There is no real time limit, but we don't want to take too much of your time. We think this test should be manageable in approximately one or two hours. The quality, readability, and efficiency of the code are most important. 

Good luck!

In [None]:
# RUN THIS CELL FIRST, TO UPLOAD THE DATA
from google.colab import files
data = files.upload()

# Upload the `data.xlsx` file below...

Make sure the name of the file you uploaded is the same as the one below read by the `pd.read_excel()` function. If you don't make any changes everything will run correctly.

In [None]:
# NOW RUN THIS CELL, TO READ THE DATA
import pandas as pd
import io

all_data_sets = pd.read_excel(io.BytesIO(data['data.xlsx']), sheet_name=None)

You now should have 3 separate dataframes stored in `all_data_sets`. Do some checks to see if that is the case. 

## Exercise 1
---------------------

The three datasets need to be joined so that we will end up with one dataframe on which the cleaning requirements are performed. 

#### **a)**
Create a new dataframe (e.g. `passenger_extended_df`) that extends the **passengers** table (second dataset/tab in the xlsx) with **extra** information (third dataset/tab in the xlsx). Investigate the data to find the join attribute(s).
Make sure that the resulting table contains all rows from **passengers** dataset. Then remove  passengers that have a `nan` value for the `age` attribute. 

#### **b)**
Obtain the final dataframe by joining the **titanic** table with the dataframe created in **a)**. Again, analyze the data to find the proper join attribute(s).

In [None]:
### Code for question 1a ###

In [None]:
### Code for question 1b ###

# Exercise 2
------------------------------------------
Now that we have our data in one set we would like to do some cleaning. 


#### **a)**
We need a new column called `RequirementX` based on whether someone survived or not and on the `Embarked` value. Fill the column according to the following rules:
 - If a passenger survived and `Embarked` equals `NaN` or "S", the cell has to take the value "..." (string with three dots)
 - If a passenger survived and `Embarked` equals "C" or "Q", the cell has to take the value "00A"
 - If a passenger did not survive and `Embarked` equals "Q", the cell has to take the value "Passed with Q"
 - Else, the cell has to take the value "TBD"


#### **b)** 
For all rows, delete all non-digit (1-9) characters in the "Ticket" column.


#### **c)**
If a passenger is born before 1965 clear out their `Email`.


#### **d)** 
Print the number of passengers that use *yahoo.com* as their `Email` provider.
 

#### **e)** 
Select 30 passengers, 15 that have completed High School and 15 that have done a Bachelor and keep only the columns `Sex`, `PassengerId`, `Education` and `Company`.
Rename the columns according to the following mapping: 
* `Sex` --> `st`
* `PassengerId` --> `sa`
* `Education` --> `tt`
* `Company` --> `ta`

Give the output a new name.

In [None]:
### Code for question 2a ###

In [None]:
### Code for question 2b ###

In [None]:
### Code for question 2c ###

In [None]:
### Code for question 2d ###

In [None]:
### Code for question 2e ###

# Exercise 3  
----------------------------------------

Using the more compact table from our previous task **2e)** we want to create two SQL queries as strings, then output them to the screen. You are not expected to work with SQL or execute the queries.

Consider the following example dataframe:

In [1]:
import pandas as pd
example_data: dict = {
    'st': ['A', 'B', 'A', 'Z', 'Z'], 
    'sa': ['001', '007', '40021', '90833', 'hello World'], 
    'tt': ['table1', 'table1', 'table1', 'table2', 'table2'],
    'ta': ['xa', 'xb', 'xc', 'ya', 'yb']
}
example_df = pd.DataFrame.from_dict(example_data)

print(example_df)

  st           sa      tt  ta
0  A          001  table1  xa
1  B          007  table1  xb
2  A        40021  table1  xc
3  Z        90833  table2  ya
4  Z  hello World  table2  yb


For each group in the `tt` column, we would like to have a separate string. Since the dataframe from the previous exercise will have two distinct `tt` values by design, your code should output two strings. 

Based on the example dataframe, try to find a pattern for structuring and creating the SQL query.
Note the use of backticks. If values are added to the `tt` column, your program should output more strings.

Using the example data, your program should **exactly** create and print the following string for **table1**:
```sql
SELECT A.`001` as `xa`, B.`007` as `xb`, A.`40021` as `xc` 
FROM schema.A + schema.B
```

And this for **table2**: 
```sql
SELECT Z.`90833` as `ya`, Z.`hello World` as `yb`
FROM schema.Z
```

In [None]:
### Code for question 3 ###

# Exercise 4
----------------------------------------
Have your code reviewed and optimized by your favorite AI. Evaluate the AI's suggestions and explain whether the code actually improves or not, you do not have to change the code we are interessed in your evaluation of the AI's output. 

Please tell us the AI you used and in what manner.