
### Guidelines
I strongly advise you to carefully read this assignment, think about approaches and try to understand the data before diving into the questions. 

* **You can complete this assignment working on this Google Colab if you wish

In [0]:
from google.colab import files

uploaded = files.upload()

# Details


### You have the following information in your files
- "agents.csv" and
- "properties.csv"


- PROPERTIES table: 
  - **id**(PK, INT) - unique identification number of the property ad listing
  - **title**(VARCHAR) - title of the property ad listing
  - **features**(VARCHAR) - field with additional characteristics of the property ad listing
  - **living_area**(FLOAT) - living area of the property in square meters
  - **total_area**(FLOAT)- total area of the property in square meters
   - **plot_area**(FLOAT) - plot area of the property in square meters
  - **price**(FLOAT) - selling price of the property in euros
  - **agent_id**(INT) - selling agent id
  - **createdAt**(DATE) - date in which the property was added to the market
- AGENTS table: 
  - **agent_id**(PK, INT) - selling agent id
  - **company**(VARCHAR) - company for which the agent works

#### Details of properties:
- **locations** can be: Alenquer, Quinta da Marinha, Golden Mile, Nagüeles;
- **types** can be: ‘apartment’, ‘penthouse’, ‘duplex’, ‘house’, ‘villa’, ‘country estate’,
‘moradia', ‘quinta', ‘plot’, ‘land’; 
- the property types can be part of the following **property groups**:
  1. group **‘apartments’** includes types ‘apartment’, ‘penthouse’, ‘duplex’;
  2. group **‘‘houses’**‘ includes types ‘house’, ‘villa’, ‘country estate’, ‘moradia', ‘quinta’;
  3. group **‘‘plots’**‘ includes types ’plot’, ‘land’.
- areas:
 - for the group **‘plots’** use **plot_area**;
 - for groups **‘apartments’** and **‘houses’** use the highest value between **total_area** or **living_area**;



**challenge**
- (Q6) Write a code to identify companies (agents) with most expensive properties for each month in 2017
- (Q7) Write a code to get first and last property posted by each company (agents)

# Data Analysis (Python)

 For this part, feel free to use as many cells as you need below this point. Please use properties.csv as your data source. 



## Problem 
A private investor is planning an investment in one of the four locations. In order to decide where to invest he needs to know the price impact of such features as ‘pool’, ‘sea view’ and ‘garage’ on properties in each location.
He also asks for the mean price of the properties in each type group (‘apartments’, ‘houses’, ‘plots’) and wants to know about properties in the market that are undervalued and overvalued. In order to accomplish the problem that was described we want you to cover the following steps:

#### Part 1: Data Cleaning
As you have seen previously, a lot of information is present in the title/features fields. From there, we want to extract the relevant information for further analysis, such as:
 - 1A: Property  **type** (as presented in **Details** above, of each property from **title** field
 - 1B: Property **location** (as presented in **Details** above, of each property from ** title** field
 - 1C: From ** features** field, if a property has:
  - a pool
  - a garage
  - sea view

#### Expected Outcome for Part 1:
- Create a property dataset with the following schema and save it in a csv file:
  - id; 
  - location name
  - type
  - title
  - features
  - pool (0/1)
  - sea view (0/1)
  - garage (0/1)
- Pool, sea view and garage should be binary:1 if the property has the feature and 0 if not
- For each of the 3 tasks (1A, 1B, 1C), describe in detail the what you did.
-  Please provide your code in the cells below, in a reproducible and understandable way;

#### Part 2: Identify outliers
Now that the data is structured correctly, let's look at which properties are a  good deal for our investor. For this you will need to** identify undervalued, overvalued, and normal properties** in the dataset. Remember that a  undervalued house in one location can be considered a high outlier in another location. Location and type classifications are important in this task.
#### Expected Outcome for Part 2:
- As before, deliver a csv file with the following format:
  - id
  - location name
  - type
  - area
  - price
  - over-valued (0/1)
  - under-valued (0/1)
  - normal (0/1)
- the new columns should be binary, where for example **over-valued** column would get value 1 if the property is indeed over-valued, 0 otherwise;
- A short report (could be a pdf file or new cells within the notebook) containing:
  - visualizations (such as scatter plots) discriminating between the undervalued, overvalued and normal properties.
  - a explanation of what is the difference between under-valued/over-valued properties and pure data outliers;
  - any notes/conclusions you wish to add;
- Provide your code in a reproducible way in the cells below;

#### Part 3: Theoretical questions

- Describe in detail how you would evaluate the price impact of features such as sea view, pool and garage considering the dataset provided. Your answer should also include how would you deal with missing values, outliers and duplicated listings (same property listing published by different agencies);


#### Part 4: Create a model to estimate the price of the properties based on the features you consider. You can use linear, polynomial, multivariate or tree regressors. 

#### Extra challenge 5:
- Describe how would you model the data over time (using createdAt field). What changes over time would you look for and what would you expect the outcomes to be? (i.e. in terms of pricing per location/type)