# Homework pandas (with answers)

<table align="left">
    <tr>
    <td><a href="https://colab.research.google.com/github/airnandez/numpandas/blob/master/exam/2020-exam-with-answers.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a></td>
    <td><a href="https://mybinder.org/v2/gh/airnandez/numpandas/master?filepath=exam%2F2020-exam-with-answers.ipynb">
  <img src="https://mybinder.org/badge_logo.svg" alt="Launch Binder"/>
</a></td>
  </tr>
</table>

*Author: Fabio Hernandez*

*Last updated: 2020-03-19*

*Location:* https://github.com/airnandez/numpandas/exam

--------------------
## Instructions

For this excercise we will use a public dataset titled **"Demandes de valeurs foncières géolocalisées"** available [here](https://www.data.gouv.fr/fr/datasets/demandes-de-valeurs-foncieres-geolocalisees/). This dataset contains information about registered real state transactions (_mutations immobilières_) in France over several years. There is a file per year. The structure of the files and the semantics of each column are documented at its source.

For your convenience, this notebook is prepared with code for downloading the dataset from its source, loading it into memory as a **pandas** dataframe and with some cleaning and helper functions. Your mission is execute the provided cells and to write the code to answer the questions below.

You must not modify the code provided. You must provide code for answering the questions, following the instructions for each one of them.

When you have finished, please save your notebook in the form of a `.ipynb` file and send it by e-mail to your instructor according to the indications you received by e-mail.

---------------------
## Dependencies

In [None]:
import datetime
import os
import glob

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.__version__

In [None]:
import numpy as np
np.__version__

------
## Download the dataset

Define a helper function for downloading data to a local file:

In [None]:
import os
import requests

def download(url: str, path: str):
    """Download file at url and save it locally at path."""
    with requests.get(url, stream=True) as resp:
        if not resp.ok:
            raise f'Could not find file at URL {url}'
            
        mode, data = 'wb', resp.content
        if 'text/plain' in resp.headers['Content-Type']:
            mode, data = 'wt', resp.text
        with open(path, mode) as f:
            f.write(data)

Download the data files, one per year, for the period 2016-2021, both inclusive. We store the downloaded data in the directory `../data` relative to the location of this notebook. If a file has been already been downloaded, don't download it again. The total amount of data to download is about 400 MB.

In [None]:
# Create destination directory
os.makedirs(os.path.join('..', 'data'), exist_ok=True)

# Download files
data_source = "https://files.data.gouv.fr/geo-dvf/latest/csv"

for year in range(2016, 2022):
    # Build the URL and the destination file path
    url = f'{data_source}/{year}/full.csv.gz'
    path = os.path.join('..', 'data', f'{year}-mutations-immobilieres.csv.gz')
    
    # If file already exists don't download it again
    if not os.path.isfile(path) :
        print(f'downloading {url} to local file {path}')
        download(url, path)
    else:
        print(f'local file {path} already exists. Skipping download...')

Check what files we have for our analysis:

In [None]:
file_paths = glob.glob(os.path.join('..', 'data', '*-mutations-immobilieres.csv.gz'))
print('\n'.join(f for f in file_paths))

---------------------
## Load the dataset

Load the dataset (i.e. all the files `../data/*-mutations-immobilieres.csv.gz`) to a **pandas** dataframe. Here we select the columns we want to load. The information about the format and contents of each column is available [here](https://www.data.gouv.fr/fr/datasets/demandes-de-valeurs-foncieres-geolocalisees/). Please make sure you are familiar with that information which you will need for analysing the data:

In [None]:
# These are the names of the columns present in the source files.
# We are not interested in analysing the commented columns, so we don't tell
# pandas to not load them
columns = (
    'id_mutation',
    'date_mutation',
    'numero_disposition',
    'nature_mutation',
    'valeur_fonciere',
    'adresse_numero',
    'adresse_suffixe',
    'adresse_nom_voie',
    'adresse_code_voie',
    'code_postal',
    'code_commune',
    'nom_commune',
    'code_departement',
#   'ancien_code_commune',
#   'ancien_nom_commune',
#   'id_parcelle',
#   'ancien_id_parcelle',
#   'numero_volume',
    'lot1_numero',
    'lot1_surface_carrez',
    'lot2_numero',
    'lot2_surface_carrez',
    'lot3_numero',
    'lot3_surface_carrez',
    'lot4_numero',
    'lot4_surface_carrez',
    'lot5_numero',
    'lot5_surface_carrez',
    'nombre_lots',
    'code_type_local',
    'type_local',
    'surface_reelle_bati',
    'nombre_pieces_principales',
#   'code_nature_culture',
    'nature_culture',
#   'code_nature_culture_speciale',
#   'nature_culture_speciale',
    'surface_terrain',
#   'longitude',
#   'latitude'
)

# These are the types we want pandas to use for each column
column_types = {
    'id_mutation': object,
    'adresse_suffixe': str,
    'adresse_numero': str,
    'adresse_suffixe': str,
    'adresse_nom_voie': str,
    'adresse_code_voie': str,
    'code_postal': str,
    'code_commune': str,
    'code_departement': str,
    'ancien_code_commune': str,
    'ancien_nom_commune': str,
    'id_parcelle': str,
    'ancien_id_parcelle': str,
    'lot1_numero': str,
    'lot2_numero': str,
    'lot3_numero': str,
    'lot4_numero': str,
    'lot5_numero': str,
    'code_type_local': str,
    'type_local': str,
}

In [None]:
# Explicitly delete our existing dataframe, if any
try:
    del df
except NameError:
    pass

file_paths = glob.glob(os.path.join('..', 'data', '*-mutations-immobilieres.csv.gz'))
df = pd.DataFrame()
for path in sorted(file_paths):
    print(f'Loading file {path}')
    df = df.append(pd.read_csv(path, usecols=columns, dtype=column_types, parse_dates=['date_mutation']))

In [None]:
# Inspect the dimensions of the dataframe
rows, columns = df.shape
print(f'This dataframe has {rows:,} rows and {columns:,} columns')

In [None]:
df.head(10)

### WARNING:

Please note that there may be several rows for the same transaction. All the rows part of a single transaction have the same identifier (i.e. the same value) in the `id_mutation` column as well as the same value in the column `valeur_fonciere`. For instance, there are two rows with the value `2018-2` in the `id_mutation` column:

In [None]:
df[df['id_mutation'] == '2018-2']

## Inspect the dataset

Let's see what **kind of transactions** are encoded in these records:

In [None]:
print('\n'.join(df['nature_mutation'].unique()))

And what **kind of properties** are these transactions about:

In [None]:
for t in df['type_local'].unique():
    print(t)

### Values for filters
Here we define some convenient constants that we can use for building masks:

In [None]:
APPARTMENT = 'Appartement'
HOUSE      = 'Maison'
BUSINESS   = 'Local industriel. commercial ou assimilé'

-------------------
# Questions (10 points + bonus)

---------------------
## Question N° 1

### Question 1a (1 point)
How many transactions of type sale (i.e. those with value `Vente` in the column `nature_mutation`) were registered in the period covered in the dataset?

In [None]:
is_sale = df['nature_mutation'] == 'Vente'
sales = df[is_sale]
sales_count = sales['id_mutation'].nunique()

In [None]:
print(f'There are {sales_count:,} sales in the dataset')

### Question 1b (1 point)
How many sales were registered for each kind of property (i.e. `Maison`, `Dépendance`, `Appartement` and `Local industriel`) in the whole period?

In [None]:
house_count    = sales[sales['type_local'] == HOUSE]['id_mutation'].nunique()
appt_count     = sales[sales['type_local'] == APPARTMENT]['id_mutation'].nunique()
business_count = sales[sales['type_local'] == BUSINESS]['id_mutation'].nunique()

In [None]:
# Determine the period covered in the dataset
start_date, end_date = df['date_mutation'].min(), df['date_mutation'].max()

# Compute the percentage of sales per kind of object
house_pct    = 100.0 * (house_count/sales_count)
appt_pct     = 100.0 * (appt_count/sales_count)
business_pct = 100.0 * (business_count/sales_count)

# Print the report
print(f'Period covered: from {start_date:%Y-%m-%d} to {end_date:%Y-%m-%d}:')
print(f'          total:  {sales_count:>10,} sales')
print(f'         houses:  {house_count:>10,} ({house_pct:>2.0f}%)')
print(f'    appartments:  {appt_count:>10,} ({appt_pct:>2.0f}%)')
print(f'       business:  {business_count:>10,} ({business_pct:>2.0f}%)')

### Question 1c (2 points)
What is the total amount of money (in million €) involved in those sales? Please remember that there may be several rows for a single transaction and within a single transaction each row has the same value in the column `valeur_fonciere`. You may want to consider grouping all the rows for the same transaction.

In [None]:
# Group by 'id_mutation' and take the first row of each group
sales_by_id = sales.groupby('id_mutation').first()

# Add the column 'valeur_fonciere' of each group (which is actually composed of a single row per group)
sales_in_million_euros = sales_by_id['valeur_fonciere'].sum() / 1_000_000

In [None]:
print(f'The total amount of money in sales was {sales_in_million_euros:,.0f} million €')

-----------
## Question N° 2

### Question 2a (3 points)
Your client, a big international corporation, is looking to purchase a property for installing a retail store in the Av. des Champs Elysées, in Paris. They hire you to provide an estimation of the necessary budget to purchase a property based on the data recorded in this dataset. You should only consider transactions involving business properties with a surface bigger than 300 m², 

In [None]:
# Build a view with the relevant data
sales             = df[is_sale]
is_business       = sales['type_local'] == BUSINESS
is_paris_8        = sales['code_postal'] == '75008'
is_champs_elysees = sales['adresse_nom_voie'].str.contains('AV DES CHAMPS ELYSEES', case=False)
is_big_surface    = sales['surface_reelle_bati'] > 300

sales_champs_elysees = sales[is_business & is_paris_8 & is_champs_elysees & is_big_surface]

In [None]:
# Group by transaction id
sales_champs_elysees_by_id = sales_champs_elysees.groupby(['id_mutation'])

# For each transaction (i.e. each group), compute its cost. Since every row in a single
# group contains the same value in the column 'valeur_fonciere', we use the mean of that
# column for each group to get the value of the whole transaction
cost_per_transaction = sales_champs_elysees_by_id['valeur_fonciere'].mean()

# For each group, sum the surfaces of all the components of the transaction
surface_per_transaction = sales_champs_elysees_by_id['surface_reelle_bati'].sum()

# Compute the average of the cost per square meter for each transaction
mean_cost_per_sq_meter = np.mean(cost_per_transaction / surface_per_transaction)

In [None]:
print(f'The average observed cost per square meter, for business bigger than 300 m² is {mean_cost_per_sq_meter:,.0f} €')

### Question 2b (3 points)

Your customer also wants to know how much money was needed for the most expensive transaction and the address of the property. Can you provide them that information?

In [None]:
# Retrieve the id and the cost for the biggest transaction
cost_per_transaction = sales_champs_elysees_by_id['valeur_fonciere'].mean()
id_mutation, max_cost = cost_per_transaction.idxmax(), cost_per_transaction.max()

# Retrieve the number and address of the property
most_expensive_sale = sales_champs_elysees[sales_champs_elysees['id_mutation'] == id_mutation]
address = f"{most_expensive_sale['adresse_numero'].values[0]}, {most_expensive_sale['adresse_nom_voie'].values[0]}"

In [None]:
print(f'The cost of the biggest sale transaction was {max_cost/1_000_000:,.0f} m€ for a property located at {address}')

### Question 2c (bonus: 1 point)
Can you tell what store is now located at the address found in your answer for question 2b?