# CHALLENGE COLLECTING DATA - IMMOVLAN
## SECOND PART: CLEANING THE DATAFRAME

Due to the slow processing of the main.py module to scrape properties data, I run this notebook to clean the data results.

### DATA COLLECTION

In [None]:
# Set the notebook to show all outputs in the cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import glob
import os

# Pandas options for data wrangling and output set-up 
import pandas as pd
pd.set_option('display.max_columns', None) # display all columns
pd.set_option('display.expand_frame_repr', False) # print all columns and in the same line
pd.set_option('display.max_colwidth', None) # display the full content of each cell
pd.set_option('display.float_format', lambda x: '%.3f' %x) # floats to be displayed with 3 decimal places

In [None]:
# Concatenate all my batches (CSVs output)
path = "output"
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join(path, "immovlan_properties*.csv"))), ignore_index= True)

### A LOOK TO THE RESULTS

In [None]:
df.shape

As the target variable is Price, deleting records with missing price

In [None]:
df = df.dropna(subset = ['Prix'])

Also deleting duplicates based on Ref column

In [None]:
df = df.drop_duplicates(subset=["Ref"])

There are 121 columns.  
My scraping is taking all tags in the html structure of each page.  
While working in the scraping, I noticed some of the tags were hardly used in the website.  
--> remove all columns with over 50% NAs

In [None]:
# percentage of NA per column
naPct = df.isnull().sum()/df.shape[0]*100

# list of columns over 50% NAs
col_to_drop = naPct[naPct>50].keys()
df = df.drop(col_to_drop, axis=1)

# dropping as well any records that may have all NAs
df = df.dropna(how='all')


In [None]:
df.shape

This looks much better...

### TRANSFORMING THE DATA

In [None]:
df.info()

In [None]:
# Differentiate numerical and categorical cols
numeric_cols = df.select_dtypes(include=np.number).columns
numeric_cols

categoric_cols = df.select_dtypes(exclude=np.number).columns
categoric_cols

In [None]:
def MemOptimisation(df):
    """
    By default pandas assign data types that consume a lot of memory.
    Also category data type handles much better categorical variables than object
    Also numerical variable seem to be Integers
    """
    print(f"\nAmount of memory used by all attributes: {df.memory_usage(deep=True).sum()}\n")
    
    # Optimise memory usage
    for i in categoric_cols:
        df[i] = df[i].astype('category')
    for i in numeric_cols:
        df[i] = df[i].astype('Int32')

    print(df.info(memory_usage='deep'))
    print("\nAmount of memory used now by all attributes: ",df.memory_usage(deep=True).sum())

In [None]:
MemOptimisation(df)

Let's have a look at the variables

In [None]:
df[numeric_cols].head()

All numeric variables seems to be correct and relevant for the goal of the project

In [None]:
df[categoric_cols].head()

In [None]:
# Remove 'à vendre' from the values and rename the column to Type_du_bien
df['Titre'] = df['Titre'].str.replace('à vendre', '', regex=False).str.strip()
df = df.rename(columns={'Titre': 'Type_du_bien'})

In [None]:
# Split column localite into zip and localite
df['Zip'] = df['ad_link'].str.extract(r'/a-vendre/(\d{4})/')[0]
df['Localite'] = df['ad_link'].str.extract(r'/a-vendre/\d{4}/([^/]+)/')[0].str.capitalize()

Confirm that below columns are binary (Oui/Non) and then transform them

In [None]:
cols = ['Meublé', 'Caves', 'Ascenseur', 'Raccordé_à_leau_courante','Terrasse_aménagée', 'Grenier', 'Jardin']
for c in cols:
    df[c].value_counts()

In [None]:
for c in cols:
    df[c] = df[c].map({'Oui': 1, 'Non': 0})

### FINAL DATAFRAME

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

The final dataframe shows:
- 14806 records
- 27 variables

In [None]:
# Saving the dataframe as immovlan_properties_FINAL.csv
file_name = f"immovlan_properties_FINAL.csv"
df.to_csv(file_name, index=False, encoding="utf-8-sig")