In [1]:
import numpy as np
import pandas as pd

### Project2 - Housing In Brazil
Here, we'll learn:

- How to organize information using basic Python data structures.
- How to import data from CSV files and clean it using the pandas library.
- How to create data visualizations like scatter and box plots.
- How to examine the relationship between two variables using correlation.

**Problem / Question:**
- Which factor, location or size, exerts a greater influence on house prices in brazil?

**Data Collection:**

In this project, we'll work with brazil-real-esate datasets 
- Data Source: open data
- Data type: csv (#3)

**Wrangling / Preprocessing:**

**EDA:**

### Data Wrangling (Munging)

- The first part of any data science project is preparing your data, which means making sure its in the right place and format for you to conduct your analysis.

**1.1   Import**
-  The first step of any data preparation is importing your raw data and cleaning it. 

In [2]:
df1 = pd.read_csv("C:/Users/Tsegi/Desktop/AAC_SCHOOL/DSProject/br/brasil-real-estate-1.csv")
df2 = pd.read_csv("C:/Users/Tsegi/Desktop/AAC_SCHOOL/DSProject/br/brasil-real-estate-2.csv")


Inspect each DataFrames: 
- By looking at its `shape` attribute. 
- Then use the `info` method to see the data types and number of missing values for each column. 
- Finally, use the `head` method to determine to look at the first five rows of your dataset.

In [3]:
df1.shape

(12834, 7)

**1.2 Data Cleaning**

**Clean df1:**

In [4]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12834 entries, 0 to 12833
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               12834 non-null  int64 
 1   property_type            12834 non-null  object
 2   place_with_parent_names  12834 non-null  object
 3   region                   12834 non-null  object
 4   lat-lon                  11551 non-null  object
 5   area_m2                  12834 non-null  int64 
 6   price_usd                12834 non-null  object
dtypes: int64(2), object(5)
memory usage: 702.0+ KB


In [5]:
df1.head()

Unnamed: 0.1,Unnamed: 0,property_type,place_with_parent_names,region,lat-lon,area_m2,price_usd
0,1,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.6443051,-35.7088142",110,"$187,230.85"
1,2,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.6430934,-35.70484",65,"$81,133.37"
2,3,house,|Brasil|Alagoas|Maceió|,Northeast,"-9.6227033,-35.7297953",211,"$154,465.45"
3,4,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.622837,-35.719556",99,"$146,013.20"
4,5,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.654955,-35.700227",55,"$101,416.71"


Here, we can observe that:
- There are many missed rows (NAN) in `lan` and `lon` columns
- The column `Unamed:0` should be dropped
- The data type for the `"price_usd"` column is `object` when it should be `float`. (this is because of `'$'` and `','`)

In [6]:
# Removing NaN value in the exsiting data frame, not new
df1.dropna(inplace=True)
# Drop the first column
df1.drop(columns=["Unnamed: 0"],inplace=True)

# Transorm price from object to float
df1["price_usd"]=(
    df1["price_usd"]
    .str.replace("$","",regex=False)
    .str.replace(",","")
    .astype(float)
)

In [8]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11551 entries, 0 to 12833
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   property_type            11551 non-null  object 
 1   place_with_parent_names  11551 non-null  object 
 2   region                   11551 non-null  object 
 3   lat-lon                  11551 non-null  object 
 4   area_m2                  11551 non-null  int64  
 5   price_usd                11551 non-null  float64
dtypes: float64(1), int64(1), object(4)
memory usage: 631.7+ KB


In [12]:
df1.head()

Unnamed: 0,property_type,place_with_parent_names,region,lat-lon,area_m2,price_usd
0,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.6443051,-35.7088142",110,187230.85
1,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.6430934,-35.70484",65,81133.37
2,house,|Brasil|Alagoas|Maceió|,Northeast,"-9.6227033,-35.7297953",211,154465.45
3,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.622837,-35.719556",99,146013.2
4,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.654955,-35.700227",55,101416.71


In [15]:
df1[["lat","lon"]]=df1["lat-lon"].str.split(",", expand=True)

df1.head()

Unnamed: 0,property_type,place_with_parent_names,region,lat-lon,area_m2,price_usd,lat,lon
0,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.6443051,-35.7088142",110,187230.85,-9.6443051,-35.7088142
1,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.6430934,-35.70484",65,81133.37,-9.6430934,-35.70484
2,house,|Brasil|Alagoas|Maceió|,Northeast,"-9.6227033,-35.7297953",211,154465.45,-9.6227033,-35.7297953
3,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.622837,-35.719556",99,146013.2,-9.622837,-35.719556
4,apartment,|Brasil|Alagoas|Maceió|,Northeast,"-9.654955,-35.700227",55,101416.71,-9.654955,-35.700227


In [16]:
df1["state"]= df1["place_with_parent_names"].str.split("|", expand=True)[2]
df1.drop(columns=["place_with_parent_names","lat-lon"],inplace=True)

df1.head()

Unnamed: 0,property_type,region,area_m2,price_usd,lat,lon,state
0,apartment,Northeast,110,187230.85,-9.6443051,-35.7088142,Alagoas
1,apartment,Northeast,65,81133.37,-9.6430934,-35.70484,Alagoas
2,house,Northeast,211,154465.45,-9.6227033,-35.7297953,Alagoas
3,apartment,Northeast,99,146013.2,-9.622837,-35.719556,Alagoas
4,apartment,Northeast,55,101416.71,-9.654955,-35.700227,Alagoas


In [17]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11551 entries, 0 to 12833
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   property_type  11551 non-null  object 
 1   region         11551 non-null  object 
 2   area_m2        11551 non-null  int64  
 3   price_usd      11551 non-null  float64
 4   lat            11551 non-null  object 
 5   lon            11551 non-null  object 
 6   state          11551 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 721.9+ KB


In [29]:
df1["lat"]=df1["lat"].astype(float)
df1["lon"]=df1["lon"].astype(float)

In [19]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11551 entries, 0 to 12833
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   property_type  11551 non-null  object 
 1   region         11551 non-null  object 
 2   area_m2        11551 non-null  int64  
 3   price_usd      11551 non-null  float64
 4   lat            11551 non-null  float64
 5   lon            11551 non-null  float64
 6   state          11551 non-null  object 
dtypes: float64(3), int64(1), object(3)
memory usage: 721.9+ KB


NB:
- `replace(source, replacement, count)`- is a method associated with strings that allows you to replace occurrences of a substring (source) with another substring (replacement). Optionally, you can specify the maximum number of replacements to make with the count parameter (default is all occurrences).
- regex=False: This is an optional argument for the replace() method. It specifies whether to treat the source string as a literal string (False) or a regular expression (True). Here, regex=False indicates we're dealing with a plain string, not a pattern.



In [None]:
original_string = "This string has $10 and $20."
modified_string = original_string.replace("$", "")
print(modified_string)  # Output: This string has 10 and 20.

**Clean df2:**

In [20]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12833 entries, 0 to 12832
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     12833 non-null  int64  
 1   property_type  12833 non-null  object 
 2   state          12833 non-null  object 
 3   region         12833 non-null  object 
 4   lat            12833 non-null  float64
 5   lon            12833 non-null  float64
 6   area_m2        11293 non-null  float64
 7   price_brl      12833 non-null  float64
dtypes: float64(4), int64(1), object(3)
memory usage: 802.2+ KB


Here, we can notice:
- There are many null values (NAN) in `area_m2' columns
- The column `Unamed:0` should be dropped
- The data type for the `" price_brl"` column is `float64`. (but should be in US dollar `"price_usd"`)
    - If we want to compare all the home prices in this dataset, they all need to be in the same currency.
- Drop `"price_brl"` column

In [21]:
df2.dropna(inplace=True)
df2.drop(columns=["Unnamed: 0"],inplace=True)

# Create price_usd column dividing by 16.74
df2["price_usd"]=(df2["price_brl"]/5.02).round(2)
df2.drop(columns=["price_brl"],inplace=True)
df2.head()

Unnamed: 0,property_type,state,region,lat,lon,area_m2,price_usd
0,apartment,Pernambuco,Northeast,-8.134204,-34.906326,72.0,82514.54
1,apartment,Pernambuco,Northeast,-8.126664,-34.903924,136.0,169005.68
2,apartment,Pernambuco,Northeast,-8.12555,-34.907601,75.0,59649.06
3,apartment,Pernambuco,Northeast,-8.120249,-34.89592,187.0,169005.68
4,apartment,Pernambuco,Northeast,-8.142666,-34.906906,80.0,92456.05


In [22]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11293 entries, 0 to 12832
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   property_type  11293 non-null  object 
 1   state          11293 non-null  object 
 2   region         11293 non-null  object 
 3   lat            11293 non-null  float64
 4   lon            11293 non-null  float64
 5   area_m2        11293 non-null  float64
 6   price_usd      11293 non-null  float64
dtypes: float64(4), object(3)
memory usage: 705.8+ KB


In [27]:
df2['area_m2'] = df2['area_m2'].astype('int64')

In [28]:
print(df2.dtypes)

property_type     object
state             object
region            object
lat              float64
lon              float64
area_m2            int64
price_usd        float64
dtype: object


**Concatenate DataFrame**

- You have three clean DataFrames, and 
- now it's time to combine them into a single DataFrame so that you can conduct your analysis. 

In [30]:
df = pd.concat([df1,df2])
print(df.shape)
df.head()

(22844, 7)


Unnamed: 0,property_type,region,area_m2,price_usd,lat,lon,state
0,apartment,Northeast,110,187230.85,-35.708814,-35.708814,Alagoas
1,apartment,Northeast,65,81133.37,-35.70484,-35.70484,Alagoas
2,house,Northeast,211,154465.45,-35.729795,-35.729795,Alagoas
3,apartment,Northeast,99,146013.2,-35.719556,-35.719556,Alagoas
4,apartment,Northeast,55,101416.71,-35.700227,-35.700227,Alagoas


**Save/write df**

- The data is clean and in a single DataFrame, and now you need to save it as a CSV file so that you can examine it in your exploratory data analysis.

In [32]:
df.to_csv("C:/Users/Tsegi/Desktop/AAC_SCHOOL/DSProject/br/BrazilData_clean.csv", index=False)