# Prepare Data 

Hi! üëã In this project we will work on a project focusing on analyzing a dataset of Brazil home sales. Our goal is to determine if there are regional differences in the real estate market. Also, we will look at southern Brazil to see if there is a relationship between home size and price, similar to what we saw with housing in some states in Mexico. üè°üí∞


### Import python libraries 

In [None]:
# Import Matplotlib, pandas, and plotly
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px

### Import the first csv file to a dataframe and dropping missing values

In [None]:
df1 = pd.read_csv("data/brasil-real-estate-1.csv")
df1.dropna(inplace=True)
df.head()

![Screenshot from 2024-04-29 12-43-08.png](attachment:46e1a2df-4862-4265-af39-5012dc45e4e3.png)

### Create two separate columns "lat" and "lon" based on "lat-lon" column

In [None]:
df1[["lat", "lon"]] = df1["lat-lon"].str.split(",", expand=True).astype(float)

### Create a "state" column Based on The "place_with_parent_names" column (State name always after "|Brasil| string").

In [None]:
df1["state"] = df1["place_with_parent_names"].str.split("|", expand=True)[2]

### Transform the "price_usd" column of df1 so that all values are floating-point numbers instead of strings.

In [None]:
df1["price_usd"] = df1["price_usd"].astype(str)
df1["price_usd"] = df1["price_usd"].str.replace('$', '', regex=False).str.replace(',', '', regex=False).astype(float)

### Drop the "lat-lon" and "place_with_parent_names" columns from df1

In [None]:
df1.drop(columns=["lat-lon", "place_with_parent_names"], inplace=True)

In [None]:
# Look df after cleaning 
df1

![Screenshot from 2024-04-29 13-09-28.png](attachment:d6044249-c8dd-41d2-bf20-1b5adff1dbfc.png)

### Import the second csv file to the second dataframe

In [None]:
df2 = pd.read_csv("data/brasil-real-estate-2.csv")
df2

![Screenshot from 2024-04-29 13-16-06.png](attachment:0cb838c5-9710-4b4d-977a-b50450e0c0d9.png)

In [None]:
df2.head()
df2.shape

(12833, 7)

### Create a new column named "price_usd" based on price_brl and drop the price_brl column

In [None]:
df2["price_usd"] = df2["price_brl"] / 3.19
df2.drop(columns=["price_brl"], inplace=True)

### Drop missing values of df2

In [None]:
df2.dropna(inplace=True)

### Concatenate  df1 and df2

In [None]:
df = pd.concat([df1, df2])

In [None]:
print("df shape:", df.shape)
df

![Screenshot from 2024-04-29 13-35-13.png](attachment:016fc5b6-103c-4692-98a1-fcc5e0c7f301.png)

# Visualization üìä 

### Location properties via  mapbox scatter

In [None]:
fig = px.scatter_mapbox(
    data_frame=df,
    lat="lat",
    lon="lon",
    center={"lat": -14.2, "lon": -51.9},  # Map will be centered on Brazil
    width=600,
    height=600,
    hover_data=["price_usd"],  # Display price when hovering mouse over house
)

fig.update_layout(mapbox_style="open-street-map")

fig.show()

![Map_box_Location.png](attachment:82b1e4c0-4bdf-4562-b85a-f4a3defcf374.png)

### Summary statistics of df 

In [None]:
summary_stats = df[["area_m2", "price_usd"]].describe()
summary_stats

![Screenshot from 2024-04-29 14-38-11.png](attachment:ef286a42-ab79-447e-baab-95f80006b9a8.png)

### Distribution of Home Prices using histogram

In [None]:
# Build histogram
plt.hist(df['price_usd'])

# Label axes
plt.xlabel("Price [USD]")

# Label axes
plt.ylabel("Frequency")

# Add title
plt.title("Distribution of Home Prices")

# Don't change the code below üëá
plt.savefig("images/1-5-12.png", dpi=150)

![Screenshot from 2024-04-29 14-39-39.png](attachment:aaa8d672-ca09-453d-829f-3b4ef88377ae.png)

### Distribution of Home Sizes using boxplot

In [None]:
# Build box plot
plt.boxplot(df['area_m2'])

# Label x-axis
plt.xlabel("Area [sq meters]")

# Add title
plt.title("Distribution of Home Sizes")

![Screenshot from 2024-04-29 14-44-31.png](attachment:9e686820-f6b5-43df-a7a4-fb6d945bcddb.png)

### Create a Series named mean price by region that shows the mean home price in each region in Brazil

In [None]:
mean_price_by_region = df.groupby('region')['price_usd'].mean().sort_values()
mean_price_by_region

![Screenshot from 2024-04-29 14-59-02.png](attachment:22897e7c-3566-413f-b7ed-44b6225895a3.png)

### Mean Home Price by Region Distrubition using bar chart.

In [None]:
# Build bar chart, label axes, add title
mean_price_by_region.plot(
    kind="bar",
    xlabel="Region",
    ylabel="Mean Price [USD]",
    title="Mean Home Price by Region"
)

![Screenshot from 2024-04-29 15-05-07.png](attachment:b3d2c100-bdb5-4f64-91c1-70cb8922d9fa.png)

# Let's analyze more the southern region of Brazil üîç

#### Create a dataframe of south homes from the last df

In [None]:
df_south = df[df["region"] == "South"]
df_south

![Screenshot from 2024-04-29 15-21-42.png](attachment:f933a32a-e9c1-41d0-b4f8-49f0d5d9f4b1.png)

In [None]:
homes_by_state = df_south["state"].value_counts()
homes_by_state

![Screenshot from 2024-04-29 15-23-24.png](attachment:f4ef4e49-28e7-4992-8b3e-3fd1287c1d2a.png)

### Price vs Area using Scatter plot

In [None]:
# Subset data
df_south_rgs = df_south[df_south['state'] == "Rio Grande do Sul"]

# Build scatter plot
plt.scatter(x=df_south_rgs['area_m2'], y=df_south_rgs['price_usd'])

# Label x-axes
plt.xlabel("Area [sq meters]")

# Label y-axes
plt.ylabel("Price [USD]")

# Add title
plt.title("Rio Grande do Sul: Price vs. Area")

![Screenshot from 2024-04-29 15-26-46.png](attachment:9c97e948-cfc5-4cd7-883a-692f828d6445.png)

### Create a dictionary of sout states where keys are the name of south cities and values are coefficient of coorelations between area m2 and price usd.

In [None]:
south_states = homes_by_state.index.tolist()  # Convert to list for iteration

# Create an empty dictionary to store correlation coefficients
south_states_corr = {}

# Iterate over the state names (assuming they match South region states in df_south)
for state in south_states:
  # Filter data for the current state
  state_data = df_south[df_south['state'] == state]

  # Calculate correlation coefficient between 'area_m2' and 'price_usd'
  corr = state_data['area_m2'].corr(state_data['price_usd'])

  # Add the correlation coefficient to the dictionary with the state name as the key
  south_states_corr[state] = corr
    
south_states_corr

![Screenshot from 2024-04-29 15-31-30.png](attachment:02df6314-6e23-4aef-a174-6ca335a9541f.png)

Since here we finihsed ! I hope that my project attract you so if you look any additional improvement or hiring to a project please contact me via this email :  sghyeryounes@gmail.com

¬© 2024 Younes Sghyer. All rights reserved.