<font size="+1"><strong>1.5. Housing in Brazil 🇧🇷</strong></font>

In this assignment, you'll work with a dataset of homes for sale in Brazil. Your goal is to determine if there are regional differences in the real estate market. Also, you will look at southern Brazil to see if there is a relationship between home size and price, similar to what you saw with housing in some states in Mexico. 

**Before you start:** Import the libraries you'll use in this notebook: Matplotlib, pandas, and plotly. Be sure to import them under the aliases we've used in this project.

In [2]:
# Import Matplotlib, pandas, and plotly

### Prepare Data

In this assignment, you'll work with real estate data from Brazil.  In the `data` directory for this project there are two CSV that you need to import and clean, one-by-one.

### Import

First, you are going to import and clean the data in `data/brasil-real-estate-1.csv`. 

**Task 1.5.1:** Import the CSV file `data/brasil-real-estate-1.csv` into the DataFrame `df1`.

In [4]:
df1 = ...
df1.head()

Before you move to the next task, take a moment to inspect `df1` using the `info` and `head` methods. What issues do you see in the data? What cleaning will you need to do before you can conduct your analysis?

**Task 1.5.2:** Drop all rows with `NaN` values from the DataFrame `df1`.

**Task 1.5.3:** Use the `"lat-lon"` column to create two separate columns in `df1`: `"lat"` and `"lon"`. Make sure that the data type for these new columns is `float`.

**Task 1.5.4:** Use the `"place_with_parent_names"` column to create a `"state"` column for `df1`. (Note that the state name always appears after `"|Brasil|"` in each string.)

**Task 1.5.5:** Transform the `"price_usd"` column of `df1` so that all values are floating-point numbers instead of strings. 

**Task 1.5.6:** Drop the `"lat-lon"` and `"place_with_parent_names"` columns from `df1`.

Now that you have cleaned `data/brasil-real-estate-1.csv` and created `df1`, you are going to import and clean the data from the second file, `brasil-real-estate-2.csv`. 

**Task 1.5.7:** Import the CSV file `brasil-real-estate-2.csv` into the DataFrame `df2`.

In [None]:
df2 = ...

Before you jump to the next task, take a look at `df2` using the `info` and `head` methods. What issues do you see in the data? How is it similar or different from `df1`?

**Task 1.5.8:** Use the `"price_brl"` column to create a new column named `"price_usd"`. (Keep in mind that, when this data was collected in 2015 and 2016, a US dollar cost 3.19 Brazilian reals.)

**Task 1.5.9:** Drop the `"price_brl"` column from `df2`, as well as any rows that have `NaN` values. 

OK! Now that you've cleaned the data from both CSV files and created `df1` and `df2`, it's time to combine them into a single DataFrame.

**Task 1.5.10:** Concatenate `df1` and `df2` to create a new DataFrame named `df`. 

In [None]:
df = ...
print("df shape:", df.shape)

<div class="alert alert-block alert-info">
    <p><b>Frequent Question:</b> I can't pass this question, and I don't know what I've done wrong. 😠 What's happening?</p>
    <p><b>Tip:</b> In this assignment, you're working with data that's similar — but not identical — the data used in the lessons. That means that you might need to make adjust the code you used in the training to work here. Take a second look at <code>df1</code> after you complete 1.5.6, and make sure you've correctly created the state names.</p>
</div>

### Explore

It's time to start exploring your data. In this section, you'll use your new data visualization skills to learn more about the regional differences in the Brazilian real estate market.

Complete the code below to create a `scatter_mapbox` showing the location of the properties in `df`.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

In [None]:
fig = px.scatter_mapbox(
    ...,
    lat=...,
    lon=...,
    center={"lat": -14.2, "lon": -51.9},  # Map will be centered on Brazil
    width=600,
    height=600,
    hover_data=["price_usd"],  # Display price when hovering mouse over house
)

fig.update_layout(mapbox_style="open-street-map")

fig.show()

**Task 1.5.11:** Use the `describe` method to create a DataFrame `summary_stats` with the summary statistics for the `"area_m2"` and `"price_usd"` columns.

In [None]:
summary_stats = ...
summary_stats

**Task 1.5.12:** Create a histogram of `"price_usd"`. Make sure that the x-axis has the label `"Price [USD]"`, the y-axis has the label `"Frequency"`, and the plot has the title `"Distribution of Home Prices"`. Use Matplotlib (`plt`).

In [10]:
# Build histogram
plt.hist()


# Label axes


# Add title


# Don't change the code below 👇
plt.savefig("images/1-5-12.png", dpi=150)

**Task 1.5.13:** Create a horizontal boxplot of `"area_m2"`. Make sure that the x-axis has the label `"Area [sq meters]"` and the plot has the title `"Distribution of Home Sizes"`. Use Matplotlib (`plt`).

In [None]:
# Build box plot
plt.boxplot()


# Label x-axis


# Add title


# Don't change the code below 👇
plt.savefig("images/1-5-13.png", dpi=150)

**Task 1.5.14:** Use the `groupby` method to create a Series named `mean_price_by_region` that shows the mean home price in each region in Brazil, sorted from smallest to largest.

In [None]:
mean_price_by_region = ...
mean_price_by_region

**Task 1.5.15:** Use `mean_price_by_region` to create a bar chart. Make sure you label the x-axis as `"Region"` and the y-axis as `"Mean Price [USD]"`, and give the chart the title `"Mean Home Price by Region"`. Use pandas. 

In [None]:
# Build bar chart, label axes, add title
mean_price_by_region.plot()

# Don't change the code below 👇
plt.savefig("images/1-5-15.png", dpi=150)

<div class="alert alert-block alert-info">
    <b>Keep it up!</b> You're halfway through your data exploration. Take one last break and get ready for the final push. 🚀
</div>

You're now going to shift your focus to the southern region of Brazil, and look at the relationship between home size and price.

**Task 1.5.16:** Create a DataFrame `df_south` that contains all the homes from `df` that are in the `"South"` region. 

In [None]:
df_south = ...
df_south.head()

**Task 1.5.17:** Use the `value_counts` method to create a Series `homes_by_state` that contains the number of properties in each state in `df_south`. 

In [None]:
homes_by_state = ...
homes_by_state

**Task 1.5.18:** Create a scatter plot showing price vs. area for the state in `df_south` that has the largest number of properties. Be sure to label the x-axis `"Area [sq meters]"` and the y-axis `"Price [USD]"`; and use the title `"<name of state>: Price vs. Area"`. Use Matplotlib (`plt`).

<div class="alert alert-block alert-info">
    <p><b>Tip:</b> You should replace <code>&lt;name of state&gt;</code> with the name of the state that has the largest number of properties.</p>
</div>

In [None]:
# Subset data
df_south_rgs = ...

# Build scatter plot
plt.scatter()


# Label axes


# Add title
plt.title("Rio Grande do Sul: Price vs. Area")

# Don't change the code below 👇
plt.savefig("images/1-5-18.png", dpi=150)

**Task 1.5.19:** Create a dictionary `south_states_corr`, where the keys are the names of the three states in the `"South"` region of Brazil, and their associated values are the correlation coefficient between `"area_m2"` and `"price_usd"` in that state.

As an example, here's a dictionary with the states and correlation coefficients for the Southeast region. Since you're looking at a different region, the states and coefficients will be different, but the structure of the dictionary will be the same.

```python
{'Espírito Santo': 0.6311332554173303,
 'Minas Gerais': 0.5830029036378931,
 'Rio de Janeiro': 0.4554077103515366,
 'São Paulo': 0.45882050624839366}
```

In [None]:
south_states_corr = ...
south_states_corr