# Data Exploration Code-Along

Implement the code-blocks below in order to explore some common data-exploration techniques. We will be using the `realestate.xlsx` file again.

## Data Exploration I

Simple data explorations

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# TODO: Load in your data from `../../data/realestate.csv`

df = pd.read_csv("../data/realestate.csv")

In [None]:
# TODO: Observe first 5 rows

df.head()

In [34]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [28]:
df.rename(columns={'B': 'C'}, inplace=True)

In [35]:
df.sort_values(by=['B'], ascending=False, inplace=True)

In [36]:
df

Unnamed: 0,A,B
2,3,6
1,2,5
0,1,4


In [None]:
# TODO: describe tukey's numbers

...

## Univariate Analysis

Observe single-variable distributions and counts.

In [None]:
# TODO: make a histplot on "house age"
# which distributions/outliers do we observe?

df['house age'].plot.hist()

In [None]:
# What we might be seeing here is a "bimodal distribution"
# that is, we have two seperate modes/means in our data
# https://statisticsbyjim.com/basics/bimodal-distribution/

# we should be on the lookout for categories in our dataset that have two seperate means when it comes to "age"

In [None]:
# TODO: make a histplot on "house age"
# which distributions/outliers do we observe?

df.hist(column=['distance_to_mrt'], bins=20)

In [None]:
# TODO: let's increase the number of bins for a better overview

...

In [None]:
# seems like outliers are making this harder to gauge...
# however, unlike the previous data-points from yesterday
# we have no reason (yet) to suspect these data points are erroneous
# so, let's just keep them in our dataset

In [None]:
# TODO: make a histplot on "num_convenience_stores"
# which distributions/outliers do we observe?

...

In [None]:
# TODO: make a histplot on "price_per_unit_area"
# which distributions/outliers do we observe?

...

In [None]:
# TODO: count the frequency of unique values in the "num_of_rooms" column, save this value into a new dataframe named "room_counts"
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html

...

In [None]:
# TODO: plot a matplotlib barplot for the room_counts dataframe
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html

...

## Bivariate Analysis

Observe relationships between two variables.

In [None]:
# TODO: Create a boxplot that reveals the range of "price_per_unit_area" for each "num_of_rooms" 
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

...

In [None]:
# TODO: same above, but ignore outliers
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

...

In [None]:
# outliers are identified by the IQR upper and lower fence
# IQR = Q3 - Q1
# Q3 --> median above Q2
# Q1 --> median below Q2 

# upper fence = Q3 + (1.5 * IQR)
# lower fence = Q1 — (1.5 * IQR)

# we can calculate this ourselves...

three_rooms = df[df["num_of_rooms"] == 3]

stats = three_rooms["price_per_unit_area"].describe()
stats

In [None]:
q1 = stats["25%"]
q3 = stats["75%"]

IQR = q3 - q1

upper_fence = q3 + (1.5 * IQR)
lower_fence = q1 - (1.5 * IQR)

In [None]:
upper_fence

In [None]:
lower_fence

In [None]:
# find all outliers
three_rooms[(three_rooms["price_per_unit_area"] > upper_fence) | (three_rooms["price_per_unit_area"]  < lower_fence)]

In [None]:
# TODO: Calculate the linear relationship b/w "house age" and "distance_to_mrt"
# Documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

from scipy import stats

stats.pearsonr(x=df["house age"], y=df["distance_to_mrt"])

In [None]:
# seems like our correlation is ~0.04, let's see what's going on
# TODO: Create a scatter-plot that reveals the relationship of "Payment Method" for each "Purchase Amount (USD)"
# Documentation: https://seaborn.pydata.org/generated/seaborn.scatterplot.html

...

In [None]:
# When our scatter-points are overlayed on each other, we can also use
# a kde-plot to view the density of scatter-points
# this is helpful in identifying clusters
# TODO: Create a scatter-plot that reveals the relationship of "Payment Method" for each "Purchase Amount (USD)"
# Documentation: https://seaborn.pydata.org/generated/seaborn.kdeplot.html

...

## Data Exploration II

While these visualizations on basic facets of the dataset are already quite comprehensive, we can take this a step further by dividing our dataframe into *groups*.

In [None]:
# TODO: seperate our dataframe according to "room" categories to observe differences 
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

...

In [None]:
# TODO: we cannot observe the intermediate grouped dataframe, but we can see the names of "groups" we've made
# Where else have we seen the "key" keyword???

...

In [None]:
# TODO: let's calculate some aggregates per group
# We call this the split,apply,combine pattern
# Documentation: https://pandas.pydata.org/pandas-docs/version/1.1/user_guide/groupby.html

...

In [None]:
# TODO: let's visualize this as well using a bar-plot
# and let's attach labels!
# Documentation: https://matplotlib.org/stable/tutorials/pyplot.html#plotting-with-keyword-strings

...

In [None]:
# TODO: let's visualize this with error-bars! (not enough data)
# Documentation: https://seaborn.pydata.org/generated/seaborn.barplot.html

...

In [None]:
# we can also "group" or "bin" our dataframe according to values 
# for example, let's consider our histogram again...

...

In [None]:
# TODO: let's say we want to "group" or "bin" our "distance_to_mrt" column to two categories: close to mrt and far
# instead of attempting to do this ourselves, we can use the "qcut" function to seperate our data into "quantiles"
# 10 for deciles, 4 for quartiles, 2 for halves
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.qcut.html

...

In [None]:
# TODO: we can take this a step further and label these quantiles for analysis

...

In [None]:
# TODO: notice how these are not saved by default, we can assign this to a new column...

...

In [None]:
# and then explore this column using visuals
# notice that these are equal to one another! We expect this since quantiles creates groups of equal size

...