# Exploratory Data Analysis (EDA) with Housing Price Dataset

Exploratory Data Analysis (EDA) is a crucial step in the data analysis pipeline. It helps us understand the data, discover patterns, spot anomalies, and frame hypotheses. In this lesson, we'll use a housing price dataset to explore various EDA techniques.

## Initial Steps for Data Analysis

The initial steps for data analysis in Python include:

1. **Data Acquisition:** This involves gathering data from various sources such as local files, databases, APIs, websites, etc.
 
2. **Loading the Data:** Common formats to consider are CSV (Comma Separated Values), JSON, XLS, HTML, XML, and more.

3. **Exploratory Data Analysis (EDA):** EDA is a systematic approach to initial data inspection. It leverages **descriptive analysis** techniques to understand the data better, identify outliers, highlight significant variables, and generally uncover underlying data patterns. Additionally, EDA helps in organizing the data, spotting errors, and assessing missing values.

4. **Data Cleaning:** It's crucial to check the available data and perform tasks such as removing empty columns, standardizing terms, imputing missing data where appropriate, and more.

5. After cleaning, you should conduct a more in-depth exploratory data analysis to further understand the data.

## Methods in EDA

EDA methodologies can be broadly categorized into:

- **Numerical Measures:** These can include coefficients, frequency counts, and other statistical metrics.
  
- **Visual Representations:** Examples are histograms, scatter plots, pie charts, and more.

Additionally, based on the number of variables in focus, methods can be:

- **Univariate:** Describing the characteristics of a single variable at a time.
  
- **Bivariate:** Analyzing the relationship between two variables, either in tandem or understanding one variable based on the other (examining the influence of one independent variable in relation to the dependent variable).
  
- **Multivariate:** An extension of bivariate analysis but for multiple variables. It explores the relationships among them or the impact of two or more independent variables (sometimes along with associated variables or covariates) on one or more dependent variables.

**Note**
It's crucial to ensure that all our analytical methods are tailored to the type of variable under consideration.


## Loading the Dataset

Before we dive into EDA, let's gather our data. In this case, we will load our dataset and take a quick look at its structure.


The dataset can be found [here](https://raw.githubusercontent.com/data-bootcamp-v4/data/main/housing_price_eda.csv) and the information about the dataset [here](https://github.com/data-bootcamp-v4/data/blob/main/housing_price_dataset_info.md).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

In [None]:
# Loading the housing price dataset (assuming the file name is "housing_price.csv")
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/housing_price_eda.csv")

## Initial Exploration

Before diving into the specifics of univariate analysis, it's essential to get acquainted with our dataset.


### Exploring numerical and categorical variables

We'll explore numerical and categorical variables, and create two dataframes, one for each type of variable.

**Note**: 
- **Numerical variables**: These can encompass both quantitative and qualitative information. Often, discrete numerical variables with limited distinct values hint at qualitative (categorical) variable encoded as numbers.

- **Object variables**: Typically, these consist of qualitative data, numeric data in a String format, or data that might not be directly relevant to the analysis. Examples include identifiers like 'ID' numbers or 'Names'. Variables with a broad range of unique values, especially in string format, often fall into this category. 


Decide based on domain knowledge and the above explorations which numerical columns are better as categorical
 and vice versa. 
 
 For demonstration purposes, let's assume the *potential_categorical_from_numerical* are categorical, even though this might not be the case in a real scenario.

As previously mentioned during the data cleaning phase, it's essential to also explore **Data Typing/Formatting** to ensure data is in the correct type or format, **Duplicates** to check and handle any repeated data points, **Missing Values** to identify and address any null or missing data, **Categorical Variables** to examine the unique categories and their distributions and check whether they need cleaning (example: for gender, values such as M, F, Masculine, masculine, Femenine etc.), same for numerical data, **Outliers** to identify and decide how to handle extreme values etc.
By exploring this, we can proceed with proper data cleaning. 

We won't be delving into this now, as we already saw it in data cleaning lessons (expect for outliers, which we'll be reviewing later in this lesson). We'll just clean quickly null values, so we can focus in univariate analysis.

## Data Cleaning


### Checking for Missing Data

Missing data can influence our analysis. It's essential to identify and handle them appropriately.


## Univariate Analysis

Univariate analysis, as its name suggests, concentrates on one variable at a time, giving us a deep understanding of its characteristics. This fundamental step in Exploratory Data Analysis (EDA) lays the groundwork for subsequent analyses involving multiple variables. Let's explore various techniques for both categorical and numerical variables.

**Categorical variables**:
- Frequency tables. Counts and proportions.
- Visualizations: Bar charts, pie charts

**Numerical variables**: 
- Measures of centrality: Mean, median, mode
- Measures of dispersion: Variance,  standard deviation, minimum, maximum, range, quantiles
- Shape of the distribution: Symmetry and kurtosis
- Visualizations: Histograms, box plots

### Categorical Variables

Categorical variables represent categories or labels, like types or groups. Analyzing categorical data involves understanding the frequency or proportion of each category.

#### Frequency Tables

Frequency tables are tabular representations that display the number of occurrences of each category. They help in understanding the distribution of categories in a dataset.

In python, we can use:
- `value_counts()`
- `pd.crosstab()`

Let's consider *MSZoning* as our categorical variable of interest, which represents the general zoning classification of the sale.

We'll look at `value_counts()`.

The frequency table gives the count of each zoning type, while the proportion table provides the percentage representation of each category in the dataset. This helps to quickly identify dominant and minority categories.

Let's look at `pd.crosstab()`. The crosstab function can be useful to compute a cross-tabulation of two (or more) factors. Here, it's used to count the occurrences of each 'MSZoning' type.

The crosstab table displays the number of occurrences of each 'MSZoning' type, just like the frequency table. Computing the proportion table showcases the relative percentage of each category.

#### Visualizations

Visualizations offer a more intuitive understanding of categorical data distribution. Bar charts and pie charts are common methods to visually represent categorical data.

##### Bar charts

Bar charts can display the frequency or proportion of each category using bars of varying lengths. Here, the same data is visualized using three different methods: `sns.barplot()` and `sns.countplot()`.

Let's see how to use the `sns.barplot()` function with the result from `value_counts()` and `pd.crosstab()`. We should expect the same plot for both following lines of code.

Using matplotlib, would just be:

```python
my_table.plot.bar()
```

##### Countplots

A countplot is a type of bar plot in Seaborn that displays the count of occurrences of unique values in a categorical column.

Same result now as the *bar plot* since it just assumes Y axis is the count of frequencies. 

##### Pie charts



Pie charts provide a circular representation of the data, showing the proportion of each category as slices of a pie. However, they can be challenging to interpret when there are many categories or when categories have similar proportions.

Seaborn, as of 2023, does not have a dedicated function for pie charts. Pie charts are more commonly created using `matplotlib`, which Seaborn is built upon.

### Numerical Variables

Numerical variables are quantitative, and their values can be measured. Analyzing numerical data involves understanding its distribution, central tendency, and variability.


#### Summary Statistics

**Centrality and Dispersion Measures**

Let's start by getting some basic statistics on our dataset to understand its scale, centrality, and spread.


- The `.describe()` method provides key statistics for numerical columns (by default) in a dataframe, excluding NaN values; although it primarily targets numeric data, the `include` parameter allows for the selection of other data types.

#### More Centrality and Dispersion Measures

Now, suppose we want to calculate individual statistical measures without using the `.describe()` method. Here are some ways to do it:

- `df[column].mean()`: Computes the mean of the selected column.
- `df[column].median()`: Calculates the median of the selected column.
- `df[column].mode()`: Identifies the mode of the selected column.
- `df[column].std()`: Determines the standard deviation of the selected column.
- `df[column].var()`: Computes the variance of the selected column.
- `df[column].min()`: Finds the minimum value in the selected column.
- `df[column].max()`: Finds the maximum value in the selected column.
- `df[column].count()`: Counts the number of non-NaN entries in the selected column.

In these examples, replace `column` with the name of the column you want to analyze.

For this section, we'll focus on 'SalePrice' as our numerical variable of interest, which represents the price at which the house was sold.

#### Shape of the Distribution

Skewness and kurtosis provide insights into the shape of the data distribution. Skewness indicates the asymmetry, and kurtosis tells about the "tailedness" or how peaked the distribution is.

#### Visualizations

Visual tools like histograms and box plots offer insights into the distribution, variability, and potential outliers in numerical data.

##### Histograms

Histograms display the frequency distribution of a dataset. The height of each bar represents the number of data points in each bin.

##### Box plots

Box plots, or whisker plots, showcase the central 50% of the data (interquartile range), potential outliers, and other statistical properties.

## Converting continuous to discrete variables: Discretization

Discretization is the process of converting continuous variables into discrete ones by creating a set of contiguous intervals (or bins) and then categorizing the variables into these intervals. This can be particularly useful when you want to categorize a continuous variable into different groups based on ranges. Note that we usually lose information in this process.

For our dataset, let's take the 'SalePrice' column, which is continuous, and discretize it into categories like 'Low', 'Medium', 'High', and 'Very High'.

Another useful option is **discretizing by quantiles**. This means dividing the data into intervals based on specific quantile values. This ensures that each bin has (approximately) the same number of data points. The `pandas` library provides a convenient method, `qcut()`, for this purpose.

Discretizing by quantiles can be particularly useful when you want to create categories that represent relative rankings (like low, medium, high, etc.) based on the distribution of the data, rather than fixed numeric ranges.

**Step 1**: Choose the number of quantiles (or bins). For example, if you want quartiles, you would choose 4 bins. 

**Step 2**: Use the `qcut()` function from `pandas`.

### 💡 Check for understanding

Discretize the '1stFlrSF' column (first-floor square feet) into three categories: 'Small', 'Medium', 'Large'. Set the bins such that 'Small' includes sizes up to the 33rd percentile, 'Medium' includes sizes from the 33rd to the 66th percentile, and 'Large' includes sizes from the 66th percentile onward. How many houses fall into each category?

In [None]:
# Your code goes here

## Summary

In this lesson, we've conducted a comprehensive univariate analysis:

- For **categorical variables**, we visualized the distribution of our zoning classifications with bar and pie charts, backed by frequency tables.
- For **numerical variables**, we explored the central tendencies, dispersions and shape of distribution of our sale prices, visualized through histograms and box plots.

This analysis allows us to deeply understand each variable, laying a strong foundation for subsequent multivariate analyses.

## 💡 Check for understanding

**Scenario**:
Given the 'TotRmsAbvGrd' column (total rooms above ground), let's dive deep into its univariate characteristics.

**Tasks**:

1. **Data Aggregation**:
    - Create a frequency table for 'TotRmsAbvGrd' to understand the distribution of the number of rooms in houses.
    - Calculate the mean, median, mode, variance, and standard deviation of 'TotRmsAbvGrd'.

2. **Visualization**:
    - Plot a histogram for 'TotRmsAbvGrd' to understand its distribution.
    - Plot a box plot for 'TotRmsAbvGrd' to visualize its central tendency, spread, and potential outliers.

3. **Interpretation**:
    - Is the distribution of the number of rooms skewed? If so, in which direction?
    - Based on the histogram and box plot, what can you infer about the common number of rooms above ground in houses? 
    - Are there any noticeable outliers in the number of rooms? If so, are there more houses with unusually many rooms or unusually few?


In [None]:
# Your code goes here