# Exploratory Data Analysis

In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Visualizing Data Insights

Full chapter [here](./Chapter_Visualizing_Data_Insights.ipynb).

## Overview of EDA

Exploratory Data Analysis (EDA) is a critical initial step in data analysis where researchers investigate datasets to understand their structure, patterns, and potential issues.



In [None]:
# read in the data
dataset_path_clean_laptops = Path.cwd().parent / "data" / "OUTPUT_laptops.parquet"
laptops = pd.read_parquet(dataset_path_clean_laptops)
laptops.info()

## Descriptive Statistics and Data Summarization

Descriptive statistics provide the foundation for understanding and interpreting data by summarizing its main features. When faced with a large dataset, it is often impractical—or even impossible—to examine every individual data point. Descriptive statistics condense the data into a few key measures that capture its central tendency, variability, and overall distribution. This subchapter focuses on the theory behind these techniques and explains why they are critical for effective data analysis.

Data summarization is essential for several reasons:
- **Simplification**: Large datasets are distilled into concise, interpretable numbers. Instead of sifting through thousands of records, analysts can review summary metrics.
- **Insight Generation**: By calculating central tendencies (mean, median, mode) and dispersion measures (range, variance, standard deviation), one can quickly gauge the typical value in a dataset and understand how spread out the values are.
- **Comparison**: Summarized statistics facilitate comparisons between different groups or over time, supporting informed decision-making.
- **Foundation for Further Analysis**: Descriptive insights serve as a preliminary step before more complex inferential statistics or predictive modeling, helping to verify assumptions and detect anomalies in the data.

<p>Just like with NumPy, we can use any of the standard <a target="_blank" href="https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex">Python numeric operators</a> with series, including:</p>
<ul>
<li><code>series_a + series_b</code> - Addition</li>
<li><code>series_a - series_b</code> - Subtraction</li>
<li><code>series_a * series_b</code> - Multiplication (this is unrelated to the multiplications used in linear algebra).</li>
<li><code>series_a / series_b</code> - Division</li>
</ul>


<p>Pandas supports many descriptive stats methods that can help us answer these questions. Here are a few of the most useful ones (with links to documentation):</p>
<ul>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html"><code>Series.max()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html"><code>DataFrame.max()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html"><code>Series.min()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html"><code>DataFrame.min()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html"><code>Series.mean()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html"><code>DataFrame.mean()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html"><code>Series.median()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html"><code>DataFrame.median()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html"><code>Series.mode()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html"><code>DataFrame.mode()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html"><code>Series.sum()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html"><code>DataFrame.sum()</code></a></li>
</ul>

In [None]:
# Create a scatter plot of total storage vs. weight
total_storage = laptops["storage_ssd_gb"] + laptops["storage_hdd_gb"] + laptops["storage_flash_gb"] + laptops["storage_hybrid_gb"]
weight = laptops["weight_kg"]
mean_storage = total_storage.mean()
mean_weight = weight.mean()

plt.figure(figsize=(10, 6))
plt.scatter(total_storage, weight, alpha=0.5)
plt.title("Total Storage vs. Weight of Laptops")
plt.xlabel("Total Storage (GB)")
plt.ylabel("Weight (kg)")
plt.grid(True)
plt.xlim(0, 3000)  # Set x-axis limit to 3000 GB
plt.ylim(0, 5)  # Set y-axis limit to 5 kg
plt.xticks(np.arange(0, 3200, 200))
plt.yticks(np.arange(0, 6, 0.5))
plt.axhline(y=mean_weight, color="r", linestyle="--", label="Mean weight")
plt.axvline(x=mean_storage, color="g", linestyle="--", label="Mean storage")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Exercise: Find the laptop with the highest RAM vs price ratio
# Calculate the RAM vs price ratio
laptops["ram_price_ratio"] = laptops["ram_gb"] / laptops["price_euros"]
# Find the laptop with the highest RAM vs price ratio
highest_ram_price_ratio = laptops.loc[laptops["ram_price_ratio"].idxmax()]
print("Laptop with the highest RAM vs price ratio:")
highest_ram_price_ratio[["manufacturer", "model_name", "ram_gb", "price_euros", "ram_price_ratio"]]


The <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html"><code>Series.value_counts()</code> method</a>. This method displays each unique non-null value in a column and their counts in order.</p>

In [None]:
# Count the number of laptops by manufacturer
laptops["cpu_manufacturer"].value_counts()

> <p><strong>Method chaining</strong> —&nbsp;a way to combine multiple methods together in a single line.</p>

In [None]:
# Get the price median for the intel and amd laptops
intel_price_median = laptops[laptops["cpu_manufacturer"] == "intel"]["price_euros"].median()
amd_price_median = laptops[laptops["cpu_manufacturer"] == "amd"]["price_euros"].median()

print(f"Intel median price: {intel_price_median}")
print(f"AMD median price: {amd_price_median}")

<p>Boolean indexing is a powerful tool which allows us to select or exclude parts of our data based on their values. However, to answer more complex questions, we need to learn how to combine boolean arrays.</p>
<p>To recap, boolean arrays are created using any of the Python standard <strong>comparison operators</strong>: <code>==</code> (equal), <code>&gt;</code> (greater than), <code>&lt;</code> (less than), <code>!=</code> (not equal).</p>
<p>We combine boolean arrays using <strong>boolean operators</strong>. In Python, these boolean operators are <code>and</code>, <code>or</code>, and <code>not</code>. In pandas, the operators are slightly different:</p>
<table>
<thead>
<tr>
<th>pandas</th>
<th>Python equivalent</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>a &amp; b</code></td>
<td><code>a and b</code></td>
<td><code>True</code> if both <code>a</code> and <code>b</code> are <code>True</code>, else <code>False</code></td>
</tr>
<tr>
<td><code>a | b</code></td>
<td><code>a or b</code></td>
<td><code>True</code> if either <code>a</code> or <code>b</code> is <code>True</code></td>
</tr>
<tr>
<td><code>~a</code></td>
<td><code>not a</code></td>
<td><code>True</code> if <code>a</code> is <code>False</code>, else <code>False</code></td>
</tr>
</tbody>
</table>

In [None]:
# Get the top 3 manufacturers with the most laptops with price > 1000 and nvidia gpus
laptops_filtered = laptops[(laptops["price_euros"] > 1000) & (laptops["gpu_manufacturer"] == "nvidia")]
top_3_manufacturers = laptops_filtered["manufacturer"].value_counts().head(3)
top_3_manufacturers

In [None]:
# Get the cheapest laptop with ssd or flash storage and at least 16GB of RAM
laptops_filtered = laptops[(laptops["storage_ssd_gb"] > 0) | (laptops["storage_flash_gb"] > 0)]
laptops_filtered = laptops_filtered[laptops_filtered["ram_gb"] >= 16]
cheapest_laptop = laptops_filtered.loc[laptops_filtered["price_euros"].idxmin()]
print("Cheapest laptop with SSD or flash storage and at least 16GB of RAM:")
cheapest_laptop[["manufacturer", "model_name", "ram_gb", "storage_ssd_gb", "storage_flash_gb", "price_euros"]]

In [None]:
# Select all laptops from Asus that do not have a dedicated Nvidia GPU and have at least 16GB of RAM
laptops_filtered = laptops[(laptops["manufacturer"] == "Asus") & ~(laptops["gpu_manufacturer"] == "nvidia") & (laptops["ram_gb"] >= 16)]
laptops_filtered

In [None]:
# Get a Apple laptop with the highest price
laptops[laptops["manufacturer"] == "Apple"].sort_values(by="price_euros", ascending=False).head(1)

In [None]:
# Counts laptops by manufacturer  only for laptops without HHD storage
laptops[laptops["storage_hdd_gb"] == 0]["manufacturer"].value_counts().head(10).plot(kind="barh", figsize=(10, 6), color="skyblue")
plt.title("Top 10 Manufacturers without HDD Storage")
plt.xlabel("Number of Laptops")
plt.ylabel("Manufacturer")
plt.xticks(rotation=45)
plt.grid(axis="x")
plt.tight_layout()
plt.show()

## Copy-on-Write (CoW)

- [Copy-on-Write (CoW)](https://pandas.pydata.org/docs/user_guide/copy_on_write.html#copy-on-write)
- [Returning a view versus a copy](https://pandas.pydata.org/docs/user_guide/indexing.html#returning-a-view-versus-a-copy)

> Copy-on-Write will become the default in pandas 3.0. We recommend turning it on now to benefit from all improvements. Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the optimizations that become possible through CoW are implemented and supported. All possible optimizations are supported starting from pandas 2.1. **CoW will be enabled by default in version 3.0.**

<img class="full-width" src="https://www.dataquest.io/wp-content/uploads/2019/01/view-vs-copy.png" alt="view-vs-copy">


<img class="full-width" src="https://www.dataquest.io/wp-content/uploads/2019/01/modifying.png" alt="modifying">

CoW will lead to more predictable behavior since it is not possible to update more than one object with one statement, e.g. indexing operations or methods won’t have side-effects. Additionally, through delaying copies as long as possible, the average performance and memory usage will improve.

**Previous behavior**

pandas indexing behavior is tricky to understand. Some operations return views while other return copies. Depending on the result of the operation, mutating one object might accidentally mutate another:

In [None]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
df

In [None]:
subset = df["foo"]
subset.iloc[0] = 100
df

Mutating subset, e.g. updating its values, also updates df. The exact behavior is hard to predict. Copy-on-Write solves accidentally modifying more than one object, it explicitly disallows this. With CoW enabled, df is unchanged:

In [None]:
pd.options.mode.copy_on_write = True

In [None]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
subset = df["foo"]
subset.iloc[0] = 10
df

In [None]:
subset

Copy-on-Write will be the default and only mode in pandas 3.0. This means that users need to migrate their code to be compliant with CoW rules.

In [None]:
pd.options.mode.copy_on_write = False

When setting values in a pandas object, care must be taken to avoid what is called chained indexing. Here is an example.

In [None]:
test_data = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6], "zip": [7, 8, 9]}, index=["a", "b", "c"])
test_data

In [None]:
# access method 1:
test_data["bar"]["b"]

In [None]:
# access method 2:
test_data.loc["b", "bar"]

These both yield the same results, so which should you use? It is instructive to understand the order of operations on these and why method 2 (.loc) is much preferred over method 1 (chained []).

`test_data["bar"]` selects the column and returns a Series. Then another Python operation `test_data_with_bar["b"]` selects the element indexed by 'b'. This is indicated by the variable `test_data_with_bar` because pandas sees these operations as separate events. e.g. separate calls to `__getitem__`, so it has to treat them as linear operations, they happen one after another.

Contrast this to `test_data.loc["b", "bar"]` which make a single call to `__getitem__`. This allows pandas to deal with this as a single entity. Furthermore this order of operations can be significantly faster, and allows one to index both axes if so desired.

**Why does assignment fail when using chained indexing?**

The problem in the previous section is just a performance issue. What’s up with the SettingWithCopy warning? We don’t usually throw warnings around when you do something that might cost a few extra milliseconds!

But it turns out that assigning to the product of chained indexing has inherently unpredictable results. To see this, think about how the Python interpreter executes this code:

In [None]:
test_data.loc["b", "bar"] = 100
test_data

In [None]:
test_data["bar"]["b"] = 50
test_data

Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees).

In [None]:
pd.options.mode.copy_on_write = True

In [None]:
test_data["bar"]["b"] = 50

## Data Aggregation

In the world of data analysis, you often work with large datasets containing many rows of detailed information. However, for decision-making or further analysis, you usually don’t need to inspect every individual record. Instead, you need to summarize or aggregate the data to:
- Reveal Trends: Understand the overall performance (e.g., total sales, average ratings) rather than just individual data points.
- Compare Groups: Compare different categories, such as sales by region or performance by department.
- Simplify Analysis: Reduce the data to a manageable size while preserving essential patterns.
- Enhance Reporting: Create meaningful summaries that are easy to visualize and interpret.

Data aggregation helps to condense your dataset, making it easier to draw insights and take informed actions.

Before diving into the groupby method, it’s important to understand the **split-apply-combine** strategy—a common paradigm in data analysis that underlies many aggregation techniques.
- **Split**: Divide the dataset into groups based on one or more key variables (for example, grouping sales records by store).
- **Apply**: Perform an operation on each group independently. This could be a statistical function like sum, mean, or even a custom transformation.
- **Combine**: Merge the individual results from each group back into a single data structure.

This strategy allows analysts to work on each group separately and then bring the results together in a concise summary.

In [None]:
# To better understand what split-apply-combine is doing, here's a manual
# implementation using a for loop.
# Don't use this in practice, it's just for illustration.

mean_prices = {}

for laptop_category in laptops["category"].unique():
    category_group = laptops[laptops["category"] == laptop_category]
    mean_price = category_group["price_euros"].mean()
    mean_prices[laptop_category] = mean_price

print(mean_prices)

# Plot the mean prices for each category
plt.figure(figsize=(10, 6))
plt.bar(mean_prices.keys(), mean_prices.values())
plt.xlabel("Laptop Category")
plt.ylabel("Mean Price (Euros)")
plt.title("Mean Laptop Prices by Category")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Pandas implements the split-apply-combine strategy using its powerful `groupby` method. Here’s what happens under the hood:
- **Splitting the Data**: When you call `df.groupby('Column')`, pandas scans the specified column and divides the DataFrame into subsets, where each subset contains rows sharing the same value in that column.
- **Applying a Function**: Once the data is split, you can apply a function to each group. This function could be an aggregation (like sum, mean, min, or max), a transformation (like scaling or normalizing), or even a filtering function. Pandas efficiently applies these operations to each subset.
- **Combining the Results**: After the function is applied, pandas combines the output into a new DataFrame or Series. This new structure presents the aggregated results in a clear, tabular format.

The elegance of groupby is that it abstracts away the need for explicit loops, providing a more efficient and readable approach to data aggregation.

Suppose you have a dataset of sales records for different stores. You want to calculate the total sales per store.

In [None]:
data = {"Store": ["A", "A", "B", "B", "C", "C"], "Sales": [100, 200, 150, 250, 300, 400]}
df = pd.DataFrame(data)

# Group by 'Store' and sum the 'Sales'
total_sales = df.groupby("Store")["Sales"].sum()

print("Total sales per store:")
total_sales

In [None]:
print(laptops.groupby("category", observed=True).groups)

In [None]:
# Get maximum price for each category
max_prices = laptops.groupby("category", observed=True)["price_euros"].max()
max_prices

You can also perform multiple operations on grouped data at once. 

In [None]:
def dif(group):
    return group.max() - group.min()


laptops.groupby("category", observed=True)["price_euros"].agg(["mean", "max", "median", "std", dif])

Data aggregation is a vital step in the data analysis process, as it transforms large, detailed datasets into meaningful summaries. The split-apply-combine strategy is the conceptual framework behind many aggregation techniques.

In [None]:
# Exercise 1: Are laptops made by Apple more expensive than those made by
# other manufacturers?
laptops.groupby("manufacturer", observed=False)["price_euros"].mean().sort_values(ascending=False)

In [None]:
# Exercise 2: What is the best value laptop with a screen size of 38 cm or more?
cols_to_show = ["manufacturer", "model_name", "price_euros", "screen_size_cm"]
best_value_laptop = laptops.loc[laptops["screen_size_cm"] >= 38, cols_to_show].sort_values("price_euros")
print(f"Best value laptop is {best_value_laptop.iloc[0]['manufacturer']} {best_value_laptop.iloc[0]['model_name']} with price {best_value_laptop.iloc[0]['price_euros']} euros")
best_value_laptop.head()

In [None]:
# Exercise 3: Which laptop has the most RAM?
cols_to_show = ["manufacturer", "model_name", "ram_gb"]
most_ram_laptop = laptops.loc[laptops["ram_gb"] == laptops["ram_gb"].max(), cols_to_show]
most_ram_laptop

## Reshaping Data: Melting and Pivoting

​In data analysis, the structure of your dataset plays a crucial role in determining the ease and effectiveness of your analysis. Often, data needs to be reshaped to fit the requirements of specific analytical methods or visualization tools. Pandas, a powerful data manipulation library in Python, provides robust functions—namely `melt()` and `pivot_table()` — to facilitate the transformation between different data formats.​

In [None]:
laptops.head()

[`pandas.pivot_table`](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html): Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

The <code>df.pivot_table()</code> method can perform the same kinds of aggregations as the <code>df.groupby</code> method and make the code for complex aggregations easier to read. 

In [None]:
# Using the groupby method to get the mean price for each category
laptops.groupby("category", observed=True)["price_euros"].mean()

In [None]:
# Same as above but using pivot_table
laptops_pivot_category = laptops.pivot_table(values="price_euros", index="category", aggfunc="mean", observed=False)
laptops_pivot_category

Keep in mind that this method returns a dataframe, so normal dataframe filtering and methods can be applied to the result. For example, let's use the DataFrame.plot() method to create a visualization. 


In [None]:
laptops_pivot_category.plot(
    kind="barh",
    figsize=(10, 6),
    color="skyblue",
    title="Mean Price by Category",
    xlabel="Price (Euros)",
    ylabel="Category",
    legend=False,
)
plt.show()

In [None]:
laptops.pivot_table(values=["price_euros", "weight_kg"], index="category", aggfunc="mean", observed=False)

In [None]:
laptops.pivot_table(
    values=["price_euros", "weight_kg"],
    index=["manufacturer", "category"],
    aggfunc=["mean", "max", "min"],
    observed=False,
    margins=True,
)

[`pandas.DataFrame.melt`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.melt.html#pandas.DataFrame.melt): Unpivot a DataFrame from wide to long format, optionally leaving identifiers set. This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

In [None]:
laptops.melt(id_vars=["model_name", "category"], value_vars=["ram_gb", "weight_kg"]).head()

## End-to-End EDA Workflow

Practical example.