# (PART) DATA EXPLORATION {-}
# How do you create a project directory ready for analysis?

## Explanation  
Before working with data, it's important to set up a clean and organized project directory. A consistent folder structure helps you manage scripts, datasets, and outputs across both Python and R — making your work easier to follow and share.

In this guide, we’ll create a root directory called `general-data-science` with four folders:

- `data/` – for datasets  
- `scripts/` – for code files  
- `images/` – for plots and charts  
- `library/` – for reusable functions  

---

**Example Folder Structure:**

```plaintext
general-data-science/
├── data/
├── scripts/
├── images/
└── library/
```

---



## Bash (Terminal)

You can create the entire structure using this single command:

```bash
mkdir -p general-data-science/{data,scripts,images,library}
cd general-data-science
```



## Python Code

You can also create the same folder structure in Python:


```python
import os

folders = ["data", "scripts", "images", "library"]
root = "general-data-science"

os.makedirs(root, exist_ok=True)
for folder in folders:
    os.makedirs(os.path.join(root, folder), exist_ok=True)

print(f"Created '{root}' project folder with subdirectories.")
```

## R Code

Here’s how to do it in R:

```R
folders <- c("data", "scripts", "images", "library")
root <- "general-data-science"

if (!dir.exists(root)) dir.create(root)
for (folder in folders) {
  dir.create(file.path(root, folder), showWarnings = FALSE)
}

cat("Created", root, "project folder with subdirectories.\n")
```

> ✅ A clean project directory helps you stay organized, reuse code, and avoid errors — it’s the first step toward reproducible, professional data science.

# How do you install basic data science tools and libraries for Python and R?

## Explanation  
Before you can analyze data in Python or R, you need to install essential libraries. These libraries provide tools for **data manipulation**, **visualization**, **statistical analysis**, and **machine learning** — the four core layers in the CDI learning system. Installing them ensures you're ready to explore datasets and build reproducible workflows.

---

## Python Tools

In your terminal or command prompt, run:

```bash
pip install pandas numpy matplotlib seaborn scikit-learn scipy
```

Then, import and check versions to confirm installation:


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import scipy

from scipy import stats  # still useful to import for use

print("pandas:", pd.__version__)
print("numpy:", np.__version__)
print("matplotlib:", plt.matplotlib.__version__)
print("seaborn:", sns.__version__)
print("scikit-learn:", sklearn.__version__)
print("scipy:", scipy.__version__)


**Installed Python libraries by layer:**

- 🧹 **EDA**:  
  - `pandas` – Tabular data structures and data cleaning tools  
  - `numpy` – Efficient array operations for numerical computing  

- 📊 **Visualization**:  
  - `matplotlib` – Customizable static and interactive plots  
  - `seaborn` – Statistical data visualizations built on matplotlib  

- 📐 **Statistical Analysis**:  
  - `scipy.stats` – Tools for distributions, t-tests, ANOVA, correlation, and more  

- 🤖 **Machine Learning**:  
  - `scikit-learn` – Algorithms and utilities for classification, regression, clustering, model evaluation  

---


## R Tools

```{r eval=FALSE, echo=TRUE}
# -----------------------------
# 📊 EDA (Exploratory Data Analysis)
# -----------------------------
if (!require(tidyverse)) install.packages("tidyverse")
library(tidyverse)

# -----------------------------
# 📈 Visualization
# -----------------------------
if (!require(GGally)) install.packages("GGally")
library(GGally)

# -----------------------------
# 📐 Statistical Analysis (STATS)
# -----------------------------
if (!require(broom)) install.packages("broom")
library(broom)

if (!require(car)) install.packages("car")
library(car)

if (!require(emmeans)) install.packages("emmeans")
library(emmeans)

# -----------------------------
# 🤖 Machine Learning
# -----------------------------
if (!require(caret)) install.packages("caret")
library(caret)
```


**Installed R packages by layer:**

- 🧹 **EDA**:  
  - `tidyverse` – A collection of packages for tidy data workflows:  
    - `dplyr` (data manipulation)  
    - `readr` (reading CSV and text files)  
    - `tidyr` (reshaping data)  
    - `tibble` (modern data frames)  
    - `stringr` (string operations)  
    - `forcats` (working with factors)  
    - `ggplot2` (visualization)  
    - `purrr` (functional programming)

- 📊 **Visualization**:  
  - `ggplot2` – Grammar of graphics for elegant visualizations (included in tidyverse)  
  - `GGally` – Enhances ggplot2 with matrix plots, correlation plots, etc.  

- 📐 **Statistical Analysis**:  
  - `broom` – Converts model outputs into tidy data frames  
  - `car` – Tools for regression diagnostics, ANOVA, linear models  
  - `emmeans` – Estimated marginal means for post-hoc testing and comparisons  

- 🤖 **Machine Learning**:  
  - `caret` – A unified framework for training, tuning, and comparing models across many algorithms  

> ✅ Once these tools are installed, you’ll be ready to acquire datasets and begin your analysis.

# What are common sources of datasets for Python and R?

## Explanation

Before working with data, it’s important to know **where data comes from**. In both Python and R, you can use:

1. **Public datasets** from libraries or platforms  
2. **Downloaded datasets** from repositories  
3. **Real-world data** from research, surveys, APIs, or government sources

These sources help you practice data skills using real, structured information.

**Common sources include:**

- **Built-in datasets**:  
  - Python: `seaborn`, `sklearn.datasets`, `statsmodels`, `pydataset`  
  - R: `datasets` package, `MASS`, `ggplot2`, `palmerpenguins`


- **Online repositories**:  
  - [UCI Machine Learning Repository](https://archive.ics.uci.edu/)  
  - [Kaggle Datasets](https://www.kaggle.com/datasets)  
  - [data.gov](https://www.data.gov/)  
  - [WHO, UN, World Bank](https://data.worldbank.org/)


- **Research & Surveys**:  
  - CSV/Excel/JSON files published with academic papers or institutions  
  - Survey data from organizations (e.g., Pew Research, Eurostat)

- **APIs and live feeds**:  
  - Weather, financial markets, genomics, social media (e.g., Twitter API)

- **Local files**:  
  - Saved from tools like Excel, Google Sheets, SPSS, or exported from databases

## Python Package-Based Datasets

- **Seaborn**:

In [None]:
import seaborn as sns
iris = sns.load_dataset("iris")
print(iris[:5])

- **Scikit-learn**:



In [None]:
from sklearn import datasets
irisml = datasets.load_iris()
print(irisml.data[:5])


- **Statsmodels**:

import statsmodels.api as sm
df = sm.datasets.get_rdataset("Guerry", "HistData").data



## R Package-Based Datasets

- **datasets** package:

  ```{r}
  iris <- datasets::iris
  head(iris)
  ```

- **ggplot2**:

  ```{r}
  data("diamonds", package = "ggplot2")
  head(diamonds)
  ```

- **palmerpenguins** (if installed):

  ```{r}
  library(palmerpenguins)
  head(penguins)
  ```

---

## Online Public Data Sources

| Source                       | Link                                                |
|-----------------------------|-----------------------------------------------------|
| UCI Machine Learning Repo   | https://archive.ics.uci.edu/ml/                     |
| Kaggle Datasets             | https://www.kaggle.com/datasets                     |
| data.gov (US Government)    | https://www.data.gov                                |
| Awesome Public Datasets     | https://github.com/awesomedata/awesome-public-datasets |
| World Bank Open Data        | https://data.worldbank.org/                         |

> 💡 Tip: Always save downloaded datasets in your `data/` folder and reference them using relative paths like `data/filename.csv`.

---

> ✅ Now that you know where to find data, let’s learn how to save, load and preview it in your Python or R environment.


# How do you save a dataset in Python and R?

## Explanation

Once you've cleaned or prepared a dataset, it's good practice to save it in a standard format like CSV. This allows you to:

- Preserve your cleaned version for future use
- Avoid repeating preprocessing steps
- Share your data with others or load it in different tools

In this example, we'll use sample datasets provided by libraries in Python and R, then save them into the `data/` folder using `to_csv()` in Python and `write_csv()` in R.

---

## Python Code




In [None]:
import pandas as pd
from sklearn import datasets
import seaborn as sns
import os

# Create data folder
os.makedirs("data", exist_ok=True)

# Save seaborn's iris dataset
df_iris_seaborn = sns.load_dataset("iris")
df_iris_seaborn.to_csv("data/iris_seaborn.csv", index=False)

print("\nSeaborn Iris\n", df_iris_seaborn.head())


# Save sklearn iris as well (optional)
iris_sklearn = datasets.load_iris(as_frame=True).frame
iris_sklearn.to_csv("data/iris_sklearn.csv", index=False)

print("\nSklearn Iris\n", iris_sklearn.head())

print("Datasets saved successfully.")

## R Code

```{r}
# Load necessary libraries
library(readr)
library(datasets)

# Create 'data/' directory if it doesn't exist
if (!dir.exists("data")) dir.create("data")

# Save the built-in iris dataset
write_csv(iris, "data/iris_rbase.csv")

cat("Datasets saved successfully.\n")


```


> ✅ After saving your cleaned or example dataset, you can now load it for further analysis or visualization in future sessions.

# How do you load a pre-cleaned dataset in Python and R?

## Explanation

Loading a dataset is one of the first steps in any data analysis project — especially when you're working from a previously saved, cleaned version.

In this guide, we assume that the dataset `*.csv` has been saved in your `data/` folder. We’ll now load it using:

- **Python**: via the `pandas` library and its `read_csv()` function
- **R**: using the `readr` package and its `read_csv()` function

Using consistent file paths (like `data/*.csv`) ensures reproducibility across environments.


---

## Python Code




In [None]:
import pandas as pd

# Load the pre-cleaned iris dataset
df = pd.read_csv("data/iris_seaborn.csv")

# Preview the data
print(df.head())

# Confirm the shape
print("Rows and columns:", df.shape)



## R Code

```{r}
library(readr)

# Load the pre-cleaned iris dataset
df <- read_csv("data/iris_rbase.csv")

# Preview the data
head(df)

# Dimensions
cat("Rows and columns:", dim(df), "\n")
```

---

> ✅ Once loaded, you're ready to continue with data wrangling, visualization, or modeling — all based on your clean, reusable dataset.

# How do you rename column names in Python and R?

## Explanation

When working with the same dataset in both Python and R, you may encounter slight differences in column names — such as capitalization or spacing. To avoid confusion and ensure consistency across your analysis, it’s best to standardize the column names.

In this guide, we’ll rename the columns to lowercase with underscores:

- `sepal_length`  
- `sepal_width`  
- `petal_length`  
- `petal_width`  
- `species`

After renaming, we’ll save the final, standardized dataset as `data/iris.csv`, which will be used consistently throughout the rest of the guide.

---

## Python Code



In [None]:
import pandas as pd

# Using seaborn version of the iris dataset
df1 = pd.read_csv("data/iris_seaborn.csv")
print("Original column names from seaborn version:", df1.columns.tolist())

# Rename columns
df1.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]

# Save standardized version
df1.to_csv("data/iris.csv", index=False)
print("Saved standardized dataset from seaborn version as 'data/iris.csv'")



## R Code
```{r}
library(readr)
library(dplyr)

# Option 1: If your dataset already has lowercase column names (e.g., from iris_rbase.csv)
df <- read_csv("data/iris_rbase.csv")

# Set standardized column names directly
colnames(df) <- c("sepal_length", "sepal_width", "petal_length", "petal_width", "species")

# Save standardized dataset
write_csv(df, "data/iris.csv")
cat("Saved standardized dataset from iris_rbase.csv as data/iris.csv\n")
```


```{r}
# Option 2: If your dataset has capitalized column names (e.g., from iris_rbase.csv)
df <- read_csv("data/iris_rbase.csv")

# Rename columns using dplyr for consistency
df <- df %>%
  rename(
    sepal_length = Sepal.Length,
    sepal_width  = Sepal.Width,
    petal_length = Petal.Length,
    petal_width  = Petal.Width,
    species      = Species
  )

# Save standardized dataset
write_csv(df, "data/iris.csv")
cat("Saved standardized dataset from iris_rbase.csv as data/iris.csv\n")
```


> ✅ From this point forward, we’ll use `data/iris.csv` as the **unified, clean dataset** for all Python and R examples in the guide.

# How do you examine the structure and types of variables in Python and R?

## Explanation

Understanding the **structure** of your dataset — including data types — is a key step in exploratory data analysis. It helps you:

- Know what transformations are needed  
- Identify categorical vs. numerical variables  
- Prepare your data for modeling or visualization

Each column in your dataset has a specific **data type**. These types influence how operations behave, how memory is allocated, and how functions treat your data.

---

### ✅ Common Data Types in Python and R {-}

| Concept            | Python (`pandas`)      | R (`base`)           | Notes |
|:---------------|:------------------------|:-------------------|:--------------------------------------|
| Integer            | `int`                  | `integer`            | Use `astype(int)` or `as.integer()` |
| Decimal Number     | `float`                | `numeric`, `double`  | `numeric` in R defaults to `double` |
| Text / String      | `str`, `object`        | `character`          | Use `astype(str)` or `as.character()` |
| Logical / Boolean  | `bool`                 | `logical`            | `True`/`False` in Python, `TRUE`/`FALSE` in R |
| Date / Time        | `datetime64[ns]`       | `Date`, `POSIXct`    | Use `pd.to_datetime()` or `as.Date()` |
| Category           | `category`             | `factor`             | Useful for grouping and modeling |
| Missing Values     | `NaN` (`numpy`)        | `NA`                 | Use `pd.isna()` or `is.na()` |
| Complex Numbers    | `complex`              | `complex`            | Rare in typical EDA workflows |
| List               | `list`                 | `list`               | R lists allow mixed data types |
| Dictionary         | `dict`                 | `named list`         | R lists with names can mimic Python dictionaries |
| Tuple              | `tuple`                | `c()`, `list()`      | No direct equivalent; use vectors or lists in R |

---

## Python Code



In [None]:
import pandas as pd

# Load the standardized dataset
df = pd.read_csv("data/iris.csv")

# View column names
print("Column names:", df.columns.tolist())

# Check data types
print("\nData types:")
print(df.dtypes)

# Optional: Use .info() for a more detailed summary
print("\nStructure info:")
df.info()


## R Code
```{r}
library(readr)

# Load the standardized dataset
df <- read_csv("data/iris.csv")

# View column names
names(df)

# Check data types (structure)
str(df)

# Optionally print class of each variable
sapply(df, class)
```

---

> ✅ Once you're familiar with variable types, you can decide how to clean, filter, or transform your data — and which variables are ready for plotting or modeling.

# How do you check for missing values in Python and R?

## Explanation

Before performing any analysis or modeling, it's important to check for **missing values**. These can cause errors, affect summary statistics, or bias machine learning models if not handled properly.

In both Python and R, missing values are represented differently:

- In **Python**, missing values typically appear as `NaN` (Not a Number)
- In **R**, they are represented as `NA`

We’ll use built-in functions to:

- Detect if any values are missing
- Count missing values by column
- (Optionally) summarize total missing values across the dataset

---

## Python Code



In [None]:
import pandas as pd

# Load the standardized dataset
df = pd.read_csv("data/iris.csv")

# Check if there are any missing values
print("Any missing values?", df.isnull().values.any())

# Count missing values by column
print("\nMissing values per column:")
print(df.isnull().sum())

# Optional: total number of missing entries
print("\nTotal missing values:", df.isnull().sum().sum())




## R Code
```{r}
library(readr)

# Load the standardized dataset
df <- read_csv("data/iris.csv")

# Check if any missing values exist
any(is.na(df))

# Count missing values per column
colSums(is.na(df))

# Optional: total number of missing values
sum(is.na(df))
```

---

> ✅ Once you've identified missing values, the next step is to **decide how to handle them** — such as removing, imputing, or flagging them.

# How do you get summary statistics for numeric variables in Python and R?

## Explanation

Summary statistics provide a quick overview of your numeric data. They help you understand:

- Central tendency (mean, median)
- Spread (min, max, standard deviation, quartiles)
- Distribution shape and potential outliers

Both **Python** and **R** offer built-in functions to calculate summary statistics for each column in a dataset. These are essential when assessing data quality and preparing for visualization or modeling.

---

## Python Code



In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("data/iris.csv")

# Get summary statistics for all numeric columns
summary = df.describe()
print(summary)


> 💡 `df.describe()` returns count, mean, std, min, 25%, 50% (median), 75%, and max for each numeric column.


## R Code
```{r}
library(readr)

# Load the dataset
df <- read_csv("data/iris.csv")

# Get summary statistics
summary(df)
```

> 💡 `summary()` in R returns min, 1st quartile, median, mean, 3rd quartile, and max.

---

> ✅ These summaries give you a solid first look at the data distribution and can guide further steps like filtering, normalization, or visualization.

# How do you filter rows based on a condition in Python and R?

## Explanation

Filtering is one of the most important skills in data wrangling. It allows you to isolate subsets of data that meet certain conditions — for example:

- Observations above or below a threshold
- Specific categories (e.g., only one species)
- Logical combinations (e.g., long petals *and* wide sepals)

In both Python and R, filtering uses logical expressions that evaluate to `True` or `False` for each row.

In this example, we’ll filter rows where:

> `sepal_length > 5.0`

---

## Python Code



In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Filter rows where sepal_length > 5.0
filtered_df = df[df["sepal_length"] > 5.0]

# View result
print(filtered_df.head())

# Confirm number of rows
print("Filtered rows:", filtered_df.shape[0])


## R Code
```{r}
library(readr)
library(dplyr)

# Load dataset
df <- read_csv("data/iris.csv")

# Filter rows where sepal_length > 5.0
filtered_df <- df %>%
  filter(sepal_length > 5.0)

# View result
head(filtered_df)

# Confirm number of rows
nrow(filtered_df)
```


> ✅ Filtering is the gateway to conditional analysis — you can combine multiple conditions and pipe the result into visualizations or summaries.

# How do you sort rows based on a variable in Python and R?

## Explanation

Sorting lets you **organize data** by numeric values, text, or any other column. It’s useful for:

- Finding top/bottom performers
- Preparing plots or tables
- Detecting outliers and patterns

In this example, we’ll sort the Iris dataset by `petal_length` in **descending order** — meaning longest petals appear first.

---

## Python Code



In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Sort by petal_length (descending)
sorted_df = df.sort_values(by="petal_length", ascending=False)

# View result
print(sorted_df.head())


## R Code
```{r}
library(readr)
library(dplyr)

# Load dataset
df <- read_csv("data/iris.csv")

# Sort by petal_length (descending)
sorted_df <- df %>%
  arrange(desc(petal_length))

# View result
head(sorted_df)
```

---

> ✅ Sorting helps highlight extremes and trends. You can sort by multiple columns or chain it with filtering for advanced workflows.

# How do you create a new variable in Python and R?

## Explanation

Creating new variables — also called **feature engineering** — allows you to extract more insight from your data. A new variable can be based on:

- Arithmetic between columns  
- Logical comparisons  
- Conditional rules

In this example, we’ll create a new column called `petal_ratio`, calculated as:

> `petal_length / petal_width`

This ratio can help distinguish species based on shape.

---

## Python Code



In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Create a new column
df["petal_ratio"] = df["petal_length"] / df["petal_width"]

# Preview result
print(df[["petal_length", "petal_width", "petal_ratio"]].head())


## R Code
```{r}
library(readr)
library(dplyr)

# Load dataset
df <- read_csv("data/iris.csv")

# Create a new column
df <- df %>%
  mutate(petal_ratio = petal_length / petal_width)

# Preview result
df %>%
  select(petal_length, petal_width, petal_ratio) %>%
  head()
```

---

> ✅ Creating new variables gives you more ways to explore and model your data — it’s a key step in both EDA and machine learning pipelines.

# How do you detect and remove duplicate rows in Python and R?

## Explanation

Duplicate rows can arise from data entry errors, merging datasets, or exporting data multiple times. Identifying and removing them is an important step in **data cleaning** to ensure that your analysis isn’t biased or inflated.

In both Python and R, we can:

- Detect duplicates  
- Count them  
- Drop them if needed

---

## Python Code



In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Check for duplicate rows
duplicates = df.duplicated()
print("Any duplicates?", duplicates.any())

# Count duplicate rows
print("Number of duplicates:", duplicates.sum())

# Remove duplicates
df_cleaned = df.drop_duplicates()

# Confirm removal
print("New shape:", df_cleaned.shape)



## R Code
```{r}
library(readr)
library(dplyr)

# Load dataset
df <- read_csv("data/iris.csv")

# Check for duplicate rows
duplicates <- duplicated(df)
cat("Any duplicates?", any(duplicates), "\n")
cat("Number of duplicates:", sum(duplicates), "\n")

# Remove duplicates
df_cleaned <- df %>%
  distinct()

# Confirm new size
cat("New number of rows:", nrow(df_cleaned), "\n")
```

---

> ✅ Cleaning duplicates ensures your results reflect **true observations** and not duplicated data points.

# How do you export a cleaned dataset in Python and R?

## Explanation

After cleaning and transforming your data — renaming columns, removing duplicates, creating new variables — it’s good practice to **save the final version**.

This ensures:

- You don’t have to redo your work  
- Others can use the clean data  
- You can start fresh from a reliable version for future steps (like visualization or modeling)

In this example, we’ll export our cleaned Iris dataset to a CSV file called `iris_cleaned.csv` in the `data/` folder.

---

## Python Code



In [None]:
import pandas as pd

# Load and optionally clean dataset
df = pd.read_csv("data/iris.csv")
df_cleaned = df.drop_duplicates()

# Export to new CSV file
df_cleaned.to_csv("data/iris_cleaned.csv", index=False)
print("Cleaned dataset saved as data/iris_cleaned.csv")



## R Code
```{r}
library(readr)
library(dplyr)

# Load and optionally clean dataset
df <- read_csv("data/iris.csv")
df_cleaned <- df %>%
  distinct()

# Export to new CSV file
write_csv(df_cleaned, "data/iris_cleaned.csv")
cat("Cleaned dataset saved as data/iris_cleaned.csv\n")
```

---

> ✅ With your cleaned dataset saved, you're now ready to begin **visualizing**, **modeling**, or **sharing** your data — with confidence.

# How do you convert variable types in Python and R?

## Explanation

Sometimes your data has columns in the wrong type — for example, a numeric column stored as text, or a categorical variable treated as a string. This can affect grouping, plotting, or modeling.

In this example, we’ll convert:

- The species column to a categorical variable
- A numeric column to string (for labeling)

---

## Python Code

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Convert species to categorical
df["species"] = df["species"].astype("category")

# Convert sepal_length to string (optional use case)
df["sepal_length_str"] = df["sepal_length"].astype(str)

# Confirm types
print(df.dtypes.head())

## R Code

```{r}
library(readr)
library(dplyr)

# Load dataset
df <- read_csv("data/iris.csv")

# Convert species to factor
df <- df %>%
  mutate(species = as.factor(species))

# Convert sepal_length to character
df <- df %>%
  mutate(sepal_length_str = as.character(sepal_length))

# Confirm structure
str(df)

```

> ✅ Converting variable types ensures that each column behaves correctly in your analysis or visualization.

# How do you group and summarize data in Python and R?

## Explanation

Grouping data lets you compare subsets — like computing the average petal length for each species. This is a powerful technique for understanding patterns and trends.

---

## Python Code

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Group by species and compute average petal length
grouped = df.groupby("species")["petal_length"].mean().reset_index()

print(grouped)

## R Code

```{r}
library(readr)
library(dplyr)

# Load dataset
df <- read_csv("data/iris.csv")

# Group by species and summarize
df_summary <- df %>%
  group_by(species) %>%
  summarise(avg_petal_length = mean(petal_length, na.rm = TRUE))

df_summary
```

> ✅ Grouping and summarizing help uncover relationships across categories in your data.

# How do you drop or reorder columns in Python and R?

## Explanation

Sometimes you want to exclude a column or rearrange the order of columns for reporting or modeling. This improves readability and usability.

---

## Python Code

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Drop petal_width column
df_dropped = df.drop(columns=["petal_width"])

# Reorder columns
cols = ["species", "sepal_length", "sepal_width", "petal_length"]
df_reordered = df[cols]

print(df_reordered.head())

## R Code

```{r}
library(readr)
library(dplyr)

# Load dataset
df <- read_csv("data/iris.csv")

# Drop petal_width column
df_dropped <- df %>%
  select(-petal_width)

# Reorder columns
df_reordered <- df %>%
  select(species, sepal_length, sepal_width, petal_length)

head(df_reordered)
```

> ✅ Dropping and reordering columns gives you control over which features to keep and how to present them.

# How do you subset specific columns in Python and R?

## Explanation

You may want to work with only a few columns at a time — for visualization, inspection, or modeling. This helps reduce clutter and focus on key variables.

---

## Python Code

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Select specific columns
subset = df[["sepal_length", "sepal_width", "species"]]

print(subset.head())

## R Code

```{r}
library(readr)
library(dplyr)

# Load dataset
df <- read_csv("data/iris.csv")

# Select specific columns
subset <- df %>%
  select(sepal_length, sepal_width, species)

head(subset)
```

> ✅ Subsetting lets you focus your analysis on the most relevant columns.

# How do you sample rows randomly in Python and R?

## Explanation

Sampling is useful when working with large datasets, doing quick checks, or creating training/test splits. You can randomly select a few rows for inspection.

---

## Python Code

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("data/iris.csv")

# Random sample of 5 rows
sampled = df.sample(n=5, random_state=42)

print(sampled)

## R Code

```{r}
library(readr)
library(dplyr)

# Load dataset
df <- read_csv("data/iris.csv")

# Random sample of 5 rows
set.seed(42)
sampled <- df %>%
  sample_n(5)

sampled
```

> ✅ Sampling allows you to explore or test your data without loading the entire dataset.

# How do you take a random sample from a large dataset in Python and R?

## Explanation

Sampling is useful when working with large datasets, doing quick inspections, or preparing training/test splits. It allows you to explore a manageable portion of the data without loading everything.

In this example, we’ll take a **random sample of 500 rows** from the `diamonds` dataset — which is considerably larger and more varied than the `iris` dataset we've used so far.

### 🎨 Why Switch to the Diamonds Dataset? {-}

The **Iris dataset** is a great starting point for learning structure, filtering, and variable creation — but it’s small and limited in variety.

The **Diamonds dataset** is much larger and richer, with:

- Categorical variables like `cut`, `color`, and `clarity`  
- Continuous variables like `price`, `carat`, and dimensions  
- Greater potential for beautiful and insightful visualizations

We’ll continue using **Iris** for clarity and comparison, but starting in the **Visualization Layer**, you’ll also work with a **sampled Diamonds dataset** to practice chart types like:

- Boxplots grouped by cut or clarity  
- Scatter plots of carat vs. price  
- Histograms of distribution patterns

Sampling allows us to keep things fast and responsive while still leveraging the power of a large, visual-friendly dataset.

---


## Python Code


In [None]:
import pandas as pd
import seaborn as sns

# Load full diamonds dataset
diamonds = sns.load_dataset("diamonds")

# Take a random sample of 500 rows
sampled_diamonds = diamonds.sample(n=500, random_state=42)

# Save to CSV (optional)
sampled_diamonds.to_csv("data/diamonds_sample.csv", index=False)

# View first few rows
print(sampled_diamonds.head())


## R Code

```{r}
library(ggplot2)
library(dplyr)
library(readr)

# Load full diamonds dataset
data("diamonds")  # loads dataset into the environment

# Access the dataset
diamonds_df <- diamonds

# Take a random sample of 500 rows
set.seed(42)
diamonds_sample <- sample_n(diamonds_df, 500)

# Save to CSV (optional)
write_csv(diamonds_sample, "data/diamonds_sample.csv")

# View first few rows
head(diamonds_sample)
```

# EDA Summary {-}

You’ve successfully completed the **Exploratory Data Analysis (EDA)** layer of the CDI Learning System — working hands-on in both Python and R to load, inspect, clean, transform, and save your data.

This foundational layer has equipped you with the confidence to move forward — not just knowing how to use code, but how to think like a data scientist.

---

### 🧱 What You’ve Accomplished {-}

- ✅ Set up a clean, organized project workspace  
- ✅ Practiced essential EDA techniques with real data  
- ✅ Standardized and saved a reusable dataset (data/iris.csv)  
- ✅ Built fluency in both Python and R side by side

---

## 📈 What Comes After EDA? {-}

Now that you’ve completed the **Exploratory Data Analysis (EDA)** layer, you’re ready to move beyond structure and cleaning — and into **data storytelling**.

The next stages of your journey include:

- 🎨 **Data Visualization (VIZ)** — turn data into clear, compelling visual insights  
- 📐 **Statistical Analysis (STATS)** — test hypotheses and draw valid conclusions  
- 🤖 **Machine Learning (ML)** — build predictive models using real-world data

In these upcoming layers, you’ll continue working with familiar datasets — and explore new ones tailored to each domain.

---

## 🚀 Continue Learning with CDI {-}

Looking to go further or dive into another domain?

📚 **[Explore All CDI Products →](https://complexdatainsights.com/explore-products/)**

> ✅ With a strong analytical foundation, you're now equipped to **structure, inspect, and transform data confidently — preparing it for visualization, statistical analysis, and machine learning.**