# Exploratory Data Analysis (EDA): Initial Exploration

---

## 1. Introduction to EDA

- **Purpose of EDA:**
    - Review and clean data
    - Derive insights (e.g., summary statistics, correlations)
    - Generate hypotheses for experiments
    - Inform next steps: further analysis, modeling, or data collection

---

## 2. What Is Exploratory Data Analysis?

- **Definition:**  
  EDA is the process of reviewing and cleaning data to:
    - Derive insights (e.g., descriptive statistics, relationships)
    - Generate hypotheses
    - Decide on next steps (modeling, experiments, or even discarding the data)
- **Why:**  
  You can only ask good questions and draw useful conclusions once you know what your data contains.

---

## 3. Loading and Previewing the Data

Let's start by importing the dataset and taking a first look at it.

### **Code Example: Loading Data and Viewing the First Rows**

```python
import pandas as pd

# Load the books data from a CSV file
books = pd.read_csv("books.csv")

# Preview the first five rows
books.head()
```

**Output:**  
| name                         | author                  | rating | year | genre       |
|------------------------------|-------------------------|--------|------|-------------|
| 10-Day Green Smoothie Cleanse| JJ Smith                | 4.73   | 2016 | Non Fiction |
| 11/22/63: A Novel            | Stephen King            | 4.62   | 2011 | Fiction     |
| 12 Rules for Life            | Jordan B. Peterson      | 4.69   | 2018 | Non Fiction |
| 1984 (Signet Classics)       | George Orwell           | 4.73   | 2017 | Fiction     |
| 5,000 Awesome Facts          | National Geographic Kids| 4.81   | 2019 | Childrens   |

**Explanation:**
- `import pandas as pd`: Imports the pandas library for data manipulation.
- `pd.read_csv("books.csv")`: Reads the CSV file into a DataFrame called `books`.
- `books.head()`: Returns the first 5 rows of the DataFrame for a quick look at the structure and content.

**Significance:**  
- Quickly see what columns exist and what kind of data you are working with.

---

## 4. Summarizing Data Structure and Quality

### **Code Example: Checking DataFrame Info**

```python
books.info()
```

**Output:**
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   name     350 non-null    object 
 1   author   350 non-null    object 
 2   rating   350 non-null    float64
 3   year     350 non-null    int64  
 4   genre    350 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 13.8+ KB
```

**Explanation:**
- `books.info()`: Shows:
    - Number of rows and columns
    - Column names and data types
    - Count of non-null (non-missing) values for each column
    - Memory usage

**Significance:**  
- Helps detect missing data and understand the types of each column for further analysis.

---

## 5. Exploring Categorical Columns
A common question about categorical columns in a dataset is how many data points we have in each category.  
For example, perhaps we're interested in the genres represented in our books data. We can select the genre column and use the pandas Series method .value_counts to find the number of books with each genre.

### **Code Example: Value Counts for a Categorical Column**

```python
books['genre'].value_counts()
```

**Output:**
```
Non Fiction    179
Fiction        131
Childrens       40
Name: genre, dtype: int64
```

**Explanation:**
- `books['genre']`: Selects the `genre` column.
- `.value_counts()`: Counts the number of occurrences for each unique value (category) in the column.

**Significance:**  
- Reveals the distribution of genres. Useful for understanding class balance or planning further analysis.

---

## 6. Summarizing Numerical Columns
Gaining a quick understanding of data included in numerical columns is done with the help of the DataFrame.describe method. Calling .describe on books, we see that it returns the count, mean, and standard deviation of the values in each numerical column (in this case rating and year), along with the min, max, and quartile values.

### **Code Example: Descriptive Statistics for Numericals**

```python
books.describe()
```

**Output:**
```
           rating         year
count  350.000000   350.000000
mean     4.608571  2013.508571
std      0.226941     3.284711
min      3.300000  2009.000000
25%      4.500000  2010.000000
50%      4.600000  2013.000000
75%      4.800000  2016.000000
max      4.900000  2019.000000
```

**Explanation:**
- `books.describe()`: Computes summary statistics (count, mean, std, min, 25th percentile, 50th percentile/median, 75th percentile, max) for all numerical columns.

**Significance:**  
- Gives a quick overview of the central tendency and spread for numeric data.

---

## 7. Visualizing Numerical Data
Histograms are a classic way to look at the distribution of numerical data by splitting numerical values into discrete bins and visualizing the count of values in each bin.

### **Code Example: Histogram of Ratings**

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(data=books, x="rating")
plt.show()
```
![image.png](attachment:d3bcdfb0-5d6a-404e-9e0d-ec3727bf6d97.png)
**Output:**  
*A histogram plot appears showing the distribution of book ratings, with most ratings above 4.4 and few below 4.0.*

**Explanation:**
- `import seaborn as sns`: Imports the Seaborn library for statistical data visualization.
- `import matplotlib.pyplot as plt`: Imports Matplotlib for plot display.
- `sns.histplot(data=books, x="rating")`: Plots a histogram of the `rating` column.
- `plt.show()`: Displays the plot.

**Significance:**  
- Visualizes the distribution of ratings, helping spot patterns or outliers (e.g., most books are highly rated).

---

## 8. Adjusting Bin Width in Histograms

### **Code Example: Setting Bin Width**

```python
sns.histplot(data=books, x="rating", binwidth=0.1)
plt.show()
```
![image.png](attachment:652d78a0-3b4b-484b-8d26-08fe679b6283.png)
**Output:**  
*A histogram with bins of width 0.1, giving finer granularity (e.g., 4.5–4.6, 4.6–4.7, etc.).*

**Explanation:**
- `binwidth=0.1`: Sets each histogram bin to represent a 0.1 increment in ratings.

**Significance:**  
- Finer bins provide more detail about the distribution and can reveal small patterns or clusters.

---

## 9. Summary: Key Steps in Initial Data Exploration

- **Load data** using pandas.
- **Preview data** with `.head()`.
- **Check data types and missing values** with `.info()`.
- **Explore categorical variables** with `.value_counts()`.
- **Summarize numericals** with `.describe()`.
- **Visualize distributions** with Seaborn (e.g., `sns.histplot`), adjusting bin width for clarity.

---


### Exercise
Functions for initial exploration
You are researching unemployment rates worldwide and have been given a new dataset to work with. The data has been saved and loaded for you as a pandas DataFrame called unemployment. You've never seen the data before, so your first task is to use a few pandas functions to learn about this new data.

pandas has been imported for you as pd.

Instructions 1/3
Use a pandas function to print the first five rows of the unemployment DataFrame.

```python
# Print the first five rows of unemployment
print(unemployment.head())

<script.py> output:
      country_code          country_name      continent   2010   2011  ...   2017   2018   2019   2020   2021
    0          AFG           Afghanistan           Asia  11.35  11.05  ...  11.18  11.15  11.22  11.71  13.28
    1          AGO                Angola         Africa   9.43   7.36  ...   7.41   7.42   7.42   8.33   8.53
    2          ALB               Albania         Europe  14.09  13.48  ...  13.62  12.30  11.47  13.33  11.82
    3          ARE  United Arab Emirates           Asia   2.48   2.30  ...   2.46   2.35   2.23   3.19   3.36
    4          ARG             Argentina  South America   7.71   7.18  ...   8.35   9.22   9.84  11.46  10.90
    
    [5 rows x 15 columns]
```
2. Use a pandas function to print a summary of column non-missing values and data types from the unemployment DataFrame.

```python
# Print a summary of non-missing values and data types in the unemployment DataFrame
print(unemployment.info())

<script.py> output:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 182 entries, 0 to 181
    Data columns (total 15 columns):
     #   Column        Non-Null Count  Dtype  
    ---  ------        --------------  -----  
     0   country_code  182 non-null    object 
     1   country_name  182 non-null    object 
     2   continent     177 non-null    object 
     3   2010          182 non-null    float64
     4   2011          182 non-null    float64
     5   2012          182 non-null    float64
     6   2013          182 non-null    float64
     7   2014          182 non-null    float64
     8   2015          182 non-null    float64
     9   2016          182 non-null    float64
     10  2017          182 non-null    float64
     11  2018          182 non-null    float64
     12  2019          182 non-null    float64
     13  2020          182 non-null    float64
     14  2021          182 non-null    float64
    dtypes: float64(12), object(3)
    memory usage: 21.5+ KB
    None
```
3. Print the summary statistics (count, mean, standard deviation, min, max, and quartile values) of each numerical column in unemployment.

```python
# Print summary statistics for numerical columns in unemployment
print(unemployment.describe())

<script.py> output:
              2010     2011     2012     2013     2014  ...     2017     2018     2019     2020     2021
    count  182.000  182.000  182.000  182.000  182.000  ...  182.000  182.000  182.000  182.000  182.000
    mean     8.409    8.315    8.318    8.345    8.180  ...    7.669    7.426    7.244    8.421    8.391
    std      6.249    6.267    6.367    6.416    6.284  ...    5.902    5.819    5.697    6.041    6.067
    min      0.450    0.320    0.480    0.250    0.200  ...    0.140    0.110    0.100    0.210    0.260
    25%      4.015    3.775    3.743    3.692    3.625  ...    3.690    3.625    3.487    4.285    4.335
    50%      6.965    6.805    6.690    6.395    6.450  ...    5.650    5.375    5.240    6.695    6.425
    75%     10.957   11.045   11.285   11.310   10.695  ...   10.315    9.258    9.445   11.155   10.840
    max     32.020   31.380   31.020   29.000   28.030  ...   27.040   26.910   28.470   29.220   33.560
    
    [8 rows x 12 columns]
```

### Exercise
Counting categorical values
Recall from the previous exercise that the unemployment DataFrame contains 182 rows of country data including country_code, country_name, continent, and unemployment percentages from 2010 through 2021.

You'd now like to explore the categorical data contained in unemployment to understand the data that it contains related to each continent.

The unemployment DataFrame has been loaded for you along with pandas as pd.

Instructions

Use a method to count the values associated with each continent in the unemployment DataFrame.
```python
# Count the values associated with each continent in unemployment
print(unemployment['continent'].value_counts())

<script.py> output:
    continent
    Africa           53
    Asia             47
    Europe           39
    North America    18
    South America    12
    Oceania           8
    Name: count, dtype: int64
```


### Exercise
Global unemployment in 2021
It's time to explore some of the numerical data in unemployment! What was typical unemployment in a given year? What was the minimum and maximum unemployment rate, and what did the distribution of the unemployment rates look like across the world? A histogram is a great way to get a sense of the answers to these questions.

Your task in this exercise is to create a histogram showing the distribution of global unemployment rates in 2021.

The unemployment DataFrame has been loaded for you along with pandas as pd.

Instructions

Import the required visualization libraries.
Create a histogram of the distribution of 2021 unemployment percentages across all countries in unemployment; show a full percentage point in each bin.
```python
# Import the required visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Create a histogram of 2021 unemployment; show a full percent in each bin
sns.histplot(data=unemployment, x='2021', binwidth=1)
plt.show()

```
![image.png](attachment:53aec1b9-7481-4668-8802-de91368a01fb.png)

# Data Validation in Exploratory Data Analysis (EDA)
---

## 1. Introduction: Why Data Validation?

- **Data validation** is a crucial early step in exploratory data analysis (EDA).
    - Ensures data types and value ranges are as expected.
    - Prevents issues later in analysis or modeling.
- Typical checks:
    - Are columns storing the right data type?
    - Are categorical values within expected categories?
    - Are numerical ranges plausible?

---

## 2. Validating Data Types

### Checking Data Types with `.info()`

```python
books.info()
```

**Output:**
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   name     350 non-null    object 
 1   author   350 non-null    object 
 2   rating   350 non-null    float64
 3   year     350 non-null    float64
 4   genre    350 non-null    object 
dtypes: float64(2), object(3)
memory usage: 13.8+ KB
```

**Line-by-line explanation:**
- `books.info()`:  
    - Displays a summary of the DataFrame.
    - Shows number of entries, column names, non-null count, and data types.
    - Purpose: Quick assessment of structure and types.

- **Significance:**  
    - We see `year` is stored as `float64` (decimal), but years should be integers.

---

### Checking Data Types with `.dtypes`

```python
books.dtypes
```

**Output:**
```
name      object
author    object
rating    float64
year      float64
genre     object
dtype: object
```

**Line-by-line explanation:**
- `books.dtypes`:
    - Returns a Series with column names and their data types.
    - Used when only data types are of interest.

- **Significance:**  
    - Confirms that `year` is `float64`, which is not ideal for year data.

---

## 3. Updating Data Types

### Converting the `year` Column to Integer

```python
books["year"] = books["year"].astype(int)
```

**Explanation:**
- `books["year"]`:  
    - Selects the `year` column in the DataFrame.
- `.astype(int)`:  
    - Converts the column's data type to integer (`int`).
    - Purpose: Years should be whole numbers, not decimals.

**Checking the update:**

```python
books.dtypes
```

**Output:**
```
name      object
author    object
rating    float64
year      int64
genre     object
dtype: object
```

**Significance:**
- Now, `year` is correctly stored as `int64` (integer).
- Ensures future operations (e.g., grouping, filtering) work as intended.

---

### Common Data Types in Python

| Type       | Python Name |
|------------|-------------|
| String     | `str`       |
| Integer    | `int`       |
| Float      | `float`     |
| Dictionary | `dict`      |
| List       | `list`      |
| Boolean    | `bool`      |

- **Note:**  
    - Use these names (e.g., `int`, `float`, `str`) when converting types with `.astype()`.

---

## 4. Validating Categorical Data

### Checking if Values are in Expected Categories

```python
books["genre"].isin(["Fiction", "Non Fiction"])
```

**Output:**
```
0      True
1      True
2      True
3      True
4     False
...
345    True
346    True
347    True
348    True
349   False
Name: genre, Length: 350, dtype: bool
```

**Line-by-line explanation:**
- `books["genre"]`:  
    - Selects the `genre` column.
- `.isin(["Fiction", "Non Fiction"])`:  
    - Checks if each value in `genre` is either "Fiction" or "Non Fiction".
    - Returns a Boolean Series (`True`/`False`).

- **Significance:**  
    - `True` means the genre is one of the allowed values.
    - `False` means it's not ("bad" or unexpected data).

---

### Inverting the Boolean Mask

```python
~books["genre"].isin(["Fiction", "Non Fiction"])
```

**Output:**
```
0     False
1     False
2     False
3     False
4      True
...
345   False
346   False
347   False
348   False
349    True
Name: genre, Length: 350, dtype: bool
```

**Line-by-line explanation:**
- `~`:  
    - Bitwise NOT operator, inverts `True` to `False` and vice versa.
    - Purpose: Find rows where genre **is NOT** in the allowed list.

- **Significance:**  
    - Helps you quickly identify unexpected category values.

---

### Filtering DataFrame to Only Valid Categories

```python
books[books["genre"].isin(["Fiction", "Non Fiction"])].head()
```

**Output:**

|   | name                         | author               | rating | year | genre       |
|---|------------------------------|----------------------|--------|------|-------------|
| 0 | 10-Day Green Smoothie Cleanse| JJ Smith             | 4.7    | 2016 | Non Fiction |
| 1 | 11/22/63: A Novel            | Stephen King         | 4.6    | 2011 | Fiction     |
| 2 | 12 Rules for Life            | Jordan B. Peterson   | 4.7    | 2018 | Non Fiction |
| 3 | 1984 (Signet Classics)       | George Orwell        | 4.7    | 2017 | Fiction     |
| 5 | A Dance with Dragons         | George R. R. Martin  | 4.4    | 2011 | Fiction     |

**Line-by-line explanation:**
- `books[...]`:  
    - DataFrame indexing. Only rows where the condition inside is `True` are included.
- `books["genre"].isin(...)`:  
    - Produces a Boolean mask as before.

- **Significance:**  
    - You can filter your dataset to only include rows with valid categories.
    - Useful for cleaning or restricting analyses.

---

## 5. Validating Numerical Data

### Selecting Only Numerical Columns

```python
books.select_dtypes("number").head()
```

**Output:**

|   | rating | year |
|---|--------|------|
| 0 | 4.7    | 2016 |
| 1 | 4.6    | 2011 |
| 2 | 4.7    | 2018 |
| 3 | 4.7    | 2017 |
| 4 | 4.8    | 2019 |

**Line-by-line explanation:**
- `select_dtypes("number")`:  
    - Selects only columns with numeric types (`int`, `float`).
- `.head()`:  
    - Shows the first 5 rows for a quick look.

- **Significance:**  
    - Focuses on numerical data for further validation or exploration.

---

### Checking Minimum and Maximum Values

```python
books["year"].min()
```
**Output:**
```
2009
```

```python
books["year"].max()
```
**Output:**
```
2019
```

**Explanation:**
- `.min()`:  
    - Returns the smallest value in `year` column.
- `.max()`:  
    - Returns the largest value in `year` column.

- **Significance:**  
    - Verifies that all years are within expected bounds (e.g., no future or ancient years).

---

### Visualizing Distribution: Boxplot

```python
sns.boxplot(data=books, x="year")
plt.show()
```
![image.png](attachment:ac234578-2104-4e3d-bc2b-8f8e7506a784.png)
**Output:**  
*(A boxplot is displayed showing the distribution of publication years)*

**Explanation:**
- `sns.boxplot(...)`:  
    - Plots a boxplot of the `year` column.
    - Shows quartiles, median, min, max, and outliers.
- `plt.show()`:  
    - Displays the plot.

- **Significance:**  
    - Visualizes distribution, detects outliers or skewness.
    - From the transcript:  
        - Min: **2009**
        - Max: **2019**
        - 25th percentile: **2010**
        - Median: **2013**
        - 75th percentile: **2016**

---

### Boxplot Grouped by Categorical Variable

```python
sns.boxplot(data=books, x="year", y="genre")
plt.show()
```
![image.png](attachment:4beaec94-4e90-4f9c-a7a3-aadb6d375d4e.png)
**Output:**  
*(A grouped boxplot showing year distribution for each genre)*

**Explanation:**
- `y="genre"`:  
    - Groups the boxplot by genre, showing year distribution per genre.
- Visual cue: See if some genres are published more recently.

- **Significance:**  
    - Example: Children's books might have slightly later publishing years, but overall range is the same for all genres.

---

## 6. Summary and Next Steps

- Validating data types ensures each column is stored optimally for analysis.
- Checking categorical values prevents problems from unexpected entries.
- Validating numerical ranges and distributions can reveal outliers, errors, or trends.
- Visualization (like boxplots) helps to understand variable distributions.

---

## **Key Code Reference**

```python
# Check data types and non-null counts
books.info()

# Show just data types
books.dtypes

# Convert year to integer
books["year"] = books["year"].astype(int)

# Check if genre values are valid
books["genre"].isin(["Fiction", "Non Fiction"])

# Invert to find invalid genres
~books["genre"].isin(["Fiction", "Non Fiction"])

# Filter to only valid genres
books[books["genre"].isin(["Fiction", "Non Fiction"])]

# Select only numerical columns
books.select_dtypes("number")

# Get min/max of a numerical column
books["year"].min()
books["year"].max()

# Boxplot of year
sns.boxplot(data=books, x="year")
plt.show()

# Boxplot of year grouped by genre
sns.boxplot(data=books, x="year", y="genre")
plt.show()
```



### Exercise
Detecting data types
A column has been changed in the unemployment DataFrame and it now has the wrong data type! This data type will stop you from performing effective exploration and analysis, so your task is to identify which column has the wrong data type and then fix it.

pandas has been imported as pd; unemployment is also available.

Instructions 1/2

Question
Which of the columns below requires an update to its data type?

Possible answersc


country_name

continent

2019

2021

```python
In [1]:
unemployment.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   country_code  182 non-null    object 
 1   country_name  182 non-null    object 
 2   continent     177 non-null    object 
 3   2010          182 non-null    float64
 4   2011          182 non-null    float64
 5   2012          182 non-null    float64
 6   2013          182 non-null    float64
 7   2014          182 non-null    float64
 8   2015          182 non-null    float64
 9   2016          182 non-null    float64
 10  2017          182 non-null    float64
 11  2018          182 non-null    float64
 12  2019          182 non-null    object 
 13  2020          182 non-null    float64
 14  2021          182 non-null    float64
dtypes: float64(11), object(4)
memory usage: 21.5+ KB

```
Update the data type of the 2019 column of unemployment to float.
Print the dtypes of the unemployment DataFrame again to check that the data type has been updated!

```python
# Update the data type of the 2019 column to a float
unemployment["2019"] = unemployment["2019"].astype(float)
# Print the dtypes to check your work
print(unemployment.info())

<script.py> output:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 182 entries, 0 to 181
    Data columns (total 15 columns):
     #   Column        Non-Null Count  Dtype  
    ---  ------        --------------  -----  
     0   country_code  182 non-null    object 
     1   country_name  182 non-null    object 
     2   continent     177 non-null    object 
     3   2010          182 non-null    float64
     4   2011          182 non-null    float64
     5   2012          182 non-null    float64
     6   2013          182 non-null    float64
     7   2014          182 non-null    float64
     8   2015          182 non-null    float64
     9   2016          182 non-null    float64
     10  2017          182 non-null    float64
     11  2018          182 non-null    float64
     12  2019          182 non-null    float64
     13  2020          182 non-null    float64
     14  2021          182 non-null    float64
    dtypes: float64(12), object(3)
    memory usage: 21.5+ KB
    None
```


### Exercise
Validating continents
Your colleague has informed you that the data on unemployment from countries in Oceania is not reliable, and you'd like to identify and exclude these countries from your unemployment data. The .isin() function can help with that!

Your task is to use .isin() to identify countries that are not in Oceania. These countries should return True while countries in Oceania should return False. This will set you up to use the results of .isin() to quickly filter out Oceania countries using Boolean indexing.

The unemployment DataFrame is available, and pandas has been imported as pd.

Instructions 1/2

Define a Series of Booleans describing whether or not each continent is outside of Oceania; call this Series not_oceania.

```python
# Define a Series describing whether each continent is outside of Oceania
not_oceania = ~unemployment["continent"].isin(['Oceania'])
```
Use Boolean indexing to print the unemployment DataFrame without any of the data related to countries in Oceania.

```python
# Define a Series describing whether each continent is outside of Oceania
not_oceania = ~unemployment["continent"].isin(["Oceania"])

# Print unemployment without records related to countries in Oceania
print(unemployment[not_oceania])


<script.py> output:
        country_code          country_name      continent   2010   2011  ...   2017   2018   2019   2020   2021
    0            AFG           Afghanistan           Asia  11.35  11.05  ...  11.18  11.15  11.22  11.71  13.28
    1            AGO                Angola         Africa   9.43   7.36  ...   7.41   7.42   7.42   8.33   8.53
    2            ALB               Albania         Europe  14.09  13.48  ...  13.62  12.30  11.47  13.33  11.82
    3            ARE  United Arab Emirates           Asia   2.48   2.30  ...   2.46   2.35   2.23   3.19   3.36
    4            ARG             Argentina  South America   7.71   7.18  ...   8.35   9.22   9.84  11.46  10.90
    ..           ...                   ...            ...    ...    ...  ...    ...    ...    ...    ... 
    
    [174 rows x 15 columns]
```

### Exercise
Validating range
Now it's time to validate our numerical data. We saw in the previous lesson using .describe() that the largest unemployment rate during 2021 was nearly 34 percent, while the lowest was just above zero.

Your task in this exercise is to get much more detailed information about the range of unemployment data using Seaborn's boxplot, and you'll also visualize the range of unemployment rates in each continent to understand geographical range differences.

unemployment is available, and the following have been imported for you: Seaborn as sns, matplotlib.pyplot as plt, and pandas as pd.

Instructions

Print the minimum and maximum unemployment rates, in that order, during 2021.
Create a boxplot of 2021 unemployment rates (on the x-axis), broken down by continent (on the y-axis).

```python
# Print the minimum and maximum unemployment rates during 2021
print( unemployment['2021'].min(), unemployment['2021'].max())

# Create a boxplot of 2021 unemployment rates, broken down by continent
sns.boxplot(x='2021', y='continent', data=unemployment)
plt.show()

<script.py> output:
    0.26 33.56

```
![image.png](attachment:d7e9a287-455c-4908-bbdd-a080cf7845c1.png)

# Data Summarization: Exploratory Data Analysis in Python
---

## 1. Data Summarization

- Exploratory Data Analysis (EDA) often starts by summarizing and comparing groups within your data.
- In the previous analysis, we noticed that children's books in the dataset tend to be published later, on average, than other genres.

---

## 2. Exploring Groups of Data with `.groupby()`

Grouping data allows us to calculate summary statistics for different categories within a dataset.

### Example: Mean Rating and Year per Genre

```python
books[["genre", "rating", "year"]].groupby("genre").mean()
```

**Output:**

| genre        | rating    | year        |
|--------------|-----------|-------------|
| Childrens    | 4.780000  | 2015.075000 |
| Fiction      | 4.570229  | 2013.022901 |
| Non Fiction  | 4.598324  | 2013.513966 |

---

### **Line-by-line Explanation:**

- `books[["genre", "rating", "year"]]`  
    - **What:** Selects only the `genre`, `rating`, and `year` columns from the `books` DataFrame.
    - **Why:** Limits the analysis to relevant columns for summarization.
    - **Expected Output:** A DataFrame containing those three columns.

- `.groupby("genre")`  
    - **What:** Groups the DataFrame by the `genre` column.
    - **Why:** To compute statistics within each genre category.
    - **Expected Output:** Grouped object, ready for aggregation.

- `.mean()`  
    - **What:** Computes the mean (average) of each numeric column within each genre group.
    - **Why:** To summarize the typical rating and publication year for each genre.
    - **Expected Output:** DataFrame where each row is a genre and columns are average rating and year.

**Significance:**  
- Children's books have the highest average rating and are, on average, published more recently.
- Fiction books have the lowest average rating among the groups.

---

## 3. Aggregating Functions

Pandas provides several functions to summarize data:

- `.sum()` — Total sum of values
- `.count()` — Number of non-null values
- `.min()` — Minimum value
- `.max()` — Maximum value
- `.var()` — Variance
- `.std()` — Standard deviation

**Purpose:**  
- These functions can be chained after `.groupby()` to compute summaries within groups.

---

## 4. Aggregating Ungrouped Data with `.agg()`

`.agg()` (aggregate) allows you to apply one or more aggregation functions to the entire DataFrame or selected columns.

### Example: Aggregating Over All Rows

```python
books[["rating", "year"]].agg(["mean", "std"])
```

**Output:**

|        | rating    | year      |
|--------|-----------|-----------|
| mean   | 4.608571  | 2013.508571 |
| std    | 0.226941  | 3.28471     |

---

### **Line-by-line Explanation:**

- `books[["rating", "year"]]`  
    - **What:** Selects only the `rating` and `year` columns.
    - **Why:** Focuses aggregation on numeric columns only.
    - **Expected Output:** DataFrame with `rating` and `year`.

- `.agg(["mean", "std"])`  
    - **What:** Applies both the mean and standard deviation functions to each selected column.
    - **Why:** Quickly get a sense of central tendency and variability.
    - **Expected Output:** DataFrame showing mean and std for each column.

**Significance:**  
- The average rating is about 4.61, with a standard deviation of 0.23, indicating ratings are closely clustered.
- The average year is about 2013.5, with a standard deviation of ~3.28 years.

---

## 5. Specifying Aggregations for Columns

You can use a dictionary in `.agg()` to apply different aggregations to different columns.

### Example: Custom Aggregations for Each Column

```python
books.agg({
    "rating": ["mean", "std"],
    "year": ["median"]
})
```

**Output:**

|        | rating    | year      |
|--------|-----------|-----------|
| mean   | 4.608571  | NaN       |
| std    | 0.226941  | NaN       |
| median | NaN       | 2013.0    |

---

### **Line-by-line Explanation:**

- `books.agg({...})`  
    - **What:** Applies specified aggregating functions to specific columns.
    - **Why:** Allows for flexible, targeted summaries across columns.
    - **Expected Output:** DataFrame summarizing each column as directed.

- `"rating": ["mean", "std"]`  
    - **What:** For the `rating` column, computes both mean and standard deviation.
    - **Why:** Get both central tendency and spread for ratings.

- `"year": ["median"]`  
    - **What:** For the `year` column, computes the median value.
    - **Why:** Median is less sensitive to outliers; useful for skewed publication years.

**Significance:**  
- Easily see different summary statistics for different columns in one command.

---

## 6. Named Summary Columns with `.agg()`

You can create custom-named summary columns for grouped data using named tuples within `.agg()`.

### Example: Named Summaries for Grouped Data

```python
books.groupby("genre").agg(
    mean_rating=("rating", "mean"),
    std_rating=("rating", "std"),
    median_year=("year", "median")
)
```

**Output:**

| genre        | mean_rating | std_rating | median_year |
|--------------|-------------|------------|-------------|
| Childrens    | 4.780000    | 0.122370   | 2015.0      |
| Fiction      | 4.570229    | 0.281123   | 2013.0      |
| Non Fiction  | 4.598324    | 0.179411   | 2013.0      |

---

### **Line-by-line Explanation:**

- `books.groupby("genre")`  
    - **What:** Groups the data by `genre`.
    - **Why:** To perform aggregation within each genre.

- `.agg(`  
    - **What:** Starts the aggregation.

- `mean_rating=("rating", "mean")`  
    - **What:** For each group, computes the mean of `rating` and names the result `mean_rating`.
    - **Why:** Clear, descriptive column names improve readability.

- `std_rating=("rating", "std")`  
    - **What:** For each group, computes the standard deviation of `rating` as `std_rating`.
    - **Why:** Measures rating variability within each genre.

- `median_year=("year", "median")`  
    - **What:** For each group, computes the median publication year as `median_year`.
    - **Why:** Shows typical publication year, robust to outliers.

**Significance:**  
- Custom naming improves clarity.
- Fiction has the lowest average rating and highest variation, while children's books are rated highest and most consistently.

---

## 7. Visualizing Categorical Summaries

Bar plots are useful for visualizing categorical group summaries.

### Example: Seaborn Bar Plot

```python
sns.barplot(data=books, x="genre", y="rating")
plt.show()
```
![image.png](attachment:086c2c45-1392-4c64-b0e5-014bfc15275a.png)
**Output:**  
*A bar plot appears, with genres on the x-axis and average ratings on the y-axis. Each bar represents the mean rating per genre, with vertical lines (error bars) showing the 95% confidence interval.*

---

### **Line-by-line Explanation:**

- `sns.barplot(data=books, x="genre", y="rating")`  
    - **What:** Creates a bar plot where the x-axis is `genre` and the y-axis is `rating`.
    - **Why:** Visualizes the average rating for each genre, with error bars for uncertainty.
    - **Expected Output:** Bar plot showing mean ratings per genre.

- `plt.show()`  
    - **What:** Displays the plot.
    - **Why:** Necessary to render the plot in most Python environments.

**Significance:**  
- Visual confirmation of earlier results: Fiction has the lowest mean rating and the widest error bar (variation).
- Bar plot makes group comparisons intuitive.

---

# **Summary Table**

| Concept                                 | Key Function(s)             | Example Code Snippet                                |
|------------------------------------------|-----------------------------|-----------------------------------------------------|
| Group data by category                   | `.groupby()`                | `df.groupby("genre").mean()`                        |
| Aggregate with standard functions        | `.mean()`, `.sum()`, etc.   | `df.groupby("genre").std()`                         |
| Aggregate ungrouped data, multiple funcs | `.agg(["mean", "std"])`     | `df[["rating", "year"]].agg(["mean", "std"])`       |
| Specify aggregations per column          | `.agg({col: [funcs]})`      | `df.agg({"rating": ["mean"], "year": ["median"]})`  |
| Named summary columns                    | `.agg(named tuples)`        | `df.groupby("genre").agg(mean_rating=("rating", "mean"))` |
| Visualize group summaries                | `sns.barplot()`             | `sns.barplot(data=df, x="genre", y="rating")`       |

---

# **Key Takeaways**

- **Grouping and aggregation** are central to exploring categorical differences in data.
- Use **`.groupby()`** to split data by categories and summarize with aggregation functions.
- **`.agg()`** allows multiple and custom aggregations, including named output columns.
- **Visualization** (e.g., bar plots) communicates group summaries effectively.
- **Clear, concise summaries** help uncover meaningful patterns and guide further analysis.

---



### Exercise
Summaries with .groupby() and .agg()
In this exercise, you'll explore the means and standard deviations of the yearly unemployment data. First, you'll find means and standard deviations regardless of the continent to observe worldwide unemployment trends. Then, you'll check unemployment trends broken down by continent.

The unemployment DataFrame is available, and pandas has been imported as pd.

Instructions 1/2

Print the mean and standard deviation of the unemployment rates for 2019 and 2020 (in that order).
```python
# Print the mean and standard deviation of rates for 2019 and 2020 
print(unemployment[["2019", "2020"]].agg(['mean', 'std']))

<script.py> output:
           2019   2020
    mean  7.244  8.421
    std   5.697  6.041
In [1]:
```

Print the mean and standard deviation (in that order) of the unemployment rates for 2019 and 2020, grouped by continent.
```python
# Print mean and standard deviation grouped by continent
print(unemployment[["continent", "2019", "2020"]].groupby('continent').agg(['mean', 'std']))

<script.py> output:
                    2019           2020       
                    mean    std    mean    std
    continent                                 
    Africa         9.264  7.455  10.308  7.928
    Asia           5.949  5.254   7.012  5.700
    Europe         6.764  4.125   7.471  4.071
    North America  7.095  4.770   9.298  4.963
    Oceania        3.774  2.369   4.274  2.617
    South America  7.719  3.380  10.275  3.411
```

### Exercise
Named aggregations
You've seen how .groupby() and .agg() can be combined to show summaries across categories. Sometimes, it's helpful to name new columns when aggregating so that it's clear in the code output what aggregations are being applied and where.

Your task is to create a DataFrame called continent_summary which shows a row for each continent. The DataFrame columns will contain the mean unemployment rate for each continent in 2021 as well as the standard deviation of the 2021 employment rate. And of course, you'll rename the columns so that their contents are clear!

The unemployment DataFrame is available, and pandas has been imported as pd.

Instructions

Create a column called mean_rate_2021 which shows the mean 2021 unemployment rate for each continent.
Create a column called std_rate_2021 which shows the standard deviation of the 2021 unemployment rate for each continent.
```python
continent_summary = unemployment.groupby("continent").agg(
    # Create the mean_rate_2021 column
    mean_rate_2021 = ( '2021' ,'mean'),
    # Create the std_rate_2021 column
    std_rate_2021 = ( '2021' ,'std')
)
print(continent_summary)

<script.py> output:
                   mean_rate_2021  std_rate_2021
    continent                                   
    Africa                 10.474          8.132
    Asia                    6.906          5.415
    Europe                  7.415          3.948
    North America           9.155          5.076
    Oceania                 4.280          2.672
    South America           9.924          3.612
```


### Exercise
Visualizing categorical summaries
As you've learned in this chapter, Seaborn has many great visualizations for exploration, including a bar plot for displaying an aggregated average value by category of data.

In Seaborn, bar plots include a vertical bar indicating the 95% confidence interval for the categorical mean. Since confidence intervals are calculated using both the number of values and the variability of those values, they give a helpful indication of how much data can be relied upon.

Your task is to create a bar plot to visualize the means and confidence intervals of unemployment rates across the different continents.

unemployment is available, and the following have been imported for you: Seaborn as sns, matplotlib.pyplot as plt, and pandas as pd.

Instructions

Create a bar plot showing continents on the x-axis and their respective average 2021 unemployment rates on the y-axis.
```python
# Create a bar plot of continents and their average unemployment
sns.barplot(x='continent', y='2021', data=unemployment)
plt.show()

```
![image.png](attachment:8284094e-e4ff-4852-9b58-35ffea3c4c88.png)

END Chap1