<a href="https://colab.research.google.com/github/cloudpedagogy/data-science-programming/blob/main/data-analysis-pandas/03_Data_Exploration_and_Descriptive_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Exploration and Descriptive Statistics


## Overview


Data exploration and descriptive statistics are essential steps in analyzing and understanding datasets. They provide valuable insights into the underlying patterns, characteristics, and relationships within the data. Python programming offers powerful libraries and tools for performing data exploration and generating descriptive statistics, enabling data scientists and analysts to gain a comprehensive understanding of their datasets.

Data exploration involves examining and visualizing the data to discover its structure, uncover patterns, identify potential issues or outliers, and gain insights into the relationships between variables. It helps in forming hypotheses and guiding further analysis. Python libraries such as Pandas, NumPy, and Matplotlib provide a wide range of functions and methods for exploring and visualizing data.

Descriptive statistics, on the other hand, involves summarizing and describing the main characteristics of the data using statistical measures. These measures provide a numerical representation of the dataset's central tendency, dispersion, and shape. Descriptive statistics can include measures such as mean, median, mode, variance, standard deviation, percentiles, and more. Python libraries like NumPy and Pandas offer functions to calculate these statistics quickly and efficiently.

Python's Pandas library is particularly useful for data exploration and descriptive statistics. It provides powerful data structures, such as DataFrames, which allow for efficient manipulation and analysis of tabular data. Pandas offers a wide range of functions to filter, sort, group, aggregate, and transform data, making it easier to extract meaningful insights from the dataset.

Visualization is an integral part of data exploration and descriptive statistics. Python's Matplotlib library, along with other visualization libraries like Seaborn and Plotly, enables the creation of various types of plots, charts, and graphs. These visual representations help in understanding the data distribution, identifying outliers, and visualizing relationships between variables.

In summary, data exploration and descriptive statistics in Python programming are crucial steps in understanding and analyzing datasets. They allow data scientists and analysts to gain insights into the data's characteristics, patterns, and relationships. Python libraries such as Pandas, NumPy, Matplotlib, Seaborn, and Plotly provide powerful tools to perform data exploration, calculate descriptive statistics, and visualize the data effectively. These techniques lay the foundation for further analysis and modeling tasks in data science and help in making informed decisions based on the data.

# Exploratory data analysis techniques

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process. It involves understanding the data, identifying patterns, detecting outliers, and gaining insights into the dataset. Pandas provides several techniques to perform EDA effectively. Here's an example of some common EDA techniques using the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(dataset.head())

# Get summary statistics of the dataset
print("\nSummary statistics:")
print(dataset.describe())

# Visualize the distribution of the 'Glucose' feature using a histogram
plt.hist(dataset['Glucose'], bins=10, edgecolor='black')
plt.xlabel('Glucose')
plt.ylabel('Frequency')
plt.title('Distribution of Glucose')
plt.show()

# Explore the correlation between features using a correlation matrix
correlation_matrix = dataset.corr()
plt.figure(figsize=(8, 6))
plt.imshow(correlation_matrix, cmap='coolwarm', interpolation='none')
plt.colorbar()
plt.xticks(range(len(column_names)), column_names, rotation=45)
plt.yticks(range(len(column_names)), column_names)
plt.title('Correlation Matrix')
plt.show()


In this example, we first load the Pima Indian Diabetes dataset using Pandas. Then, we perform the following EDA techniques:

1. Displaying the first few rows of the dataset: This gives an initial understanding of the data and its structure.

2. Obtaining summary statistics: Using `describe()`, we get summary statistics such as count, mean, standard deviation, minimum, maximum, and quartiles for each numerical feature in the dataset.

3. Visualizing the distribution of a feature: We create a histogram using `plt.hist()` to visualize the distribution of the 'Glucose' feature. This helps in understanding the spread and shape of the data.

4. Exploring the correlation between features: We compute the correlation matrix using `dataset.corr()` and create a heatmap using `plt.imshow()`. This visual representation allows us to identify patterns and relationships between different features.


## Summary statistics and descriptive analysis



## Summary statistics

Summary statistics in Pandas provide a concise overview of the distribution and properties of a dataset. It includes common statistical measures such as mean, median, mode, standard deviation, minimum, maximum, and quartiles. Pandas provides a convenient way to calculate these summary statistics using the `describe()` method.

Here's an example of calculating summary statistics on the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Calculate summary statistics
summary_stats = dataset.describe()

# Print the summary statistics
print(summary_stats)


In this example, we load the Pima Indian Diabetes dataset using Pandas. We then use the `describe()` method on the dataset to calculate the summary statistics. The `describe()` method automatically computes and returns the count, mean, standard deviation, minimum, quartiles, and maximum for each numerical column in the dataset. The resulting summary_stats DataFrame contains these statistics.

Finally, we print the summary statistics using `print(summary_stats)` to display the output. The summary statistics provide a quick overview of the dataset, including measures such as count (number of non-null values), mean, standard deviation, minimum, quartiles (25th, 50th, and 75th percentiles), and maximum values for each numeric column in the dataset.


## Descriptive analysis

Descriptive analysis in pandas involves computing various statistics and summaries to gain insights into the data. Pandas provides a range of functions to perform descriptive analysis, including measures of central tendency, dispersion, correlation, and more. These functions allow you to understand the distribution, relationships, and overall characteristics of the dataset.

Here's an example of performing descriptive analysis on the Pima Indian Diabetes dataset using pandas:


In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Compute descriptive statistics
summary = dataset.describe()

# Compute correlation matrix
correlation_matrix = dataset.corr()

# Print the descriptive statistics and correlation matrix
print("Descriptive Statistics:")
print(summary)
print("\nCorrelation Matrix:")
print(correlation_matrix)


In this example, we load the Pima Indian Diabetes dataset using Pandas. We then use two common methods for descriptive analysis:

1. `describe()`: This function computes various statistics, including count, mean, standard deviation, minimum, quartiles, and maximum for each column in the dataset. The resulting summary provides an overview of the dataset's distribution.
2. `corr()`: This function calculates the correlation between different columns of the dataset. The resulting correlation matrix shows the strength and direction of the linear relationship between variables.

We use these functions on the dataset and store the results in `summary` and `correlation_matrix` variables. Finally, we print the descriptive statistics and correlation matrix to examine the dataset's characteristics and relationships between variables.

Descriptive analysis provides valuable insights into the dataset, enabling you to understand the data's distribution, identify potential issues, and explore relationships between variables.


# Grouping and aggregation operations



## Grouping
Grouping in pandas refers to the process of splitting data into groups based on one or more categorical variables and applying calculations or operations within each group. It allows for aggregating and summarizing data by categories, enabling insights and analysis at a higher level of granularity.

Here's an example of grouping using the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Group the data by the 'Outcome' column and calculate the mean of other numeric columns
grouped_data = dataset.groupby('Outcome').mean()

# Print the grouped data
print(grouped_data)


In this example, we load the Pima Indian Diabetes dataset using Pandas. We then group the data by the 'Outcome' column, which contains binary values (0 or 1) indicating whether a person has diabetes or not. The `groupby()` function is used to group the data based on this column.

After grouping, we calculate the mean of the other numeric columns ('Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', and 'Age') within each group. The `mean()` function is applied to each group, calculating the average value for each column.

Finally, we print the grouped data, which displays the mean values for each numeric column, separated by the 'Outcome' groups. This provides insights into how the different variables relate to the outcome of diabetes in the dataset.


## Aggregation operations



Aggregation operations in Pandas allow you to compute summary statistics or perform calculations on groups of data. These operations provide valuable insights into the data distribution, central tendency, and other statistical measures. Pandas provides a range of aggregation functions, such as mean, sum, count, max, min, etc.

Here's an example of aggregation operations on the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Calculate the mean glucose level for all records
mean_glucose = dataset['Glucose'].mean()
print("Mean glucose level:", mean_glucose)

# Calculate the maximum BMI value
max_bmi = dataset['BMI'].max()
print("Maximum BMI value:", max_bmi)

# Count the number of records for each outcome (0 or 1)
outcome_counts = dataset['Outcome'].value_counts()
print("Outcome counts:")
print(outcome_counts)


In this example, we load the Pima Indian Diabetes dataset using Pandas. We perform several aggregation operations on different columns:

1. `mean_glucose = dataset['Glucose'].mean()`: Calculates the mean (average) glucose level for all records in the 'Glucose' column.

2. `max_bmi = dataset['BMI'].max()`: Finds the maximum BMI value from the 'BMI' column.

3. `outcome_counts = dataset['Outcome'].value_counts()`: Counts the number of records for each outcome (0 or 1) in the 'Outcome' column using the `value_counts()` function.

We then print the results to see the computed summary statistics and outcome counts.

By applying aggregation operations, you can gain valuable insights into the dataset, understand the distribution of different variables, and make informed decisions based on the computed summary statistics.


## Data visualization using Pandas and Matplotlib



Data visualization is an essential part of data analysis, allowing you to gain insights and communicate findings effectively. In Python, you can use Pandas for data manipulation and Matplotlib for creating various types of plots. Here's an example of data visualization using Pandas and Matplotlib on the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Histogram of the Glucose levels
plt.hist(dataset['Glucose'], bins=10, color='skyblue')
plt.xlabel('Glucose')
plt.ylabel('Frequency')
plt.title('Distribution of Glucose Levels')
plt.show()

# Boxplot of BMI by Outcome
dataset.boxplot(column='BMI', by='Outcome', grid=False)
plt.xlabel('Outcome')
plt.ylabel('BMI')
plt.title('BMI Distribution by Outcome')
plt.suptitle('')
plt.show()


In this example, we load the Pima Indian Diabetes dataset using Pandas. We then create two different types of plots using Matplotlib.

1. Histogram: We plot a histogram of the 'Glucose' levels using `plt.hist()`. The `bins` parameter determines the number of bins in the histogram. We set the x-axis label to 'Glucose', the y-axis label to 'Frequency', and the title of the plot to 'Distribution of Glucose Levels'.

2. Boxplot: We create a boxplot of the 'BMI' values grouped by the 'Outcome' using `dataset.boxplot()`. The `column` parameter specifies the column to plot, and the `by` parameter specifies the grouping column. We customize the x-axis label, y-axis label, title, and remove the default subplot title using `plt.suptitle('')`.

Finally, we use `plt.show()` to display each plot separately.

These examples demonstrate how to create basic visualizations using Pandas and Matplotlib. You can explore further by customizing plots, creating different types of visualizations, and adding more features to suit your analysis needs.


# Reflection Points

1. **Exploratory Data Analysis Techniques**:
   - How can exploratory data analysis help in understanding a dataset?
   - What are some common techniques used in exploratory data analysis?
   - How can you handle missing values during exploratory data analysis?
   - Can you provide examples of exploratory data analysis techniques you've used in the past?

2. **Summary Statistics and Descriptive Analysis**:
   - What is the purpose of summary statistics in data analysis?
   - How do you calculate measures of central tendency, such as mean, median, and mode, using Python?
   - What are some commonly used measures of dispersion, such as variance and standard deviation?
   - How can you interpret and use summary statistics to gain insights from a dataset?

3. **Grouping and Aggregation Operations**:
   - What is the significance of grouping and aggregation operations in data analysis?
   - How can you group data based on one or more variables using Pandas?
   - What are some commonly used aggregation functions, such as count, sum, mean, and max?
   - Can you provide examples of situations where grouping and aggregation operations are useful?

4. **Data Visualization using Pandas and Matplotlib**:
   - Why is data visualization important in data analysis?
   - How can you create basic visualizations, such as line plots, bar plots, and scatter plots, using Pandas and Matplotlib?
   - What are the key components of a well-designed data visualization?
   - Can you explain the concept of data visualization best practices, such as choosing appropriate chart types and labeling axes?


# A quiz on Data Exploration and Descriptive Statistics


1. Which Python library is commonly used for data manipulation and analysis?
<br>a) Numpy
<br>b) Pandas
<br>c) Scikit-learn
<br>d) Matplotlib

2. What is the function used in Pandas to calculate summary statistics for a numerical column?
<br>a) describe()
<br>b) summary()
<br>c) stats()
<br>d) summary_stats()

3. Which of the following is NOT a measure of central tendency?
<br>a) Mean
<br>b) Median
<br>c) Mode
<br>d) Variance

4. How can you calculate the correlation matrix between columns in a Pandas DataFrame?
<br>a) df.corr()
<br>b) df.correlation()
<br>c) df.calculate_correlation()
<br>d) df.correlation_matrix()

5. What function in Pandas allows you to group data based on one or more columns?
<br>a) groupby()
<br>b) split()
<br>c) divide()
<br>d) sort()

6. Which aggregation function in Pandas calculates the maximum value in a group?
<br>a) mean()
<br>b) min()
<br>c) max()
<br>d) sum()

7. Which method is used to create a scatter plot in Matplotlib?
<br>a) plot()
<br>b) scatter()
<br>c) bar()
<br>d) hist()

8. How can you change the figure size in Matplotlib?
<br>a) Using the `set_size()` method
<br>b) Using the `resize()` method
<br>c) Using the `figure_size()` method
<br>d) Using the `figure()` function with the `figsize` parameter

9. Which command in Matplotlib is used to add a title to a plot?
<br>a) title()
<br>b) add_title()
<br>c) set_title()
<br>d) plot_title()

10. Which line of code can be used to display the plot created using Matplotlib?
<br>a) `show()`
<br>b) `display()`
<br>c) `plot()`
<br>d) `draw()`
---
Answers:

1. b) Pandas
2. a) describe()
3. d) Variance
4. a) df.corr()
5. a) groupby()
6. c) max()
7. b) scatter()
8. d) Using the `figure()` function with the `figsize` parameter
9. c) set_title()
10. a) `show()`
---