<p align="center">
  <img src="../datasafari-logo-primary.png" width="300">
</p>

---

# Exploratory Data Analysis and Visualization

## Session Objective
Learn to summarize and visualize datasets to uncover patterns and distributions. We will compute key statistics and create plots (histograms, boxplots, bar/pie charts, scatterplots, heatmaps) to identify trends, outliers, and relationships in real data.

Exploratory Data Analysis (EDA) is used to analyze and investigate data sets and summarize their main characteristics, often with visualization. For example, as an analyst working with Tanzanian financial data, you might first compute averages and medians of incomes to understand ‚Äútypical‚Äù values, and then plot histograms or boxplots to see how income is distributed. This helps identify patterns, spot anomalies or outliers, and reveal relationships among variables. Throughout this session, we focus on practical data science applications in Python (Pandas, Matplotlib, Seaborn).

---

## üßÆ Univariate Statistics (Summary Measures) (30 minutes)

**Scenario:** Imagine you work at a Tanzanian microfinance NGO. Before making decisions, you need to know the ‚Äútypical‚Äù loan size and how much customer incomes vary.

The **univariate statistical measures** (mean, median, mode, variance, standard deviation) describe the center and spread of a single variable‚Äôs values. The **mean** (average) sums all values and divides by count. The **median** is the middle value that splits the sorted data in half (it is not influenced by extreme values). The **mode** is the most frequent value. **Variance** and **standard deviation** quantify spread: standard deviation measures the typical distance of data points from the mean. Together, these summarize a numeric field.

In [None]:
import pandas as pd

# Example dataset of 20 customers (synthetic financial data)
data = {
    'Age':       [56, 46, 32, 25, 38, 56, 36, 40, 28, 28, 41, 53, 57, 41, 20, 39, 19, 41, 47, 55],
    'Income':    [71936, 73081, 13413, 59052, 46234, 47542, 27855, 72305, 49633, 55333, 
                  56255, 62487, 45599, 49552, 51427, 59970, 47897, 49502, 38764, 38324],
    'Gender':    ['Female','Male','Female','Male','Female','Male','Male','Female','Male','Female',
                  'Female','Female','Female','Female','Female','Female','Female','Female','Female','Female'],
    'LoanAmount':[1500, 4798,  161, 4297, 1981,  995, 4911, 3342, 4551, 3798, 
                  1275, 1016,  337,  878, 1076, 4887, 3993, 4859,  379,  492],
    'Region':    ['Urban','Urban','Urban','Urban','Urban','Urban','Urban','Urban','Rural','Urban',
                  'Rural','Rural','Rural','Urban','Urban','Urban','Urban','Rural','Urban','Urban']
}
df = pd.DataFrame(data)

# Compute basic statistics
print("Mean Age:", df['Age'].mean())
print("Median Age:", df['Age'].median())
print("Mode Gender:", df['Gender'].mode()[0])
print("Income Std Dev:", df['Income'].std(), "\n")
print(df[['Age','Income','LoanAmount']].describe().round(1))

Using Pandas, we compute these easily. For example, `df.mean()`, `df.median()`, `df.mode()`, `df.std()` give quick summaries. In the code above, we see the mean age, median age, and most common gender; the output of `df.describe()` lists count, mean, std, min, quartiles, and max for each numeric column. These statistics tell us, for instance, if the income distribution is tightly clustered or widely spread. The median is especially useful because it resists extreme values.

**Key point:** A large difference between mean and median indicates **skew** (asymmetric data). A high std dev means data are more spread out.

### Practical Activity
Using your dataset (e.g. a Tanzanian financial survey), compute summary stats for key fields:

1. Calculate mean, median, mode, variance, std dev for numerical features (use `df.describe()` or Pandas functions).
2. Interpret the results: Are values skewed? Which feature has the highest variance?
3. Write your observations in Markdown.

---

## üìà Distribution Plots (Histogram, KDE, Boxplot) (30 minutes)

**Scenario:** To understand income patterns, you ask: Are most customers earning around a typical value, or is the distribution skewed? Are there outliers?

**Technical:** A **histogram** shows the frequency (count) of data points in numerical bins. It is the most common way to visualize a data distribution. (Unlike a bar chart for categories, histogram bars touch each other and represent ranges of values.) A **Kernel Density Estimate (KDE)** plot is a smoothed version of the histogram. A **boxplot** (box-and-whisker) shows the median, quartiles, and outliers: it extends from the first quartile (Q1) to the third quartile (Q3) with a line at the median, and whiskers up to 1.5√ó the interquartile range; points beyond are outliers. Boxplots make it easy to spot asymmetry and outliers.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram of Income
plt.figure(figsize=(6,4))
plt.hist(df['Income'], bins=5, color='skyblue', edgecolor='black')
plt.title('Histogram of Customer Income')
plt.xlabel('Income'); plt.ylabel('Count')
plt.show()

# KDE plot of Income
sns.kdeplot(df['Income'], shade=True, color='orange')
plt.title('Income Distribution (KDE)')
plt.xlabel('Income'); plt.ylabel('Density')
plt.show()

# Boxplot of Income
plt.figure(figsize=(4,4))
sns.boxplot(y=df['Income'], color='lightgreen')
plt.title('Boxplot of Income')
plt.ylabel('Income')
plt.show()

In code, we first plot a histogram of incomes (`plt.hist`). The KDE (`sns.kdeplot`) overlays a smooth density curve, highlighting modes or skew. The boxplot (`sns.boxplot`) reveals the median line inside the box (the 50% range) and any fliers beyond whiskers. By inspecting these:

* If the histogram is skewed right (tail on high end), the mean > median.
* Boxplot outliers appear as dots outside whiskers.
* A bimodal distribution would show two peaks in the KDE.

### Practical Activity
1. Use `plt.hist()` or `sns.histplot()` to plot histograms of at least two numerical columns (e.g. income and loan amount).
2. Add titles/labels to make charts clear (use `plt.title`, `plt.xlabel`, `plt.ylabel`).
3. Create boxplots (`sns.boxplot()`) for the same columns.
4. Examine the plots: Are there skewed distributions or outliers? Write your interpretations in Markdown.

---

## ü•ß Categorical Data Visualization (Counts, Bar Charts, Pie Charts) (20 minutes)

**Scenario:** You have categorical fields like ‚ÄúRegion‚Äù (Urban/Rural) or ‚ÄúGender‚Äù. You want to see how many customers fall into each category.

**Technical:** For categorical variables, we count frequencies. In Pandas, `df['Region'].value_counts()` gives counts per category. We often visualize these with a **bar chart**, where each category has a bar height proportional to its frequency. The bars should be equally spaced and labeled; their lengths reflect category counts. A **pie chart** is another option, showing proportions of a whole (though bar charts are usually clearer).

In [None]:
# Bar chart of Gender
gender_counts = df['Gender'].value_counts()
plt.figure(figsize=(4,4))
plt.bar(gender_counts.index, gender_counts.values, color=['pink','lightblue'])
plt.title('Count by Gender')
plt.xlabel('Gender'); plt.ylabel('Count')
plt.show()

# Pie chart of Region
region_counts = df['Region'].value_counts()
plt.figure(figsize=(4,4))
plt.pie(region_counts.values, labels=region_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Region Distribution')
plt.axis('equal');  # keep as circle
plt.show()

Above, we made a bar chart of Gender counts and a pie chart of Region. Bar charts are preferable for categorical frequencies: ‚ÄúThe bar chart is a familiar way of visualizing categorical distributions. It displays a bar for each category. The length of each bar is proportional to the frequency of that category.‚Äù (Pie charts show slices of a circle.) You can see which category is largest and compare easily.

### Practical Activity
1. For each categorical feature in your data, compute `value_counts()` and make a bar chart (`plt.bar` or `sns.countplot`).
2. Optionally, create a pie chart with `plt.pie()`.
3. Check which categories dominate and if any category is rare. Summarize findings in Markdown (e.g. ‚ÄúUrban customers make up 85% of the dataset, so services should focus there.‚Äù).

---

## üîç Bivariate Analysis (Scatterplots and Correlation) (25 minutes)

**Scenario:** To explore relationships, you examine pairs of numeric features. For example, is age related to loan amount? Does higher income correspond to larger loans?

**Technical:** A **scatter plot** shows one variable on the x-axis and another on the y-axis, with each point representing a record. Patterns (upward trend, cluster, none) may indicate **correlation**. We quantify this with the **Pearson correlation coefficient (r)**, which ranges from ‚Äì1 to +1. A value near +1 means a strong positive linear relationship; near 0 means no linear correlation. In Python, `df.corr()` computes pairwise correlations for numeric columns.

In [None]:
# Scatter plot: Income vs. LoanAmount
plt.figure(figsize=(6,4))
plt.scatter(df['Income'], df['LoanAmount'], color='purple', alpha=0.7)
plt.title('Income vs. Loan Amount')
plt.xlabel('Income'); plt.ylabel('Loan Amount')
plt.show()

# Compute correlation
corr_matrix = df[['Age','Income','LoanAmount']].corr()
print(corr_matrix)

The scatter plot (`plt.scatter`) lets you visually assess if points cluster along a line. The printed correlation matrix shows numbers like: r(Income, LoanAmount) ‚âà 0.31, indicating a moderate positive correlation; r(Age, LoanAmount) might be negative. Remember: r is between ‚Äì1 and 1, measuring strength and direction of linear relation. A correlation of 0.7 or above is often considered strong; below 0.3 is weak.

* If points lie close to an upward-sloping line, r is positive (strong if near 1).
* If they form no clear pattern, r is near 0.
* A downward trend yields a negative r.

### Practical Activity
1. Choose two numeric variables (e.g. Income and LoanAmount) and create a scatter plot (`sns.scatterplot` or `plt.scatter`).
2. Compute their correlation with `df.corr()`.
3. Interpret: ‚ÄúWe found r = X, indicating [strong/weak/none] [positive/negative] correlation.‚Äù
4. Repeat for another pair (e.g. Age vs. Income). Record insights in Markdown.

---

## üå°Ô∏è Correlation Heatmap (15 minutes)

**Scenario:** You have many numerical features and want an overview of all pairwise correlations to spot strong relationships.

**Technical:** A **heatmap** is a grid where cell colors represent values, useful for visualizing a correlation matrix. We will plot the correlation matrix from `df.corr()` using Seaborn.

In [None]:
# Correlation heatmap
corr = df.corr()
plt.figure(figsize=(5,4))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix Heatmap')
plt.show()

The heatmap above (with values annotated) quickly highlights high correlations: dark red (near +1) or dark blue (near ‚Äì1). For example, if Income and LoanAmount have a moderate positive correlation (~0.31), their cell will show ~0.31. Blocks of color show groups of features that move together. This multivariate view helps identify unexpected relationships or redundancy among features.

### Practical Activity
1. Compute a correlation matrix for all numeric features (`df.corr()`).
2. Plot it with `sns.heatmap(..., annot=True)`.
3. Look for the highest and lowest correlations (largest magnitude values) and write down hypotheses (e.g. ‚ÄúMaybe younger customers take smaller loans because age and loan have negative correlation.‚Äù).

---

## üíª Mini Project: EDA of a Household Dataset (45 minutes)

Now apply what you‚Äôve learned in a guided mini-project. Context: You have a real dataset (e.g. a Tanzanian household survey or microfinance customer data). Follow these steps:

1. **Load the data** into a Pandas DataFrame (e.g. a CSV file from Zindi/Humdata).
2. **Compute summary statistics:** For each numerical column, find mean, median, mode, and standard deviation. Interpret these statistics.
3. **Visualize distributions:** Plot histograms (with appropriate binning) and boxplots for at least two numeric variables. Comment on their shapes (skewness, modality, outliers).
4. **Analyze categories:** For each categorical column (such as Region, Gender, Product type), create a bar chart of counts. Optionally, a pie chart for one variable. Note the largest categories.
5. **Explore relationships:** Pick two numeric features and plot a scatterplot. Compute their correlation. Repeat for a different pair.
6. **Correlation heatmap:** Generate a heatmap of all numeric-feature correlations. Identify any strong correlations (`|r|>0.5`) and mention what they might imply.
7. **Write observations:** In Markdown cells, summarize the key findings: e.g. which feature has the highest mean, whether income is skewed, any surprising outliers, and how features relate. Use bullet points for clarity.

This mini-project reinforces summarizing data with code and narrative.

---

## üéì Capstone Project: Complete EDA (60 minutes)

For your capstone task, perform a comprehensive EDA on a new dataset (for example, download a Tanzanian financial dataset from the provided sources or use a public dataset relevant to economics/health). Break your work into these tasks:

1. **Data Preparation:** Load and inspect the dataset. Handle any missing values if needed.
2. **Univariate Analysis:** Compute all basic statistics (mean, median, mode, std, min, max) for numeric features. Display the results in a table (`df.describe()`).
3. **Distribution Plots:** Create histograms and boxplots for key variables (at least three). Explain in Markdown what each plot tells you (e.g. ‚ÄúThe income distribution is right-skewed with a long tail, as most people have lower incomes but a few are very high.‚Äù).
4. **Categorical Analysis:** For each categorical variable, present counts in a bar chart. Provide insights (e.g. ‚Äú60% of respondents are from Urban regions, suggesting urban bias in the data.‚Äù).
5. **Bivariate Plots:** For at least two pairs of variables, plot and analyze scatterplots. Compute correlation coefficients for each pair and discuss whether they indicate strong relationships.
6. **Correlation Overview:** Plot a heatmap of the correlation matrix for all numeric features. Highlight any surprising strong correlations or lack thereof.
7. **Narrative Summary:** In well-organized Markdown sections, discuss the overall patterns you discovered. Point out any outliers or anomalies (e.g. ‚ÄúOne outlier with extremely high loan amount was found in Region X‚Äù). Suggest possible reasons or next steps (such as deeper data cleaning if anomalies appear).

**Deliverable:** A Jupyter Notebook (with code and Markdown) containing all plots and written analysis. Make sure each visualization is labeled (with `plt.title`, axis labels) and every chart is interpreted in text. This final notebook should stand alone as a complete EDA report.

---

<p align="center">
  <img src="../datasafari-logo-primary.png" width="300">
</p>
