In [1]:
# <div align="center" style="background-color:#ffcc00; color:white; padding:10px; border-radius:10px;">EDUCATION DISTRICTWISE | DESCRIPTIVE STATISTICS WITH PYTHON</div>


## <div align="center" style="background-color:#0077b6; color:white; padding:10px; border-radius:10px;">📌 Table of Contents</div>

- [🔹 Introduction](#introduction)
- [🔹 Dataset Overview](#dataset-overview)
- [🔹 Import Packages and Libraries](#import-packages-and-libraries)
- [🔹 Explore the Data](#explore-the-data)
- [🔹 Descriptive Statistics](#descriptive-statistics)
- [🔹 Statistical Functions](#statistical-functions)
- [🔹 Conclusion](#conclusion)  

## <div align="center" style="background-color:#0077b6; color:white; padding:10px; border-radius:10px;">📌 Introduction</div>

Education is a cornerstone of socio-economic development, and understanding its distribution across regions is essential for policy-making and resource allocation. This dataset provides a district-wise overview of educational infrastructure and literacy levels across multiple states. By analyzing this data, we can identify disparities in education, assess the availability of educational facilities, and explore the relationship between population distribution and literacy rates.

## <div align="center" style="background-color:#0077b6; color:white; padding:10px; border-radius:10px;">📌 Dataset Overview</div>

The dataset consists of 680 entries and contains information on district-level educational infrastructure across various states. The key attributes include:

    DISTNAME: Name of the district.
    STATNAME: Name of the state.
    BLOCKS: Number of administrative blocks in the district.
    VILLAGES: Number of villages within the district.
    CLUSTERS: Number of school clusters, which likely represent administrative or educational groupings of schools.
    TOTPOPULAT: Total population of the district (634 non-null values, suggesting some missing data).
    OVERALL_LI: Overall literacy rate of the district, given as a percentage (634 non-null values).

Key Observations:

    The dataset is well-structured but has missing values in the total population and literacy rate columns.
    The numerical data spans different aspects of education, from infrastructure availability (blocks, villages, clusters) to education outcomes (literacy rates).
    This dataset can be leveraged for regional education policy analysis, identification of underdeveloped districts, and correlation studies between infrastructure and literacy levels.

## <div align="center" style="background-color:#0077b6; color:white; padding:10px; border-radius:10px;">📌 Import Packages and Libraries 

Before we begin, we need to import all the necessary libraries and extensions. In this course, we'll use Pandas and NumPy for data operations and Matplotlib for visualization.

In [2]:
%%time

# Installing select libraries:-
from gc import collect; # garbage collection to free up memory
from warnings import filterwarnings; # handle warning messages

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra

import matplotlib.pyplot as plt # data visualization
import seaborn as sns # statistical data visualization

# Set the plot style to 'fivethirtyeight'
plt.style.use("fivethirtyeight")

from datetime import datetime  # Importing the datetime class from the datetime module

from scipy import stats # statistical functions

filterwarnings('ignore'); # Ignore warning messages
from IPython.display import display_html, clear_output; # displaying HTML content


clear_output();
print();
collect();


CPU times: user 1.56 s, sys: 129 ms, total: 1.69 s
Wall time: 1.4 s


0

In [3]:
import os

print(os.listdir("/kaggle/input"))

['education-districtwise']


In [4]:
import os

print(os.listdir("/kaggle/input/education-districtwise"))


['education_districtwise.csv']


In [9]:
import pandas as pd

# Define the correct file path
file_path = "/kaggle/input/education-districtwise/education_districtwise.csv"

# Load the dataset
df = pd.read_csv(file_path) 

## <div align="center" style="background-color:#0077b6; color:white; padding:10px; border-radius:10px;">📌 Explore the Data</div>

Let's begin with the head() function to quickly examine the dataset. Remember, head() returns the number of rows specified in the argument you provide.

In [10]:
# Display the first few rows
df.head(10)

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0
5,DISTRICT323,STATE1,12,523,96,1070144.0,64.32
6,DISTRICT114,STATE1,6,110,49,147104.0,80.48
7,DISTRICT438,STATE1,7,134,54,143388.0,74.49
8,DISTRICT610,STATE1,10,388,80,409576.0,65.97
9,DISTRICT476,STATE1,11,361,86,555357.0,69.9


Important: To accurately interpret this data, keep in mind that each row, or observation, represents a district—not a state or a village. The VILLAGES column shows the number of villages within each district, the TOTPOPULAT column reflects the total population of the district, and the OVERALL_LI column represents the literacy rate for that district.

## <div align="center" style="background-color:#0077b6; color:white; padding:10px; border-radius:10px;">📌 Descriptive Statistics</div> 

Now that we have a clear understanding of the dataset, we can leverage Python to compute descriptive statistics efficiently.

One of the most essential functions for this task is describe(), which provides a summary of key statistical measures in a single command. Data professionals frequently use describe() to gain quick insights into numerical columns.

For a numeric column, describe() returns the following statistics:

count: Number of non-null (non-NA) observations
mean: Arithmetic average of the values
std: Standard deviation, measuring data dispersion
min: Minimum value in the dataset
25%: First quartile (25th percentile)
50%: Median (50th percentile)
75%: Third quartile (75th percentile)
max: Maximum value in the dataset
By using describe(), we can efficiently summarize the dataset’s distribution and variability, making it a powerful tool for exploratory data analysis. 🚀

In [11]:
df['OVERALL_LI'].describe()

count    634.000000
mean      73.395189
std       10.098460
min       37.220000
25%       66.437500
50%       73.490000
75%       80.815000
max       98.760000
Name: OVERALL_LI, dtype: float64

### Interpreting the Summary Statistics 

The statistical summary provides valuable insights into the overall literacy rate. For instance, the mean helps identify the central tendency of the dataset, revealing that the average literacy rate across all districts is approximately 73%. This serves as both a key standalone metric and a useful reference for comparison. By knowing the mean literacy rate, we can better assess which districts significantly exceed or fall below this benchmark.

Note: The describe() function automatically excludes missing values (NaN) from its calculations. As a result, the count of observations for OVERALL_LI (634) is lower than the total number of rows in the dataset (680). Handling missing data is a complex topic beyond the scope of this course.

In [12]:
df['STATNAME'].describe()

count         680
unique         36
top       STATE21
freq           75
Name: STATNAME, dtype: object

### Understanding Categorical Statistics 

The unique category reveals that our dataset contains 36 different states. The top category identifies STATE21 as the most frequently occurring state, making it the mode. Additionally, the frequency metric shows that STATE21 appears in 75 rows, meaning it encompasses 75 districts.

This insight can be valuable for resource allocation, as states with a higher number of districts may require more educational support and infrastructure.

## <div align="center" style="background-color:#0077b6; color:white; padding:10px; border-radius:10px;">📌 Statistical Functions</div> 

The describe() function is particularly useful because it provides multiple key statistics in a single step. However, Python also offers individual functions for specific statistical measures, such as mean(), median(), std(), min(), and max().

Earlier in the program, you used mean() and median() to identify potential outliers. These standalone functions are also valuable for performing additional computations based on descriptive statistics.

Computing the Range Using max() and min()
The range of a dataset is calculated as the difference between its maximum and minimum values:

$$\text{Range} = \text{max} - \text{min}$$

You can apply max() and min() to determine the range of literacy rates across all districts in your dataset. This measure helps quantify the spread of values, offering insight into variability within the data.

In [13]:
range_overall_li = df['OVERALL_LI'].max() - df['OVERALL_LI'].min()
range_overall_li

61.540000000000006

### Interpreting the Literacy Rate Range 

The literacy rate across all districts varies by approximately 61.5 percentage points.

This significant disparity indicates that some districts have substantially higher literacy rates than others. As we continue analyzing the data, we can identify the districts with the lowest literacy levels. These insights can help the government gain a clearer national perspective on literacy and strengthen successful educational initiatives.

## <div align="center" style="background-color:#0077b6; color:white; padding:10px; border-radius:10px;">📌 Conclusion</div> 


### Conclusion After Running Descriptive Statistics

Based on the descriptive statistics of Overall Literacy Rate (OVERALL_LI):

    General Literacy Insights
        The mean literacy rate across districts is 73.40%, suggesting a moderately high literacy level overall.
        However, the standard deviation (10.10%) indicates notable variation between districts.

    Distribution of Literacy Rates
        The minimum literacy rate is 37.22%, highlighting regions with very low literacy.
        The maximum literacy rate is 98.76%, suggesting some districts have nearly complete literacy.
        The interquartile range (IQR) (66.44% to 80.82%) shows that 50% of districts have literacy rates within this range.

    Regional Literacy Variability
        The range of literacy rates is 61.54%, meaning there is a huge disparity between the least and most literate districts.
        Some states have the highest number of districts, making it a key focus for analysis.

Next Steps

    Visualize Literacy Rate Distribution
        Use a histogram or boxplot to better understand the spread of literacy rates.

    Identify Low-Literacy Regions
        Filter and analyze districts below the 25th percentile (66.44%) to investigate possible causes.

    Compare Literacy by State
        Group data by STATNAME and compute average literacy rates per state for deeper insights.