<a href="https://colab.research.google.com/github/bankurucharan/Statistical-Analysis/blob/main/CharanBankuru.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Introduction***

Statistical analysis is a fundamental aspect of data science and machine learning. In this session,the basics of statistical analysis using Python, leveraging libraries like pandas, numpy, and scipy. We will use and explore how to apply statistical methods to California Housing dataset

**introduction to California Housing dataset**

The California Housing dataset is a popular dataset used in the field of data science and machine learning for predictive modeling and statistical analysis. Originally derived from the 1990 U.S. Census, the dataset contains aggregated data on housing in California, focusing on demographic and economic factors that influence housing prices across various regions of the state. This dataset is often used for regression tasks, where the goal is to predict the median house value in a particular area based on a set of features describing the area.

# **1) Basic Concepts in Statistics:-**

**Need for Sampling in Statistics:-**

Sampling is important in statistics because it allows us to study a small group of people or things to understand a larger group without having to look at every single member of that larger group. Here’s why we need sampling:

**Too Many to Study:**  Often, there are too many people or items in a population to study all of them. For example, it’s impossible to survey every single person in a country, so we take a sample instead.

**Saves Time:** Studying a smaller sample takes less time than trying to study the whole population. This is especially important when quick decisions are needed.

**Saves Money:** It’s cheaper to collect data from a small sample than from the entire population. This makes sampling a more affordable option.

***Benefits of Sampling:***

Sampling has several advantages that make it very useful:

**Quick Results:** Sampling allows us to get results faster because we only need to collect data from a small part of the population.

**Accurate Data**: Focusing on a smaller group helps reduce mistakes and improves the accuracy of the results.

# **2) Understanding Basic Statistical Methods:**

 **Descriptive Statistics:**

 **Mean, Median, Mode:**These give you a sense of the central tendency of
feature in the dataset. For example, the mean of 'MedInc' (median income) will tell you the average income in the dataset, while the median gives you the middle value.

**Standard Deviation and Variance:** These measure the dispersion or spread of the data. A high standard deviation indicates that the data points are spread out over a larger range of values.

**Range:** This shows the difference between the maximum and minimum values for each feature, giving you a sense of the data's spread.

**Skewness:** This indicates the asymmetry of the distribution. A skewness close to zero indicates a symmetric distribution, while positive or negative values indicate skewness to the right or left, respectively.

**Kurtosis:** This measures the "tailedness" of the distribution. High kurtosis indicates more outliers.

# ***Inferential Statistics:***

**One-Sample T-Test:**

1. **T-Statistic and P-Value:** The one-sample t-test compares the mean of the 'MedInc' values against a hypothetical population mean (3.5 in this case). The t-statistic shows how far the sample mean is from the population mean in units of the standard error, and the p-value indicates the probability of observing a result at least as extreme as the one obtained, under the assumption that the null hypothesis is true.

2. **95% Confidence Interval:**
 The confidence interval provides a range of values that is likely to contain the true population mean of 'MedInc' with 95% confidence. If this interval does not include the hypothetical population mean (3.5), this further supports the conclusion from the t-test that the sample mean is significantly different from the population mean.

3. **Linear Regression:**
Regression Summary: The linear regression model examines the relationship between the independent variable ('MedInc', median income) and the dependent variable ('target', median house value).

4. **R-squared:** This measures the proportion of the variance in the dependent variable that is predictable from the independent variable. A higher R-squared indicates a better fit.

5. **P-value for the Coefficient:** If this p-value is low (typically < 0.05), it suggests that the independent variable ('MedInc') is statistically significantly related to the dependent variable ('target').

# ***Overview of the California Housing Dataset:***

The dataset comprises 20,640 observations (rows) and 8 features (columns), each representing different characteristics of the housing and population in various block groups (the smallest geographical unit for which the U.S. Census Bureau publishes sample data). The block groups are typically smaller than a city or town and represent a collection of city blocks.

The target variable in the dataset is the median house value for California districts, expressed in hundreds of thousands of dollars. The dataset's features provide information about the socio-economic and environmental conditions in these areas, allowing for a detailed analysis of how these factors correlate with housing prices.

***Key Features:***

**MedInc:** Median income of households in the block group (in tens of thousands of dollars).

**HouseAge:** Median age of the houses in the block group.

**AveRooms:** Average number of rooms per household in the block group.

**AveBedrms:** Average number of bedrooms per household in the block group.

**Population:** Total population of the block group.

**AveOccup:** Average number of occupants per household in the block group.

**Latitude:** Latitude coordinate of the block group.

**Longitude:** Longitude coordinate of the block group.

# ***Sample Code for Statistical Analysis in Python With California Housing dataset***:


In [12]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from scipy import stats
import numpy as np
import statsmodels.api as sm

# Load the dataset
california = fetch_california_housing()
df = pd.DataFrame(data=california.data, columns=california.feature_names)
df['target'] = california.target
# Display the first few rows
print(df.head())

# Calculate basic descriptive statistics
print("Mean:\n", df.mean())
print("\nMedian:\n", df.median())
print("\nMode:\n", df.mode().iloc[0])
print("\nStandard Deviation:\n", df.std())
print("\nVariance:\n", df.var())

# Additional descriptive statistics
print("\nRange:\n", df.max() - df.min())
print("\nSkewness:\n", df.skew())
print("\nKurtosis:\n", df.kurt())

# Example data: Median Income (MedInc) values
medinc_values = df['MedInc']

# Hypothetical population mean for Median Income
population_mean = 3.5

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(medinc_values, population_mean)

print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")

# Sample mean and standard error for Median Income
sample_mean = np.mean(medinc_values)
standard_error = stats.sem(medinc_values)

# Compute 95% confidence interval for Median Income
confidence_interval = stats.norm.interval(0.95, loc=sample_mean, scale=standard_error)

print(f"95% Confidence Interval for Median Income: {confidence_interval}")

# Define independent variable (add constant for intercept)
X = sm.add_constant(df['MedInc'])

# Define dependent variable
y = df['target']

# Fit linear regression model
model = sm.OLS(y, X).fit()

# Print model summary
print(model.summary())


   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  target  
0    -122.23   4.526  
1    -122.22   3.585  
2    -122.24   3.521  
3    -122.25   3.413  
4    -122.25   3.422  
Mean:
 MedInc           3.870671
HouseAge        28.639486
AveRooms         5.429000
AveBedrms        1.096675
Population    1425.476744
AveOccup         3.070655
Latitude        35.631861
Longitude     -119.569704
target           2.068558
dtype: float64

Median:
 MedInc           3.534800
HouseAge        29.000000
AveRooms         5.229129
AveBedrms        1.048780
Population    1166.000000
AveOccup 

# ***Conclusion:-***

The analysis indicates that median income is a significant determinant of housing prices in California. Areas with higher median incomes tend to have higher median house values. The statistical tests support the conclusion that the median income in this dataset is significantly different from a hypothetical population mean, further emphasizing the unique economic conditions in California. This relationship between income and housing prices highlights the economic disparities across different regions in the state and provides insights into the factors driving the housing market.

We then explored descriptive statistics, which offer powerful tools for summarizing and understanding data. Measures like the mean, median, mode, standard deviation, variance, and range provide valuable insights into the central tendency, spread, and distribution of data. Additional measures like percentiles, skewness, and kurtosis help further characterize data, making it easier to identify trends, outliers, and underlying patterns.

By applying these statistical techniques using Python and libraries like pandas, numpy, scipy, and statsmodels, we can perform robust analyses on california housing datasets.
