<a href="https://colab.research.google.com/github/aneelkumar18/bostondataset/blob/main/housingdataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's summarize the conclusions and findings from the analysis based on the dataset.

**1. Descriptive Statistics:**  
**Mean:** The average values of the features and the target variable are calculated, providing an overall sense of the data.  
**Median:** The middle value in the dataset for each feature, which helps understand the central tendency and compare it with the mean.  
**Mode:**The most frequently occurring value in the dataset for each feature.  
**Standard Deviation:** Indicates the amount of variation or dispersion of each feature from its mean.     
**Variance:** The squared standard deviation, showing the spread of the feature values.  
**Range:** The difference between the maximum and minimum values, giving insight into the range of values.   
**Skewness:** Measures the asymmetry of the distribution of values. Positive skewness indicates a long tail on the right, while negative skewness indicates a long tail on the left.

**2. One-Sample T-Test for Average Number of Rooms (RM):**  
**T-Statistic:** Measures how many standard deviations the sample mean is from the hypothesized population mean (6.0 in this case).  
**P-Value:** Indicates the probability of observing a test statistic at least as extreme as the one computed, under the null hypothesis. A small p-value (typically less than 0.05) suggests that the observed mean is significantly different from the hypothesized mean.  
**Conclusion:** If the p-value is low, it means that the average number of rooms in the dataset is significantly different from the hypothesized population mean of 6.0. If the p-value is high, the average number of rooms is not significantly different from the hypothesized mean.  
**3. 95% Confidence Interval for Average Number of Rooms (RM):**  
Provides a range within which the true population mean of the average number of rooms is expected to lie with 95% confidence. If the hypothesized mean of 6.0 falls outside this interval, it indicates that it is unlikely that the true population mean is 6.0.
**4. Linear Regression Analysis:**  
**Model Summary:** The linear regression model is fitted to predict housing prices (target) based on the average number of rooms (RM). The model summary includes:  
**Coefficients:** Show the relationship between the predictor (average number of rooms) and the target variable (housing price). A positive coefficient suggests that an increase in the number of rooms is associated with an increase in housing price.  
**R-squared:** Represents the proportion of the variance in the target variable that is predictable from the predictor variable. Higher values indicate a better fit of the model to the data.  
**P-Value:** Tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value indicates that the predictor variable is significantly associated with the target variable.  
**Confidence Intervals:** Provide a range within which the true coefficient values are expected to lie with a certain level of confidence.

In [13]:
import pandas as pd
import statsmodels.api as sm
#from sklearn.datasets import load_boston
from scipy import stats
import numpy as np

# Load the Boston Housing dataset from the original source
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

df = pd.DataFrame(data=data, columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'])
df['target'] = target


# Display the first few rows
print(df.head())

# Calculate basic descriptive statistics
print("Mean:\n", df.mean())
print("\nMedian:\n", df.median())
print("\nMode:\n", df.mode().iloc[0])
print("\nStandard Deviation:\n", df.std())
print("\nVariance:\n", df.var())

# Additional descriptive statistics
print("\nRange:\n", df.max() - df.min())
print("\nSkewness:\n", df.skew())
print("\nKurtosis:\n", df.kurt())

# Example data: Average number of rooms (RM)
average_rooms = df['RM']

# Hypothetical population mean for Average number of rooms
population_mean = 6.0

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(average_rooms, population_mean)

print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")

# Sample mean and standard error for Average number of rooms
sample_mean = np.mean(average_rooms)
standard_error = stats.sem(average_rooms)

# Compute 95% confidence interval for Average number of rooms
confidence_interval = stats.norm.interval(0.95, loc=sample_mean, scale=standard_error)

print(f"95% Confidence Interval for Average Number of Rooms: {confidence_interval}")

# Define independent variable (add constant for intercept)
X = sm.add_constant(df['RM'])

# Define dependent variable (e.g., Housing price)
y = df['target']

# Fit linear regression model
model = sm.OLS(y, X).fit()

# Print model summary
print(model.summary())

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  target  
0     15.3  396.90   4.98    24.0  
1     17.8  396.90   9.14    21.6  
2     17.8  392.83   4.03    34.7  
3     18.7  394.63   2.94    33.4  
4     18.7  396.90   5.33    36.2  
Mean:
 CRIM         3.613524
ZN          11.363636
INDUS       11.136779
CHAS         0.069170
NOX          0.554695
RM           6.284634
AGE         68.574901
DIS          3.795043
RAD          9.549407
TAX        408.237154
PTRATIO     18.455534
B          356.674032
LSTAT       12.653063
target      22.532806
dtype: float64

Median:
 CRIM 

**Conclusion:**  
Descriptive statistics offer insights into the distribution and spread of the dataset.  One-sample t-test helps determine whether the average number of rooms differs significantly from a hypothesized value.  95% confidence interval provides a range of plausible values for the true mean of the average number of rooms.Linear regression demonstrates the relationship between the number of rooms and housing prices, helping to understand how changes in the number of rooms impact housing prices.  The model summary provides valuable statistics to interpret the significance and strength of this relationship. Overall, this analysis helps in understanding the data, validating assumptions about the average number of rooms, and modeling the relationship between the number of rooms and housing prices.