<a href="https://colab.research.google.com/github/bharathkumaro5/STATISTICAL-ANANLYSIS/blob/main/bharath.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Introduction:-***
Statistical analysis is a fundamental aspect of data science and machine learning. By understanding and applying statistical methods in this session, we will explore the basics of statistical analysis using Python, leveraging libraries like pandas, numpy, and scipy.

1. ***Descriptive Statistics:***

**Mean, Median, Mode:** These give you a sense of the central tendency of each feature in the dataset. For example, the mean of 'MedInc' (median income) will tell you the average income in the dataset, while the median gives you the middle value.

**Standard Deviation and Variance:** These measure the dispersion or spread of the data. A high standard deviation indicates that the data points are spread out over a larger range of values.

**Range:** This shows the difference between the maximum and minimum values for each feature, giving you a sense of the data's spread.

**Skewness:** This indicates the asymmetry of the distribution. A skewness close to zero indicates a symmetric distribution, while positive or negative values indicate skewness to the right or left, respectively.

**Kurtosis:** This measures the "tailedness" of the distribution. High kurtosis indicates more outliers.

2. *** One-Sample T-Test:***
T-Statistic and P-Value: The one-sample t-test compares the mean of the 'MedInc' values against a hypothetical population mean (3.5 in this case). The t-statistic shows how far the sample mean is from the population mean in units of the standard error, and the p-value indicates the probability of observing a result at least as extreme as the one obtained, under the assumption that the null hypothesis is true.

**Conclusion:** If the p-value is small (typically less than 0.05), you might reject the null hypothesis, suggesting that the sample mean is statistically significantly different from the population mean. If the p-value is large, you do not reject the null hypothesis, meaning there isn't strong evidence to suggest the sample mean is different from the population mean.

3. ***95% Confidence Interval:***
Interpretation: The confidence interval provides a range of values that is likely to contain the true population mean of 'MedInc' with 95% confidence. If this interval does not include the hypothetical population mean (3.5), this further supports the conclusion from the t-test that the sample mean is significantly different from the population mean.

4. ***Linear Regression:***
Regression Summary: The linear regression model examines the relationship between the independent variable ('MedInc', median income) and the dependent variable ('target', median house value).

**Key Outputs:**

**Coefficients:** The coefficient for 'MedInc' indicates the expected change in the median house value for a one-unit increase in median income, holding all else constant.

**R-squared:** This measures the proportion of the variance in the dependent variable that is predictable from the independent variable. A higher R-squared indicates a better fit.

P-value for the Coefficient: **bold text**If this p-value is low (typically < 0.05), it suggests that the independent variable ('MedInc') is statistically significantly related to the dependent variable ('target').

**Conclusion: **If the coefficient for 'MedInc' is positive and statistically significant, you can conclude that higher median incomes are associated with higher median house values. The strength and direction of this relationship are reflected in the coefficient and R-squared value.

In [1]:
import pandas as pd
import seaborn as sns
from scipy import stats
import numpy as np
import statsmodels.api as sm

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Display the first few rows
print(titanic.head())

# Drop rows with missing target and non-numeric columns for simplicity
df = titanic.dropna(subset=['age', 'fare', 'pclass', 'survived'])
df = df[['age', 'fare', 'pclass', 'survived']]

# Calculate basic descriptive statistics
print("Mean:\n", df.mean())
print("\nMedian:\n", df.median())
print("\nMode:\n", df.mode().iloc[0])
print("\nStandard Deviation:\n", df.std())
print("\nVariance:\n", df.var())

# Additional descriptive statistics
print("\nRange:\n", df.max() - df.min())
print("\nSkewness:\n", df.skew())
print("\nKurtosis:\n", df.kurt())

# Example data: Age values
age_values = df['age']

# Hypothetical population mean for Age
population_mean = 30.0

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(age_values, population_mean)

print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")

# Sample mean and standard error for Age
sample_mean = np.mean(age_values)
standard_error = stats.sem(age_values)

# Compute 95% confidence interval for Age
confidence_interval = stats.norm.interval(0.95, loc=sample_mean, scale=standard_error)

print(f"95% Confidence Interval for Age: {confidence_interval}")

# Define independent variable (add constant for intercept)
X = sm.add_constant(df['age'])

# Define dependent variable (e.g., Fare)
y = df['fare']

# Fit linear regression model
model = sm.OLS(y, X).fit()

# Print model summary
print(model.summary())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  
Mean:
 age         29.699118
fare        34.694514
pclass       2.236695
survived     0.406162
dtype: float64

Median:
 age         28.0000
fare        15.7417
pclass       2.0000
survived     0.0000
dtype: float64

Mod

***Conclusion:***

The descriptive statistics provide an overview of the age and fare distributions, showing how these variables are spread and their central tendencies.

The t-test suggests whether the average age differs significantly from the assumed population mean of 30.

The regression model explores the relationship between age and fare, though age alone may not be a strong predictor for fare (based on the R-squared value).