### Statistical Learning and Predictive Modeling

This project focuses on the application of statistical learning techniques to analyze data patterns, relationships, and make predictions. Using various datasets, the project demonstrates key steps including:

- **Data Cleaning**: Handling missing values and ensuring data consistency.
- **Modeling**: Applying linear regression, logistic regression, and other models to make predictions.
- **Visualization**: Recreating visualizations to gain insights from data.
- **Data Manipulation**: Filtering and joining datasets to extract meaningful information.

The goal is to apply statistical learning methods to real-world data, showcasing skills in predictive modeling and analysis.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

### Analyzing Height Inheritance: Regression Toward the Mean

This project explores the phenomenon of genetic inheritance, specifically using the historical dataset of fathers' and sons' heights analyzed by Karl Pearson and Alice Lee. The goal is to investigate whether the observed difference in heights between fathers and sons can be attributed to chance or is part of a broader trend, known as **regression toward the mean**. 

In this context, regression analysis will be used to examine the relationship between fathers' height and sons' height, while showcasing modern statistical techniques.

In [4]:
# Load the data with the correct path
file_path = '~/Desktop/DevKit/fatherson.csv'
data = pd.read_csv(file_path, delimiter='\t')

FileNotFoundError: [Errno 2] No such file or directory: '/Users/zerayacobmeshesha/Desktop/DevKit/fatherson.csv'

In [None]:
# Convert the columns to numeric, forcing errors to NaN
data['fheight'] = pd.to_numeric(data['fheight'], errors='coerce')
data['sheight'] = pd.to_numeric(data['sheight'], errors='coerce')

# Drop rows with NaN values
data = data.dropna()

# Display the first few rows 
print("First few rows of the dataset:")
print(data.head())

print(data.describe())

In [None]:
# Convert the columns to numeric, forcing errors to NaN
data['fheight'] = pd.to_numeric(data['fheight'], errors='coerce')
data['sheight'] = pd.to_numeric(data['sheight'], errors='coerce')

# Drop rows with NaN values (if any)
data = data.dropna()

# Display the first few rows of the cleaned dataset and summary statistics
print("First few rows of the dataset:")
print(data.head())

print("\nSummary statistics:")
print(data.describe())

# Create density plots for fathers' and sons' heights
plt.figure(figsize=(10, 6))
sns.kdeplot(data['fheight'], label="Fathers' Height")
sns.kdeplot(data['sheight'], label="Sons' Height")
plt.title("Density Plot of Fathers' and Sons' Heights")
plt.xlabel("Height (cm)")
plt.ylabel("Density")
plt.legend()
plt.show()

### Perform t-test: Compute t-statistic

To determine whether the difference in means between fathers' and sons' heights is statistically significant, I will compute the **t-statistic** manually, without using any pre-existing t-test functions.

Steps:
1. **Calculate the Pooled Standard Error**: This is done by combining the standard deviations of the two groups to account for variability within each group.
2. **Calculate the t-statistic**: Using the difference between the means and the pooled standard error, the t-statistic will be computed.

The pooled standard deviation is used to account for the fact that the two groups (fathers and sons) might have different variabilities, and it provides a more accurate measure of the true population variance by combining the two sample variances.

Finally, I will analyze the computed t-statistic to determine whether the difference in heights is statistically significant.

In [None]:
# Calculate means and standard deviations
mean_fathers = data['fheight'].mean()
mean_sons = data['sheight'].mean()
std_fathers = data['fheight'].std()
std_sons = data['sheight'].std()

# Calculate number of observations
n_fathers = data['fheight'].count()
n_sons = data['sheight'].count()

# Calculate pooled standard deviation
s_p = np.sqrt(((n_fathers - 1) * std_fathers**2 + (n_sons - 1) * std_sons**2) / (n_fathers + n_sons - 2))
# Calculate t-statistic
t_statistic = (mean_fathers - mean_sons) / (s_p * np.sqrt(1/n_fathers + 1/n_sons))

t_statistic

In [None]:
''' We use a pooled standard deviation to estimate the standard deviation of the entire population from which both samples are drawn. 
It is used when the variances of the two samples are assumed to be equal.'''

In [None]:
# Perform the t-test 
t_statistic_scipy, p_value = stats.ttest_ind(data['fheight'], data['sheight'])

t_statistic_scipy, p_value

### Monte Carlo Simulation: Father-Son Height Comparison

In this step, I will run a Monte Carlo simulation to further analyze the father-son height data.

- **Compute Mean and Standard Deviation**: Calculate the overall mean and standard deviation of combined father and son heights.
- **Generate Random Variables**: Create two sets of normal random variables (length 1100) for fathers and sons using the same mean and standard deviation.
- **Mean Difference**: Calculate the mean difference between the two groups and compare it to the previous t-test result.

This simulation allows for a deeper understanding of height differences through random sampling and comparison.

In [None]:
combined_data = np.concatenate([data['fheight'], data['sheight']])

# Overall mean and standard deviation
overall_mean = np.mean(combined_data)
overall_std = np.std(combined_data)

# Set seed for reproducibility
np.random.seed(42)

# Create normal distributions for fathers and sons
n_samples = 1100
simulated_fathers = np.random.normal(loc=overall_mean, scale=overall_std, size=n_samples)
simulated_sons = np.random.normal(loc=overall_mean, scale=overall_std, size=n_samples)

# Calculate the mean difference
mean_difference = np.mean(simulated_fathers) - np.mean(simulated_sons)

mean_difference

### Monte Carlo Simulations: Repeated Sampling

In this step, I will run **R = 1000** Monte Carlo simulations:

- **Repeat Simulation**: Generate random distributions for fathers and sons, calculating the mean difference each time.
- **Store and Plot Differences**: Store the mean differences and plot their distribution.
- **Analysis**: Calculate the mean and standard deviation of the differences, then compare these results to the original father-son dataset.

This process helps assess the variability and consistency of the height differences through repeated random sampling.

In [None]:
# Parameters
n_simulations = 1000
differences = []

# Run Monte Carlo simulations
for _ in range(n_simulations):
    fathers_sample = np.random.normal(loc=overall_mean, scale=overall_std, size=n_samples)
    sons_sample = np.random.normal(loc=overall_mean, scale=overall_std, size=n_samples)
    diff = np.mean(fathers_sample) - np.mean(sons_sample)
    differences.append(diff)

# Convert differences to a numpy array
differences = np.array(differences)

# Plot the distribution of differences
plt.figure(figsize=(10, 6))
plt.hist(differences, bins=30, density=True, alpha=0.7, color='g')
plt.title('Distribution of Differences from Monte Carlo Simulations')
plt.xlabel('Difference in Means')
plt.ylabel('Density')
plt.show()

# Calculate and print the mean and standard deviation of the differences
mean_of_differences = np.mean(differences)
std_of_differences = np.std(differences)

mean_of_differences, std_of_differences

### Percentile Calculation

In this step, I will calculate key percentiles from the distribution of differences generated in the Monte Carlo simulations:

- **Percentiles**: Compute the 90th, 95th, and 99th percentiles, along with the maximum of the distribution.
- **Comparison**: Compare these percentiles to the differences calculated from the original father-son dataset.
- **Conclusion**: Based on this comparison, analyze what the dataset reveals about the relationship between the generated and observed data.

This analysis helps determine how the observed dataset aligns with the simulated distribution of differences.

In [None]:
# Calculate percentiles
percentiles = np.percentile(differences, [90, 95, 99])
max_difference = np.max(differences)

percentiles, max_difference

If the observed mean difference exceeds the 95th or 99th percentile, it indicates that the difference in heights between fathers and sons is statistically significant and unlikely to be due to random variation. This suggests that the observed difference reflects a genuine generational effect rather than random sampling. The Monte Carlo simulation provides a robust framework for determining whether the variation in the dataset is consistent with natural variability or indicative of a meaningful trend.