"Introduction"

Statistical analysis is a fundamental aspect of data science and machine learning. By understanding and applying statistical methods, you can derive meaningful insights from datasets, make predictions, and inform decision-making. In this session, we will explore the basics of statistical analysis using Python, leveraging libraries like pandas, numpy, and scipy. We will use various datasets, including the built-in Diabetes dataset from the sklearn library, and explore how to apply statistical methods to datasets from sources like Kaggle and the UCI Machine Learning Repository.


"Section 1: Basic Concepts in Statistics"

Before diving into statistical analysis, it’s essential to understand some fundamental concepts:

"1.1 Population and Sample"

Population: In statistics, a population refers to the complete set of all possible observations or measurements that could be made about a particular subject. For example, if we are studying the heights of all adults in a city, the population would include every adult in that city.
Sample: A sample is a subset of the population selected for the actual study. Sampling is necessary because it is often impractical or impossible to collect data from an entire population due to time, cost, and logistical constraints. In the same example, a sample might consist of 500 randomly selected adults from the city.

"1.2 Need for Sampling in Statistics"

Sampling is a powerful tool in statistics that allows researchers to make inferences about a population without needing to study every individual. The main reasons for sampling include:

Cost Efficiency: Studying a whole population can be very costly. Sampling reduces the resources needed for data collection.
Time Efficiency: Gathering data from an entire population can take a long time. Sampling provides quicker insights and results.
Feasibility: In many cases, it’s simply not feasible to collect data from everyone (e.g., when testing a new drug).
Manageability: Smaller, manageable samples make data analysis simpler and more straightforward.

1.3 "Benefits of Sampling"

"Accuracy": With proper sampling techniques, the results from a sample can accurately reflect the population.
Less Data Overhead: Handling and analyzing a sample is less overwhelming than dealing with massive amounts of data.
Focused Research: Allows for more detailed analysis and can be tailored to specific aspects of a population.

Section 2: "Understanding Basic Statistical Methods"

2.1 "Descriptive Statistics"

Descriptive statistics are essential for understanding the basic features of a dataset. They provide simple summaries about the sample and the measures, forming the basis of virtually every quantitative analysis of data. Descriptive statistics help to simplify large amounts of data in a sensible way. Each descriptive statistic reduces lots of data into a simpler summary.

Here are the key measures of descriptive statistics:"

M"
an: The mean, often referred to as the average, is the sum of all values in a dataset divided by the number of values. It is a measure of central tendency that gives us an idea of the “central” value of a dataset. The mean is sensitive to outliers (extremely high or low values) and may not accurately reflect the central tendency if the dataset is skew"ed.
M"e
ian: The median is the middle value of a dataset when it is ordered from smallest to largest. If there is an even number of observations, the median is the average of the two middle values. The median is a robust measure of central tendency, meaning it is not affected by outliers. It is particularly useful for skewed distributi"ons."
Mode: The mode is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode (bimodal or multimodal), or no mode at all if all values are unique. The mode is useful for categorical data where we want to know which is the most common category.
Standard Deviation (SD): The standard deviation measures the amount of variation or dispersion in a dataset. A low standard deviation indicates that the data points tend to be close to the mean, whereas a high standard deviation indicates that the data points are spread out over a wider range of values. The standard deviation is the square root of the var"iance.
"V
riance: Variance is a measure of how far each value in the dataset is from the mean, and thus from every other value in the dataset. It is the average of the squared differences from the mean. Variance gives us a general idea of the spread of the data but is in squared units, which can make interpretation less intuitive compared to standard dev"iatio"n.

Range: The range is the difference between the maximum and minimum values in a dataset. It provides a measure of how spread out the values are. However, the range is sensitive to outliers and does not give any information about the distribution of values within the dataset.

"Additional Descriptive Measures":

"Percentiles and Quartiles": 
Percentiles are measures that indicate the value below which a given percentage of observations fall. Quartiles divide the dataset into four equal parts. The 25th percentile is the first quartile (Q1), the 50th percentile is the second quartile (Q2, also the median), and the 75th percentile is the third quartile (Q3). Quartiles and percentiles provide insight into the distribution and spread of the data."
Interquartile Range (IQR")
 The IQR is the range of the middle 50% of the values in a dataset, calculated as the difference between the third quartile (Q3) and the first quartile (Q1). The IQR is a robust measure of spread and is useful for identifying outliers."S
Skewne"ss
: Skewness measures the asymmetry of the data distribution. A positive skew indicates that the right tail of the distribution is longer or fatter than the left tail (right-skewed). A negative skew indicates that the left tail is longer or fatter than the right tail (left-skewed). Skewness provides insights into the direction and degree of asymmetry in the data".
Kurto"s
s: Kurtosis measures the “tailedness” or the peak of the data distribution. High kurtosis means that the data have heavy tails or outliers, whereas low kurtosis indicates light tails or fewer outliers. Kurtosis helps in understanding the extremity of deviations from the mean.

By using these descriptive statistics, you can gain a better understanding of the dataset’s characteristics, including its central tendency, variability, and distribution shape. These measures are foundational for further statistical analysis and hypothesis testing, enabling more informed decision-making based on the data.

2.2 "Inferential Statistics"

Inferential statistics allow us to make predictions or inferences about a population based on a sample of data. Unlike descriptive statistics, which simply summarize data, inferential statistics help us draw conclusions and make predictions beyond the immediate data.
"
Common Methods of Inferential Statisti"cs:"

Hypothesis Tes"t
ng: Hypothesis testing involves evaluating a hypothesis about a population parameter based on sample data. It allows us to determine if there is enough evidence to reject a null hypothesis in favor of an alternative hypothesis. This method is widely used in scientific research to test theories and assumptio"ns.
Confidence Inte"r
als: A confidence interval is a range of values, derived from the sample data, that is likely to contain the true population parameter. Confidence intervals provide an estimate of the uncertainty associated with a sample statistic, allowing researchers to gauge the precision of their estimates.
Regression Analysis: Regression analysis is used to model the relationship between dependent and independent variables. It helps in understanding how the typical value of the dependent variable changes when any one of the independent variables is varied. Regression analysis is fundamental in predicting outcomes and identifying trends in "data.
ANOVA (Analysis of Va"ri
ance): ANOVA is a statistical method used to compare the means of three or more groups to see if they are significantly different from each other. It is commonly used in experimental research to test the effects of different treatments or condi"tions.
Chi-Squ"ar
e Test: The chi-square test is used to test the independence of two categorical variables. It is particularly useful in survey research and marketing studies to examine relationships between different categorical data points.
Inferential statistics are powerful tools that allow us to go beyond mere description and make predictions about broader populations. By applying these methods, you can draw meaningful conclusions from your data and make informed decisions based on statistical evidence.

In [23]:
pip install pandas numpy scipy scikit-learn statsmodels

Note: you may need to restart the kernel to use updated packages.


6.1 "Loading the Diabetes Dataset"

In [24]:
import pandas as pd
from sklearn.datasets import load_iris

# Load the dataset
iris= load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Display the first few rows
print(df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  


6.2 "Performing Descriptive Statistics"

In [25]:
# Calculate basic descriptive statistics
print("Mean:\n", df.mean())
print("\nMedian:\n", df.median())
print("\nMode:\n", df.mode().iloc[0])
print("\nStandard Deviation:\n", df.std())
print("\nVariance:\n", df.var())

# Additional descriptive statistics
print("\nRange:\n", df.max() - df.min())
print("\nSkewness:\n", df.skew())
print("\nKurtosis:\n", df.kurt())

Mean:
 sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
target               1.000000
dtype: float64

Median:
 sepal length (cm)    5.80
sepal width (cm)     3.00
petal length (cm)    4.35
petal width (cm)     1.30
target               1.00
dtype: float64

Mode:
 sepal length (cm)    5.0
sepal width (cm)     3.0
petal length (cm)    1.4
petal width (cm)     0.2
target               0.0
Name: 0, dtype: float64

Standard Deviation:
 sepal length (cm)    0.828066
sepal width (cm)     0.435866
petal length (cm)    1.765298
petal width (cm)     0.762238
target               0.819232
dtype: float64

Variance:
 sepal length (cm)    0.685694
sepal width (cm)     0.189979
petal length (cm)    3.116278
petal width (cm)     0.581006
target               0.671141
dtype: float64

Range:
 sepal length (cm)    3.6
sepal width (cm)     2.4
petal length (cm)    5.9
petal width (cm)     2.4
target               2.0
dtype: float64

Sk

6.3 "Performing Inferential Statistics"

In [26]:
# Example data: Sepal Length
sepal_length_values = df['sepal length (cm)']

# Hypothetical population mean for Sepal Length
population_mean = 5.8  # Based on domain knowledge or assumption

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(sepal_length_values, population_mean)

print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")


T-Statistic: 0.6409183514112012
P-Value: 0.5225602746220779


6.4 "Confidence Intervals"

In [27]:
# Sample mean and standard error for Sepal Length
sample_mean = np.mean(sepal_length_values)
standard_error = stats.sem(sepal_length_values)

# Compute 95% confidence interval for Sepal Length
confidence_interval = stats.norm.interval(0.95, loc=sample_mean, scale=standard_error)

print(f"95% Confidence Interval for Sepal Length: {confidence_interval}")


95% Confidence Interval for Sepal Length: (5.710817588579892, 5.9758490780867755)


6.5 "Regression Analysis"

In [28]:
# Define independent variable (add constant for intercept)
X = sm.add_constant(df['sepal length (cm)'])

# Define dependent variable
y = df['target']

# Fit linear regression model
model = sm.OLS(y, X).fit()

# Print model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.612
Model:                            OLS   Adj. R-squared:                  0.610
Method:                 Least Squares   F-statistic:                     233.8
Date:                Wed, 04 Sep 2024   Prob (F-statistic):           2.89e-32
Time:                        14:44:06   Log-Likelihood:                -111.35
No. Observations:                 150   AIC:                             226.7
Df Residuals:                     148   BIC:                             232.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -3.5240      0.29