Statistics

**There are two main branches**

1. **Descriptive Statistics**

- Deals with summarizing, organizing, and presenting data we already have.

- Uses measures like mean (average), median, mode, range, variance, and standard deviation to describe a dataset.

- Also involves graphs, charts, and tables for easy visualization.

- For example calculating the average coding hours of AI engineering students.

2. **Inferential Statistics**

- Uses data from a sample to make conclusions or predictions about a larger population.

- Involves methods like correlation, regression, hypothesis testing, chi-square tests, t-tests, and ANOVA.

- Helps in data-driven decision making by testing ideas and estimating outcomes with a level of certainty.

- Foundation of predictive analysis, guiding policymakers and organizations in making strategic decisions.

- For example, uing a survey of 1,000 people to predict how millions will vote in a presidential election.

**Why Statistics**
- Beyond the confusing formulas and outrageous numbers, statistics will help us answer questions like;

   - What's typical? (measures of center)
   - How much variation is there?(measures of spread)
   - Is this pattern real or just coincidence? (hypothesis testing)
   - Can we predict future outcomes?(regression)

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
# import scipy.stats as stats


In [2]:
# lets set random seed for reproducibility
np.random.seed(42)

In [3]:
#lets simulate a dataset for ai engineering students 
#1. les create it with a 25 hours/week and standard deviation of 5 hours 
traditional_study_hours  =  np.random.normal(25,5,100)

#2. Acellerated learning (project-based on hands-on style)
#this one wil be a 35 hours/week and a standard deviation of 8 hours 

accelerated_study_hours = np.random.normal(35,8,100)

#lets generate corresponding performance scores between 0-100

#we would let the performance coreelate with study hours but have some randomness
traditional_scores =  np.random.normal(75,12,100) # Mean of 75, SD of 12
accelerated_scores =np.random.normal(82,15,100) # mean 82

#lets generate project completion counts 
traditional_project = np.random.poisson(8,100) # average 8 projects

accelrated_project=np.random.poisson(12,100) # Average 12 projects

In [4]:
# now lets create the datafram

data = pd.DataFrame({
    'Study_Hours_per_week': np.concatenate([traditional_study_hours,accelerated_study_hours]),
    'Performance_Score': np.concatenate([traditional_scores,accelerated_scores]),
    'Projects_completed': np.concatenate([traditional_project,accelrated_project]),
    'Learning_Track': ['Traditional'] * 100 + ['Accelerated'] *100
})

In [5]:
data.head()

Unnamed: 0,Study_Hours_per_week,Performance_Score,Projects_completed,Learning_Track
0,27.483571,79.293448,12,Traditional
1,24.308678,81.729414,7,Traditional
2,28.238443,87.996615,10,Traditional
3,32.615149,87.645625,5,Traditional
4,23.829233,58.467968,12,Traditional


In [6]:
data.tail()

Unnamed: 0,Study_Hours_per_week,Performance_Score,Projects_completed,Learning_Track
195,38.082539,74.962365,15,Accelerated
196,27.929141,56.302982,6,Accelerated
197,36.229801,102.308086,8,Accelerated
198,35.46567,80.281902,13,Accelerated
199,25.856238,100.567245,18,Accelerated


In [7]:
#np.clip sets the min and maximum value for a data set.if value is less than minimum it takes the minimum value else if greater than maximum, it takes the maximum value 
data['Study_Hours_per_week'] = round(np.clip(data['Study_Hours_per_week'],10,60),1)
data['Performance_Score'] = round(np.clip(data['Performance_Score'],0,100),1)
data['Projects_completed'] = np.clip(data['Projects_completed'],1,25)

In [8]:
data.tail()

Unnamed: 0,Study_Hours_per_week,Performance_Score,Projects_completed,Learning_Track
195,38.1,75.0,15,Accelerated
196,27.9,56.3,6,Accelerated
197,36.2,100.0,8,Accelerated
198,35.5,80.3,13,Accelerated
199,25.9,100.0,18,Accelerated


In [9]:
# Lets take a snapshot of the data 

print(f'Total Ai engineering students: {len(data)}')
print(f"Learning Tracks: {data['Learning_Track'].unique()}")
print("\nFirst 10 students in our dataset")
print(data.head(10).round(1)) #  you can use round as a method , just by the way (i originally know it to be a function or method in in linear algebra)


Total Ai engineering students: 200
Learning Tracks: ['Traditional' 'Accelerated']

First 10 students in our dataset
   Study_Hours_per_week  Performance_Score  Projects_completed Learning_Track
0                  27.5               79.3                  12    Traditional
1                  24.3               81.7                   7    Traditional
2                  28.2               88.0                  10    Traditional
3                  32.6               87.6                   5    Traditional
4                  23.8               58.5                  12    Traditional
5                  23.8               63.7                   9    Traditional
6                  32.9               81.2                   4    Traditional
7                  28.8               81.2                   7    Traditional
8                  22.7               81.2                   4    Traditional
9                  27.7              100.0                   7    Traditional


# Discriptive statistics for this data

In [10]:
# lets get the description for each learning track by study hour 

data.groupby('Learning_Track')['Study_Hours_per_week'].describe().round(2)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Learning_Track,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Accelerated,100.0,35.18,7.63,19.6,28.58,35.65,39.33,56.8
Traditional,100.0,24.48,4.54,11.9,22.0,24.35,27.05,34.3


In [11]:
# lets get the description for each learning track by performance score

data.groupby('Learning_Track')['Performance_Score'].describe().round(2)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Learning_Track,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Accelerated,100.0,82.8,11.83,50.1,73.45,82.75,92.25,100.0
Traditional,100.0,75.52,12.33,36.1,67.18,76.2,83.42,100.0


In [16]:
# you can use this if you dont want to bother about writing it one by one 
print(data.groupby('Learning_Track')[['Study_Hours_per_week', 'Performance_Score', 'Projects_completed']].describe().round(2))

               Study_Hours_per_week                                          \
                              count   mean   std   min    25%    50%    75%   
Learning_Track                                                                
Accelerated                   100.0  35.18  7.63  19.6  28.58  35.65  39.33   
Traditional                   100.0  24.48  4.54  11.9  22.00  24.35  27.05   

                     Performance_Score         ...                \
                 max             count   mean  ...    75%    max   
Learning_Track                                 ...                 
Accelerated     56.8             100.0  82.80  ...  92.25  100.0   
Traditional     34.3             100.0  75.52  ...  83.42  100.0   

               Projects_completed                                            
                            count   mean   std  min  25%   50%    75%   max  
Learning_Track                                                               
Accelerated                 1

In [25]:
# lets check other descriptive measures
#analyze each learning track

print("\n=== VISUAL REPRESENTATION ===")
for track in ['Traditional','Accelerated']:
    track_data = data[data['Learning_Track'] == track]

    print(f"\n{track} learning Track (n={len(track_data)} studenyts): ")
    print("  Study Hours per week")
    print(f"    Mean: {track_data['Study_Hours_per_week'].median():.1f} hours")
    print(f"    Median: {track_data['Study_Hours_per_week'].median():.1f} hours")
    print(f"    Standard Deviation: {track_data['Study_Hours_per_week'].std():.1f} hours")

    print("   Performance Scores: ")
    print(f"     Mean: {track_data['Performance_Score'].mean():.1f}")
    print(f"     Median: {track_data['Performance_Score'].median():.1f}")
    print(f"     Standard Deviation: {track_data['Performance_Score'].std():.1f}")


    print("  Projects Completed:")
    print(f"    Mean: {track_data['Projects_completed'].mean():.1f} projects")
    print(f"    Median: {track_data['Projects_completed'].median():.1f} projects")
    print(f"    Range: {track_data['Projects_completed'].min()} - {track_data['Projects_completed'].max()} projects" )


=== VISUAL REPRESENTATION ===

Traditional learning Track (n=100 studenyts): 
  Study Hours per week
    Mean: 24.4 hours
    Median: 24.4 hours
    Standard Deviation: 4.5 hours
   Performance Scores: 
     Mean: 75.5
     Median: 76.2
     Standard Deviation: 12.3
  Projects Completed:
    Mean: 7.9 projects
    Median: 8.0 projects
    Range: 3 - 14 projects

Accelerated learning Track (n=100 studenyts): 
  Study Hours per week
    Mean: 35.6 hours
    Median: 35.6 hours
    Standard Deviation: 7.6 hours
   Performance Scores: 
     Mean: 82.8
     Median: 82.8
     Standard Deviation: 11.8
  Projects Completed:
    Mean: 12.0 projects
    Median: 12.0 projects
    Range: 3 - 22 projects
