# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

- **Consider a significance level of 5% for all tests.**

In [1]:
#Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import math
import os
from scipy.stats import ttest_ind, norm

# Challenge 1 - Independent Sample T-tests

In this challenge, we will be using the Pokemon dataset. Before applying statistical methods to this data, let's first examine the data.

To load the data, run the code below.

In [2]:
# List files in the current directory
files = os.listdir()
files

['.config', 'pokemon.csv', 'sample_data']

In [3]:
# Read the CSV file into a DataFrame
df = pd.read_csv('pokemon.csv')

Let's start off by looking at the `head` function in the cell below.

In [4]:
# pokemon data:
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


The first thing we would like to do is compare the legendary Pokemon to the regular Pokemon. To do this, we should examine the data further. What is the count of legendary vs. non legendary Pokemons?

In [5]:
# Legendary Pokemon count:
df['Legendary'].value_counts()

Legendary
False    735
True      65
Name: count, dtype: int64

Compute the mean and standard deviation of the total points for both legendary and non-legendary Pokemon.

## Non-legendary pokemons info

In [6]:
# Filter [Non-Legendary] Pokemons:
non_legendary_df = df[df['Legendary']==False]

In [7]:
# Get non-legendary mean and std info:
non_legendary_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
#,735.0,353.315646,208.590419,1.0,175.5,346.0,533.5,715.0
Total,735.0,417.213605,106.760417,180.0,324.0,425.0,498.0,700.0
HP,735.0,67.182313,24.808849,1.0,50.0,65.0,79.5,255.0
Attack,735.0,75.669388,30.490153,5.0,54.5,72.0,95.0,185.0
Defense,735.0,71.559184,30.408194,5.0,50.0,66.0,85.0,230.0
Sp. Atk,735.0,68.454422,29.091705,10.0,45.0,65.0,85.0,175.0
Sp. Def,735.0,68.892517,25.66931,20.0,50.0,65.0,85.0,230.0
Speed,735.0,65.455782,27.843038,5.0,45.0,64.0,85.0,160.0
Generation,735.0,3.284354,1.673471,1.0,2.0,3.0,5.0,6.0


In [8]:
non_legendary_totals = df[df['Legendary']==False]['Total']
non_legendary_totals

0      318
1      405
2      525
3      625
4      309
      ... 
787    494
788    304
789    514
790    245
791    535
Name: Total, Length: 735, dtype: int64

In [9]:
# non-legendary total mean:
non_legendary_totals_mean = non_legendary_totals.mean()
non_legendary_totals_mean

417.21360544217686

In [10]:
# non-legendary std:
non_legendary_totals_std = non_legendary_totals.std()
non_legendary_totals_std

106.76041745713005

## Legendary pokemons info

In [11]:
# Filter Legendary Pokemons:
legendary_df = df[df['Legendary']==True]

In [12]:
# Get legendary mean and std info:
legendary_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
#,65.0,470.215385,173.651095,144.0,381.0,483.0,642.0,721.0
Total,65.0,637.384615,60.937389,580.0,580.0,600.0,680.0,780.0
HP,65.0,92.738462,21.722164,50.0,80.0,91.0,105.0,150.0
Attack,65.0,116.676923,30.348037,50.0,100.0,110.0,131.0,190.0
Defense,65.0,99.661538,28.255131,20.0,90.0,100.0,115.0,200.0
Sp. Atk,65.0,122.184615,31.104608,50.0,100.0,120.0,150.0,194.0
Sp. Def,65.0,105.938462,28.827004,20.0,90.0,100.0,120.0,200.0
Speed,65.0,100.184615,22.952323,50.0,90.0,100.0,110.0,180.0
Generation,65.0,3.769231,1.455262,1.0,3.0,4.0,5.0,6.0


In [13]:
legendary_totals = df[df['Legendary']==True]['Total']
legendary_totals

156    580
157    580
158    580
162    680
163    780
      ... 
795    600
796    700
797    600
798    680
799    600
Name: Total, Length: 65, dtype: int64

In [14]:
# legendary total mean:
legendary_totals_mean = legendary_totals.mean()
legendary_totals_mean

637.3846153846154

In [15]:
# legendary std:
legendary_totals_std = legendary_totals.std()
legendary_totals_std

60.93738905315344

The computation of the mean might give us a clue regarding how the statistical test may turn out; However, it certainly does not prove whether there is a significant difference between the two groups.

In the cell below, use the `ttest_ind` function in `scipy.stats` to compare the the total points for legendary and non-legendary Pokemon. Since we do not have any information about the population, assume the variances are not equal.

In [16]:
# T-test info for Legendary versus Non-legendary Pokemons:

# HO > non-legendary = legendary
# H1 > non-legendary != legendary

# significance level = 0.05

## T-test Manual version

In [31]:
# Non-legendary pokemons
non_legendary_sample = 735
non_legendary_totals_mean = non_legendary_totals.mean()
non_legendary_totals_std = non_legendary_totals.std()

# Legendary Pokemons
legendary_sample = 65
legendary_totals_mean = legendary_totals.mean()
legendary_totals_std = legendary_totals.std()

In [32]:
# Manual t-test with formula:
pooled_sample_std = math.sqrt(((non_legendary_sample-1)*non_legendary_totals_std**2 + (legendary_sample-1)*legendary_totals_std**2)/(non_legendary_sample+legendary_sample-2))
statistic = (non_legendary_totals_mean-legendary_totals_mean)/(pooled_sample_std*math.sqrt((1/non_legendary_sample)+(1/legendary_sample)))
print("T Statistic is: ", statistic)

T Statistic is:  -16.386116965872432


## T-test Automatic version

In [53]:
# Doing the t-test
t_statistic, p_value = ttest_ind(non_legendary_totals, legendary_totals, equal_var=False)
t_statistic, p_value

(-25.8335743895517, 9.357954335957446e-47)

What do you conclude from this test? Write your conclusions below.

In [18]:
# Your conclusions here:
# Result is close to 0 so we reject H0

How about we try to compare the different types of pokemon? In the cell below, list the types of Pokemon from column `Type 1` and the count of each type.

In [34]:
# List and count 'Type 1' Pokemons:
df['Type 1'].value_counts()

Type 1
Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Electric     44
Rock         44
Dragon       32
Ground       32
Ghost        32
Dark         31
Poison       28
Steel        27
Fighting     27
Ice          24
Fairy        17
Flying        4
Name: count, dtype: int64

Since water is the largest group of Pokemon, compare the mean and standard deviation of water Pokemon to all other Pokemon.

## Water Pokemon's Info:

In [40]:
# Filter water Pokemon:
water_pokemon = df[df['Type 1']=='Water']
len(water_pokemon)

112

In [41]:
water_pokemon.describe()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,112.0,112.0,112.0,112.0,112.0,112.0,112.0,112.0,112.0
mean,303.089286,430.455357,72.0625,74.151786,72.946429,74.8125,70.517857,65.964286,2.857143
std,188.440807,113.188266,27.487026,28.377192,27.773809,29.030128,28.460493,23.019353,1.5588
min,7.0,200.0,20.0,10.0,20.0,10.0,20.0,15.0,1.0
25%,130.0,328.75,52.25,53.0,54.5,55.0,50.0,50.0,1.0
50%,275.0,455.0,70.0,72.0,70.0,70.0,65.0,65.0,3.0
75%,456.25,502.25,90.25,92.0,88.5,90.5,89.25,82.0,4.0
max,693.0,770.0,170.0,155.0,180.0,180.0,160.0,122.0,6.0


In [42]:
# Create water pokemon Totals:
water_totals = df[df['Type 1']=='Water']['Total']
water_totals

9      314
10     405
11     530
12     630
59     320
      ... 
724    314
725    405
726    530
762    330
763    500
Name: Total, Length: 112, dtype: int64

In [43]:
# Water Pokemon mean:
water_totals_mean = water_totals.mean()
water_totals_mean

430.45535714285717

In [44]:
# Water Pokemon std:
water_totals_std = water_totals.std()
water_totals_std

113.1882660643146

## Non-Water Pokemon Info:

In [45]:
# Filter non-water Pokemon:
non_water_pokemon = df[df['Type 1']!='Water']
len(non_water_pokemon)

688

In [46]:
non_water_pokemon.describe()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,688.0,688.0,688.0,688.0,688.0,688.0,688.0,688.0,688.0
mean,372.536337,435.859012,68.802326,79.790698,73.988372,72.49564,72.127907,68.65407,3.399709
std,209.928799,121.091682,25.194299,33.025152,31.719933,33.292537,27.739292,29.925907,1.666119
min,1.0,180.0,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,197.75,330.0,50.0,55.0,50.0,45.0,50.0,45.0,2.0
50%,380.0,450.0,65.0,76.0,70.0,65.0,70.0,65.0,3.0
75%,553.25,515.0,80.0,100.0,90.0,95.0,90.0,90.0,5.0
max,721.0,780.0,255.0,190.0,230.0,194.0,230.0,180.0,6.0


In [47]:
# Create non-water pokemon Totals:
non_water_totals = df[df['Type 1']!='Water']['Total']
non_water_totals

0      318
1      405
2      525
3      625
4      309
      ... 
795    600
796    700
797    600
798    680
799    600
Name: Total, Length: 688, dtype: int64

In [48]:
# Non-Water Pokemon mean:
non_water_totals_mean = non_water_totals.mean()
non_water_totals_mean

435.85901162790697

In [49]:
# Non-Water Pokemon std:
non_water_totals_std = non_water_totals.std()
non_water_totals_std

121.0916823020807

Perform a hypothesis test comparing the mean of total points for water Pokemon to all non-water Pokemon. Assume the variances are equal.

In [None]:
# T-test info for Water versus Non-water Pokemons:

# HO > water pokemon = non-water pokemon
# H1 > water pokemon != non-water pokemon

# significance level = 0.05

In [50]:
# Water pokemons
water_sample = len(water_pokemon)
water_totals_mean = water_totals.mean()
water_totals_std = water_totals.std()

# Non-Water Pokemons
non_water_sample = len(non_water_pokemon)
non_water_totals_mean = non_water_totals.mean()
non_water_totals_std = non_water_totals.std()

In [52]:
# Doing the t-test
t_statistic_2, p_value_2 = ttest_ind(water_totals, non_water_totals, equal_var=False)
t_statistic, p_value

(-0.4638681676327303, 0.6433915385821449)

In [54]:
# We accepet H0 0.6 > 0.05

Write your conclusion below.

**Conclusions:** Since the result is 0.643 > 0.05 we can accept H0 saying that Water Pokemons equal all teh other Pokemon types together so we can confirm: HO > water pokemon = non-water pokemon

# Challenge 2 - Matched Pairs Test

In this challenge we will compare dependent samples of data describing our Pokemon. Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. In the cell below, import the `ttest_rel` function from `scipy.stats` and compare the two columns to see if there is a statistically significant difference between them.

In [58]:
# Attack & Defense data listed:
attack_scores = df['Attack'].tolist()
defense_scores = df['Defense'].tolist()

In [60]:
# T-test Attack versus defense:
ttest_ind(attack_scores, defense_scores)

TtestResult(statistic=3.241764074042312, pvalue=0.0012123980547321484, df=1598.0)

Describe the results of the test in the cell below.

In [24]:
# Your conclusions here:
# Since the p-value (0.00121) is less than the significance level of 0.05, we reject the null hypothesis (HO)
# Attack and Defense scores are not equal

We are also curious about whether therer is a significant difference between the mean of special defense and the mean of special attack. Perform the hypothesis test in the cell below.

In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB


In [64]:
# Get Special Attack totals mean:
special_attack_mean = df['Sp. Atk'].mean()
special_attack_mean

72.82

In [65]:
# Get Special Defense totals mean:
special_defense_mean = df['Sp. Def'].mean()
special_defense_mean

71.9025

In [67]:
# Get special attack + defense totals to compare:
special_attack_total = df['Sp. Atk'].tolist()
special_defense_total = df['Sp. Def'].tolist()

In [68]:
# T-test Special Attack versus defense totals:
ttest_ind(special_attack_total, special_defense_total)

TtestResult(statistic=0.6041290031014401, pvalue=0.5458436328840358, df=1598.0)

Describe the results of the test in the cell below.

In [69]:
# Conclusions: p_value > 0.05 we can accept H0 (special attack + defense are equal)

As you may recall, a two sample matched pairs test can also be expressed as a one sample test of the difference between the two dependent columns.

Import the `ttest_1samp` function and perform a one sample t-test of the difference between defense and attack. Test the hypothesis that the difference between the means is zero. Confirm that the results of the test are the same.

In [71]:
# Attack & Defense data listed:
attack_scores = df['Attack'].tolist()
defense_scores = df['Defense'].tolist()

In [72]:
# Calculate the differences between defense and attack scores:
differences = []
for i in range(len(defense_scores)):
    difference = defense_scores[i] - attack_scores[i]
    differences.append(difference)

In [77]:
from scipy.stats import ttest_1samp
# ttest_1samp with differnces:
ttest_1samp(differences, 0)

TtestResult(statistic=-4.325566393330478, pvalue=1.7140303479358558e-05, df=799)

In [78]:
# Conclusion p-value < 0.05 so we reject the null hypothesis (H0)