<a href="https://colab.research.google.com/github/cheronoF/Data-Science/blob/main/Copy_of_Copy_of_Python_Programming_Stratified_Sampling_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="green">*To start working on this notebook, or any other notebook that we will use in the Moringa Data Science Course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

# Python Programming: Stratified Sampling Exercise

## Example

We are going to use the example we looked at in the overview. Our dataset is comprised of different heights of Moringa School students. It has 10,000 entries whereby 60% are female and 40% are Male. We are going to perform stratified sampling on this population so that we can get a sample of 1000 students which has the same proportion as the population.

**Import relevant Libraries**

In [None]:
import pandas as pd


**Load the Dataset**

Here is the dataset we are going to use in this example.[Dataset Download](https://drive.google.com/file/d/1ODcSRSs_isRKCAShFwnMrXdcphed9kYn/view?usp=sharing)

In [None]:
# Load the data into a panda dataframe
data= pd.read_csv('Gender_heights.csv')

# Check out the data
data



In [None]:
# Now we will confirm how many Female students and Male students are there in our dataset.
# To achieve this we will use pandas .value_count() method. This method outputs the number of times a value appears in a column.
data['gender'].value_counts()

F    6000
M    4000
Name: gender, dtype: int64

Now that we have confirmed that our popluation follows the correct proportion, we are going to first create a random sample of 1000 students without stratification just to see how it behaves.

In [None]:
# To create a random sample from a dataframe we use the pandas sample method. You can read more about it here.https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html
# We pass in an argument called frac which symbolises the fraction of the population we want to use as our sample

# Non stratified sample
non_strat_output = data.sample(frac=0.1) 

# check the proportion of the non_stratified sample
print(non_strat_output['gender'].value_counts())

F    618
M    382
Name: gender, dtype: int64


Run the code several times, taking note of the proportion of Female and Male. What do you notice?

When you run it several time, you'll notice that the proportion varies each time. Although it ranges closely to our desired proportion, its not quite the 60-40 proportion we want. To remedy it, this is where stratified sampling comes in handy.

Let's us create a stratified random sample and see how it behaves

In [None]:

# To create a stratified random sample we'll use pandas groupBy method. Basically what the method does is that it splits our dataset into sets and then applies a function on each subset.
# In our case, we are going to applies a function that puts each subset into a random sample with a size of 100. The beauty about this method is that it strictly maintains the populations proportion.

# Stratified sample
strat_output = data.groupby('gender', group_keys=False).apply(lambda grouped_subset : grouped_subset.sample(frac=0.1))

# proportion of the non_stratified sample
print(strat_output['gender'].value_counts())

# Check the stratified output
print(strat_output)

 

F    600
M    400
Name: gender, dtype: int64
     gender         ht
2809      F  57.999551
69        F  68.038378
2106      F  67.503046
5058      F  60.443173
1188      F  58.968027
...     ...        ...
7638      M  85.052857
6772      M  86.741967
6616      M  92.597088
7329      M  87.790862
7162      M  90.415005

[1000 rows x 2 columns]


## <font color="green">Challenges</font>

### Challenge 1

In [None]:
# Challenge 1
# ---
# Question: Moringa school is doing an analysis on how the students perform based on the different programmes that they offer.
# Given a dataset of 10000 students and you are told that 50% are Core students, 25% are Prep students and 25% are Pre-prep students. 
# Also in the dataset, 60% of the students are Female and the rest are Male. 
# You are tasked with creating a stratified random sample that represents that population.
# ---
# Dataset Source = https://drive.google.com/file/d/10THQj3iqund_D5tgypBdeKoc2FZ9pC0S/view?usp=sharing

# Load the data into a panda dataframe
Moringa = pd.read_csv('Moringa_Students_heights.csv')

# Checking data

Moringa.head(5)



Unnamed: 0,programme,gender,ht
0,Prep,F,57.861539
1,Prep,F,65.550765
2,Core,F,52.142763
3,Pre-prep,F,69.453854
4,Pre-prep,F,54.203258


In [None]:
Moringa.shape

(10000, 3)

In [None]:
# Confirming how many Female students and Male students and the programmes they're on in our dataset.
Moringa['gender'].value_counts()

F    6000
M    4000
Name: gender, dtype: int64

In [None]:
Moringa['programme'].value_counts()

Core        5000
Pre-prep    2500
Prep        2500
Name: programme, dtype: int64

In [None]:
# Trying out Non-stratified sampling
# Non stratified sample
non_strat_output = Moringa.sample(frac=0.1) 

# check the proportion of the non_stratified sample
print(non_strat_output['gender'].value_counts())


F    612
M    388
Name: gender, dtype: int64


In [None]:
# Stratified sample
strat_output = Moringa.groupby('gender', group_keys=False).apply(lambda grouped_subset : grouped_subset.sample(frac=0.1))

# proportion of the non_stratified sample
print(strat_output['gender'].value_counts())

# Check the stratified output
print(strat_output)


F    600
M    400
Name: gender, dtype: int64
     programme gender         ht
3965  Pre-prep      F  60.408243
365       Core      F  62.865218
552       Prep      F  55.503927
5953      Core      F  63.424460
5463      Core      F  61.206943
...        ...    ...        ...
9554  Pre-prep      M  86.012101
7255      Prep      M  84.730698
6557      Prep      M  90.471306
6600      Prep      M  80.664773
8389      Core      M  97.555265

[1000 rows x 3 columns]


In [None]:
strat_output = Moringa.groupby('programme', group_keys=False).apply(lambda grouped_subset : grouped_subset.sample(frac=0.1))

# proportion of the non_stratified sample
print(strat_output['programme'].value_counts())

# Check the stratified output
print(strat_output)

Core        500
Pre-prep    250
Prep        250
Name: programme, dtype: int64
     programme gender         ht
409       Core      F  55.123906
7314      Core      M  90.424957
966       Core      F  59.757305
4433      Core      F  60.928354
2949      Core      F  65.699003
...        ...    ...        ...
584       Prep      F  67.065707
3640      Prep      F  57.466359
9970      Prep      M  96.051919
7820      Prep      M  89.710478
4774      Prep      F  75.107617

[1000 rows x 3 columns]


### Challenge 2

In [None]:
# Challenge 2
# ---
# Question: A wine company would like to perform some analysis on a variety of new red wines. 
# Select a stratified sample based on wine quality from the given dataset.
# ---
# Dataset url = http://bit.ly/RedWinesDataset

RedWine = pd.read_csv("http://bit.ly/RedWinesDataset")

RedWine.head(5)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [None]:
RedWine.shape

(1599, 12)

In [None]:
# Trying out Non-stratified sampling
# Non stratified sample
non_strat_output = RedWine.sample(frac=0.1) 

# check the proportion of the non_stratified sample
print(non_strat_output['quality'].value_counts())

5    67
6    54
7    29
4     9
3     1
Name: quality, dtype: int64


In [None]:
strat_output = RedWine.groupby('quality', group_keys=False).apply(lambda grouped_subset : grouped_subset.sample(frac=0.1))

# proportion of the non_stratified sample
print(strat_output['quality'].value_counts())

# Check the stratified output
print(strat_output)

5    68
6    64
7    20
4     5
8     2
3     1
Name: quality, dtype: int64
      fixed acidity  volatile acidity  citric acid  ...  sulphates  alcohol  quality
832            10.4             0.440         0.42  ...       0.86      9.9        3
1484            6.8             0.910         0.06  ...       0.64     10.9        4
1176            6.5             0.880         0.03  ...       0.50     11.2        4
1239            6.5             0.670         0.00  ...       0.56     11.8        4
633            10.1             0.935         0.22  ...       0.64     11.3        4
...             ...               ...          ...  ...        ...      ...      ...
430            10.5             0.240         0.47  ...       0.90     11.0        7
200             9.6             0.320         0.47  ...       0.82     10.3        7
502            10.4             0.440         0.73  ...       0.85     12.0        7
828             7.8             0.570         0.09  ...       0.74     12.

In [None]:
import pandas as pd

### Challenge 3

In [None]:
# Challenge 3
# ---
# Question: You have been provided with a list of employees of a certain company with some details about their gender (male/female) 
# and their type of employment (full-time/part-time). The HR team wants to conduct a survey on the working condition 
# that will be representative of the general opinion without interviewing every employee. 
# They request you conduct stratified sampling before any analysis is done. 
# ---
# Dataset url = http://bit.ly/StratifiedEmployeeDataset

# Hint: Perform EDA first


Employee = pd.read_excel("http://bit.ly/StratifiedEmployeeDataset")
Employee.head(5)

Unnamed: 0,Employee,Gender,Time,Strata
0,Em001,Male,Full-time,MF
1,Em002,Male,Part-time,MP
2,Em003,Male,Full-time,MF
3,Em004,Female,Part-time,FP
4,Em005,Male,Full-time,MF
