# Data and Sampling Distributions

In the era of big data, sampling is very useful to understand the population with performance and velocity.

## Random sampling and sample bias

A sample is a subset of the data set (population). An example of a sampling methodology is random sampling, in which each value in the population has the same probability of being selected, and can be done with or without replacement. <br>
Note: Data quality is more important than sample size in reducing bias.

## Imports

In [4]:
import pandas as pd
import warnings

In [5]:
warnings.filterwarnings('ignore')

### Random sampling example with Loan Data

In [6]:
df = pd.read_csv('../../datasets/loan_data_2007_2014.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,...,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m
0,0,1077501,1296599,5000,5000,4975.0,36 months,10.65,162.87,B,...,,,,,,,,,,
1,1,1077430,1314167,2500,2500,2500.0,60 months,15.27,59.83,C,...,,,,,,,,,,
2,2,1077175,1313524,2400,2400,2400.0,36 months,15.96,84.33,C,...,,,,,,,,,,
3,3,1076863,1277178,10000,10000,10000.0,36 months,13.49,339.31,C,...,,,,,,,,,,
4,4,1075358,1311748,3000,3000,3000.0,60 months,12.69,67.79,B,...,,,,,,,,,,


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 466285 entries, 0 to 466284
Data columns (total 75 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Unnamed: 0                   466285 non-null  int64  
 1   id                           466285 non-null  int64  
 2   member_id                    466285 non-null  int64  
 3   loan_amnt                    466285 non-null  int64  
 4   funded_amnt                  466285 non-null  int64  
 5   funded_amnt_inv              466285 non-null  float64
 6   term                         466285 non-null  object 
 7   int_rate                     466285 non-null  float64
 8   installment                  466285 non-null  float64
 9   grade                        466285 non-null  object 
 10  sub_grade                    466285 non-null  object 
 11  emp_title                    438697 non-null  object 
 12  emp_length                   445277 non-null  object 
 13 

In [9]:
#Using random_state to reproducibility
sample_df = df.sample(n=384, random_state=42)
sample_df.head()

Unnamed: 0.1,Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,...,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m
362514,362514,19677589,21900299,32500,32500,32500.0,60 months,14.99,773.01,C,...,,,,,,,25100.0,,,
288564,288564,29755527,32278795,11000,11000,11000.0,60 months,20.99,297.53,E,...,,,,,,,24000.0,,,
213591,213591,1343334,1588314,30000,30000,30000.0,36 months,14.65,1034.83,C,...,,,,,,,,,,
263083,263083,33131681,35775007,14400,14400,14400.0,60 months,14.49,338.74,C,...,,,,,,,17400.0,,,
165001,165001,3293168,4066358,15000,15000,14900.0,36 months,8.9,476.3,A,...,,,,,,,8700.0,,,


In [10]:
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 384 entries, 362514 to 365592
Data columns (total 75 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   384 non-null    int64  
 1   id                           384 non-null    int64  
 2   member_id                    384 non-null    int64  
 3   loan_amnt                    384 non-null    int64  
 4   funded_amnt                  384 non-null    int64  
 5   funded_amnt_inv              384 non-null    float64
 6   term                         384 non-null    object 
 7   int_rate                     384 non-null    float64
 8   installment                  384 non-null    float64
 9   grade                        384 non-null    object 
 10  sub_grade                    384 non-null    object 
 11  emp_title                    360 non-null    object 
 12  emp_length                   365 non-null    object 
 13  home_ownership   

## Bias

...