#### 1. A physician is evaluating a new diet for her patients with a family history of
#### heart disease. To test the effectiveness of this diet, 16 patients are placed on the
#### diet for 6 months. Their weights and triglyceride levels are measured before
#### and after the study, and the physician wants to know if either set of
#### measurements has changed. (Data set: DietStudy.xls)

In [24]:
import pandas as pd
import numpy as np

In [19]:
diet = pd.read_excel('DietStudy_1.xls')

In [20]:
diet.head()

Unnamed: 0,Patient ID,Age in years,Gender,Triglyceride,1st interim triglyceride,2nd interim triglyceride,3rd interim triglyceride,Final triglyceride,Weight,1st interim weight,2nd interim weight,3rd interim weight,Final weight
0,1,45,0,180,148,106,113,100,198,196,193,188,192
1,2,56,0,139,94,119,75,92,237,233,232,228,225
2,3,50,0,152,185,86,149,118,233,231,229,228,226
3,4,46,1,112,145,136,149,82,179,181,177,174,172
4,5,64,0,156,104,157,79,97,219,217,215,213,214


In [25]:
cond = [diet['Gender']==0, diet['Gender']==1]
target = ['Male', 'Female']
diet['Gender'] = np.select(cond, target)

In [26]:
diet.head()

Unnamed: 0,Patient ID,Age in years,Gender,Triglyceride,1st interim triglyceride,2nd interim triglyceride,3rd interim triglyceride,Final triglyceride,Weight,1st interim weight,2nd interim weight,3rd interim weight,Final weight
0,1,45,Male,180,148,106,113,100,198,196,193,188,192
1,2,56,Male,139,94,119,75,92,237,233,232,228,225
2,3,50,Male,152,185,86,149,118,233,231,229,228,226
3,4,46,Female,112,145,136,149,82,179,181,177,174,172
4,5,64,Male,156,104,157,79,97,219,217,215,213,214


In [27]:
from scipy.stats import ttest_rel

In [30]:
stat_tri, p_tri = ttest_rel(diet['Triglyceride'], diet['Final triglyceride'])

In [31]:
alpha = 0.05
if p_tri <= alpha:
    print('Significant Changes in Triglyceride')
else:
    print('No Significant Changes in Tryglyceride')
print(f'stats: {stat_tri:.3f}, p-value: {p_tri:.3f}')

No Significant Changes in Tryglyceride
stats: 1.200, p-value: 0.249


In [32]:
stat_wei, p_wei = ttest_rel(diet['Weight'], diet['Final weight'])

In [33]:
alpha = 0.05
if p_wei <= alpha:
    print('Significant Changes in Weight')
else:
    print('No Significant Changes in Weight')
print(f'stats: {stat_wei:.3f}, p-value: {p_wei:.3f}')

Significant Changes in Weight
stats: 11.175, p-value: 0.000


#### 2. An analyst at a department store wants to evaluate a recent credit card
#### promotion. To this end, 500 cardholders were randomly selected. Half received
#### an ad promoting a reduced interest rate on purchases made over the next three
#### months, and half received a standard seasonal ad. Is the promotion effective to
#### increase sales? (Data set: creditpromo.csv)

In [34]:
credit = pd.read_csv('creditpromo.csv')

In [35]:
credit.head()

Unnamed: 0,id,insert,dollars
0,148,Standard,2232.771979
1,572,New Promotion,1403.807542
2,973,Standard,2327.092181
3,1096,Standard,1280.030541
4,1541,New Promotion,1513.5632


In [36]:
from scipy.stats import ttest_ind

In [58]:
standard = credit[credit['insert']=='Standard']['dollars']
new = credit[credit['insert']=='New Promotion']['dollars']

In [59]:
stat_cr, p_cr = ttest_ind(standard, new)

In [60]:
aplha = 0.05
if p <= alpha:
    print('Promotion is Effective to Increasing Sales')
else:
    print('Promotion is Ineffective to Increasing Sales')
print(f'stats: {stat_cr:.3f}, p-value: {p_cr:.3f}')

Promotion is Ineffective to Increasing Sales
stats: -2.260, p-value: 0.024


<p><b> 3. An experiment is conducted to study the hybrid seed production of bottle gourd
under open field conditions. The main aim of the investigation is to compare
natural pollination and hand pollination. The data is collected on 10 randomly
selected plants from each of natural pollination and hand pollination. The data
is collected on fruit weight (kg), seed yield/plant (g) and seedling length (cm).
(Data set: pollination.csv)<br>
a. Is the overall population of Seed yield/plant (g) equals to 200?<br>
b. Test whether the natural pollination and hand pollination under open field
    conditions are equally effective or are significantly different.</b></p>

In [43]:
pol = pd.read_csv('pollination.csv')

In [45]:
pol

Unnamed: 0,Group,Fruit_Wt,Seed_Yield_Plant,Seedling_length
0,Natural,1.85,147.7,16.86
1,Natural,1.86,136.86,16.77
2,Natural,1.83,149.97,16.35
3,Natural,1.89,172.33,18.26
4,Natural,1.8,144.46,17.9
5,Natural,1.88,138.3,16.95
6,Natural,1.89,150.58,18.15
7,Natural,1.79,140.99,18.86
8,Natural,1.85,140.57,18.39
9,Natural,1.84,138.33,18.58


In [64]:
from scipy.stats import ttest_1samp

In [66]:
st, p_m = ttest_1samp(pol['Seed_Yield_Plant'], popmean=200)

In [67]:
print(f'Stats: {st:.3f}, p-value: {p_m:.3f}')

Stats: -2.301, p-value: 0.033


In [70]:
alpha = 0.05
if p_m <= aplha:
    print('Overall population of Seed yield/plant (g) not equals to 200')
else:
    print('Overall population of Seed yield/plant (g) equals to 200')

Overall population of Seed yield/plant (g) not equals to 200


In [71]:
seed_yield = np.mean(pol['Seed_Yield_Plant'])
print(f'Actual Mean is {seed_yield:.2f}')

Actual Mean is 180.80


In [61]:
natural = pol[pol['Group']=='Natural']['Seed_Yield_Plant']
hand = pol[pol['Group']=='Hand']['Seed_Yield_Plant']

In [62]:
stat_pol, p_pol = ttest_ind(natural, hand)

In [63]:
alpha = 0.05
if p <= alpha:
    print('Significally Different')
else:
    print('Equally Effective')
print(f'stats: {stat_pol:.3f}, p-value: {p_pol:.3f}')

Equally Effective
stats: -13.958, p-value: 0.000


<p><b>4. An electronics firm is developing a new DVD player in response to customer
requests. Using a prototype, the marketing team has collected focus data for
different age groups viz. <br>Under 25; 25-34; 35-44; 45-54; 55-64; 65 and above.<br>
Do you think that consumers of various ages rated the design differently?
    (Data set: dvdplayer.csv).</b></p>

In [72]:
dvd = pd.read_csv('dvdplayer.csv')

In [76]:
dvd.head()

Unnamed: 0,agegroup,dvdscore
0,65 and over,38.454803
1,55-64,17.669677
2,65 and over,31.704307
3,65 and over,25.92446
4,Under 25,30.450007


In [74]:
dvd['agegroup'].unique()

array(['65 and over', '55-64', 'Under 25', '25-34', '45-54', '35-44'],
      dtype=object)

In [77]:
age_under25 = dvd[dvd['agegroup']=='Under 25']['dvdscore']
age_25_34 = dvd[dvd['agegroup']=='25-34']['dvdscore']
age_35_44 = dvd[dvd['agegroup']=='35-44']['dvdscore']
age_45_54 = dvd[dvd['agegroup']=='45-54']['dvdscore']
age_55_64 = dvd[dvd['agegroup']=='55-64']['dvdscore']
age_65over = dvd[dvd['agegroup']=='65 and over']['dvdscore']

In [78]:
from scipy.stats import f_oneway

In [79]:
stat_dvd, p_dvd = f_oneway(age_under25, age_25_34, age_35_44, age_45_54, age_55_64, age_65over)

In [81]:
alpha = 0.05
if p <= alpha:
    print('Consumers of various ages rated the design differently')
else:
    print('Consumers of various ages rated the design same')
print(f'stats: {stat_dvd:.3f}, p-value: {p_dvd:.3f}')

Consumers of various ages rated the design same
stats: 6.993, p-value: 0.000


<p><b>5. A survey was conducted among 2800 customers on several demographic
characteristics. Working status, sex, age, age-group, race, happiness, no. of
child, marital status, educational qualifications, income group etc. had been
captured for that purpose. (Data set: sample_survey.csv).<br>
a. Is there any relationship in between labour force status with marital status?<br>
b. Do you think educational qualification is somehow controlling the marital
status?<br>
c. Is happiness is driven by earnings or marital status?</b><p>

In [82]:
survey = pd.read_csv('sample_survey.csv')

In [83]:
survey.head()

Unnamed: 0,id,wrkstat,marital,childs,age,educ,paeduc,maeduc,speduc,degree,...,agecat,childcat,news1,news2,news3,news4,news5,car1,car2,car3
0,1,Working full time,Divorced,2.0,60.0,12.0,12.0,12.0,,High school,...,55 to 64,1-2,No,No,No,No,No,American,Japanese,Japanese
1,2,Working part-time,Never married,0.0,27.0,17.0,20.0,,,Junior college,...,25 to 34,,No,No,Yes,No,No,American,German,Japanese
2,3,Working full time,Married,2.0,36.0,12.0,12.0,12.0,16.0,High school,...,35 to 44,1-2,No,No,No,Yes,Yes,American,American,
3,4,Working full time,Never married,0.0,21.0,13.0,,12.0,,High school,...,Less than 25,,No,No,No,Yes,Yes,American,Other,
4,5,Working full time,Never married,0.0,35.0,16.0,,12.0,,Bachelor,...,35 to 44,,No,No,No,No,No,American,American,Korean


In [84]:
survey.columns

Index(['id', 'wrkstat', 'marital', 'childs', 'age', 'educ', 'paeduc', 'maeduc',
       'speduc', 'degree', 'sex', 'race', 'born', 'parborn', 'granborn',
       'income', 'rincome', 'polviews', 'cappun', 'postlife', 'happy',
       'hapmar', 'owngun', 'news', 'tvhours', 'howpaid', 'ethnic', 'eth1',
       'eth2', 'eth3', 'confinan', 'conbus', 'coneduc', 'conpress', 'conmedic',
       'contv', 'agecat', 'childcat', 'news1', 'news2', 'news3', 'news4',
       'news5', 'car1', 'car2', 'car3'],
      dtype='object')

In [85]:
survey['wrkstat'].unique()

array(['Working full time', 'Working part-time', 'Unemployed, laid off',
       'Keeping house', 'Retired', 'School', 'Other',
       'Temporarily not working', nan], dtype=object)

In [86]:
survey['marital'].unique()

array(['Divorced', 'Never married', 'Married', 'Separated', 'Widowed',
       nan], dtype=object)

In [87]:
table1 = pd.crosstab(survey['marital'], survey['wrkstat'])

In [90]:
table1

wrkstat,Keeping house,Other,Retired,School,Temporarily not working,"Unemployed, laid off",Working full time,Working part-time
marital,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Divorced,25,12,53,7,9,10,295,35
Married,200,16,168,9,23,13,778,138
Never married,35,14,17,60,11,32,392,102
Separated,13,4,6,2,1,0,58,9
Widowed,55,8,150,1,2,3,44,20


In [89]:
from scipy.stats import chi2
from scipy.stats import chi2_contingency

In [91]:
stat1, p1, dof1, expected1 = chi2_contingency(table1)

In [95]:
expected1

array([[ 51.69187279,   8.51024735,  62.09328622,  12.45017668,
          7.24946996,   9.14063604, 246.95477032,  47.90954064],
       [155.8869258 ,  25.66431095, 187.25441696,  37.5459364 ,
         21.86219081,  27.56537102, 744.74028269, 144.48056537],
       [ 76.84240283,  12.65088339,  92.30459364,  18.50777385,
         10.77667845,  13.58798587, 367.10989399,  71.21978799],
       [ 10.77879859,   1.7745583 ,  12.94770318,   2.59611307,
          1.51166078,   1.90600707,  51.495053  ,   9.99010601],
       [ 32.8       ,   5.4       ,  39.4       ,   7.9       ,
          4.6       ,   5.8       , 156.7       ,  30.4       ]])

In [105]:
prob = 0.95
critical1 = chi2.ppf(prob, dof1)
print(f'probability: {prob}, critical: {critical1:.3f}, stat: {stat1:.3f}')
if abs(stat1) >= critical1:
    print('Relationship exists')
else:
    print('Relationship does not exists.')

probability: 0.95, critical: 41.337, stat: 729.242
Relationship exists


In [94]:
survey['degree'].unique()

array(['High school', 'Junior college', 'Bachelor', 'LT High school',
       'Graduate', nan], dtype=object)

In [96]:
table2 = pd.crosstab(survey['degree'], survey['marital'])

In [99]:
table2

marital,Divorced,Married,Never married,Separated,Widowed
degree,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bachelor,58,251,129,12,28
Graduate,29,123,41,3,9
High school,241,686,367,58,148
Junior college,45,108,46,3,6
LT High school,70,174,77,17,92


In [97]:
stat2, p2, dof2, expected2 = chi2_contingency(table2)

In [98]:
expected2

array([[ 75.06345268, 227.39312301, 111.83268345,  15.75824176,
         47.95249911],
       [ 32.19248493,  97.52215526,  47.9617157 ,   6.75824176,
         20.56540234],
       [235.55476781, 713.57674583, 350.9393832 ,  49.45054945,
        150.4785537 ],
       [ 32.66359447,  98.94930876,  48.66359447,   6.85714286,
         20.86635945],
       [ 67.52570011, 204.55866714, 100.60262318,  14.17582418,
         43.1371854 ]])

In [106]:
prob = 0.95
critical2 = chi2.ppf(prob, dof2)
print(f'probability: {prob}, critical: {critical2:.3f}, stat: {stat2:.3f}')
if abs(stat2) >= critical2:
    print('Educational qualification is somehow controlling the marital status')
else:
    print('Educational qualification is not controlling the marital status')

probability: 0.95, critical: 26.296, stat: 122.684
Educational qualification is somehow controlling the marital status


In [109]:
alpha = 1 - prob
print(f'alpha: {aplha:.2f}, p: {p2:.3f}, stat: {stat2:.3f}')
if p <= critical2:
    print('Educational qualification is somehow controlling the marital status')
else:
    print('Educational qualification is not controlling the marital status')

alpha: 0.05, p: 0.000, stat: 122.684
Educational qualification is somehow controlling the marital status


In [110]:
survey.columns

Index(['id', 'wrkstat', 'marital', 'childs', 'age', 'educ', 'paeduc', 'maeduc',
       'speduc', 'degree', 'sex', 'race', 'born', 'parborn', 'granborn',
       'income', 'rincome', 'polviews', 'cappun', 'postlife', 'happy',
       'hapmar', 'owngun', 'news', 'tvhours', 'howpaid', 'ethnic', 'eth1',
       'eth2', 'eth3', 'confinan', 'conbus', 'coneduc', 'conpress', 'conmedic',
       'contv', 'agecat', 'childcat', 'news1', 'news2', 'news3', 'news4',
       'news5', 'car1', 'car2', 'car3'],
      dtype='object')

In [112]:
table3 = pd.crosstab(survey['income'], survey['happy'])

In [113]:
table3

happy,Not too happy,Pretty happy,Very happy
income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
$1000 TO 2999,7,20,5
$10000 - 14999,39,107,44
$15000 - 19999,33,119,26
$20000 - 24999,40,155,50
$25000 or more,113,888,571
$3000 TO 3999,9,11,4
$4000 TO 4999,9,13,10
$5000 TO 5999,6,18,11
$6000 TO 6999,14,13,6
$7000 TO 7999,12,21,14


In [114]:
stat3, p3, dof3, expected3 = chi2_contingency(table3)

In [115]:
expected3

array([[  3.89520355,  18.16041919,   9.94437727],
       [ 23.12777106, 107.82748892,  59.04474002],
       [ 21.66706973, 101.01733172,  55.31559855],
       [ 29.82265216, 139.04070939,  76.13663845],
       [191.35187424, 892.1305925 , 488.51753325],
       [  2.92140266,  13.62031439,   7.45828295],
       [  3.89520355,  18.16041919,   9.94437727],
       [  4.26037888,  19.86295848,  10.87666264],
       [  4.01692866,  18.72793229,  10.25513906],
       [  5.72108021,  26.67311568,  14.60580411],
       [  7.06005643,  32.91575977,  18.0241838 ],
       [  4.26037888,  19.86295848,  10.87666264]])

In [122]:
alpha3 = 0.05
if p3 <= aplha:
    print('Relationship between income and happiness exists')
else:
    print('Relationship between income and happiness does not exist')
print(f'stats: {stat3:.3f}, p-value: {p3:.3f}, alpha: {alpha3}')

Relationship between income and happiness exists
stats: 178.951, p-value: 0.000, alpha: 0.05


In [118]:
table4 = pd.crosstab(survey['marital'], survey['happy'])

In [119]:
table4

happy,Not too happy,Pretty happy,Very happy
marital,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Divorced,72,278,93
Married,71,684,582
Never married,108,426,120
Separated,30,49,13
Widowed,59,137,83


In [120]:
stat4, p4, dof4, expected4 = chi2_contingency(table4)

In [121]:
expected4

array([[ 53.6969697 , 248.58538324, 140.71764706],
       [162.06060606, 750.24527629, 424.69411765],
       [ 79.27272727, 366.98609626, 207.74117647],
       [ 11.15151515,  51.62495544,  29.22352941],
       [ 33.81818182, 156.55828877,  88.62352941]])

In [126]:
alpha4 = 0.05
if p4 <= alpha4:
    print('Relationship between Marriage and happiness exists')
else:
    print('Relationship between Marriage and happiness does not exist')
print(f'stats: {stat4:.3f}, p-value: {p4:.3f}, alpha: {alpha4}')

Relationship between Marriage and happiness exists
stats: 260.689, p-value: 0.000, alpha: 0.05
