In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import scipy.stats as stats
%matplotlib inline

1. A physician is evaluating a new diet for her patients with a family history of heart disease. To test the effectiveness of this diet, 16 patients are placed on the diet for 6 months. Their weights and triglyceride levels are measured before and after the study, and the physician wants to know if either set of measurements has changed. (Data set: dietstudy.csv)

In [2]:
diet = pd.read_csv('dietstudy.csv')
diet.head()

Unnamed: 0,patid,age,gender,tg0,tg1,tg2,tg3,tg4,wgt0,wgt1,wgt2,wgt3,wgt4
0,1,45,Male,180,148,106,113,100,198,196,193,188,192
1,2,56,Male,139,94,119,75,92,237,233,232,228,225
2,3,50,Male,152,185,86,149,118,233,231,229,228,226
3,4,46,Female,112,145,136,149,82,179,181,177,174,172
4,5,64,Male,156,104,157,79,97,219,217,215,213,214


In [3]:
stats.ttest_rel(a=diet.tg0,b=diet.tg4)

Ttest_relResult(statistic=1.2000008533342437, pvalue=0.24874946576903698)

In [4]:
0.24874946576903698/2

0.12437473288451849

In [5]:
stats.ttest_rel(a=diet.wgt0,b=diet.wgt4)

Ttest_relResult(statistic=11.174521688532522, pvalue=1.137689414996614e-08)

2. An analyst at a department store wants to evaluate a recent credit card promotion. To this end, 500 cardholders were randomly selected. Half received an ad promoting a reduced interest rate on purchases made over the next three months, and half received a standard seasonal ad. Is the promotion effective to increase sales? (Data set: creditpromo.csv)

In [6]:
credit = pd.read_csv('creditpromo.csv')
credit.head()

Unnamed: 0,id,insert,dollars
0,148,Standard,2232.771979
1,572,New Promotion,1403.807542
2,973,Standard,2327.092181
3,1096,Standard,1280.030541
4,1541,New Promotion,1513.5632


In [7]:
standard = credit[credit['insert'] == 'Standard']['dollars']
promo = credit[credit['insert'] == 'New Promotion']['dollars']
promo.std() - standard.std()

10.030121614125676

In [8]:
stats.ttest_ind(a=standard,b=promo,equal_var=True)

Ttest_indResult(statistic=-2.2604227264649963, pvalue=0.024225996894147814)

In [9]:
0.024225996894147814/2

0.012112998447073907

3. An experiment is conducted to study the hybrid seed production of bottle gourd under open field conditions. The main aim of the investigation is to compare natural pollination and hand pollination. The data are collected on 10 randomly selected plants from each of natural pollination and hand pollination. The data are collected on fruit weight (kg), seed yield/plant (g) and seedling length (cm). (Data set: pollination.csv)

a. Is the overall population of Seed yield/plant (g) equals to 200?

In [10]:
poll = pd.read_csv('pollination.csv')
poll.head()

Unnamed: 0,Group,Fruit_Wt,Seed_Yield_Plant,Seedling_length
0,Natural,1.85,147.7,16.86
1,Natural,1.86,136.86,16.77
2,Natural,1.83,149.97,16.35
3,Natural,1.89,172.33,18.26
4,Natural,1.8,144.46,17.9


In [11]:
poll.shape

(20, 4)

In [12]:
stats.ttest_1samp(a=poll.Seed_Yield_Plant,popmean=200)

Ttest_1sampResult(statistic=-2.3009121248548645, pvalue=0.032891040921283025)

b. Test whether the natural pollination and hand pollination under open field conditions are equally effective or are significantly different.

In [13]:
natural_fruit = poll['Fruit_Wt'][poll.Group == 'Natural']
natural_yield = poll['Seed_Yield_Plant'][poll.Group == 'Natural']
natural_length = poll['Seedling_length'][poll.Group == 'Natural']

hand_yield = poll['Seed_Yield_Plant'][poll.Group == 'Hand']
hand_fruit = poll['Fruit_Wt'][poll.Group == 'Hand']
hand_length = poll['Seedling_length'][poll.Group == 'Hand']

In [14]:
stats.ttest_ind(a=natural_fruit,b=hand_fruit)

Ttest_indResult(statistic=-17.669989614440286, pvalue=8.078362076486221e-13)

In [15]:
stats.ttest_ind(a=natural_length,b=hand_length)

Ttest_indResult(statistic=-2.542229999657055, pvalue=0.020428817064110226)

In [16]:
stats.ttest_ind(a=natural_yield,b=hand_yield)

Ttest_indResult(statistic=-13.958260515902547, pvalue=4.2714815854843853e-11)

4. An electronics firm is developing a new DVD player in response to customer requests. Using a prototype, the marketing team has collected focus data for different age groups viz. Under 25; 25-34; 35-44; 45-54; 55-64; 65 and above. Do you think that consumers of various ages rated the design differently? (Data set: dvdplayer.csv).

In [17]:
dvd = pd.read_csv('dvdplayer.csv')
dvd.agegroup.value_counts()

65 and over    17
Under 25       13
35-44          12
45-54          10
25-34          10
55-64           6
Name: agegroup, dtype: int64

In [18]:
dvd.head()

Unnamed: 0,agegroup,dvdscore
0,65 and over,38.454803
1,55-64,17.669677
2,65 and over,31.704307
3,65 and over,25.92446
4,Under 25,30.450007


In [19]:
s1 = dvd.dvdscore[dvd.agegroup == '65 and over']
s2 = dvd.dvdscore[dvd.agegroup == 'Under 25']
s3 = dvd.dvdscore[dvd.agegroup == '35-44']
s4 = dvd.dvdscore[dvd.agegroup == '25-34']
s5 = dvd.dvdscore[dvd.agegroup == '55-64']
s6 = dvd.dvdscore[dvd.agegroup == '45-54']

In [20]:
stats.f_oneway(s1,s2,s3,s4,s5,s6)

F_onewayResult(statistic=6.992526962676516, pvalue=3.087324905679639e-05)

5. A survey was conducted among 2800 customers on several demographic characteristics. Working status, sex, age, age-group, race, happiness, no. of child, marital status, educational qualifications, income group etc. had been captured for that purpose. (Data set: sample_survey.csv).

In [21]:
survey = pd.read_csv('sample_survey.csv')
survey.head()

Unnamed: 0,id,wrkstat,marital,childs,age,educ,paeduc,maeduc,speduc,degree,...,agecat,childcat,news1,news2,news3,news4,news5,car1,car2,car3
0,1,Working full time,Divorced,2.0,60.0,12.0,12.0,12.0,,High school,...,55 to 64,1-2,No,No,No,No,No,American,Japanese,Japanese
1,2,Working part-time,Never married,0.0,27.0,17.0,20.0,,,Junior college,...,25 to 34,,No,No,Yes,No,No,American,German,Japanese
2,3,Working full time,Married,2.0,36.0,12.0,12.0,12.0,16.0,High school,...,35 to 44,1-2,No,No,No,Yes,Yes,American,American,
3,4,Working full time,Never married,0.0,21.0,13.0,,12.0,,High school,...,Less than 25,,No,No,No,Yes,Yes,American,Other,
4,5,Working full time,Never married,0.0,35.0,16.0,,12.0,,Bachelor,...,35 to 44,,No,No,No,No,No,American,American,Korean


a. Is there any relationship in between labour force status with marital status?

In [22]:
df = pd.crosstab(survey.wrkstat,survey.marital,margins=True)
df

marital,Divorced,Married,Never married,Separated,Widowed,All
wrkstat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Keeping house,25,200,35,13,55,328
Other,12,16,14,4,8,54
Retired,53,168,17,6,150,394
School,7,9,60,2,1,79
Temporarily not working,9,23,11,1,2,46
"Unemployed, laid off",10,13,32,0,3,58
Working full time,295,778,392,58,44,1567
Working part-time,35,138,102,9,20,304
All,446,1345,663,93,283,2830


In [23]:
stats.chi2_contingency(observed=df)

(729.2421426572284,
 1.820339965538765e-127,
 40,
 array([[5.16918728e+01, 1.55886926e+02, 7.68424028e+01, 1.07787986e+01,
         3.28000000e+01, 3.28000000e+02],
        [8.51024735e+00, 2.56643110e+01, 1.26508834e+01, 1.77455830e+00,
         5.40000000e+00, 5.40000000e+01],
        [6.20932862e+01, 1.87254417e+02, 9.23045936e+01, 1.29477032e+01,
         3.94000000e+01, 3.94000000e+02],
        [1.24501767e+01, 3.75459364e+01, 1.85077739e+01, 2.59611307e+00,
         7.90000000e+00, 7.90000000e+01],
        [7.24946996e+00, 2.18621908e+01, 1.07766784e+01, 1.51166078e+00,
         4.60000000e+00, 4.60000000e+01],
        [9.14063604e+00, 2.75653710e+01, 1.35879859e+01, 1.90600707e+00,
         5.80000000e+00, 5.80000000e+01],
        [2.46954770e+02, 7.44740283e+02, 3.67109894e+02, 5.14950530e+01,
         1.56700000e+02, 1.56700000e+03],
        [4.79095406e+01, 1.44480565e+02, 7.12197880e+01, 9.99010601e+00,
         3.04000000e+01, 3.04000000e+02],
        [4.46000000e+02, 1.345

b. Do you think educational qualification is somehow controlling the marital status?

In [24]:
df1 = pd.crosstab(survey.degree,survey.marital,margins=True)
df1

marital,Divorced,Married,Never married,Separated,Widowed,All
degree,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bachelor,58,251,129,12,28,478
Graduate,29,123,41,3,9,205
High school,241,686,367,58,148,1500
Junior college,45,108,46,3,6,208
LT High school,70,174,77,17,92,430
All,443,1342,660,93,283,2821


In [25]:
stats.chi2_contingency(observed=df1)

(122.68449020508541,
 7.424404099753273e-15,
 25,
 array([[  75.06345268,  227.39312301,  111.83268345,   15.75824176,
           47.95249911,  478.        ],
        [  32.19248493,   97.52215526,   47.9617157 ,    6.75824176,
           20.56540234,  205.        ],
        [ 235.55476781,  713.57674583,  350.9393832 ,   49.45054945,
          150.4785537 , 1500.        ],
        [  32.66359447,   98.94930876,   48.66359447,    6.85714286,
           20.86635945,  208.        ],
        [  67.52570011,  204.55866714,  100.60262318,   14.17582418,
           43.1371854 ,  430.        ],
        [ 443.        , 1342.        ,  660.        ,   93.        ,
          283.        , 2821.        ]]))

c. Is happiness is driven by earnings or marital status

In [26]:
survey.head()

Unnamed: 0,id,wrkstat,marital,childs,age,educ,paeduc,maeduc,speduc,degree,...,agecat,childcat,news1,news2,news3,news4,news5,car1,car2,car3
0,1,Working full time,Divorced,2.0,60.0,12.0,12.0,12.0,,High school,...,55 to 64,1-2,No,No,No,No,No,American,Japanese,Japanese
1,2,Working part-time,Never married,0.0,27.0,17.0,20.0,,,Junior college,...,25 to 34,,No,No,Yes,No,No,American,German,Japanese
2,3,Working full time,Married,2.0,36.0,12.0,12.0,12.0,16.0,High school,...,35 to 44,1-2,No,No,No,Yes,Yes,American,American,
3,4,Working full time,Never married,0.0,21.0,13.0,,12.0,,High school,...,Less than 25,,No,No,No,Yes,Yes,American,Other,
4,5,Working full time,Never married,0.0,35.0,16.0,,12.0,,Bachelor,...,35 to 44,,No,No,No,No,No,American,American,Korean


In [30]:
maritalt = pd.crosstab(survey.marital,survey.happy,margins=True)
maritalt

happy,Not too happy,Pretty happy,Very happy,All
marital,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Divorced,72,278,93,443
Married,71,684,582,1337
Never married,108,426,120,654
Separated,30,49,13,92
Widowed,59,137,83,279
All,340,1574,891,2805


In [31]:
stats.chi2_contingency(observed=maritalt)

(260.68943894182826,
 7.762777322980048e-47,
 15,
 array([[  53.6969697 ,  248.58538324,  140.71764706,  443.        ],
        [ 162.06060606,  750.24527629,  424.69411765, 1337.        ],
        [  79.27272727,  366.98609626,  207.74117647,  654.        ],
        [  11.15151515,   51.62495544,   29.22352941,   92.        ],
        [  33.81818182,  156.55828877,   88.62352941,  279.        ],
        [ 340.        , 1574.        ,  891.        , 2805.        ]]))

In [34]:
survey.income.value_counts()

$25000 or more    1587
$20000 - 24999     247
$10000 - 14999     192
$15000 - 19999     179
$8000 TO 9999       59
$7000 TO 7999       47
LT $1000            36
$5000 TO 5999       35
$6000 TO 6999       33
$1000 TO 2999       32
$4000 TO 4999       32
$3000 TO 3999       24
Name: income, dtype: int64

In [36]:
earningt = pd.crosstab(survey.happy,survey.income,margins=True)
earningt

income,$1000 TO 2999,$10000 - 14999,$15000 - 19999,$20000 - 24999,$25000 or more,$3000 TO 3999,$4000 TO 4999,$5000 TO 5999,$6000 TO 6999,$7000 TO 7999,$8000 TO 9999,LT $1000,All
happy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Not too happy,7,39,33,40,113,9,9,6,14,12,9,11,302
Pretty happy,20,107,119,155,888,11,13,18,13,21,30,13,1408
Very happy,5,44,26,50,571,4,10,11,6,14,19,11,771
All,32,190,178,245,1572,24,32,35,33,47,58,35,2481


In [37]:
stats.chi2_contingency(observed=earningt)

(178.95053061216427,
 7.234749067043371e-21,
 36,
 array([[   3.89520355,   23.12777106,   21.66706973,   29.82265216,
          191.35187424,    2.92140266,    3.89520355,    4.26037888,
            4.01692866,    5.72108021,    7.06005643,    4.26037888,
          302.        ],
        [  18.16041919,  107.82748892,  101.01733172,  139.04070939,
          892.1305925 ,   13.62031439,   18.16041919,   19.86295848,
           18.72793229,   26.67311568,   32.91575977,   19.86295848,
         1408.        ],
        [   9.94437727,   59.04474002,   55.31559855,   76.13663845,
          488.51753325,    7.45828295,    9.94437727,   10.87666264,
           10.25513906,   14.60580411,   18.0241838 ,   10.87666264,
          771.        ],
        [  32.        ,  190.        ,  178.        ,  245.        ,
         1572.        ,   24.        ,   32.        ,   35.        ,
           33.        ,   47.        ,   58.        ,   35.        ,
         2481.        ]]))

Chisquare value of marital status (260.68) > Chisquare value of earning (178.95)
So, happiness is more driven by marital status.