# FetchMaker

#### Congratulations! You’ve just started working at the hottest new tech startup, FetchMaker.


#### FetchMaker’s mission is to match up prospective dog owners with their perfect pet.

#### Data on thousands of adoptable dogs are in FetchMaker’s system, and it’s your job to analyze some of that data.

Let's start by including a data interface called `fetchmaker` that will give you access to FetchMaker's dog data.

Use `import fetchmaker` at the top of our **fetchmaker** file to import the `fetchmaker` package.

We will also import `numpy`.


In [2]:
import numpy as np
import fetchmaker

The attributes that FetchMaker keeps track of are:

  - `weight`, an integer representing how heavy a dog is in pounds
  - `tail_length`, a float representing tail length in inches
  - `age`, in years
  - `color`, a String such as `"brown"` or `"grey"`
  - `is_rescue`, a boolean `0` or `1`

The `fetchmaker` package lets you access this data for a specific breed of dog with the following format:

```
fetchmaker.get_weight("poodle")
```

This returns a Pandas DataFrame of the weights of the poodles recorded in the system. The other methods are `get_tail_length`, `get_color`, `get_age`, and `get_is_rescue`, which all take a breed as an input.

Get the tail lengths of all of the `"rottweiler"`s in the system, and store it in a variable called `rottweiler_tl`.

In [5]:
rottweiler_tl = fetchmaker.get_tail_length(
    'rottweiler')
print(rottweiler_tl.head())

400    3.13
401    3.32
402    1.16
403    2.23
404    8.86
Name: tail_length, dtype: float64



Print out the mean of `rottweiler_tl` and the standard deviation of `rottweiler_tl`, using `np.mean` and `np.std`.


In [7]:
print(np.mean(rottweiler_tl))
print(np.std(rottweiler_tl))

4.2360999999999995
2.0647536874891395



Over the years, we have seen that we expect `8%` of dogs in the FetchMaker system to be rescues. We want to know if whippets are significantly more or less likely to be a rescue.


Store the `is_rescue` values for `"whippet"`s in a variable called `whippet_rescue`.

In [8]:
whippet_rescue = fetchmaker.get_is_rescue(
    'whippet')
print(whippet_rescue.head())

700    0
701    0
702    0
703    0
704    0
Name: is_rescue, dtype: int64



Use `np.count_nonzero` to get the number of entries in `whippet_rescue` that are `1`. Store this number in a variable called `num_whippet_rescues`.


In [11]:
num_whippet_rescues = np.count_nonzero(
    whippet_rescue)
print(num_whippet_rescues)

6



Get the number of samples in the whippet set by taking the `np.size` of `whippet_rescue`. Store this in a variable called `num_whippets`.


In [12]:
num_whippets = np.size(whippet_rescue)
print(num_whippets)

100



Use a binomial test to test the number of whippet rescues, `num_whippet_rescues`, against our expected percentage, `8%`.


Remember to import the binomial test by using `from scipy.stats import binom_test`.

In [13]:
from scipy.stats import binom_test
pval = binom_test(num_whippet_rescues,
                 num_whippets,
                 p=0.08)


Print out the p-value. Is your result significant?


In [14]:
print(pval)

def pval_sig(pvalue):
  if pvalue < 0.05:
    return 'The pvalue is significant'
  else:
    return 'The pvalue is not significant'
print(pval_sig(pval))

0.5811780106238098
The pvalue is not significant


Three of our most popular mid-sized dog breeds are whippets, terriers, and pitbulls. Is there a significant difference in the average weights of these three dog breeds? Perform a comparative numerical test to determine if there is a significant difference.

In [15]:
wt_whip = fetchmaker.get_weight('whippet')
wt_ter = fetchmaker.get_weight('terrier')
wt_pit = fetchmaker.get_weight('pitbull')

print(np.mean(wt_whip))
print(np.mean(wt_ter))
print(np.mean(wt_pit))

print(np.std(wt_whip))
print(np.std(wt_ter))
print(np.std(wt_pit))

std_whip = np.std(wt_whip)
std_ter = np.std(wt_ter)
std_pit = np.std(wt_pit)

ratio_whip_ter = std_whip/std_ter
#print(ratio_whip_ter)
ratio_whip_pit = std_whip/std_pit
#print(ratio_whip_pit)
ratio_ter_pit = std_ter/std_pit
#print(ratio_ter_pit)

from scipy.stats import f_oneway

fstat, pval = f_oneway(wt_whip, wt_ter, wt_pit)
print(pval)

40.82
30.92
44.16
12.795608621710809
8.601953266555219
9.373067800885684
3.276415588274815e-17


Now, perform another test to determine which of the pairs of these dog breeds differ from each other.

In [16]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

wt_whip_ter_pit = np.concatenate(
    [wt_whip,
    wt_ter,
    wt_pit])
labels = ['whippet'] * len(wt_whip) + ['terrier'] * len(wt_ter) + ['pitbull'] * len(wt_pit)

tukey_whip_ter_pit = pairwise_tukeyhsd(
    wt_whip_ter_pit,
    labels,
    0.05)
print(tukey_whip_ter_pit)

Multiple Comparison of Means - Tukey HSD,FWER=0.05
 group1  group2 meandiff  lower  upper  reject
----------------------------------------------
pitbull terrier  -13.24  -16.728 -9.752  True 
pitbull whippet  -3.34    -6.828 0.148  False 
terrier whippet   9.9     6.412  13.388  True 
----------------------------------------------


We want to see if `"poodle"`s and `"shihtzu"`s have significantly different color breakdowns.

Get the poodle colors and store it in a variable called `poodle_colors`.

Get the shih tzu colors and store it in a variable called `shihtzu_colors`.

In [17]:
poodle_colors = fetchmaker.get_color('poodle')
shihtzu_colors = fetchmaker.get_color('shihtzu')
#print(poodle_colors)
#print(shihtzu_colors)

You can get the number of occurrences of brown poodles by using `np.count_nonzero(poodle_colors == "brown")`.

Use this function to build a Chi Square contingency table, called `color_table`, with the following structure:

```
     Poodle	Shih Tzu
Black	x	x
Brown	x	x
Gold	x	x
Grey	x	x
White	x	x
```

Fill in the "x" entries with the number of each poodle or shih tzu with the specified color.

In [18]:
pood_bl = np.count_nonzero(
    poodle_colors == 'black')
pood_br = np.count_nonzero(
    poodle_colors == 'brown')
pood_go = np.count_nonzero(
    poodle_colors == 'gold')
pood_gr = np.count_nonzero(
    poodle_colors == 'grey')
pood_wh = np.count_nonzero(
    poodle_colors == 'white')

shih_bl = np.count_nonzero(
    shihtzu_colors == 'black')
shih_br = np.count_nonzero(
    shihtzu_colors == 'brown')
shih_go = np.count_nonzero(
    shihtzu_colors == 'gold')
shih_gr = np.count_nonzero(
    shihtzu_colors == 'grey')
shih_wh = np.count_nonzero(
    shihtzu_colors == 'white')

# Contingency table
#           Poodle   |  Shih Tzu
# -------+-----------+---------------
# Black  |  17          10
# Brown  |  13          36
# Gold   |   8           6
# Grey   |  52          41
# White  |  10           7

color_table = [[pood_bl, shih_bl],
               [pood_br, shih_br],
               [pood_go, shih_go],
               [pood_gr, shih_gr],
               [pood_wh, shih_wh]]
print(color_table)

[[17, 10], [13, 36], [8, 6], [52, 41], [10, 7]]


Feed your `color_table` into SciPy's Chi Square test, save the p-value and print it out.

Is there a significant difference?

In [19]:
from scipy.stats import chi2_contingency

chi2, pval, dof, expected = chi2_contingency(
    color_table)
print(chi2)
print(pval)
print(dof)
print(expected)

if pval < 0.05:
  print('Significant')
else:
  print('Not significant')

14.726934501399128
0.005302408293244593
4
[[13.5 13.5]
 [24.5 24.5]
 [ 7.   7. ]
 [46.5 46.5]
 [ 8.5  8.5]]
Significant
