In [1]:
from azureml import Workspace

ws = Workspace()
ds = ws.datasets['trinidad_sample_000000000000.csv']
frame = ds.to_dataframe()



In [2]:
df = frame
df['device_ifa'].value_counts()

00000000-0000-0000-0000-000000000000    7656
e24e6c59-dd59-4468-9b32-5d5395dc75f6    2930
440c367d-ceed-4663-9682-fb3d2f1e638b    2334
9071F452-D6BD-443E-ADE3-F679A12441D9    2316
6eba2486-216c-42e8-bf6e-cd0c02c655f4    1823
0c788db3-45ef-473a-ba87-bd0a27561113    1623
27e87f66-bd30-403a-8c6d-35ecb268eb6c    1541
1943bb23-f355-433c-8521-83ce9682c60e    1344
49371aae-fb7a-4764-9cb6-2b737d9e20ac    1077
6077b675-1d64-4a76-8863-9a441b3e3e2b    1075
f43f484c-6ac5-4e1c-b5b6-9125364a6737    1035
400e46e1-588c-4201-9130-4da05397fd40     916
aed17c22-c94a-4d0b-8b46-d307d261821a     909
99d1b3a3-cf9e-49e1-b5ba-6e16f199db68     880
C1F24A61-E71D-455B-AAA2-FA7D746E1032     838
710a190a-9055-4fbc-8fa0-92c6cb87bc2b     836
7de1d1fe-0f8d-4e99-9b7d-94410999c534     790
7b29aed5-9df9-468c-80b0-eae9886b6b2e     764
cd265ee5-39c0-4a39-830c-5f46ff74c217     718
d93d9804-e9e7-4de6-93ed-7bad874628e3     693
2b393e94-784d-4ad6-93be-eeddab23ce4a     655
7d820b7c-c8ba-48b6-aff5-2fa781b008ce     643
f7bd6986-f

### Remove 000-000 ifa, so we can study the ifa that has been uniquely id.

In [3]:
ifa = df[['device_ifa','device_model']].loc[df['device_ifa'] != '00000000-0000-0000-0000-000000000000']
ifa['device_ifa'].value_counts()

e24e6c59-dd59-4468-9b32-5d5395dc75f6    2930
440c367d-ceed-4663-9682-fb3d2f1e638b    2334
9071F452-D6BD-443E-ADE3-F679A12441D9    2316
6eba2486-216c-42e8-bf6e-cd0c02c655f4    1823
0c788db3-45ef-473a-ba87-bd0a27561113    1623
27e87f66-bd30-403a-8c6d-35ecb268eb6c    1541
1943bb23-f355-433c-8521-83ce9682c60e    1344
49371aae-fb7a-4764-9cb6-2b737d9e20ac    1077
6077b675-1d64-4a76-8863-9a441b3e3e2b    1075
f43f484c-6ac5-4e1c-b5b6-9125364a6737    1035
400e46e1-588c-4201-9130-4da05397fd40     916
aed17c22-c94a-4d0b-8b46-d307d261821a     909
99d1b3a3-cf9e-49e1-b5ba-6e16f199db68     880
C1F24A61-E71D-455B-AAA2-FA7D746E1032     838
710a190a-9055-4fbc-8fa0-92c6cb87bc2b     836
7de1d1fe-0f8d-4e99-9b7d-94410999c534     790
7b29aed5-9df9-468c-80b0-eae9886b6b2e     764
cd265ee5-39c0-4a39-830c-5f46ff74c217     718
d93d9804-e9e7-4de6-93ed-7bad874628e3     693
2b393e94-784d-4ad6-93be-eeddab23ce4a     655
7d820b7c-c8ba-48b6-aff5-2fa781b008ce     643
f7bd6986-ff61-4ab2-baaa-76e3e30dee29     641
351CD7DC-2

#### We create a new column device_count based on numbers of device_model associated with ifa

In [4]:
ifa_model_ctn = ifa.groupby('device_ifa')['device_model'].nunique().sort_values(ascending=False).reset_index(name='device_count')
ifa_model_ctn.head()


Unnamed: 0,device_ifa,device_count
0,F5542BB4-21AC-4406-9B19-4F7BA513F08D,3
1,90473B91-2750-4998-A172-E43C7D1B09FF,3
2,25c81c36-8041-4ae6-8eda-14d416761de4,2
3,be3a9e3a-1b42-4470-9296-bda375dd0386,2
4,d698c57c-264d-4517-8bb8-8b32424a370b,2


We see that there are ifas with (>1) device_model associated, we select 1 ifa and see which devices are associated with this ifa

In [5]:
ifa.loc[ifa['device_ifa'] =='F5542BB4-21AC-4406-9B19-4F7BA513F08D'].groupby('device_model').groups.keys()

dict_keys(['iPad', 'A1432', 'iPhone'])

### Problem:
We assume that, from the beginning, each <b>device_ifa</b> represents 1 unique <b>device_model</b>. Through Hypothesis Testing, we want to confirm the mean for <b>device_count</b> is ONE<br> H_0 = mean(device_count) = 1 <br> H_a = mean(device_count) != 1 <br>### One sample, two-sided testing
standard significant level <b>alpha</b> = 0.05. We calculate p-value using t-test.<br>
```python
if p-value <= alpha:
    reject(H_0)  # we say the result is statistically significant. 
else:
    H_0
```

In [6]:
from scipy import stats
stats.ttest_1samp(a = ifa_model_ctn['device_count'], 
                  popmean = 1.00000)

Ttest_1sampResult(statistic=39.897538917461937, pvalue=0.0)

Since <b>(pvalue < alpha) </b>, we can infer that most likely the device_ifa has >1 device_model associated with it. However: <br> <b> Type 1 and Type 2 Error </b> <br> Anytime you reject a hypothesis there is a chance you made a mistake. This would mean you rejected a hypothesis that is true or failed to reject a hypothesis that is false.

<ol>
<li>
    <b>Type 1 Error = incorrectly rejecting the null hypothesis</b>. Researcher says there is a difference between the groups when there really isnât. It can be thought of as a false positive study result. Type I Error is related to p-Value and alpha. You can remember this by thinking that Î± is the first letter of the alphabet
</li>
<li>
    <b>Type 2 Error = fail to reject null when you should have rejected the null hypothesis</b>. Researcher says there is no difference between the groups when there is a difference. It can be thought of as a false negative study result. The probability of making a Type II Error is called beta. You can remember this by thinking that Î² is the second letter in the greek alphabet.
</li>
</ol>

Reference: http://www.stomponstep1.com/p-value-null-hypothesis-type-1-error-statistical-significance/