# Evaluating Data Sources

## Case 1

### Data source
**Theoretical Data Source**: Amsterdam availability data scraped from AirBnB on December 24th.

**Question**: What are the popular neighborhoods in Amsterdam?

### Evaluation
AirBnB availability on Christmas Eve may not reflect typical neighborhood popularity in Amsterdam. Quite likely, places in all Amsterdam neighborhoods were fully booked that day regardless of their popularity. Moreover, neighhorhoods where major celebration events took place might be more popular than usual. 

In short, this source is of dubious quality because data were collected under a special circustance that could influence AirBnB users' normal behaviors. 

## Case 2

### Data source
**Theoretical Data Source**: Mental health services use on September 12, 2001 in San Francisco, CA and New York City, NY.

**Question**: How do patterns of mental health service use vary between cities?

### Evaluation
On September 11, 2001, thousands died during a terrorist attack in New York City. While the whole world was shaken, New Yorkers were likely affected more than the most and wanted to seek out mental health service more badly than people in San Francisco next day.

As a result, the comparison between these two cities on September 12, 2001 may not reflect their true differences in mental health survice usage in general.

## Case 3

### Data source

**Actual Data Source**: [Armenian Pub Survey](https://www.kaggle.com/erikhambardzumyan/pubs).

**Question**: What are the most common reasons Armenians visit local pubs?

### Evaluation

To check potential selection biases, we can examine whether this dataset is a representative sample of the Armenian popluation. 

First, let's import the data.

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Then, we can read the data and take a look at the first five rows.

In [16]:
# Import the data downloaded from Kaggle
pubs = pd.read_csv('armenian_pubs.csv')
# Inspect the first 5 rows
pubs.head()

Unnamed: 0,Timestamp,Age,Gender,Income,Occupation,Fav_Pub,WTS,Freq,Prim_Imp,Sec_Imp,Stratum,Lifestyle,Occasions
0,2017/02/25 10:52:03 PM GMT+4,19,Male,100000.0,Student,Station,2000.0,Several times in a month,Environment,Menu,Capital,Nightlife,Hang outs with friends
1,2017/02/25 10:53:19 PM GMT+4,19,Female,50000.0,Student,Calumet,2000.0,rarely (once two week/or a month),Music,Pricing,Capital,Adventure/traveling/exploring,Hang outs with friends
2,2017/02/25 10:54:05 PM GMT+4,20,Male,100000.0,Student,Liberty,3000.0,rarely (once two week/or a month),Environment,Music,Capital,"Busy(student life, work)",Hang outs with friends
3,2017/02/25 10:55:09 PM GMT+4,18,Male,0.0,Student,Calumet,3000.0,Several times in a month,Environment,Music,Capital,Art,Hang outs with friends
4,2017/02/25 10:55:38 PM GMT+4,19,Female,130000.0,Student + working,Liberty,10000.0,rarely (once two week/or a month),Pricing,Environment,Capital,,Hang outs with friends


As we can see, this dataset contains pub-goers' demographic information (e.g., age, gender, income, occupation) as well as their pub-going behaviors (e.g., favorite pub, frequency, factors of primary and secondary importance, etc.). To see whether this sample might be biased, we can check whether the sample demographic were representative of the 2017 Armenian population.

#### Age

Using `pubs['Age']` returns `KeyError`, which may result from white spaces in column names. According to this [Stack Overflow answer](https://stackoverflow.com/questions/21606987/how-can-i-strip-the-whitespace-from-pandas-dataframe-headers), a quick solution is using `str.strip()` to remove empty spaces in each column name and `rename()` to update the names (the `inplace=True` argument allows us to change the original data frame rather than creating a new one).  

In [23]:
pubs.rename(columns=lambda x: x.strip(), inplace=True)

Let's summarize the age of this sample:

In [26]:
pubs['Age'].describe()

count    175.000000
mean      19.548571
std        2.770262
min       16.000000
25%       18.000000
50%       19.000000
75%       20.000000
max       41.000000
Name: Age, dtype: float64

The median age in this sample was 19 years, which was far younger than the national median 35.1 years (Source: [Armenia Demographics Profile 2018](https://www.indexmundi.com/armenia/demographics_profile.html), estimated in 2017). Since younger people may go to pubs for different reasons than other age groups, this sample is a biased source for answering the question about common reasons why Armenians go to pubs.

#### Gender

Is the male-female gender ratio in this sample representative?

In [36]:
pubs['Gender'].value_counts()['Male']/pubs['Gender'].value_counts()['Female']

0.7156862745098039

The national male-female gender ratio in Armenia was 0.94, which was a little higher than that in the sample.

#### Income

Are incomes in this sample representative? (According to historical data on [Online Currency Converter](https://freecurrencyrates.com/en/exchange-rate-history/USD-AMD/2017), the Armenian Dram to USD exchange rate was 485.86 on 2017/02/25.)

In [38]:
pubs['Income'].describe()/485.86 

count       0.358128
mean      226.809577
std       736.019961
min         0.000000
25%         0.205821
50%       113.201334
75%       205.820607
max      9261.927304
Name: Income, dtype: float64

Armenia's household income per capita in 2017 was about 1453.62 USD (source: [CEIC](https://www.ceicdata.com/en/indicator/armenia/annual-household-income-per-capita)), which was much higher than 226.81 USD in the sample.

#### Occupation

In [40]:
pubs['Occupation'].value_counts()/len(pubs['Occupation'])

Student                             0.697143
Student + working                   0.228571
Working                             0.051429
Working                             0.005714
army                                0.005714
Entrepreneur / Software Engineer    0.005714
CEO                                 0.005714
Name: Occupation, dtype: float64

Students or working students consisted of nearly 93% of this sample, which obviously wasn't representative of the Armenia occupation.

#### Summary

This dataset is not a random sample of the Armenian population. It's strongly biased towards the students, who were younger and less wealthy compared to the population. In light of this bias, we need to limit the scope to Armenian students (or even students of the American University of Armenia, according to this dataset's [Kaggle description](https://www.kaggle.com/erikhambardzumyan/pubs)) when drawing any conclusions.