<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Swiss Dog Owners

Please check out the data documentation on Kaggle, [here](https://www.kaggle.com/kmader/dogs-of-zurich).

## 1) Import the libraries you'll need below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## 2) Load the datasets
This time, there are _three_ datasets. Load them in as separate variables.

In [2]:
dogs15 = pd.read_csv('../../../../resource-datasets/swiss_dogs/20151001hundehalter.csv')
dogs16 = pd.read_csv('../../../../resource-datasets/swiss_dogs/20160307hundehalter.csv')
dogs17 = pd.read_csv('../../../../resource-datasets/swiss_dogs/20170308hundehalter.csv')

In [3]:
dogs15.head()

Unnamed: 0,HALTER_ID,ALTER,GESCHLECHT,STADTKREIS,STADTQUARTIER,RASSE1,RASSE1_MISCHLING,RASSE2,RASSE2_MISCHLING,RASSENTYP,GEBURTSJAHR_HUND,GESCHLECHT_HUND,HUNDEFARBE
0,126,51-60,m,9.0,92.0,Welsh Terrier,,,,K,2011,w,schwarz/braun
1,574,61-70,w,2.0,23.0,Cairn Terrier,,,,K,2002,w,brindle
2,695,41-50,m,6.0,63.0,Labrador Retriever,,,,I,2012,w,braun
3,893,61-70,w,7.0,71.0,Mittelschnauzer,,,,I,2010,w,schwarz
4,1177,51-60,m,10.0,102.0,Shih Tzu,,,,K,2011,m,schwarz/weiss


## 3) Append them together.
In each dataset, make an appropriate `year` column. After that, append all three `DataFrame`s into one master `DataFrame`.

In [4]:
dogs = pd.concat([dogs15, dogs16, dogs17], axis=0)

## 4) Check yourself
Did step 3 work? Did the data append properly?

In [5]:
dogs15.shape

(6980, 13)

In [6]:
dogs16.shape

(6930, 13)

In [7]:
dogs17.shape

(7155, 13)

In [8]:
dogs.shape

(21065, 13)

## 5) Ach nein! This data set is in German!
Rename each column so that it is in English. The translations are in the data documentation.

**NOTE!!:** This dataset is on dog **owners**, and their dogs. Be careful when labeling columns.

In [9]:
dogs15.columns

Index(['HALTER_ID', 'ALTER', 'GESCHLECHT', 'STADTKREIS', 'STADTQUARTIER',
       'RASSE1', 'RASSE1_MISCHLING', 'RASSE2', 'RASSE2_MISCHLING', 'RASSENTYP',
       'GEBURTSJAHR_HUND', 'GESCHLECHT_HUND', 'HUNDEFARBE'],
      dtype='object')

In [10]:
dogs.columns = ['owner_id','age','gender','city_district','city_quarter','primary_breed',
                'primary_breed_hybrid','secondary_breed','secondary_breed_hybrid',
               'breed_type','dog_birth_year','dog_gender','dog_color']

In [11]:
dogs.head()

Unnamed: 0,owner_id,age,gender,city_district,city_quarter,primary_breed,primary_breed_hybrid,secondary_breed,secondary_breed_hybrid,breed_type,dog_birth_year,dog_gender,dog_color
0,126,51-60,m,9.0,92.0,Welsh Terrier,,,,K,2011,w,schwarz/braun
1,574,61-70,w,2.0,23.0,Cairn Terrier,,,,K,2002,w,brindle
2,695,41-50,m,6.0,63.0,Labrador Retriever,,,,I,2012,w,braun
3,893,61-70,w,7.0,71.0,Mittelschnauzer,,,,I,2010,w,schwarz
4,1177,51-60,m,10.0,102.0,Shih Tzu,,,,K,2011,m,schwarz/weiss


## 6) One of these columns is totally blank.
Drop it permanently.

In [12]:
dogs.dropna(axis='columns',how='all',inplace=True)

In [13]:
dogs.head()

Unnamed: 0,owner_id,age,gender,city_district,city_quarter,primary_breed,primary_breed_hybrid,secondary_breed,breed_type,dog_birth_year,dog_gender,dog_color
0,126,51-60,m,9.0,92.0,Welsh Terrier,,,K,2011,w,schwarz/braun
1,574,61-70,w,2.0,23.0,Cairn Terrier,,,K,2002,w,brindle
2,695,41-50,m,6.0,63.0,Labrador Retriever,,,I,2012,w,braun
3,893,61-70,w,7.0,71.0,Mittelschnauzer,,,I,2010,w,schwarz
4,1177,51-60,m,10.0,102.0,Shih Tzu,,,K,2011,m,schwarz/weiss


## 7) Pugs
Create a filtered DataFrame that contains all of the pugs in this dataset. And yes, even the dog breeds are in German. Turns out, Germans call pugs "Mops".

In [14]:
pugs = dogs[(dogs['primary_breed'] == 'Mops') | (dogs['secondary_breed'] == 'Mops')]
pugs.head()

Unnamed: 0,owner_id,age,gender,city_district,city_quarter,primary_breed,primary_breed_hybrid,secondary_breed,breed_type,dog_birth_year,dog_gender,dog_color
35,6321,51-60,w,8.0,81.0,Mops,,,K,2011,m,beige
37,6469,41-50,m,11.0,111.0,Mops,,,K,2011,m,beige
104,65665,61-70,w,8.0,82.0,Mops,,,K,2002,m,beige
105,65665,61-70,w,8.0,82.0,Mops,,,K,2002,m,schwarz
123,66524,51-60,w,12.0,123.0,Mops,,Chihuahua,K,2013,m,blue/merle


## 8) Tables
For the pug data, show the counts of:
* Human genders
* Dog genders
* Dog color (only show the top 5)
* Dog gender _versus_ human gender

In [15]:
pugs.groupby('gender').size()

gender
m    138
w    419
dtype: int64

In [16]:
pugs.groupby('dog_gender').size()

dog_gender
m    302
w    255
dtype: int64

In [17]:
pugs.groupby('dog_color').size().sort_values(ascending=False)[:5]

dog_color
beige            281
schwarz          127
beige/schwarz     63
braun             21
hellbraun          8
dtype: int64

## 9) Translate the gender columns
Convert all instances of `m` to `M` and `w` to `F` in the pug data.

In [20]:
pugs.gender = pugs.gender.map({'m':'M','w':'F'})
pugs.dog_gender = pugs.dog_gender.map({'m':'M','w':'F'})

pugs.head()

Unnamed: 0,owner_id,age,gender,city_district,city_quarter,primary_breed,primary_breed_hybrid,secondary_breed,breed_type,dog_birth_year,dog_gender,dog_color
35,6321,51-60,,8.0,81.0,Mops,,,K,2011,,beige
37,6469,41-50,,11.0,111.0,Mops,,,K,2011,,beige
104,65665,61-70,,8.0,82.0,Mops,,,K,2002,,beige
105,65665,61-70,,8.0,82.0,Mops,,,K,2002,,schwarz
123,66524,51-60,,12.0,123.0,Mops,,Chihuahua,K,2013,,blue/merle


## 10) Translate colors
Still using the pug data. Use the provided data dictionary as a guide. Use this to translate each dog's color into English. For colors not in this dictionary, put `"other"`.


In [19]:
color_dict = {'beige': 'beige', 
              'schwarz': 'black', 
              'braun': 'brown', 
              'gestromt': 'brindle', 
              'beige/schwarz': 'beige/black',
              'silber': 'silver',
              'beige/weiss': 'beige/white',
              'grau': 'grey',
              'rehbraun': 'fawn brown',
              'sandfarbig': 'buff',
              'brindle': 'brindle',
              'schwarz/weiss': 'black/white',
              'braun gestromt': 'brown brindle',
              'braun/schwarz': 'brown/black',
              'hellbraun': 'light-brown',
              'blondfarben': 'blond',
              'tricolor': 'tricolor',
              'beige/braun': 'beige/brown',
              'apricot': 'apricot', 
              'weiss': 'white',
              'blue/merle': 'blue/merle',
              'creme': 'creme',
              'sable': 'sable'}

## Grande Finale

This problem is going to involve a few steps. Read carefully. Pick apart each task and solve them one at a time. **This pertains to the full dataset, no longer just pugs.**

- Create a new column, `age_mid`, which is the age of the owner. To do this, you'll calculate the average of the endpoints for the numbers in the `age` column. For example, `51-60` => `55.5`. You can do this for example by
    - creating some sort of dictionary as in the last problem (this is brute force, and requires a lot of work and is not extensible).
    - creating a function that will compute this for an individual string as input, and returns the appropriate number. While this might sound more difficult, it actually involves less work, is cleaner, and is more extensible.
- Create a new column, `dog_age`, which is the age of the dogs _at that time_. You may use the dog's year of birth and the "year" column you made in part 3 to compute this.
* Take a look at this new `dog_age` variable. Drop or clean up the ones that make no sense and are likely the result of data errors.
* Subset to only include pugs (`Mops`), shiba inus (`Shiba Inu`), any dog with  "Retriever" in its name, and any dog with "Terrier" in its name.
    - _Hint:_ Check out the `.str.contains()` method.
* Keep only breeds with more than 100 observations.
* With this data subset, compute the average human and dog age for each breed.