<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Swiss Dog Owners

Please check out the data documentation on Kaggle, [here](https://www.kaggle.com/kmader/dogs-of-zurich).

## 1) Import the libraries you'll need below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## 2) Load the datasets
This time, there are _three_ datasets. Load them in as separate variables.

In [2]:
dogs15 = pd.read_csv('../../../../../resource-datasets/swiss_dogs/20151001hundehalter.csv')
dogs16 = pd.read_csv('../../../../../resource-datasets/swiss_dogs/20160307hundehalter.csv')
dogs17 = pd.read_csv('../../../../../resource-datasets/swiss_dogs/20170308hundehalter.csv')

In [3]:
dogs15.shape, dogs16.shape, dogs17.shape

((6980, 13), (6930, 13), (7155, 13))

## 3) Append them together.
In each dataset, make an appropriate `year` column. After that, append all three `DataFrame`s into one master `DataFrame`.

In [4]:
dogs15['year'] = 2015
dogs16['year'] = 2016
dogs17['year'] = 2017
dogs = pd.concat([dogs15, dogs16, dogs17], axis=0)

## 4) Check yourself
Did step 3 work? Did the data append properly?

In [5]:
dogs.head()

Unnamed: 0,HALTER_ID,ALTER,GESCHLECHT,STADTKREIS,STADTQUARTIER,RASSE1,RASSE1_MISCHLING,RASSE2,RASSE2_MISCHLING,RASSENTYP,GEBURTSJAHR_HUND,GESCHLECHT_HUND,HUNDEFARBE,year
0,126,51-60,m,9.0,92.0,Welsh Terrier,,,,K,2011,w,schwarz/braun,2015
1,574,61-70,w,2.0,23.0,Cairn Terrier,,,,K,2002,w,brindle,2015
2,695,41-50,m,6.0,63.0,Labrador Retriever,,,,I,2012,w,braun,2015
3,893,61-70,w,7.0,71.0,Mittelschnauzer,,,,I,2010,w,schwarz,2015
4,1177,51-60,m,10.0,102.0,Shih Tzu,,,,K,2011,m,schwarz/weiss,2015


In [6]:
dogs.shape

(21065, 14)

## 5) Ach nein! This data set is in German!
Rename each column so that it is in English. The translations are in the data documentation.

**NOTE!!:** This dataset is on dog **owners**, and their dogs. Be careful when labeling columns.

In [7]:
dogs.columns

Index(['HALTER_ID', 'ALTER', 'GESCHLECHT', 'STADTKREIS', 'STADTQUARTIER',
       'RASSE1', 'RASSE1_MISCHLING', 'RASSE2', 'RASSE2_MISCHLING', 'RASSENTYP',
       'GEBURTSJAHR_HUND', 'GESCHLECHT_HUND', 'HUNDEFARBE', 'year'],
      dtype='object')

In [8]:
dogs.head(3)

Unnamed: 0,HALTER_ID,ALTER,GESCHLECHT,STADTKREIS,STADTQUARTIER,RASSE1,RASSE1_MISCHLING,RASSE2,RASSE2_MISCHLING,RASSENTYP,GEBURTSJAHR_HUND,GESCHLECHT_HUND,HUNDEFARBE,year
0,126,51-60,m,9.0,92.0,Welsh Terrier,,,,K,2011,w,schwarz/braun,2015
1,574,61-70,w,2.0,23.0,Cairn Terrier,,,,K,2002,w,brindle,2015
2,695,41-50,m,6.0,63.0,Labrador Retriever,,,,I,2012,w,braun,2015


In [9]:
dogs.columns = [
    'id', 'age', 'gender', 'district', 'quarter', 'breed1', 'breed1_hybrid',
    'breed2', 'breed2_hybrid', 'breed_type',
    'dog_year', 'dog_gender', 'color', 'year'
]

In [10]:
dogs.head()

Unnamed: 0,id,age,gender,district,quarter,breed1,breed1_hybrid,breed2,breed2_hybrid,breed_type,dog_year,dog_gender,color,year
0,126,51-60,m,9.0,92.0,Welsh Terrier,,,,K,2011,w,schwarz/braun,2015
1,574,61-70,w,2.0,23.0,Cairn Terrier,,,,K,2002,w,brindle,2015
2,695,41-50,m,6.0,63.0,Labrador Retriever,,,,I,2012,w,braun,2015
3,893,61-70,w,7.0,71.0,Mittelschnauzer,,,,I,2010,w,schwarz,2015
4,1177,51-60,m,10.0,102.0,Shih Tzu,,,,K,2011,m,schwarz/weiss,2015


## 6) One of these columns is totally blank.
Drop it permanently.

In [11]:
dogs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21065 entries, 0 to 7154
Data columns (total 14 columns):
id               21065 non-null int64
age              21060 non-null object
gender           21065 non-null object
district         21060 non-null float64
quarter          21060 non-null float64
breed1           21065 non-null object
breed1_hybrid    1939 non-null object
breed2           1590 non-null object
breed2_hybrid    0 non-null float64
breed_type       20891 non-null object
dog_year         21065 non-null int64
dog_gender       21065 non-null object
color            21065 non-null object
year             21065 non-null int64
dtypes: float64(3), int64(3), object(8)
memory usage: 2.4+ MB


In [12]:
dogs.drop('breed2_hybrid', axis=1, inplace=True)

## 7) Pugs
Create a filtered DataFrame that contains all of the pugs in this dataset. And yes, even the dog breeds are in German. Turns out, Germans call pugs "Mops".

![](imgs/chloe.jpg)

In [13]:
pugs = dogs[dogs['breed1'] == 'Mops']

## 8) Tables
For the pug data, show the counts of:
* Human genders
* Dog genders
* Dog color (only show the top 5)
* Dog gender _versus_ human gender

In [14]:
pugs.gender.value_counts(dropna=False)

w    402
m    129
Name: gender, dtype: int64

In [15]:
pugs.dog_gender.value_counts(dropna=False)

m    294
w    237
Name: dog_gender, dtype: int64

In [16]:
pugs.color.value_counts().head()

beige            271
schwarz          127
beige/schwarz     60
braun             17
gestromt           7
Name: color, dtype: int64

In [17]:
pugs.groupby(['gender', 'dog_gender'])[['age']].count().unstack()

Unnamed: 0_level_0,age,age
dog_gender,m,w
gender,Unnamed: 1_level_2,Unnamed: 2_level_2
m,74,55
w,220,182


In [18]:
pd.crosstab(pugs.gender, pugs.dog_gender)

dog_gender,m,w
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
m,74,55
w,220,182


## 9) Translate the gender columns
Convert all instances of `m` to `M` and `w` to `F` in the pug data.

In [19]:
dogs.head()

Unnamed: 0,id,age,gender,district,quarter,breed1,breed1_hybrid,breed2,breed_type,dog_year,dog_gender,color,year
0,126,51-60,m,9.0,92.0,Welsh Terrier,,,K,2011,w,schwarz/braun,2015
1,574,61-70,w,2.0,23.0,Cairn Terrier,,,K,2002,w,brindle,2015
2,695,41-50,m,6.0,63.0,Labrador Retriever,,,I,2012,w,braun,2015
3,893,61-70,w,7.0,71.0,Mittelschnauzer,,,I,2010,w,schwarz,2015
4,1177,51-60,m,10.0,102.0,Shih Tzu,,,K,2011,m,schwarz/weiss,2015


In [20]:
# There are several ways to do this. I'll do each column one of the two more common ways:
dogs.gender = dogs.gender.map({'m': 'M', 'w': 'W'})
dogs.dog_gender = np.where(dogs.dog_gender == 'm', 'M', 'W')

## 10) Translate colors
Still using the pug data. Use the provided data dictionary as a guide. Use this to translate each dog's color into English. For colors not in this dictionary, put `"other"`.


In [21]:
color_dict = {'beige': 'beige', 
              'schwarz': 'black', 
              'braun': 'brown', 
              'gestromt': 'brindle', 
              'beige/schwarz': 'beige/black',
              'silber': 'silver',
              'beige/weiss': 'beige/white',
              'grau': 'grey',
              'rehbraun': 'fawn brown',
              'sandfarbig': 'buff',
              'brindle': 'brindle',
              'schwarz/weiss': 'black/white',
              'braun gestromt': 'brown brindle',
              'braun/schwarz': 'brown/black',
              'hellbraun': 'light-brown',
              'blondfarben': 'blond',
              'tricolor': 'tricolor',
              'beige/braun': 'beige/brown',
              'apricot': 'apricot', 
              'weiss': 'white',
              'blue/merle': 'blue/merle',
              'creme': 'creme',
              'sable': 'sable'}

In [22]:
pugs['color_en'] = pugs['color'].map(color_dict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [23]:
pugs

Unnamed: 0,id,age,gender,district,quarter,breed1,breed1_hybrid,breed2,breed_type,dog_year,dog_gender,color,year,color_en
35,6321,51-60,w,8.0,81.0,Mops,,,K,2011,m,beige,2015,beige
37,6469,41-50,m,11.0,111.0,Mops,,,K,2011,m,beige,2015,beige
104,65665,61-70,w,8.0,82.0,Mops,,,K,2002,m,beige,2015,beige
105,65665,61-70,w,8.0,82.0,Mops,,,K,2002,m,schwarz,2015,black
123,66524,51-60,w,12.0,123.0,Mops,,Chihuahua,K,2013,m,blue/merle,2015,blue/merle
130,67001,51-60,w,3.0,34.0,Mops,,,K,2013,m,beige,2015,beige
144,70788,21-30,m,7.0,71.0,Mops,,Beagle,K,2014,w,braun,2015,brown
257,80556,81-90,m,3.0,31.0,Mops,,,K,2009,m,gestromt,2015,brindle
319,80881,71-80,w,7.0,72.0,Mops,,,K,2008,w,beige,2015,beige
320,80881,71-80,w,7.0,72.0,Mops,,,K,2006,m,beige,2015,beige


In [24]:
pugs.color_en.value_counts(dropna=False)

beige            271
black            127
beige/black       60
brown             17
brindle           10
brown/black        6
light-brown        4
silver             3
fawn brown         3
brown brindle      3
beige/brown        3
beige/white        3
grey               3
apricot            3
black/white        3
buff               3
sable              2
blue/merle         2
tricolor           2
creme              1
white              1
blond              1
Name: color_en, dtype: int64

## Grande Finale

This problem is going to involve a few steps. Read carefully. Pick apart each task and solve them one at a time. **This pertains to the full dataset, no longer just pugs.**

- Create a new column, `age_mid`, which is the age of the owner. To do this, you'll calculate the average of the endpoints for the numbers in the `age` column. For example, `51-60` => `55.5`. You can do this for example by
    - creating some sort of dictionary as in the last problem (this is brute force, and requires a lot of work and is not extensible).
    - creating a function that will compute this for an individual string as input, and returns the appropriate number. While this might sound more difficult, it actually involves less work, is cleaner, and is more extensible.
- Create a new column, `dog_age`, which is the age of the dogs _at that time_. You may use the dog's year of birth and the "year" column you made in part 3 to compute this.
* Take a look at this new `dog_age` variable. Drop or clean up the ones that make no sense and are likely the result of data errors.
* Subset to only include pugs (`Mops`), shiba inus (`Shiba Inu`), any dog with  "Retriever" in its name, and any dog with "Terrier" in its name.
    - _Hint:_ Check out the `.str.contains()` method.
* Keep only breeds with more than 100 observations.
* With this data subset, compute the average human and dog age for each breed.

In [25]:
dogs.age.unique()

array(['51-60', '61-70', '41-50', '71-80', '31-40', '81-90', '21-30',
       '91-100', nan, '11-20'], dtype=object)

In [26]:
def age_map(age_str):
    try:
        lower, upper = age_str.split("-")
    except:
        return np.nan
        
    return (int(lower) + int(upper)) / 2

In [27]:
dogs['age_mid'] = dogs.age.apply(age_map)

In [28]:
dogs['age_mid'].value_counts()

55.5    4559
45.5    4472
35.5    3942
65.5    3095
25.5    2337
75.5    1957
85.5     524
15.5     127
95.5      47
Name: age_mid, dtype: int64

In [29]:
dogs.dog_year.unique()

array([2011, 2002, 2012, 2010, 2005, 2004, 2001, 2013, 2014, 2007, 2003,
       1999, 2000, 2009, 1997, 2008, 2006, 2015, 1998, 1995, 1980,    8,
          1, 1994, 1962, 5012, 2016, 1996, 2017])

In [30]:
dogs['dog_year_corrected'] = dogs['dog_year'].map(lambda x: x+2000 if x<10 else x-3000 if x>3000 else x)

In [31]:
dogs['dog_year_corrected'].value_counts()

2012    1874
2013    1730
2010    1710
2014    1698
2009    1663
2011    1639
2008    1520
2007    1474
2006    1322
2005    1251
2004    1179
2015    1132
2003     837
2002     664
2016     511
2001     402
2000     251
1999     117
1998      65
1997      16
1996       3
1994       2
1980       2
1995       1
2017       1
1962       1
Name: dog_year_corrected, dtype: int64

In [32]:
dogs['year'].unique()

array([2015, 2016, 2017])

In [33]:
dogs['dog_age'] = dogs['year'] - dogs['dog_year_corrected']

In [34]:
dogs.dog_age.value_counts()

 3     1780
 4     1756
 5     1749
 2     1682
 6     1653
 7     1644
 8     1548
 1     1463
 9     1459
 10    1373
 11    1264
 12    1113
 13     911
 14     636
 15     414
 16     235
 0      195
 17     113
 18      50
 19      16
 20       4
 21       2
 35       1
 36       1
-1        1
 53       1
 22       1
Name: dog_age, dtype: int64

In [35]:
# subset to ages which make sense
dogs_sub = dogs.loc[dogs.dog_age.between(0, 22), :]

In [36]:
# subset on breeds
is_pug = dogs_sub.breed1 == 'Mops'
is_shiba = dogs_sub.breed1 == 'Shiba Inu'
is_retriever = dogs_sub.breed1.str.contains('Retriever')
is_terrier = dogs_sub.breed1.str.contains("Terrier")

dogs_sub = dogs_sub.loc[is_pug | is_shiba | is_retriever | is_terrier, :]

In [37]:
# get value counts and subset
breed_counts = dogs_sub.breed1.value_counts()
top_dogs = breed_counts[breed_counts > 100].index

In [38]:
good_dogs = dogs_sub.loc[dogs_sub.breed1.isin(top_dogs), :]

In [39]:
good_dogs.breed1.value_counts()

Labrador Retriever             1324
Jack Russel Terrier             886
Yorkshire Terrier               872
Mops                            530
Golden Retriever                491
West Highland White Terrier     296
Terrier                         179
Cairn Terrier                   154
Tibet Terrier                   138
Flat Coated Retriever           137
Parson Russell Terrier          137
Parson Jack Russell Terrier     116
Name: breed1, dtype: int64

In [40]:
# group by breed and obtain mean ages of owners and dogs
good_dogs.groupby('breed1').agg({'age_mid': 'mean', 'dog_age': 'mean'}).sort_values(by='dog_age')

Unnamed: 0_level_0,age_mid,dog_age
breed1,Unnamed: 1_level_1,Unnamed: 2_level_1
Parson Russell Terrier,47.178832,5.729927
Mops,43.254717,6.103774
Flat Coated Retriever,54.259124,6.291971
Yorkshire Terrier,50.683486,6.938073
Labrador Retriever,51.058912,7.153323
Golden Retriever,52.648676,7.645621
Parson Jack Russell Terrier,48.948276,8.017241
Jack Russel Terrier,48.852144,8.247178
Terrier,50.807263,8.837989
Tibet Terrier,58.905797,9.253623
