# Uncommon names

Here we want to answer: Which names are most frequent in a district compared to average Berlin?

For each name, we compare:

freq_in_district / mean( freq_in_all_districts )

The result is how often more times a name is chosen in a district compared to average Berlin. Then we take the top names form that list.

In [1]:
import pandas as pd
import numpy as np

### Load Data

In [2]:
def load_data():
    " Loads and returns the dataframe "    
    data_path = '../data/'
    df = pd.read_csv(data_path + 'processed/namedata.csv', sep=',')
    return df

df = load_data()
df.head()    

Unnamed: 0,anzahl,bezirk,frequency,geschlecht,vorname,year
0,122,Charlottenburg-Wilmersdorf,0.013676,w,Marie,2012
1,105,Charlottenburg-Wilmersdorf,0.01177,w,Sophie,2012
2,78,Charlottenburg-Wilmersdorf,0.008743,w,Charlotte,2012
3,69,Charlottenburg-Wilmersdorf,0.007735,w,Maria,2012
4,66,Charlottenburg-Wilmersdorf,0.007398,m,Paul,2012


## Compute deviation

In [3]:
year = 2017

In [4]:
def get_deviation(df, year):
    # Filter by year
    dfyear =  df[df['year']==year]
    # Count up all names (sum over gender)
    name_counts = dfyear.groupby(by=['bezirk', 'vorname']).sum()
    # Find total count in bezirk for all names
    bezirk_counts = name_counts.groupby(by=['bezirk']).sum()
    bezirk_counts = bezirk_counts.rename(columns={'anzahl': 'anzahl_bezirk'})
    # Merge name counts and bezirk counts
    compare = pd.merge(bezirk_counts, name_counts, left_index=True, right_index=True).reset_index()
    # Calculate the frequency of each name
    compare['frequency'] = compare['anzahl'] / compare['anzahl_bezirk']
    # Calulate the mean frequency of each name for all Berlin and merge
    mean_freqs = compare.groupby(by='vorname').mean()['frequency']
    mean_freqs.name = 'mean_freq'
    compare = pd.merge(compare, mean_freqs.reset_index(), left_on='vorname', right_on='vorname')
    # Calculate deviation from mean for each name
    compare['deviation'] = compare['frequency'] / compare['mean_freq']
    compare = compare[['bezirk', 'vorname', 'anzahl', 'frequency', 'mean_freq', 'deviation']]
    return compare

deviation = get_deviation(df, year)
deviation.head()

Unnamed: 0,bezirk,vorname,anzahl,frequency,mean_freq,deviation
0,Charlottenburg-Wilmersdorf,Aaliyah,1,0.000105,0.000321,0.327615
1,Friedrichshain-Kreuzberg,Aaliyah,2,0.000235,0.000321,0.733231
2,Mitte,Aaliyah,2,0.000235,0.000321,0.732542
3,Reinickendorf,Aaliyah,1,0.000568,0.000321,1.770796
4,Spandau,Aaliyah,1,0.000168,0.000321,0.522656


We use only those uncommon names with count above a threshold - we don't want too exotic names. Here we look at how high we can set the threshold and still get enough names for the graph (around 10). Reinickendorf seems to have low counts, which is why we set a separate threshold for it

In [5]:
for bezirk in deviation['bezirk'].unique():
    print(bezirk, deviation[(deviation['anzahl']>10) & (deviation['bezirk']==bezirk)].__len__())

Charlottenburg-Wilmersdorf 162
Friedrichshain-Kreuzberg 128
Mitte 126
Reinickendorf 4
Spandau 98
Tempelhof-Schöneberg 175
Treptow-Köpenick 22
Lichtenberg 71
Neukölln 56
Pankow 158
Steglitz-Zehlendorf 12
Marzahn-Hellersdorf 18


Now, we get the top 10 uncommon names for each district and save them

In [6]:
dfs = []
for bezirk in deviation['bezirk'].unique():
    min_anzahl = 10 if bezirk!='Reinickendorf' else 7
    df_dev = deviation[ (deviation['anzahl']>min_anzahl) & (deviation['bezirk']==bezirk) ].sort_values(by=['deviation'], ascending=False).head(10)
    dfs.append(df_dev)
uncommon = pd.concat(dfs)
uncommon.head()

Unnamed: 0,bezirk,vorname,anzahl,frequency,mean_freq,deviation
2313,Charlottenburg-Wilmersdorf,Caspar,21,0.002208,0.000819,2.69551
14109,Charlottenburg-Wilmersdorf,William,14,0.001472,0.000645,2.282402
5585,Charlottenburg-Wilmersdorf,Hugo,15,0.001577,0.000717,2.200057
2232,Charlottenburg-Wilmersdorf,Carla,12,0.001261,0.000593,2.125977
6886,Charlottenburg-Wilmersdorf,Julius,32,0.003364,0.001591,2.113881


### Save result

In [7]:
def save_df(uncommon, year):
    save_df = uncommon.rename(columns={'deviation': 'freq_dev'})
    save_df['freq_dev'] = save_df['freq_dev'].apply(lambda x: int(x*100)) 
#     save_df['bezirk'] = save_df['bezirk'].apply(lambda x: x.replace('oe','ö').title())
#     save_df = save_df[['bezirk', 'geschlecht', 'vorname', 'anzahl', 'freq_dev']]
    
    save_df = save_df[['bezirk', 'vorname', 'anzahl', 'freq_dev']]
    save_df.to_csv('../data/processed/beliebte_namen_'+str(year)+'.csv', sep=',', index=False)
    
save_df(uncommon, year)