This notebook uses the previously prepared dataset of 730,000 anonymous English Wikipedia revisions (May 4 - June 4) and creates the dataframes used in the [Local Vandals](localvandals.herokuapp.com) web app.

In [240]:
import pandas as pd
import numpy as np
import json

In [2]:
changes = pd.read_csv('changes.csv').drop(['Unnamed: 0', 'latlng'], axis=1)

  interactivity=interactivity, compiler=compiler, result=result)


In [114]:
changes['rc_title'] = changes['rc_title'].map(lambda x: str(x).translate(str.maketrans("_", " ")))

The three functions below do most of the work here. All of them take in a dataframe, a geographic unit (that is, the name of the column with the desired grouping unit: e.g., 'country') and a cutoff for the minimum number of revisions per geographical unit. The top-articles and top-vandalized functions also have a parameter for the number of top articles required.

In [362]:
def get_damage_pct_diff(df, unit, minimum):
    counts = df[unit].value_counts()
    selection = df[df[unit].isin(counts[counts >= minimum].index)]
    mean_damage = selection['damage_prob'].mean()
    unit_damage_df = selection.groupby(unit, as_index=False)['damage_prob'].mean()
    pct_diff = round(
        ((unit_damage_df['damage_prob'] - mean_damage)/unit_damage_df['damage_prob']) * 100)
    unit_damage_df['pct_diff'] = pct_diff
    unit_damage_df = unit_damage_df.set_index(unit).drop('damage_prob', axis=1)
    return unit_damage_df

In [333]:
def get_top_articles(df, unit, minimum, n_articles=5):
    counts = df[unit].value_counts()
    selection = df[df[unit].isin(counts[counts >= minimum].index)]
    top_articles = selection.groupby(unit)['rc_title'].value_counts()
    top_n = top_articles.groupby(level=[0]).nlargest(n_articles)
    top_n.index = top_n.index.droplevel(0)
    top_n = top_n.index.to_frame().reset_index(drop=True)
    return top_n

In [334]:
def get_top_vandalized(df, unit, minimum, n_articles=5):
    counts = df[unit].value_counts()
    selection = df[df[unit].isin(counts[counts >= minimum].index)]
    vandalized = selection[selection['damage_prob'] > 0.9]
    top_articles = vandalized.groupby(unit)['rc_title'].value_counts()
    top_n = top_articles.groupby(level=[0]).nlargest(n_articles)
    top_n.index = top_n.index.droplevel(0)
    top_n = top_n.index.to_frame().reset_index(drop=True)
    return top_n

In [359]:
country_damage = get_damage_pct_diff(changes, 'country', 100)

In [213]:
country_iso_damage = get_damage_pct_diff(changes, 'country_iso', 100)

In [232]:
import pickle
with open('country_iso_damage.pkl', 'wb') as f:
    pickle.dump(country_iso_damage, f)

In [118]:
changes_us = changes.query('country_iso == "US"')

Some zip codes got mangled in the pandas import - let's fix them.

In [188]:
def regularize_zip(zipcode):
    try:
        code = str(int(zipcode))
    except ValueError:
        return None
    code = code.zfill(5)
    return code

In [189]:
changes_us['postcode'] = changes_us['postcode'].map(regularize_zip)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [363]:
state_damage = get_damage_pct_diff(changes_us, 'state', 0)

We can't use 'city' by itself: e.g., Springfield, IL and Springfield, MA are two different places.

In [352]:
changes_us['citystate'] = changes_us['city'] + ", " + changes_us['state']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [364]:
city_damage = get_damage_pct_diff(changes_us, 'citystate', 10)
len(city_damage)

2931

In [365]:
zip_damage = get_damage_pct_diff(changes_us, 'postcode', 10)
len(zip_damage)

5205

In [369]:
geog_damage = pd.concat([country_damage, state_damage, city_damage, zip_damage])

In [375]:
country_top_articles = country_top_articles.set_index('country')
state_top_articles = state_top_articles.set_index('state')
city_top_articles = city_top_articles.set_index('citystate')
zip_top_articles = zip_top_articles.set_index('postcode')

In [377]:
geog_top_articles = pd.concat([country_top_articles, state_top_articles,
                               city_top_articles, zip_top_articles])

In [378]:
country_top_vandalized = country_top_vandalized.set_index('country')
state_top_vandalized = state_top_vandalized.set_index('state')
city_top_vandalized = city_top_vandalized.set_index('citystate')
zip_top_vandalized = zip_top_vandalized.set_index('postcode')

In [379]:
geog_top_vandalized = pd.concat([country_top_vandalized, state_top_vandalized,
                                city_top_vandalized, zip_top_vandalized])

In [380]:
geog_damage.to_pickle('geog_damage.pkl')
geog_top_articles.to_pickle('geog_top_articles.pkl')
geog_top_vandalized.to_pickle('geog_top_vandalized.pkl')