In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import functools
%matplotlib inline

# Untagged data

These data were written out from the clean_content.py script where the taxons column was empty. 
- Here we assume the taxon column was empty because the content item has not been tagged.

In [2]:
#read in untagged content to describe content with no taxons
untagged = pd.read_csv('../../data/untagged_content.csv')

In [3]:
print("There are {} rows in the untagged content data".
      format(untagged.shape[0]))
print("There are {} unique content items in the untagged content data".
      format(untagged.content_id.nunique()))

There are 57337 rows in the untagged content data
There are 57123 unique content items in the untagged content data


In [4]:
untagged.columns

Index(['first_published_at', 'base_path', 'content_id', 'description',
       'details', 'document_type', 'first_published_at.1', 'locale',
       'primary_publishing_organisation', 'publishing_app', 'taxons', 'title',
       'document_type_gp', 'body', 'combined_text'],
      dtype='object')

# Taxon data

Taxons data is a row for each taxon with columns for the taxon_id/taxon title at each level. So, for example, if an item has only been tagged to level1 then level2 and subsequent levels will be missing. If an item was tagged to level3, the level2 and level1 columns have been filled recursively. 

A taxon in taxons is identified through content_id

In [5]:
#read in taxon file which was cleaned from raw using clean_taxons.py
taxons = pd.read_csv('../../data/clean_taxons.csv')

In [6]:
taxons.columns

Index(['Unnamed: 0', 'base_path', 'content_id', 'taxon_name', 'level1',
       'level2tax_id', 'level3tax_id', 'level4tax_id', 'level1taxon',
       'level2taxon', 'level3taxon', 'level4taxon', 'level5taxon'],
      dtype='object')

# Content data

These data were created in clean_content.py so that each row represents a single content-taxon pair. There can be multiple rows for a content item (content_id) if it has been tagged to multiple taxons (taxon_id).  

In [7]:
#read in content items file which was cleaned from raw using clean_content.py

content = pd.read_csv('../../data/clean_content.csv.gz', compression='gzip')

In [8]:
content.columns

Index(['Unnamed: 0', 'base_path', 'content_id', 'description', 'details',
       'document_type', 'first_published_at', 'locale',
       'primary_publishing_organisation', 'publishing_app', 'title', 'body',
       'combined_text', 'variable', 'taxon_id'],
      dtype='object')

In [9]:
content.shape

(335615, 15)

In [10]:
content.content_id.nunique()

140103

## Labelled data

In [11]:
labelled = pd.read_csv('../../data/labelled.csv')

In [22]:
labelled.shape

(336967, 24)

In [12]:
labelled.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'base_path', 'content_id', 'description',
       'details', 'document_type', 'first_published_at', 'locale',
       'primary_publishing_organisation', 'publishing_app', 'title', 'body',
       'combined_text', 'variable', 'taxon_id', 'base_path_y', 'content_id_y',
       'taxon_name', 'level1taxon', 'level2taxon', 'level3taxon',
       'level4taxon', '_merge'],
      dtype='object')

In [13]:
print(labelled['_merge'].value_counts())

both    232149
Name: _merge, dtype: int64


## Empty taxons

In [24]:
empty_taxons = pd.read_csv('../../data/empty_tags.csv')

In [25]:
empty_taxons.shape

(1352, 24)

In [26]:
empty_taxons.content_id_y.nunique()

1352

In [27]:
empty_taxons_not_world = pd.read_csv('../../data/empty_tags_not_world.csv')

In [28]:
empty_taxons_not_world.shape

(105, 8)

In [29]:
empty_taxons_not_world.content_id_y.nunique()

105

## Filtered
#### by taxon to exclude specific taxons from prediction activities

Current approach: Take out World and Corporate top taxons   
Must consider that the data which we will predict on needs to come from the same population as training data and it is hard to filter the unlabelled data to remove World & Corporate (unless they are perfectly predicted by a meta var such as documnet type). It may be safer to keep them in the training data, predict on all data and act differently if World/Corporate is predicted?

In [30]:
filtered = pd.read_csv('../../data/filtered.csv')

In [31]:
filtered.shape

(229084, 20)

In [33]:
filtered.content_id.nunique()

127320

In [34]:
filtered.columns

Index(['Unnamed: 0', 'base_path', 'content_id', 'description', 'details',
       'document_type', 'first_published_at', 'locale',
       'primary_publishing_organisation', 'publishing_app', 'title', 'body',
       'combined_text', 'taxon_id', 'taxon_name', 'level1taxon', 'level2taxon',
       'level3taxon', 'level4taxon', '_merge'],
      dtype='object')

### Old tags

Need to add this to untagged data...

In [37]:
old_tags = pd.read_csv('../../data/old_tags.csv')

In [38]:
old_tags.columns

Index(['Unnamed: 0', 'base_path_x', 'content_id_x', 'document_type',
       'first_published_at', 'locale', 'primary_publishing_organisation',
       'publishing_app', 'title', 'taxon_id'],
      dtype='object')

In [39]:
print("There are {} taxons represented in the {} content item/taxon combinations which have no corresponding taxon in the taxon data"
      .format(old_tags.taxon_id.nunique(), old_tags.shape[0]))

There are 2010 taxons represented in the 106154 content item/taxon combinations which have no corresponding taxon in the taxon data


In [40]:
print("There are {} content items/taxon combinations with missing taxon because these were removed during taxon_clean.py"
      .format(old_tags[old_tags.taxon_id.isnull()].shape[0]))

There are 0 content items/taxon combinations with missing taxon because these were removed during taxon_clean.py


Devs did some spot checks on these and some of these taxons were not part of the topic taxonomy so did not have a match in the topic taxonomy file. Others are in the World branch of the taxonomy.