# Descriptive statistics for published reviews (metadata)

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# read in file

filename = '../../../../../../data/raw/20190618_published_reviews.csv'

reviews = pd.read_csv(filename)

reviews.drop([col for col in reviews.columns if 'Unnamed' in col], axis=1, inplace=True)

reviews.head()


## Assessing Duplicates

There seem to be more unique IDs than titles! ie: some titles occur with multiple CD numbers

In [None]:
print("Unique CD numbers: ")
print(len(reviews['CD Number'].unique()))

print("Unique titles: ")
print(len(reviews['Review Title'].unique()))

In [None]:
# define duplicate columns
reviews['duplicate_CD_title'] = reviews.duplicated(subset=['CD Number', 'Review Title'], keep=False)
reviews['duplicate_CD'] = reviews.duplicated(subset=['CD Number'], keep=False)
reviews['duplicate_title'] = reviews.duplicated(subset=['Review Title'], keep=False)

The combination of a CD number and title occurs more than once for 7 rows.

In [None]:
reviews[reviews['duplicate_CD_title']]

CD numbers occur more than once in 7 rows - same rows as above.

In [None]:
reviews[reviews['duplicate_CD']]

Multiple titles occur more than once, sometimes with different CD numbers.

In [None]:
reviews[reviews['duplicate_title']]

## Descriptive statistics

### Studies per Review Group

We first assess the number of reviews that exists for each Review Group. On average, a Review Group has produced 141 studies, but the standard deviation is quite large, indicating that this number varies a lot across the groups.

In [None]:
# descriptive stats for reviews per group
print(reviews["Group"].value_counts().describe())

In [None]:
reviews["Group"].value_counts().iloc[:10][::-1].plot(kind="barh", title="Number of reviews per group - top 10")

### Review status: Publication Flag

It seems that a review can have one of six publication flags. What do each of these mean?

In [None]:
reviews["Publication Flag"].value_counts().iloc[::-1]#.plot(kind="barh", title="Status of reviews count", color="#34495E")

The plot belows shows the distribution (in %) over Publication Flags by Review Group. It seems that most groups have a similar proportion of studies in each group, although there are some outliers. 

In [None]:
status_by_group = reviews.groupby(["Group", "Publication Flag"]).agg({'CD Number':'count'})
status_by_group_pct = status_by_group.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()) ).reset_index()
status_by_group_pct = status_by_group_pct.pivot(index='Group', columns='Publication Flag').fillna(value=0).round(2)
plt.figure(figsize=(18,24))
plt.title("Distribution (in %) over Publication Flag by Review Group")
sns.heatmap(status_by_group_pct, annot=True)