# EDA (Exploratory Data Analysis)

The DBLP dataset contains the title as well as an abstract, if there exists one, for each English computer science paper. In order to perform EDA (Exploratory Data Analysis) on this dataset, we first define a title with an abstract, if there is one, to be a document. In this way, each document can be splitted into sentences that end with a period, a question mark, or an exclamation mark, and each sentence can be separated into tokens by whitespace and punctuations such as a comma, a colon, or a semicolon. 

In [None]:
import pandas as pd
import numpy as np
import json
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

In [None]:
# import data
with open('../data/eda/report_data.txt', 'r') as f:
    n_document = f.readline()[:-1]
    n_sentence = f.readline()[:-1]
    n_token = f.readline()[:-1]
    n_unique = f.readline()[:-1]
    document_len = f.readline()[:-1]
    sentence_len = f.readline()[:-1]
    count_white = f.readline()[:-1]
    weird = f.readline()

document_len = [int(i) if i.strip().isdigit() else i.strip() for i in document_len[1:-1].split(',')]
sentence_len = [int(i) if i.strip().isdigit() else i.strip() for i in sentence_len[1:-1].split(',')]
all_token = json.load(open('../data/eda/report_tokens.csv'))
weird = [i[1:-1] for i in weird[1:-1].split(', ')]
try:
    single = pd.read_csv('../data/out/AutoPhrase_single-word.txt', sep="	", header=None)
    single.columns = ['Quality Score', 'Phrase']
except:
    single = pd.DataFrame(columns=['Quality Score', 'Phrase'])
multi = pd.read_csv('../data/out/AutoPhrase_multi-words.txt', sep="	", header=None)
multi.columns = ['Quality Score', 'Phrase']

# How many documents/sentences/tokens are there in your input corpus?

In [None]:
print('# of documents: ' + str(n_document))
print('# of sentences: ' + str(n_sentence))
print('# of total tokens: ' + str(n_token))
print('# of unique tokens: ' + str(n_unique))

By looping through each line in the dataset, we found a total of 2,243,969 documents, 8,501,604 sentences, and 91,890,440 tokens. Among these tokens, there are 1,359,793 unique ones.

# What are the length distributions of documents and sentences? Any outliers?

We use both a density plot and a box plot to analyze the length distributions of documents and sentences. The length of documents and sentences are calculated using the number of tokens. A density plot shows the probability density, which is the probability per unit on the x-axis, of the length distributions of documents and sentences by using a kernel density estimate. On the other hand, a box plot indicates how document and sentence lengths in the data are spread out.

In [None]:
sns.set(style='whitegrid', font_scale=2)
plt.figure(figsize=(20, 15))
plt.subplot(2, 1, 1)
plt.hist(document_len)
#sns.distplot(document_len)
plt.ylabel('Density') 
plt.title('Length Distribution of Documents', fontsize=25)
plt.subplot(2, 1, 2)
sns.boxplot(document_len)
plt.xlabel('Document Length') 

#plt.savefig('../data/eda/Length Distribution of Documents.png')
#plt.show()

Most of the documents have a length less than 200 tokens, with a peak at about 100 tokens long. There are outliers that have a high number of tokens, which is why our plots are skewed right. Little variability about the middle half of the data also leads to a small interquartile range and thus a short box. The longest document has about 4,800 tokens.

In [None]:
sns.set(style='whitegrid', font_scale=2)
plt.figure(figsize=(20, 15))
plt.subplot(2, 1, 1)
#sns.distplot(sentence_len)
plt.hist(sentence_len)
plt.ylabel('Density') 
plt.title('Length Distribution of Sentences', fontsize=25)
plt.subplot(2, 1, 2)
sns.boxplot(sentence_len)
plt.xlabel('Sentence Length') 

#plt.savefig('../data/eda/Length Distribution of Sentences.png')
#plt.show()

Most of the sentences have a length less than 30 tokens, with a peak at about 10 tokens long. There are outliers that have a high number of tokens, which is why our plots are skewed right. The longest sentence has about 230 tokens.

Comparing the two distributions, although outliers exist in both distributions, there are a lot more extreme outliers in the length distribution of documents leading to highly right-skewed plots. Most of the documents have a length less than 200 tokens while most of the sentences have a length less than 30 tokens, which makes sense as many documents do not have an abstract but only a title. However, the box plot of the length distribution of sentences shows that there are a few extremely long sentences up to about 240 tokens long, indicating that there might be some “weird” characters leading to mistakenly splitting the sentences.

# What is the distribution of all tokens? How many "rare" tokens (e.g., < 5 times)? 

As we have mentioned before, although there are 91,890,440 tokens in total, there are only 1,359,793 unique ones. By counting the frequency of each unique token, we get a frequency distribution of tokens.

In [None]:
all_token_df = pd.DataFrame(list(zip(list(all_token.keys()), list(all_token.values()))), 
                            columns=['token', 'frequency'])

all_token_df = all_token_df.sort_values(by='frequency', ascending=False).reset_index(drop=True)
all_token_df

In [None]:
sns.set(style='whitegrid', font_scale=2)
plt.figure(figsize=(20, 15))
plt.subplot(2, 1, 1)
#sns.distplot(list(all_token.values()))
plt.hist(list(all_token.values()))
plt.ylabel('Density') 
plt.title('Frequency Distribution of Tokens', fontsize=25)
plt.subplot(2, 1, 2)
sns.boxplot(list(all_token.values()))
plt.xlabel('Token Frequency') 

#plt.savefig('../data/eda/Frequency Distribution of Tokens.png')
#plt.show()

Most of the tokens have a frequency less than 100,000. There are a few extremely frequent outliers, which is why our plots are skewed right. Little variability about the middle half of the data also leads to a small interquartile range and thus a short box. The most frequent token has a frequency of about 490,000.

In [None]:
# Top 20 tokens with their frequency
sns.set(style='whitegrid', font_scale=2)
plt.figure(figsize=(25, 20))
ax = sns.barplot(x='frequency', y='token', data=all_token_df[:20])
ax.axes.set_title('Top 20 Most Frequent Tokens', fontsize=50)
ax.axes.set_xlabel('Frequency', fontsize=30)
ax.axes.set_ylabel('Token', fontsize=30)

#plt.savefig('../data/eda/Top 20 Most Frequent Tokens.png')
#plt.show()

According to the plot, the top most frequent tokens in the dataset are not domain specific.

In [None]:
rare = all_token_df.loc[all_token_df['frequency'] < 5]

print('# of rare tokens (e.g., < 5 times): ' + str(rare.shape[0]))
print('Some of the rare tokens are: ')
print(rare['token'][:20].to_list())

By examining tokens with the least frequency, we found 1,162,239 tokens with a frequency less than 5. Many of these terms are actually domain specific. 

# Is there any pre-processing required? e.g., remove the consecutive whitespace, remove some "weird" characters.

In [None]:
print('# of sentences containing consecutive whitespace: ' + str(count_white))
print('Some of the "weird" characters are: ')
print(weird[-21:])

A pre-processing of the data is strongly suggested.

# Run AutoPhrase, and then plot the quality score distribution of single-word and multi-word phrases separately. Compare and discuss their differences.

After running AutoPhrase on the DBLP dataset, we got the results of single-word and multi-word phrases together with their quality scores. Thus we can compare and contrast single-word and multi-word phrases by plotting the quality score distributions of them on the same scale.

In [None]:
sns.set(style='whitegrid', font_scale=2)
plt.figure(figsize=(20, 15))
plt.subplot(2, 1, 1)
#sns.distplot(single['Quality Score'].to_list(), bins=np.arange(0, 1, 0.05))
plt.hist(single['Quality Score'].to_list(), bins=np.arange(0, 1, 0.05))
plt.xlim(-0.3, 1.3)
plt.ylim(0, 5)
plt.ylabel('Density') 
plt.title('Quality Score Distribution of Single-Word Phrases', fontsize=25)
plt.subplot(2, 1, 2)
#sns.distplot(multi['Quality Score'], bins=np.arange(0, 1, 0.05))
plt.hist(multi['Quality Score'], bins=np.arange(0, 1, 0.05))
plt.xlim(-0.3, 1.3)
plt.ylim(0, 5)
plt.ylabel('Density') 
plt.title('Quality Score Distribution of Multi-Word Phrases', fontsize=25)

#plt.savefig('../data/eda/Quality Score Distribution.png')
#plt.show()

Quality score distribution of single-word phrases is more normal while that of multi-word phrases is more right-skewed. Quality scores of single-word phrases are clustered between 0.3 to 0.7 with two major modes at about 0.4 and 0.6 and a minor mode at about 0.3. Quality scores of multi-word phrases are clustered between 0 to 0.3, with a major mode at about 0.1 and a minor mode at about 0.9.

Comparing and contrasting quality score distributions of single-word and multi-word phrases on the same scale, it’s obvious that single-word phrases generally have a higher quality score than multi-word phrases. However, some of the multi-word phrases have a much higher quality score, around 0.8 and 0.9, than any of the single-word phrases. It makes sense as although single-word phrases are more easily and frequently used no matter in domain specific or nonspecific fields, multi-word phrases can be more domain specific as they are created and used only within the domain.