# The dataset
---
Downloaded from https://dumps.wikimedia.org/enwiki/ and used this method https://github.com/markriedl/WikiPlots to extract the info and help recreate the corpus.

The raw data is a Wikipedia dump of English articles that contains a sub-header that contains the word **"plot"** (e.g., "Plot", "Plot Summary", etc.).

When the corpus is recreated we have two files:

* plots: a text file containing all story plots. Each story plot is given with one sentence per line. Each story is followed by **`<EOS>`** on a line by itself.
* titles: a text file containing a list of titles for each article in whih a story plot was found and extracted.

The dataset used was uploaded in 23-Mar-2017 14:24 and can be found [here](https://dumps.wikimedia.org/enwiki/). It's a 56Gb `.xml` file zipped as `.bz2` with 14Gb size.

When extracted the articles are separated by folders (i.e. "AA/", "AB/", "AC/"), with several stories in it.

In [30]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem.porter import PorterStemmer
import numpy as np

# Stemming

***

In [31]:
stemmer = PorterStemmer()

def stem(text):
    text_stem = [stemmer.stem(token) for token in text.split(' ')]
    text_stem_join = ' '.join(text_stem)
    return text_stem_join

# Reading the dataset
***

For this example I used the story plots from the "AB/" folder.

In [32]:
dataset_dir = 'dataset/'
plots_filename = 'plots_AB.txt'
titles_filename = 'titles_AB.txt'
separator = '<EOS>'

with open(dataset_dir + plots_filename, 'r') as file:
    corpus = file.readlines()
    corpus = corpus[:-1]
    corpus = ''.join(corpus)
    corpus = corpus.split(separator)

with open(dataset_dir + titles_filename, 'r') as file:
    titles = file.readlines()

## Extracted stories
---

In [45]:
print('Total of extracted stories:', len(titles))
titles

Total of extracted stories: 65


['Day of the Tentacle\n',
 'Doraemon\n',
 'Dressed to Kill (1980 film)\n',
 'Doom (1993 video game)\n',
 'Diablo II\n',
 'Dune Messiah\n',
 'Duke Nukem 3D\n',
 'Dr. Strangelove\n',
 'Das Boot\n',
 'Death of a Hero\n',
 'The Evil Dead\n',
 'Young and Innocent\n',
 'Escape from New York\n',
 'Eyes Wide Shut\n',
 'Edward de Vere, 17th Earl of Oxford\n',
 'Enter the Dragon\n',
 'Evil Dead II\n',
 'The Trial\n',
 'The Metamorphosis\n',
 'Fahrenheit 451\n',
 'Fearless (1993 film)\n',
 'Formant\n',
 'The Pinchcliffe Grand Prix\n',
 'Four Weddings and a Funeral\n',
 'Final Fantasy: The Spirits Within\n',
 'Four Feather Falls\n',
 'Show Me Love (film)\n',
 'Follies\n',
 'Fawlty Towers\n',
 'Full Metal Jacket\n',
 'Farmer Giles of Ham\n',
 'King Kong vs. Godzilla\n',
 'Ebirah, Horror of the Deep\n',
 'Son of Godzilla\n',
 'Destroy All Monsters\n',
 'Godzilla vs. Megalon\n',
 'Godzilla vs. Biollante\n',
 'Terror of Mechagodzilla\n',
 'Godzilla vs. King Ghidorah\n',
 'Godzilla vs. Mothra\n',
 'God

## Plot example (A Song of Ice and Fire)
---

In [34]:
corpus[43]

'\nA Song of Ice and Fire takes place in a fictional world in which seasons last for years and end unpredictably.\nNearly three centuries before the events of the first novel (see backstory), the Seven Kingdoms of Westeros were united under the Targaryen dynasty by Aegon I and his sisters Visenya and Rhaenys, with Aegon Targaryen becoming the first king of the whole of the continent of Westeros, save for the southerly Dorne.\nAt the beginning of A Game of Thrones, 15 peaceful summer years have passed since the rebellion led by Robert Baratheon deposed and killed the last Targaryen king, Aerys II, and proclaimed Robert king of the Seven Kingdoms.\nThe principal story chronicles the power struggle for the Iron Throne between the great Houses of Westeros following the death of King Robert in A Game of Thrones.\nRobert\'s heir apparent, the 13-year old Joffrey, is immediately proclaimed king through the machinations of his mother, Cersei Lannister.\nWhen Lord Eddard "Ned" Stark, King Rober

# Fitting and transforming the stemmed corpus
---

In [35]:
corpus_stem = list(map(stem, corpus))
tfidf = TfidfVectorizer(norm='l2', use_idf=True, stop_words='english')

X = tfidf.fit_transform(corpus_stem)

# Extracted features
---

In [44]:
print('Total features extracted:', len(tfidf.get_feature_names()))
print(tfidf.get_feature_names())

# print(X.toarray())

Total features extracted: 7055
['000', '10', '100', '11', '114', '116', '117', '1194', '13', '1300', '14', '141', '15', '153', '1567', '1568', '1570', '1571', '1572', '1573', '1576', '1577', '1578', '1580', '1581', '1582', '1583', '1584', '1585', '1586', '1589', '1592', '1596', '16', '1600', '1601', '1603', '1604', '1605', '1606', '16th', '17', '175', '17th', '18', '1861', '1863', '1864', '1865', '1869', '1873', '1877', '1884', '19', '1941', '1944', '1945', '1947', '195', '1954', '1955', '1960', '1960s', '1962', '1965', '1968', '1970s', '1973', '1988', '1992', '1993', '1997', '1999', '19th', '1st', '20', '200', '2008', '2065', '20th', '21', '21st', '22', '2204', '23', '24', '25', '26', '27', '28', '280', '29', '2nd', '30', '31', '34', '36', '39', '3d', '40', '400', '451', '46', '49', '50', '500', '50s', '51', '52', '5309', '61', '66', '69th', '78', '79', '800', '80th', '83', '843rd', '867', '88', '900', '93', 'abandon', 'abbey', 'abduct', 'abdul', 'abid', 'abil', 'abilities', 'ability'

# Cosine Similarities of NxN articles
---

In [37]:
N = len(X.toarray())
a = np.zeros(shape=(N,N))
for i in range(N):
    for j in range(N):
        a[i][j] = cosine_similarity(X[i].toarray(), X[j].toarray())[0][0]
        
df = pd.DataFrame(data=a)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,55,56,57,58,59,60,61,62,63,64
0,1.000000,0.023168,0.061498,0.022726,0.021548,0.020452,0.019789,0.036444,0.020765,0.041371,...,0.027946,0.014564,0.028720,0.031060,0.024586,0.010632,0.018904,0.034613,0.029478,0.034668
1,0.023168,1.000000,0.015307,0.016203,0.020730,0.015709,0.019732,0.034435,0.015435,0.039898,...,0.003125,0.008296,0.012961,0.015582,0.015231,0.010136,0.014915,0.016827,0.016428,0.031814
2,0.061498,0.015307,1.000000,0.017928,0.025703,0.017407,0.018253,0.037495,0.023200,0.029896,...,0.010566,0.008656,0.018467,0.028953,0.022246,0.024767,0.024423,0.006419,0.029006,0.029950
3,0.022726,0.016203,0.017928,1.000000,0.110076,0.023845,0.127291,0.055872,0.049428,0.044426,...,0.083821,0.061469,0.067852,0.032100,0.035585,0.016573,0.016628,0.028774,0.027401,0.034248
4,0.021548,0.020730,0.025703,0.110076,1.000000,0.020796,0.055067,0.019943,0.030299,0.043900,...,0.065922,0.049887,0.036959,0.036179,0.021332,0.011398,0.021258,0.047584,0.024788,0.038019
5,0.020452,0.015709,0.017407,0.023845,0.020796,1.000000,0.026535,0.025934,0.018570,0.035337,...,0.013127,0.012962,0.014413,0.024806,0.132072,0.009017,0.020888,0.018331,0.022760,0.025754
6,0.019789,0.019732,0.018253,0.127291,0.055067,0.026535,1.000000,0.033424,0.048311,0.040963,...,0.035480,0.024445,0.043663,0.028714,0.031793,0.096390,0.012636,0.037332,0.060811,0.036694
7,0.036444,0.034435,0.037495,0.055872,0.019943,0.025934,0.033424,1.000000,0.074712,0.056114,...,0.022928,0.008482,0.012646,0.028229,0.032688,0.019896,0.019878,0.018822,0.024414,0.030550
8,0.020765,0.015435,0.023200,0.049428,0.030299,0.018570,0.048311,0.074712,1.000000,0.064684,...,0.019581,0.015693,0.015247,0.021471,0.041605,0.021553,0.020771,0.013271,0.033159,0.040655
9,0.041371,0.039898,0.029896,0.044426,0.043900,0.035337,0.040963,0.056114,0.064684,1.000000,...,0.033951,0.027646,0.025267,0.032398,0.024612,0.017017,0.046029,0.059761,0.050151,0.069298


# Setting 1's to 0's and getting most similar articles
---

In [38]:
df[df == df.max()] = 0

most_similar = np.array(df.apply(lambda x: df.columns[x.argmax()], axis = 1))
most_similar_values = np.array(df.apply(lambda x: x.max(), axis = 1))

# Most similar articles
---

In [39]:
for i in range(len(most_similar)):
    print('Most similar article to', titles[i], 'is:', titles[most_similar[i]], 'with cosine similarity:', df[most_similar[i]].max())
    print()

Most similar article to Day of the Tentacle
 is: Gunpowder Plot
 with cosine similarity: 0.6208293729

Most similar article to Doraemon
 is: Gunpowder Plot
 with cosine similarity: 0.6208293729

Most similar article to Dressed to Kill (1980 film)
 is: Glen or Glenda
 with cosine similarity: 0.0973278054087

Most similar article to Doom (1993 video game)
 is: Duke Nukem 3D
 with cosine similarity: 0.186365996796

Most similar article to Diablo II
 is: Doom (1993 video game)
 with cosine similarity: 0.127290879176

Most similar article to Dune Messiah
 is: Heretics of Dune
 with cosine similarity: 0.182127168354

Most similar article to Duke Nukem 3D
 is: Escape from New York
 with cosine similarity: 0.186365996796

Most similar article to Dr. Strangelove
 is: The Return of Godzilla
 with cosine similarity: 0.298076043532

Most similar article to Das Boot
 is: Galaxy Quest
 with cosine similarity: 0.118794051549

Most similar article to Death of a Hero
 is: Glen or Glenda
 with cosine si

# Most similar articles

In [48]:
most_similar_articles_indexes = np.where(most_similar_values == np.sort(most_similar_values)[-1])[0]
print('Most similar articles:')
print(titles[most_similar_articles_indexes[0]], 'and', titles[most_similar_articles_indexes[1]], 'cos:', df[most_similar_articles_indexes[0]].max())
print()
print('Plot of', titles[most_similar_articles_indexes[0]])
print()
print(corpus[most_similar_articles_indexes[0]])
print()
print()
print('Plot of', titles[most_similar_articles_indexes[1]])
print()
print(corpus[most_similar_articles_indexes[1]])

Most similar articles:
Guy Fawkes
 and Gunpowder Plot
 cos: 0.6208293729

Plot of Guy Fawkes



In 1604 Fawkes became involved with a small group of English Catholics, led by Robert Catesby, who planned to assassinate the Protestant King James and replace him with his daughter, third in the line of succession, Princess Elizabeth.
Fawkes was described by the Jesuit priest and former school friend Oswald Tesimond as "pleasant of approach and cheerful of manner, opposed to quarrels and strife&nbsp;.
loyal to his friends".
Tesimond also claimed Fawkes was "a man highly skilled in matters of war", and that it was this mixture of piety and professionalism which endeared him to his fellow conspirators.
The author Antonia Fraser describes Fawkes as "a tall, powerfully built man, with thick reddish-brown hair, a flowing moustache in the tradition of the time, and a bushy reddish-brown beard", and that he was "a man of action&nbsp;.
capable of intelligent argument as well as physical endurance, 

# Wordcloud of most similar articles

In [50]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wc = WordCloud(background_color='white', max_words=30, stopwords=tfidf.get_stop_words())
wc_ = WordCloud(background_color='white', max_words=30, stopwords=tfidf.get_stop_words())

wc.generate(corpus[most_similar_articles_indexes[0]])
wc_.generate(corpus[most_similar_articles_indexes[1]])


fig = plt.figure()
a = fig.add_subplot(1,2,1)
imgplot = plt.imshow(wc)
plt.title(titles[most_similar_articles_indexes[0]])
plt.axis("off")

a = fig.add_subplot(1,2,2)
imgplot = plt.imshow(wc_)
plt.title(titles[most_similar_articles_indexes[1]])
plt.axis("off")

plt.show()

# Sparse matrix of similarity

In [51]:
import seaborn as sns
%matplotlib

sns.heatmap(df)

Using matplotlib backend: TkAgg


<matplotlib.axes._subplots.AxesSubplot at 0x1138d9710>