## Multi Modal Data Analysis

In this notebook we will look at what a multi-modal analysis of social and online objects might look like!'

We will start by looking at some wikipedia data - most pages have text, quite a few have images, and since they're linked to each other you can create a network of pages. Some pages have audio too!

In this notebook we will be using the wikipedia data found on [Kaggle](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset), scraped using [pywikibot](https://github.com/OlehOnyshchak/pyWikiMM). For the sake of demonstrations, I downloaded 4 wikipedia json files from that link - 

1) 1960 South Vietnamese Coup Attempt

2) 1962 South Vietnamese Independence Palace bombing

3) 1880 Democratic National Convention

4) 1880 Republican National Convention


The data we downloaded contains the raw text, as well as a link to images and their features, already extracted using a CNN, similar to the exercises we saw. Let us load these files and see what they look like.

In [1]:
import json

In [59]:
import lucem_illud
import spacy

In [21]:
with open('../data/wiki/text_0.json') as f:
    text_0 = json.loads(json.load(f))

In [49]:
text_0.keys()

dict_keys(['title', 'id', 'url', 'wikitext', 'html'])

In [83]:
text_0['wikitext'][0:1000]

"{{pp-move-indef}}\n{{Infobox military conflict\n|conflict=1960 South Vietnamese coup attempt\n|image=Ngo Dinh Diem - Thumbnail - ARC 542189.png\n|image_size  = 300px\n|alt=A portrait of a middle-aged man, looking to the left in a half-portrait/profile. He has chubby cheeks, parts his hair to the side and wears a suit and tie.\n|caption=President Ngô Đình Diệm of South Vietnam\n|partof=\n|date=November 11, 1960\n|place=[[Ho Chi Minh City|Saigon]], [[South Vietnam]]\n|result=[[Coup d'état|Coup]] attempt defeated\n\n|combatant1=[[Army of the Republic of Vietnam|ARVN]] rebels\n|combatant2={{flagicon|South Vietnam}} [[Army of the Republic of Vietnam|ARVN]] loyalists\n|commander1=[[Vương Văn Đông]]<br />[[Nguyễn Chánh Thi]]\n|commander2=[[Ngô Đình Diệm]]<br />[[Nguyễn Văn Thiệu]]<br />[[Trần Thiện Khiêm]]\n|strength1=One armoured regiment, one marine unit, and three paratrooper battalions\n|strength2=[[5th Division (South Vietnam)|5th Division]] and [[7th Division (South Vietnam)|7th Divisi

In [84]:
with open('../data/wiki/text_1.json') as f:
    text_1 = json.loads(json.load(f))

In [85]:
with open('../data/wiki/text_2.json') as f:
    text_2 = json.loads(json.load(f))

In [86]:
with open('../data/wiki/text_3.json') as f:
    text_3 = json.loads(json.load(f))

In [79]:
nlp = spacy.load("en")

Now that we've loaded the text, we can begin to vectorise. The following methods help you create either avg word2vec or doc2vec vectors.

In [89]:
def word_tokenize(word_list):
    tokenized = []
    # pass word list through language model.
    doc = nlp(word_list)
    for token in doc:
        if not token.is_punct and len(token.text.strip()) > 0 and not token.is_stop and not token.like_num:
            if '|' not in token.text and '=' not in token.text and '.' not in token.text:
                tokenized.append(token.lemma_.lower())
    return tokenized

In [90]:
tokenized_0 = word_tokenize(text_0['wikitext'])

In [91]:
tokenized_1 = word_tokenize(text_1['wikitext'])

In [92]:
tokenized_2 = word_tokenize(text_2['wikitext'])

In [93]:
tokenized_3 = word_tokenize(text_3['wikitext'])

In [81]:
def create_vector(text, model, model_type=None):
    if model_type == "word2vec":
        vectors = []
        for word in text:
            try:
                vectors.append(model.wv[word])
            except KeyError:
                pass
        if len(vectors) > 0:
            return np.mean(vectors, axis=0)
    if model_type == "doc2vec":
        vector = model.infer_vector(text)
        return vector

In [94]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [95]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate([tokenized_0, tokenized_1, tokenized_2, tokenized_3])]

In [96]:
d2vmodel = Doc2Vec(documents, vector_size=50)

In [97]:
model_address = "data/" 

In [None]:
w2v_model = gensim.models.KeyedVectors.load_word2vec_format(model_address, binary=True)

In [None]:
create_vector(tokenized_0, w2vmodel, model_type="word2vec")

In [98]:
create_vector(tokenized_0, d2vmodel, model_type="doc2vec")

array([-0.47521707, -0.6569927 , -0.03321923,  1.390251  ,  3.1851172 ,
       -2.3046446 ,  0.7811589 ,  2.6919131 ,  0.12278743,  4.8917794 ,
       -2.1079886 , -0.7931807 ,  0.6441535 ,  1.2792832 ,  0.72818583,
       -0.95999724,  1.4883771 ,  3.074919  , -0.87856305, -2.6820378 ,
       -0.19215181, -2.5349438 , -1.6373653 ,  4.462885  ,  0.22386754,
       -4.3792377 ,  0.7293477 ,  0.30209842,  1.484349  , -2.9116993 ,
       -0.9497888 ,  0.23001377, -0.47558174,  1.1162286 ,  2.031805  ,
       -1.1123452 , -0.72439283,  2.1367135 , -1.7740167 , -0.5438571 ,
       -1.2810451 ,  1.610157  , -0.37582704, -2.4665263 , -4.057991  ,
       -1.3699012 , -1.5396568 , -0.99103254,  4.587286  , -3.1472726 ],
      dtype=float32)

In [29]:
with open('../data/wiki/meta_0.json') as f:
    meta_images_0 = json.loads(json.load(f))

In [44]:
len(meta_images_0['img_meta'])

4

In [46]:
meta_images_0['img_meta'][0].keys()

dict_keys(['filename', 'title', 'url', 'description', 'caption', 'headings', 'features', 'parsed_title'])

So there are 4 images in the first meta file. It also includes image embedding features extracted from a CNN.

In [45]:
wiki_0_img_features = {}

In [47]:
for img in meta_images_0['img_meta']:
    wiki_0_img_features[img['title']] = np.array(img['features'])

In [48]:
wiki_0_img_features

{'Madame Ngô Đình Nhu and Lyndon Baines Johnson.jpg': ['12.354491',
  '1.5270848',
  '13.063374',
  '2.5052433',
  '1.173463',
  '12.640376',
  '1.715704',
  '8.215351',
  '19.42873',
  '3.864879',
  '8.719646',
  '5.615429',
  '0.7377188',
  '3.2366543',
  '21.107939',
  '10.09845',
  '10.476905',
  '15.076618',
  '1.8579829',
  '4.58282',
  '6.404323',
  '9.05388',
  '10.285291',
  '4.8379645',
  '6.991437',
  '4.822517',
  '2.7573485',
  '7.594947',
  '3.186844',
  '7.57923',
  '2.6561885',
  '3.8202536',
  '10.310757',
  '32.220436',
  '10.473255',
  '19.846302',
  '4.0166388',
  '9.215218',
  '6.8447847',
  '23.832325',
  '15.827032',
  '5.9551196',
  '10.423122',
  '3.1189883',
  '8.595874',
  '1.988519',
  '3.4408007',
  '5.54282',
  '1.8818237',
  '2.868326',
  '1.3910198',
  '12.305906',
  '11.429949',
  '9.158894',
  '7.8012547',
  '5.894635',
  '1.0868963',
  '11.428989',
  '4.8180437',
  '6.71693',
  '4.6707015',
  '5.6400266',
  '5.8589907',
  '5.1224685',
  '0.30416858',


Now that we have text features and image features, we can begin to check for similarities and differences!

In [99]:
def cosine_distance(X, Y):
    cosine_similarity = np.dot(X, Y) / (np.linalg.norm(X)* np.linalg.norm(Y))
    return 1 - cosine_similarity