## Merging dataframes to get image<->taxo_label dataframe.
In this file we merge an image<->article dataframe (segment 0 of WIT_Dataset) with an article<->taxo_label dataframe (training dataset of ORES) to obtain a image<->taxo_label dataframe.

In [3]:
import pandas as pd

### 1. Reading images <-> taxo_label into a dataframe.

The training images given by the file _wit_v1.train.all-00000-of-00010.tsv.gz_ has 3.7 million images inside.

In [4]:
# Segment numbers of the WIT_Dataset

# Construct segments dataframe

segment_numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
frames = []
for nr in segment_numbers:
    filename = '/scratch/WIT_Dataset/wit_v1.train.all-0000' + str(nr) + '-of-00010.tsv.gz'
    new_frame = pd.read_csv(filename, compression='gzip', sep='\t')
    new_frame.to_json(f'data/images_df_segment_{str(nr)}.json.bz2', compression='bz2')
    # frames.append(new_frame)
    print(f'Added segment {nr}')


In [3]:
images_df_ready_to_json = images_df.reset_index()
images_df_ready_to_json.to_json('data/image_df_all_segments.json.bz2', compression='bz2')

ValueError: DataFrame index must be unique for orient='columns'.

Notice that some images are present in the images_df dataset multiple times in different articles and languages same language.

In [None]:
print(images_df.image_url.value_counts())

https://upload.wikimedia.org/wikipedia/commons/c/c7/North_Macedonia_relief_location_map.jpg    286
https://upload.wikimedia.org/wikipedia/commons/1/1a/E3d_txikia.png                             157
https://upload.wikimedia.org/wikipedia/commons/d/db/Moonmap_from_clementine_data.png           126
http://upload.wikimedia.org/wikipedia/commons/b/bb/Location_map_South_Georgia.png               95
https://upload.wikimedia.org/wikipedia/commons/f/ff/Espa%C3%B1aLoc.svg                          80
                                                                                              ... 
https://upload.wikimedia.org/wikipedia/commons/9/9a/DinmukhametAkhimov.jpg                       1
https://upload.wikimedia.org/wikipedia/commons/a/a8/Alessandro_Haber_2007_cropped.jpg            1
https://upload.wikimedia.org/wikipedia/commons/2/26/Amsel_mit_Beere.JPG                          1
https://upload.wikimedia.org/wikipedia/commons/f/f8/TheAll-Story-June1912.jpg                    1
https://up

### 2. Reading ORES training data (article<->taxo_label)

From this other dataset _enwiki.labeled_article_items.json.bz2_ (apparently ORES training data), we get information about 5.9 million articles (qid, title & taxo_labels).

In [5]:
articles_df = pd.read_json('data/enwiki.labeled_article_items.json.bz2', compression='bz2', lines=True)
print(articles_df.shape)
articles_df.head(5)

(5926244, 9)


Unnamed: 0,article_pid,wp_templates,article_revid,qid,sitelinks,title,talk_revid,talk_pid,taxo_labels
0,18951386.0,"[WikiProject Objectivism, WikiProject Novels, ...",926765055.0,Q374098,"{'la': 'Atlas Shrugged', 'sv': 'Och världen sk...",Atlas Shrugged,911346471,128,"[Culture.Media.Books, History and Society.Poli..."
1,358.0,"[WikiProject Africa, WikiProject Countries, Wi...",928541225.0,Q262,"{'crh': 'Cezair', 'ku': 'Cezayir', 'et': 'Alže...",Algeria,927128572,354,"[Geography.Geographical, History and Society.S..."
2,2482.0,"[WikiProject France, WikiProject Architecture,...",924947879.0,Q64436,"{'sv': 'Triumfbågen, Paris', 'en': 'Arc de Tri...",Arc de Triomphe,921129575,672,"[Culture.Visual arts.Visual arts*, Geography.G..."
3,18951655.0,"[WikiProject Archaeology, WikiProject Anthropo...",926027145.0,Q23498,"{'ku': 'Arkeolojî', 'tg': 'Бостоншиносӣ', 'ro'...",Archaeology,896487747,692,"[History and Society.History, History and Soci..."
4,713.0,"[WikiProject Robotics, WikiProject Science Fic...",917854604.0,Q181787,"{'sv': 'Android', 'en': 'Android (robot)', 'ja...",Android (robot),899920932,714,"[Culture.Media.Entertainment, STEM.Technology,..."


### 3. Merging dataframes
Merging the dataframes we get categories (known as taxo_labels) for each image. After we are done, we group the dataframe by image_url and then aggregate it so that the images' taxo_labels are equal to a union of the taxo_labels of the articles that reference the images.

In [6]:
image_labels = images_df[['page_url', 'image_url', 'page_title']].merge(articles_df[['title', 'taxo_labels']], left_on=['page_title'], right_on=['title'])
print(image_labels.shape)
image_labels.head(2)

(6279903, 5)


Unnamed: 0,page_url,image_url,page_title,title,taxo_labels
0,https://en.wikipedia.org/wiki/Oxydactylus,https://upload.wikimedia.org/wikipedia/commons...,Oxydactylus,Oxydactylus,"[STEM.Biology, STEM.STEM*, STEM.Earth and envi..."
1,https://it.wikipedia.org/wiki/Oxydactylus,https://upload.wikimedia.org/wikipedia/commons...,Oxydactylus,Oxydactylus,"[STEM.Biology, STEM.STEM*, STEM.Earth and envi..."


In [7]:
grouped = image_labels.groupby('image_url').agg(lambda x: list(x)).reset_index() # https://stackoverflow.com/questions/34962104/how-can-i-use-the-apply-function-for-a-single-column & https://stackoverflow.com/questions/22219004/how-to-group-dataframe-rows-into-list-in-pandas-groupby
grouped.taxo_labels = grouped.taxo_labels.apply(lambda x : x[0])
grouped

Unnamed: 0,image_url,page_url,page_title,title,taxo_labels
0,http://upload.wikimedia.org/wikipedia/commons/...,[https://en.wikipedia.org/wiki/Chevrolet_Bisca...,[Chevrolet Biscayne],[Chevrolet Biscayne],"[STEM.STEM*, History and Society.Transportatio..."
1,http://upload.wikimedia.org/wikipedia/commons/...,[https://en.wikipedia.org/wiki/Dodge_Colt],[Dodge Colt],[Dodge Colt],"[STEM.STEM*, History and Society.Transportatio..."
2,http://upload.wikimedia.org/wikipedia/commons/...,[https://lt.wikipedia.org/wiki/Pontiac_Grand_Am],[Pontiac Grand Am],[Pontiac Grand Am],"[STEM.STEM*, History and Society.Transportatio..."
3,http://upload.wikimedia.org/wikipedia/commons/...,[https://en.wikipedia.org/wiki/%C3%89mile_Bayard],[Émile Bayard],[Émile Bayard],"[Geography.Regions.Europe.Europe*, Geography.R..."
4,http://upload.wikimedia.org/wikipedia/commons/...,[https://en.wikipedia.org/wiki/March_1st_Movem...,[March 1st Movement],[March 1st Movement],"[Geography.Regions.Asia.Asia*, Geography.Regio..."
...,...,...,...,...,...
3527930,https://upload.wikimedia.org/wikipedia/vi/f/f9...,[https://vi.wikipedia.org/wiki/SMS_L%C3%BCtzow],[SMS Lützow],[SMS Lützow],"[History and Society.History, History and Soci..."
3527931,https://upload.wikimedia.org/wikipedia/vi/f/fa...,[https://vi.wikipedia.org/wiki/Messerschmitt_M...,[Messerschmitt Me 262],[Messerschmitt Me 262],"[STEM.Engineering, History and Society.Transpo..."
3527932,https://upload.wikimedia.org/wikipedia/vi/f/fb...,[https://vi.wikipedia.org/wiki/John_Crawfurd],[John Crawfurd],[John Crawfurd],"[Geography.Regions.Europe.Northern Europe, Geo..."
3527933,https://upload.wikimedia.org/wikipedia/vi/f/fb...,[https://vi.wikipedia.org/wiki/Mikoyan-Gurevic...,[Mikoyan-Gurevich MiG-17],[Mikoyan-Gurevich MiG-17],"[Geography.Regions.Asia.Asia*, Geography.Regio..."


Finally, we have a dataset with 1 million image <-> taxo_label entries, save it!

In [8]:
grouped.to_json('data/image_labels_segments_0_to_4.json.bz2', compression='bz2')