## Search Wikipedia for infos about BANDS of the MJF, save them in a json file

Info from concerts are first loaded from a csv file. We need 
* the name of the concert (usually the name of the concert is the name of the artist)
* its ID in the MJF database.

We then use the Wikipedia toolbox to search Wikipedia for the name of the concert. We then record the summary of the page, the list of images url, the categories of the page.
We use a Pandas dataframe to store the information. 
Eventually we save the dataframe as a json object and to a json file.

In [1]:
import pandas as pd

Load the concert file

In [4]:
CList = pd.read_csv('bands.csv',header=None,names=['id','name'],keep_default_na=False)    

In [5]:
CList

Unnamed: 0,id,name
0,13,Amy Macdonald
1,14,Jorge Ben Jor
2,15,Afrocubism
3,16,Kassav'
4,17,Little Feat
5,18,Jethro Tull's Ian Anderson
6,19,Alabama Shakes
7,20,Alanis Morissette
8,21,Katie Melua
9,22,Jamie N Commons


Listing the names

In [6]:
len(CList)

2390

### Perform the Wikipedia search

In [7]:
import wikipedia as wk
import time

In [8]:
# function to get the information from the page
def getWresults(Wkresults):
    categories = []
    images = []
    summary = ''
    success = 1
    try:
        Wpage = wk.page(Wkresults)
    except:
        success = 0
    if success:
        try:
            images = Wpage.images
        except:
            pass
    if success:
        try:
            summary = Wpage.summary
        except:
            pass
    if success:
        try:
            categories = Wpage.categories
        except:
            pass
    return categories,images,summary

In [None]:
# Process to search Wikipedia
# Take some time!
slist = []
sslist = []
lcategories = []
limages = []
lsummary = []
slcategories = []
slimages = []
slsummary = []

for i in range(len(CList)):
    searchQ = CList.name[i]
    if searchQ:
        Wkresults = wk.search(searchQ,results = 5,suggestion= True)
        if Wkresults[0]:
            slist.append(Wkresults[0][0])
            categories,images,summary = getWresults(Wkresults[0][0])
            lcategories.append(categories)
            limages.append(images)
            lsummary.append(summary)
        else: 
            slist.append('')
            lcategories.append([])
            limages.append([])
            lsummary.append('')
        if Wkresults[1]:
            sslist.append(Wkresults[1])
            categories,images,summary = getWresults(Wkresults[1])
            slcategories.append(categories)
            slimages.append(images)
            slsummary.append(summary)
        else: 
            sslist.append('')
            slcategories.append([])
            slimages.append([])
            slsummary.append('')
    else: 
        sslist.append('')
        slcategories.append([])
        slimages.append([])
        slsummary.append('')
    if (not i%10):
        print i
        print time.ctime()

0
Mon Jul 25 14:51:44 2016
10
Mon Jul 25 14:52:21 2016
20
Mon Jul 25 14:52:49 2016


In [45]:
# Save the data in the pickle format
import pickle
pickle.dump(lsummary,open('savelsummaryband.p','wb'))
pickle.dump(slsummary,open('saveslsummaryband.p','wb'))
pickle.dump(slist,open('saveslistband.p','wb'))
pickle.dump(sslist,open('savesslistband.p','wb'))
pickle.dump(lcategories,open('savelcategoriesband.p','wb'))
pickle.dump(slcategories,open('saveslcategoriesband.p','wb'))
pickle.dump(limages,open('savelimagesband.p','wb'))
pickle.dump(slimages,open('saveslimagesband.p','wb'))

# Part 2: handle the data

For each page the results are 
* a string for the search query used 
* a string for the summary
* a list of strings for the categories
* a list of strings for the images (list of images URL)

List of strings have to be handled in different manners. For the categories, the string are concatenated together to form one single string. For the images, only one image is selected from the list (the first of the list), this image must be a jpg one. 

### Cast the lists into dataframes

In [50]:
dfslist  = pd.DataFrame(slist,columns=['searchQ'])
dfsslist  = pd.DataFrame(sslist,columns=['sug_searchQ'])
dflsummary  = pd.DataFrame(lsummary,columns=['summary'])
dfslsummary = pd.DataFrame(slsummary,columns=['sug_summary'])

In [66]:
dfc = pd.concat([dfslist,dfsslist,dflsummary,dfslsummary],axis=1)

### Sort images

In [159]:
filtimages = []
for list in limages:
    jpgimg =[]
    for img in list:
        if img[-3:].lower()=='jpg':
            jpgimg.append(img)
    filtimages.append(jpgimg)
filtsimages = []
for list in slimages:
    jpgimg =[]
    for img in list:
        if img[-3:].lower()=='jpg':
            jpgimg.append(img)
    filtsimages.append(jpgimg)

In [160]:
# Take only the first image

dflimages = pd.DataFrame(filtimages)
dfslimages = pd.DataFrame(filtsimages)
dfimg=pd.DataFrame(dflimages[0])
dfimg.columns = ['image']
dfsimg=pd.DataFrame(dfslimages[0])
dfsimg.columns = ['sug_image']

In [162]:
dfci = pd.concat([dfc,dfimg,dfsimg],axis=1)

### Turn the list of categories into strings and make a single column dataframe

In [103]:
categstring = []
for list in lcategories:
    strlist = ', '.join(list)
    categstring.append(strlist)
categstringsug = []
for list in slcategories:
    strlist = ', '.join(list)
    categstringsug.append(strlist)

In [106]:
dflcategories  = pd.DataFrame(categstring,columns=['categories'])
dfslcategories  = pd.DataFrame(categstringsug,columns=['sug_categories'])

In [163]:
dfcicat = pd.concat([dfci,dflcategories,dfslcategories],axis=1)

### Index the dataframe by the MJF id

In [164]:
# Add the MJFid to the dataframe
dff = pd.concat([dfcicat,CList.id],axis=1)

In [165]:
dff.columns

Index([u'searchQ', u'sug_searchQ', u'summary', u'sug_summary', u'image',
       u'sug_image', u'categories', u'sug_categories', u'id'],
      dtype='object')

In [166]:
dffx = dff.set_index('id')

In [167]:
# Save the file
import json
with open('MJFdatabands.json', 'w') as f:
    f.write(dffx.to_json(orient='index'))