# First question : Who are people talking about cinema, regarding gender and age ?

In [1]:
import pandas as pd
import bz2
import json

from pathlib import Path

from ressources import config

## Task 1

We first import the data. A dataframe with url containing the key words 'movies', 'films' and 'cinema' were extracted from Quotebank in a json file.

Summary of the features:

- quoteID : primary key of the quotation
- quotation : text of the longest encountered original form of the quotation
- speaker : selected most likely speaker. This matches the the first speaker entry in `probas`
- qids : wikidata IDs of all aliases that match the selected speaker
- date : earliest occurrence date of any version of the quotation
- numOccurences : number of time this quotation occurs in the articles
- probas : array representing the probabilities of each speaker having uttered the quotation.
- urls : list of links to the original articles containing the quotation 
- phase : corresponding phase of the data in which the quotation first occurred (A-E)

In [2]:
# Loading the dataframe to have a look on the data
path_to_file = Path('../generated/QUOTEBANK/quotes-2018-cinema.json.bz2')

df_reader = pd.read_json(path_to_file, lines=True, compression='bz2', chunksize=100000)
for chunk in df_reader:
    display(chunk.sample(1))

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
29501,2018-12-20-005804,And putting the song in the sequence where Aqu...,James Wan,[Q374286],2018-12-20 15:31:51,4,"[[James Wan, 0.624], [None, 0.2683], [Captain ...",[http://feeds.businessinsider.com.au/~/5884369...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
128420,2018-04-30-085445,remembers these negative comments about Ashley...,Fran Walsh,[Q116861],2018-04-30 22:29:41,1,"[[Fran Walsh, 0.5222], [Peter Jackson, 0.2398]...",[http://www.msn.com/en-us/movies/news/ashley-j...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
220493,2018-02-12-069071,Kate Upton comes out for Saturday's Jonathan S...,,[],2018-02-12 19:34:07,3,"[[None, 0.3795], [Coco Rocha, 0.2683], [Kate U...",[http://msn.com/en-us/movies/celebrity/hot-mam...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
382078,2018-11-17-017361,I didn't steal anything but I actually got som...,Evanna Lynch,[Q211730],2018-11-17 07:00:00,1,"[[Evanna Lynch, 0.8702], [Jason Isaacs, 0.0692...",[http://digitalspy.com/movies/harry-potter/fea...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
427062,2018-04-26-124778,"The (gender gap) came out with technology, and...",Ellen Spiro,[Q5365019],2018-04-26 07:20:54,1,"[[Ellen Spiro, 0.8573], [None, 0.1427]]",[http://www.dailycal.org/2018/04/26/local-fema...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
500681,2018-10-04-054207,"I'm telling your wife,",,[],2018-10-04 18:00:01,1,"[[None, 0.8127], [True North, 0.1873]]",[https://vancouversun.com/entertainment/movies...,E


## Task 2

Secondly, the data need to be filtered. Some quotations aren't attributed to a most likely speaker. Unattributed quotes should then be deleted. Indeed, it is not relevant to take the second most likely person, since it make no sense to take a a more likely speaker than no one.

In [46]:
#drop rows without speakers and saving change in another dataframe
path_to_file = Path('../generated/QUOTEBANK/quotes-2018-cinema.json.bz2')
path_to_out = Path('../generated/QUOTEBANK/quotes-2018_cinema_filtered.json.bz2')
df_reader = pd.read_json(path_to_file, lines=True, compression='bz2', chunksize=100000)

with open(path_to_out, 'wb') as d_file:
    for chunk in df_reader:
        wo_none = chunk.loc[chunk['speaker'] != 'None']
        wo_none.to_json(d_file, orient = 'records', compression='bz2', lines = True)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
86677,2018-05-11-113515,"the older your kids get [ on screen ], the old...",Gabrielle Union,[Q231648],2018-05-11 06:00:00,2,"[[Gabrielle Union, 0.748], [None, 0.252]]",[http://www.mcall.com/entertainment/movies/la-...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
142053,2018-10-18-149642,would assume that that would be an option that...,David Rosado,[Q20657720],2018-10-18 15:30:00,2,"[[David Rosado, 0.8702], [None, 0.1025], [Luke...",[https://www.msn.com/en-us/news/us/city-fires-...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
210761,2018-05-03-140013,"Today, if you wanted to sell your little chick...",Judith Miller,"[Q265810, Q6303576, Q978869]",2018-05-03 11:10:09,1,"[[Judith Miller, 0.8113], [None, 0.1577], [Jus...",[http://www.thesouthernreporter.co.uk/news/tre...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
379585,2018-10-11-130919,true story of how Atlanta high school educator...,Ryan Coogler,[Q7383978],2018-10-11 21:11:53,1,"[[Ryan Coogler, 0.4213], [None, 0.3725], [Mich...",[https://www.slashfilm.com/black-panther-2-dir...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
409121,2018-01-14-049139,Thankyou Fans & well wishers u r truly support...,Romit Raj,[Q7363235],2018-01-14 09:03:36,2,"[[Romit Raj, 0.9344], [None, 0.0483], [Shilpa ...",[https://in.news.yahoo.com/bigg-boss-11-shilpa...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
500497,2018-11-08-091345,"So, I guess you guys heard, Special Agent in C...",Karen Page,[Q27987843],2018-11-08 17:32:12,11,"[[Karen Page, 0.8842], [None, 0.1158]]",[http://marvelcinematicuniverse.wikia.com/wiki...,E


## Task 3

Quotebank's data will then be enriched. Gender and age of the speakers need to be added from Wikidata.

## Task 4

Analysing data :
- Group by age and gender
- Create visualisations