# First question : Who are people talking about cinema, regarding gender and age ?

In [1]:
import pandas as pd
import bz2
import json

from pathlib import Path

from ressources import config

## Importing data

A dataframe with url containing the key words 'movies', 'films' and 'cinema' were extracted from Quotebank in a json file.

Summary of the features:

- quoteID : primary key of the quotation
- quotation : text of the longest encountered original form of the quotation
- speaker : selected most likely speaker. This matches the the first speaker entry in `probas`
- qids : wikidata IDs of all aliases that match the selected speaker
- date : earliest occurrence date of any version of the quotation
- numOccurences : number of time this quotation occurs in the articles
- probas : array representing the probabilities of each speaker having uttered the quotation.
- urls : list of links to the original articles containing the quotation 
- phase : corresponding phase of the data in which the quotation first occurred (A-E)

In [37]:
# Loading the dataframe (not usefull)
path_to_file = Path('../generated/QUOTEBANK/quotes-2018-cinema.json.bz2')

df_reader = pd.read_json(path_to_file, lines=True, compression='bz2', chunksize=100000)
for chunk in df_reader:
    display(chunk.sample(1))

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
31325,2018-06-09-020360,I took my kids to it and it's been such a joy ...,Robert Smigel,[Q779151],2018-06-09,43,"[[Robert Smigel, 0.7924], [None, 0.1909], [Kel...",[http://abcnews.go.com/Entertainment/wireStory...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
197360,2018-02-15-054265,"I think that's wonderful, because I dress like...",,[],2018-02-15 10:03:20,1,"[[None, 0.4236], [Chadwick Boseman, 0.2125], [...",[http://lasvegasweekly.com/ae/film/2018/feb/15...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
243780,2018-06-01-090015,The Lines that Taught me Life.. The lines that...,,[],2018-06-01 05:34:00,3,"[[None, 0.7146], [Deepika Padukone, 0.2855]]",[http://www.sify.com/movies/naina-from-yeh-jaw...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
394826,2018-12-18-033509,I just want to go back because I love it. When...,,[],2018-12-18 13:43:19,24,"[[None, 0.8566], [John Cena, 0.0744], [John Or...",[https://www.seattletimes.com/entertainment/wi...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
401769,2018-03-02-094048,"She is the best stage wife ever, and I hope pe...",Josh Lucas,"[Q37904085, Q53651]",2018-03-02 09:32:00,5,"[[Josh Lucas, 0.6233], [Uma Thurman, 0.3047], ...",[http://www.eonline.com/news/917638/uma-thurma...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
500154,2018-06-07-033922,"I could have done with you, 9,000 Dubliners, `...",Michael McIntyre,"[Q20640481, Q2546628, Q6832726]",2018-06-07 09:16:15,187,"[[Michael McIntyre, 0.766], [None, 0.234]]",[https://www.belfasttelegraph.co.uk/news/repub...,E


## Filter data

Some quotations aren't attributed to a most likely speaker. Unattributed quotes should then be deleted. Indeed, it is not relevant to take the second most likely person, since it make no sense to take a a more likely speaker than no one.

In [46]:
#drop rows without speakers and saving change in another dataframe
path_to_file = Path('../generated/QUOTEBANK/quotes-2018-cinema.json.bz2')
path_to_out = Path('../generated/QUOTEBANK/quotes-2018_cinema_filtered.json.bz2')
df_reader = pd.read_json(path_to_file, lines=True, compression='bz2', chunksize=100000)

with open(path_to_out, 'wb') as d_file:
    for chunk in df_reader:
        wo_none = chunk.loc[chunk['speaker'] != 'None']
        wo_none.to_json(d_file, orient = 'records', compression='bz2', lines = True)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
86677,2018-05-11-113515,"the older your kids get [ on screen ], the old...",Gabrielle Union,[Q231648],2018-05-11 06:00:00,2,"[[Gabrielle Union, 0.748], [None, 0.252]]",[http://www.mcall.com/entertainment/movies/la-...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
142053,2018-10-18-149642,would assume that that would be an option that...,David Rosado,[Q20657720],2018-10-18 15:30:00,2,"[[David Rosado, 0.8702], [None, 0.1025], [Luke...",[https://www.msn.com/en-us/news/us/city-fires-...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
210761,2018-05-03-140013,"Today, if you wanted to sell your little chick...",Judith Miller,"[Q265810, Q6303576, Q978869]",2018-05-03 11:10:09,1,"[[Judith Miller, 0.8113], [None, 0.1577], [Jus...",[http://www.thesouthernreporter.co.uk/news/tre...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
379585,2018-10-11-130919,true story of how Atlanta high school educator...,Ryan Coogler,[Q7383978],2018-10-11 21:11:53,1,"[[Ryan Coogler, 0.4213], [None, 0.3725], [Mich...",[https://www.slashfilm.com/black-panther-2-dir...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
409121,2018-01-14-049139,Thankyou Fans & well wishers u r truly support...,Romit Raj,[Q7363235],2018-01-14 09:03:36,2,"[[Romit Raj, 0.9344], [None, 0.0483], [Shilpa ...",[https://in.news.yahoo.com/bigg-boss-11-shilpa...,E


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
500497,2018-11-08-091345,"So, I guess you guys heard, Special Agent in C...",Karen Page,[Q27987843],2018-11-08 17:32:12,11,"[[Karen Page, 0.8842], [None, 0.1158]]",[http://marvelcinematicuniverse.wikia.com/wiki...,E


## Enriching data

Gender and age of the speakers need to be added from Wikidata.

## Analysing data

Group by age and gender