In [None]:
import pandas as pd
import sparql_dataframe

wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"


<span style="color:red">#### The "Audience" query: *what audience is the most sexist?* ---- *Is there any decade in which the reviews are the most sexist?*</span>

<span style="color:red">I think we could skip these queries for the SPARQL endpoint, as they could be implemented as "audience" visualizations directly:</span>
<span style="color:red">1. the Genre value is already present in the finalmovies.csv file and it is not a "missing" value which we need to look for in the SPARQL endpoint</span>
<span style="color:red">((2. we'd need to associate reviews and movies through different CSV files and i think this is a bit of a too lengthy process at the moment))</span>

#### The "Characters" queries
##### Bechdel test query: *how many of the [selected and tested for Bechdel] movies have **male** directors?*
To answer this query (and the following one, regarding character dialogues) we gather data from the `dialogue_bechdel.csv` file.

We first create an **empty list `film_list`**, containing tuples representing the IMDB id of the movie (`imdbid` column) and its result in the Bechdel test (column `bechdel_rating`).
If the `bechdel_rating` is...:
- 0 &rarr; FAILED the first criteria
- 1 &rarr; FAILED the second criteria
- 2 &rarr; FAILED the third criteria
- 3 &rarr; PASSED the test (passed all three criteria) 

Then, we **read the CSV file as a dataframe and clean it**, dropping all the movies which have not been tested for the Bechdel test: these movies will have a `NaN` value under the `bechdel_rating` column.

Then, we populate the `film_list_bechdel` and measure its length: this is the total number of movies which have been tested for the Bechdel test (72).

In [None]:
film_list_bechdel = list()

df = pd.read_csv('../data/dialogue/dialogue_bechdel.csv')

bechdel_df = df.dropna(axis=0,subset=["bechdel_rating"])

for idx, row in bechdel_df.iterrows():
    tuple = (row["imdbid"], row["bechdel_rating"])
    film_list_bechdel.append(tuple)

n_films = len(film_list_bechdel)
print("Total number of movies tested for Bechdel test:\t",n_films)

Before finally querying the SPARQL endpoint, we need to add the Wikidata's suffix to our starting IMDB's ids. We do so with the appropriate function `createIMDBid`.

In [None]:
def createIMDBid(code):
    if len(str(code)) == 5:
        return "tt00"+str(code)
    elif len(str(code)) == 6:
        return "tt0"+str(code)
    elif len(str(code)) == 7:
        return "tt"+str(code)

We then create an empty dataframe, `Bechdel_df`, with three columns (`Movie`, `Director`, `Bechdel_result`), yet to be populated with the results of our query, which will take one movie at the time from our `film_list_bechdel`.

Finally, we query the SPARQL endpoint, selecting only the movies from our list which have a **male director** (specified by the Wikidata class `wd:Q6581097`). We populate the `Bechdel_df` dataframe with only these movies, leaving out the rest.

Now, if we count the number of rows of `Bechdel_df`, we will have the **total number of male directors of the movies tested for the Bechdel test**.
Please notice how this number is actually higher than the total number of movies tested for the Bechdel test (`n_films`): this is because some movies will have more than one director.

The result of this query means that **no matter the result of the Bechdel test, all the movies which have been tested for it have male directors**.

In [None]:
# SPARQL QUERY 
Bechdel_df = pd.DataFrame(columns=["Movie", "Director", "Bechdel_result"])

for tpl in film_list_bechdel:
    imdb_id = createIMDBid(tpl[0])

    query_gender_director = '''
        SELECT ?Movie ?Director
        WHERE {{
            ?movie wdt:P345 '{imdbid}' ;
                    wdt:P57 ?director ;
                    rdfs:label ?Movie .
            ?director rdfs:label ?Director ;
                        wdt:P21 wd:Q6581097 .
            filter ((lang(?Director) = "en") && (lang(?Movie) = "en"))
        }}
    '''

    result_query = sparql_dataframe.get(wikidata_endpoint,query_gender_director.format(imdbid=imdb_id))

    Bechdel_df = pd.concat([Bechdel_df,result_query])

    if film_list_bechdel[1] == 0.0:
        Bechdel_df["Bechdel_result"] = "FAILED criteria 1"
    elif film_list_bechdel[1] == 1.0:
        Bechdel_df["Bechdel_result"] = "FAILED criteria 2"
    elif film_list_bechdel[1] == 2.0:
        Bechdel_df["Bechdel_result"] = "FAILED criteria 3"
    else:
        Bechdel_df["Bechdel_result"] = "PASSED"

total_Mdirectors = (len(Bechdel_df.index))
print("Total number of male directors of the movies tested for the Bechdel test:\t",n_films)

##### Characters dialogue query: *how many of the [selected] films have **male** directors?*

#### Gaze score queries
##### GS query 1: *To what genre belong the top 10 films in the gaze score ranking?*

##### GS query 2: *Is there any correlation between rank in the gaze score ranking, box-office and production costs?*

##### GS query 3: *Is there any decade in which the films rank higher in the gaze score ranking?*