# Preprocessed Dataframe

## Input:
In this Notebook we are working with the two dataframes "credits_cleaned.csv" and "movie_metadata.csv", we've previously created from the original datasets.
We already deleted all the movies, that don't occur in all of our datasets (meaning the script files as well as the two data frames with meta data about the movvies).

## Output:
Now, we want to sum up our "credits_cleaned" and "movie_metadata" cleaned in a way, that we get a single Dataframe with only the coulmns that are relevant for us. 
For that, we create a new Dataframe take over the relevant columns from the respective Dataframe. 
Note, that in case of the genre column we split it up into three separate columns for each genre, instead of having one column with a dictionary for all genres,
as wee need to individually call the genres later on. 
At the end, we get a single Dataframe with the following columns: character,	actress/actor,	gender,	title_of_movie,	genre1,	genre2,	genre3, release date, budget, voting, original language


In [1]:
# Import necessary libraries
import pandas as pd 
import regex as re

In [2]:
# loading both our cleaned datasets
df = pd.read_csv('credits_cleaned.csv')
df_2 = pd.read_csv('movies_metadata_cleaned.csv')

### Creating the new Dataframe

In [6]:
# lets create a data_frame that only has the useful information 
clean_data = {'character': [], "actress/actor": [], "gender": [], "title_of_movie": [], "genre1": [], "genre2": [], "genre3": [], "release_date": [], "budget": [], "voting": [], "original_language": []} 
# this creates an empty dataframe, only the column names are defined
clean_frame = pd.DataFrame(clean_data)

#lets get the useful columns from our data-sets
cast_col = df['cast']
#Here we turn the strings in the cast_col into lists of dictionaries, so we can work with them
cast_col = [eval(single_movie) for single_movie in cast_col]
# getting the titles for our dataFrame
title_col = df_2['original_title']
# getting the genres of movies
genres_col = df_2['genres']
#Here we turn the strings in the genre_col into lists of dictionaries, so we can work with them
genres_col = [eval(single_movie) for single_movie in genres_col]
# Extracting the genres is a bit more complicated, because we need to extract all genres from the dict in the one genre column
# Because there is also other information given in the dictionaries, we create a list with only the genres 
# (3 genres for each movie)
liste = 0
genre_list = []
# We want to look into every movie of the genre col
while liste < len(genres_col):
    count = 0
    # counting the nr. of genres, because for some movies there are more, but we are only interested in 3
    while count < 3:
        # We try to append the genre to our genre list
        try:
            genre_list.append(genres_col[liste][count].get('name'))
            count += 1
        # If there is no genre given, we append an empty string
        except:                              
            genre_list.append("")
            count += 1 
    liste += 1
# getting the release dates of the movies
release_col = df_2['release_date']
# getting the budget of the movies
budget_col = df_2['budget']
# getting the voting of the movies
voting_col= df_2['vote_average']
# getting the original language of the movies
language_col = df_2['original_language']

In [7]:
index_counter = -1 # for proper index
# we will need this for genre list later
count = 0 
# in the first for loop we operate on the level of the different movies, stored in our column
for index, movie in enumerate(title_col): 
    # Here we look into the single cast members of the respective movie
    for i, m in enumerate(cast_col[index]): # we can do so by accessing the cast_col via the index from our first for loop
        # Now we add a row for every cast member in each movie
        index_counter += 1
        # here we add a row to our dataframe with the previously specified content
        # Note how we use the "count" variable to get the proper index for our "genre_list"
        clean_frame.loc[index_counter] = m['character'], m['name'], m['gender'], title_col[index], genre_list[count], genre_list[count+1], genre_list[count+2], release_col[index], budget_col[index], voting_col[index], language_col[index]
    count += 3 # we raise this count by 3 because in our genre list we have 3 genres given for each movie

# At the end, we reset our index
clean_frame = clean_frame.reset_index(drop=True)


## Getting missing Genders 
Our Dataset is incomplete in respect to the specified genders of the actresses/actors. 
For all missing values, they marked the gender as "0.0". 
In the following code snippets,
we tried to minimize the missing gender values by accessing the actressses'/actors' Wikipedia entries and extracting their genders from there.
Note, that we still have missing values aferwards because some more unknown people in our dataset dont have a Wikipedia entry,
but we didn't see an easily realizable way to fill up those missing values

In [10]:
def get_gender(name):
    """
    Assigns a Gender to a Person according to their Wikipedia-entry.
  
    Parameters:
    name (str): the name of the actor/actress whose Gender is currently missing
  
    Returns:
    int: either 1.0 (for female) or 2.0 (for male)
  
    """
    # processing of the name so it fits into the URL
    name = name.replace(" ", "_")
    # getting all html tables of the respective Wiki page
    scraper = pd.read_html("https://de.wikipedia.org/wiki/{}".format(name))
        
    index = 0
    # In case there are multiple HTML tables in the wiki-entry, 
    # we are filtering for the one called "Personendaten"
    for idx, table in enumerate(scraper):
        if table.columns[0] == "Personendaten":
            index = idx
            break
                
    # changing the index column so we can access the entry via "KURZBESCHREIBUNG"
    new_index = scraper[index].set_index('Personendaten')
    # Accessing the entry where it is either stated "Schauspieler" or "Schauspielerin"
    # We decide about the gender of the person via the gendered german gob description of "Schauspieler" vs. "Schauspielerin"
    personen_daten = new_index.at['KURZBESCHREIBUNG', 'Personendaten.1']
    # Regular expression that can match any string that includes the german word for "actress" in any form
    g = re.compile(r".*(S|s)(chauspielerin).*".format(personen_daten), re.DOTALL)

    # Whether or not the "personen_daten"-entry matches with our regular expression, decides about the assigned gender of our person
    if g.match(personen_daten):
        # 1.0 == female in our DataFrame
        assigned_gender = 1.0
    else:
        # 2.0 == male in our DataFrame
        assigned_gender = 2.0
        
    return assigned_gender

In [None]:
# We apply our function onto every entry in our gender column
for i, line in enumerate(clean_frame['gender']):
    # In case there is a row with no assigned gender
    if line == 0.0:
        # We try to get the gender through their Wikipedia entry
        try: 
            name = clean_frame.at[i, 'actress/actor']
            gender = get_gender(name)
        # If this is not possible (maybe the person doesnt have a wiki-entry), we leave it unassigned
        except:
            gender = 0.0
        clean_frame.at[i, 'gender'] = gender

# # Saving the dataframe into a CSV for later use, so we don't have to run the code of this notebook every time        
clean_frame.to_csv("movie_data_cleaned.csv")

## Final Dataframe 
Now we have created the Dataframe, which we later on use for conducting the actual Bechdel-test!