# Computer Vision Analysis: color image processing

In this notebook I analysed the 5 most prominent colors of each cover and classified them in order to determine if there was any difference between the most common gender of each movie and TV show.

I used *openCV* library to read the images in array format.

In [1]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import cv2
from collections import Counter
from skimage.color import rgb2lab, deltaE_cie76
import os
import re

%matplotlib inline

import sys
sys.path.insert(0, '../Functions')
from Functions_EDA import *

## Reading the dataset

I read the .csv file obtained from the notebook *Webscrapping the images* and took only the most relevant columns for this analysis.

In [2]:
data = pd.read_csv('../Datasets/final_dataset_clean_coverInfo.csv',index_col=0)
data.head()

Unnamed: 0,Title,Link,Type,Year_Release,Country,Directors,Women_Directors,Men_Directors,Not_Set_Directors,Total_Directors,...,Description,RatingAverage,Votes,Reviews,Country_Code,Continent,Most_Common_Gender_Cast,Most_Common_Gender_Directors,Cover_Image_Info,Cover_Image_Path
0,Money Heist,https://www.filmaffinity.com/us/film879405.html,TV show,2017,Spain,"Álex Pina, Jesús Colmenar, Miguel Ángel Vivas,...",0,5,2,7,...,TV Series (2017-Present Day). 4 Seasons. A mys...,7.1,25691.0,"""[4th Season Review]: [It] is like an extended...",ESP,Europe,Men,Men,('https://pics.filmaffinity.com/la_casa_de_pap...,covers/MoneyHeistTVSeries.jpg
1,The Blacklist,https://www.filmaffinity.com/us/film573633.html,TV show,2013,United States of America,"Jon Bokenkamp, Michael W. Watkins, Andrew McCa...",4,32,0,36,...,"The world's most wanted criminal, Thomas Raymo...",6.4,5148.0,"""His name is above the title and, depending ho...",USA,Americas,Men,Men,('https://pics.filmaffinity.com/the_blacklist_...,covers/TheBlacklistTVSeries.jpg
2,Locked Up,https://www.filmaffinity.com/us/film441483.html,TV show,2015,Spain,"Iván Escobar, Esther Martínez Lobato, Daniel É...",2,3,4,9,...,Macarena Ferreiro is a young naive woman who f...,7.0,6941.0,,ESP,Europe,Women,Men,('https://pics.filmaffinity.com/vis_a_vis_tv_s...,covers/LockedUpTVSeries.jpg
3,Prison Break,https://www.filmaffinity.com/us/film822756.html,TV show,2005,United States of America,"Paul Scheuring, Bobby Roth, Kevin Hooks, Dwigh...",1,30,1,32,...,TV Series (2005-2009). 5 Seasons. 90 Episodes....,7.3,71511.0,"""A strong cast led by Wentworth Miller (...) I...",USA,Americas,Men,Men,('https://pics.filmaffinity.com/prison_break_t...,covers/PrisonBreakTVSeries.jpg
4,13 Reasons Why,https://www.filmaffinity.com/us/film999360.html,TV show,2017,United States of America,"Brian Yorkey, Tom McCarthy, Kyle Patrick Alvar...",2,5,0,7,...,"'Thirteen Reasons Why', based on the best-sell...",6.8,21496.0,"""[2nd Season Review]: [It] is a frustratingly ...",USA,Americas,Men,Men,('https://pics.filmaffinity.com/13_reasons_why...,covers/ThirteenReasonsWhyTVSeries.jpg


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2186 entries, 0 to 2185
Data columns (total 32 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Title                         2186 non-null   object 
 1   Link                          2186 non-null   object 
 2   Type                          2186 non-null   object 
 3   Year_Release                  2186 non-null   int64  
 4   Country                       2186 non-null   object 
 5   Directors                     2167 non-null   object 
 6   Women_Directors               2186 non-null   int64  
 7   Men_Directors                 2186 non-null   int64  
 8   Not_Set_Directors             2186 non-null   int64  
 9   Total_Directors               2186 non-null   int64  
 10  %_Women_Directors             2186 non-null   float64
 11  %_Men_Directors               2186 non-null   float64
 12  %_Not_Set_Directors           2186 non-null   float64
 13  Cas

In [4]:
data.columns

Index(['Title', 'Link', 'Type', 'Year_Release', 'Country', 'Directors',
       'Women_Directors', 'Men_Directors', 'Not_Set_Directors',
       'Total_Directors', '%_Women_Directors', '%_Men_Directors',
       '%_Not_Set_Directors', 'Cast', 'Women_Cast', 'Men_Cast', 'Not_Set_Cast',
       'Total_Cast', '%_Women_Cast', '%_Men_Cast', '%_Not_Set_Cast', 'Genres',
       'Description', 'RatingAverage', 'Votes', 'Reviews', 'Country_Code',
       'Continent', 'Most_Common_Gender_Cast', 'Most_Common_Gender_Directors',
       'Cover_Image_Info', 'Cover_Image_Path'],
      dtype='object')

In [5]:
df = data[['Title','Year_Release', 'Country', 'Country_Code', 'Continent',
           'RatingAverage', 'Votes', 'Genres', 'Most_Common_Gender_Cast', 
           'Most_Common_Gender_Directors', 'Cover_Image_Path']]
df.head()

Unnamed: 0,Title,Year_Release,Country,Country_Code,Continent,RatingAverage,Votes,Genres,Most_Common_Gender_Cast,Most_Common_Gender_Directors,Cover_Image_Path
0,Money Heist,2017,Spain,ESP,Europe,7.1,25691.0,"TV Series, Thriller, Mystery, Heist Film, Kidn...",Men,Men,covers/MoneyHeistTVSeries.jpg
1,The Blacklist,2013,United States of America,USA,Americas,6.4,5148.0,"TV Series, Mystery, Drama, Crime, Spy Film",Men,Men,covers/TheBlacklistTVSeries.jpg
2,Locked Up,2015,Spain,ESP,Europe,7.0,6941.0,"TV Series, Thriller, Drama, Prison Drama",Women,Men,covers/LockedUpTVSeries.jpg
3,Prison Break,2005,United States of America,USA,Americas,7.3,71511.0,"TV Series, Action, Drama, Prison Drama, Cop Mo...",Men,Men,covers/PrisonBreakTVSeries.jpg
4,13 Reasons Why,2017,United States of America,USA,Americas,6.8,21496.0,"TV Series, Drama, Mystery, Teen/coming-of-age,...",Men,Men,covers/ThirteenReasonsWhyTVSeries.jpg


## Reading the images in array format

From here, I started reading the covers in RGB array format, so I could detect the most used colors and classified them later.

To detect the colors, I used the KMeans model from *sklearn* library.

In [6]:
def get_image(image_path):
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    return image

In [7]:
images = []

for item in df.index:
    images.append(get_image(df.Cover_Image_Path.iloc[item]))

In [8]:
images

[array([[[2, 2, 2],
         [0, 3, 2],
         [0, 6, 4],
         ...,
         [2, 2, 4],
         [2, 2, 4],
         [2, 2, 4]],
 
        [[3, 1, 2],
         [0, 3, 2],
         [0, 6, 4],
         ...,
         [2, 2, 4],
         [2, 2, 4],
         [2, 2, 4]],
 
        [[5, 0, 4],
         [1, 2, 4],
         [0, 6, 4],
         ...,
         [2, 2, 4],
         [2, 2, 4],
         [2, 2, 4]],
 
        ...,
 
        [[3, 3, 5],
         [3, 3, 5],
         [3, 3, 5],
         ...,
         [3, 3, 5],
         [3, 3, 5],
         [3, 3, 5]],
 
        [[3, 3, 5],
         [3, 3, 5],
         [3, 3, 5],
         ...,
         [3, 3, 5],
         [3, 3, 5],
         [3, 3, 5]],
 
        [[3, 3, 5],
         [3, 3, 5],
         [3, 3, 5],
         ...,
         [3, 3, 5],
         [3, 3, 5],
         [3, 3, 5]]], dtype=uint8),
 array([[[234, 234, 234],
         [235, 235, 235],
         [237, 237, 237],
         ...,
         [232, 238, 236],
         [226, 235, 232],
      

In [9]:
df['Cover_Image_Colors'] = images
df.head()

Unnamed: 0,Title,Year_Release,Country,Country_Code,Continent,RatingAverage,Votes,Genres,Most_Common_Gender_Cast,Most_Common_Gender_Directors,Cover_Image_Path,Cover_Image_Colors
0,Money Heist,2017,Spain,ESP,Europe,7.1,25691.0,"TV Series, Thriller, Mystery, Heist Film, Kidn...",Men,Men,covers/MoneyHeistTVSeries.jpg,"[[[2, 2, 2], [0, 3, 2], [0, 6, 4], [0, 5, 4], ..."
1,The Blacklist,2013,United States of America,USA,Americas,6.4,5148.0,"TV Series, Mystery, Drama, Crime, Spy Film",Men,Men,covers/TheBlacklistTVSeries.jpg,"[[[234, 234, 234], [235, 235, 235], [237, 237,..."
2,Locked Up,2015,Spain,ESP,Europe,7.0,6941.0,"TV Series, Thriller, Drama, Prison Drama",Women,Men,covers/LockedUpTVSeries.jpg,"[[[52, 71, 49], [34, 53, 31], [32, 51, 29], [3..."
3,Prison Break,2005,United States of America,USA,Americas,7.3,71511.0,"TV Series, Action, Drama, Prison Drama, Cop Mo...",Men,Men,covers/PrisonBreakTVSeries.jpg,"[[[0, 0, 5], [43, 46, 51], [39, 44, 48], [19, ..."
4,13 Reasons Why,2017,United States of America,USA,Americas,6.8,21496.0,"TV Series, Drama, Mystery, Teen/coming-of-age,...",Men,Men,covers/ThirteenReasonsWhyTVSeries.jpg,"[[[2, 6, 15], [2, 6, 15], [2, 6, 15], [2, 6, 1..."


In [10]:
def RGB2HEX(color):
    # function that will convert RGB to hex so that we can use them as labels 
    # for our pie chart.
    return "#{:02x}{:02x}{:02x}".format(int(color[0]), int(color[1]), int(color[2]))

In [11]:
def get_colors(image, number_of_colors, fig_name, show_chart=False):
    # KMeans expects the input to be of two dimensions, so we use Numpy’s reshape 
    # function to reshape the image data.
    modified_image = image.reshape(image.shape[0]*image.shape[1], 3)
    
    # KMeans algorithm will form clusters of colors and these clusters will be our top colors. 
    # We then fit and predict on the same image to extract the prediction into the variable labels.
    clf = KMeans(n_clusters = number_of_colors)
    labels = clf.fit_predict(modified_image)
    
    counts = Counter(labels)

    center_colors = clf.cluster_centers_
    # We get ordered colors by iterating through the keys
    ordered_colors = [center_colors[i] for i in counts.keys()]
    hex_colors = [RGB2HEX(ordered_colors[i]) for i in counts.keys()]
    rgb_colors = [ordered_colors[i] for i in counts.keys()]

    if (show_chart):
        plt.figure(figsize = (8, 6))
        plt.pie(counts.values(), colors = hex_colors)

    return rgb_colors

In [12]:
# top 5 colors from Money Heist cover
colors = get_colors(df.Cover_Image_Colors.iloc[0], 5, True)

In [13]:
# top 5 colors from Money Heist cover
colors

[array([99.80301842, 91.66045995, 84.19188554]),
 array([13.01282999, 10.69761586, 11.71678091]),
 array([44.62366378, 40.29016166, 36.46952477]),
 array([163.54390681, 159.32437276, 152.97580645]),
 array([166.17702177,  19.66397578,  29.62995531])]

In [None]:
colors = []

for item in df.index:
    c = get_colors(df.Cover_Image_Colors.iloc[item], 5, False)
    colors.append(c)

In [None]:
len(colors)

In [None]:
df['Cover_Image_Top5_Colors'] = colors
df.head()

In [None]:
# Saving the information to not lose it
df.to_csv('../Datasets/final_dataset_coverColors.csv')

In [None]:
df = pd.read_csv('../Datasets/final_dataset_coverColors.csv',index_col=0)
df.head()

In [None]:
for item in df.index:
    print(item)
    x = re.findall(r"[-+]?\d*\.\d+|\d+",df.Cover_Image_Top5_Colors[item])
    
    y = []
    for i in x:
        y.append(float(i))
    
    count = 0
    colorArray = []

    for j in y:
        if (len(y[count:count+3]) != 0):
            colorArray.append(y[count:count+3])
        else:
            break
        count += 3
    
    df.Cover_Image_Top5_Colors.iloc[item] = colorArray

In [None]:
COLORS = {
    'GREEN': [0, 255, 0],
    'BLUE': [0, 0, 255],
    'RED': [255, 0, 0],
    'YELLOW': [255, 255, 0],
    'MAGENTA': [255, 0, 255],
    'CYAN': [0, 255, 255],
    'BROWN': [147, 81, 22],
    'ORANGE': [255, 136, 48],
    'GREY': [178, 186, 187],
    'PURPLE': [142, 68, 173 ]
}

In [None]:
def match_image_by_color(image, color, threshold = 60, number_of_colors = 10):
    
    selected_color = rgb2lab(np.uint8(np.asarray([[color]])))

    select_image = False
    for i in range(number_of_colors):
        curr_color = rgb2lab(np.uint8(np.asarray([[image[i]]])))
        diff = deltaE_cie76(selected_color, curr_color)
        if (diff < threshold):
            select_image = True
    
    return select_image

In [None]:
match_image_by_color(df.Cover_Image_Top5_Colors[0], COLORS['RED'], threshold = 60, number_of_colors = 5)

In [None]:
df.Cover_Image_Top5_Colors[0]

In [None]:
df['GREEN_Cover'] = 0
df['BLUE_Cover'] = 0
df['RED_Cover'] = 0
df['YELLOW_Cover'] = 0
df['MAGENTA_Cover'] = 0
df['CYAN_Cover'] = 0
df['BROWN_Cover'] = 0
df['ORANGE_Cover'] = 0
df['GREY_Cover'] = 0
df['PURPLE_Cover'] = 0

In [None]:
COLORS

In [None]:
for item in df.index:
    print(item)
    for color in COLORS:
        match = match_image_by_color(df.Cover_Image_Top5_Colors[item], COLORS[color], threshold = 60, number_of_colors = 5)
        if match == True:
            df[color+'_Cover'][item] = 1

In [None]:
df.head()

## Exploring the data

After obtaining all the information, I did some exploratory analysis to check my hypothesis.

In [None]:
df.groupby(['Most_Common_Gender_Cast'])['GREEN_Cover'].sum()

In [None]:
df.groupby(['Most_Common_Gender_Cast'])['BLUE_Cover'].sum()

In [None]:
df.groupby(['Most_Common_Gender_Cast'])['RED_Cover'].sum()

In [None]:
df.groupby(['Most_Common_Gender_Cast'])['YELLOW_Cover'].sum()

In [None]:
df.groupby(['Most_Common_Gender_Cast'])['MAGENTA_Cover'].sum()

In [None]:
df.groupby(['Most_Common_Gender_Cast'])['CYAN_Cover'].sum()

In [None]:
df.groupby(['Most_Common_Gender_Cast'])['BROWN_Cover'].sum()

In [None]:
df.groupby(['Most_Common_Gender_Cast'])['ORANGE_Cover'].sum()

In [None]:
df.groupby(['Most_Common_Gender_Cast'])['GREY_Cover'].sum()

In [None]:
df.groupby(['Most_Common_Gender_Cast'])['PURPLE_Cover'].sum()

In [None]:
COLORS.keys()

In [None]:
def getList(dict): 
    list = [] 
    for key in dict.keys(): 
        list.append(key) 
          
    return list

In [None]:
getList(COLORS)[0]

In [None]:
fig = go.Figure()

for color in getList(COLORS):
    x, y = [],[]
    i = df.groupby(['Most_Common_Gender_Cast'])[color+'_Cover'].sum().values[1] * 100 / len(df[df.Most_Common_Gender_Cast == 'Women']) 
    
    x.append(color.lower())
    y.append(i)
    
    fig.add_trace(go.Scatter(
        x=x, y=y,
        mode='markers',
        marker=dict(
            color=color.lower(),
            size = 50
            ),
        opacity=0.8,
        name = color.lower())
        )

fig.update_layout(
    title = 'Proportion of colors where female cast is most common gender',
    )

fig.show()

In [None]:
fig = go.Figure()

for color in getList(COLORS):
    x, y = [],[]
    i = df.groupby(['Most_Common_Gender_Cast'])[color+'_Cover'].sum().values[1] * 100 / len(df[df.Most_Common_Gender_Cast == 'Men']) 
    
    x.append(color.lower())
    y.append(i)
    
    fig.add_trace(go.Scatter(
        x=x, y=y,
        mode='markers',
        marker=dict(
            color=color.lower(),
            size = 50
            ),
        opacity=0.8,
        name = color.lower())
        )

fig.update_layout(
    title = 'Proportion of colors where male cast is most common gender',
    )

fig.show()

In [None]:
fig = go.Figure()

for color in getList(COLORS):
    i = df.groupby(['Most_Common_Gender_Cast'])[color+'_Cover'].sum().values[0] * 100 / len(df[df.Most_Common_Gender_Cast == 'Men']) 
    j = df.groupby(['Most_Common_Gender_Cast'])[color+'_Cover'].sum().values[1] * 100 / len(df[df.Most_Common_Gender_Cast == 'Women']) 
    
    fig.add_trace(go.Bar(
        x=df.groupby(['Most_Common_Gender_Cast'])[color+'_Cover'].sum().keys(), 
        y=[i,j],
        opacity=0.8,
        name = color.lower(),
        marker_color = color.lower(),
        text = df.groupby(['Most_Common_Gender_Cast'])[color+'_Cover'].sum()
        )
    )

fig.update_layout(barmode='group',
                   title={
                    'text': "Distribution of cover colors by gender in CAST",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'},
                   yaxis=dict( title='Count',),
                   paper_bgcolor='white',
                   legend=dict(
                    x=1,
                    y=1,
                    bgcolor='rgba(255, 255, 255, 0)',
                    bordercolor='rgba(255, 255, 255, 0)'
                    ),
                   uniformtext_minsize=1, 
                   uniformtext_mode='show',
)

fig

In [None]:
fig = go.Figure()

for color in getList(COLORS):
    i = df.groupby(['Most_Common_Gender_Cast'])[color+'_Cover'].sum().values[0] * 100 / len(df[df.Most_Common_Gender_Cast == 'Men']) 
    j = df.groupby(['Most_Common_Gender_Cast'])[color+'_Cover'].sum().values[1] * 100 / len(df[df.Most_Common_Gender_Cast == 'Women']) 
    
    fig.add_trace(go.Bar(
        x=[color.lower()], 
        y=[i - j],
        opacity=0.7,
        name = color.lower(),
        marker_color = color.lower()
        )
    )

fig.update_layout(
    title={
                    'text': "Difference between percentage of colors by gender in CAST",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'},
    )

fig

In [None]:
fig = go.Figure()

for color in getList(COLORS):
    x, y = [],[]
    i = df.groupby(['Most_Common_Gender_Directors'])[color+'_Cover'].sum().values[1] * 100 / len(df[df.Most_Common_Gender_Directors == 'Women']) 
    
    x.append(color.lower())
    y.append(i)
    
    fig.add_trace(go.Scatter(
        x=x, y=y,
        mode='markers',
        marker=dict(
            color=color.lower(),
            size = 50
            ),
        opacity=0.8,
        name = color.lower())
        )

fig.update_layout(
    title = 'Proportion of colors where female directors is most common gender',
    )

fig.show()

In [None]:
fig = go.Figure()

for color in getList(COLORS):
    x, y = [],[]
    i = df.groupby(['Most_Common_Gender_Directors'])[color+'_Cover'].sum().values[0] * 100 / len(df[df.Most_Common_Gender_Directors == 'Men']) 
    
    x.append(color.lower())
    y.append(i)
    
    fig.add_trace(go.Scatter(
        x=x, y=y,
        mode='markers',
        marker=dict(
            color=color.lower(),
            size = 50
            ),
        opacity=0.8,
        name = color.lower())
        )

fig.update_layout(
    title = 'Proportion of colors where male directors is most common gender',
    )

fig.show()

In [None]:
fig = go.Figure()

for color in getList(COLORS):
    i = df.groupby(['Most_Common_Gender_Directors'])[color+'_Cover'].sum().values[0] * 100 / len(df[df.Most_Common_Gender_Directors == 'Men']) 
    j = df.groupby(['Most_Common_Gender_Directors'])[color+'_Cover'].sum().values[1] * 100 / len(df[df.Most_Common_Gender_Directors == 'Women']) 
    
    fig.add_trace(go.Bar(
        x=df.groupby(['Most_Common_Gender_Directors'])[color+'_Cover'].sum().keys(), 
        y=[i,j],
        opacity=0.8,
        name = color.lower(),
        marker_color = color.lower(),
        text = df.groupby(['Most_Common_Gender_Directors'])[color+'_Cover'].sum()
        )
    )

fig.update_layout(barmode='group',
                   title={
                    'text': "Distribution of cover colors by gender in DIRECTION",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'},
                   yaxis=dict( title='Count',),
                   paper_bgcolor='white',
                   legend=dict(
                    x=1,
                    y=1,
                    bgcolor='rgba(255, 255, 255, 0)',
                    bordercolor='rgba(255, 255, 255, 0)'
                    ),
                   uniformtext_minsize=1, 
                   uniformtext_mode='show',
)

fig

In [None]:
fig = go.Figure()

for color in getList(COLORS):
    i = df.groupby(['Most_Common_Gender_Directors'])[color+'_Cover'].sum().values[0] * 100 / len(df[df.Most_Common_Gender_Directors == 'Men']) 
    j = df.groupby(['Most_Common_Gender_Directors'])[color+'_Cover'].sum().values[1] * 100 / len(df[df.Most_Common_Gender_Directors == 'Women']) 
    
    fig.add_trace(go.Bar(
        x=[color.lower()], 
        y=[i - j],
        opacity=0.7,
        name = color.lower(),
        marker_color = color.lower()
        )
    )

fig.update_layout(
    title={
                    'text': "Difference between percentage of colors by gender in DIRECTION",
                    'y':0.9,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'},
    )

fig

In [None]:
df.groupby(['Year_Release'])['GREEN_Cover'].mean()

In [None]:
fig = go.Figure()

for color in getList(COLORS):
    x, y = [],[]
    for item in df.groupby(['Year_Release'])[color+'_Cover'].mean().index:
        if (df.groupby(['Year_Release'])[color+'_Cover'].mean()[item] > 0) == True:
            if item >= 2005:
                x.append(item)
                y.append(df.groupby(['Year_Release'])[color+'_Cover'].mean()[item] *100)
    
    fig.add_trace(go.Scatter(
        x=x, y=y,
        mode='lines+markers',
        marker_color=color.lower(),
        opacity=0.6,
        name = color.lower())
        )

fig.update_layout(
    title='Color evolution of covers during the last 15 years',
    yaxis=dict(
        title='Color Ratio',
    ),
    xaxis=dict(
        title='Year',
    ),
)

fig.show()

In [None]:
df.head()

In [None]:
df.groupby(['Continent']).mean()

In [None]:
fig = go.Figure()

for color in getList(COLORS):
    x, y = [],[]
    for item in df.groupby(['Continent'])[color+'_Cover'].mean().index:
        if (df.groupby(['Continent'])[color+'_Cover'].mean()[item] > 0) == True:
            x.append(item)
            y.append(df.groupby(['Continent'])[color+'_Cover'].mean()[item] * 100)
    
    fig.add_trace(go.Bar(
        x=x, y=y,
        marker_color=color.lower(),
        opacity=0.6,
        name = color.lower()),
        )

fig.update_layout(
    barmode='stack',
    title='Color ratio of covers by Continent',
)

fig.update_yaxes(showticklabels=False)

fig.show()

In [None]:
df.head(2)

In [None]:
df[df.Most_Common_Gender_Cast == 'Women'].sort_values(by='Votes', ascending=False).head()

In [None]:
get_colors(df.Cover_Image_Colors.iloc[86], 5, 'HIMYM_pie_women', True)

In [None]:
get_colors(df.Cover_Image_Colors.iloc[88], 5, 'Prince_Bel-Air_pie_women', True)

In [None]:
get_colors(df.Cover_Image_Colors.iloc[673], 5, 'Mamma_mia_pie_women', True)

In [None]:
get_colors(df.Cover_Image_Colors.iloc[529], 5, 'Matilda_pie_women', True)

In [None]:
get_colors(df.Cover_Image_Colors.iloc[986], 5, 'Annihilation_pie_women', True)

In [None]:
df[df.Most_Common_Gender_Cast == 'Men'].sort_values(by='Votes', ascending=False).head()

In [None]:
get_colors(df.Cover_Image_Colors.iloc[890], 5, 'Pulp_Fiction_pie_women', True)

In [None]:
get_colors(df.Cover_Image_Colors.iloc[682], 5, 'Forrest_Gump_pie_men', True)

In [None]:
get_colors(df.Cover_Image_Colors.iloc[750], 5, 'Matrix_pie_men', True)

In [None]:
get_colors(df.Cover_Image_Colors.iloc[729], 5, 'LOTR_pie_men', True)

In [None]:
get_colors(df.Cover_Image_Colors.iloc[834], 5, 'Kill_Bill_pie_men', True)

In [None]:
# provar de mirar colors per cada gènere (thriller, drama)

In [None]:
top = pd.read_csv('../Datasets/genres_top_to_bottom.csv', index_col=0)
top.head()

In [None]:
top = top.iloc[:20,:]
top

In [None]:
for x in top.index:
    print(x)

In [None]:
genre = pd.DataFrame([x.split(',') for x in df.Genres])
genre.head()

In [None]:
gn = pd.DataFrame()
for x in genre.T: 
    gn = pd.concat([gn, genre.T[x]])

In [None]:
gn.head()

In [None]:
gn[0].value_counts().head(10)

In [None]:
gn = gn.reset_index(drop=True) 
gn.head()

In [None]:
for i in gn[0].index: 
    try: 
        gn[0][i] = gn[0][i].strip() 
    except: 
        pass

In [None]:
gn.head()

In [None]:
gn = gn.dropna() 
gn.head(20)

In [None]:
gn = gn.reset_index(drop=True) 
gn.head(20)

In [None]:
top20_genres = gn.index.to_list()[0:20] 
top20_genres

In [None]:
list_genres = gn[0].to_list() 
list_genres

In [None]:
for element in list_genres: 
    if element == 'TV Series': 
        list_genres.remove('TV Series') 
    elif element == 'Serie de TV': 
        list_genres.remove('Serie de TV') 
    elif element == 'Miniserie de TV': 
        list_genres.remove('Miniserie de TV') 
    elif element == 'TV Movie': 
        list_genres.remove('TV Movie')

In [None]:
list_genres

In [None]:
list_genres2 = [] 
[list_genres2.append(x) for x in list_genres if x not in list_genres2] 
list_genres2

In [None]:
res = pd.DataFrame() 
res['Genre'] = list_genres2 
res['GREEN_Cover'] = 0
res['BLUE_Cover'] = 0
res['RED_Cover'] = 0
res['YELLOW_Cover'] = 0
res['MAGENTA_Cover'] = 0
res['CYAN_Cover'] = 0
res['BROWN_Cover'] = 0
res['ORANGE_Cover'] = 0
res['GREY_Cover'] = 0
res['PURPLE_Cover'] = 0
res.head(10)

In [None]:
for item in df.index: 
    for x in list_genres2: 
        if x in df.Genres[item]:
            for color in COLORS:
                if df[color+'_Cover'][item] == 1:
                    res[color+'_Cover'][res.Genre == x] += 1

In [None]:
res.head()

In [None]:
res.to_csv('../Datasets/final_dataset_genres_colors.csv')

In [None]:
res = pd.read_csv('../Datasets/final_dataset_genres_colors.csv', index_col=0)
res.head()

In [None]:
english_genre_names = {'Comedia':'Comedy','Documental':'Documentary','Crimen':'Crime', 'True Crime':'Crime',
                      'Acción':'Action', 'Intriga':'Intrigue', 'Fantástico':'Fantasy',
                      'Ciencia ficción':'Science fiction', 'Sci-Fi':'Science fiction', 
                       'Basado en hechos reales':'Based on real facts','Familia':'Family', 
                       'Biográfico': 'Biographical','Aventuras':'Adventure',
                       'Adventures':'Adventure', 'Aventuras marinas':'Adventure',
                       'Aventura espacial':'Adventure', 'Sea Adventures':'Adventure',
                       'Space Adventure':'Adventure',
                      'Adolescencia':'Adolescence','Amistad':'Friendship', 'Secuela':'Sequel',
                      'Comedia dramática':'Dramatic comedy', 'Cine familiar':'Family',
                      'Sitcom familiar':'Family', 'Family Relationships':'Family',
                      'Family-friendly':'Family', 'Family Sitcom':'Family',
                      'Animación':'Animation', 'Prison Drama':'Drama', 'Infantil':'Kids', 'Infancia':'Kids',
                      'Comedia juvenil':'Comedy', 'Comedia negra':'Comedy',
                      'Comedia de terror':'Comedy','Comedia romántica':'Comedy',
                      'Comedia absurda':'Comedy', 'Black Comedy':'Comedy','Teen Comedy':'Comedy',
                      'Comedy-Drama':'Dramatic comedy','Broad Comedy':'Comedy', 'Horror Comedy':'Comedy',
                      'Romantic Comedy':'Comedy', 'High Comedy':'Comedy','Courtroom Drama':'Drama',
                      'Romantic Drama':'Drama','Psychological Drama':'Drama','Drama de época':'Drama',
                      'Drama psicológico':'Drama','Social Drama':'Drama','Drama romántico':'Drama',
                      'Drama sureño':'Drama','Drama judicial':'Drama','Drama carcelario':'Drama',
                      'Drama social':'Drama','Sobrenatural':'Supernatural','Drogas':'Drugs',
                      'Cómic':'Based on a Comic','Policíaco':'Cop Movies','Documental sobre música':'Documentary',
                      'Documental deportivo':'Documentary','Documental científico':'Documentary',
                      'Documental marino':'Documentary','Documental sobre cine':'Documentary',
                      'Documental sobre Historia':'Documentary','Documental sobre videojuegos':'Documentary',
                      'Movie Documentary':'Documentary','Music Documentary':'Documentary',
                      'Secuestros / Desapariciones':'Kidnapping Film / Disappearance',
                      'Colegios & Universidad':'Schools & University','Animales':'Animals','Política':'Politics',
                      'Videojuego':'Based on a Video Game','Histórico':'Historical','Deporte':'Sports',
                      'Sport Documentaries':'Documentary','Fútbol':'Soccer/Football','Superhéroes':'Superheroes',
                       'DC Comics':'Superheroes','Marvel Comics':'Superheroes',
                      'Psychological Thriller':'Thriller','Thriller psicológico':'Thriller','Thriller futurista':'Thriller',
                      'Futuristic Thriller':'Thriller','Animación para adultos':'Animation','Adult Animation':'Animation',
                      'Cortometraje (animación)':'Short Film (Animated)','Venganza':'Revenge','Espionaje':'Spy Film',
                      'Iraq War':'War','II World War':'War','Spanish Civil War':'War','Spanish Post-War':'War',
                      'Cold War':'War','Guerra de Vietnam':'War','Guerra de Afganistán':'War','II Guerra Mundial':'War',
                      'Guerra Fría':'War','Guerra Civil Española':'War','Guerra de Siria':'War',
                      'I Guerra Mundial':'War','Guerra de los Balcanes':'War','Guerras Napoleónicas':'War',
                      'Guerra de Corea':'War','Guerra de Iraq':'War','Internet / Informática':'Computers / Internet',
                      'Brujería':'Witchcraft','Robos & Atracos':'Heist Film','Magia':'Magic','Magic Realism':'Magic',
                      'Música':'Music','Supervivencia':'Survival Film','Bélico':'War','Asesinos en serie':'Serial Killers',
                      'Naturaleza':'Nature','Vampiros':'Vampires','Periodismo':'Journalism',
                      'Artes marciales':'Martial Arts','Racismo':'Racism','Vejez':'Old Age','Discapacidad':'Disability',
                      'Terrorismo':'Terrorism','Inmigración':'Immigration','Zombis':'Zombies',
                      'Viajes en el tiempo':'Time Travel','Medieval Fantasy':'Fantasy','Fantasía medieval':'Fantasy',
                      'Monstruos':'Monsters','Dragones':'Dragons','Extraterrestres':'Aliens',
                      'Distopía':'Dystopia','Homosexualidad':'Gay & Lesbian','Coches / Automovilismo':'Car Movies',
                      'Religión':'Religion','Años 80':'1980s','Años 70':'1970s','Años 60':'1960s',
                      'Años 90':'1990s','Años 50':'1950s','Años 40':'1940s','Perros/Lobos':'Dogs/Wolves',
                      'Literatura':'Literature','Sátira':'Satire','3-D':'3D','Feminismo':'Feminism',
                      'Navidad':'Christmas','Película de culto':'Cult Movie','Mediometraje':'Half-length Film',
                      'Siglo XIX':'19th Century','Enfermedad':'Disease/illness','Conciertos':'Concerts',
                      'Hombres lobo':'Werewolf'}

In [None]:
res['Genre'].replace(english_genre_names, inplace=True)
res.head(20)

In [None]:
res['Genre'].value_counts()

In [None]:
all_genres = res.groupby(['Genre']).sum()
all_genres

In [None]:
final_genres = pd.merge(top, all_genres, how='left', left_on=top.index, right_on=all_genres.index)
final_genres

In [None]:
final_genres.iloc[0].to_frame()

In [None]:
final_genres.iloc[0].index

In [None]:
def get_labels_values(df_item):
    labels = []
    values = []
    for color in COLORS:
        if df_item[color+'_Cover']:
            labels.append(color.lower())
            values.append(df_item[color+'_Cover'])
    
    return labels, values

In [None]:
labels, values = get_labels_values(final_genres.iloc[0])

In [None]:
print(labels)
print(values)

In [None]:
plt.figure(figsize = (8, 6))
plt.pie(values, colors = labels)
plt.title(final_genres.iloc[0]['key_0'])

In [None]:
import plotly.graph_objects as go

fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent',
                             insidetextorientation='radial',
                             marker=dict(colors=labels),
                             title_text=final_genres.iloc[0]['key_0'],
                            )])

fig.update(layout_title_text=final_genres.iloc[0]['key_0'],
           layout_showlegend=True)

fig.show()

In [None]:
labels1, values1 = get_labels_values(final_genres.iloc[5])
len(labels1)

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

labels1, values1 = get_labels_values(final_genres.iloc[0])
labels2, values2 = get_labels_values(final_genres.iloc[1])
labels3, values3 = get_labels_values(final_genres.iloc[2])
labels4, values4 = get_labels_values(final_genres.iloc[3])
labels5, values5 = get_labels_values(final_genres.iloc[4])
labels6, values6 = get_labels_values(final_genres.iloc[5])

# Create subplots, using 'domain' type for pie charts
specs = [[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}], 
         [{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]]
fig = make_subplots(rows=2, cols=3, specs=specs)

# Define pie charts
fig.add_trace(go.Pie(labels=labels1, values=values1, name=final_genres.iloc[0]['key_0'],
                     marker_colors=labels1, title_text=final_genres.iloc[0]['key_0'],
                    textposition = ['none','none','none','none','none','inside','inside','inside','inside','inside']), 1, 1)
fig.add_trace(go.Pie(labels=labels2, values=values2, name=final_genres.iloc[1]['key_0'],
                     marker_colors=labels2, title_text=final_genres.iloc[1]['key_0'],
                    textposition = ['none','none','none','none','none','inside','inside','inside','inside','inside']), 1, 2)
fig.add_trace(go.Pie(labels=labels3, values=values3, name=final_genres.iloc[2]['key_0'],
                     marker_colors=labels3, title_text=final_genres.iloc[2]['key_0'],
                    textposition = ['none','none','none','inside','inside','inside','inside','none']), 1, 3)
fig.add_trace(go.Pie(labels=labels4, values=values4, name=final_genres.iloc[3]['key_0'],
                     marker_colors=labels4, title_text=final_genres.iloc[3]['key_0'],
                    textposition = ['none','none','none','none','none','inside','inside','inside','inside','inside']), 2, 1)
fig.add_trace(go.Pie(labels=labels5, values=values5, name=final_genres.iloc[4]['key_0'],
                     marker_colors=labels5, title_text=final_genres.iloc[4]['key_0'],
                    textposition = ['none','none','none','none','none','inside','inside','inside','inside','inside']), 2, 2)
fig.add_trace(go.Pie(labels=labels6, values=values6, name=final_genres.iloc[5]['key_0'],
                     marker_colors=labels6, title_text=final_genres.iloc[5]['key_0'],
                    textposition = ['none','none','none','none','inside','inside','inside','inside','inside']), 2, 3)

# Tune layout and hover info
fig.update_traces(hoverinfo='label+percent+name', textinfo='percent')
fig.update(layout_title_text='Top 6 Genres Most Prominent Colors Shown Proportionally',
           layout_showlegend=False)

fig = go.Figure(fig)
fig.show()